How LLMs Do Inference

The fascinating process that turns mathematical operations into human-like text

"The quick brown fox jumps over the lazy dog"

The Inference Process

1

Tokenization

Input text is broken down into tokens (words or subwords) that the model can process.

2

Embedding

Tokens are converted to numerical vectors that capture semantic meaning.

3

Attention

The model calculates which parts of the input to focus on using self-attention mechanisms.

4

Feed Forward

Neural network layers process the attended information through nonlinear transformations.

5

Prediction

The model outputs probabilities for the next likely token in the sequence.

6

Sampling

A token is selected (either greedily or with randomness) to continue the sequence.

Transformer Architecture

Input Text
Tokenization
Embedding
Transformer Decoder Layers (Attention + FFN)
Prediction (Logits)
Sampling
Output Token

Decoder-Only

Modern LLMs primarily use the decoder part of the Transformer, processing input sequentially and predicting the next token based on all previous tokens.

Key Components

  • Multi-head self-attention: Allows the model to weigh the importance of different tokens in the input sequence.
  • Positional embeddings: Adds information about the position of tokens, as attention is permutation-invariant.
  • Feed-Forward Networks: Standard neural network layers applied independently to each position.
  • Layer normalization & Residual connections: Help stabilize training and allow for deeper networks.

Interactive Inference Demo

Generation Steps:

Enter a prompt and click "Generate Text" to see the token-by-token process...

Inference Optimizations

KV Caching

Stores previous key-value pairs from attention layers to avoid recomputing them for each new token, significantly speeding up sequential generation.

Quantization

Reduces the numerical precision of model weights (e.g., from 32-bit to 8-bit or 4-bit integers) to decrease memory usage and speed up computation on compatible hardware.

Batching

Groups multiple inference requests together to process them simultaneously, improving GPU utilization and throughput, especially for smaller prompts.

Speculative Decoding

Uses a smaller, faster model to predict several tokens ahead, then verifies these predictions efficiently with the larger model. If correct, it skips computation for those tokens.