How LLMs Do Inference
From a user's prompt to the machine's reply: an interactive journey inside the model.
LLM Inference
The process where a trained Large Language Model (LLM) generates text by predicting one token at a time based on the input it received. Unlike training, which takes months, inference happens in milliseconds.
Introduction
You type a question into ChatGPT or Claude, and almost instantly, words start appearing on your screen. It feels like magic, or perhaps like a human typing on the other end. But what's actually happening deeply inside the silicon chips of the server?
The process is called inference. It's a complex dance of mathematics, memory management, and massive parallel computation. Roughly speaking, the model is playing a very sophisticated game of "guess the next word," over and over again, billions of times a day.
In this interactive explainer, we'll peel back the layers. We'll trace the journey of a prompt as it gets sliced into numbers, processed by giant matrices, and converted back into human language.
Step 1: Tokenization
Computers don't understand words; they understand numbers. Before an LLM can do anything effectively, your text input must be broken down into smaller chunks called tokens.
Modern LLMs use Byte Pair Encoding (BPE), which starts with individual characters and repeatedly merges the most common pairs. This creates a vocabulary of subwords that balances efficiency with flexibility.
BPE Tokenizer
586 tokensHow BPE works: The tokenizer starts with individual characters and repeatedly merges the most common pairs. The β£ symbol represents a leading space.
Try typing: "tokenization", "unbelievable", or "artificial intelligence" to see how subwords are split.
Step 2: Embeddings
Each token ID is converted into a dense vector of numbers called an embedding. These vectors live in a high-dimensional space where similar words cluster together.
The embedding captures semantic meaning: "cat" and "dog" will be closer to each other than "cat" and "computer". This is where the model's understanding of language begins.
Word Embeddings in 2D Space
Words with similar meanings cluster together
Word Analogy: In embedding space, king - man + woman β queen
The vector from "man" to "woman" captures the concept of gender, and adding it to "king" points toward "queen".
Embeddings: Each word is represented as a vector (list of numbers). Similar words have similar vectors, so they appear close together in this 2D projection.
The Missing Piece: Position
Here's a problem: if we just fed these embeddings into the attention mechanism, "The cat ate the fish" and "The fish ate the cat" would look identical! The model wouldn't know which word came first.
To fix this, we add a Positional Encoding vector to each word's embedding. By using sine and cosine waves of different frequencies, we give every position in the sequence a unique mathematical "fingerprint" that the model can learn to recognize.
Positional Encodings
How transformers know "where" words are using math (Sinusoidal frequencies)
Standard Transformer uses 10000. Lower values compress the waves.
Each row is a position vector added to the word embedding. Low frequencies (left) change rapidly, while high frequencies (right) change slowly.
to see values
Step 3: The Transformer Block
The heart of an LLM is the Transformer, a neural network architecture invented in 2017. It consists of stacked layers, each containing attention mechanisms and feed-forward networks.
A modern LLM like GPT-4 or Llama-3 might have 80+ of these layers, each refining the model's understanding of the input and building toward the final prediction.
Transformer Block Architecture
Click any block to see details
Model Size Examples
Inside the Transformer
Once the input is tokenized and embedded, it enters the Transformer layers. This is where the "intelligence" lives. A modern LLM might have dozens (or even roughly 100) of these identical blocks stacked on top of each other.
The key mechanism here is Self-Attention. It allows each token to "look at" other tokens in the sequence to gather context. For example, in the sentence "The animal didn't cross the street because it was too tired," attention helps the model understand that "it" refers to "the animal" and not "the street."
Multi-Head Attention Visualizer
Click on a word to see what it "attends to"
Grammatical relationships
Meaning connections
Adjacent words
Distant references
"it" attends to mat, cat, because, was, The
Multi-Head Attention: Each "head" learns to focus on different types of relationships. One head might track grammar, another tracks meaning, and another handles pronouns. The model combines all heads for a rich understanding.
Multi-Head Attention
Rather than having a single attention mechanism, transformers use multiple attention heads that each learn to focus on different types of relationships.
One head might specialize in grammatical dependencies (subjects and verbs), another in semantic similarity (synonyms and related concepts), and another in positional patterns (nearby words). The model combines all these perspectives for a richer understanding.
Multi-Head Attention Specialization
See how different heads focus on different patterns
Tracks grammatical structure: verbs find their subjects, adjectives find their nouns.
Finds meaning relationships: synonyms, antonyms, and related concepts cluster together.
Focuses on nearby tokens: builds local context from immediate neighbors.
Captures sentence-level structure: connects distant but structurally related words.
Key Insight: By having multiple heads with different "specializations", the model can capture many types of relationships simultaneously. The outputs are concatenated and projected back, giving a rich representation that combines syntactic, semantic, and structural information.
The Two Phases: Prefill & Decode
Generation isn't just one smooth process; it actually happens in two distinct phases that behave very differently under the hood.
- Prefill (The "Reading" Phase): The model processes your entire prompt at once. This is highly parallelizable and usually very fast. It's like reading a whole page of a book in a glance to understand the context.
- Decode (The "Writing" Phase): The model generates the response one token at a time. Each new token depends on all the previous ones (including the prompt and what it just generated). This is sequential and harder to speed up.
Prefill vs. Decode Phases
See how inference actually works
Prefill Phase
ParallelAll prompt tokens are processed simultaneously in a single forward pass. Highly parallelizable!
Decode Phase
SequentialEach new token requires a full forward pass. This is the bottleneck in LLM inference.
Generated Output
Theskyis
Why this matters: The decode phase is the bottleneck. Each token must wait for the previous one, making it inherently sequential. This is why "tokens per second" is a key LLM benchmark.
Training vs. Inference
It's crucial to distinguish between training (teaching the model) and inference (using the model).
Training
- Takes months or years
- Uses thousands of GPUs in parallel
- "Backward pass" updates weights
- Goal: Minimize error on massive datasets
Inference
- Takes milliseconds per token
- Can run on a single GPU (or CPU!)
- "Forward pass" only; weights are frozen
- Goal: Generate useful responses quickly
Step 4: Sampling
After passing through all the layers, the model produces a set of scores (logits) for every possible token in its vocabulary. These scores represent the likelihood of each token being the next one.
But the model doesn't just always pick the most likely word (that would be boring and repetitive). Instead, strategies like Temperature, Top-P, Top-K, and Min-P introduce controlled randomness to make the text feel more creative and human-like.
Interactive: Sampling Strategies
Adjust the sliders to see how the model chooses the next word for:
"It was a beautiful..."
Controls randomness. Low = predictable, High = creative.
Only consider tokens in top X% cumulative probability.
Only consider the top K highest probability tokens.
Filter tokens below X% of max probability. Modern alternative to Top-K.
Penalize recently used tokens. Higher = less repetition.
Decoding Strategies: Beam Search
Sampling adds randomness, but what if we want the best possible sentence? Simply picking the most likely word at each step (Greedy Decoding) can lead to dead ends.
Beam Search explores multiple parallel futures simultaneously. It keeps the "top K" most promising incomplete sentences at every step, allowing it to find better overall sequences that might start with a lower-probability word.
Beam Search vs. Greedy Decoding
Explore multiple future possibilities simultaneously
Greedy Decoding: At each step, we simply pick the single most likely token. This is fast but can miss better full sentences if a low-probability word leads to a high-probability ending.
Optimization: The KV Cache
Generating one token at a time sounds like it could be incredibly slow if we had to re-read the entire history for every single new word. Fortunately, engineers use a clever trick called the KV (Key-Value) Cache.
Instead of recalculating the attention keys and values for all previous tokens at every step, the model saves them in GPU memory. This means for each new step, it only needs to compute the values for the newest token. The trade-off? It uses a huge amount of memory!
KV Cache: Memory vs. Speed Tradeoff
See how caching affects memory usage and computation
KV Cache Memory Usage
For 256 tokens on Llama-3 8B
Decode Time per Token
O(N) - Linear with sequence length
Computation Cost: O(N) vs O(NΒ²)
With KV Cache: We trade memory for speed. The cache stores previously computed keys and values, so each new token only needs O(1) computation for its own K/V, then O(N) to attend to the cached history. This is why GPU memory limits your context length!
Throughput: Continuous Batching
Serving one user at a time is inefficient because GPU cores are massive parallel machines. To fix this, we process multiple requests in a batch.
But what if one request finishes early? In a naive batch, the GPU waits for the slowest request. Continuous Batching (or cellular batching) solves this by immediately ejecting finished sequences and inserting new ones into the free slots, keeping the GPU 100% busy.
Continuous Batching Explainer
How to keep the GPU busy (maximize utilization)
Request Queue 6 pending
GPU Batch Slots (Max 4)
Real-World Inference: Quantization
In production, engineers struggle with the Memory Wall. The weights of a model like Llama-3-70B take up ~140GB of VRAM. Moving that data from memory to the compute units is often the bottleneck.
To solve this, we use Quantization: reducing the precision of the model's numbers from high-definition (16-bit) to lower fidelity (8-bit or 4-bit). It's like lowering the bitrate on a video fileβyou save massive amounts of space with surprisingly little loss in quality.
Quantization: Precision vs. Memory
Reducing bits per weight to fit models in RAM
Impact on 70B Model
FP16 (16-bit): The training standard. Extremely precise, but requires massive enterprise GPUs (needs ~140GB for 70B params).
Key Takeaways
- 1. Tokenization (BPE) turns text into subword units the model can process.
- 2. Embeddings capture meaning, while Positional Encodings track word order.
- 3. Transformer layers process the sequence through attention and feed-forward networks.
- 4. Multi-Head Attention captures different types of relationships simultaneously.
- 5. Inference has a fast parallel "prefill" phase and a slower sequential "decode" phase.
- 6. Decoding strategies like Sampling (creativity) or Beam Search (quality) shape the output.
- 7. The KV Cache trades memory for speed, enabling real-time inference.
- 8. Continuous Batching maximizes GPU throughput by filling gaps in processing.
- 9. Quantization compresses models (e.g., to 4-bit) to fit on consumer hardware.