AI & Machine Learning | Thirdpen Prints special

How LLMs Do Inference

From a user's prompt to the machine's reply: an interactive journey inside the model.

LLM Inference

The process where a trained Large Language Model (LLM) generates text by predicting one token at a time based on the input it received. Unlike training, which takes months, inference happens in milliseconds.

Introduction

You type a question into ChatGPT or Claude, and almost instantly, words start appearing on your screen. It feels like magic, or perhaps like a human typing on the other end. But what's actually happening deeply inside the silicon chips of the server?

The process is called inference. It's a complex dance of mathematics, memory management, and massive parallel computation. Roughly speaking, the model is playing a very sophisticated game of "guess the next word," over and over again, billions of times a day.

In this interactive explainer, we'll peel back the layers. We'll trace the journey of a prompt as it gets sliced into numbers, processed by giant matrices, and converted back into human language.

Step 1: Tokenization

Computers don't understand words; they understand numbers. Before an LLM can do anything effectively, your text input must be broken down into smaller chunks called tokens.

Modern LLMs use Byte Pair Encoding (BPE), which starts with individual characters and repeatedly merges the most common pairs. This creates a vocabulary of subwords that balances efficiency with flexibility.

BPE Tokenizer

586 tokens

Show Embeddings

input text

16 tokens 26 characters ratio: 1.6 chars/token

T 51

"T" → 51

he 258

"he" → 258

␣ 220

" " → 220

qu 421

"qu" → 421

ic 291

"ic" → 291

k 74

"k" → 74

␣b 275

" b" → 275

ro 305

"ro" → 305

w 86

"w" → 86

n 77

"n" → 77

␣fox 21831

" fox" → 21831

␣ 220

" " → 220

ju 10456

"ju" → 10456

mp 3149

"mp" → 3149

s 82

"s" → 82

. 13

"." → 13

How BPE works: The tokenizer starts with individual characters and repeatedly merges the most common pairs. The ␣ symbol represents a leading space.

Try typing: "tokenization", "unbelievable", or "artificial intelligence" to see how subwords are split.

Step 2: Embeddings

Each token ID is converted into a dense vector of numbers called an embedding. These vectors live in a high-dimensional space where similar words cluster together.

The embedding captures semantic meaning: "cat" and "dog" will be closer to each other than "cat" and "computer". This is where the model's understanding of language begins.

Word Embeddings in 2D Space

Words with similar meanings cluster together

Show Analogy

animal

royalty

person

action

color

size

emotion

tech

Word Analogy: In embedding space, king - man + woman ≈ queen

The vector from "man" to "woman" captures the concept of gender, and adding it to "king" points toward "queen".

Embeddings: Each word is represented as a vector (list of numbers). Similar words have similar vectors, so they appear close together in this 2D projection.

The Missing Piece: Position

Here's a problem: if we just fed these embeddings into the attention mechanism, "The cat ate the fish" and "The fish ate the cat" would look identical! The model wouldn't know which word came first.

To fix this, we add a Positional Encoding vector to each word's embedding. By using sine and cosine waves of different frequencies, we give every position in the sequence a unique mathematical "fingerprint" that the model can learn to recognize.

Positional Encodings

How transformers know "where" words are using math (Sinusoidal frequencies)

Sequence Length (Rows) 32

Frequency Base 10000

Standard Transformer uses 10000. Lower values compress the waves.

Token Position (0 → 31)

Embedding Dimension (0 → 63)

Each row is a position vector added to the word embedding. Low frequencies (left) change rapidly, while high frequencies (right) change slowly.

Hover over the grid
to see values

Step 3: The Transformer Block

The heart of an LLM is the Transformer, a neural network architecture invented in 2017. It consists of stacked layers, each containing attention mechanisms and feed-forward networks.

A modern LLM like GPT-4 or Llama-3 might have 80+ of these layers, each refining the model's understanding of the input and building toward the final prediction.

Transformer Block Architecture

Click any block to see details

↓ Repeat × N layers ↓

Model Size Examples

GPT-2 Small

12 layers, 768d

117M params

GPT-2 Large

36 layers, 1280d

774M params

Llama-3 8B

32 layers, 4096d

8B params

GPT-4

? layers, ?d

~1T+ params

Inside the Transformer

Once the input is tokenized and embedded, it enters the Transformer layers. This is where the "intelligence" lives. A modern LLM might have dozens (or even roughly 100) of these identical blocks stacked on top of each other.

The key mechanism here is Self-Attention. It allows each token to "look at" other tokens in the sequence to gather context. For example, in the sentence "The animal didn't cross the street because it was too tired," attention helps the model understand that "it" refers to "the animal" and not "the street."

Multi-Head Attention Visualizer

Click on a word to see what it "attends to"

Head 1: Syntactic

Grammatical relationships

Head 2: Semantic

Meaning connections

Head 3: Positional

Adjacent words

Head 4: Long-range

Distant references

"it" attends to mat, cat, because, was, The

Multi-Head Attention: Each "head" learns to focus on different types of relationships. One head might track grammar, another tracks meaning, and another handles pronouns. The model combines all heads for a rich understanding.

Multi-Head Attention

Rather than having a single attention mechanism, transformers use multiple attention heads that each learn to focus on different types of relationships.

One head might specialize in grammatical dependencies (subjects and verbs), another in semantic similarity (synonyms and related concepts), and another in positional patterns (nearby words). The model combines all these perspectives for a richer understanding.

Multi-Head Attention Specialization

See how different heads focus on different patterns

Syntactic

Tracks grammatical structure: verbs find their subjects, adjectives find their nouns.

Semantic

Finds meaning relationships: synonyms, antonyms, and related concepts cluster together.

Positional

Focuses on nearby tokens: builds local context from immediate neighbors.

Long-range

Captures sentence-level structure: connects distant but structurally related words.

Key Insight: By having multiple heads with different "specializations", the model can capture many types of relationships simultaneously. The outputs are concatenated and projected back, giving a rich representation that combines syntactic, semantic, and structural information.

The Two Phases: Prefill & Decode

Generation isn't just one smooth process; it actually happens in two distinct phases that behave very differently under the hood.

Prefill (The "Reading" Phase): The model processes your entire prompt at once. This is highly parallelizable and usually very fast. It's like reading a whole page of a book in a glance to understand the context.
Decode (The "Writing" Phase): The model generates the response one token at a time. Each new token depends on all the previous ones (including the prompt and what it just generated). This is sequential and harder to speed up.

Prefill vs. Decode Phases

See how inference actually works

Prefill

Decode

Prefill Phase

Parallel

The

sky

Time

~50ms

Tokens

3 (all at once)

All prompt tokens are processed simultaneously in a single forward pass. Highly parallelizable!

Decode Phase

Sequential

Time per token

~—

Tokens

0/6 (one by one)

Each new token requires a full forward pass. This is the bottleneck in LLM inference.

GPU Utilization 0%

Prefill: High utilization (batched ops) Decode: Low utilization (memory-bound)

Generated Output

Theskyis

Why this matters: The decode phase is the bottleneck. Each token must wait for the previous one, making it inherently sequential. This is why "tokens per second" is a key LLM benchmark.

Training vs. Inference

It's crucial to distinguish between training (teaching the model) and inference (using the model).

Training

Takes months or years
Uses thousands of GPUs in parallel
"Backward pass" updates weights
Goal: Minimize error on massive datasets

Inference

Takes milliseconds per token
Can run on a single GPU (or CPU!)
"Forward pass" only; weights are frozen
Goal: Generate useful responses quickly

Step 4: Sampling

After passing through all the layers, the model produces a set of scores (logits) for every possible token in its vocabulary. These scores represent the likelihood of each token being the next one.

But the model doesn't just always pick the most likely word (that would be boring and repetitive). Instead, strategies like Temperature, Top-P, Top-K, and Min-P introduce controlled randomness to make the text feel more creative and human-like.

Interactive: Sampling Strategies

Compare Mode

Adjust the sliders to see how the model chooses the next word for:
"It was a beautiful..."

Temperature: 1.0 Creative

Controls randomness. Low = predictable, High = creative.

Top-P (Nucleus): 0.90

Only consider tokens in top X% cumulative probability.

Top-K: 50

Only consider the top K highest probability tokens.

Min-P: 0.00 Off

Filter tokens below X% of max probability. Modern alternative to Top-K.

Repetition Penalty: 1.0

Penalize recently used tokens. Higher = less repetition.

Token Probabilities

Included

Repeated (penalized)

Filtered out

Decoding Strategies: Beam Search

Sampling adds randomness, but what if we want the best possible sentence? Simply picking the most likely word at each step (Greedy Decoding) can lead to dead ends.

Beam Search explores multiple parallel futures simultaneously. It keeps the "top K" most promising incomplete sentences at every step, allowing it to find better overall sequences that might start with a lower-probability word.

Beam Search vs. Greedy Decoding

Explore multiple future possibilities simultaneously

The

Total Prob

100.0%

Greedy Decoding: At each step, we simply pick the single most likely token. This is fast but can miss better full sentences if a low-probability word leads to a high-probability ending.

Optimization: The KV Cache

Generating one token at a time sounds like it could be incredibly slow if we had to re-read the entire history for every single new word. Fortunately, engineers use a clever trick called the KV (Key-Value) Cache.

Instead of recalculating the attention keys and values for all previous tokens at every step, the model saves them in GPU memory. This means for each new step, it only needs to compute the values for the newest token. The trade-off? It uses a huge amount of memory!

KV Cache: Memory vs. Speed Tradeoff

See how caching affects memory usage and computation

Sequence Length: 256 tokens (typical chat: 500-2000)

64 1024 2048 4096

KV Cache Memory Usage

128.00 MB

For 256 tokens on Llama-3 8B

2 × 32 layers × 32 heads × 128d × 256 tokens × 2 bytes

Decode Time per Token

~43ms

O(N) - Linear with sequence length

Only compute attention for new token

Computation Cost: O(N) vs O(N²)

O(N²) No Cache

O(N) With Cache

Sequence Length →

Short prompt (256 tokens)

128.00 MB

Typical chat (1K tokens)

512.00 MB

Long context (8K tokens)

4.00 GB

With KV Cache: We trade memory for speed. The cache stores previously computed keys and values, so each new token only needs O(1) computation for its own K/V, then O(N) to attend to the cached history. This is why GPU memory limits your context length!

Throughput: Continuous Batching

Serving one user at a time is inefficient because GPU cores are massive parallel machines. To fix this, we process multiple requests in a batch.

But what if one request finishes early? In a naive batch, the GPU waits for the slowest request. Continuous Batching (or cellular batching) solves this by immediately ejecting finished sequences and inserting new ones into the free slots, keeping the GPU 100% busy.

Continuous Batching Explainer

How to keep the GPU busy (maximize utilization)

Request Queue 6 pending

Req #0

5 toks

Req #1

8 toks

Req #2

12 toks

Req #3

15 toks

Req #4

6 toks

Req #5

10 toks

GPU Batch Slots (Max 4)

Efficiency: 0%

Slot 0

Empty

Slot 1

Empty

Slot 2

Empty

Slot 3

Empty

Continuous Batching: As soon as a request finishes, its slot is immediately freed and filled by the next request in the queue. This keeps the GPU fully saturated (high efficiency) and reduces latency.

Real-World Inference: Quantization

In production, engineers struggle with the Memory Wall. The weights of a model like Llama-3-70B take up ~140GB of VRAM. Moving that data from memory to the compute units is often the bottleneck.

To solve this, we use Quantization: reducing the precision of the model's numbers from high-definition (16-bit) to lower fidelity (8-bit or 4-bit). It's like lowering the bitrate on a video file—you save massive amounts of space with surprisingly little loss in quality.

Quantization: Precision vs. Memory

Reducing bits per weight to fit models in RAM

Weight Value 0.3570

-1.0

1.0

Quantized Value

0.3570

Precision Error

0.00000

Impact on 70B Model

VRAM Required 140.0 GB

0 GB 24 GB 80 GB 140 GB

FP16 (16-bit): The training standard. Extremely precise, but requires massive enterprise GPUs (needs ~140GB for 70B params).

Key Takeaways

1. Tokenization (BPE) turns text into subword units the model can process.
2. Embeddings capture meaning, while Positional Encodings track word order.
3. Transformer layers process the sequence through attention and feed-forward networks.
4. Multi-Head Attention captures different types of relationships simultaneously.
5. Inference has a fast parallel "prefill" phase and a slower sequential "decode" phase.
6. Decoding strategies like Sampling (creativity) or Beam Search (quality) shape the output.
7. The KV Cache trades memory for speed, enabling real-time inference.
8. Continuous Batching maximizes GPU throughput by filling gaps in processing.
9. Quantization compresses models (e.g., to 4-bit) to fit on consumer hardware.