The complete guide to tokenization - the first step in how large language models process text
Tokenization is the process of breaking down raw text into smaller units called tokens, which serve as the fundamental building blocks that LLMs use to understand and generate language.
Standardizing the input text by converting to lowercase, removing extra spaces, and handling Unicode characters.
Initial splitting into words, punctuation, and special tokens using rules or whitespace.
Applying the tokenization algorithm (BPE, WordPiece, etc.) to split into final tokens.
Tokens will appear here...
Iteratively merges most frequent character pairs to create a vocabulary of subword units.
Similar to BPE but uses likelihood rather than frequency to determine merges.
Algorithm | Used In | Strengths | Weaknesses |
---|---|---|---|
BPE | GPT models | Simple, efficient, handles rare words | May split words unnaturally |
WordPiece | BERT, DistilBERT | Better for masked language modeling | More complex training |
SentencePiece | T5, ALBERT | Language agnostic, no pre-tokenization | Larger vocabulary sizes |
When encountering unknown words, models must rely on subword units which may not capture the word's meaning accurately.
Same token can have different meanings based on context, which pure tokenization doesn't resolve.
Models have fixed maximum token limits, requiring truncation or chunking of long texts.
Languages with different scripts or no word boundaries pose unique tokenization challenges.
Tokenization is the critical first step that enables LLMs to process human language. Understanding it provides insight into how these models work and their limitations.