LLM Tokenization Methods Explained
Compare BPE, WordPiece, SentencePiece, Tiktoken — which models use each and why it matters.
BPE starts with individual characters and iteratively merges the most frequent adjacent pair into a new token until reaching a target vocabulary size. Originally a data compression algorithm, adapted for NLP by Sennrich et al. (2016).
- 1 Start: split all training text into individual characters
- 2 Count all adjacent character pairs across the corpus
- 3 Merge the most frequent pair into a single new token
- 4 Repeat until the vocabulary reaches the target size (e.g. 50,257 for GPT-2)
- 5 At inference, apply learned merges in order to tokenize new text
- +Handles any word including unseen ones (falls back to characters)
- +Good balance between vocabulary size and token efficiency
- +Works well for code (symbols stay as single tokens)
- +Widely adopted — huge ecosystem support
- −Less efficient for languages with large character sets (CJK, Arabic)
- −Same surface form can tokenize differently based on position (space prefix)
- −Vocabulary is language-biased toward training data distribution
Cross-Language Token Efficiency
Same greeting across languages — approximate token counts. Non-Latin scripts use significantly more tokens in BPE models, increasing cost for non-English users.
| Language | Text | BPE (GPT) | SentencePiece | Byte-level |
|---|---|---|---|---|
| 🇬🇧 English | Hello, how are you? | 6 | 6 | 19 |
| 🇪🇸 Spanish | Hola, ¿cómo estás? | 8 | 7 | 21 |
| 🇨🇳 Chinese | 你好,你好吗? | 14 | 8 | 21 |
| 🇸🇦 Arabic | مرحبا، كيف حالك؟ | 16 | 10 | 30 |
| 🇮🇳 Hindi | नमस्ते, आप कैसे हैं? | 18 | 11 | 45 |
| 🇯🇵 Japanese | こんにちは、元気ですか? | 12 | 9 | 36 |
About
This guide covers the six major tokenization algorithms used in modern LLMs: Byte Pair Encoding (BPE, used by GPT models), WordPiece (used by BERT and early language models), SentencePiece (used by T5, LLaMA, Gemma), Unigram Language Model (used in some multilingual models), Tiktoken (OpenAI's fast BPE implementation for GPT-3.5+), and Byte-Level BPE (used by Claude and Llama 2). Each entry explains the algorithm, pros/cons, vocabulary size, which models use it, and language efficiency. An interactive cross-language efficiency table compares how different languages tokenize across algorithms.
How to use
- 1 Click any tokenizer card to expand its full explanation.
- 2 Read how the algorithm works, its vocabulary size, and which models use it.
- 3 Compare pros and cons to understand trade-offs for your use case.
- 4 Review the cross-language efficiency table to see how tokenizers handle different scripts.
- 5 Use the category filter to focus on a specific type of tokenizer.
- What is tokenization in LLMs?
- Tokenization is the process of splitting text into smaller units called tokens before feeding it to a language model. Tokens are not the same as words — a word like "tokenization" might be split into ["token", "ization"] or kept as one token depending on the vocabulary. Models are billed and limited by token count, not character or word count.
- Why does the same text use different token counts in GPT vs Claude?
- GPT uses Tiktoken (BPE) with the cl100k_base vocabulary, while Claude uses a Byte-Level BPE with a different vocabulary. Different merge rules and vocabulary sizes mean the same text maps to different token sequences.
- What is byte-level BPE?
- Byte-Level BPE starts with individual bytes (0-255) rather than characters as the base vocabulary. This guarantees the model can represent any UTF-8 string without unknown tokens. Used by Claude, GPT-2, LLaMA 2, and RoBERTa.
- Which tokenizer is most efficient for non-English text?
- SentencePiece models (LLaMA 3, Gemma, T5) with large multilingual vocabularies tend to be most efficient for non-Latin scripts. Tiktoken (English-focused) tends to over-tokenize languages like Chinese, Japanese, Korean, and Arabic compared to models trained specifically for those languages.