What is tokenization in LLMs?

Tokenization is the process of splitting text into smaller units called tokens before feeding it to a language model. Tokens are not the same as words — a word like "tokenization" might be split into ["token", "ization"] or kept as one token depending on the vocabulary. Models are billed and limited by token count, not character or word count.

Why does the same text use different token counts in GPT vs Claude?

GPT uses Tiktoken (BPE) with the cl100k_base vocabulary, while Claude uses a Byte-Level BPE with a different vocabulary. Different merge rules and vocabulary sizes mean the same text maps to different token sequences.

What is byte-level BPE?

Byte-Level BPE starts with individual bytes (0-255) rather than characters as the base vocabulary. This guarantees the model can represent any UTF-8 string without unknown tokens. Used by Claude, GPT-2, LLaMA 2, and RoBERTa.

Which tokenizer is most efficient for non-English text?

SentencePiece models (LLaMA 3, Gemma, T5) with large multilingual vocabularies tend to be most efficient for non-Latin scripts. Tiktoken (English-focused) tends to over-tokenize languages like Chinese, Japanese, Korean, and Arabic compared to models trained specifically for those languages.

LLM Tokenization Methods Explained

Compare BPE, WordPiece, SentencePiece, Tiktoken — which models use each and why it matters.

BPE starts with individual characters and iteratively merges the most frequent adjacent pair into a new token until reaching a target vocabulary size. Originally a data compression algorithm, adapted for NLP by Sennrich et al. (2016).

How it works

1 Start: split all training text into individual characters
2 Count all adjacent character pairs across the corpus
3 Merge the most frequent pair into a single new token
4 Repeat until the vocabulary reaches the target size (e.g. 50,257 for GPT-2)
5 At inference, apply learned merges in order to tokenize new text

Advantages

+Handles any word including unseen ones (falls back to characters)
+Good balance between vocabulary size and token efficiency
+Works well for code (symbols stay as single tokens)
+Widely adopted — huge ecosystem support

Limitations

−Less efficient for languages with large character sets (CJK, Arabic)
−Same surface form can tokenize differently based on position (space prefix)
−Vocabulary is language-biased toward training data distribution

Cross-Language Token Efficiency

Same greeting across languages — approximate token counts. Non-Latin scripts use significantly more tokens in BPE models, increasing cost for non-English users.

Language	Text	BPE (GPT)	SentencePiece	Byte-level
🇬🇧 English	Hello, how are you?	6	6	19
🇪🇸 Spanish	Hola, ¿cómo estás?	8	7	21
🇨🇳 Chinese	你好，你好吗？	14	8	21
🇸🇦 Arabic	مرحبا، كيف حالك؟	16	10	30
🇮🇳 Hindi	नमस्ते, आप कैसे हैं?	18	11	45
🇯🇵 Japanese	こんにちは、元気ですか？	12	9	36

LLM Tokenization Methods Explained

Cross-Language Token Efficiency

About

How to use