Skip to main content
CodeLint.Dev Dev Tools

LLM Tokenization Methods Explained

Compare BPE, WordPiece, SentencePiece, Tiktoken — which models use each and why it matters.

BPE starts with individual characters and iteratively merges the most frequent adjacent pair into a new token until reaching a target vocabulary size. Originally a data compression algorithm, adapted for NLP by Sennrich et al. (2016).

How it works
  1. 1 Start: split all training text into individual characters
  2. 2 Count all adjacent character pairs across the corpus
  3. 3 Merge the most frequent pair into a single new token
  4. 4 Repeat until the vocabulary reaches the target size (e.g. 50,257 for GPT-2)
  5. 5 At inference, apply learned merges in order to tokenize new text
Advantages
  • +Handles any word including unseen ones (falls back to characters)
  • +Good balance between vocabulary size and token efficiency
  • +Works well for code (symbols stay as single tokens)
  • +Widely adopted — huge ecosystem support
Limitations
  • Less efficient for languages with large character sets (CJK, Arabic)
  • Same surface form can tokenize differently based on position (space prefix)
  • Vocabulary is language-biased toward training data distribution

Cross-Language Token Efficiency

Same greeting across languages — approximate token counts. Non-Latin scripts use significantly more tokens in BPE models, increasing cost for non-English users.

LanguageTextBPE (GPT)SentencePieceByte-level
🇬🇧 EnglishHello, how are you?6619
🇪🇸 SpanishHola, ¿cómo estás?8721
🇨🇳 Chinese你好,你好吗?14821
🇸🇦 Arabicمرحبا، كيف حالك؟161030
🇮🇳 Hindiनमस्ते, आप कैसे हैं?181145
🇯🇵 Japaneseこんにちは、元気ですか?12936