What Is a Token? From Characters to BPE
A token is a chunk of text that a language model treats as a single unit. Tokens are not words, characters, or syllables — they are fragments determined by the model's tokenization algorithm. A single word might be one token, two tokens, or even three tokens depending on how common it is in the training data.
Most modern LLMs (GPT, Claude, Gemini) use a variant of Byte Pair Encoding (BPE), originally from Sennrich et al. 2016. BPE builds a vocabulary of common character sequences by repeatedly merging the most frequent adjacent pairs:
- Start with individual characters as the vocabulary.
- Count all adjacent character pairs in the training corpus.
- Merge the most frequent pair into a new token.
- Repeat until the vocabulary reaches the target size (GPT-4's tokenizer has ~100,000 tokens).
The result: common English words and subwords get their own token; rare words, proper nouns, and non-English text are split into multiple tokens. Examples with OpenAI's cl100k tokenizer:
| Text | Tokens | Count |
|---|---|---|
| hello world | [hello] [ world] | 2 |
| tokenization | [token][ization] | 2 |
| Anthropic | [Anthrop][ic] | 2 |
| supercalifragilistic | [super][cal][if][rag][il][istic] | 6 |
| 日本語 | [日][本][語] | 3 (3 chars, 3 tokens) |
Rule of thumb for English: ~4 characters per token, or approximately 0.75 tokens per word. 1,000 tokens ≈ 750 words ≈ a typical page of text. Non-English text is typically 1.5–3× more tokens per character than English, because the tokenizer was trained predominantly on English text.
Tokenizers Differ Across Models
Each model family uses its own tokenizer, meaning the same text produces a different token count across models. You cannot assume GPT-4 and Claude give you the same count.
| Provider | Tokenizer | Vocab size | Counting tool |
|---|---|---|---|
| OpenAI (GPT-4o) | o200k_base | ~200,000 | tiktoken |
| OpenAI (GPT-4) | cl100k_base | ~100,000 | tiktoken |
| Anthropic (Claude) | Claude tokenizer | ~100,000 | Anthropic API |
| Google (Gemini) | SentencePiece | ~256,000 | Vertex AI SDK |
| Meta (Llama 3) | tiktoken (modified) | ~128,000 | Hugging Face tokenizers |
For production cost estimation, always count tokens using the specific model's tokenizer. Using tiktoken to estimate Claude token counts will give you a close approximation but not an exact number. The CodeLint.Dev Token Counter shows estimated token counts across major models simultaneously.
Context Windows: The Hard Limit
The context window is the maximum number of tokens a model can process in a single API call, counting both input (prompt + system message + conversation history) and output (completion). Exceeding the context window causes an error — the API will reject the request.
| Model | Context window | Approx. pages of text |
|---|---|---|
| GPT-4o | 128,000 tokens | ~96,000 words / ~384 pages |
| Claude 3.5 Sonnet | 200,000 tokens | ~150,000 words / ~600 pages |
| Gemini 1.5 Pro | 1,000,000 tokens | ~750,000 words / ~3,000 pages |
| Gemini 2.0 Flash | 1,048,576 tokens | ~786,000 words |
| Llama 3.1 (70B) | 128,000 tokens | ~96,000 words |
Critical design consideration: In a multi-turn conversation, every message in the conversation history counts against the context window. A 100-turn conversation with 200 tokens per turn = 20,000 tokens of history before you add the current prompt or system message. For long-running agents or chat applications, you need a context management strategy (summarisation, sliding window, or embedding-based retrieval).
Counting Tokens in Python and JavaScript
Use the official tokenizer libraries for accurate counts — never rely on word count approximations in production.
import tiktoken
def count_tokens_openai(text: str, model: str = "gpt-4o") -> int:
"""Count tokens for OpenAI models using tiktoken."""
enc = tiktoken.encoding_for_model(model)
return len(enc.encode(text))
def count_chat_tokens(messages: list[dict], model: str = "gpt-4o") -> int:
"""
Count tokens for a chat completion request.
Includes per-message overhead (role, formatting).
"""
enc = tiktoken.encoding_for_model(model)
tokens_per_message = 3 # every message has <|im_start|>, role, <|im_sep|>
tokens_per_name = 1 # if 'name' key is present
total = 3 # priming tokens: <|im_start|>assistant<|im_sep|>
for msg in messages:
total += tokens_per_message
for key, value in msg.items():
total += len(enc.encode(value))
if key == "name":
total += tokens_per_name
return total
# Example usage
text = "Explain quantum entanglement in simple terms."
print(f"Tokens: {count_tokens_openai(text)}") # → 8
messages = [
{"role": "system", "content": "You are a helpful physics teacher."},
{"role": "user", "content": text},
]
print(f"Chat tokens: {count_chat_tokens(messages)}") # → ~25import { encoding_for_model, get_encoding } from 'js-tiktoken';
// Count tokens for a specific OpenAI model
function countTokens(text, model = 'gpt-4o') {
const enc = encoding_for_model(model);
const tokens = enc.encode(text);
enc.free(); // Important: free WASM memory
return tokens.length;
}
// Approximate token count (no WASM dependency — useful for client-side)
function approximateTokenCount(text) {
// Rough approximation: 1 token per ~4 characters for English text
return Math.ceil(text.length / 4);
}
// Count context for a conversation
function countChatTokens(messages, model = 'gpt-4o') {
const enc = encoding_for_model(model);
let total = 3; // priming
for (const msg of messages) {
total += 3; // per-message overhead
total += enc.encode(msg.content ?? '').length;
}
enc.free();
return total;
}
const text = 'Explain quantum entanglement in simple terms.';
console.log(countTokens(text)); // 87 Practical Ways to Reduce Token Usage
- Use concise system prompts. A system prompt that runs to 500 tokens is sent with every API call. A well-written 100-token system prompt saves 400 tokens × number of API calls per day. Over 10,000 daily calls, that is 4 million tokens saved per day.
- Strip boilerplate from input documents. HTML tags, repeated headers, legal disclaimers, navigation menus — none of these help the model answer the user's question. Strip them before passing the document to the model.
- Summarise conversation history. In long conversations, replace old message history with a running summary. Keep the last 2–3 turns in full detail; summarise everything before that.
- Use retrieval-augmented generation (RAG). Instead of sending an entire knowledge base in the context, retrieve only the 3–5 most relevant chunks using vector search and send those. A well-tuned RAG system uses 10–100× fewer tokens than stuffing the full document set.
- Reduce max_tokens when output is known to be short. Setting
max_tokens=4096on a request that only needs a yes/no answer wastes nothing at billing time (you pay for output tokens used, not the limit), but it does prevent runaway generations. Set a realistic upper bound. - Use structured output formats. Asking the model to return JSON with a defined schema produces more compact output than free-form prose. A structured response of 50 tokens can contain the same information as a prose response of 200 tokens.
- Choose the right model tier. GPT-4o mini and Claude Haiku are 10–30× cheaper per token than their frontier counterparts and handle the majority of production use cases well. Reserve frontier models for tasks that genuinely require their capability — complex reasoning, nuanced writing, ambiguous instructions.