Skip to main content
CodeLint.Dev Dev Tools
AI Tools 10 min read

LLM Token Counter: What Tokens Are, How Tokenization Works, and Why It Matters

Every large language model — GPT-4o, Claude Sonnet, Gemini Flash — processes text as a sequence of tokens, not characters or words. Tokens are the atomic unit of LLM input and output, and they determine two things directly: your API cost and whether your content fits in the model's context window. This guide explains how tokenization works from first principles, how different models handle tokens differently, the context window limits you need to know, and practical techniques for reducing token usage without degrading output quality.

Try the tool
LLM Token Counter
Count tokens free →

What Is a Token? From Characters to BPE

A token is a chunk of text that a language model treats as a single unit. Tokens are not words, characters, or syllables — they are fragments determined by the model's tokenization algorithm. A single word might be one token, two tokens, or even three tokens depending on how common it is in the training data.

Most modern LLMs (GPT, Claude, Gemini) use a variant of Byte Pair Encoding (BPE), originally from Sennrich et al. 2016. BPE builds a vocabulary of common character sequences by repeatedly merging the most frequent adjacent pairs:

  1. Start with individual characters as the vocabulary.
  2. Count all adjacent character pairs in the training corpus.
  3. Merge the most frequent pair into a new token.
  4. Repeat until the vocabulary reaches the target size (GPT-4's tokenizer has ~100,000 tokens).

The result: common English words and subwords get their own token; rare words, proper nouns, and non-English text are split into multiple tokens. Examples with OpenAI's cl100k tokenizer:

Text Tokens Count
hello world[hello] [ world]2
tokenization[token][ization]2
Anthropic[Anthrop][ic]2
supercalifragilistic[super][cal][if][rag][il][istic]6
日本語[日][本][語]3 (3 chars, 3 tokens)

Rule of thumb for English: ~4 characters per token, or approximately 0.75 tokens per word. 1,000 tokens ≈ 750 words ≈ a typical page of text. Non-English text is typically 1.5–3× more tokens per character than English, because the tokenizer was trained predominantly on English text.

Tokenizers Differ Across Models

Each model family uses its own tokenizer, meaning the same text produces a different token count across models. You cannot assume GPT-4 and Claude give you the same count.

Provider Tokenizer Vocab size Counting tool
OpenAI (GPT-4o)o200k_base~200,000tiktoken
OpenAI (GPT-4)cl100k_base~100,000tiktoken
Anthropic (Claude)Claude tokenizer~100,000Anthropic API
Google (Gemini)SentencePiece~256,000Vertex AI SDK
Meta (Llama 3)tiktoken (modified)~128,000Hugging Face tokenizers

For production cost estimation, always count tokens using the specific model's tokenizer. Using tiktoken to estimate Claude token counts will give you a close approximation but not an exact number. The CodeLint.Dev Token Counter shows estimated token counts across major models simultaneously.

Context Windows: The Hard Limit

The context window is the maximum number of tokens a model can process in a single API call, counting both input (prompt + system message + conversation history) and output (completion). Exceeding the context window causes an error — the API will reject the request.

Model Context window Approx. pages of text
GPT-4o128,000 tokens~96,000 words / ~384 pages
Claude 3.5 Sonnet200,000 tokens~150,000 words / ~600 pages
Gemini 1.5 Pro1,000,000 tokens~750,000 words / ~3,000 pages
Gemini 2.0 Flash1,048,576 tokens~786,000 words
Llama 3.1 (70B)128,000 tokens~96,000 words

Critical design consideration: In a multi-turn conversation, every message in the conversation history counts against the context window. A 100-turn conversation with 200 tokens per turn = 20,000 tokens of history before you add the current prompt or system message. For long-running agents or chat applications, you need a context management strategy (summarisation, sliding window, or embedding-based retrieval).

Counting Tokens in Python and JavaScript

Use the official tokenizer libraries for accurate counts — never rely on word count approximations in production.

Python Python (tiktoken for OpenAI)
import tiktoken

def count_tokens_openai(text: str, model: str = "gpt-4o") -> int:
    """Count tokens for OpenAI models using tiktoken."""
    enc = tiktoken.encoding_for_model(model)
    return len(enc.encode(text))

def count_chat_tokens(messages: list[dict], model: str = "gpt-4o") -> int:
    """
    Count tokens for a chat completion request.
    Includes per-message overhead (role, formatting).
    """
    enc = tiktoken.encoding_for_model(model)
    tokens_per_message = 3  # every message has <|im_start|>, role, <|im_sep|>
    tokens_per_name = 1     # if 'name' key is present
    total = 3               # priming tokens: <|im_start|>assistant<|im_sep|>

    for msg in messages:
        total += tokens_per_message
        for key, value in msg.items():
            total += len(enc.encode(value))
            if key == "name":
                total += tokens_per_name
    return total

# Example usage
text = "Explain quantum entanglement in simple terms."
print(f"Tokens: {count_tokens_openai(text)}")  # → 8

messages = [
    {"role": "system", "content": "You are a helpful physics teacher."},
    {"role": "user", "content": text},
]
print(f"Chat tokens: {count_chat_tokens(messages)}")  # → ~25
JavaScript Node.js (js-tiktoken)
import { encoding_for_model, get_encoding } from 'js-tiktoken';

// Count tokens for a specific OpenAI model
function countTokens(text, model = 'gpt-4o') {
  const enc = encoding_for_model(model);
  const tokens = enc.encode(text);
  enc.free(); // Important: free WASM memory
  return tokens.length;
}

// Approximate token count (no WASM dependency — useful for client-side)
function approximateTokenCount(text) {
  // Rough approximation: 1 token per ~4 characters for English text
  return Math.ceil(text.length / 4);
}

// Count context for a conversation
function countChatTokens(messages, model = 'gpt-4o') {
  const enc = encoding_for_model(model);
  let total = 3; // priming
  for (const msg of messages) {
    total += 3; // per-message overhead
    total += enc.encode(msg.content ?? '').length;
  }
  enc.free();
  return total;
}

const text = 'Explain quantum entanglement in simple terms.';
console.log(countTokens(text)); // 8

7 Practical Ways to Reduce Token Usage

  1. Use concise system prompts. A system prompt that runs to 500 tokens is sent with every API call. A well-written 100-token system prompt saves 400 tokens × number of API calls per day. Over 10,000 daily calls, that is 4 million tokens saved per day.
  2. Strip boilerplate from input documents. HTML tags, repeated headers, legal disclaimers, navigation menus — none of these help the model answer the user's question. Strip them before passing the document to the model.
  3. Summarise conversation history. In long conversations, replace old message history with a running summary. Keep the last 2–3 turns in full detail; summarise everything before that.
  4. Use retrieval-augmented generation (RAG). Instead of sending an entire knowledge base in the context, retrieve only the 3–5 most relevant chunks using vector search and send those. A well-tuned RAG system uses 10–100× fewer tokens than stuffing the full document set.
  5. Reduce max_tokens when output is known to be short. Setting max_tokens=4096 on a request that only needs a yes/no answer wastes nothing at billing time (you pay for output tokens used, not the limit), but it does prevent runaway generations. Set a realistic upper bound.
  6. Use structured output formats. Asking the model to return JSON with a defined schema produces more compact output than free-form prose. A structured response of 50 tokens can contain the same information as a prose response of 200 tokens.
  7. Choose the right model tier. GPT-4o mini and Claude Haiku are 10–30× cheaper per token than their frontier counterparts and handle the majority of production use cases well. Reserve frontier models for tasks that genuinely require their capability — complex reasoning, nuanced writing, ambiguous instructions.

Frequently Asked Questions

How many tokens is 1000 words?
For standard English prose, approximately 1,333 tokens (using the common rule of 4 characters per token and an average word length of 5.3 characters including spaces). A more practical rule: 1,000 words ≈ 1,333 tokens, or 1,000 tokens ≈ 750 words. This varies by content type — code is often denser (more tokens per word) due to symbols and identifiers; simple conversational text is lighter.
Why does non-English text use more tokens?
BPE tokenizers are trained on text corpora that are predominantly English. Common English words and subwords get compact single-token representations. Characters and subwords in other languages — especially non-Latin scripts like Chinese, Japanese, Arabic, and Korean — appear less frequently in training data and are broken into more token fragments. A Chinese character may cost 1–3 tokens where a comparable English word costs 1. This means that for the same number of tokens, non-English content conveys less information, and non-English API calls can be 1.5–3× more expensive per semantic unit.
Does formatting (markdown, code blocks) use extra tokens?
Yes. Markdown formatting characters (**, ##, -, `, triple backticks), JSON structural characters ({, }, :, ","), and whitespace (spaces, newlines, indentation) all consume tokens. A heavily formatted response with many markdown headers, bullet points, and code blocks will use more tokens than the same information in plain prose. For high-volume, cost-sensitive applications, consider requesting plain text output and handling formatting client-side.
What happens when I exceed the context window?
The API returns an error (typically HTTP 400 with a message like "maximum context length exceeded"). Your request is not processed and you are not charged. You need to reduce the total token count of your request — shorten the prompt, remove conversation history, or summarise long documents — before retrying. Some API clients implement automatic context truncation as a fallback, but this can silently remove important context.
Is the token count the same for input and output pricing?
The token count is the same (1 token = 1 token), but the prices differ. Output (completion) tokens are typically 3–5× more expensive than input (prompt) tokens because generating a token requires a full forward pass through the model, while processing input tokens uses a more efficient attention mechanism. For example, GPT-4o charges $2.50 per million input tokens and $10.00 per million output tokens (4× difference). This pricing asymmetry makes it economical to write long, detailed prompts if they produce shorter, more accurate completions.
How do I count tokens for Claude (Anthropic) models?
Anthropic does not publish a standalone tokenizer library equivalent to tiktoken. The most accurate way to count Claude tokens is through the Anthropic API's token counting endpoint: POST to /v1/messages/count_tokens with your messages array. This returns the exact input token count before you commit to a paid API call. For estimation purposes, Claude's tokenizer is similar to cl100k_base (GPT-4's tokenizer), so tiktoken gives a reasonable approximation for English text.

Ready to try LLM Token Counter?

Free, private, and runs entirely in your browser — no sign-up, no server, no data sent anywhere.

Open LLM Token Counter