What is a context window in an LLM?

The context window is the maximum number of tokens a model can process in a single request — system prompt, conversation history, pasted documents, tool results, and the generated output all count against it. A token is roughly three-quarters of an English word. A 128K window fits about a 300-page book; a 1M window fits several books or a mid-size codebase.

Which LLM has the largest context window in 2026?

Among open-weights models, Llama 4 Scout advertises 10 million tokens (self-hosted). Among hosted frontier models, more than a dozen — including Gemini 3.x Pro, recent Claude models, GPT-5.x via API, DeepSeek V4, and Qwen 3.5 — offer 1M+ token windows, with some providers charging a premium for the portion of a request beyond a threshold like 200K tokens. Advertised size and reliably usable size are not the same thing, so test on your workload.

Why does my model forget things that are still inside its context window?

Effective context is smaller than advertised context. Research beginning with the 2023 "Lost in the Middle" paper shows retrieval accuracy is highest at the start and end of the window and weakest in the middle, and multi-hop benchmarks like RULER show reasoning quality degrading well before the window is full. Mitigations: put critical instructions at the start, place key data near the question, remove irrelevant material, and summarize long histories instead of resending them.

How much does it cost to use a full 1M-token context window?

Input tokens are billed per request, so one full 1M-token prompt costs roughly $0.15 on the cheapest long-context models up to $10+ on premium frontier models — a spread of over 70×. In multi-turn chats and agent loops the history is resent every turn, multiplying the cost. Prompt caching (50–90% discounts on repeated prefixes), history compaction, and routing extraction work to cheaper models are the standard cost controls.

Is RAG dead now that context windows are huge?

No — the economics and quality data say otherwise. Re-sending a large corpus on every query costs orders of magnitude more than retrieving the relevant slice, and irrelevant context measurably degrades answer quality ("context rot"). What changed is RAG's job: with 1M-token windows, retrieval only needs to find the right neighborhood (tens of thousands of tokens), not the exact paragraph. Whole-corpus reasoning tasks, one-off analyses, and small corpora are where pure long context wins.

Do all languages use the same number of tokens?

No. Tokenizers are trained predominantly on English-heavy data, so English is the most token-efficient language. Hindi, Japanese, Korean, Arabic, and other non-Latin scripts commonly consume 2–4× more tokens for the same content, and source code sits between prose and worst-case text. This means the same document translated into different languages fills very different fractions of a context window — and costs proportionally more.

LLM Context Windows in 2026: 128K to 10M Tokens Explained

What a Context Window Actually Is

A context window is the maximum number of tokens a model can attend to in one request — everything counts against it: the system prompt, conversation history, documents you paste, tool definitions and tool results, and (for output) the tokens the model generates. A token is roughly ¾ of an English word or about 4 characters; code, non-Latin scripts, and numbers tokenize less efficiently (Hindi, Japanese, or Arabic text can consume 2–4× more tokens per character than English).

Rules of thumb for sizing:

1,000 tokens ≈ 750 English words ≈ 1.5 pages
128K tokens ≈ a 300-page novel
1M tokens ≈ 7–8 novels, or a mid-size codebase (~30–50K lines with comments)
10M tokens ≈ a small library — or one enterprise monorepo

Two details that regularly bite developers: the window is shared between input and output (a model with a 200K window and 8K max output cannot read 199K tokens and write 8K), and conversation history re-enters the window on every turn — a chat that has consumed 90% of the window has only 10% left for the next question and answer, which is why long chats degrade and why context management (summarization, pruning) exists.

The 2026 Landscape: Who Offers What

The frontier has moved fast. In 2023, 32K tokens was premium; by 2026, more than a dozen models ship 1M+ windows. The approximate landscape as of mid-2026:

Tier	Window	Representative models
Mega	10M	Llama 4 Scout (open weights, self-hosted)
Frontier	1M+	Gemini 3.1 Pro, recent Claude models, GPT-5.x (1M via API tier), DeepSeek V4, Qwen 3.5 Plus, Llama 4 Maverick
Extended	400–512K	GPT-5.5 default window, several mid-tier hosted models
Standard	128–256K	Most open-weights deployments, smaller hosted models

Caveats that the headline numbers hide:

Tiered pricing: several providers charge a premium (often ~2×) for the portion of a request beyond a threshold (e.g. beyond 200K or 272K tokens).
Rate limits: a 1M-token request may be allowed by the model but throttled by your API tier's tokens-per-minute limits.
Latency: time-to-first-token grows with input size; a near-full 1M-token request can take tens of seconds before the first output token arrives.

Advertised vs Effective Context: The Research

The most important finding in long-context research is that every model's usable context is smaller than its advertised context — the only question is by how much.

Lost in the middle. The landmark result (Liu et al., 2023, "Lost in the Middle") showed models retrieve information best from the beginning and end of the context and worst from the middle — a U-shaped accuracy curve. Newer models have flattened the U considerably, but the effect has not vanished.
Needle-in-a-haystack is the easy test. Finding one planted sentence in a sea of text is nearly solved — frontier models score near-perfectly. But NIAH only tests retrieval of a verbatim fact.
Harder benchmarks tell the real story. Benchmarks like RULER and multi-hop reasoning tests, which require combining several pieces of information scattered across the window, show effective context dropping to a fraction of the advertised window — a model advertising 1M may reason reliably over only the first few hundred thousand tokens of genuinely interdependent material.
Context rot. Practitioners consistently observe that irrelevant material does not just waste space — it actively degrades answers. More context is not neutral; it is a distractor unless it is relevant.

The practical translation: treat the advertised window as a hard ceiling, and the first ~50–70% of it as the zone where you should place anything the model must reason about carefully. Put critical instructions at the start, critical data near the question, and do not pad the window just because it is there.

The Economics: What Filling a Window Costs

Long context is priced per input token, which makes big windows deceptively expensive. Filling a 1M-token window once costs anywhere from roughly $0.15 on budget models to $10+ on premium frontier models — a spread of more than 70× for the same token count. Now multiply by a chat interface that resends the full history every turn, or an agent that loops 20 times, and the window becomes your dominant cost driver.

The three levers that matter most in practice:

Prompt caching. All major providers now discount repeated prefixes (typically 50–90% off cached input). Structure prompts so the stable part — system prompt, tool definitions, reference documents — is a byte-identical prefix, and the variable part (user question) comes last. For agents and chatbots this routinely cuts input costs 5–10×.
Context management. Summarize or prune old turns instead of resending everything. An agent that compacts its history at 50% window usage often outperforms one that runs to the edge — both on cost and on answer quality (see context rot above).
Model routing. Long-document tasks that are mostly extraction (find, quote, tabulate) run fine on cheap long-context models; reserve premium models for the reasoning step over the extracted material. A two-stage pipeline frequently costs 10× less than one premium mega-prompt.

Long Context vs RAG: A Decision Framework

"Just put everything in context" and "always build RAG" are both wrong as defaults. The honest trade-off:

Situation	Better default
Corpus fits comfortably in the window and is queried once or twice	Long context — no infrastructure, no retrieval misses
Same corpus queried thousands of times	RAG — paying to re-read the corpus per query is waste; retrieve the relevant slice
Corpus larger than any window (10M+ tokens, or constantly changing)	RAG by necessity, possibly feeding a long-context model large retrieved chunks
Task needs global understanding (summarize the whole codebase, find all contradictions)	Long context — retrieval fragments destroy global structure
Strict latency or cost budget	RAG with small context — smaller prompts are faster and cheaper

The emerging 2026 pattern is hybrid: retrieval selects the right 50–200K tokens, a long-context model reasons over that generous slice, and prompt caching amortizes anything stable. Long context did not kill RAG — it changed RAG's job from "find the exact paragraph" to "find the right neighborhood", which is a much easier retrieval problem to get right.

LLM Context Windows in 2026: From 128K to 10 Million Tokens — What Actually Matters

What a Context Window Actually Is

The 2026 Landscape: Who Offers What

Advertised vs Effective Context: The Research

The Economics: What Filling a Window Costs

Long Context vs RAG: A Decision Framework

Frequently Asked Questions

Ready to try Context Window Visualizer?