What a Context Window Actually Is
A context window is the maximum number of tokens a model can attend to in one request — everything counts against it: the system prompt, conversation history, documents you paste, tool definitions and tool results, and (for output) the tokens the model generates. A token is roughly ¾ of an English word or about 4 characters; code, non-Latin scripts, and numbers tokenize less efficiently (Hindi, Japanese, or Arabic text can consume 2–4× more tokens per character than English).
Rules of thumb for sizing:
- 1,000 tokens ≈ 750 English words ≈ 1.5 pages
- 128K tokens ≈ a 300-page novel
- 1M tokens ≈ 7–8 novels, or a mid-size codebase (~30–50K lines with comments)
- 10M tokens ≈ a small library — or one enterprise monorepo
Two details that regularly bite developers: the window is shared between input and output (a model with a 200K window and 8K max output cannot read 199K tokens and write 8K), and conversation history re-enters the window on every turn — a chat that has consumed 90% of the window has only 10% left for the next question and answer, which is why long chats degrade and why context management (summarization, pruning) exists.
The 2026 Landscape: Who Offers What
The frontier has moved fast. In 2023, 32K tokens was premium; by 2026, more than a dozen models ship 1M+ windows. The approximate landscape as of mid-2026:
| Tier | Window | Representative models |
|---|---|---|
| Mega | 10M | Llama 4 Scout (open weights, self-hosted) |
| Frontier | 1M+ | Gemini 3.1 Pro, recent Claude models, GPT-5.x (1M via API tier), DeepSeek V4, Qwen 3.5 Plus, Llama 4 Maverick |
| Extended | 400–512K | GPT-5.5 default window, several mid-tier hosted models |
| Standard | 128–256K | Most open-weights deployments, smaller hosted models |
Caveats that the headline numbers hide:
- Tiered pricing: several providers charge a premium (often ~2×) for the portion of a request beyond a threshold (e.g. beyond 200K or 272K tokens).
- Rate limits: a 1M-token request may be allowed by the model but throttled by your API tier's tokens-per-minute limits.
- Latency: time-to-first-token grows with input size; a near-full 1M-token request can take tens of seconds before the first output token arrives.
Advertised vs Effective Context: The Research
The most important finding in long-context research is that every model's usable context is smaller than its advertised context — the only question is by how much.
- Lost in the middle. The landmark result (Liu et al., 2023, "Lost in the Middle") showed models retrieve information best from the beginning and end of the context and worst from the middle — a U-shaped accuracy curve. Newer models have flattened the U considerably, but the effect has not vanished.
- Needle-in-a-haystack is the easy test. Finding one planted sentence in a sea of text is nearly solved — frontier models score near-perfectly. But NIAH only tests retrieval of a verbatim fact.
- Harder benchmarks tell the real story. Benchmarks like RULER and multi-hop reasoning tests, which require combining several pieces of information scattered across the window, show effective context dropping to a fraction of the advertised window — a model advertising 1M may reason reliably over only the first few hundred thousand tokens of genuinely interdependent material.
- Context rot. Practitioners consistently observe that irrelevant material does not just waste space — it actively degrades answers. More context is not neutral; it is a distractor unless it is relevant.
The practical translation: treat the advertised window as a hard ceiling, and the first ~50–70% of it as the zone where you should place anything the model must reason about carefully. Put critical instructions at the start, critical data near the question, and do not pad the window just because it is there.
The Economics: What Filling a Window Costs
Long context is priced per input token, which makes big windows deceptively expensive. Filling a 1M-token window once costs anywhere from roughly $0.15 on budget models to $10+ on premium frontier models — a spread of more than 70× for the same token count. Now multiply by a chat interface that resends the full history every turn, or an agent that loops 20 times, and the window becomes your dominant cost driver.
The three levers that matter most in practice:
- Prompt caching. All major providers now discount repeated prefixes (typically 50–90% off cached input). Structure prompts so the stable part — system prompt, tool definitions, reference documents — is a byte-identical prefix, and the variable part (user question) comes last. For agents and chatbots this routinely cuts input costs 5–10×.
- Context management. Summarize or prune old turns instead of resending everything. An agent that compacts its history at 50% window usage often outperforms one that runs to the edge — both on cost and on answer quality (see context rot above).
- Model routing. Long-document tasks that are mostly extraction (find, quote, tabulate) run fine on cheap long-context models; reserve premium models for the reasoning step over the extracted material. A two-stage pipeline frequently costs 10× less than one premium mega-prompt.
Long Context vs RAG: A Decision Framework
"Just put everything in context" and "always build RAG" are both wrong as defaults. The honest trade-off:
| Situation | Better default |
|---|---|
| Corpus fits comfortably in the window and is queried once or twice | Long context — no infrastructure, no retrieval misses |
| Same corpus queried thousands of times | RAG — paying to re-read the corpus per query is waste; retrieve the relevant slice |
| Corpus larger than any window (10M+ tokens, or constantly changing) | RAG by necessity, possibly feeding a long-context model large retrieved chunks |
| Task needs global understanding (summarize the whole codebase, find all contradictions) | Long context — retrieval fragments destroy global structure |
| Strict latency or cost budget | RAG with small context — smaller prompts are faster and cheaper |
The emerging 2026 pattern is hybrid: retrieval selects the right 50–200K tokens, a long-context model reasons over that generous slice, and prompt caching amortizes anything stable. Long context did not kill RAG — it changed RAG's job from "find the exact paragraph" to "find the right neighborhood", which is a much easier retrieval problem to get right.