The Setup: Every Token Is a Weighted Lottery
An LLM does not "write" — at each step it produces a score (logit) for every token in its vocabulary (typically 50,000–250,000 candidates), which softmax converts into a probability distribution. Something must then pick one token, and that something is the sampler. Everything in this article happens after the model has done its thinking and before a token hits your screen.
The two degenerate strategies frame the problem:
- Greedy decoding (always take the top token) — maximally "safe", but produces repetitive, stilted text and gets trapped in loops ("the the the…" pathologies are greedy artifacts). Counterintuitively, the most-probable-next-token path is often not the most-probable overall text.
- Pure sampling from the full distribution — every token gets its fair chance, including the tens of thousands of collectively-significant nonsense candidates in the long tail. Output derails within sentences.
All practical sampling — temperature, top-k, top-p, min-p — is machinery for living between these poles: reshape the distribution, cut the tail, then sample. The parameters differ in how they reshape and where they cut.
Temperature: Reshaping the Distribution
Temperature divides the logits before softmax: softmax(logits / T). The effect on the distribution:
Next-token candidates after "The capital of France is"
T = 0.2 T = 1.0 T = 1.8
" Paris" 99.8% 95% 62%
" the" 0.1% 3% 18%
" located" 0.05% 1.2% 11%
(long tail) ~0% 0.8% 9%
- T < 1 sharpens: probability mass concentrates on the leaders. The model becomes more predictable, more repetitive, more "confident" — including confidently wrong: temperature does not improve knowledge, it only suppresses variety.
- T = 1 is neutral: the distribution as the model produced it.
- T > 1 flattens: underdogs gain share. More diverse, more surprising — and past ~1.5, increasingly incoherent, because natural language's actual entropy per token is low and flattening fights it.
- T = 0 conventionally means greedy decoding (argmax; the math would divide by zero).
Two persistent misconceptions worth killing: temperature is not a creativity dial in any semantic sense — the model has no "creative mode"; it is variance in a lottery. And temperature 0 does not guarantee identical outputs across runs on hosted APIs: batching effects, floating-point non-associativity across GPU configurations, and infrastructure changes introduce nondeterminism below the sampler. Some APIs offer a seed parameter for best-effort reproducibility, but bit-identical replay is not a contract any major provider makes.
Top-k, Top-p, Min-p: Three Ways to Cut the Tail
Temperature reshapes; truncation decides which candidates are allowed at all. The three main schemes:
- Top-k: keep the k highest-probability tokens, renormalize, sample. Simple, but rigid — k=40 is too many candidates when the answer is obvious ("The capital of France is") and possibly too few when many continuations are equally good (open prose). The right k varies token by token, and top-k cannot adapt.
- Top-p (nucleus sampling): keep the smallest set of tokens whose cumulative probability ≥ p. With p=0.9: a confident distribution might need only 2 tokens to reach 90% (tail cut ruthlessly), an uncertain one might include 200 (variety preserved). This adaptivity is why top-p became the industry default over top-k.
- Min-p: keep tokens whose probability ≥ min_p × (top token's probability). With min_p=0.05, anything at least 5% as likely as the leader survives. Scales with confidence like top-p but handles the flat-distribution case more gracefully; increasingly available in open-source stacks (llama.cpp, vLLM) and popular for high-temperature creative generation, though mostly absent from big-provider APIs.
Order of operations and interactions: in most implementations temperature applies first, then truncation. Which means high temperature with tight top-p is not contradictory — it is a useful recipe: the flattening lets mid-tier candidates compete while the nucleus still guillotines the nonsense tail. Two practical warnings: parameters compound, so tune one at a time (the standard advice: adjust temperature or top-p, not both, until you know why you need both); and defaults differ per provider and per model — "temperature 1.0" on two APIs does not imply the same behavior, because the underlying distributions and default truncation differ.
The Supporting Cast: Penalties, Stops, and Length
The remaining knobs solve specific pathologies:
- Frequency penalty — reduces a token's logit proportionally to how many times it already appeared. Fights the classic repetition spiral in long generations. Overdone (>1.0 on OpenAI's −2..2 scale), it forces the model away from legitimately repeated terms — variable names in code, the product name in marketing copy — degrading precision.
- Presence penalty — flat penalty once a token has appeared at all (versus frequency's per-occurrence scaling). Nudges toward new topics rather than merely un-repeating words.
- Repetition penalty (open-source ecosystems) — multiplicative variant; same purpose, different math, values ~1.05–1.15 typical.
- Stop sequences — strings that terminate generation immediately. Underrated for structured work: generating until "\n\n" or "</answer>" is cheaper and cleaner than generating extra text and trimming it afterward.
- Max tokens — a hard budget cap, not a target: the model does not "aim" for it, and hitting the cap mid-sentence is a truncation failure you should detect (finish_reason) and handle. Length steering belongs in the prompt ("respond in 2–3 sentences"); max_tokens is the safety rail and cost control.
- Reasoning-model caveat: current reasoning-class models (which generate internal thinking before answering) typically ignore or restrict sampling parameters — several APIs reject temperature/top_p on these models outright, because sampling noise in the reasoning chain harms it. Check the model's docs before porting settings.
Recommended Settings by Task (and How to Tune)
Starting points, not gospel — model-specific behavior varies, and your eval set outranks any table:
| Task | Temperature | Top-p | Notes |
|---|---|---|---|
| Extraction, classification, structured output | 0 – 0.2 | default | Consistency is the goal; add schema-constrained output |
| Code generation | 0 – 0.3 | default | Low temp; repetition penalties OFF (code repeats legitimately) |
| Q&A / RAG answers | 0.2 – 0.5 | 0.9 | Low enough to stay grounded in the context |
| General chat / assistants | 0.6 – 0.8 | 0.9 – 0.95 | The all-purpose zone most defaults target |
| Creative writing, brainstorming | 0.9 – 1.3 | 0.95+ (or min-p 0.05) | Sample several candidates; curate |
| Synthetic data / eval variety | 1.0 – 1.5 | 0.95+ | Diversity is the product; validate downstream |
The tuning discipline that separates cargo-culting from engineering:
- Decide what failure looks like first. Hallucinated fields? Repetitive prose? Identical outputs where variety was wanted? The failure mode picks the knob; without one named, leave defaults alone.
- Run each setting against a fixed input set multiple times — sampling is stochastic, so single-run comparisons are noise. Ten runs per setting on 20 inputs beats one run on 200.
- Log the full sampling config with every production call. When output quality shifts, "what were the parameters?" must be answerable from the trace — especially since model upgrades reset the empirical ground under your tuned values.
- Remember the ceiling: sampling shapes how the model's distribution is read, not what it knows. No temperature makes a model more accurate than its knowledge; prompt quality and model choice move accuracy, sampling moves consistency and variety.