What does temperature actually do in an LLM?

It rescales the model's token probabilities before sampling: logits are divided by the temperature, so T below 1 sharpens the distribution (probability concentrates on top candidates — more predictable, more repetitive), T above 1 flattens it (unlikely tokens gain share — more diverse, eventually incoherent), and T=0 conventionally means always taking the single most likely token. It is variance control in a lottery, not a "creativity" setting — it cannot add knowledge or improve accuracy, only trade consistency against variety.

What is the difference between top-p and top-k?

Both cut the low-probability tail before sampling, but top-k keeps a fixed number of candidates (say, the 40 most likely) while top-p (nucleus sampling) keeps the smallest set whose probabilities sum to p (say, 90%). Top-p adapts per token: when the model is confident, the nucleus may be 2 tokens; when many continuations are plausible, it may be 200. That adaptivity is why top-p became the standard. Min-p is a newer alternative (keep tokens at least X% as likely as the leader) popular in open-source stacks for high-temperature generation.

Why do I get different outputs at temperature 0?

Temperature 0 makes the sampler deterministic, but the numbers feeding it are not: hosted inference batches your request with others, GPU floating-point arithmetic is non-associative (so batch composition changes tiny logit values, occasionally flipping near-tied tokens), and providers update infrastructure. Seed parameters (where offered) give best-effort reproducibility, not a guarantee. If you need identical outputs, cache them; if you need consistent behavior, use temperature 0 plus schema-constrained output and treat small variation as expected.

What temperature should I use for my task?

Starting points: 0–0.2 for extraction, classification, and structured output; 0–0.3 for code; 0.2–0.5 for RAG/Q&A; 0.6–0.8 for general chat; 0.9–1.3 for creative writing and brainstorming (paired with top-p 0.95+ and candidate curation). Then tune empirically: name the failure mode you are fixing, change one parameter at a time, and compare multiple runs per setting on a fixed input set, since single stochastic runs are noise. Note that reasoning-class models often ignore or reject sampling parameters entirely — check the docs.

What do frequency and presence penalties do?

Both discourage repetition by lowering the scores of already-used tokens: frequency penalty scales with how many times a token appeared (fighting repetition spirals in long text), presence penalty applies a flat one-time penalty (nudging toward new topics). Use small values (0.1–0.5 on the −2..2 scale) when long generations loop or echo; keep them at zero for code and technical content, where identifiers and terms must legitimately repeat — over-penalization visibly degrades precision by forcing synonym drift.

Do sampling parameters affect accuracy or hallucinations?

Indirectly and modestly. Lower temperature reduces the chance of sampling a low-probability wrong token, which helps tasks with a single correct answer — but the model's top-ranked token can be confidently wrong, and no temperature fixes that. Hallucination is primarily addressed by grounding (RAG, citations), prompt design ("say so if the answer is not in the document"), and model choice; sampling settings then reduce variance around whatever behavior those establish. Think of sampling as consistency control, with accuracy set upstream.

Temperature, Top-p, Top-k: LLM Sampling Parameters Explained

The Setup: Every Token Is a Weighted Lottery

An LLM does not "write" — at each step it produces a score (logit) for every token in its vocabulary (typically 50,000–250,000 candidates), which softmax converts into a probability distribution. Something must then pick one token, and that something is the sampler. Everything in this article happens after the model has done its thinking and before a token hits your screen.

The two degenerate strategies frame the problem:

Greedy decoding (always take the top token) — maximally "safe", but produces repetitive, stilted text and gets trapped in loops ("the the the…" pathologies are greedy artifacts). Counterintuitively, the most-probable-next-token path is often not the most-probable overall text.
Pure sampling from the full distribution — every token gets its fair chance, including the tens of thousands of collectively-significant nonsense candidates in the long tail. Output derails within sentences.

All practical sampling — temperature, top-k, top-p, min-p — is machinery for living between these poles: reshape the distribution, cut the tail, then sample. The parameters differ in how they reshape and where they cut.

Temperature: Reshaping the Distribution

Temperature divides the logits before softmax: softmax(logits / T). The effect on the distribution:

Next-token candidates after "The capital of France is"
                 T = 0.2      T = 1.0      T = 1.8
" Paris"          99.8%        95%          62%
" the"             0.1%         3%          18%
" located"         0.05%        1.2%        11%
(long tail)        ~0%          0.8%         9%

T < 1 sharpens: probability mass concentrates on the leaders. The model becomes more predictable, more repetitive, more "confident" — including confidently wrong: temperature does not improve knowledge, it only suppresses variety.
T = 1 is neutral: the distribution as the model produced it.
T > 1 flattens: underdogs gain share. More diverse, more surprising — and past ~1.5, increasingly incoherent, because natural language's actual entropy per token is low and flattening fights it.
T = 0 conventionally means greedy decoding (argmax; the math would divide by zero).

Two persistent misconceptions worth killing: temperature is not a creativity dial in any semantic sense — the model has no "creative mode"; it is variance in a lottery. And temperature 0 does not guarantee identical outputs across runs on hosted APIs: batching effects, floating-point non-associativity across GPU configurations, and infrastructure changes introduce nondeterminism below the sampler. Some APIs offer a seed parameter for best-effort reproducibility, but bit-identical replay is not a contract any major provider makes.

Top-k, Top-p, Min-p: Three Ways to Cut the Tail

Temperature reshapes; truncation decides which candidates are allowed at all. The three main schemes:

Top-k: keep the k highest-probability tokens, renormalize, sample. Simple, but rigid — k=40 is too many candidates when the answer is obvious ("The capital of France is") and possibly too few when many continuations are equally good (open prose). The right k varies token by token, and top-k cannot adapt.
Top-p (nucleus sampling): keep the smallest set of tokens whose cumulative probability ≥ p. With p=0.9: a confident distribution might need only 2 tokens to reach 90% (tail cut ruthlessly), an uncertain one might include 200 (variety preserved). This adaptivity is why top-p became the industry default over top-k.
Min-p: keep tokens whose probability ≥ min_p × (top token's probability). With min_p=0.05, anything at least 5% as likely as the leader survives. Scales with confidence like top-p but handles the flat-distribution case more gracefully; increasingly available in open-source stacks (llama.cpp, vLLM) and popular for high-temperature creative generation, though mostly absent from big-provider APIs.

Order of operations and interactions: in most implementations temperature applies first, then truncation. Which means high temperature with tight top-p is not contradictory — it is a useful recipe: the flattening lets mid-tier candidates compete while the nucleus still guillotines the nonsense tail. Two practical warnings: parameters compound, so tune one at a time (the standard advice: adjust temperature or top-p, not both, until you know why you need both); and defaults differ per provider and per model — "temperature 1.0" on two APIs does not imply the same behavior, because the underlying distributions and default truncation differ.

The Supporting Cast: Penalties, Stops, and Length

The remaining knobs solve specific pathologies:

Frequency penalty — reduces a token's logit proportionally to how many times it already appeared. Fights the classic repetition spiral in long generations. Overdone (>1.0 on OpenAI's −2..2 scale), it forces the model away from legitimately repeated terms — variable names in code, the product name in marketing copy — degrading precision.
Presence penalty — flat penalty once a token has appeared at all (versus frequency's per-occurrence scaling). Nudges toward new topics rather than merely un-repeating words.
Repetition penalty (open-source ecosystems) — multiplicative variant; same purpose, different math, values ~1.05–1.15 typical.
Stop sequences — strings that terminate generation immediately. Underrated for structured work: generating until "\n\n" or "</answer>" is cheaper and cleaner than generating extra text and trimming it afterward.
Max tokens — a hard budget cap, not a target: the model does not "aim" for it, and hitting the cap mid-sentence is a truncation failure you should detect (finish_reason) and handle. Length steering belongs in the prompt ("respond in 2–3 sentences"); max_tokens is the safety rail and cost control.
Reasoning-model caveat: current reasoning-class models (which generate internal thinking before answering) typically ignore or restrict sampling parameters — several APIs reject temperature/top_p on these models outright, because sampling noise in the reasoning chain harms it. Check the model's docs before porting settings.

Recommended Settings by Task (and How to Tune)

Starting points, not gospel — model-specific behavior varies, and your eval set outranks any table:

Task	Temperature	Top-p	Notes
Extraction, classification, structured output	0 – 0.2	default	Consistency is the goal; add schema-constrained output
Code generation	0 – 0.3	default	Low temp; repetition penalties OFF (code repeats legitimately)
Q&A / RAG answers	0.2 – 0.5	0.9	Low enough to stay grounded in the context
General chat / assistants	0.6 – 0.8	0.9 – 0.95	The all-purpose zone most defaults target
Creative writing, brainstorming	0.9 – 1.3	0.95+ (or min-p 0.05)	Sample several candidates; curate
Synthetic data / eval variety	1.0 – 1.5	0.95+	Diversity is the product; validate downstream

The tuning discipline that separates cargo-culting from engineering:

Decide what failure looks like first. Hallucinated fields? Repetitive prose? Identical outputs where variety was wanted? The failure mode picks the knob; without one named, leave defaults alone.
Run each setting against a fixed input set multiple times — sampling is stochastic, so single-run comparisons are noise. Ten runs per setting on 20 inputs beats one run on 200.
Log the full sampling config with every production call. When output quality shifts, "what were the parameters?" must be answerable from the trace — especially since model upgrades reset the empirical ground under your tuned values.
Remember the ceiling: sampling shapes how the model's distribution is read, not what it knows. No temperature makes a model more accurate than its knowledge; prompt quality and model choice move accuracy, sampling moves consistency and variety.

Temperature, Top-p, and Friends: How LLM Sampling Actually Works

The Setup: Every Token Is a Weighted Lottery

Temperature: Reshaping the Distribution

Top-k, Top-p, Min-p: Three Ways to Cut the Tail

The Supporting Cast: Penalties, Stops, and Length

Recommended Settings by Task (and How to Tune)

Frequently Asked Questions

Ready to try Sampling Parameters Guide?