AI / LLM Glossary
46 terms across architecture, training, inference, evaluation, safety, and deployment.
The neural network architecture underlying virtually all modern LLMs. Introduced by Vaswani et al. (2017) in "Attention Is All You Need". Uses self-attention mechanisms to relate every token to every other token in a sequence — enabling parallel training and long-range dependency capture.
A mechanism that allows each token to "attend to" (weight the importance of) all other tokens in the sequence when computing its representation. Scaled dot-product attention: Attention(Q,K,V) = softmax(QKᵀ/√d_k)V where Q=queries, K=keys, V=values.
Runs multiple attention operations ("heads") in parallel, each learning different aspects of token relationships (e.g. one head may capture syntactic structure, another semantic similarity). Outputs are concatenated and projected back to the model dimension.
Attention where queries, keys, and values all come from the same sequence — each token computes attention over all other tokens in the same layer. Enables the model to build contextual representations.
The maximum number of tokens a model can process in a single forward pass (prompt + response combined). Beyond this limit, older tokens are dropped. GPT-4o: 128K tokens; Claude 4: 200K tokens; Gemini 2.5: 1M tokens.
A dense vector representation of a token, word, sentence, or document in a high-dimensional space (e.g. 768 or 4096 dimensions). Semantically similar items have vectors that are geometrically close (high cosine similarity).
Since transformers have no inherent notion of order, positional encodings are added to token embeddings to inject sequence position information. Modern models use Rotary Position Embeddings (RoPE) which extend to long contexts more effectively than the original sinusoidal encodings.
An architecture where the model routes each token to only a subset of specialised "expert" sub-networks (FFN layers). Most parameters are not activated for any given input, allowing very large parameter counts without proportional compute costs. Used by Mixtral, GPT-4, and DeepSeek V3.
A memory-efficient attention variant where multiple query heads share key and value heads. Reduces the KV-cache size significantly during inference. Used in Llama 3, Mistral, and most recent LLMs.
During autoregressive generation, the key and value tensors for previously generated tokens are cached to avoid recomputation. The KV cache grows with sequence length and is the primary memory cost during inference.
The initial training phase where a model learns general language representations from massive amounts of unlabelled text (next-token prediction or masked token prediction). Requires enormous compute — GPT-4 reportedly used 25,000 A100s for months.
Training a pre-trained model on a curated dataset of (instruction, response) pairs, where the response is the "correct" output. Makes the model follow instructions. The first step after pre-training for instruction-following models.
A training method where a reward model (trained on human preference rankings) provides a signal to the LLM via PPO (Proximal Policy Optimisation). Used by OpenAI to align GPT-4 and ChatGPT with human preferences.
An alternative to RLHF that directly trains on human preference pairs (chosen vs. rejected response) without a separate reward model. More stable and simpler than PPO-based RLHF. Widely adopted in open-source models.
Low-Rank Adaptation — a parameter-efficient fine-tuning technique that injects two small trainable matrices into attention layers instead of updating all weights. See the Optimization Guide for a full explanation.
Quantized LoRA — combines 4-bit NF4 quantization of the base model with LoRA adapters in float16. Enables fine-tuning 65B+ models on a single GPU.
Parameter-Efficient Fine-Tuning — umbrella term for methods (LoRA, QLoRA, prefix tuning, prompt tuning, adapters) that fine-tune a small number of parameters rather than the full model. The Hugging Face PEFT library implements most of these.
When a model is fine-tuned on a specific task, it tends to "forget" capabilities learned during pre-training. Full fine-tuning is most susceptible. LoRA mitigates this by keeping original weights frozen.
LLMs generate text one token at a time, feeding each generated token back as input to predict the next. This means generating a 100-token response requires 100 sequential forward passes.
The basic unit of text that a model processes. Tokens are subword pieces — not words or characters. "tokenization" → ["token", "ization"]. One token ≈ 4 characters in English, fewer in code, more in non-Latin scripts.
The input text sent to a language model. Includes the system prompt, conversation history, and user message. Token count directly determines cost and speed.
Instructions provided at the start of a conversation to define the model's persona, behaviour, capabilities, and constraints. Typically hidden from the end user but processed before every query.
A sampling parameter (0–2) controlling output randomness. Low values produce deterministic outputs; high values produce diverse, creative outputs. See Sampling Parameters reference for full details.
A sampling strategy that considers only tokens whose cumulative probability sums to p. Adapts the candidate pool based on model confidence — smaller pool when model is confident, larger when uncertain.
Always picks the token with the highest probability at each step. Equivalent to temperature=0. Fast but often produces repetitive or locally optimal (but globally suboptimal) text.
Maintains k candidate sequences ("beams") simultaneously, selecting the globally highest-probability sequence. Produces more coherent text than greedy decoding but is deterministic and computationally expensive. Less common in modern LLMs (sampling preferred).
A latency-reduction technique where a small draft model generates several candidate tokens quickly, and the large model verifies multiple tokens in parallel. Can achieve 2–3× speedup for the large model with no quality loss.
A pattern where relevant documents are retrieved from an external knowledge base and injected into the model's context before generation. Reduces hallucination and extends knowledge beyond the training cutoff without retraining.
A capability where the model outputs a structured call to an external function or API (e.g. search, calculator, database) rather than generating text directly. The calling application executes the function and returns results to the model.
Returning generated tokens to the client incrementally as they are produced, rather than waiting for the full response. Improves perceived latency significantly — especially important for long responses.
When a model generates factually incorrect, fabricated, or nonsensical information presented as fact. A fundamental limitation of autoregressive LLMs — the model predicts plausible-sounding tokens, not factually verified ones.
A measure of how well a language model predicts a test corpus. Lower perplexity = better predictions. Used as an intrinsic evaluation metric during training. Not always correlated with downstream task performance.
Massive Multitask Language Understanding — a benchmark covering 57 subjects (STEM, humanities, law, medicine, etc.) with 4-choice multiple-choice questions. One of the most widely reported LLM benchmarks.
A coding benchmark by OpenAI with 164 programming problems. Measures the percentage of problems solved correctly (pass@k). The standard benchmark for comparing code generation capabilities.
Bilingual Evaluation Understudy — a precision-based metric comparing machine-generated text to reference translations using n-gram overlap. Originally for machine translation; now widely used for text generation, though criticised for poor correlation with human judgment.
The challenge of ensuring AI systems behave in accordance with human values and intentions. In practice for LLMs: making the model helpful, harmless, and honest (Anthropic's HHH framework).
Techniques that attempt to bypass a model's safety guardrails by crafting adversarial prompts. Examples include role-play scenarios, prompt injection, and context switching that trick the model into ignoring its safety training.
An attack where malicious instructions are hidden in user-provided content (e.g. a web page retrieved by the model, a document being summarised) that override the system prompt or hijack the model's actions.
Anthropic's approach to AI alignment where the model is trained using a set of principles ("constitution") to critique and revise its own outputs. Reduces dependence on human feedback labellers for safety.
Adversarial testing where humans or automated systems probe an AI model to find harmful, unsafe, or unexpected behaviours before public deployment.
Running a trained model to generate predictions or text — as opposed to training (updating weights). Most API usage is inference. Speed measured in tokens/second; cost measured in $/million tokens.
Reducing numerical precision of model weights (e.g. float32 → int8 or int4) to reduce memory and increase inference speed. Introduces a small quality trade-off. See Optimization Guide for GPTQ, AWQ, GGUF details.
A high-throughput inference framework that uses PagedAttention to manage the KV cache like virtual memory, dramatically increasing serving throughput. The de facto standard for self-hosted LLM serving.
A tool for running large language models locally. Handles model download, GGUF quantization, and serves an OpenAI-compatible API. Supports macOS (Metal), Linux (CUDA/ROCm), and Windows.
Latency = time to first token (TTFT) + time to complete response. Throughput = tokens generated per second across all concurrent users. Optimising one often hurts the other — small batches reduce latency, large batches increase throughput.
Further training a pre-trained model on a task-specific or domain-specific dataset to improve performance. Ranges from full fine-tuning (all weights updated) to PEFT methods like LoRA that update only a tiny fraction of parameters.
About
This glossary defines 44 key AI and LLM terms across six categories: Architecture (Transformer, Attention, Multi-Head Attention, KV Cache, MoE, RoPE), Training (Pre-training, Fine-tuning, RLHF, DPO, LoRA, QLoRA, SFT, Instruction Tuning), Inference (Temperature, Top-p, Speculative Decoding, Beam Search, Greedy Decoding, vLLM), Evaluation (Perplexity, BLEU, ROUGE, BERTScore, MMLU, HumanEval), Safety (Hallucination, Constitutional AI, Alignment, Jailbreak, System Prompt, Red Teaming), and Deployment (Quantization, Distillation, ONNX, TensorRT, Triton, GGUF). Each term includes a plain-English definition and cross-references to related terms.
How to use
- 1 Use the search box to find any term by name or definition text.
- 2 Filter by category to browse a specific domain (e.g. Safety, Training).
- 3 Click "See also" chips to discover related concepts.
- 4 The term count in the top-right updates as you filter.
- What is a Transformer in AI?
- A Transformer is the neural network architecture that underpins virtually all modern LLMs. Introduced in "Attention Is All You Need" (Vaswani et al., 2017), it uses self-attention mechanisms to process all tokens in parallel (unlike RNNs which process sequentially). Encoder-only Transformers (BERT) are good at understanding; decoder-only (GPT, Claude, Llama) are used for generation.
- What is RAG (Retrieval-Augmented Generation)?
- RAG is a technique that augments a language model's response by retrieving relevant documents from an external knowledge base at inference time. The retrieved context is inserted into the prompt before generation. This lets a model access up-to-date or private information without retraining, and reduces hallucination on factual queries.
- What does hallucination mean in AI?
- Hallucination refers to when an LLM generates plausible-sounding but factually incorrect or fabricated information. It occurs because models are trained to produce high-probability text sequences, not to verify factual accuracy. Techniques to reduce hallucination include RAG, grounding in source documents, RLHF with factuality rewards, and Constitutional AI.
- What is RLHF?
- RLHF (Reinforcement Learning from Human Feedback) is a fine-tuning technique where human raters compare model outputs, their preferences are used to train a reward model, and the language model is then fine-tuned using reinforcement learning (PPO) to maximize the reward. Used by ChatGPT, Claude, and Gemini to align model outputs with human preferences.