Skip to main content
CodeLint.Dev Dev Tools

AI / LLM Glossary

46 terms across architecture, training, inference, evaluation, safety, and deployment.

46 terms
Transformer Architecture

The neural network architecture underlying virtually all modern LLMs. Introduced by Vaswani et al. (2017) in "Attention Is All You Need". Uses self-attention mechanisms to relate every token to every other token in a sequence — enabling parallel training and long-range dependency capture.

See also: AttentionSelf-AttentionEncoderDecoder
Attention Architecture

A mechanism that allows each token to "attend to" (weight the importance of) all other tokens in the sequence when computing its representation. Scaled dot-product attention: Attention(Q,K,V) = softmax(QKᵀ/√d_k)V where Q=queries, K=keys, V=values.

Multi-Head Attention Architecture

Runs multiple attention operations ("heads") in parallel, each learning different aspects of token relationships (e.g. one head may capture syntactic structure, another semantic similarity). Outputs are concatenated and projected back to the model dimension.

Self-Attention Architecture

Attention where queries, keys, and values all come from the same sequence — each token computes attention over all other tokens in the same layer. Enables the model to build contextual representations.

Context Window Architecture

The maximum number of tokens a model can process in a single forward pass (prompt + response combined). Beyond this limit, older tokens are dropped. GPT-4o: 128K tokens; Claude 4: 200K tokens; Gemini 2.5: 1M tokens.

Embedding Architecture

A dense vector representation of a token, word, sentence, or document in a high-dimensional space (e.g. 768 or 4096 dimensions). Semantically similar items have vectors that are geometrically close (high cosine similarity).

Positional Encoding Architecture

Since transformers have no inherent notion of order, positional encodings are added to token embeddings to inject sequence position information. Modern models use Rotary Position Embeddings (RoPE) which extend to long contexts more effectively than the original sinusoidal encodings.

MoE — Mixture of Experts Architecture

An architecture where the model routes each token to only a subset of specialised "expert" sub-networks (FFN layers). Most parameters are not activated for any given input, allowing very large parameter counts without proportional compute costs. Used by Mixtral, GPT-4, and DeepSeek V3.

GQA — Grouped Query Attention Architecture

A memory-efficient attention variant where multiple query heads share key and value heads. Reduces the KV-cache size significantly during inference. Used in Llama 3, Mistral, and most recent LLMs.

KV Cache Architecture

During autoregressive generation, the key and value tensors for previously generated tokens are cached to avoid recomputation. The KV cache grows with sequence length and is the primary memory cost during inference.

Pre-training Training

The initial training phase where a model learns general language representations from massive amounts of unlabelled text (next-token prediction or masked token prediction). Requires enormous compute — GPT-4 reportedly used 25,000 A100s for months.

SFT — Supervised Fine-tuning Training

Training a pre-trained model on a curated dataset of (instruction, response) pairs, where the response is the "correct" output. Makes the model follow instructions. The first step after pre-training for instruction-following models.

RLHF — Reinforcement Learning from Human Feedback Training

A training method where a reward model (trained on human preference rankings) provides a signal to the LLM via PPO (Proximal Policy Optimisation). Used by OpenAI to align GPT-4 and ChatGPT with human preferences.

DPO — Direct Preference Optimisation Training

An alternative to RLHF that directly trains on human preference pairs (chosen vs. rejected response) without a separate reward model. More stable and simpler than PPO-based RLHF. Widely adopted in open-source models.

LoRA Training

Low-Rank Adaptation — a parameter-efficient fine-tuning technique that injects two small trainable matrices into attention layers instead of updating all weights. See the Optimization Guide for a full explanation.

See also: QLoRAPEFT
QLoRA Training

Quantized LoRA — combines 4-bit NF4 quantization of the base model with LoRA adapters in float16. Enables fine-tuning 65B+ models on a single GPU.

See also: LoRA
PEFT Training

Parameter-Efficient Fine-Tuning — umbrella term for methods (LoRA, QLoRA, prefix tuning, prompt tuning, adapters) that fine-tune a small number of parameters rather than the full model. The Hugging Face PEFT library implements most of these.

Catastrophic Forgetting Training

When a model is fine-tuned on a specific task, it tends to "forget" capabilities learned during pre-training. Full fine-tuning is most susceptible. LoRA mitigates this by keeping original weights frozen.

Autoregressive Generation Inference & Sampling

LLMs generate text one token at a time, feeding each generated token back as input to predict the next. This means generating a 100-token response requires 100 sequential forward passes.

Token Inference & Sampling

The basic unit of text that a model processes. Tokens are subword pieces — not words or characters. "tokenization" → ["token", "ization"]. One token ≈ 4 characters in English, fewer in code, more in non-Latin scripts.

Prompt Inference & Sampling

The input text sent to a language model. Includes the system prompt, conversation history, and user message. Token count directly determines cost and speed.

System Prompt Inference & Sampling

Instructions provided at the start of a conversation to define the model's persona, behaviour, capabilities, and constraints. Typically hidden from the end user but processed before every query.

Temperature Inference & Sampling

A sampling parameter (0–2) controlling output randomness. Low values produce deterministic outputs; high values produce diverse, creative outputs. See Sampling Parameters reference for full details.

Top-P / Nucleus Sampling Inference & Sampling

A sampling strategy that considers only tokens whose cumulative probability sums to p. Adapts the candidate pool based on model confidence — smaller pool when model is confident, larger when uncertain.

Greedy Decoding Inference & Sampling

Always picks the token with the highest probability at each step. Equivalent to temperature=0. Fast but often produces repetitive or locally optimal (but globally suboptimal) text.

Beam Search Inference & Sampling

Maintains k candidate sequences ("beams") simultaneously, selecting the globally highest-probability sequence. Produces more coherent text than greedy decoding but is deterministic and computationally expensive. Less common in modern LLMs (sampling preferred).

Speculative Decoding Inference & Sampling

A latency-reduction technique where a small draft model generates several candidate tokens quickly, and the large model verifies multiple tokens in parallel. Can achieve 2–3× speedup for the large model with no quality loss.

RAG — Retrieval-Augmented Generation Inference & Sampling

A pattern where relevant documents are retrieved from an external knowledge base and injected into the model's context before generation. Reduces hallucination and extends knowledge beyond the training cutoff without retraining.

Function Calling / Tool Use Inference & Sampling

A capability where the model outputs a structured call to an external function or API (e.g. search, calculator, database) rather than generating text directly. The calling application executes the function and returns results to the model.

Streaming Inference & Sampling

Returning generated tokens to the client incrementally as they are produced, rather than waiting for the full response. Improves perceived latency significantly — especially important for long responses.

Hallucination Evaluation

When a model generates factually incorrect, fabricated, or nonsensical information presented as fact. A fundamental limitation of autoregressive LLMs — the model predicts plausible-sounding tokens, not factually verified ones.

Perplexity Evaluation

A measure of how well a language model predicts a test corpus. Lower perplexity = better predictions. Used as an intrinsic evaluation metric during training. Not always correlated with downstream task performance.

MMLU Evaluation

Massive Multitask Language Understanding — a benchmark covering 57 subjects (STEM, humanities, law, medicine, etc.) with 4-choice multiple-choice questions. One of the most widely reported LLM benchmarks.

HumanEval Evaluation

A coding benchmark by OpenAI with 164 programming problems. Measures the percentage of problems solved correctly (pass@k). The standard benchmark for comparing code generation capabilities.

BLEU Evaluation

Bilingual Evaluation Understudy — a precision-based metric comparing machine-generated text to reference translations using n-gram overlap. Originally for machine translation; now widely used for text generation, though criticised for poor correlation with human judgment.

Alignment Safety & Alignment

The challenge of ensuring AI systems behave in accordance with human values and intentions. In practice for LLMs: making the model helpful, harmless, and honest (Anthropic's HHH framework).

Jailbreaking Safety & Alignment

Techniques that attempt to bypass a model's safety guardrails by crafting adversarial prompts. Examples include role-play scenarios, prompt injection, and context switching that trick the model into ignoring its safety training.

Prompt Injection Safety & Alignment

An attack where malicious instructions are hidden in user-provided content (e.g. a web page retrieved by the model, a document being summarised) that override the system prompt or hijack the model's actions.

Constitutional AI Safety & Alignment

Anthropic's approach to AI alignment where the model is trained using a set of principles ("constitution") to critique and revise its own outputs. Reduces dependence on human feedback labellers for safety.

Red Teaming Safety & Alignment

Adversarial testing where humans or automated systems probe an AI model to find harmful, unsafe, or unexpected behaviours before public deployment.

Inference Deployment & Serving

Running a trained model to generate predictions or text — as opposed to training (updating weights). Most API usage is inference. Speed measured in tokens/second; cost measured in $/million tokens.

Quantization Deployment & Serving

Reducing numerical precision of model weights (e.g. float32 → int8 or int4) to reduce memory and increase inference speed. Introduces a small quality trade-off. See Optimization Guide for GPTQ, AWQ, GGUF details.

vLLM Deployment & Serving

A high-throughput inference framework that uses PagedAttention to manage the KV cache like virtual memory, dramatically increasing serving throughput. The de facto standard for self-hosted LLM serving.

Ollama Deployment & Serving

A tool for running large language models locally. Handles model download, GGUF quantization, and serves an OpenAI-compatible API. Supports macOS (Metal), Linux (CUDA/ROCm), and Windows.

Latency vs Throughput Deployment & Serving

Latency = time to first token (TTFT) + time to complete response. Throughput = tokens generated per second across all concurrent users. Optimising one often hurts the other — small batches reduce latency, large batches increase throughput.

Fine-tuning Deployment & Serving

Further training a pre-trained model on a task-specific or domain-specific dataset to improve performance. Ranges from full fine-tuning (all weights updated) to PEFT methods like LoRA that update only a tiny fraction of parameters.