What is a Transformer in AI?

A Transformer is the neural network architecture that underpins virtually all modern LLMs. Introduced in "Attention Is All You Need" (Vaswani et al., 2017), it uses self-attention mechanisms to process all tokens in parallel (unlike RNNs which process sequentially). Encoder-only Transformers (BERT) are good at understanding; decoder-only (GPT, Claude, Llama) are used for generation.

What is RAG (Retrieval-Augmented Generation)?

RAG is a technique that augments a language model's response by retrieving relevant documents from an external knowledge base at inference time. The retrieved context is inserted into the prompt before generation. This lets a model access up-to-date or private information without retraining, and reduces hallucination on factual queries.

What does hallucination mean in AI?

Hallucination refers to when an LLM generates plausible-sounding but factually incorrect or fabricated information. It occurs because models are trained to produce high-probability text sequences, not to verify factual accuracy. Techniques to reduce hallucination include RAG, grounding in source documents, RLHF with factuality rewards, and Constitutional AI.

RLHF (Reinforcement Learning from Human Feedback) is a fine-tuning technique where human raters compare model outputs, their preferences are used to train a reward model, and the language model is then fine-tuned using reinforcement learning (PPO) to maximize the reward. Used by ChatGPT, Claude, and Gemini to align model outputs with human preferences.

AI / LLM Glossary

46 terms across architecture, training, inference, evaluation, safety, and deployment.

46 terms

Transformer Architecture

The neural network architecture underlying virtually all modern LLMs. Introduced by Vaswani et al. (2017) in "Attention Is All You Need". Uses self-attention mechanisms to relate every token to every other token in a sequence — enabling parallel training and long-range dependency capture.

See also: AttentionSelf-AttentionEncoderDecoder

Attention Architecture

A mechanism that allows each token to "attend to" (weight the importance of) all other tokens in the sequence when computing its representation. Scaled dot-product attention: Attention(Q,K,V) = softmax(QKᵀ/√d_k)V where Q=queries, K=keys, V=values.

Multi-Head Attention Architecture

Runs multiple attention operations ("heads") in parallel, each learning different aspects of token relationships (e.g. one head may capture syntactic structure, another semantic similarity). Outputs are concatenated and projected back to the model dimension.

Self-Attention Architecture

Attention where queries, keys, and values all come from the same sequence — each token computes attention over all other tokens in the same layer. Enables the model to build contextual representations.

Context Window Architecture

The maximum number of tokens a model can process in a single forward pass (prompt + response combined). Beyond this limit, older tokens are dropped. GPT-4o: 128K tokens; Claude 4: 200K tokens; Gemini 2.5: 1M tokens.

Embedding Architecture

A dense vector representation of a token, word, sentence, or document in a high-dimensional space (e.g. 768 or 4096 dimensions). Semantically similar items have vectors that are geometrically close (high cosine similarity).

Positional Encoding Architecture

Since transformers have no inherent notion of order, positional encodings are added to token embeddings to inject sequence position information. Modern models use Rotary Position Embeddings (RoPE) which extend to long contexts more effectively than the original sinusoidal encodings.

MoE — Mixture of Experts Architecture

An architecture where the model routes each token to only a subset of specialised "expert" sub-networks (FFN layers). Most parameters are not activated for any given input, allowing very large parameter counts without proportional compute costs. Used by Mixtral, GPT-4, and DeepSeek V3.

GQA — Grouped Query Attention Architecture

A memory-efficient attention variant where multiple query heads share key and value heads. Reduces the KV-cache size significantly during inference. Used in Llama 3, Mistral, and most recent LLMs.

KV Cache Architecture

During autoregressive generation, the key and value tensors for previously generated tokens are cached to avoid recomputation. The KV cache grows with sequence length and is the primary memory cost during inference.

Pre-training Training

The initial training phase where a model learns general language representations from massive amounts of unlabelled text (next-token prediction or masked token prediction). Requires enormous compute — GPT-4 reportedly used 25,000 A100s for months.

SFT — Supervised Fine-tuning Training

Training a pre-trained model on a curated dataset of (instruction, response) pairs, where the response is the "correct" output. Makes the model follow instructions. The first step after pre-training for instruction-following models.

RLHF — Reinforcement Learning from Human Feedback Training

A training method where a reward model (trained on human preference rankings) provides a signal to the LLM via PPO (Proximal Policy Optimisation). Used by OpenAI to align GPT-4 and ChatGPT with human preferences.

DPO — Direct Preference Optimisation Training

An alternative to RLHF that directly trains on human preference pairs (chosen vs. rejected response) without a separate reward model. More stable and simpler than PPO-based RLHF. Widely adopted in open-source models.

LoRA Training

Low-Rank Adaptation — a parameter-efficient fine-tuning technique that injects two small trainable matrices into attention layers instead of updating all weights. See the Optimization Guide for a full explanation.

AI / LLM Glossary

About

How to use