LLM Fine-tuning & Optimization Guide
LoRA, QLoRA, GPTQ, AWQ, GGUF — when to use each, VRAM requirements, quality tradeoffs.
Quick Comparison
| Method | Trainable % | VRAM | Quality | Best for |
|---|---|---|---|---|
| Full Fine-tune | 100% | Full | ★★★★★ | Best quality, large datasets |
| LoRA | ~0.5% | 3–5×↓ | ★★★★☆ | Single GPU, adapter swapping |
| QLoRA | ~0.5% | 8–12×↓ | ★★★★☆ | Consumer GPU, 7B–65B models |
| Prefix Tuning | < 0.1% | 10×↓ | ★★★☆☆ | Multi-task, storage efficiency |
| Prompt Tuning | < 0.01% | 20×↓ | ★★☆☆☆ | Extreme scale (10B+) |
| INT8 Quant | N/A | 2×↓ | ★★★★★ | Inference, GPU VRAM saving |
| GPTQ (4-bit) | N/A | 4×↓ | ★★★★☆ | Production serving, large models |
| AWQ (4-bit) | N/A | 4×↓ | ★★★★★ | Edge/on-device, best 4-bit quality |
| GGUF/llama.cpp | N/A | 4–8×↓ | ★★★★☆ | Local/offline CPU+GPU inference |
LoRA (Hu et al. 2021, Microsoft) freezes the original pre-trained weights and injects two small trainable matrices (A and B) into each attention layer. Only these low-rank matrices are trained, keeping the original model intact. At inference, the adapted weights can be merged back — zero latency overhead.
- 1 For a weight matrix W (d×k), introduce two matrices: A (d×r) and B (r×k) where rank r << d
- 2 Forward pass: output = Wx + BAx (original + low-rank adaptation)
- 3 Only A and B are updated during training — W is frozen
- 4 Rank r controls the capacity vs. efficiency trade-off (common values: 4, 8, 16, 64)
- 5 After training, merge: W' = W + BA for zero-overhead inference
- →Fine-tuning on a single consumer GPU (16–24 GB VRAM)
- →Task-specific adapters that can be swapped without reloading base model
- →Style and tone adaptation for LLMs
- →Domain adaptation for code or domain-specific language
- −Smaller trainable parameter count may underfit very large domain shifts
- −Rank selection requires experimentation
- −Not all model layers benefit equally — typically applied to Q, K, V, O attention matrices
LoRA: Low-Rank Adaptation of Large Language Models (Hu et al. 2021)
About
This guide covers nine optimization techniques for large language models: Full Fine-tuning (all weights, highest quality, highest VRAM), LoRA (Low-Rank Adaptation — adds small trainable matrices, 99% fewer trainable params), QLoRA (LoRA on 4-bit quantized base — enables fine-tuning 65B models on consumer GPUs), Prefix Tuning (prepends trainable soft tokens), Prompt Tuning (lightest approach — trains only input embeddings), INT8 Quantization (LLM.int8(), 2× VRAM reduction), GPTQ (post-training quantization to 4-bit, fastest inference), AWQ (Activation-aware Weight Quantization, better quality than GPTQ at same bit-width), and GGUF (llama.cpp format for CPU/GPU inference). Each entry explains how it works step-by-step, use cases, limitations, and recommended tools.
How to use
- 1 Use the Quick Comparison table to find the right technique at a glance.
- 2 Filter by category: Fine-tuning or Quantization.
- 3 Click any technique to expand its full explanation.
- 4 Read "How it works" for the step-by-step mechanism.
- 5 Compare Use Cases vs Limitations to choose the best approach for your hardware and task.
- What is LoRA and why is it popular?
- LoRA (Low-Rank Adaptation) adds small trainable matrices to attention layers while freezing the original weights. Only ~0.1–1% of parameters are trained, reducing VRAM 3–10× vs full fine-tuning. The trained adapter is a small file (~10-100 MB) that can be shared and swapped. Used by Stable Diffusion, Llama fine-tunes, and most open-source model customizations.
- What is the difference between LoRA and QLoRA?
- QLoRA combines LoRA with a 4-bit quantized base model (NF4 quantization via bitsandbytes). This reduces the base model's VRAM by ~4× compared to bfloat16, making it possible to fine-tune 70B models on a single 48 GB GPU or 13B models on consumer GPUs. Quality loss vs full-precision LoRA is minimal.
- When should I use GPTQ vs AWQ?
- Both are post-training quantization to 4-bit, but AWQ (Activation-aware Weight Quantization) is generally preferred for quality — it uses activation statistics to identify and protect important weights. GPTQ is faster to quantize and has more community support. For inference on quantized models, both work with vLLM, AutoGPTQ, and llama.cpp.
- What is GGUF and when should I use it?
- GGUF is the file format used by llama.cpp for running quantized models on CPUs and consumer GPUs (including Apple Silicon). It supports Q4_K_M, Q5_K_M, Q8_0, and other quantization levels. Use GGUF when you want to run a model locally with Ollama, LM Studio, or Jan without writing code.