What is LoRA and why is it popular?

LoRA (Low-Rank Adaptation) adds small trainable matrices to attention layers while freezing the original weights. Only ~0.1–1% of parameters are trained, reducing VRAM 3–10× vs full fine-tuning. The trained adapter is a small file (~10-100 MB) that can be shared and swapped. Used by Stable Diffusion, Llama fine-tunes, and most open-source model customizations.

What is the difference between LoRA and QLoRA?

QLoRA combines LoRA with a 4-bit quantized base model (NF4 quantization via bitsandbytes). This reduces the base model's VRAM by ~4× compared to bfloat16, making it possible to fine-tune 70B models on a single 48 GB GPU or 13B models on consumer GPUs. Quality loss vs full-precision LoRA is minimal.

When should I use GPTQ vs AWQ?

Both are post-training quantization to 4-bit, but AWQ (Activation-aware Weight Quantization) is generally preferred for quality — it uses activation statistics to identify and protect important weights. GPTQ is faster to quantize and has more community support. For inference on quantized models, both work with vLLM, AutoGPTQ, and llama.cpp.

What is GGUF and when should I use it?

GGUF is the file format used by llama.cpp for running quantized models on CPUs and consumer GPUs (including Apple Silicon). It supports Q4_K_M, Q5_K_M, Q8_0, and other quantization levels. Use GGUF when you want to run a model locally with Ollama, LM Studio, or Jan without writing code.

LLM Fine-tuning & Optimization Guide

LoRA, QLoRA, GPTQ, AWQ, GGUF — when to use each, VRAM requirements, quality tradeoffs.

Quick Comparison

Method	Trainable %	VRAM	Quality	Best for
Full Fine-tune	100%	Full	★★★★★	Best quality, large datasets
LoRA	~0.5%	3–5×↓	★★★★☆	Single GPU, adapter swapping
QLoRA	~0.5%	8–12×↓	★★★★☆	Consumer GPU, 7B–65B models
Prefix Tuning	< 0.1%	10×↓	★★★☆☆	Multi-task, storage efficiency
Prompt Tuning	< 0.01%	20×↓	★★☆☆☆	Extreme scale (10B+)
INT8 Quant	N/A	2×↓	★★★★★	Inference, GPU VRAM saving
GPTQ (4-bit)	N/A	4×↓	★★★★☆	Production serving, large models
AWQ (4-bit)	N/A	4×↓	★★★★★	Edge/on-device, best 4-bit quality
GGUF/llama.cpp	N/A	4–8×↓	★★★★☆	Local/offline CPU+GPU inference

LoRA (Hu et al. 2021, Microsoft) freezes the original pre-trained weights and injects two small trainable matrices (A and B) into each attention layer. Only these low-rank matrices are trained, keeping the original model intact. At inference, the adapted weights can be merged back — zero latency overhead.

How it works

1 For a weight matrix W (d×k), introduce two matrices: A (d×r) and B (r×k) where rank r << d
2 Forward pass: output = Wx + BAx (original + low-rank adaptation)
3 Only A and B are updated during training — W is frozen
4 Rank r controls the capacity vs. efficiency trade-off (common values: 4, 8, 16, 64)
5 After training, merge: W' = W + BA for zero-overhead inference

Use Cases

→Fine-tuning on a single consumer GPU (16–24 GB VRAM)
→Task-specific adapters that can be swapped without reloading base model
→Style and tone adaptation for LLMs
→Domain adaptation for code or domain-specific language

Limitations

−Smaller trainable parameter count may underfit very large domain shifts
−Rank selection requires experimentation
−Not all model layers benefit equally — typically applied to Q, K, V, O attention matrices

Tools

Hugging Face PEFTUnslothLLaMA FactoryAxolotlLitGPT

LoRA: Low-Rank Adaptation of Large Language Models (Hu et al. 2021)

LLM Fine-tuning & Optimization Guide

Quick Comparison

About

How to use