Skip to main content
CodeLint.Dev Dev Tools

LLM Fine-tuning & Optimization Guide

LoRA, QLoRA, GPTQ, AWQ, GGUF — when to use each, VRAM requirements, quality tradeoffs.

Quick Comparison

MethodTrainable %VRAMQualityBest for
Full Fine-tune100%Full★★★★★Best quality, large datasets
LoRA~0.5%3–5×↓★★★★☆Single GPU, adapter swapping
QLoRA~0.5%8–12×↓★★★★☆Consumer GPU, 7B–65B models
Prefix Tuning< 0.1%10×↓★★★☆☆Multi-task, storage efficiency
Prompt Tuning< 0.01%20×↓★★☆☆☆Extreme scale (10B+)
INT8 QuantN/A2×↓★★★★★Inference, GPU VRAM saving
GPTQ (4-bit)N/A4×↓★★★★☆Production serving, large models
AWQ (4-bit)N/A4×↓★★★★★Edge/on-device, best 4-bit quality
GGUF/llama.cppN/A4–8×↓★★★★☆Local/offline CPU+GPU inference

LoRA (Hu et al. 2021, Microsoft) freezes the original pre-trained weights and injects two small trainable matrices (A and B) into each attention layer. Only these low-rank matrices are trained, keeping the original model intact. At inference, the adapted weights can be merged back — zero latency overhead.

How it works
  1. 1 For a weight matrix W (d×k), introduce two matrices: A (d×r) and B (r×k) where rank r << d
  2. 2 Forward pass: output = Wx + BAx (original + low-rank adaptation)
  3. 3 Only A and B are updated during training — W is frozen
  4. 4 Rank r controls the capacity vs. efficiency trade-off (common values: 4, 8, 16, 64)
  5. 5 After training, merge: W' = W + BA for zero-overhead inference
Use Cases
  • Fine-tuning on a single consumer GPU (16–24 GB VRAM)
  • Task-specific adapters that can be swapped without reloading base model
  • Style and tone adaptation for LLMs
  • Domain adaptation for code or domain-specific language
Limitations
  • Smaller trainable parameter count may underfit very large domain shifts
  • Rank selection requires experimentation
  • Not all model layers benefit equally — typically applied to Q, K, V, O attention matrices
Tools
Hugging Face PEFTUnslothLLaMA FactoryAxolotlLitGPT

LoRA: Low-Rank Adaptation of Large Language Models (Hu et al. 2021)