Fine-tuning + PEFT
Phase 45 of /root Year 5: parameter-efficient fine-tuning. LoRA, QLoRA, instruction tuning, DPO basics. K8s-native: KubeRay-distributed fine-tune jobs tracked in MLflow. 6-8 weeks, ~60-80 hours.
Seventh phase of Year 5. Customizing models without breaking the bank. 6-8 weeks, ~60-80 hrs.
Full fine-tuning of a large model requires datacenter compute. Parameter-Efficient Fine-Tuning (PEFT) — LoRA, QLoRA — updates only ~1% of weights via low-rank adapters, making fine-tuning practical on consumer GPUs. By phase end you’ve fine-tuned at least one open-weights model on a small task, tracked the run in MLflow, served the fine-tuned variant alongside the base via KServe.
This phase makes you operationally fluent at fine-tuning. The patterns transfer to instruction tuning, DPO (Direct Preference Optimization), and the next-generation alignment techniques.
Prerequisites
- Phase 44 complete; inference optimization fluency
- GPU available (24GB ideal; 16GB workable for QLoRA on smaller models)
- 12 hrs/week budget reserved
Why this phase exists
In 2026 fine-tuning is the practical adaptation mechanism for production LLM applications. Few teams pre-train; many fine-tune. The patterns (LoRA, instruction tuning, DPO) are dominant. This phase installs them at production depth.
The pattern-first frame
Same eight steps.
1. PROBLEM
You have an open-weights base model. You want it to behave differently — answer in your company’s voice, follow a specific task format, refuse certain categories, adopt domain expertise. Fine-tuning adjusts the model. Full fine-tuning costs too much; PEFT makes it tractable.
2. PRINCIPLES
2.1 LoRA — low-rank adaptation
Instead of updating all weights, train small low-rank matrices that “shift” the original weights. Typical: ~1% of total parameters. Quality usually close to full fine-tune.
→ Pattern: fine-tuning-strategies — OUTLINE this phase
Investigate:
- Walk LoRA math: W + ΔW where ΔW = A × B, A and B low-rank.
- Why does LoRA work? (Hint: most useful adaptations are low-rank.)
- What’s the cost/benefit of higher rank?
2.2 QLoRA — quantized LoRA
Combine quantization (4-bit base) with LoRA. The base model fits on a small GPU; LoRA adapter is trained. The unlock that enabled fine-tuning Llama-70B on a single 24GB GPU.
Investigate:
- Walk QLoRA: base model in 4-bit + LoRA adapters in FP16, double-quant for memory.
- Why is gradient computation through 4-bit base still feasible?
- When does QLoRA degrade quality vs LoRA on FP16 base?
2.3 Instruction tuning
Fine-tune on (instruction, response) pairs. Teaches the model to follow instructions specifically. The dataset shape (format, diversity, quality) dominates outcome.
Investigate:
- Walk instruction tuning vs raw next-token prediction.
- Why does dataset quality matter more than dataset size for instruction tuning?
- What’s the role of supervision rate (fully-supervised vs RLHF)?
2.4 DPO — Direct Preference Optimization
Train on (prompt, chosen_response, rejected_response) tuples to align toward preferred behavior. Simpler than RLHF; comparable results in many cases.
Investigate:
- Walk DPO loss: encourage chosen > rejected via likelihood ratio.
- Why is DPO simpler than RLHF? (Hint: no reward model + PPO needed.)
- When does RLHF beat DPO?
2.5 Dataset preparation as the actual work
The model isn’t the bottleneck; the data is. Data quality, format consistency, deduplication, contamination check (test set leakage), distribution shaping.
Investigate:
- What’s a contamination check, and why does it matter?
- How do you curate a diverse instruction dataset?
- What’s data deduplication’s role in dataset prep?
2.6 K8s-native fine-tune workflow
Fine-tune as a KubeRay RayJob: declare the job → operator runs it → MLflow tracks → registered model lands in registry. Composes with Phase 37’s distributed training stack.
→ Pattern: operator-pattern reinforced
Investigate:
- Walk a fine-tune RayJob CRD: container image + dataset reference + LoRA config.
- How does the fine-tuned adapter end up in MLflow?
- How does KServe load the base model + LoRA adapter for serving?
3. TRADE-OFFS
| Decision | Options | Cost |
|---|---|---|
| Method | LoRA; QLoRA; full fine-tune; prompt tuning | LoRA: standard. QLoRA: when GPU memory is tight. Full FT: when you have the compute. Prompt tuning: very small task. |
| Library | PEFT (HuggingFace); raw PyTorch; unsloth | PEFT: standard (recommended). Unsloth: faster, less flexible. |
| Alignment technique | SFT (instruction tuning); DPO; RLHF | SFT: simplest. DPO: middle. RLHF: most powerful, complex. |
| Dataset source | Public (HuggingFace Hub); synthetic; private | Public: convenient, generic. Synthetic: targeted, risk of artifacts. Private: best quality, hardest to obtain. |
4. TOOLS (as of 2026-06)
- PEFT (HuggingFace) — LoRA + QLoRA library
- TRL (HuggingFace) — SFT + DPO + RLHF trainers
- bitsandbytes — 4-bit quantization
- Unsloth — speed-optimized LoRA training
- KubeRay for distribution
- MLflow for tracking
Reading
- LoRA paper (Hu et al.)
- QLoRA paper (Dettmers et al.)
- DPO paper (Rafailov et al.)
- HuggingFace PEFT + TRL docs
5. MASTERY: Fine-tune via KubeRay
[ ] Pick a base model (Llama 3.1 8B, Mistral, or similar open weights)
[ ] Pick a fine-tune task (your ops-handbook style mimicry, code-completion, classification)
[ ] Curate or compose dataset (~1K-10K examples)
[ ] Contamination check against any benchmark you care about
[ ] LoRA fine-tune locally (single GPU); verify reasonable loss curve
[ ] Switch to KubeRay RayJob for distributed fine-tune (multi-GPU)
[ ] Track in MLflow: hyperparameters, loss curves, adapter artifacts
[ ] Eval the fine-tuned model: held-out test set + qualitative inspection
[ ] Register fine-tuned adapter in MLflow registry
[ ] Deploy via KServe: base model + LoRA adapter via vLLM LoRA support
[ ] Optional: DPO fine-tune on a small preference dataset
6. COMPARE: Unsloth
Pick one fine-tune; run it via Unsloth’s optimized path. Compare training time + quality.
400-word reflection.
7. OPERATE
- 2-3 runbooks: fine-tune OOM, LoRA serving failure (adapter not loading), eval regression
- 1-2 ADRs (LoRA over QLoRA when fits; PEFT lib over raw PyTorch)
- Weekly log
8. CONTRIBUTE
- PEFT / TRL — docs, edge cases
- A public dataset + LoRA fine-tune on HuggingFace Hub
What ships from this phase
- At least one fine-tuned model registered + served on basecamp
- Fine-tune RayJob template in basecamp/ml-infra-helpers
- Fine-tune runbooks
Validation criteria
[ ] LoRA fine-tune working via KubeRay RayJob
[ ] Fine-tuned model registered + served via KServe
[ ] Quality eval shows targeted behavior change
[ ] All 10 operational depth checks
[ ] Compare reflection (400 words)
[ ] 2-3 runbooks
[ ] 1-2 ADRs
[ ] Pattern entries:
- fine-tuning-strategies → OUTLINE
[ ] Exit Test passed
Exit Test
Time: 2.5 hours.
Part 1: Build (90 min)
Run a small LoRA fine-tune via KubeRay. Register adapter. Serve via KServe + vLLM LoRA support. Verify behavior change.
Part 2: Articulate (60 min)
~1000 words: “Walk a LoRA fine-tune end-to-end. Math (low-rank update), engineering (KubeRay job, MLflow tracking), deployment (LoRA adapter loading at serving). Cite patterns.”
Anti-patterns
| Anti-pattern | Why |
|---|---|
| Fine-tuning without eval | Can’t tell if you helped or hurt |
| Skipping contamination check | Benchmark scores inflated, real perf worse |
| Using bad/unfiltered data | Worse model than baseline |
| Full fine-tune when LoRA suffices | 100× the compute for marginal gain |
Patterns touched this phase
fine-tuning-strategies— OUTLINEoperator-patternreinforced