Fine-tuning + PEFT

Phase 45 of /root Year 5: parameter-efficient fine-tuning. LoRA, QLoRA, instruction tuning, DPO basics. K8s-native: KubeRay-distributed fine-tune jobs tracked in MLflow. 6-8 weeks, ~60-80 hours.

Seventh phase of Year 5. Customizing models without breaking the bank. 6-8 weeks, ~60-80 hrs.

Full fine-tuning of a large model requires datacenter compute. Parameter-Efficient Fine-Tuning (PEFT) — LoRA, QLoRA — updates only ~1% of weights via low-rank adapters, making fine-tuning practical on consumer GPUs. By phase end you’ve fine-tuned at least one open-weights model on a small task, tracked the run in MLflow, served the fine-tuned variant alongside the base via KServe.

This phase makes you operationally fluent at fine-tuning. The patterns transfer to instruction tuning, DPO (Direct Preference Optimization), and the next-generation alignment techniques.


Prerequisites

  • Phase 44 complete; inference optimization fluency
  • GPU available (24GB ideal; 16GB workable for QLoRA on smaller models)
  • 12 hrs/week budget reserved

Why this phase exists

In 2026 fine-tuning is the practical adaptation mechanism for production LLM applications. Few teams pre-train; many fine-tune. The patterns (LoRA, instruction tuning, DPO) are dominant. This phase installs them at production depth.


The pattern-first frame

Same eight steps.


1. PROBLEM

You have an open-weights base model. You want it to behave differently — answer in your company’s voice, follow a specific task format, refuse certain categories, adopt domain expertise. Fine-tuning adjusts the model. Full fine-tuning costs too much; PEFT makes it tractable.


2. PRINCIPLES

2.1 LoRA — low-rank adaptation

Instead of updating all weights, train small low-rank matrices that “shift” the original weights. Typical: ~1% of total parameters. Quality usually close to full fine-tune.

→ Pattern: fine-tuning-strategies — OUTLINE this phase

Investigate:

  • Walk LoRA math: W + ΔW where ΔW = A × B, A and B low-rank.
  • Why does LoRA work? (Hint: most useful adaptations are low-rank.)
  • What’s the cost/benefit of higher rank?

2.2 QLoRA — quantized LoRA

Combine quantization (4-bit base) with LoRA. The base model fits on a small GPU; LoRA adapter is trained. The unlock that enabled fine-tuning Llama-70B on a single 24GB GPU.

Investigate:

  • Walk QLoRA: base model in 4-bit + LoRA adapters in FP16, double-quant for memory.
  • Why is gradient computation through 4-bit base still feasible?
  • When does QLoRA degrade quality vs LoRA on FP16 base?

2.3 Instruction tuning

Fine-tune on (instruction, response) pairs. Teaches the model to follow instructions specifically. The dataset shape (format, diversity, quality) dominates outcome.

Investigate:

  • Walk instruction tuning vs raw next-token prediction.
  • Why does dataset quality matter more than dataset size for instruction tuning?
  • What’s the role of supervision rate (fully-supervised vs RLHF)?

2.4 DPO — Direct Preference Optimization

Train on (prompt, chosen_response, rejected_response) tuples to align toward preferred behavior. Simpler than RLHF; comparable results in many cases.

Investigate:

  • Walk DPO loss: encourage chosen > rejected via likelihood ratio.
  • Why is DPO simpler than RLHF? (Hint: no reward model + PPO needed.)
  • When does RLHF beat DPO?

2.5 Dataset preparation as the actual work

The model isn’t the bottleneck; the data is. Data quality, format consistency, deduplication, contamination check (test set leakage), distribution shaping.

Investigate:

  • What’s a contamination check, and why does it matter?
  • How do you curate a diverse instruction dataset?
  • What’s data deduplication’s role in dataset prep?

2.6 K8s-native fine-tune workflow

Fine-tune as a KubeRay RayJob: declare the job → operator runs it → MLflow tracks → registered model lands in registry. Composes with Phase 37’s distributed training stack.

→ Pattern: operator-pattern reinforced

Investigate:

  • Walk a fine-tune RayJob CRD: container image + dataset reference + LoRA config.
  • How does the fine-tuned adapter end up in MLflow?
  • How does KServe load the base model + LoRA adapter for serving?

3. TRADE-OFFS

DecisionOptionsCost
MethodLoRA; QLoRA; full fine-tune; prompt tuningLoRA: standard. QLoRA: when GPU memory is tight. Full FT: when you have the compute. Prompt tuning: very small task.
LibraryPEFT (HuggingFace); raw PyTorch; unslothPEFT: standard (recommended). Unsloth: faster, less flexible.
Alignment techniqueSFT (instruction tuning); DPO; RLHFSFT: simplest. DPO: middle. RLHF: most powerful, complex.
Dataset sourcePublic (HuggingFace Hub); synthetic; privatePublic: convenient, generic. Synthetic: targeted, risk of artifacts. Private: best quality, hardest to obtain.

4. TOOLS (as of 2026-06)

  • PEFT (HuggingFace) — LoRA + QLoRA library
  • TRL (HuggingFace) — SFT + DPO + RLHF trainers
  • bitsandbytes — 4-bit quantization
  • Unsloth — speed-optimized LoRA training
  • KubeRay for distribution
  • MLflow for tracking

Reading

  • LoRA paper (Hu et al.)
  • QLoRA paper (Dettmers et al.)
  • DPO paper (Rafailov et al.)
  • HuggingFace PEFT + TRL docs

5. MASTERY: Fine-tune via KubeRay

[ ] Pick a base model (Llama 3.1 8B, Mistral, or similar open weights)
[ ] Pick a fine-tune task (your ops-handbook style mimicry, code-completion, classification)
[ ] Curate or compose dataset (~1K-10K examples)
[ ] Contamination check against any benchmark you care about
[ ] LoRA fine-tune locally (single GPU); verify reasonable loss curve
[ ] Switch to KubeRay RayJob for distributed fine-tune (multi-GPU)
[ ] Track in MLflow: hyperparameters, loss curves, adapter artifacts
[ ] Eval the fine-tuned model: held-out test set + qualitative inspection
[ ] Register fine-tuned adapter in MLflow registry
[ ] Deploy via KServe: base model + LoRA adapter via vLLM LoRA support
[ ] Optional: DPO fine-tune on a small preference dataset

6. COMPARE: Unsloth

Pick one fine-tune; run it via Unsloth’s optimized path. Compare training time + quality.

400-word reflection.


7. OPERATE

  • 2-3 runbooks: fine-tune OOM, LoRA serving failure (adapter not loading), eval regression
  • 1-2 ADRs (LoRA over QLoRA when fits; PEFT lib over raw PyTorch)
  • Weekly log

8. CONTRIBUTE

  • PEFT / TRL — docs, edge cases
  • A public dataset + LoRA fine-tune on HuggingFace Hub

What ships from this phase

  • At least one fine-tuned model registered + served on basecamp
  • Fine-tune RayJob template in basecamp/ml-infra-helpers
  • Fine-tune runbooks

Validation criteria

[ ] LoRA fine-tune working via KubeRay RayJob
[ ] Fine-tuned model registered + served via KServe
[ ] Quality eval shows targeted behavior change
[ ] All 10 operational depth checks
[ ] Compare reflection (400 words)
[ ] 2-3 runbooks
[ ] 1-2 ADRs
[ ] Pattern entries:
    - fine-tuning-strategies → OUTLINE
[ ] Exit Test passed

Exit Test

Time: 2.5 hours.

Part 1: Build (90 min)

Run a small LoRA fine-tune via KubeRay. Register adapter. Serve via KServe + vLLM LoRA support. Verify behavior change.

Part 2: Articulate (60 min)

~1000 words: “Walk a LoRA fine-tune end-to-end. Math (low-rank update), engineering (KubeRay job, MLflow tracking), deployment (LoRA adapter loading at serving). Cite patterns.”


Anti-patterns

Anti-patternWhy
Fine-tuning without evalCan’t tell if you helped or hurt
Skipping contamination checkBenchmark scores inflated, real perf worse
Using bad/unfiltered dataWorse model than baseline
Full fine-tune when LoRA suffices100× the compute for marginal gain

Patterns touched this phase

  • fine-tuning-strategies — OUTLINE
  • operator-pattern reinforced

→ Next: Phase 46: LLM Gateway — ship llm-gateway