Fine-Tuning Strategies

Parameter-Efficient Fine-Tuning (PEFT) — LoRA, QLoRA, adapters. Update small low-rank deltas instead of full model weights. Makes fine-tuning practical on consumer GPUs.

Don’t update all weights. Update small low-rank deltas that adapt the base model. LoRA, QLoRA, DPO — the PEFT family that made fine-tuning a 24GB-GPU game. Status: STUB — promoted to OUTLINE in Y5 Phase 45.

What this pattern is

Fine-tuning strategies are the family of techniques for adapting a pre-trained LLM to a specific task without re-training the full model. Full fine-tuning updates all model parameters — expensive, prone to catastrophic forgetting, requires production-scale GPUs. LoRA (Hu et al., 2021) updates small low-rank decomposition matrices added to attention layers; the base model stays frozen; the LoRA delta is the only trained tensor (often <1% of model size). QLoRA (Dettmers et al., 2023) quantizes the frozen base model to 4-bit while keeping LoRA deltas in higher precision — fitting fine-tuning of 65B-parameter models onto a single 48GB GPU. DPO (Direct Preference Optimization) trains directly on preference data (chosen vs rejected) without an explicit reward model. HuggingFace PEFT is the canonical OSS library.

The pattern made fine-tuning operationally accessible: what required a cluster of A100s in 2022 fits on a used 3090 in 2026. Senior ML engineers know the trade-offs across full-FT, LoRA, QLoRA, and DPO and pick the right one per use case.

The pattern’s insight is that most task adaptation doesn’t require changing the whole model. A 70B model’s linguistic knowledge, world knowledge, and reasoning capabilities are in the pre-trained weights. Adapting it to a specific task (customer support tone, medical terminology, code generation style) requires small targeted changes to how it processes attention. LoRA captures those changes in low-rank matrices; full fine-tuning wastefully updates parameters that don’t need to change.

The pattern also enables new deployment patterns. Multi-LoRA serving: one base model plus many LoRA adapters, each fine-tuned for a specific task. Users select the adapter at inference time; the base model weights are shared. This is dramatically cheaper than serving many full fine-tuned models. vLLM and TGI both support multi-LoRA serving natively.

DPO and other preference-based methods are the current frontier for alignment fine-tuning. Instead of RLHF (which requires training a separate reward model), DPO directly optimizes the LLM to prefer chosen responses over rejected ones. Simpler, more stable, often better results. DPO variants (KTO, IPO, ORPO) explore the trade-off space further.

Concrete instances in the wild

  • HuggingFace PEFT. OSS library for LoRA, QLoRA, adapters, prefix tuning. Standard for fine-tuning in 2026.
  • LoRA (Hu et al., 2021). Low-rank adaptation. The technique that made PEFT mainstream.
  • QLoRA (Dettmers et al., 2023). Quantized base + higher-precision LoRA. Consumer GPU fine-tuning.
  • DPO (Direct Preference Optimization). Preference-based alignment without reward model.
  • KTO, IPO, ORPO. DPO variants for different preference-data shapes.
  • Full fine-tuning. Traditional approach. Still useful for major model adaptation (medical, legal, domain-specific base models).
  • Prefix tuning / Prompt tuning. Freeze model; train soft prompts. Older PEFT technique; less common in 2026.
  • Adapters (Houlsby et al., 2019). Insert small trainable modules between transformer layers. Similar to LoRA in spirit.
  • Axolotl. Popular OSS fine-tuning framework. Wraps PEFT for common workflows.
  • Unsloth. Optimized fine-tuning framework. Faster than raw PEFT for common cases.
  • Together.ai / Fireworks.ai fine-tuning APIs. Managed fine-tuning for teams without GPU infrastructure.

Why this pattern matters

Full fine-tuning is expensive. A 70B model with full fine-tuning requires hundreds of GB of GPU memory, days of training time, and multi-GPU coordination. The cost is such that only teams with production-scale GPU budgets could realistically do it. LoRA and QLoRA collapsed the cost by orders of magnitude. Fine-tuning became something a small team with one GPU could do overnight rather than something requiring cloud budgets.

The economic shift changed what fine-tuning was for. Pre-LoRA, fine-tuning was reserved for major model adaptations (adapting a base model to medical text, code generation, etc.). Post-LoRA, fine-tuning became a normal iteration in the LLM app development loop (adapting to specific customer tones, task-specific behaviors, edge cases). Fine-tuning went from “occasional major project” to “regular iteration technique.”

For LLM applications specifically, fine-tuning fills gaps that prompt engineering can’t. Style consistency (matching a brand voice). Structured output reliability (guaranteed JSON format). Domain-specific terminology (medical, legal, financial). Instruction adherence for specific instruction styles. Each is more reliably achieved through fine-tuning than through prompt engineering alone, especially at scale.

The pattern also enables per-customer or per-tenant model customization. Multi-LoRA serving means each tenant can have their own fine-tuned adapter over a shared base model. This scales economically — one base model serves many tenants, adapters swap at inference time. Without PEFT, per-tenant customization required per-tenant full models, which is prohibitively expensive.

For preference alignment specifically, DPO and variants matter because RLHF (the traditional preference-alignment method) is complex and unstable. RLHF requires training a reward model, then using it to shape the LLM via reinforcement learning. The multi-step process is error-prone. DPO does the same job in one training pass, directly optimizing for preferences. Simpler, more stable, comparable or better results. Most preference fine-tuning in 2026 uses DPO variants rather than classic RLHF.

The failure modes to know: catastrophic forgetting (fine-tuning destroys general capabilities); overfitting to small datasets (LoRA is prone to this); poor LoRA rank choice (too small underfits; too large approaches full fine-tuning cost); data quality issues (garbage fine-tuning data produces worse models than the base); insufficient evaluation (fine-tuned model regresses on general tasks). Each has known mitigations; skipping them produces frustrating experiences.

Modern platforms make PEFT accessible. HuggingFace PEFT + Axolotl + Unsloth cover most fine-tuning workflows with sensible defaults. Managed services (Together, Fireworks) provide fine-tuning APIs for teams without GPU infrastructure. Colab / RunPod / Lambda Labs offer per-hour GPU access. The barrier to entry dropped from “significant infrastructure investment” to “an afternoon and a credit card.”

Depth progression

STUB     ← you are here.
OUTLINE  Promoted when Y5 Phase 45 trains at least one LoRA fine-tune on basecamp.
DEEP     Out of scope unless capstone direction calls for production-grade
         fine-tuning. Default: OUTLINE.

Preview: what OUTLINE will answer

When Y5 Phase 45 promotes this entry to OUTLINE, it will name:

  • PROBLEM. How do you adapt a pre-trained LLM to a specific task or domain without prohibitive cost?
  • PRINCIPLES. Update small deltas, not full weights. Match technique to task (LoRA for style; QLoRA for scale; DPO for alignment). Multi-LoRA serving for per-tenant customization. Evaluate for catastrophic forgetting. Data quality matters more than technique choice.
  • TRADE-OFFS. Full FT (max flexibility, max cost) vs LoRA (efficient, some limits) vs QLoRA (consumer GPU, quantization complexity). Task fine-tuning (SFT) vs preference alignment (DPO). Self-managed (control) vs managed (Together, Fireworks — convenience).
  • TOOLS (time-stamped as of 2026-06): HuggingFace PEFT, Axolotl, Unsloth, DPO/KTO/IPO/ORPO implementations, Together.ai/Fireworks.ai fine-tuning APIs, LlamaFactory, Lit-GPT.

The DEEP promotion is out of scope for basecamp default; if pursued (e.g., Y5 capstone calls for domain-specific fine-tuning), it would add MASTERY (operating fine-tuning workflows on basecamp), COMPARE (LoRA vs QLoRA vs DPO for different workloads), OPERATE (a specific fine-tuning event and eval outcome), and CONTRIBUTE (a PEFT documentation improvement or public case study).

Canonical references

  • Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models” (2021). Free. Foundational.
  • Dettmers et al., “QLoRA: Efficient Finetuning of Quantized LLMs” (2023). Free.
  • Rafailov et al., “Direct Preference Optimization” (2023). Free.
  • HuggingFace PEFT documentation. Free at huggingface.co/docs/peft.
  • Sebastian Raschka’s blog on PEFT patterns. Free at sebastianraschka.com.

Cross-references