Inference Optimization

Phase 44 of /root Year 5: quantization (INT8, INT4), distillation, speculative decoding, batch size tuning, KV cache optimization. Squeeze 2-5× more throughput from the same GPU. 5-7 weeks, ~50-70 hours.

Sixth phase of Year 5. Squeezing every drop from the GPU. 5-7 weeks, ~50-70 hrs.

Phase 43 deployed vLLM. This phase makes it faster + cheaper. Quantization, distillation, speculative decoding, batch tuning — the techniques that let you serve 2-5× more traffic on the same hardware. By phase end basecamp serves quantized + speculatively-decoded models with measured quality + throughput improvements.

These optimizations aren’t optional at scale. They’re the difference between serving 100 RPS and 500 RPS on the same GPU. Frontier AI labs (Anthropic, OpenAI, and others) all apply them aggressively. The patterns survive specific implementations.

Prerequisites

Phase 43 complete; vLLM operational

12 hrs/week budget reserved

Why this phase exists

LLM serving is GPU-bound. Every percent of GPU utilization is real money. Optimization techniques compound: quantization gives 2× → batch-size tuning gives 1.3× → speculative decoding gives 1.5× → total 4× from the same hardware. Senior ML platform engineers know these techniques and when each fits.

The pattern-first frame

Same eight steps.

1. PROBLEM

You have an LLM serving at some throughput. You want more throughput, lower latency, or both — without changing hardware. The techniques are well-known; the engineering is picking the right ones for your workload + measuring the quality trade-offs.

2. PRINCIPLES

2.1 Quantization

Reduce numerical precision: FP16 → INT8 → INT4. Smaller model = less GPU memory = larger batches = higher throughput. Quality degrades at extreme quantization but is often acceptable.

→ Pattern: inference-optimization — OUTLINE this phase

Investigate:

Walk an INT8 quantization scheme: what’s the calibration step?
Why does AWQ preserve quality better than naive INT4?
When does INT4 surprise you (small quality drop) vs disappoint (large drop)?

2.2 Distillation

Train a smaller “student” model to mimic a larger “teacher.” Smaller student = faster inference + lower cost. Common for fine-tuning specific tasks where you don’t need the teacher’s full generality.

Investigate:

Walk knowledge distillation: teacher outputs soft labels, student trained against them.
When does distillation work well (narrow tasks) vs poorly (general capability)?
What’s the relationship between distillation and fine-tuning (Phase 45 deepens)?

2.3 Speculative decoding

A small “draft” model proposes tokens; the large “verifier” model checks them in parallel. Net effect: latency drops because many tokens accept per verification step.

Investigate:

Walk speculative decoding: draft model proposes 5 tokens → verifier model evaluates all 5 in one forward pass → accepts some.
Why does it work? (Hint: language is predictable; draft model is usually right.)
When does it fail (low draft-model acceptance rate)?

2.4 Batch size tuning

Find the batch size that maximizes GPU utilization without degrading latency unacceptably. Workload-dependent.

Investigate:

For a chatbot: what’s the right max-batched-tokens?
For a batch summarizer: what’s the right max-num-seqs?
What’s the right way to measure both?

2.5 KV cache optimization

KV cache dominates LLM memory at long contexts. Optimizations: prefix sharing (multi-tenant prompts), KV cache offload (RAM/disk for cold contexts), grouped-query attention (smaller cache).

Investigate:

How does prefix sharing work? When does it earn its weight?
What’s grouped-query attention (GQA), and which 2026 models use it?
When does KV cache offload to RAM make sense?

2.6 The optimization stack composes

Quantization + speculative decoding + batch tuning + KV optimization compose multiplicatively. The combined optimization stack is 2-5× over baseline FP16 vanilla.

Investigate:

Walk the order to apply optimizations.
Which ones have quality risk vs which are quality-neutral?
How do you A/B test optimization changes safely?

3. TRADE-OFFS

Decision	Options	Cost
Quantization	AWQ (INT4); GPTQ (INT4); FP8 (newer); GGUF (CPU)	AWQ: best INT4 quality. GPTQ: similar. FP8: newer hardware. GGUF: CPU.
Distillation	Custom train; off-the-shelf smaller model	Custom: targeted, $$. Off-shelf: cheap, generic.
Speculative decoding	Native vLLM; n-gram speculation; medusa	Native: built-in. n-gram: cheap, lower acceptance. Medusa: research-leaning.
Optimization budget	All optimizations; selective; none	All: complexity. Selective: pragmatic. None: leaving perf on the table.

4. TOOLS (as of 2026-06)

AWQ, GPTQ quantization tools
vLLM speculative decoding built-in
Lit-GPT or TRT-LLM for advanced optimizations
DeepSpeed-Inference for some techniques

Reading

AWQ paper (Lin et al.)
“Mixed Precision Inference for Transformers” papers
vLLM optimization docs
“Speculative Decoding” paper (Leviathan et al.)

5. MASTERY: Optimized inference on basecamp

[ ] Quantize Phase 43's model to INT4 via AWQ; serve via KServe
[ ] Benchmark quality (perplexity + downstream task) before/after
[ ] Benchmark throughput before/after
[ ] Enable speculative decoding in vLLM with a small draft model
[ ] Measure speculative acceptance rate; tune draft model
[ ] Tune max-batched-tokens, max-num-seqs for your workload
[ ] Profile KV cache memory; observe prefix-cache reuse
[ ] Distill one specific behavior into a smaller model (optional, time-permitting)
[ ] A/B test optimization changes via Flagger (Phase 38) — observe quality stability
[ ] Document optimization stack as ADR

6. COMPARE: TensorRT-LLM or DeepSpeed-Inference

Pick one alternative optimization stack. Apply to one model. Compare results.

400-word reflection.

7. OPERATE

2-3 runbooks: quantization quality regression, speculative decoding low acceptance, batch saturation
1-2 ADRs (AWQ as quantization default; speculative decoding policy)
Weekly log

8. CONTRIBUTE

vLLM optimization features
AWQ / GPTQ / autoawq projects
A blog post on a real optimization win

What ships from this phase

Optimized LLM serving on basecamp (quantized + speculative + tuned)
Optimization runbooks

Validation criteria

[ ] Quantized model (INT4 via AWQ) serving via KServe
[ ] Speculative decoding enabled with measured acceptance rate
[ ] Batch tuning documented
[ ] Quality + throughput before/after benchmarks
[ ] All 9 operational depth checks
[ ] Compare reflection (400 words)
[ ] 2-3 runbooks
[ ] 1-2 ADRs
[ ] Pattern entries:
    - inference-optimization → OUTLINE
[ ] Exit Test passed

Exit Test

Time: 2 hours.

Part 1: Build (75 min)

Apply quantization + speculative decoding to a new model. Benchmark both throughput and quality.

Part 2: Articulate (45 min)

~800 words: “Walk the inference optimization stack you applied. Cite each technique’s quality trade-off and throughput gain. Explain how you’d present these trade-offs to a non-technical stakeholder.”

Anti-patterns

Anti-pattern	Why
Quantizing without quality measurement	Silently degraded model
Applying all optimizations at once	Can’t attribute gains or regressions
Tuning batch size on synthetic workload	Production workload has different distribution
Ignoring KV cache analysis at long context	Major perf left on table

Patterns touched this phase

inference-optimization — OUTLINE

→ Next: Phase 45: Fine-tuning + PEFT