Inference Optimization

Quantization, distillation, speculative decoding, batch tuning, KV cache optimization. The compounding techniques that make LLM serving 2-5× cheaper without hardware change.

Same model. Same GPU. 2–5× more throughput. The compounding optimization stack: quantization × speculative decoding × batch tuning × KV cache. Status: STUB — promoted to OUTLINE in Y5 Phase 44.

What this pattern is

Inference optimization is the set of compounding techniques that squeeze more throughput out of the same GPU hardware. Quantization (FP16 → INT8 → INT4 via AWQ, GPTQ, or FP8 on newer hardware) reduces model size, freeing GPU memory for larger batches. Distillation trains a smaller student model from a larger teacher — narrow tasks survive distillation well. Speculative decoding uses a small draft model to propose tokens that a large verifier model checks in parallel, reducing per-token latency. Batch size tuning finds the workload-specific maximum that maintains acceptable latency. KV cache optimization (prefix sharing, GQA, offload) reduces the dominant memory cost at long contexts.

The techniques compound. 2x from quantization × 1.3x from batch tuning × 1.5x from speculative decoding = ~4x total. Senior ML platform engineers know all of them and which ones fit which workload. Quality measurement at each step is non-negotiable — silent quality degradation is the failure mode.

Each technique targets a specific bottleneck. Quantization targets GPU memory (fitting bigger batches). Distillation targets model size (smaller compute per token). Speculative decoding targets sequential token generation (parallelizing what was serial). Batch tuning targets GPU utilization (fewer idle cycles). KV cache optimization targets long-context memory pressure. Understanding which bottleneck applies to your workload determines which optimizations pay off.

Quality measurement is what separates real optimization from theater. Quantization can degrade model quality subtly — 4-bit quantization of a 70B model might produce responses that pass eyeball tests but score lower on evals. Distillation can lose important behaviors that the teacher had. Speculative decoding is usually quality-preserving but bugs in verifier logic can produce silent errors. Every optimization needs eval verification before deployment; skipping this step is how “we made it faster and worse” happens without anyone noticing until users complain.

Concrete instances in the wild

  • AWQ (Activation-aware Weight Quantization). Common 4-bit quantization. Balances quality and compression.
  • GPTQ. Alternative 4-bit quantization method. Slightly different quality/speed trade-off.
  • FP8 quantization (H100+). Newer hardware supports FP8 natively. Better than INT8 quality at similar size.
  • SmoothQuant. Technique that makes activation quantization more robust for INT8.
  • Speculative decoding (vLLM built-in). Small draft model + large verifier. 1.3-2x latency reduction typical.
  • Medusa / EAGLE (advanced speculative decoding). Tree-based speculation for better acceptance rates.
  • Distilled models (DistilBERT, DistilGPT, Llama-2-Distill). Trained smaller variants of larger models.
  • Model surgery (SlimPajama, MobileLLM). Restructured smaller models trained on subsets.
  • KV cache prefix sharing (vLLM automatic prefix caching). Multi-tenant prompts sharing prefix cache.
  • Grouped Query Attention (GQA). Architectural change reducing KV cache size. Standard in Llama 3+.
  • Multi-query attention (MQA). Extreme GQA. Used in some architectures.
  • Batch tuning per-workload. vLLM’s max_num_batched_tokens tuning.

Why this pattern matters

LLM inference costs dominate the operating budget of any serious AI application. A 4x throughput improvement translates directly to 4x cost reduction at the same load, or 4x more load at the same cost. For applications with millions of daily queries, this is the difference between economically viable and infeasible. Optimization isn’t optional; it’s the difference between shipping and not shipping.

The pattern also determines what models you can afford to run. A 405B model served naively is prohibitively expensive for most workloads. Quantized to 4-bit with speculative decoding, it might fit within budget. Optimization is what makes frontier-capability models operationally accessible to teams without hyperscale budgets.

For latency-sensitive applications, speculative decoding specifically matters. Interactive chat perceives quality as latency + accuracy. A 100ms improvement in first-token latency is often more valuable than a slight accuracy improvement. Speculative decoding, prefix caching, and chunked prefill each shave latency in specific ways.

The pattern also enables architectures that would otherwise be impossible. Multi-LoRA serving (many fine-tuned variants sharing base model weights) requires quantization to fit the base model plus adapters. Multi-tenant serving with prefix caching requires PagedAttention. Long-context serving requires GQA or MQA to keep KV cache manageable. Each optimization enables patterns that unoptimized serving can’t support.

Modern serving frameworks bake in many of these optimizations. vLLM handles PagedAttention, continuous batching, prefix caching, and speculative decoding automatically. TGI has similar features. What used to require custom kernel-level work in 2022 is a config flag in 2026. The frontier keeps moving — new optimizations appear every few months — but the baseline is significantly higher than it used to be.

The failure modes to know: quantization degrading quality without measurement (silent regression); speculative decoding bugs producing wrong tokens (harder-to-detect regression); overly aggressive batch sizes producing latency spikes; distillation losing important edge-case behaviors. Each has known mitigations, but adopting optimizations without eval discipline produces mysteries later.

The pattern also depends on the specific model and hardware. Optimizations that help Llama 3 70B on H100 might not help Mixtral 8x22B on A100. Newer models have optimizations baked in (GQA in Llama 3; sparse attention in newer models). Newer hardware supports optimizations (FP8, TransformerEngine on H100+). What works today might not be optimal tomorrow; staying current is part of the discipline.

Depth progression

STUB     ← you are here.
OUTLINE  Promoted when Y5 Phase 44 applies quantization + speculative decoding
         to a basecamp-served model.
DEEP     Out of scope unless capstone direction prioritizes it. Default: OUTLINE.

Preview: what OUTLINE will answer

When Y5 Phase 44 promotes this entry to OUTLINE, it will name:

  • PROBLEM. How do you get more throughput and lower latency from the same LLM inference hardware?
  • PRINCIPLES. Techniques compound multiplicatively. Each targets a specific bottleneck. Quality measurement at every step. Match technique to workload characteristics. Newer models bake in optimizations (GQA, FP8).
  • TRADE-OFFS. Aggressive quantization (max compression, some quality loss) vs conservative (safe, less gain). Distillation (fast, task-narrow) vs speculative decoding (quality-preserving, complex). Larger batches (throughput-optimized) vs smaller (latency-optimized). Static optimization (deployment-time) vs dynamic (per-request).
  • TOOLS (time-stamped as of 2026-06): AWQ, GPTQ, FP8 (H100+), SmoothQuant, speculative decoding (vLLM built-in), Medusa/EAGLE, distilled models (DistilBERT, etc.), model surgery (SlimPajama), KV cache optimization (GQA/MQA), batch tuning per-workload.

The DEEP promotion is out of scope for basecamp default; if pursued, it would add MASTERY (operating optimized serving with measured cost + quality), COMPARE (AWQ vs GPTQ vs FP8; speculative decoding variants), OPERATE (a specific optimization event and its measured impact), and CONTRIBUTE (a vLLM or TensorRT-LLM optimization documentation improvement).

Canonical references

  • vLLM documentation on optimization features. Free at docs.vllm.ai.
  • Semianalysis writing on inference economics. Free selections.
  • Anyscale blog on LLM inference optimization. Free at anyscale.com.
  • Chip Huyen’s talks on LLM cost optimization. Free.
  • Papers: AWQ (Lin et al., 2023); Speculative Decoding (Leviathan et al., 2023); Medusa (Cai et al., 2024). All free.

Cross-references