LLM Serving

Production-grade LLM inference at scale. PagedAttention, continuous batching, KV cache management. vLLM, TGI, KServe ServingRuntime.

Loading a model is the easy part. Serving it efficiently — high throughput, low latency, GPU-utilization-positive — is the senior-IC challenge. vLLM, KV cache, continuous batching. Status: STUB — promoted to OUTLINE in Y5 Phase 43.

What this pattern is

LLM serving is production-grade inference at scale. Naive serving (load model in PyTorch, expose an HTTP endpoint, process one request at a time) wastes 80%+ of GPU capacity because the GPU sits idle between forward passes. Production-grade serving uses several techniques together: PagedAttention (vLLM’s contribution — manages KV cache in fixed-size pages, allowing many concurrent sequences to share GPU memory efficiently); continuous batching (new requests join an in-flight batch each token-step, not waiting for the previous batch to finish); KV cache prefix sharing (multi-tenant prompts with shared prefixes reuse cache); quantization (INT8/INT4 models fit in less GPU memory, allow larger batches). vLLM is the canonical OSS serving framework; TGI (Text Generation Inference) from HuggingFace is a parallel runtime; KServe with vLLM as ServingRuntime is the K8s-native deployment shape.

The pattern’s central concern is GPU utilization. GPUs are the most expensive resource in an LLM stack. A serving system that gets 30% GPU utilization pays 3.3x more per token than one that gets 100%. The techniques above are all mechanisms to keep the GPU busy while requests are flowing. Naive serving leaves the GPU idle waiting for one request to finish before starting the next. Continuous batching keeps it busy across many concurrent requests.

KV cache dominates memory at long contexts. Each request’s KV cache scales with sequence length and model size. At 100K token contexts on 70B models, KV cache can be gigabytes per request. PagedAttention manages this cache as fixed-size pages (like OS virtual memory), so multiple requests can share fragmented GPU memory efficiently. Without PagedAttention, GPUs run out of memory well before compute is saturated.

Serving quality also depends on request scheduling. First-token latency (time to first token) matters for interactive UX. Token throughput (tokens per second per request) matters for user-perceived speed. Total throughput (tokens per second across all requests) matters for cost. Optimizing all three simultaneously requires tuning batch sizes, priorities, and preemption policies. Production LLM serving is a scheduling problem as much as an ML problem.

Concrete instances in the wild

  • vLLM. OSS serving framework from UC Berkeley. PagedAttention originator. Dominant OSS runtime as of 2026.
  • HuggingFace Text Generation Inference (TGI). OSS serving from Hugging Face. Parallel implementation, similar techniques.
  • NVIDIA TensorRT-LLM. Hardware-optimized serving for NVIDIA GPUs. Best absolute performance on supported models.
  • KServe with vLLM ServingRuntime. K8s-native serving. basecamp default.
  • Anyscale Ray Serve. Ray-native serving. Common when workloads span serving + training on same cluster.
  • BentoML. OSS serving framework, supports LLMs among other models.
  • SGLang. Newer OSS serving framework with radix attention for structured generation.
  • Together.ai / Fireworks.ai. Managed LLM inference-as-a-service. Convenience for teams not operating GPU infra.
  • Groq. Custom hardware for extremely fast LLM inference.
  • AWS SageMaker + DJL. AWS-managed LLM serving with Deep Java Library.
  • Replicate. Managed inference-as-a-service, popular for prototyping.

Why this pattern matters

Naive LLM serving is prohibitively expensive at any real scale. A 70B model served naively might handle 1-5 requests per second per GPU. The same GPU with vLLM handles 10-50 requests per second. That’s 10x cost difference for the same hardware. For any team spending real money on LLM inference, adopting production serving techniques isn’t optional; it’s the difference between viable and infeasible economics.

The pattern also determines what’s possible at your budget. Serving Llama 3.1 70B naively might require dozens of GPUs for modest traffic; with vLLM + quantization + speculative decoding it might fit on a handful. The techniques compound — quantization enables larger batches; larger batches enable higher utilization; higher utilization enables lower per-token cost. Getting the stack right shifts what your infrastructure can afford by an order of magnitude.

For basecamp specifically, LLM serving is the substrate under the AIOps agent (Y5 Phase 50), the ops-handbook chatbot (Y5 Phase 42), and any other basecamp AI use case. All of them route through vLLM. Getting serving right once means every basecamp AI application inherits good performance and cost characteristics.

The pattern also matters for latency-sensitive applications. Interactive chat needs sub-second first-token latency. Real-time coding assistants need sub-100ms latency. Serving optimizations that reduce first-token latency (chunked prefill, prompt caching, speculative decoding) are load-bearing for these applications. Naive serving that’s fine for batch inference is unusable for interactive use.

Modern serving also enables new patterns. Prefix caching enables cheap RAG (the same document prefix is reused across many queries). Multi-LoRA serving enables serving many fine-tuned variants from a single base model. Speculative decoding enables sub-linear scaling with model size. Each is a technique that would have been impossible without production-grade serving infrastructure.

The failure modes to know: GPU memory fragmentation without PagedAttention (OOM before compute saturates); poor batch sizing (either underutilization or unacceptable latency); missing quantization (paying 4x memory for marginal quality); ignoring first-token latency (users perceive the app as slow even at high throughput). Each has known solutions in modern serving frameworks; using them means adopting the framework, not just calling the model.

Modern platforms make production serving accessible. vLLM deploys via KServe as a K8s CRD. TGI has Docker images ready to run. Managed services (Together, Fireworks, Replicate) remove serving concerns entirely at a per-token markup. The choice between self-hosted and managed is a cost/control tradeoff, not a capability gap.

Depth progression

STUB     ← you are here.
OUTLINE  Promoted when Y5 Phase 43 deploys vLLM on basecamp via KServe.
DEEP     Promoted after Y5 Phase 44 — vLLM operational with measured RPS + p99
         latency, plus inference optimizations (quantization, speculative decoding)
         applied.

Preview: what OUTLINE will answer

When Y5 Phase 43 promotes this entry to OUTLINE, it will name:

  • PROBLEM. How do you serve LLMs efficiently enough that per-token cost is viable?
  • PRINCIPLES. GPU utilization is the metric. PagedAttention manages KV cache. Continuous batching keeps GPU busy. Prefix sharing amortizes cache. Quantization enables larger batches. Scheduling optimizes for latency + throughput jointly.
  • TRADE-OFFS. vLLM (dominant OSS, most-adopted) vs TGI (parallel implementation) vs TensorRT-LLM (best NVIDIA perf) vs SGLang (structured generation). Self-hosted (control, ops burden) vs managed (easy, per-token markup). Throughput optimization (batch-friendly) vs latency (interactive-friendly).
  • TOOLS (time-stamped as of 2026-06): vLLM (basecamp default), HuggingFace TGI, NVIDIA TensorRT-LLM, KServe ServingRuntime, Ray Serve, BentoML, SGLang, Together.ai/Fireworks.ai (managed), Groq (custom hardware).

The DEEP promotion, after Y5 Phase 44 with measured RPS + p99 latency, will add MASTERY (operating vLLM on basecamp), COMPARE (vLLM vs TGI vs managed inference), OPERATE (a specific tuning event or capacity incident), and CONTRIBUTE (a vLLM or KServe documentation improvement).

Canonical references

  • Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention” (2023). Free. The vLLM paper.
  • vLLM documentation. Free at docs.vllm.ai.
  • HuggingFace TGI documentation. Free at huggingface.co/docs/text-generation-inference.
  • Anyscale blog series on LLM serving. Free.
  • Semianalysis writing on LLM inference economics. Free selections.

Cross-references