LLM Routing

Route requests to the right model based on query characteristics, cost, latency, capability. The dispatch logic at the heart of every production LLM gateway.

Not every query needs the largest model. Routing picks per-query: small cheap model for simple, large expensive for complex, specific for specialized tasks. Status: STUB — promoted to OUTLINE in Y5 Phase 46.

What this pattern is

LLM routing is the dispatch logic in an LLM gateway that decides which upstream model handles each request. The decision dimensions: capability (does this query need reasoning, code generation, vision, function-calling?); cost (small models cost $0.001/query; frontier models cost $0.05); latency (local vLLM has zero round-trip; managed APIs add tens of ms); availability (failover when a primary is down). Routing strategies range from rule-based (regex on the query, prefix match on the user) to classifier-based (a small classifier model decides) to LLM-as-router (a fast LLM picks the right downstream LLM).

The pattern is operationally load-bearing for any org that uses multiple LLM backends. Without routing, every query hits the most-expensive model. With routing, total cost drops 5-10x while user-perceived quality stays unchanged. /root’s Y5 llm-gateway is the homelab-scale instantiation.

The pattern is mediation applied to LLM traffic. The gateway sits between application code and LLM backends. It handles routing decisions, cost accounting, rate limiting, caching, retries, and observability. Application code sends “I need an LLM response for this query”; the gateway decides how to fulfill that. This is the same architecture as service mesh or API gateway, applied to LLM-specific concerns.

Routing quality depends on knowing what each model is good at. GPT-4 is strong at reasoning; Claude is strong at long-context and coding; Llama 3.1 70B (self-hosted) is cost-efficient for high-volume simple queries; specialized models (Whisper for speech, embedding models for embeddings) handle specific tasks. A good router matches query characteristics to model capabilities — not because “bigger is always better,” but because different models have different sweet spots.

Concrete instances in the wild

  • basecamp llm-gateway (Y5 Phase 46). Rule-based routing across vLLM (local Llama), Anthropic API, OpenAI API.
  • LiteLLM. OSS LLM gateway with routing, cost tracking, retries. Popular for multi-backend setups.
  • Portkey. Commercial LLM gateway with routing, caching, observability.
  • Helicone. Commercial gateway focused on observability and cost optimization.
  • OpenRouter. Meta-provider that routes across many upstream LLM APIs.
  • Anyscale Aviary. Ray-based LLM router.
  • AWS Bedrock. Managed service that provides routing across multiple foundation models.
  • Cloudflare AI Gateway. Edge-based LLM gateway with routing and caching.
  • Kong AI Gateway. Kong-based API gateway with LLM routing features.
  • Custom internal gateways at frontier AI companies. Every serious AI company builds one; specific designs are often internal.

Why this pattern matters

Different LLM queries have wildly different requirements. Classifying a support ticket into a category is a task that a small cheap model handles well. Debugging a complex distributed-systems issue benefits from a frontier reasoning model. Extracting structured data from an invoice needs vision + specific extraction capability. Generating a marketing paragraph needs style over reasoning. If every query goes to the same model, you either overpay (using GPT-4 for classification) or underdeliver (using GPT-3.5 for complex reasoning).

Routing solves this. Each query gets the model appropriate to its complexity, latency requirement, and cost budget. Total cost drops significantly — typical enterprise deployments see 5-10x cost reduction from routing without quality regression. Latency drops for simple queries (small models are faster). Overall system throughput increases because expensive models aren’t bottlenecked by requests that didn’t need them.

The pattern also enables failover and reliability. Multi-provider routing means an outage at OpenAI doesn’t take down the LLM application; the gateway falls back to Anthropic or a self-hosted model. Rate limit hits at one provider redirect to another. Quality regressions at one model (a bad new deployment) can be routed around while the vendor fixes it. Without a gateway with routing, application code has to handle all of this per-call, which nobody does correctly.

For multi-model organizations, routing also enforces consistency. Model selection isn’t per-app-developer’s-preference; it’s governed by the gateway’s routing rules. This means the organization can enforce cost budgets, quality standards, and security policies uniformly. Without a gateway, each application makes its own choices, which produces both cost sprawl and quality inconsistency.

For basecamp specifically, routing is the mechanism that makes the AI stack economically viable. Simple queries (classification, extraction, formatting) go to self-hosted Llama 3.1 70B on vLLM (~$0 marginal cost). Complex reasoning goes to Claude or GPT-4 (higher per-query cost, better quality). The routing decision determines whether basecamp’s AI features are affordable or budget-blowing.

The failure modes to know: routing rules that don’t reflect actual model capabilities (wrong model for the task); classifier drift as models and use cases evolve; routing that ignores observed quality (bad model choice never gets corrected); overly complex routing that becomes hard to reason about. Each has known mitigations, but operating routing means owning them.

Modern routing is increasingly LLM-based rather than rule-based. A small fast LLM reads the query and decides which model to route to. This handles nuance that rule-based routing misses. The catch is that the router LLM itself has cost and latency — the routing decision needs to be much cheaper than the difference between routing options for the pattern to make sense.

Depth progression

STUB     ← you are here.
OUTLINE  Promoted when Y5 Phase 46 ships llm-gateway with rule-based routing.
DEEP     Promoted after Y5 Phase 46 — routing operational with measured per-model
         cost + latency + quality, and at least one route-rule change informed by
         observed data.

Preview: what OUTLINE will answer

When Y5 Phase 46 promotes this entry to OUTLINE, it will name:

  • PROBLEM. How do you dispatch LLM queries across multiple backends to optimize cost, latency, capability, and reliability?
  • PRINCIPLES. Match query characteristics to model capabilities. Route by cost/quality tradeoff. Support failover. Observe per-model quality. Iterate routing rules based on measured data.
  • TRADE-OFFS. Rule-based (fast, brittle) vs classifier-based (accurate, adds latency) vs LLM-as-router (nuanced, expensive). Static routing rules vs dynamic (learned from data). Single provider (simple) vs multi-provider (resilient, complex).
  • TOOLS (time-stamped as of 2026-06): LiteLLM (basecamp candidate), Portkey, Helicone, OpenRouter, Anyscale Aviary, AWS Bedrock, Cloudflare AI Gateway, Kong AI Gateway, custom gateways.

The DEEP promotion, after Y5 Phase 46 with routing informed by observed data, will add MASTERY (operating llm-gateway on basecamp), COMPARE (LiteLLM vs Portkey vs custom gateway), OPERATE (a specific routing-rule change and its measured impact), and CONTRIBUTE (a LiteLLM documentation improvement or public case study).

Canonical references

  • LiteLLM documentation. Free at docs.litellm.ai.
  • Portkey blog on LLM routing patterns. Free.
  • Anyscale blog on multi-model serving. Free at anyscale.com.
  • Cloudflare’s AI Gateway announcement and technical docs. Free.
  • Chip Huyen’s writings on production LLM systems. Free at huyenchip.com.

Cross-references