Token-Bucket Rate Limiting

Each request consumes tokens; tokens refill at a fixed rate. Burst tolerance + steady-state limit. The canonical rate-limiting algorithm at every layer from HTTP to LLM tokens.

A bucket holds N tokens; refills at rate R; each request consumes one (or more). Burst up to N; steady-state at R. Simple, robust, composes. Status: STUB — promoted to OUTLINE in Y2 Phase 16.

What this pattern is

Token-bucket rate limiting models a quota as a bucket with capacity N (the burst limit) that refills at rate R tokens per second (the steady-state limit). Each request consumes one token (or more, for weighted requests — important for LLM token cost). If the bucket is empty, the request is rejected (or queued). The pattern naturally tolerates short bursts above the average rate while bounding sustained traffic.

The pattern outperforms simpler approaches for most production needs. Fixed-window counters have a boundary problem (a burst can span two windows and effectively double the rate). Sliding-window logs are precise but memory-expensive (one entry per request). Token bucket sits in a sweet spot: bounded memory (two counters — bucket level and last-refill time), configurable burst tolerance (bucket capacity), and predictable steady-state (refill rate).

The pattern is the same across layers. HTTP middleware limits requests per client. API Gateway throttles API calls per API key. AWS SDK enforces retry budgets to prevent client-side runaway. gRPC client-side limits protect downstream services. LLM gateways (Y5 Phase 46) enforce per-tenant plus per-model token-bucket limits where the “tokens” are literal LLM tokens consumed per request. The bucket algorithm doesn’t care what the “tokens” represent; the semantics just change.

Redis cell (a counter-based variant) is the canonical distributed implementation when state must be shared across replicas. Client-side implementations exist but distribute poorly; server-side implementations that share state via Redis are the modern reference.

Concrete instances in the wild

HTTP rate limiting middleware. nginx limit_req module, Envoy rate limit service, Traefik middleware — all implement token-bucket or leaky-bucket variants at the HTTP layer.
AWS API Gateway throttling. Per-API-key burst plus steady-state limits. AWS docs describe them as token bucket explicitly.
AWS SDK retry budgets. Client-side circuit breaker that prevents excessive retries. Same pattern, applied to client-side outbound traffic.
gRPC keepalive limits. gRPC clients enforce keepalive pings at bucket-limited rates to avoid overwhelming servers.
Twitter API rate limits. Historically the reference for public API rate limiting. Per-endpoint, per-token buckets with visible remaining counts.
llm-gateway (Y5 Phase 46). Token-bucket rate limits where “tokens” are LLM tokens. Per-tenant per-model buckets with different rates for cheap vs expensive models.
Redis cell. Redis Labs’s official rate-limiting module. CL.THROTTLE command implements token-bucket over a Redis key.
Cloudflare Rate Limiting. Edge-based token-bucket rate limiting. Applied per IP, per URL, per user agent, per any combination.
Kubernetes API Priority and Fairness. Token-bucket-like fairness in the K8s API server. Distributes API-server capacity across client priorities.

Why this pattern matters

Without rate limiting, downstream services get overwhelmed. A misbehaving client (or a legitimate client that suddenly scales) can consume all downstream capacity, cascading into service failures. Rate limits prevent cascade by giving each caller a bounded share and rejecting requests beyond that share. The upstream service can serve everyone (albeit slowly under pressure) instead of failing for everyone.

The pattern also enables cost control. LLM gateways demonstrate this vividly. Without rate limits, a runaway agent can burn through a company’s daily LLM budget in minutes. With per-tenant token buckets, each tenant has a predictable maximum spend; overages become bounded and visible. Same math applies to any pay-per-request service: databases, third-party APIs, expensive compute.

The pattern’s burst-tolerance property matters more than it looks. Real traffic is bursty. A steady-state limit without burst tolerance means legitimate spikes get rejected. A generous burst limit with tight steady-state means legitimate spikes are absorbed while sustained abuse is bounded. Token bucket gives you both dials independently: raise the bucket capacity to absorb bigger bursts, lower the refill rate to constrain steady-state.

Modern rate-limiting failure modes are well-documented. Ineffective (limits too high) — attacks can still overwhelm. Overreaching (limits too low) — legitimate users hit them constantly, driving support burden. Misconfigured (per-IP instead of per-user) — corporate NATs share IPs; single bad actor blocks the whole company. The senior operator’s discipline: measure real usage, set limits at real-usage percentile plus headroom, revisit as usage grows.

Different rate-limiting algorithms have different failure modes. Fixed-window: susceptible to boundary bursts. Sliding-window log: memory-expensive at scale. Sliding-window counter: approximation errors near boundaries. Token bucket: burst tolerance is a feature; the trade-off is you need a good sense of what “reasonable burst” looks like for your workload.

Depth progression

STUB     ← you are here.
OUTLINE  Promoted when Y2 Phase 16 (backend at scale) implements token-bucket
         middleware.
DEEP     Promoted after Y5 Phase 46 — llm-gateway implements per-tenant +
         per-model token-bucket limits with both request count AND token cost.

Preview: what OUTLINE will answer

When Y2 Phase 16 promotes this entry to OUTLINE, it will name:

PROBLEM. How do you protect a service from being overwhelmed by legitimate spikes or misbehaving clients while still tolerating short bursts?
PRINCIPLES. Bucket capacity bounds burst. Refill rate bounds steady-state. Weighted consumption (some requests cost more) enables cost-sensitive limiting. Distributed state via Redis or similar for cross-replica consistency. Failing closed (reject when unsure) preserves protection.
TRADE-OFFS. Token bucket vs leaky bucket vs sliding window (algorithmic differences with real operational implications). Client-side (fast, unreliable) vs server-side (accurate, more infrastructure). Per-user vs per-IP vs per-region (different granularities for different threats).
TOOLS (time-stamped as of 2026-06): nginx limit_req, Envoy rate limit service, Traefik middleware, Redis cell, cloud-managed rate limiters (AWS API Gateway, Cloudflare, Google Cloud), K8s API Priority and Fairness.

The DEEP promotion, after Y5 Phase 46 with llm-gateway operational, will add MASTERY (operating token-bucket limits across HTTP and LLM layers), COMPARE (token bucket vs leaky bucket vs sliding window in practice), OPERATE (a real incident where rate limiting prevented cascade or blocked legitimate users), and CONTRIBUTE (an OSS rate-limiting library contribution or blog post on tuning).

Canonical references

ACM SIGCOMM’s networking literature on token bucket algorithms — the theoretical foundation from the 1990s.
Cloudflare’s engineering blog posts on rate limiting at edge scale. Free at blog.cloudflare.com.
Stripe’s engineering blog on rate limiting patterns — the SaaS operational perspective.
Marc Brooker’s blog posts on adaptive rate limiting and backpressure — modern practitioner treatment.
Redis documentation on the cell module and rate-limiting patterns.

Cross-references

Y2 Phase 16: Backend at Scale
Y5 Phase 46: LLM Gateway
Related: backpressure, llm-routing, llm-caching