LLM Caching

Exact + semantic caching for LLM responses. Save cost + latency by deduplicating identical and similar prompts. Redis (exact) + pgvector (semantic) is the canonical pair.

Identical prompts → identical responses. Similar prompts → maybe identical responses. The cost savings compound, but semantic cache has a poisoning rate to measure. Status: STUB — promoted to OUTLINE in Y5 Phase 46.

What this pattern is

LLM caching deduplicates inference for repeated or similar prompts. Exact caching (key = hash of prompt + model + parameters; value = response) is straightforward — Redis-backed, with TTL. It’s most valuable when prompts are mechanically generated (the same RAG prompt on the same document; the same evaluation prompt on the same input). Semantic caching (embed prompt → search vector store → return cached response if similarity > threshold) catches paraphrases — “Tell me about X” and “What is X?” produce the same response. Semantic caching adds poisoning rate as an operational concern: a too-low threshold returns wrong cached answers for queries that look similar but aren’t equivalent.

The pattern composes with llm-routing (cached response means no upstream call), with vector-search (semantic cache is vector search applied to past prompts), and with evals (eval-gate the semantic-cache threshold).

Caching in LLM contexts also includes KV cache reuse at the serving layer (prefix caching in vLLM, radix attention in SGLang). This is a different level of cache — it operates on the internal attention state, not on responses. Prefix caching is transparent to the application and can produce dramatic latency improvements when many requests share a common prefix (e.g., a shared system prompt across all requests). This is technically a serving optimization, not application-level caching, but it’s often discussed together with response caching.

The pattern’s operational value depends heavily on workload characteristics. Applications with high prompt reuse (batch jobs, evaluation loops, RAG on stable corpora) get large cost savings. Applications with unique prompts per query (interactive chat, code generation) get less benefit. Measuring cache hit rate is essential — if hit rate is <5%, the cache overhead may exceed savings. If hit rate is >20%, the cache is significantly reducing cost and latency.

Concrete instances in the wild

basecamp caching (Y5 Phase 46). Redis for exact cache + pgvector for semantic cache, integrated into llm-gateway.
GPTCache. OSS caching layer for LLMs. Exact + semantic caching with pluggable backends.
Langchain LLM cache. Framework-level caching with pluggable backends (Redis, SQLite, Postgres).
LiteLLM caching. Built into LiteLLM gateway; Redis + Qdrant backend options.
Portkey caching. Commercial LLM gateway with cache built in.
Helicone caching. Commercial gateway with caching + observability.
vLLM automatic prefix caching. Serving-level KV cache reuse for shared prompt prefixes.
SGLang radix attention. Structured KV cache for shared prefixes in tree-shaped prompts.
Anthropic prompt caching (2024). API-level caching where Claude caches system prompts across requests.
OpenAI prompt caching (2024). Similar feature; automatic cache of shared prompt prefixes.

Why this pattern matters

LLM inference is expensive per query. Frontier model calls cost cents; long-context calls cost tens of cents. For applications with high query volume, these costs compound quickly. Caching converts repeated cost to one-time cost. A prompt that costs $0.05 to answer, cached and served 1000 times, costs $0.05 total instead of $50. For applications with structural prompt reuse, this is transformative economics.

Latency benefits are also significant. LLM inference latency is dominated by generation time — often seconds for long responses. A cache hit returns in milliseconds. For applications where latency matters (interactive chat, code completion, real-time features), cache hits dramatically improve user experience. Interactive apps become responsive; batch jobs finish faster.

For basecamp specifically, caching is what makes the ops-handbook RAG chatbot economically viable. Similar operational questions come up repeatedly (“how do I restart the ingest pipeline?”). Semantic caching means the same underlying answer serves many phrasings of the same question. Without caching, each phrasing costs a full LLM call; with caching, the second phrasing costs a cache lookup.

The pattern also matters for evaluation and testing loops. When developing an LLM application, you might run the same eval suite hundreds of times as you iterate. Without caching, each eval run costs full LLM inference. With caching, only the first run pays inference cost; subsequent runs pay cache lookup cost. Development velocity increases because iteration cycles get faster.

Semantic caching specifically enables handling paraphrases and query variations. Users don’t ask the same question the same way twice. “How do I reset my password?” and “I forgot my password, what do I do?” have the same answer. Exact caching misses this; semantic caching catches it. The catch is the poisoning risk — semantically similar queries might have different answers (“How do I reset my password?” vs “How do I reset my email password?”). Setting the semantic similarity threshold correctly is what separates useful semantic caching from confusion-producing semantic caching.

The pattern is also increasingly built into the LLM APIs themselves. Anthropic and OpenAI both added prompt caching in 2024, where the API caches shared prompt prefixes across requests. This is transparent to applications and often provides significant cost savings without any application-level cache implementation. Application-level caching still adds value on top (whole-response caching, cross-provider caching) but is now one layer among several.

The failure modes to know: exact cache hit rate too low to justify the infrastructure (measure before deploying); semantic cache threshold too low producing wrong cached answers (measure poisoning rate); cache invalidation forgotten (stale answers persist beyond their relevance); cache becomes critical infrastructure whose downtime breaks apps (needs monitoring). Each has known patterns, but operating LLM caching means owning them.

Depth progression

STUB     ← you are here.
OUTLINE  Promoted when Y5 Phase 46 implements Redis + pgvector caching in
         llm-gateway.
DEEP     Out of scope unless ops reveal sustained cache use. Default: OUTLINE.

Preview: what OUTLINE will answer

When Y5 Phase 46 promotes this entry to OUTLINE, it will name:

PROBLEM. How do you reduce LLM inference cost and latency by not recomputing responses to repeated or similar queries?
PRINCIPLES. Exact caching for mechanically-repeated prompts. Semantic caching for paraphrases. TTL for freshness. Measure hit rate and poisoning rate. Combine with routing and provider-level caching. Cache invalidation matters.
TRADE-OFFS. Exact (safe, misses paraphrases) vs semantic (catches variations, poisoning risk). Application-level (control) vs provider-level (Anthropic/OpenAI caching — automatic, less flexible). Aggressive TTL (fresh, low hit rate) vs long TTL (stale risk, high hit rate). Semantic threshold high (few hits, safe) vs low (many hits, risk).
TOOLS (time-stamped as of 2026-06): Redis + pgvector (basecamp default), GPTCache, LangChain cache, LiteLLM cache, Portkey cache, Helicone cache, vLLM prefix caching, SGLang radix attention, Anthropic/OpenAI provider-level caching.

The DEEP promotion is out of scope for basecamp default; if pursued, it would add MASTERY (operating cache with measured hit and poisoning rates), COMPARE (application-level vs provider-level caching), OPERATE (a specific cache-configuration event), and CONTRIBUTE (a GPTCache or LiteLLM documentation improvement).

Canonical references

GPTCache documentation. Free at github.com/zilliztech/GPTCache.
vLLM documentation on automatic prefix caching. Free.
Anthropic prompt caching documentation. Free at anthropic.com.
OpenAI prompt caching documentation. Free.
Portkey and Helicone blogs on LLM caching patterns. Free.

Cross-references

Y5 Phase 46: LLM Gateway
Related: llm-routing, vector-search, llm-serving, evals