RAG as Pattern

The pattern: Retrieval-Augmented Generation = three subsystems. Ingest — chunk, embed, store in a vector DB. Retrieve — embed the query, top-K vector search, optionally re-rank with a cross-encoder, optionally hybrid (dense + BM25). Generate — build the prompt with retrieved context, stream the LLM response, optionally cite sources. Each is a separable engineering problem with its own eval surface.

The trade-off: answer quality vs. system complexity. Naive RAG (dense top-5 + LLM) works on toy data; real-data RAG needs hybrid retrieval, re-ranking, chunking strategy, freshness handling. Quality lift from each technique is workload-dependent — measure with an eval set, don’t optimize on vibes. The most common failure mode: blaming the LLM when the bug is in retrieval.

Deepens in Year 4 Phase 24: LLM Infra + RAG — building llm-gateway v1 plus notes-rag end-to-end is the DEEP exercise. Reinforced in Phase 25: GPU Infrastructure once drift detection covers embeddings + prompts in production.

inference-shapes — generation is the streaming shape; retrieval is online; ingest is batch.
model-lifecycle — embeddings and rerankers are models with their own promotion story.
feature-store — vector index is a sibling online store; metadata filters live alongside features.
train-serve-skew — the chunker at ingest time and at query time must be the same.
agent-loop — retrieval is a common node inside an agent graph.
tool-use-as-capability — search_docs is the canonical retrieval tool.
oltp-vs-olap — vector search is OLAP-shaped scan with an ANN index on top.
schema-on-read-vs-write — chunking + metadata is a schema-on-write decision.
basecamp — services/llm-gateway/ and notes-rag live here.

RAG as Pattern

Related patterns