RAG as Pattern
The pattern: Retrieval-Augmented Generation = three subsystems. Ingest — chunk, embed, store in a vector DB. Retrieve — embed the query, top-K vector search, optionally re-rank with a cross-encoder, optionally hybrid (dense + BM25). Generate — build the prompt with retrieved context, stream the LLM response, optionally cite sources. Each is a separable engineering problem with its own eval surface.
The trade-off: answer quality vs. system complexity. Naive RAG (dense top-5 + LLM) works on toy data; real-data RAG needs hybrid retrieval, re-ranking, chunking strategy, freshness handling. Quality lift from each technique is workload-dependent — measure with an eval set, don’t optimize on vibes. The most common failure mode: blaming the LLM when the bug is in retrieval.
Deepens in Year 4 Phase 24: LLM Infra + RAG — building llm-gateway v1 plus notes-rag end-to-end is the DEEP exercise. Reinforced in Phase 25: GPU Infrastructure once drift detection covers embeddings + prompts in production.
Related patterns
- inference-shapes — generation is the streaming shape; retrieval is online; ingest is batch.
- model-lifecycle — embeddings and rerankers are models with their own promotion story.
- feature-store — vector index is a sibling online store; metadata filters live alongside features.
- train-serve-skew — the chunker at ingest time and at query time must be the same.
- agent-loop — retrieval is a common node inside an agent graph.
- tool-use-as-capability —
search_docsis the canonical retrieval tool. - oltp-vs-olap — vector search is OLAP-shaped scan with an ANN index on top.
- schema-on-read-vs-write — chunking + metadata is a schema-on-write decision.
basecamp—services/llm-gateway/andnotes-raglive here.