RAG (Retrieval-Augmented Generation)
Retrieve relevant context from a corpus, inject into the LLM prompt, generate. The pattern that lets LLMs answer with grounded, cited, up-to-date information.
Retrieve. Augment. Generate. The corpus is the LLM’s working memory; the embedding model is its index; the prompt template is the bridge. Status: STUB — promoted to OUTLINE in Y5 Phase 42.
What this pattern is
RAG (Retrieval-Augmented Generation) is the three-stage shape that lets an LLM answer questions using information it wasn’t trained on. Retrieve: embed the user query, search a vector store (vector-search) for the top-K most relevant chunks. Augment: inject those chunks into the LLM prompt as context. Generate: the LLM produces an answer grounded in the retrieved context, ideally with citations. The pattern lets LLMs work with proprietary documents, fresh data, and corpora too large to fit in any single context window — without retraining the model.
The /root canonical RAG use case is mcp-ops-handbook: 5 years of weekly logs + postmortems + runbooks indexed in pgvector; the Y5 AIOps agent retrieves similar past incidents when triaging new alerts. The pattern is the load-bearing primitive of nearly every “AI for your docs” product, every internal copilot, every modern semantic-search system.
RAG’s operational quality depends heavily on retrieval quality. If the retrieval step returns irrelevant chunks, the LLM will generate plausible-sounding but wrong answers grounded in the wrong context. Improving RAG means improving retrieval — better embedding models, better chunking strategies, better re-ranking after initial retrieval, hybrid search combining semantic and keyword. The LLM’s role in RAG is smaller than it appears; the retrieval infrastructure does most of the work.
The pattern has evolved substantially. Naive RAG (single retrieval, direct prompt injection) is the starting point. Advanced patterns include multi-hop retrieval (retrieve; use result to formulate follow-up retrieval), self-querying (the LLM writes its own retrieval query from the user’s question), hypothetical document embeddings (HyDE — LLM generates a hypothetical answer; embed that instead of the query), and agentic RAG (the LLM decides when and what to retrieve mid-conversation). Each addresses a specific weakness of naive RAG.
Concrete instances in the wild
- Every “AI for your docs” product. Notion AI, Coda AI, Copilot Chat, Cursor, GitHub Copilot Enterprise — all RAG under the hood.
- basecamp mcp-ops-handbook (Y5 Phase 42). RAG over 5 years of ops history for the AIOps agent.
- Perplexity. Web-scale RAG. Query → web search → retrieve pages → LLM generates answer with citations.
- Bing Chat / ChatGPT with search. Same pattern applied to consumer search.
- Anthropic Claude with Projects / Files. Claude has built-in RAG over user-uploaded files.
- LangChain / LlamaIndex. OSS frameworks that make RAG easy to prototype. Load documents, embed, query.
- Vectara. Commercial RAG-as-a-service. Managed retrieval + generation.
- Elasticsearch with vector search + LLM integration. Existing search infrastructure extended to RAG.
- Vespa. Yahoo’s OSS platform. Rich hybrid search + LLM integration.
- HAY Stack. OSS framework for building RAG applications with strong evaluation support.
Why this pattern matters
LLMs are trained on a fixed snapshot of data and cannot know anything that happened after their training cutoff. Without RAG, they can’t answer questions about your internal documents, recent events, or specialized corpora. They can only riff on what was in their training set. This is severely limiting for enterprise applications.
With RAG, LLMs become useful for information tasks over any corpus. Internal wikis. Support tickets. Product documentation. Legal contracts. Medical records. Codebases. Anything you can chunk, embed, and index becomes queryable via natural language. The LLM provides the language understanding and synthesis; the retrieval system provides the grounded facts. Together they produce answers that neither can produce alone.
The pattern is also what enables citations and verifiability. Naked LLM output has no provenance — you have no way to verify the model isn’t making things up. RAG-generated output includes citations to the source documents. Users can verify the source. Hallucination becomes containable because the LLM’s job is to synthesize from cited context, not to produce facts from parametric memory.
For SRE and platform teams, RAG applied to operational corpora is transformative. Every postmortem, every runbook, every incident timeline becomes queryable during future incidents. “Have we seen an issue like this before?” becomes a one-click retrieval instead of a Slack search. Incident response gets faster because institutional memory becomes accessible.
RAG also matters for cost management. Without RAG, keeping an LLM current on your corpus requires fine-tuning (expensive, slow, needs updates whenever the corpus changes). With RAG, the corpus lives in a vector store (cheap, fast to update). The LLM stays generic; the corpus provides specificity. Adding new documents means embedding and indexing them, not retraining.
The failure modes to know: poor chunking (context missing key details), poor retrieval (irrelevant chunks in context), context stuffing (too much irrelevant text confuses the LLM), citation hallucination (LLM cites documents it wasn’t actually shown), stale corpora (retrieving outdated information), permission leakage (retrieval returns documents the user shouldn’t see). Each has known mitigations; getting them all right is what separates production RAG from prototype RAG.
Modern tooling is improving RAG quality substantially. Re-rankers (Cohere Rerank, ColBERT) improve retrieval precision. Structured chunking (respecting document boundaries) improves context relevance. Query expansion (LLM writes multiple search queries from one user question) improves recall. RAG in 2026 is genuinely useful; RAG in 2023 was frequently unreliable.
Depth progression
STUB ← you are here.
OUTLINE Promoted when Y5 Phase 42 stands a working RAG pipeline on basecamp.
DEEP Promoted after Y5 Phase 42 + Y5 Phase 50 — operating RAG over the
ops-handbook corpus, with measured retrieval-quality + grounding metrics.
Preview: what OUTLINE will answer
When Y5 Phase 42 promotes this entry to OUTLINE, it will name:
- PROBLEM. How do you make an LLM answer questions using information beyond its training data, with grounding and citations?
- PRINCIPLES. Retrieve semantically relevant chunks. Inject into prompt as context. Instruct LLM to answer only from context. Return citations. Improve retrieval quality (chunking, re-ranking, hybrid search) before improving LLM. Evaluate retrieval and generation separately.
- TRADE-OFFS. Naive RAG (simple, weaker) vs advanced (multi-hop, self-querying — powerful, complex). Semantic search only vs hybrid (semantic + keyword). Large context window (less chunking needed, more cost per query) vs small (more chunking, less context per query). Local retrieval (fast, limited corpus) vs global (slower, comprehensive).
- TOOLS (time-stamped as of 2026-06): LangChain, LlamaIndex, Haystack, Vectara (managed), Elasticsearch with vector, Vespa, pgvector + custom pipeline (basecamp), Perplexity/Claude/ChatGPT with search (consumer RAG).
The DEEP promotion, after Y5 with production RAG operating over the ops-handbook, will add MASTERY (operating RAG at basecamp scale), COMPARE (naive vs advanced RAG patterns), OPERATE (a specific retrieval-quality tuning event), and CONTRIBUTE (a LangChain/LlamaIndex example or a public case study).
Canonical references
- Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” (2020). Free. Original RAG paper.
- LangChain documentation. Free at langchain.com.
- LlamaIndex documentation. Free at llamaindex.ai.
- Anthropic’s guides on effective RAG. Free at anthropic.com.
- Anyscale blog on production RAG patterns. Free at anyscale.com.