Vector Stores + Embeddings + RAG

Phase 42 of /root Year 5: vector search at depth via pgvector (on CloudNativePG) and Milvus Operator. Embedding pipelines. RAG architecture: ingest + retrieve + generate. Hybrid search. Tier 7 entry of basecamp. 6-8 weeks, ~60-80 hours.

Fourth phase of Year 5. The retrieval layer of every AI system. 6-8 weeks, ~60-80 hrs.

Phase 36 introduced attention. This phase puts it to work — embeddings turn text/images/anything into vectors; vector stores serve similarity search at sub-100ms; RAG (Retrieve-Augmented Generation) is the dominant production architecture for LLM applications in 2026. By phase end basecamp has pgvector on CloudNativePG and Milvus via Milvus Operator running side-by-side; an embedding pipeline ingests ops-handbook and code into vectors; a RAG endpoint serves “what’s in our docs about X?” queries.

The K8s-native deployment is consistent with the rest of basecamp: every vector store is a CRD-driven operator-managed component.


Prerequisites

  • All Y4 + Y5 Phase 39-41 complete
  • GPU available for embedding generation (or fast CPU)
  • 12 hrs/week budget reserved

Why this phase exists

LLMs alone are stateless — they don’t know your data. RAG is the pattern that connects them: ingest your data → embed → store → retrieve relevant chunks at query time → feed to LLM. Every production LLM application bigger than a chatbot uses RAG. The retrieval layer (vector store) is the hard part operationally.


The pattern-first frame

Same eight steps.


1. PROBLEM

You have data (ops-handbook, code, runbooks, weekly logs) and an LLM. You want users to ask questions and get answers grounded in your data. The LLM alone hallucinates. You need to: turn data into vectors (embeddings), store + index them (vector store), query by similarity (nearest neighbors), include results in the LLM prompt (RAG).


2. PRINCIPLES

2.1 Embeddings as vector representations

An embedding model converts text/images/audio into fixed-dimensional vectors. Semantically similar inputs map to nearby vectors.

→ Pattern: embedding-store — OUTLINE

Investigate:

  • Walk how a sentence becomes a 768-dim vector via a transformer encoder.
  • Why are embedding models often smaller than generation models?
  • What’s the relationship between embedding model and downstream LLM quality?

2.2 Vector search via ANN

Exact nearest-neighbor search is O(n). Approximate Nearest Neighbor (ANN) algorithms (HNSW, IVF, ScaNN) trade recall for sub-100ms latency at million-vector scale.

→ Pattern: vector-searchDEEP target this phase

Investigate:

  • Walk HNSW: layers, search start point, greedy descent.
  • Why is exact NN impractical above ~1M vectors?
  • When does pgvector beat Milvus (one machine, <10M vectors)? When does Milvus beat pgvector (distributed, >50M)?

2.3 RAG architecture

Ingest pipeline: data → chunk → embed → store. Query pipeline: query → embed → retrieve top-k → format prompt → LLM. Each step has trade-offs.

→ Pattern: rag-as-patternDEEP target this phase

Investigate:

  • Walk a RAG end-to-end. What’s at each step?
  • What’s chunking strategy, and why does it dominate retrieval quality?
  • When does pure-vector retrieval fail (specific entity lookup, exact keyword matching)?

Pure vector retrieval misses cases where exact keywords matter. Hybrid combines vector + BM25 (keyword) + metadata filters; rerankers blend results.

Investigate:

  • What’s reciprocal rank fusion, and why is it the workhorse of hybrid?
  • When does a reranker model on top of retrieval results earn its weight?
  • Why does Postgres-native (pgvector + tsvector) often beat dedicated vector stores at moderate scale?

2.5 K8s-native vector store deployment

pgvector: an extension on top of CloudNativePG (Y3 Phase 20). Milvus: deployed via Milvus Operator (Milvus CRD). Both fit basecamp’s K8s-native ecosystem.

→ Pattern: operator-pattern reinforced

Investigate:

  • Walk Milvus Operator: declare Milvus CRD → operator creates etcd + MinIO + Milvus components.
  • How does pgvector compose with CloudNativePG (same operator, just enable extension)?
  • When do you reach for Milvus over pgvector? (Hint: >50M vectors, distributed sharding, GPU indexing.)

2.6 Retrieval evaluation

Retrieval quality drives generation quality. Eval recall@k, MRR, NDCG on a held-out query set.

→ Pattern: reinforces evals from Phase 41

Investigate:

  • Walk recall@10: what does it measure?
  • What’s a “query / relevant doc” labeled set, and how do you curate one?
  • Why is “RAG output looks good” not evidence retrieval is working?

3. TRADE-OFFS

DecisionOptionsCost
Vector storepgvector (small-mid scale); Milvus Operator (large scale); Qdrant; Weaviatepgvector: simple, K8s-native via CloudNativePG. Milvus: scale, K8s-native operator. Qdrant/Weaviate: managed alternatives.
Embedding modelOpenAI text-embedding-3; Anthropic equivalents; open-weights (BGE, GTE, Nomic)Closed: best quality, per-call cost. Open: free, runs on your GPU, quality varies.
ANN algorithmHNSW; IVF; ScaNNHNSW: modern default. IVF: smaller memory. ScaNN: Google-optimized.
RerankerNone (rely on retrieval); Cohere reranker; open-weights cross-encoderNone: cheapest. Cohere: best quality, paid. Cross-encoder: middle.

4. TOOLS (as of 2026-06)

  • pgvector (extension on CloudNativePG)
  • Milvus Operator (Milvus CRD)
  • sentence-transformers — embedding model library
  • BGE-M3, Nomic-Embed-v1.5 — open-weights embedding models in 2026

Reading

  • “Building LLM-Powered Applications” (Alammar et al.) — RAG chapter
  • The original HNSW paper (Malkov + Yashunin)
  • pgvector docs + Milvus docs

5. MASTERY: Vector search + RAG on basecamp

[ ] pgvector enabled on a CloudNativePG cluster; create a vector table
[ ] Milvus Operator deployed via Flux; declare a Milvus CRD
[ ] Ingest ops-handbook into both stores; compare ANN performance
[ ] Build an embedding pipeline: read from Iceberg, embed with sentence-transformers, write to vector store
[ ] Implement HNSW indexing; benchmark recall@10 + latency
[ ] Add hybrid search (vector + Postgres tsvector); compare against vector-only
[ ] Add reciprocal rank fusion for hybrid scoring
[ ] Curate 50-query relevant-doc labeled set
[ ] Eval retrieval: recall@10, MRR, NDCG
[ ] Build a small RAG endpoint: query → retrieve → format → LLM call (use a small open model locally)

6. COMPARE: Qdrant or Weaviate

Deploy one as a parallel installation; cache the same vectors there. Compare ergonomics, query model, K8s integration.

400-word reflection.


7. OPERATE

  • 3-4 runbooks: “Vector index rebuild”, “Embedding model upgrade”, “Retrieval quality regression”, “Milvus disk pressure”
  • 2 ADRs (pgvector for now, Milvus when >50M; embedding model choice; reranker choice)
  • Weekly log

8. CONTRIBUTE

  • pgvector — improvements
  • Milvus (CNCF graduated) — docs
  • A retrieval eval set for ops-handbook (private; pattern is public)

What ships from this phase

  • Tier 7 entry of basecamp: pgvector + Milvus + embedding pipelines + RAG endpoint
  • Vector store runbooks

Validation criteria

[ ] pgvector + Milvus both operational
[ ] Embedding pipeline ingesting ops-handbook
[ ] RAG endpoint serving queries
[ ] Hybrid search + reranker integrated
[ ] Retrieval eval set + metrics tracked
[ ] All 10 operational depth checks
[ ] Compare reflection (400 words)
[ ] 3-4 runbooks
[ ] 2 ADRs
[ ] Pattern entries:
    - vector-search → DEEP
    - embedding-store → OUTLINE
    - rag-as-pattern → DEEP
    - operator-pattern reinforced
[ ] Exit Test passed

Exit Test

Time: 3 hours.

Part 1: Build (90 min)

Add a new corpus (e.g., your weekly logs from the past year) to the vector store. Build an end-to-end RAG endpoint. Eval recall@10 on a 20-query labeled set.

Part 2: Diagnose (60 min)

A retrieval scenario: recall@10 dropped from 0.85 to 0.40 after an embedding model upgrade. Possible: embedding dimensions changed; old embeddings still in store; reranker incompatible.

Part 3: Articulate (30 min)

~600 words: “Walk a RAG query end-to-end: user input → embedding → vector retrieval → reranking → LLM prompt construction → LLM call → response. Cite patterns at each step.”


Anti-patterns

Anti-patternWhy
Pure-vector without hybridMisses exact entity lookups
Chunking that splits coherent ideasRetrieval returns fragments; LLM hallucinates around them
No retrieval eval”RAG output looks good” is not evidence
Using a massive embedding model when a smaller one sufficesCost + latency for marginal quality

Patterns touched this phase

  • vector-searchDEEP
  • embedding-store — OUTLINE
  • rag-as-patternDEEP
  • operator-pattern reinforced

→ Next: Phase 43: LLM Serving Deep