Skip to content
5-YEAR PROGRAM · YEAR 4 · PHASE 24
UPCOMING

LLM Infrastructure + RAG + llm-gateway v1

Fifth phase of Year 4. The big LLM-shaped phase. vLLM at depth, RAG as a 3-system pattern, vector DBs (pgvector + Qdrant), llm-gateway v1 ships. Personal RAG over your weekly logs lands as notes-rag. ~8 weeks, ~100 hrs.


Where this phase sits

P24 is the centerpiece of Year 4. Phase 21 shipped the services/llm-gateway/ v0 scaffold — one Go service, one endpoint, one backend, deliberately minimal. P24 turns that scaffold into v1: RAG, streaming SSE, multi-model routing, per-user rate limiting, cost and latency tracking, OpenAI-compatible API surface. It’s the phase where the gateway stops being a stub and starts being the substance behind Year 4’s flagship.

This is also the phase where notes-rag lands — your personal RAG over four years of weekly logs. By P24, the Master Plan’s weekly-log discipline has produced a corpus no demo dataset can fake: real questions you were stuck on, real patterns you noticed, real things you broke and fixed. You ingest that corpus, index it, point llm-gateway at it, and ask your homelab “what was I stuck on in March 2027?” That’s the second Studio composition recipe, and it’s the cinematic moment Year 4 has been building toward.

The pattern depth this phase: rag-as-pattern and inference-shapes both reach DEEP. RAG specifically reaches DEEP because P24 forces the three sub-systems (ingest, retrieve, generate) to be operated separately. They fail differently, scale differently, and cost differently — and you’ll have evidence of all three by the end of the phase. The full P21 → P24 → P25 llm-gateway arc is summarized in the Year 4 overview; P24 is the middle step where the gateway becomes real.


Prerequisites

  • Phase 23 complete — train→deploy recipe operational
  • You accept: LLMs are a different model shape (bigger, slower, sometimes streaming) with their own infrastructure (vector DBs, embedding services, prompt management). The patterns from P20-23 still apply; the infra just gets bigger.

Why this phase exists

The Year 4 flagship is services/llm-gateway/. P21 shipped v0 (skeleton); this phase ships v1 (production-grade for homelab traffic): RAG, streaming, multi-model routing, cost tracking.

This phase also lands the second composition recipe: Personal RAG over your weekly logs (notes-rag). After 4 years of weekly logs in Iceberg, you can ask your homelab “what was I stuck on in March 2027?” and get an answer.

→ Pattern: rag-as-pattern (DEEP this phase) → Pattern: inference-shapes (DEEP — adds streaming)


1. PROBLEM

You want to serve LLMs in production with:

  • Multi-model routing (small for cheap, big for hard)
  • Retrieval-augmented generation (domain context injected)
  • Vector search at scale
  • Streaming token output
  • Cost + latency tracking per user/request
  • Rate limiting + abuse detection
  • OpenAI-compatible API so clients are vendor-neutral

That’s services/llm-gateway/ v1.


2. PRINCIPLES

2.1 RAG as 3 sub-systems

Ingest + retrieve + generate. Each is a separate problem.

→ Pattern: rag-as-pattern

Investigate each sub-system independently:

Ingest:

  • Chunk documents (semantic vs fixed-size; overlap)
  • Embed chunks (sentence-transformers, OpenAI, custom)
  • Store in vector DB

Retrieve:

  • Embed user query
  • Vector similarity search (top-K)
  • Hybrid: dense + keyword (BM25)
  • Re-rank (cross-encoder)

Generate:

  • Construct prompt with retrieved context
  • Stream LLM response
  • Optionally cite sources

2.2 Vector databases: pgvector vs Qdrant

Both store embeddings + do similarity search. Different operational shapes.

Investigate:

  • pgvector extension on your existing Postgres (Tier 1) — single source of truth
  • Qdrant on K8s (Tier 7) — dedicated vector DB
  • Index 100K embeddings in both; compare query latency at 10/100/1000 QPS
  • When does each win? (pgvector for “I already have Postgres”; Qdrant for “real scale.”)

2.3 vLLM at depth

vLLM does paged attention + continuous batching. The result: 10-20x throughput on the same GPU vs naive serving.

Investigate:

  • Read the vLLM paper (Kwon et al.) properly this time
  • Configure continuous batching params (max-batch-size, scheduling)
  • Multi-model serving: route small queries to Phi-3-mini, big to Llama-3.2-1B (or bigger via spot GPU)

2.4 Streaming responses (SSE)

LLMs are slow; stream tokens as generated.

→ Pattern: inference-shapes (streaming variant)

Investigate:

  • Implement SSE endpoint in Go (basecamp llm-gateway)
  • Backpressure handling: client slow, server faster
  • Why is first-token-latency more important than total-latency for chat?

2.5 Cost + latency tracking

Every request: tokens-in/out, latency, model, user. Track. Bill.

Investigate:

  • Per-request cost calculation (model-specific pricing)
  • Aggregate by user/team/day
  • Dashboard in Superset (Y3 P18)
  • SLO: tokens/sec for streaming, e2e latency for non-streaming

2.6 Per-user rate limiting

Token bucket via Redis. Same shape as P3 + P12 work.

→ Pattern: caching + Redis (Y3 P18)

2.7 Prompt management

Prompts are config. Version them. Test them. Store them.

Investigate:

  • Prompt as code: .txt files in basecamp git, versioned
  • Or: prompt store as a service (preview only — Y5 P29 portal hosts this)
  • Eval suite: prompts + expected outputs; gate prompt changes via PR

3. TRADE-OFFS

DecisionOption AOption BWhen
Vector DBpgvector (in Postgres)QdrantWeaviate
EmbeddingOpenAI APIsentence-transformers (self-host)Hosted: pay-per-call. Self: GPU cost + control
LLM servingvLLMTGI (HuggingFace)OpenAI API
RAG frameworkLangChainLlamaIndexcustom (full control)
Re-rankingcross-encoder (BAAI/bge-reranker)LLM judgenone

4. TOOLS (as of 2025-10)

  • vLLM 0.6+ (LLM runtime)
  • pgvector 0.7+
  • Qdrant 1.12+
  • sentence-transformers (embeddings)
  • HuggingFace transformers (model loading)
  • LangChain or LlamaIndex (awareness — likely DIY for production)
  • Go (continuing llm-gateway)

5. MASTERY

5.1 Reading list

RequiredWhy
vLLM paper (“Efficient Memory Management for LLM Serving” — Kwon et al.)The runtime
”AI Engineering” (Chip Huyen, 2024)Modern LLM systems book
Anthropic / OpenAI cookbooksPractical patterns
RecommendedWhy
”Speech and Language Processing” (Jurafsky & Martin) Ch. on RAGDepth
Pinecone / Weaviate blog series on RAGField practice

5.2 Operational depth checklist

[ ] Install pgvector extension on basecamp's Postgres
[ ] Deploy Qdrant on basecamp Tier 7
[ ] Embed 100K chunks; compare pgvector vs Qdrant latency at 10/100/1000 QPS
[ ] Build minimal RAG end-to-end: query → top-5 retrieve → vLLM with context → response
[ ] Add hybrid search (dense + BM25); compare relevance vs dense-only
[ ] Add re-ranking (cross-encoder); measure relevance lift
[ ] Implement streaming SSE endpoint in Go (extend llm-gateway v0)
[ ] Add cost + latency tracking per request to MLflow + Prometheus
[ ] Multi-model routing in llm-gateway: small queries to Phi-3-mini, larger to Llama-3.2-1B
[ ] Per-user rate limiting via Redis token bucket
[ ] Prompt versioning: prompts in basecamp git; CI tests them against eval set
[ ] Build cost-tracking dashboard in Superset

5.3 services/llm-gateway/ v1

By phase end, llm-gateway is real:

services/llm-gateway/ v1 (PUBLIC via basecamp's repo):
Go service in basecamp/charts/llm-gateway/
Endpoints:
POST /v1/chat/completions — OpenAI-compat, streaming SSE
POST /v1/embeddings — embedding generation
GET /v1/models — available models
Features:
- Multi-model routing (small vs large)
- RAG pipeline (vector search + context injection)
- Streaming SSE responses
- Per-user rate limit (Redis token bucket)
- Per-request cost + latency tracking (OTel + Prometheus)
- mTLS via mesh
- OIDC auth via Dex
- Audit log to Loki
Deferred to P25:
- Drift detection on input embeddings (KS-test)
- Auto-rollback on drift alert
- Quantization-aware deployment

5.4 notes-rag personal service ships

By phase end, the second Studio composition recipe is live:

notes-rag = "ask my homelab about my own writing"
Pipeline:
1. Ingest (offline, weekly via Airflow):
Iceberg abukix.weekly_logs → Spark chunk → embed → pgvector
2. Retrieve (online):
user query → embed → pgvector top-5 → re-rank
3. Generate (online):
prompt(retrieved context + query) → llm-gateway → stream response
4. UI:
small Go HTTP server with htmx (same stack as triage from Y1)
deployed via basecamp/charts/personal/notes-rag/
Personal moment:
"What was I stuck on in March 2027?"
→ response: "You wrote about X, Y, Z..."
→ first cinematic moment of the homelab as your second brain

This is private at first; a sanitized public demo lives at studio.abukix.dev when Year 5’s portal launches.


6. COMPARE: pgvector vs Qdrant for production

You did the latency comparison. Now do the operational one: backups, multi-replica, schema evolution, ops burden.

500 words.


7. OPERATE

  • 4+ runbooks (vllm-oom, vector-db-recovery, rag-quality-drift, llm-gateway-streaming-stuck)
  • 2+ postmortems
  • 1+ ADR (pgvector-or-qdrant-for-basecamp)
  • Weekly log

8. CONTRIBUTE

vLLM, pgvector, Qdrant, sentence-transformers — all active.


Validation criteria

[ ] All 12 operational depth checks
[ ] llm-gateway v1 shipped in basecamp Tier 7 (RAG + streaming + multi-model + cost)
[ ] notes-rag personal service running on basecamp
[ ] pgvector vs Qdrant comparison written up
[ ] 4+ runbooks; 2+ postmortems; 1+ ADR; 8+ weekly log entries
[ ] Pattern entries deepened:
- rag-as-pattern → DEEP
- inference-shapes → DEEP (streaming variant added)
[ ] Exit Test passed

Exit Test

Time: 4 hours.

  1. Build (120 min) — add a new model to llm-gateway v1: vLLM with a different small model, RAG-enabled, streaming, cost-tracked, rate-limited. End-to-end. With OTel traces.
  2. Debug (90 min) — scenario: notes-rag returns irrelevant context for a known-answer query. Find why (chunking? embedding model? top-K too small? hybrid weighting?).
  3. Articulate (30 min) — 800 words: “Trace a notes-rag query end-to-end: user prompt → final streamed response. Cite every component + cost.”

Anti-patterns

Anti-patternWhy
LLM “for everything” without classical baselineOften a sklearn classifier wins for small problems
RAG without re-rankingTop-K-only retrieval misses relevance gains
No prompt versioningPrompts drift; results drift; no audit
Streaming without backpressureBuffer growth on slow clients
Vector DB without backup strategyOne MinIO disk loss = embeddings rebuild from scratch

Patterns deepened this phase


→ Next: Phase 25: GPU Infrastructure + Production (Y4 capstone)