LLM Infrastructure + RAG + llm-gateway v1
Fifth phase of Year 4. The big LLM-shaped phase. vLLM at depth, RAG as a 3-system pattern, vector DBs (pgvector + Qdrant),
llm-gatewayv1 ships. Personal RAG over your weekly logs lands asnotes-rag. ~8 weeks, ~100 hrs.
Where this phase sits
P24 is the centerpiece of Year 4. Phase 21 shipped the services/llm-gateway/ v0 scaffold — one Go service, one endpoint, one backend, deliberately minimal. P24 turns that scaffold into v1: RAG, streaming SSE, multi-model routing, per-user rate limiting, cost and latency tracking, OpenAI-compatible API surface. It’s the phase where the gateway stops being a stub and starts being the substance behind Year 4’s flagship.
This is also the phase where notes-rag lands — your personal RAG over four years of weekly logs. By P24, the Master Plan’s weekly-log discipline has produced a corpus no demo dataset can fake: real questions you were stuck on, real patterns you noticed, real things you broke and fixed. You ingest that corpus, index it, point llm-gateway at it, and ask your homelab “what was I stuck on in March 2027?” That’s the second Studio composition recipe, and it’s the cinematic moment Year 4 has been building toward.
The pattern depth this phase: rag-as-pattern and inference-shapes both reach DEEP. RAG specifically reaches DEEP because P24 forces the three sub-systems (ingest, retrieve, generate) to be operated separately. They fail differently, scale differently, and cost differently — and you’ll have evidence of all three by the end of the phase. The full P21 → P24 → P25 llm-gateway arc is summarized in the Year 4 overview; P24 is the middle step where the gateway becomes real.
Prerequisites
- Phase 23 complete — train→deploy recipe operational
- You accept: LLMs are a different model shape (bigger, slower, sometimes streaming) with their own infrastructure (vector DBs, embedding services, prompt management). The patterns from P20-23 still apply; the infra just gets bigger.
Why this phase exists
The Year 4 flagship is services/llm-gateway/. P21 shipped v0 (skeleton); this phase ships v1 (production-grade for homelab traffic): RAG, streaming, multi-model routing, cost tracking.
This phase also lands the second composition recipe: Personal RAG over your weekly logs (notes-rag). After 4 years of weekly logs in Iceberg, you can ask your homelab “what was I stuck on in March 2027?” and get an answer.
→ Pattern: rag-as-pattern (DEEP this phase) → Pattern: inference-shapes (DEEP — adds streaming)
1. PROBLEM
You want to serve LLMs in production with:
- Multi-model routing (small for cheap, big for hard)
- Retrieval-augmented generation (domain context injected)
- Vector search at scale
- Streaming token output
- Cost + latency tracking per user/request
- Rate limiting + abuse detection
- OpenAI-compatible API so clients are vendor-neutral
That’s services/llm-gateway/ v1.
2. PRINCIPLES
2.1 RAG as 3 sub-systems
Ingest + retrieve + generate. Each is a separate problem.
→ Pattern: rag-as-pattern
Investigate each sub-system independently:
Ingest:
- Chunk documents (semantic vs fixed-size; overlap)
- Embed chunks (sentence-transformers, OpenAI, custom)
- Store in vector DB
Retrieve:
- Embed user query
- Vector similarity search (top-K)
- Hybrid: dense + keyword (BM25)
- Re-rank (cross-encoder)
Generate:
- Construct prompt with retrieved context
- Stream LLM response
- Optionally cite sources
2.2 Vector databases: pgvector vs Qdrant
Both store embeddings + do similarity search. Different operational shapes.
Investigate:
- pgvector extension on your existing Postgres (Tier 1) — single source of truth
- Qdrant on K8s (Tier 7) — dedicated vector DB
- Index 100K embeddings in both; compare query latency at 10/100/1000 QPS
- When does each win? (pgvector for “I already have Postgres”; Qdrant for “real scale.”)
2.3 vLLM at depth
vLLM does paged attention + continuous batching. The result: 10-20x throughput on the same GPU vs naive serving.
Investigate:
- Read the vLLM paper (Kwon et al.) properly this time
- Configure continuous batching params (max-batch-size, scheduling)
- Multi-model serving: route small queries to Phi-3-mini, big to Llama-3.2-1B (or bigger via spot GPU)
2.4 Streaming responses (SSE)
LLMs are slow; stream tokens as generated.
→ Pattern: inference-shapes (streaming variant)
Investigate:
- Implement SSE endpoint in Go (basecamp llm-gateway)
- Backpressure handling: client slow, server faster
- Why is first-token-latency more important than total-latency for chat?
2.5 Cost + latency tracking
Every request: tokens-in/out, latency, model, user. Track. Bill.
Investigate:
- Per-request cost calculation (model-specific pricing)
- Aggregate by user/team/day
- Dashboard in Superset (Y3 P18)
- SLO: tokens/sec for streaming, e2e latency for non-streaming
2.6 Per-user rate limiting
Token bucket via Redis. Same shape as P3 + P12 work.
→ Pattern: caching + Redis (Y3 P18)
2.7 Prompt management
Prompts are config. Version them. Test them. Store them.
Investigate:
- Prompt as code:
.txtfiles in basecamp git, versioned - Or: prompt store as a service (preview only — Y5 P29 portal hosts this)
- Eval suite: prompts + expected outputs; gate prompt changes via PR
3. TRADE-OFFS
| Decision | Option A | Option B | When |
|---|---|---|---|
| Vector DB | pgvector (in Postgres) | Qdrant | Weaviate |
| Embedding | OpenAI API | sentence-transformers (self-host) | Hosted: pay-per-call. Self: GPU cost + control |
| LLM serving | vLLM | TGI (HuggingFace) | OpenAI API |
| RAG framework | LangChain | LlamaIndex | custom (full control) |
| Re-ranking | cross-encoder (BAAI/bge-reranker) | LLM judge | none |
4. TOOLS (as of 2025-10)
- vLLM 0.6+ (LLM runtime)
- pgvector 0.7+
- Qdrant 1.12+
- sentence-transformers (embeddings)
- HuggingFace transformers (model loading)
- LangChain or LlamaIndex (awareness — likely DIY for production)
- Go (continuing llm-gateway)
5. MASTERY
5.1 Reading list
| Required | Why |
|---|---|
| vLLM paper (“Efficient Memory Management for LLM Serving” — Kwon et al.) | The runtime |
| ”AI Engineering” (Chip Huyen, 2024) | Modern LLM systems book |
| Anthropic / OpenAI cookbooks | Practical patterns |
| Recommended | Why |
|---|---|
| ”Speech and Language Processing” (Jurafsky & Martin) Ch. on RAG | Depth |
| Pinecone / Weaviate blog series on RAG | Field practice |
5.2 Operational depth checklist
[ ] Install pgvector extension on basecamp's Postgres[ ] Deploy Qdrant on basecamp Tier 7[ ] Embed 100K chunks; compare pgvector vs Qdrant latency at 10/100/1000 QPS[ ] Build minimal RAG end-to-end: query → top-5 retrieve → vLLM with context → response[ ] Add hybrid search (dense + BM25); compare relevance vs dense-only[ ] Add re-ranking (cross-encoder); measure relevance lift[ ] Implement streaming SSE endpoint in Go (extend llm-gateway v0)[ ] Add cost + latency tracking per request to MLflow + Prometheus[ ] Multi-model routing in llm-gateway: small queries to Phi-3-mini, larger to Llama-3.2-1B[ ] Per-user rate limiting via Redis token bucket[ ] Prompt versioning: prompts in basecamp git; CI tests them against eval set[ ] Build cost-tracking dashboard in Superset5.3 services/llm-gateway/ v1
By phase end, llm-gateway is real:
services/llm-gateway/ v1 (PUBLIC via basecamp's repo): Go service in basecamp/charts/llm-gateway/
Endpoints: POST /v1/chat/completions — OpenAI-compat, streaming SSE POST /v1/embeddings — embedding generation GET /v1/models — available models
Features: - Multi-model routing (small vs large) - RAG pipeline (vector search + context injection) - Streaming SSE responses - Per-user rate limit (Redis token bucket) - Per-request cost + latency tracking (OTel + Prometheus) - mTLS via mesh - OIDC auth via Dex - Audit log to Loki
Deferred to P25: - Drift detection on input embeddings (KS-test) - Auto-rollback on drift alert - Quantization-aware deployment5.4 notes-rag personal service ships
By phase end, the second Studio composition recipe is live:
notes-rag = "ask my homelab about my own writing"
Pipeline: 1. Ingest (offline, weekly via Airflow): Iceberg abukix.weekly_logs → Spark chunk → embed → pgvector 2. Retrieve (online): user query → embed → pgvector top-5 → re-rank 3. Generate (online): prompt(retrieved context + query) → llm-gateway → stream response 4. UI: small Go HTTP server with htmx (same stack as triage from Y1) deployed via basecamp/charts/personal/notes-rag/
Personal moment: "What was I stuck on in March 2027?" → response: "You wrote about X, Y, Z..." → first cinematic moment of the homelab as your second brainThis is private at first; a sanitized public demo lives at studio.abukix.dev when Year 5’s portal launches.
6. COMPARE: pgvector vs Qdrant for production
You did the latency comparison. Now do the operational one: backups, multi-replica, schema evolution, ops burden.
500 words.
7. OPERATE
- 4+ runbooks (
vllm-oom,vector-db-recovery,rag-quality-drift,llm-gateway-streaming-stuck) - 2+ postmortems
- 1+ ADR (
pgvector-or-qdrant-for-basecamp) - Weekly log
8. CONTRIBUTE
vLLM, pgvector, Qdrant, sentence-transformers — all active.
Validation criteria
[ ] All 12 operational depth checks[ ] llm-gateway v1 shipped in basecamp Tier 7 (RAG + streaming + multi-model + cost)[ ] notes-rag personal service running on basecamp[ ] pgvector vs Qdrant comparison written up[ ] 4+ runbooks; 2+ postmortems; 1+ ADR; 8+ weekly log entries[ ] Pattern entries deepened: - rag-as-pattern → DEEP - inference-shapes → DEEP (streaming variant added)[ ] Exit Test passedExit Test
Time: 4 hours.
- Build (120 min) — add a new model to llm-gateway v1: vLLM with a different small model, RAG-enabled, streaming, cost-tracked, rate-limited. End-to-end. With OTel traces.
- Debug (90 min) — scenario: notes-rag returns irrelevant context for a known-answer query. Find why (chunking? embedding model? top-K too small? hybrid weighting?).
- Articulate (30 min) — 800 words: “Trace a notes-rag query end-to-end: user prompt → final streamed response. Cite every component + cost.”
Anti-patterns
| Anti-pattern | Why |
|---|---|
| LLM “for everything” without classical baseline | Often a sklearn classifier wins for small problems |
| RAG without re-ranking | Top-K-only retrieval misses relevance gains |
| No prompt versioning | Prompts drift; results drift; no audit |
| Streaming without backpressure | Buffer growth on slow clients |
| Vector DB without backup strategy | One MinIO disk loss = embeddings rebuild from scratch |
Patterns deepened this phase
- rag-as-pattern → DEEP
- inference-shapes → DEEP
→ Next: Phase 25: GPU Infrastructure + Production (Y4 capstone)