LLM Infrastructure + RAG + llm-gateway v1

Fifth phase of Year 4. The big LLM-shaped phase. vLLM at depth, RAG as a 3-system pattern, vector DBs (pgvector + Qdrant), llm-gateway v1 ships. Personal RAG over your weekly logs lands as notes-rag. ~8 weeks, ~100 hrs.

Where this phase sits

P24 is the centerpiece of Year 4. Phase 21 shipped the services/llm-gateway/ v0 scaffold — one Go service, one endpoint, one backend, deliberately minimal. P24 turns that scaffold into v1: RAG, streaming SSE, multi-model routing, per-user rate limiting, cost and latency tracking, OpenAI-compatible API surface. It’s the phase where the gateway stops being a stub and starts being the substance behind Year 4’s flagship.

This is also the phase where notes-rag lands — your personal RAG over four years of weekly logs. By P24, the Master Plan’s weekly-log discipline has produced a corpus no demo dataset can fake: real questions you were stuck on, real patterns you noticed, real things you broke and fixed. You ingest that corpus, index it, point llm-gateway at it, and ask your homelab “what was I stuck on in March 2027?” That’s the second Studio composition recipe, and it’s the cinematic moment Year 4 has been building toward.

The pattern depth this phase: rag-as-pattern and inference-shapes both reach DEEP. RAG specifically reaches DEEP because P24 forces the three sub-systems (ingest, retrieve, generate) to be operated separately. They fail differently, scale differently, and cost differently — and you’ll have evidence of all three by the end of the phase. The full P21 → P24 → P25 llm-gateway arc is summarized in the Year 4 overview; P24 is the middle step where the gateway becomes real.

Prerequisites

Phase 23 complete — train→deploy recipe operational

You accept: LLMs are a different model shape (bigger, slower, sometimes streaming) with their own infrastructure (vector DBs, embedding services, prompt management). The patterns from P20-23 still apply; the infra just gets bigger.

Why this phase exists

The Year 4 flagship is services/llm-gateway/. P21 shipped v0 (skeleton); this phase ships v1 (production-grade for homelab traffic): RAG, streaming, multi-model routing, cost tracking.

This phase also lands the second composition recipe: Personal RAG over your weekly logs (notes-rag). After 4 years of weekly logs in Iceberg, you can ask your homelab “what was I stuck on in March 2027?” and get an answer.

→ Pattern: rag-as-pattern (DEEP this phase) → Pattern: inference-shapes (DEEP — adds streaming)

1. PROBLEM

You want to serve LLMs in production with:

Multi-model routing (small for cheap, big for hard)
Retrieval-augmented generation (domain context injected)
Vector search at scale
Streaming token output
Cost + latency tracking per user/request
Rate limiting + abuse detection
OpenAI-compatible API so clients are vendor-neutral

That’s services/llm-gateway/ v1.

2. PRINCIPLES

2.1 RAG as 3 sub-systems

Ingest + retrieve + generate. Each is a separate problem.

→ Pattern: rag-as-pattern

Investigate each sub-system independently:

Ingest:

Chunk documents (semantic vs fixed-size; overlap)
Embed chunks (sentence-transformers, OpenAI, custom)
Store in vector DB

Retrieve:

Embed user query
Vector similarity search (top-K)
Hybrid: dense + keyword (BM25)
Re-rank (cross-encoder)

Generate:

Construct prompt with retrieved context
Stream LLM response
Optionally cite sources

2.2 Vector databases: pgvector vs Qdrant

Both store embeddings + do similarity search. Different operational shapes.

Investigate:

pgvector extension on your existing Postgres (Tier 1) — single source of truth
Qdrant on K8s (Tier 7) — dedicated vector DB
Index 100K embeddings in both; compare query latency at 10/100/1000 QPS
When does each win? (pgvector for “I already have Postgres”; Qdrant for “real scale.”)

2.3 vLLM at depth

vLLM does paged attention + continuous batching. The result: 10-20x throughput on the same GPU vs naive serving.

Investigate:

Read the vLLM paper (Kwon et al.) properly this time
Configure continuous batching params (max-batch-size, scheduling)
Multi-model serving: route small queries to Phi-3-mini, big to Llama-3.2-1B (or bigger via spot GPU)

2.4 Streaming responses (SSE)

LLMs are slow; stream tokens as generated.

→ Pattern: inference-shapes (streaming variant)

Investigate:

Implement SSE endpoint in Go (basecamp llm-gateway)
Backpressure handling: client slow, server faster
Why is first-token-latency more important than total-latency for chat?

2.5 Cost + latency tracking

Every request: tokens-in/out, latency, model, user. Track. Bill.

Investigate:

Per-request cost calculation (model-specific pricing)
Aggregate by user/team/day
Dashboard in Superset (Y3 P18)
SLO: tokens/sec for streaming, e2e latency for non-streaming

2.6 Per-user rate limiting

Token bucket via Redis. Same shape as P3 + P12 work.

→ Pattern: caching + Redis (Y3 P18)

2.7 Prompt management

Prompts are config. Version them. Test them. Store them.

Investigate:

Prompt as code: .txt files in basecamp git, versioned
Or: prompt store as a service (preview only — Y5 P29 portal hosts this)
Eval suite: prompts + expected outputs; gate prompt changes via PR

3. TRADE-OFFS

Decision	Option A	Option B	When
Vector DB	pgvector (in Postgres)	Qdrant	Weaviate
Embedding	OpenAI API	sentence-transformers (self-host)	Hosted: pay-per-call. Self: GPU cost + control
LLM serving	vLLM	TGI (HuggingFace)	OpenAI API
RAG framework	LangChain	LlamaIndex	custom (full control)
Re-ranking	cross-encoder (BAAI/bge-reranker)	LLM judge	none

4. TOOLS (as of 2025-10)

vLLM 0.6+ (LLM runtime)
pgvector 0.7+
Qdrant 1.12+
sentence-transformers (embeddings)
HuggingFace transformers (model loading)
LangChain or LlamaIndex (awareness — likely DIY for production)
Go (continuing llm-gateway)

5. MASTERY

5.1 Reading list

Required	Why
vLLM paper (“Efficient Memory Management for LLM Serving” — Kwon et al.)	The runtime
”AI Engineering” (Chip Huyen, 2024)	Modern LLM systems book
Anthropic / OpenAI cookbooks	Practical patterns

Recommended	Why
”Speech and Language Processing” (Jurafsky & Martin) Ch. on RAG	Depth
Pinecone / Weaviate blog series on RAG	Field practice

5.2 Operational depth checklist

[ ] Install pgvector extension on basecamp's Postgres
[ ] Deploy Qdrant on basecamp Tier 7
[ ] Embed 100K chunks; compare pgvector vs Qdrant latency at 10/100/1000 QPS
[ ] Build minimal RAG end-to-end: query → top-5 retrieve → vLLM with context → response
[ ] Add hybrid search (dense + BM25); compare relevance vs dense-only
[ ] Add re-ranking (cross-encoder); measure relevance lift
[ ] Implement streaming SSE endpoint in Go (extend llm-gateway v0)
[ ] Add cost + latency tracking per request to MLflow + Prometheus
[ ] Multi-model routing in llm-gateway: small queries to Phi-3-mini, larger to Llama-3.2-1B
[ ] Per-user rate limiting via Redis token bucket
[ ] Prompt versioning: prompts in basecamp git; CI tests them against eval set
[ ] Build cost-tracking dashboard in Superset

5.3 `services/llm-gateway/` v1

By phase end, llm-gateway is real:

services/llm-gateway/ v1 (PUBLIC via basecamp's repo):
  Go service in basecamp/charts/llm-gateway/

  Endpoints:
    POST /v1/chat/completions       — OpenAI-compat, streaming SSE
    POST /v1/embeddings             — embedding generation
    GET  /v1/models                 — available models

  Features:
    - Multi-model routing (small vs large)
    - RAG pipeline (vector search + context injection)
    - Streaming SSE responses
    - Per-user rate limit (Redis token bucket)
    - Per-request cost + latency tracking (OTel + Prometheus)
    - mTLS via mesh
    - OIDC auth via Dex
    - Audit log to Loki

  Deferred to P25:
    - Drift detection on input embeddings (KS-test)
    - Auto-rollback on drift alert
    - Quantization-aware deployment

5.4 `notes-rag` personal service ships

By phase end, the second Studio composition recipe is live:

notes-rag = "ask my homelab about my own writing"

  Pipeline:
    1. Ingest (offline, weekly via Airflow):
       Iceberg abukix.weekly_logs → Spark chunk → embed → pgvector
    2. Retrieve (online):
       user query → embed → pgvector top-5 → re-rank
    3. Generate (online):
       prompt(retrieved context + query) → llm-gateway → stream response
    4. UI:
       small Go HTTP server with htmx (same stack as triage from Y1)
       deployed via basecamp/charts/personal/notes-rag/

  Personal moment:
    "What was I stuck on in March 2027?"
    → response: "You wrote about X, Y, Z..."
    → first cinematic moment of the homelab as your second brain

This is private at first; a sanitized public demo lives at studio.abukix.dev when Year 5’s portal launches.

6. COMPARE: pgvector vs Qdrant for production

You did the latency comparison. Now do the operational one: backups, multi-replica, schema evolution, ops burden.

500 words.

7. OPERATE

4+ runbooks (vllm-oom, vector-db-recovery, rag-quality-drift, llm-gateway-streaming-stuck)
2+ postmortems
1+ ADR (pgvector-or-qdrant-for-basecamp)
Weekly log

8. CONTRIBUTE

vLLM, pgvector, Qdrant, sentence-transformers — all active.

Validation criteria

[ ] All 12 operational depth checks
[ ] llm-gateway v1 shipped in basecamp Tier 7 (RAG + streaming + multi-model + cost)
[ ] notes-rag personal service running on basecamp
[ ] pgvector vs Qdrant comparison written up
[ ] 4+ runbooks; 2+ postmortems; 1+ ADR; 8+ weekly log entries
[ ] Pattern entries deepened:
    - rag-as-pattern → DEEP
    - inference-shapes → DEEP (streaming variant added)
[ ] Exit Test passed

Exit Test

Time: 4 hours.

Build (120 min) — add a new model to llm-gateway v1: vLLM with a different small model, RAG-enabled, streaming, cost-tracked, rate-limited. End-to-end. With OTel traces.
Debug (90 min) — scenario: notes-rag returns irrelevant context for a known-answer query. Find why (chunking? embedding model? top-K too small? hybrid weighting?).
Articulate (30 min) — 800 words: “Trace a notes-rag query end-to-end: user prompt → final streamed response. Cite every component + cost.”

Anti-patterns

Anti-pattern	Why
LLM “for everything” without classical baseline	Often a sklearn classifier wins for small problems
RAG without re-ranking	Top-K-only retrieval misses relevance gains
No prompt versioning	Prompts drift; results drift; no audit
Streaming without backpressure	Buffer growth on slow clients
Vector DB without backup strategy	One MinIO disk loss = embeddings rebuild from scratch

Patterns deepened this phase

rag-as-pattern → DEEP
inference-shapes → DEEP

→ Next: Phase 25: GPU Infrastructure + Production (Y4 capstone)

LLM Infrastructure + RAG + llm-gateway v1

Where this phase sits

Prerequisites

Why this phase exists

1. PROBLEM

2. PRINCIPLES

2.1 RAG as 3 sub-systems

2.2 Vector databases: pgvector vs Qdrant

2.3 vLLM at depth

2.4 Streaming responses (SSE)

2.5 Cost + latency tracking

2.6 Per-user rate limiting

2.7 Prompt management

3. TRADE-OFFS

4. TOOLS (as of 2025-10)

5. MASTERY

5.1 Reading list

5.2 Operational depth checklist

5.3 services/llm-gateway/ v1

5.4 notes-rag personal service ships

6. COMPARE: pgvector vs Qdrant for production

7. OPERATE

8. CONTRIBUTE

Validation criteria

Exit Test

Anti-patterns

Patterns deepened this phase

5.3 `services/llm-gateway/` v1

5.4 `notes-rag` personal service ships