GPU Infrastructure + Production (Year 4 Capstone)
Final phase of Year 4. GPU scheduling patterns + production-grade
llm-gateway(drift detection + auto-rollback) +mlshipv0.5. Year 4 synthesis. ~8 weeks, ~95 hrs.
Where this phase sits
P25 is the Year 4 capstone. It does three things at once: it isolates GPU scheduling as its own pattern (the Year 4 overview explains why GPU isn’t a sub-topic of cloud), it hardens llm-gateway from v1 to v1.5 by closing the drift-detection and auto-rollback loop, and it advances mlship from v0 to v0.5 with PyTorch detection and a Fargate target. Each of those alone would be a small phase. Together they’re the synthesis that earns the Year 4 graduation badge.
The cloud spend constraint is real and not optional: $20-50, destroy-at-end-of-session. Buying a homelab GPU is the wrong move at this stage of the homelab/hardware plan — the ratio of “money spent” to “patterns learned” is much better with a few hours of cloud spot than with a depreciating $1500 card. The point of P25 is GPU-pattern fluency, not GPU mastery. You’ll never have a GPU fleet at home. You’ll demonstrate enough operational depth on cloud spot to interview convincingly, and that’s the right bar for an ML Platform / AI Infrastructure Engineer exit ramp.
P25 closes every Y4 pattern at DEEP: model-lifecycle (drift + auto-rollback complete the retrain side), train-serve-skew (revisited at production scale via input-distribution monitoring), feature-store, inference-shapes, rag-as-pattern — all operational, all observable, all drift-aware. By the end of this phase, you stop being someone who deploys ML and start being someone who operates an ML platform. That’s the operator → architect inflection Year 5 is built on.
Prerequisites
- Phase 24 complete — llm-gateway v1 + notes-rag operational
- Cloud GPU spend budget ready (~$20-50 total this phase)
- You accept: homelab can’t have real GPUs at meaningful scale. This phase masters GPU patterns via cloud spot + benchmark + production-shape understanding.
Why this phase exists
Year 4 exit is ML Platform Engineer / AI Infrastructure Engineer. That role manages GPU fleets, multi-tenant inference, cost. You won’t have a GPU fleet in the homelab, but you’ll do meaningful work on cloud GPUs to understand the patterns.
This phase synthesizes Year 4 + ships the hardened llm-gateway v1.5 + grows mlship to v0.5.
1. PROBLEM
GPUs are 10-100x faster than CPUs for ML but expensive ($0.30-5/hr cloud). Multi-tenant access requires scheduling, sharing, cost attribution. Production LLM serving needs careful GPU memory management. Models drift in production; you need detection + safe rollback.
2. PRINCIPLES
2.1 GPU scheduling in K8s
K8s schedules GPUs via nvidia.com/gpu resource. NVIDIA Device Plugin exposes them to kubelet.
Investigate:
- Spin up an EKS cluster with 1 GPU node (g5.xlarge spot)
- Install NVIDIA Device Plugin
- Schedule a Pod requesting
nvidia.com/gpu: 1 - Verify CUDA via
nvidia-smifrom inside the pod
2.2 GPU sharing strategies
- Time-slicing: multiple pods share GPU; slow context switches
- MPS (Multi-Process Service): spatial sharing for compatible workloads
- MIG (Multi-Instance GPU): A100/H100 hard partition
- vGPU: NVIDIA-licensed
Investigate:
- Configure time-slicing in K8s device plugin
- Run 4 small inference pods sharing 1 GPU
- Measure throughput vs dedicated
2.3 Drift detection in production
→ Pattern: train-serve-skew (revisited at production scale)
Models degrade silently as input distributions shift.
Investigate:
- Add input-feature distribution monitoring to llm-gateway (mean, std, KS-test alert)
- Add output-quality monitoring (refusal rate, length distribution)
- Statistical tests for drift: KS, MMD, population stability index
2.4 ML CI/CD with auto-rollback
Models flow through staging → production via KFP + KServe canary. If drift fires, auto-rollback.
→ Pattern: model-lifecycle (DEEP — closes the retrain loop)
Investigate:
- Add eval suite to llm-gateway (golden-set + scoring on every prompt change)
- Block promote-to-prod if eval regression
- Wire drift alert → KServe canary advance/rollback
2.5 Cost optimization
GPU is expensive. Spot instances, scale-to-zero, model quantization, batching.
Investigate:
- Compare cost: g5.xlarge on-demand vs spot vs reserved
- Quantize Llama 3.2 1B to AWQ; compare quality + cost
- Scale-to-zero with Knative for low-traffic models
- Cost-per-thousand-tokens calculation; bill back to users via the dashboard
3. TRADE-OFFS
| Decision | Option A | Option B | When |
|---|---|---|---|
| GPU access | Cloud on-demand | Cloud spot (70% off, interruptible) | Reserved |
| Sharing | Dedicated | Time-slicing | MIG |
| Quantization | None (FP16) | 8-bit | 4-bit (AWQ/GPTQ) |
| Serving runtime | vLLM | TGI | TensorRT-LLM |
| Drift detection | KS-test (statistical) | Embedding-distance | Output-quality |
4. TOOLS (as of 2025-10)
- NVIDIA Device Plugin (K8s GPU exposure)
- NVIDIA GPU Operator (full GPU stack on K8s)
- AWS / GCP GPU instances (g5, T4)
- AWQ / GPTQ (quantization)
- Evidently AI or Alibi Detect (drift libraries)
5. MASTERY
5.1 Reading list
| Required | Why |
|---|---|
| Chip Huyen “Designing ML Systems” Ch. 8-9 (Distribution Shifts + Continual Learning) | The drift discipline |
| NVIDIA Device Plugin docs | The implementation |
| AWQ / GPTQ papers | Quantization theory |
5.2 Operational depth checklist
[ ] EKS cluster with 1 GPU spot node[ ] NVIDIA Device Plugin installed[ ] Schedule a Pod with `nvidia.com/gpu: 1`; verify nvidia-smi[ ] Configure time-slicing; run 4 inference pods on 1 GPU[ ] Quantize Llama 3.2 1B to AWQ; benchmark vs FP16[ ] Add drift detection to llm-gateway (KS-test on input embeddings)[ ] Build ML CI: eval suite blocks bad model promotion[ ] Configure auto-rollback on drift alert (KServe canary reverse)[ ] Set up GPU cost dashboard (per-cluster, per-model, per-user)[ ] mlship v0.5: add KServe + cloud (Fargate or Cloud Run) deploy targets[ ] Document the GPU operational guide in ops-handbook[ ] Destroy the GPU instance at end of phase; verify $0 ongoing5.3 llm-gateway v1.5 (production-shaped)
Final form for Y4:
services/llm-gateway/ v1.5: + Drift detection (input embeddings KS-test, output quality monitoring) + Auto-rollback on drift alert (revert KServe canary) + Quantization-aware deployment (route AWQ vs FP16 by load) + Cost-aware routing (cheap model when SLO budget allows) + Eval suite gating prompt changesThis is what closes the P21 → P24 → P25 arc. v0 was scaffold, v1 was production for happy paths, v1.5 is production for the unhappy paths — the model degrading silently, the canary going wrong, the prompt change that regresses on a corner case. Once those failure modes have explicit handling, the gateway is operationally complete for Y4.
5.4 mlship v0.5
mlship v0.5 (still PRIVATE; v2 capstone in Y5): + sklearn auto-detect (v0) + pytorch auto-detect (added) + KServe deploy target (added) + AWS Fargate deploy target (added — preview only with cloud spend) + Better error messages + 80% test coverageThe Year 5 capstone (P30) adds: HuggingFace + ONNX + TF detection, vLLM routing for LLMs, GCP Cloud Run, polished docs, demo video, Show HN launch. Full version arc and OSS launch plan: the mlship project plan.
6. COMPARE: cloud GPU vs Colab vs RunPod
Homelab can’t have meaningful GPUs. Beyond AWS/GCP, you have RunPod, Lambda Labs, Vast.ai (rent GPUs from individuals).
Compare for a “fine-tune Phi-3 on your weekly logs” hypothetical:
- Cost
- Spin-up latency
- Persistence
- Ops ergonomics
400 words.
7. OPERATE
- 4+ runbooks (
gpu-node-failure,model-rollback-via-canary,quantization-quality-regression,drift-alert-investigation) - 2+ postmortems (Y4 capstone — biggest learnings)
- Year 4 PR shipped
- Weekly log
8. CONTRIBUTE
vLLM, NVIDIA Device Plugin, KServe, Evidently AI.
Validation criteria (= Year 4 Final Exam readiness)
[ ] All 12 operational depth checks[ ] Cloud GPU work demonstrated end-to-end[ ] llm-gateway v1.5 with drift + auto-rollback + quantization-aware[ ] mlship v0.5 working (sklearn + pytorch; Docker + KServe + Fargate targets)[ ] All Year 4 patterns DEEP: - model-lifecycle, train-serve-skew, feature-store, inference-shapes, rag-as-pattern[ ] 4+ runbooks; 2+ postmortems; 8+ weekly log entries[ ] Year 4 cloud spend: <$50[ ] Year 4 Final Exam passedYear 4 Final Exam (full day)
See final-exam.md for the full spec — 8 hours, 3 parts: build (Kubeflow + Ray + KServe end-to-end), triple incident, design review.
Year 4 graduation
You can:- Operate the ML platform end-to-end (train + register + serve + monitor + retrain)- Deploy LLM infrastructure (vLLM + RAG + vector DB) production-shaped- Manage GPU resources + cost- Detect + respond to model drift via canary auto-rollback- Ship OSS in the ML space (llm-gateway in basecamp; mlship v0.5)- Run personal RAG over your own writing (notes-rag dogfoods llm-gateway)
Exit ramp: ML Platform Engineer / AI Infrastructure EngineerConfidence: ~40 patterns DEEP, ML platform operational, llm-gateway in production, mlship v0.5 ready for Y5 capstone polish + launchThe bridge from here into Year 5 is the operator → architect transition described in the Master Plan. Y4 finishes the engine; Y5 builds the agents that drive it. The patterns you’ll need next (the agents category) start as OUTLINE entries Year 5 promotes to DEEP — the same ladder Y4 just walked.
Anti-patterns
| Anti-pattern | Why |
|---|---|
| Buying a homelab GPU | $1000-3000 for what cloud spot gives you for $50 across the year |
| Drift detection without rollback | Detecting + paging is half the value |
| Quantization without quality eval | ”It’s faster!” until users notice it’s worse |
| GPU running idle | $$$/hour even when zero traffic; scale-to-zero |
| Cloud GPU forgotten over weekend | $50 lesson |
Patterns deepened this phase
- All Year 4 patterns reach DEEP. By P25 end, basecamp’s ML/AI layer is operational + observable + drift-aware. See the ml-and-ai pattern category for the full list.
→ Next: Year 4 Final Exam