GPU Infrastructure + Production (Year 4 Capstone)

Final phase of Year 4. GPU scheduling patterns + production-grade llm-gateway (drift detection + auto-rollback) + mlship v0.5. Year 4 synthesis. ~8 weeks, ~95 hrs.

Where this phase sits

P25 is the Year 4 capstone. It does three things at once: it isolates GPU scheduling as its own pattern (the Year 4 overview explains why GPU isn’t a sub-topic of cloud), it hardens llm-gateway from v1 to v1.5 by closing the drift-detection and auto-rollback loop, and it advances mlship from v0 to v0.5 with PyTorch detection and a Fargate target. Each of those alone would be a small phase. Together they’re the synthesis that earns the Year 4 graduation badge.

The cloud spend constraint is real and not optional: $20-50, destroy-at-end-of-session. Buying a homelab GPU is the wrong move at this stage of the homelab/hardware plan — the ratio of “money spent” to “patterns learned” is much better with a few hours of cloud spot than with a depreciating $1500 card. The point of P25 is GPU-pattern fluency, not GPU mastery. You’ll never have a GPU fleet at home. You’ll demonstrate enough operational depth on cloud spot to interview convincingly, and that’s the right bar for an ML Platform / AI Infrastructure Engineer exit ramp.

P25 closes every Y4 pattern at DEEP: model-lifecycle (drift + auto-rollback complete the retrain side), train-serve-skew (revisited at production scale via input-distribution monitoring), feature-store, inference-shapes, rag-as-pattern — all operational, all observable, all drift-aware. By the end of this phase, you stop being someone who deploys ML and start being someone who operates an ML platform. That’s the operator → architect inflection Year 5 is built on.

Prerequisites

Phase 24 complete — llm-gateway v1 + notes-rag operational

Cloud GPU spend budget ready (~$20-50 total this phase)

You accept: homelab can’t have real GPUs at meaningful scale. This phase masters GPU patterns via cloud spot + benchmark + production-shape understanding.

Why this phase exists

Year 4 exit is ML Platform Engineer / AI Infrastructure Engineer. That role manages GPU fleets, multi-tenant inference, cost. You won’t have a GPU fleet in the homelab, but you’ll do meaningful work on cloud GPUs to understand the patterns.

This phase synthesizes Year 4 + ships the hardened llm-gateway v1.5 + grows mlship to v0.5.

1. PROBLEM

GPUs are 10-100x faster than CPUs for ML but expensive ($0.30-5/hr cloud). Multi-tenant access requires scheduling, sharing, cost attribution. Production LLM serving needs careful GPU memory management. Models drift in production; you need detection + safe rollback.

2. PRINCIPLES

2.1 GPU scheduling in K8s

K8s schedules GPUs via nvidia.com/gpu resource. NVIDIA Device Plugin exposes them to kubelet.

Investigate:

Spin up an EKS cluster with 1 GPU node (g5.xlarge spot)
Install NVIDIA Device Plugin
Schedule a Pod requesting nvidia.com/gpu: 1
Verify CUDA via nvidia-smi from inside the pod

Time-slicing: multiple pods share GPU; slow context switches
MPS (Multi-Process Service): spatial sharing for compatible workloads
MIG (Multi-Instance GPU): A100/H100 hard partition
vGPU: NVIDIA-licensed

Investigate:

Configure time-slicing in K8s device plugin
Run 4 small inference pods sharing 1 GPU
Measure throughput vs dedicated

2.3 Drift detection in production

→ Pattern: train-serve-skew (revisited at production scale)

Models degrade silently as input distributions shift.

Investigate:

Add input-feature distribution monitoring to llm-gateway (mean, std, KS-test alert)
Add output-quality monitoring (refusal rate, length distribution)
Statistical tests for drift: KS, MMD, population stability index

2.4 ML CI/CD with auto-rollback

Models flow through staging → production via KFP + KServe canary. If drift fires, auto-rollback.

→ Pattern: model-lifecycle (DEEP — closes the retrain loop)

Investigate:

Add eval suite to llm-gateway (golden-set + scoring on every prompt change)
Block promote-to-prod if eval regression
Wire drift alert → KServe canary advance/rollback

2.5 Cost optimization

GPU is expensive. Spot instances, scale-to-zero, model quantization, batching.

Investigate:

Compare cost: g5.xlarge on-demand vs spot vs reserved
Quantize Llama 3.2 1B to AWQ; compare quality + cost
Scale-to-zero with Knative for low-traffic models
Cost-per-thousand-tokens calculation; bill back to users via the dashboard

3. TRADE-OFFS

Decision	Option A	Option B	When
GPU access	Cloud on-demand	Cloud spot (70% off, interruptible)	Reserved
Sharing	Dedicated	Time-slicing	MIG
Quantization	None (FP16)	8-bit	4-bit (AWQ/GPTQ)
Serving runtime	vLLM	TGI	TensorRT-LLM
Drift detection	KS-test (statistical)	Embedding-distance	Output-quality

4. TOOLS (as of 2025-10)

NVIDIA Device Plugin (K8s GPU exposure)
NVIDIA GPU Operator (full GPU stack on K8s)
AWS / GCP GPU instances (g5, T4)
AWQ / GPTQ (quantization)
Evidently AI or Alibi Detect (drift libraries)

5. MASTERY

5.1 Reading list

Required	Why
Chip Huyen “Designing ML Systems” Ch. 8-9 (Distribution Shifts + Continual Learning)	The drift discipline
NVIDIA Device Plugin docs	The implementation
AWQ / GPTQ papers	Quantization theory

5.2 Operational depth checklist

[ ] EKS cluster with 1 GPU spot node
[ ] NVIDIA Device Plugin installed
[ ] Schedule a Pod with `nvidia.com/gpu: 1`; verify nvidia-smi
[ ] Configure time-slicing; run 4 inference pods on 1 GPU
[ ] Quantize Llama 3.2 1B to AWQ; benchmark vs FP16
[ ] Add drift detection to llm-gateway (KS-test on input embeddings)
[ ] Build ML CI: eval suite blocks bad model promotion
[ ] Configure auto-rollback on drift alert (KServe canary reverse)
[ ] Set up GPU cost dashboard (per-cluster, per-model, per-user)
[ ] mlship v0.5: add KServe + cloud (Fargate or Cloud Run) deploy targets
[ ] Document the GPU operational guide in ops-handbook
[ ] Destroy the GPU instance at end of phase; verify $0 ongoing

5.3 `llm-gateway` v1.5 (production-shaped)

Final form for Y4:

services/llm-gateway/ v1.5:
  + Drift detection (input embeddings KS-test, output quality monitoring)
  + Auto-rollback on drift alert (revert KServe canary)
  + Quantization-aware deployment (route AWQ vs FP16 by load)
  + Cost-aware routing (cheap model when SLO budget allows)
  + Eval suite gating prompt changes

This is what closes the P21 → P24 → P25 arc. v0 was scaffold, v1 was production for happy paths, v1.5 is production for the unhappy paths — the model degrading silently, the canary going wrong, the prompt change that regresses on a corner case. Once those failure modes have explicit handling, the gateway is operationally complete for Y4.

5.4 `mlship` v0.5

mlship v0.5 (still PRIVATE; v2 capstone in Y5):
  + sklearn auto-detect (v0)
  + pytorch auto-detect (added)
  + KServe deploy target (added)
  + AWS Fargate deploy target (added — preview only with cloud spend)
  + Better error messages
  + 80% test coverage

The Year 5 capstone (P30) adds: HuggingFace + ONNX + TF detection, vLLM routing for LLMs, GCP Cloud Run, polished docs, demo video, Show HN launch. Full version arc and OSS launch plan: the mlship project plan.

6. COMPARE: cloud GPU vs Colab vs RunPod

Homelab can’t have meaningful GPUs. Beyond AWS/GCP, you have RunPod, Lambda Labs, Vast.ai (rent GPUs from individuals).

Compare for a “fine-tune Phi-3 on your weekly logs” hypothetical:

Cost
Spin-up latency
Persistence
Ops ergonomics

400 words.

7. OPERATE

4+ runbooks (gpu-node-failure, model-rollback-via-canary, quantization-quality-regression, drift-alert-investigation)
2+ postmortems (Y4 capstone — biggest learnings)
Year 4 PR shipped
Weekly log

8. CONTRIBUTE

vLLM, NVIDIA Device Plugin, KServe, Evidently AI.

Validation criteria (= Year 4 Final Exam readiness)

[ ] All 12 operational depth checks
[ ] Cloud GPU work demonstrated end-to-end
[ ] llm-gateway v1.5 with drift + auto-rollback + quantization-aware
[ ] mlship v0.5 working (sklearn + pytorch; Docker + KServe + Fargate targets)
[ ] All Year 4 patterns DEEP:
    - model-lifecycle, train-serve-skew, feature-store, inference-shapes, rag-as-pattern
[ ] 4+ runbooks; 2+ postmortems; 8+ weekly log entries
[ ] Year 4 cloud spend: <$50
[ ] Year 4 Final Exam passed

Year 4 Final Exam (full day)

See final-exam.md for the full spec — 8 hours, 3 parts: build (Kubeflow + Ray + KServe end-to-end), triple incident, design review.

Year 4 graduation

You can:
- Operate the ML platform end-to-end (train + register + serve + monitor + retrain)
- Deploy LLM infrastructure (vLLM + RAG + vector DB) production-shaped
- Manage GPU resources + cost
- Detect + respond to model drift via canary auto-rollback
- Ship OSS in the ML space (llm-gateway in basecamp; mlship v0.5)
- Run personal RAG over your own writing (notes-rag dogfoods llm-gateway)

Exit ramp: ML Platform Engineer / AI Infrastructure Engineer
Confidence: ~40 patterns DEEP, ML platform operational, llm-gateway in production,
            mlship v0.5 ready for Y5 capstone polish + launch

The bridge from here into Year 5 is the operator → architect transition described in the Master Plan. Y4 finishes the engine; Y5 builds the agents that drive it. The patterns you’ll need next (the agents category) start as OUTLINE entries Year 5 promotes to DEEP — the same ladder Y4 just walked.

Anti-patterns

Anti-pattern	Why
Buying a homelab GPU	$1000-3000 for what cloud spot gives you for $50 across the year
Drift detection without rollback	Detecting + paging is half the value
Quantization without quality eval	”It’s faster!” until users notice it’s worse
GPU running idle	$$$/hour even when zero traffic; scale-to-zero
Cloud GPU forgotten over weekend	$50 lesson

Patterns deepened this phase

All Year 4 patterns reach DEEP. By P25 end, basecamp’s ML/AI layer is operational + observable + drift-aware. See the ml-and-ai pattern category for the full list.

→ Next: Year 4 Final Exam

GPU Infrastructure + Production (Year 4 Capstone)

Where this phase sits

Prerequisites

Why this phase exists

1. PROBLEM

2. PRINCIPLES

2.1 GPU scheduling in K8s

2.2 GPU sharing strategies

2.3 Drift detection in production

2.4 ML CI/CD with auto-rollback

2.5 Cost optimization

3. TRADE-OFFS

4. TOOLS (as of 2025-10)

5. MASTERY

5.1 Reading list

5.2 Operational depth checklist

5.3 llm-gateway v1.5 (production-shaped)

5.4 mlship v0.5

6. COMPARE: cloud GPU vs Colab vs RunPod

7. OPERATE

8. CONTRIBUTE

Validation criteria (= Year 4 Final Exam readiness)

Year 4 Final Exam (full day)

Year 4 graduation

Anti-patterns

Patterns deepened this phase

5.3 `llm-gateway` v1.5 (production-shaped)

5.4 `mlship` v0.5