Skip to content
5-YEAR PROGRAM · YEAR 4 · PHASE 25
UPCOMING

GPU Infrastructure + Production (Year 4 Capstone)

Final phase of Year 4. GPU scheduling patterns + production-grade llm-gateway (drift detection + auto-rollback) + mlship v0.5. Year 4 synthesis. ~8 weeks, ~95 hrs.


Where this phase sits

P25 is the Year 4 capstone. It does three things at once: it isolates GPU scheduling as its own pattern (the Year 4 overview explains why GPU isn’t a sub-topic of cloud), it hardens llm-gateway from v1 to v1.5 by closing the drift-detection and auto-rollback loop, and it advances mlship from v0 to v0.5 with PyTorch detection and a Fargate target. Each of those alone would be a small phase. Together they’re the synthesis that earns the Year 4 graduation badge.

The cloud spend constraint is real and not optional: $20-50, destroy-at-end-of-session. Buying a homelab GPU is the wrong move at this stage of the homelab/hardware plan — the ratio of “money spent” to “patterns learned” is much better with a few hours of cloud spot than with a depreciating $1500 card. The point of P25 is GPU-pattern fluency, not GPU mastery. You’ll never have a GPU fleet at home. You’ll demonstrate enough operational depth on cloud spot to interview convincingly, and that’s the right bar for an ML Platform / AI Infrastructure Engineer exit ramp.

P25 closes every Y4 pattern at DEEP: model-lifecycle (drift + auto-rollback complete the retrain side), train-serve-skew (revisited at production scale via input-distribution monitoring), feature-store, inference-shapes, rag-as-pattern — all operational, all observable, all drift-aware. By the end of this phase, you stop being someone who deploys ML and start being someone who operates an ML platform. That’s the operator → architect inflection Year 5 is built on.


Prerequisites

  • Phase 24 complete — llm-gateway v1 + notes-rag operational
  • Cloud GPU spend budget ready (~$20-50 total this phase)
  • You accept: homelab can’t have real GPUs at meaningful scale. This phase masters GPU patterns via cloud spot + benchmark + production-shape understanding.

Why this phase exists

Year 4 exit is ML Platform Engineer / AI Infrastructure Engineer. That role manages GPU fleets, multi-tenant inference, cost. You won’t have a GPU fleet in the homelab, but you’ll do meaningful work on cloud GPUs to understand the patterns.

This phase synthesizes Year 4 + ships the hardened llm-gateway v1.5 + grows mlship to v0.5.


1. PROBLEM

GPUs are 10-100x faster than CPUs for ML but expensive ($0.30-5/hr cloud). Multi-tenant access requires scheduling, sharing, cost attribution. Production LLM serving needs careful GPU memory management. Models drift in production; you need detection + safe rollback.


2. PRINCIPLES

2.1 GPU scheduling in K8s

K8s schedules GPUs via nvidia.com/gpu resource. NVIDIA Device Plugin exposes them to kubelet.

Investigate:

  • Spin up an EKS cluster with 1 GPU node (g5.xlarge spot)
  • Install NVIDIA Device Plugin
  • Schedule a Pod requesting nvidia.com/gpu: 1
  • Verify CUDA via nvidia-smi from inside the pod

2.2 GPU sharing strategies

  • Time-slicing: multiple pods share GPU; slow context switches
  • MPS (Multi-Process Service): spatial sharing for compatible workloads
  • MIG (Multi-Instance GPU): A100/H100 hard partition
  • vGPU: NVIDIA-licensed

Investigate:

  • Configure time-slicing in K8s device plugin
  • Run 4 small inference pods sharing 1 GPU
  • Measure throughput vs dedicated

2.3 Drift detection in production

→ Pattern: train-serve-skew (revisited at production scale)

Models degrade silently as input distributions shift.

Investigate:

  • Add input-feature distribution monitoring to llm-gateway (mean, std, KS-test alert)
  • Add output-quality monitoring (refusal rate, length distribution)
  • Statistical tests for drift: KS, MMD, population stability index

2.4 ML CI/CD with auto-rollback

Models flow through staging → production via KFP + KServe canary. If drift fires, auto-rollback.

→ Pattern: model-lifecycle (DEEP — closes the retrain loop)

Investigate:

  • Add eval suite to llm-gateway (golden-set + scoring on every prompt change)
  • Block promote-to-prod if eval regression
  • Wire drift alert → KServe canary advance/rollback

2.5 Cost optimization

GPU is expensive. Spot instances, scale-to-zero, model quantization, batching.

Investigate:

  • Compare cost: g5.xlarge on-demand vs spot vs reserved
  • Quantize Llama 3.2 1B to AWQ; compare quality + cost
  • Scale-to-zero with Knative for low-traffic models
  • Cost-per-thousand-tokens calculation; bill back to users via the dashboard

3. TRADE-OFFS

DecisionOption AOption BWhen
GPU accessCloud on-demandCloud spot (70% off, interruptible)Reserved
SharingDedicatedTime-slicingMIG
QuantizationNone (FP16)8-bit4-bit (AWQ/GPTQ)
Serving runtimevLLMTGITensorRT-LLM
Drift detectionKS-test (statistical)Embedding-distanceOutput-quality

4. TOOLS (as of 2025-10)

  • NVIDIA Device Plugin (K8s GPU exposure)
  • NVIDIA GPU Operator (full GPU stack on K8s)
  • AWS / GCP GPU instances (g5, T4)
  • AWQ / GPTQ (quantization)
  • Evidently AI or Alibi Detect (drift libraries)

5. MASTERY

5.1 Reading list

RequiredWhy
Chip Huyen “Designing ML Systems” Ch. 8-9 (Distribution Shifts + Continual Learning)The drift discipline
NVIDIA Device Plugin docsThe implementation
AWQ / GPTQ papersQuantization theory

5.2 Operational depth checklist

[ ] EKS cluster with 1 GPU spot node
[ ] NVIDIA Device Plugin installed
[ ] Schedule a Pod with `nvidia.com/gpu: 1`; verify nvidia-smi
[ ] Configure time-slicing; run 4 inference pods on 1 GPU
[ ] Quantize Llama 3.2 1B to AWQ; benchmark vs FP16
[ ] Add drift detection to llm-gateway (KS-test on input embeddings)
[ ] Build ML CI: eval suite blocks bad model promotion
[ ] Configure auto-rollback on drift alert (KServe canary reverse)
[ ] Set up GPU cost dashboard (per-cluster, per-model, per-user)
[ ] mlship v0.5: add KServe + cloud (Fargate or Cloud Run) deploy targets
[ ] Document the GPU operational guide in ops-handbook
[ ] Destroy the GPU instance at end of phase; verify $0 ongoing

5.3 llm-gateway v1.5 (production-shaped)

Final form for Y4:

services/llm-gateway/ v1.5:
+ Drift detection (input embeddings KS-test, output quality monitoring)
+ Auto-rollback on drift alert (revert KServe canary)
+ Quantization-aware deployment (route AWQ vs FP16 by load)
+ Cost-aware routing (cheap model when SLO budget allows)
+ Eval suite gating prompt changes

This is what closes the P21P24 → P25 arc. v0 was scaffold, v1 was production for happy paths, v1.5 is production for the unhappy paths — the model degrading silently, the canary going wrong, the prompt change that regresses on a corner case. Once those failure modes have explicit handling, the gateway is operationally complete for Y4.

5.4 mlship v0.5

mlship v0.5 (still PRIVATE; v2 capstone in Y5):
+ sklearn auto-detect (v0)
+ pytorch auto-detect (added)
+ KServe deploy target (added)
+ AWS Fargate deploy target (added — preview only with cloud spend)
+ Better error messages
+ 80% test coverage

The Year 5 capstone (P30) adds: HuggingFace + ONNX + TF detection, vLLM routing for LLMs, GCP Cloud Run, polished docs, demo video, Show HN launch. Full version arc and OSS launch plan: the mlship project plan.


6. COMPARE: cloud GPU vs Colab vs RunPod

Homelab can’t have meaningful GPUs. Beyond AWS/GCP, you have RunPod, Lambda Labs, Vast.ai (rent GPUs from individuals).

Compare for a “fine-tune Phi-3 on your weekly logs” hypothetical:

  • Cost
  • Spin-up latency
  • Persistence
  • Ops ergonomics

400 words.


7. OPERATE

  • 4+ runbooks (gpu-node-failure, model-rollback-via-canary, quantization-quality-regression, drift-alert-investigation)
  • 2+ postmortems (Y4 capstone — biggest learnings)
  • Year 4 PR shipped
  • Weekly log

8. CONTRIBUTE

vLLM, NVIDIA Device Plugin, KServe, Evidently AI.


Validation criteria (= Year 4 Final Exam readiness)

[ ] All 12 operational depth checks
[ ] Cloud GPU work demonstrated end-to-end
[ ] llm-gateway v1.5 with drift + auto-rollback + quantization-aware
[ ] mlship v0.5 working (sklearn + pytorch; Docker + KServe + Fargate targets)
[ ] All Year 4 patterns DEEP:
- model-lifecycle, train-serve-skew, feature-store, inference-shapes, rag-as-pattern
[ ] 4+ runbooks; 2+ postmortems; 8+ weekly log entries
[ ] Year 4 cloud spend: <$50
[ ] Year 4 Final Exam passed

Year 4 Final Exam (full day)

See final-exam.md for the full spec — 8 hours, 3 parts: build (Kubeflow + Ray + KServe end-to-end), triple incident, design review.


Year 4 graduation

You can:
- Operate the ML platform end-to-end (train + register + serve + monitor + retrain)
- Deploy LLM infrastructure (vLLM + RAG + vector DB) production-shaped
- Manage GPU resources + cost
- Detect + respond to model drift via canary auto-rollback
- Ship OSS in the ML space (llm-gateway in basecamp; mlship v0.5)
- Run personal RAG over your own writing (notes-rag dogfoods llm-gateway)
Exit ramp: ML Platform Engineer / AI Infrastructure Engineer
Confidence: ~40 patterns DEEP, ML platform operational, llm-gateway in production,
mlship v0.5 ready for Y5 capstone polish + launch

The bridge from here into Year 5 is the operator → architect transition described in the Master Plan. Y4 finishes the engine; Y5 builds the agents that drive it. The patterns you’ll need next (the agents category) start as OUTLINE entries Year 5 promotes to DEEP — the same ladder Y4 just walked.


Anti-patterns

Anti-patternWhy
Buying a homelab GPU$1000-3000 for what cloud spot gives you for $50 across the year
Drift detection without rollbackDetecting + paging is half the value
Quantization without quality eval”It’s faster!” until users notice it’s worse
GPU running idle$$$/hour even when zero traffic; scale-to-zero
Cloud GPU forgotten over weekend$50 lesson

Patterns deepened this phase

  • All Year 4 patterns reach DEEP. By P25 end, basecamp’s ML/AI layer is operational + observable + drift-aware. See the ml-and-ai pattern category for the full list.

→ Next: Year 4 Final Exam