Final Exam

Full-day (~8 hour) scenario combining all 6 phases of Year 4. Pass = ready for Year 5 (ML Platform / AI Infrastructure Engineer exit ramp credible; Staff/Principal AI Platform Engineer trajectory in reach).

The Year 4 Final Exam audits whether the data platform from Year 3 actually grew an ML/AI layer that you can operate at production rigor — not “I followed the tutorial and got an inference response.” One end-to-end ML pipeline (train → register → deploy → canary → drift → auto-rollback) built in 180 minutes. Three independent incidents from the messy parts of ML (cold start, train/serve skew, vLLM OOM, drift sensitivity, GPU thrash) in 120 minutes. One design review where the answer to “should we buy a homelab GPU?” is almost certainly no, and your job is to explain why with numbers and alternatives.

What’s measured is pattern fluency at the ML/AI infrastructure layer — can you reason about model-lifecycle, inference-shapes (online vs batch vs streaming), feature-store train/serve skew, rag-as-pattern, and the GPU economics every ML platform engineer eventually defends in a budget review? Two engineers can both wire MLflow → KServe; only one can explain what trade-off was made and why, defend a canary that’s flapping, and recognize when “drift” is a sensitivity miscalibration rather than a real distribution shift. The exam audits the second one.

This is the inflection toward Staff. Y3’s exam graduates you toward Senior; Y4’s exam graduates you toward Staff/Principal trajectory. The design review part is where that shift is most visible — Senior reviewers identify the issues; Staff reviewers propose the smaller alternative that gets 80% of the value, with cost numbers from their own work.

When to take

After Phase 25 validation criteria green and services/llm-gateway/ v1.5 has been moving real (homelab-scale) traffic for at least two weeks with drift + auto-rollback active. Schedule ~2 weeks ahead so mlship v0.5 is stable, the train→deploy composition recipe runs end-to-end without hand-holding, and notes-rag has produced at least one real “huh, that retrieval is bad” moment you can cite.

Setup

basecamp Tier 1-7 all operational
services/llm-gateway/ v1.5 running with drift + auto-rollback
mlship v0.5 working (Docker + KServe + Fargate)
notes-rag personal service running
Train→deploy + Personal-RAG composition recipes runnable
8 hours uninterrupted
The root-exam skill (or solo + this doc)

Format

3 parts, ~8 hours total:
  Part 1: Build               (180 min)
  Part 2: Triple incident     (120 min)
  Part 3: Design review       (120 min)

Part 1: Build (180 min)

“Train a model on Year 3 Iceberg data via Kubeflow Pipelines + Ray; register in MLflow; deploy to KServe via mlship with canary; add drift detection; verify end-to-end through services/llm-gateway/. Cloud GPU optional but encouraged.”

Required deliverables:

[ ] KFP pipeline (define in Python; compile to Argo Workflow)
[ ] Step 1: extract from Iceberg with PIT-correct Feast features
[ ] Step 2: Ray-distributed training (sklearn or PyTorch)
[ ] Step 3: eval suite; gate promotion on threshold
[ ] Step 4: register in MLflow as Staging
[ ] Step 5: mlship deploy --to k8s with canary 90/10
[ ] Step 6: drift detection wired (KS-test on input embeddings or features)
[ ] Step 7: auto-rollback configured (canary reverse on drift alert)
[ ] Pipeline runnable via platform-ctl pipeline run train-deploy
[ ] Lineage visible in DataHub end-to-end
[ ] Observability: traces, logs, metrics correlated for one inference request through llm-gateway

Pass criteria:

Pipeline succeeds end-to-end
Canary advances correctly under success-rate threshold
Drift simulation triggers auto-rollback within 5 min
Lineage visible from Iceberg → MLflow → KServe → llm-gateway
0 manual kubectl apply -f; everything via basecamp + platform-ctl

What passing looks like: the pipeline run is observable end-to-end, the canary’s reverse is logged with a timestamp and a reason, and the writeup of “why this canary threshold and not its neighbor” is something you could hand to a Y5 student as a study artifact. The pipeline isn’t just “it worked once” — it’s reproducible from platform-ctl and reviewable from a basecamp PR.

Part 2: Triple incident (120 min)

Three scenarios, ~40 min each. The root-exam skill picks them deterministically. One each from the ML serving / feature store / LLM-and-GPU bucket.

Sample scenarios

From Phase 21 (ML Serving)

Cold start > 60s; investigate via OTel + Loki + KServe metrics; tune
Canary rollout stuck (success rate flapping); identify whether real or noisy SLO
Ray cluster lost head node; recover from checkpoint

From Phase 22 (Feature Store)

Model accuracy dropped after Feast schema change; trace via lineage
Feast online store stale (Redis); find materialization breakage
Train/serve skew: features differ between training and inference

From Phase 24 (LLM Infra)

vLLM OOM under modest load; tune batching params + quantize
RAG returning hallucinations; investigate chunking / re-ranking / hybrid weights
llm-gateway streaming stuck; find backpressure / SSE buffer issue
Cost spike: one user generated 10M tokens overnight; detect + rate-limit

From Phase 25 (GPU + Drift)

GPU node OOM; identify whether memory leak or quantization regression
Drift alert fires but model is fine; tune drift sensitivity
Auto-rollback didn’t fire when it should have; debug the canary advance logic

For each: triage → fix → runbook + postmortem skeleton.

Pass criteria:

3/3 root-caused
Fixes via GitOps where applicable
Each runbook usable by another engineer
Each postmortem skeleton names at least one pattern the failure rode in on (e.g., train-serve-skew, rag-as-pattern, inference-shapes, model-lifecycle)

What passing looks like: the runbooks read like ones an ML platform engineer would actually reach for at 3am. They name signals before commands, distinguish “drift fired” from “model is wrong,” and treat canary advance logic as a system to debug rather than a magic word.

Part 3: Design review (120 min)

“You’re a Staff Engineer. A junior proposes ‘let’s add GPU to the homelab — we can run real LLMs locally’ and submits a hardware-spec + integration design. Read it; write a thorough but constructive review.”

The provided design has 5 deliberate issues:

ROI math: $2K hardware vs $50/year cloud spot for the actual workload
Power + cooling not accounted for in homelab (real ops cost)
No multi-tenant story (one GPU, who gets it?)
Drift detection skipped because “we’ll just retrain weekly”
vLLM-only (no fallback when GPU is occupied)

Your review must:

Identify 4 of 5 issues
Propose alternatives (cloud spot for spikes; remote GPU rental; CPU + quantization for steady-state)
Be constructive (not “no”)
Cite Y4 patterns + cite cost numbers from your own GPU work in Phase 25

Pass criteria:

4/5 issues identified
2+ alternatives with trade-offs
Cites at least 4 Y4 patterns
Tone: senior + constructive

What passing looks like: the review opens with what the design gets right, names the missing trade-offs with your own numbers (P25 left receipts), and closes with a smaller alternative — typically “cloud spot for spikes, CPU + quantization for steady-state, defer the homelab GPU until utilization exceeds X.” The numbers are what make this Staff-tone instead of Senior-tone.

Overall pass criteria

[ ] Part 1: pipeline + canary + drift + auto-rollback work end-to-end
[ ] Part 2: 3/3 incidents root-caused; runbooks + postmortems written
[ ] Part 3: 4/5 issues caught + alternatives proposed
[ ] Self-grade vs Claude grade: agree within ~10%

If you fail Part 2 by symptom-patching the drift or canary scenario: don’t retake immediately. Drift is the contract that closes the model lifecycle loop; if it’s not muscle memory, the mlship v2 capstone in Year 5 will compound the gap. Spend 1-2 weeks operating drift detection with deliberate breakage, then retake.

After passing

You can:
- Architect + operate the full ML/AI lifecycle on a self-built platform
- Reason about LLM systems (RAG, vLLM, vector DBs, drift) at intermediate depth
- Manage GPU + cost trade-offs without over-spending
- Build and deploy ML pipelines that are reproducible + observable + reversible

You have:
- llm-gateway in production (homelab-scale traffic) with drift + auto-rollback
- mlship v0.5 ready for Y5 capstone polish
- notes-rag dogfooding llm-gateway over your own 4-year writing
- 2 Studio composition recipes runnable end-to-end
- ops-handbook with ~110 runbooks, ~20 postmortems, ~200 weekly logs, 12+ ADRs
- 4+ merged upstream PRs (across years)
- ~40 patterns DEEP

Exit ramp: ML Platform Engineer / AI Infrastructure Engineer

Update program/overview.md Status block: “Year 4 complete: YYYY-MM-DD.”

→ Next: Year 5 — AI Platform + Capstone

Anti-patterns

Anti-pattern	Why
Cramming the Y4 reading the week before	”Designing ML Systems” + “AI Engineering” don’t compress
Skipping the design review part	Senior+ work IS architectural review
Forgetting to destroy GPU in Part 1	Real money lesson
Symptom-patching the drift / canary parts of Part 2	Drift is the contract that closes the model lifecycle loop
Reviewing the GPU proposal without your own cost numbers	Staff tone is numerical; Senior-tone-without-numbers fails Part 3