Skip to content
5-YEAR PROGRAM · YEAR 4
UPCOMING

Final Exam

Full-day (~8 hour) scenario combining all 6 phases of Year 4. Pass = ready for Year 5 (ML Platform / AI Infrastructure Engineer exit ramp credible; Staff/Principal AI Platform Engineer trajectory in reach).

The Year 4 Final Exam audits whether the data platform from Year 3 actually grew an ML/AI layer that you can operate at production rigor — not “I followed the tutorial and got an inference response.” One end-to-end ML pipeline (train → register → deploy → canary → drift → auto-rollback) built in 180 minutes. Three independent incidents from the messy parts of ML (cold start, train/serve skew, vLLM OOM, drift sensitivity, GPU thrash) in 120 minutes. One design review where the answer to “should we buy a homelab GPU?” is almost certainly no, and your job is to explain why with numbers and alternatives.

What’s measured is pattern fluency at the ML/AI infrastructure layer — can you reason about model-lifecycle, inference-shapes (online vs batch vs streaming), feature-store train/serve skew, rag-as-pattern, and the GPU economics every ML platform engineer eventually defends in a budget review? Two engineers can both wire MLflow → KServe; only one can explain what trade-off was made and why, defend a canary that’s flapping, and recognize when “drift” is a sensitivity miscalibration rather than a real distribution shift. The exam audits the second one.

This is the inflection toward Staff. Y3’s exam graduates you toward Senior; Y4’s exam graduates you toward Staff/Principal trajectory. The design review part is where that shift is most visible — Senior reviewers identify the issues; Staff reviewers propose the smaller alternative that gets 80% of the value, with cost numbers from their own work.


When to take

After Phase 25 validation criteria green and services/llm-gateway/ v1.5 has been moving real (homelab-scale) traffic for at least two weeks with drift + auto-rollback active. Schedule ~2 weeks ahead so mlship v0.5 is stable, the train→deploy composition recipe runs end-to-end without hand-holding, and notes-rag has produced at least one real “huh, that retrieval is bad” moment you can cite.


Setup

  • basecamp Tier 1-7 all operational
  • services/llm-gateway/ v1.5 running with drift + auto-rollback
  • mlship v0.5 working (Docker + KServe + Fargate)
  • notes-rag personal service running
  • Train→deploy + Personal-RAG composition recipes runnable
  • 8 hours uninterrupted
  • The root-exam skill (or solo + this doc)

Format

3 parts, ~8 hours total:
Part 1: Build (180 min)
Part 2: Triple incident (120 min)
Part 3: Design review (120 min)

Part 1: Build (180 min)

“Train a model on Year 3 Iceberg data via Kubeflow Pipelines + Ray; register in MLflow; deploy to KServe via mlship with canary; add drift detection; verify end-to-end through services/llm-gateway/. Cloud GPU optional but encouraged.”

Required deliverables:

[ ] KFP pipeline (define in Python; compile to Argo Workflow)
[ ] Step 1: extract from Iceberg with PIT-correct Feast features
[ ] Step 2: Ray-distributed training (sklearn or PyTorch)
[ ] Step 3: eval suite; gate promotion on threshold
[ ] Step 4: register in MLflow as Staging
[ ] Step 5: mlship deploy --to k8s with canary 90/10
[ ] Step 6: drift detection wired (KS-test on input embeddings or features)
[ ] Step 7: auto-rollback configured (canary reverse on drift alert)
[ ] Pipeline runnable via platform-ctl pipeline run train-deploy
[ ] Lineage visible in DataHub end-to-end
[ ] Observability: traces, logs, metrics correlated for one inference request through llm-gateway

Pass criteria:

  • Pipeline succeeds end-to-end
  • Canary advances correctly under success-rate threshold
  • Drift simulation triggers auto-rollback within 5 min
  • Lineage visible from Iceberg → MLflow → KServe → llm-gateway
  • 0 manual kubectl apply -f; everything via basecamp + platform-ctl

What passing looks like: the pipeline run is observable end-to-end, the canary’s reverse is logged with a timestamp and a reason, and the writeup of “why this canary threshold and not its neighbor” is something you could hand to a Y5 student as a study artifact. The pipeline isn’t just “it worked once” — it’s reproducible from platform-ctl and reviewable from a basecamp PR.


Part 2: Triple incident (120 min)

Three scenarios, ~40 min each. The root-exam skill picks them deterministically. One each from the ML serving / feature store / LLM-and-GPU bucket.

Sample scenarios

From Phase 21 (ML Serving)

  • Cold start > 60s; investigate via OTel + Loki + KServe metrics; tune
  • Canary rollout stuck (success rate flapping); identify whether real or noisy SLO
  • Ray cluster lost head node; recover from checkpoint

From Phase 22 (Feature Store)

  • Model accuracy dropped after Feast schema change; trace via lineage
  • Feast online store stale (Redis); find materialization breakage
  • Train/serve skew: features differ between training and inference

From Phase 24 (LLM Infra)

  • vLLM OOM under modest load; tune batching params + quantize
  • RAG returning hallucinations; investigate chunking / re-ranking / hybrid weights
  • llm-gateway streaming stuck; find backpressure / SSE buffer issue
  • Cost spike: one user generated 10M tokens overnight; detect + rate-limit

From Phase 25 (GPU + Drift)

  • GPU node OOM; identify whether memory leak or quantization regression
  • Drift alert fires but model is fine; tune drift sensitivity
  • Auto-rollback didn’t fire when it should have; debug the canary advance logic

For each: triage → fix → runbook + postmortem skeleton.

Pass criteria:

  • 3/3 root-caused
  • Fixes via GitOps where applicable
  • Each runbook usable by another engineer
  • Each postmortem skeleton names at least one pattern the failure rode in on (e.g., train-serve-skew, rag-as-pattern, inference-shapes, model-lifecycle)

What passing looks like: the runbooks read like ones an ML platform engineer would actually reach for at 3am. They name signals before commands, distinguish “drift fired” from “model is wrong,” and treat canary advance logic as a system to debug rather than a magic word.


Part 3: Design review (120 min)

“You’re a Staff Engineer. A junior proposes ‘let’s add GPU to the homelab — we can run real LLMs locally’ and submits a hardware-spec + integration design. Read it; write a thorough but constructive review.”

The provided design has 5 deliberate issues:

  • ROI math: $2K hardware vs $50/year cloud spot for the actual workload
  • Power + cooling not accounted for in homelab (real ops cost)
  • No multi-tenant story (one GPU, who gets it?)
  • Drift detection skipped because “we’ll just retrain weekly”
  • vLLM-only (no fallback when GPU is occupied)

Your review must:

  • Identify 4 of 5 issues
  • Propose alternatives (cloud spot for spikes; remote GPU rental; CPU + quantization for steady-state)
  • Be constructive (not “no”)
  • Cite Y4 patterns + cite cost numbers from your own GPU work in Phase 25

Pass criteria:

  • 4/5 issues identified
  • 2+ alternatives with trade-offs
  • Cites at least 4 Y4 patterns
  • Tone: senior + constructive

What passing looks like: the review opens with what the design gets right, names the missing trade-offs with your own numbers (P25 left receipts), and closes with a smaller alternative — typically “cloud spot for spikes, CPU + quantization for steady-state, defer the homelab GPU until utilization exceeds X.” The numbers are what make this Staff-tone instead of Senior-tone.


Overall pass criteria

[ ] Part 1: pipeline + canary + drift + auto-rollback work end-to-end
[ ] Part 2: 3/3 incidents root-caused; runbooks + postmortems written
[ ] Part 3: 4/5 issues caught + alternatives proposed
[ ] Self-grade vs Claude grade: agree within ~10%

If you fail Part 2 by symptom-patching the drift or canary scenario: don’t retake immediately. Drift is the contract that closes the model lifecycle loop; if it’s not muscle memory, the mlship v2 capstone in Year 5 will compound the gap. Spend 1-2 weeks operating drift detection with deliberate breakage, then retake.


After passing

You can:
- Architect + operate the full ML/AI lifecycle on a self-built platform
- Reason about LLM systems (RAG, vLLM, vector DBs, drift) at intermediate depth
- Manage GPU + cost trade-offs without over-spending
- Build and deploy ML pipelines that are reproducible + observable + reversible
You have:
- llm-gateway in production (homelab-scale traffic) with drift + auto-rollback
- mlship v0.5 ready for Y5 capstone polish
- notes-rag dogfooding llm-gateway over your own 4-year writing
- 2 Studio composition recipes runnable end-to-end
- ops-handbook with ~110 runbooks, ~20 postmortems, ~200 weekly logs, 12+ ADRs
- 4+ merged upstream PRs (across years)
- ~40 patterns DEEP
Exit ramp: ML Platform Engineer / AI Infrastructure Engineer

Update program/overview.md Status block: “Year 4 complete: YYYY-MM-DD.”

→ Next: Year 5 — AI Platform + Capstone


Anti-patterns

Anti-patternWhy
Cramming the Y4 reading the week before”Designing ML Systems” + “AI Engineering” don’t compress
Skipping the design review partSenior+ work IS architectural review
Forgetting to destroy GPU in Part 1Real money lesson
Symptom-patching the drift / canary parts of Part 2Drift is the contract that closes the model lifecycle loop
Reviewing the GPU proposal without your own cost numbersStaff tone is numerical; Senior-tone-without-numbers fails Part 3