MLOps Foundations
First phase of Year 4. Model lifecycle as a pattern (train → register → deploy → monitor → retrain). MLflow as the registry. Set up basecamp Tier 5 (ML). ~8 weeks, ~95 hrs.
Where this phase sits
Phase 20 is the entry point into Year 4. Year 3 ended with a credible data platform — lakehouse, processing, serving, governance. Year 4 stacks the ML layer on top, and that layer starts here. Before you can serve a model (Phase 21), build features around it (Phase 22), or orchestrate its retraining (Phase 23), you need an honest answer to a much smaller question: where does this model artifact live, and how do I know what produced it? That’s what P20 settles.
The phase is deliberately quieter than the rest of Y4. No GPUs, no streaming, no RAG — just MLflow, a sklearn classifier, and the discipline of pinning every input that contributed to the output. The point is to install the frame (lifecycle as a pattern) so the louder phases that follow have somewhere to plug in. By the end of P20, basecamp’s Tier 5 in the basecamp plan has come alive: MLflow on Postgres + MinIO, OIDC via Dex, lineage emitting through OpenLineage into DataHub.
This is also the first phase that asks you to read the ml-and-ai pattern category seriously. model-lifecycle and train-serve-skew were near-empty stubs through Y1-Y3; P20 promotes model-lifecycle to OUTLINE and reinforces schema-on-read-vs-write from a new angle (model inputs, not Iceberg files).
Prerequisites
- Year 3 complete — basecamp public, lakehouse + JupyterHub operational
- You accept: MLOps is software engineering applied to ML, not a separate discipline. The patterns (CI/CD, observability, idempotent deploys, gated rollouts) all transfer from Y2-Y3. The “ML” specifics are: tracking experiments, registering models, serving with versions, detecting drift.
Why this phase exists
Years 4-5 deploy ML/LLM workloads that need: experiment tracking (which run produced this model?), model registry (where does the binary live? what’s its lineage?), versioned serving (canary, rollback), and monitoring at the model layer (not just the service layer).
This phase lays the lifecycle foundation. P21 adds serving. P22 adds features. P23 orchestrates. P24 adds LLMs. P25 hardens.
→ Pattern: model-lifecycle (first deepening)
1. PROBLEM
You have data (Year 3’s lakehouse). Someone wants to predict something from it. You need to:
- Train experimentally + track every experiment’s params + metrics + artifacts
- Pick a winner; promote the model binary to a registry with provenance
- Serve the model behind a stable API while iterating new versions
- Detect when the world shifts (drift) and the deployed model is no longer right
- Retrain on a schedule + a trigger
MLflow handles tracking + registry. KServe (P21) handles serving. Drift + retrain (P25) close the loop.
2. PRINCIPLES
2.1 The model lifecycle as a pattern
Train → evaluate → register → deploy → monitor → retrain. Each step has hand-offs that need contracts.
→ Pattern: model-lifecycle
Investigate:
- Read Chip Huyen’s “Designing Machine Learning Systems” Ch. 1 + 6
- Map the lifecycle to a real model: e.g., predicting next-week-commits from
abukix.commits - What’s the hardest hand-off? (Usually: monitoring vs. retraining trigger.)
2.2 Experiment tracking
Every training run logs: params, metrics, code version (git SHA), data version (Iceberg snapshot), produced artifacts.
Investigate:
- MLflow Tracking API:
mlflow.log_params,mlflow.log_metrics,mlflow.log_artifact - Why is “code version + data version + params” the irreducible reproducibility minimum?
- Compare with Weights & Biases / Neptune / Comet — same shape, different UX
2.3 Model registry: versioning
A model has versions. Each version has a stage (None / Staging / Production / Archived). Promotions are auditable.
Investigate:
- MLflow Model Registry API: register, transition, alias
- Provenance: every version traceable to (training run → data → code)
- How does this map to GitOps for non-model services?
2.4 Reproducibility
Same code + same data + same params = same model? Almost. Random seeds, hardware, library versions all matter.
Investigate:
- Pin everything: numpy seed, python hash seed, CUDA determinism mode
- Pin Python deps:
uv lock(Y1 Phase 4 already taught this) - Pin data: Iceberg snapshot ID at training time
2.5 Schema validation for ML inputs
A trained model expects a feature shape. Inference inputs must match. This is konfig (Y1) plus shape-check.
→ Pattern: schema-on-read-vs-write (revisited)
Investigate:
- pydantic models for inference input + output
- Why is input-shape skew the most common ML production bug?
2.6 Test the model, not just the code
Unit tests cover code; model tests cover behavior. Slice tests, invariance tests, expected-output tests.
Investigate:
- Read Chip Huyen Ch. 6 (Model Development + Offline Evaluation)
- Build 3 invariance tests for a sklearn classifier (e.g., “renaming a feature shouldn’t change predictions”)
3. TRADE-OFFS
| Decision | Option A | Option B | When |
|---|---|---|---|
| Tracking | MLflow (OSS) | Weights & Biases (commercial) | Neptune |
| Registry | MLflow Model Registry | DVC | custom S3 layout |
| Experiment scale | one run at a time | parallel sweeps | Optuna / Ray Tune (P22-23) |
| Reproducibility | seed + version pinning | Docker images per experiment | both |
| Eval | offline (test set) | online (canary in prod) | Both — offline gates promotion, online validates |
4. TOOLS (as of 2025-10)
- MLflow 2.16+ — tracking + registry + serving (you’ll use registry; serving via KServe in P21)
- scikit-learn — for early experiments; familiar
- PyTorch — deepens P21
- HuggingFace transformers + datasets — preview; deepens P24
- uv + ruff + pytest (Y1 Python toolchain)
- JupyterHub (Y3 P15) — primary ML workspace
5. MASTERY
5.1 Reading list
| Required | Why |
|---|---|
| ”Designing Machine Learning Systems” (Chip Huyen) Ch. 1-7 | The book |
| MLflow docs — Tracking + Registry | The implementation |
| ”Reliable Machine Learning” (Hidalgo, Joglekar, et al.) Ch. 1-5 | The discipline |
| Recommended | Why |
|---|---|
| ”Machine Learning Engineering” (Andriy Burkov) | Complementary depth |
5.2 Operational depth checklist
[ ] Deploy MLflow on basecamp Tier 5 (with Postgres backend + MinIO artifacts)[ ] Configure MLflow OIDC auth via Dex[ ] Train a small sklearn model on abukix.commits (predict next-week-commits) - Use JupyterHub for the work - Read data from Iceberg via Spark or DuckDB - Pin Iceberg snapshot ID; log in MLflow[ ] Log 5 experiments with different hyperparameters; compare in MLflow UI[ ] Register the winner in MLflow Model Registry; tag as "Staging"[ ] Build pydantic input/output schemas for the model[ ] Build 3 invariance tests[ ] Containerize a "predict" function (you'll wire to KServe in P21)[ ] Wire MLflow to OpenLineage (already from Y3 P19); verify lineage in DataHub[ ] Document the basecamp ML stack in basecamp/README.md (Tier 5 section)6. COMPARE: MLflow vs Weights & Biases vs DIY
For the same training run, log to MLflow + W&B (free tier OK) + a homemade Postgres + MinIO solution. Compare:
- UX (the “find my best run” experience)
- Cost at scale
- Lock-in / portability
- Extensibility
400 words.
7. OPERATE
- 3+ runbooks (
mlflow-postgres-bloat,model-registry-promotion-stuck,lineage-broken-from-mlflow-side) - 1+ ADR (
why-mlflow-over-wandbfor the homelab) - Weekly log
8. CONTRIBUTE
MLflow itself, OpenLineage MLflow integration, scikit-learn docs.
Validation criteria
[ ] All 10 operational depth checks[ ] MLflow + Postgres + MinIO running in basecamp Tier 5[ ] At least 1 model registered with full provenance (code + data + params + metrics)[ ] Invariance tests + pydantic schemas[ ] MLflow vs W&B comparison written up[ ] 3+ runbooks; 1+ ADR; 8+ weekly log entries[ ] Pattern entries deepened: - model-lifecycle → OUTLINE (DEEP after P25 closes the loop) - schema-on-read-vs-write → reinforced (ML inputs)[ ] Exit Test passedExit Test
Time: 3 hours.
- Build (90 min) — given a notebook in JupyterHub, train a small classifier, log to MLflow, register in Model Registry, write 3 invariance tests, package the predict function as a container.
- Diagnose (60 min) — scenario: MLflow shows two “best” models with identical metrics but different predictions; trace via lineage; explain.
- Articulate (30 min) — 600 words: “Walk the model lifecycle for
next-week-commits. What hand-offs exist? Where does each fail in production?”
Anti-patterns
| Anti-pattern | Why |
|---|---|
| Jupyter notebooks as production code | Notebooks are for exploration; productionize via containers |
| Training runs not logged to MLflow | ”I forgot which run made this model” — career-limiting |
| Random seed not set | Non-reproducible; debug nightmare |
| Skipping invariance tests | Model “works” until a column reorder breaks it |
| Direct deploy from notebook to KServe | Skip the registry; lose provenance |
Patterns deepened this phase
- model-lifecycle → OUTLINE (closes to DEEP at P25)
- schema-on-read-vs-write → reinforced
→ Next: Phase 21: ML Serving + mlship v0