Skip to content
5-YEAR PROGRAM · YEAR 4 · PHASE 20
UPCOMING

MLOps Foundations

First phase of Year 4. Model lifecycle as a pattern (train → register → deploy → monitor → retrain). MLflow as the registry. Set up basecamp Tier 5 (ML). ~8 weeks, ~95 hrs.


Where this phase sits

Phase 20 is the entry point into Year 4. Year 3 ended with a credible data platform — lakehouse, processing, serving, governance. Year 4 stacks the ML layer on top, and that layer starts here. Before you can serve a model (Phase 21), build features around it (Phase 22), or orchestrate its retraining (Phase 23), you need an honest answer to a much smaller question: where does this model artifact live, and how do I know what produced it? That’s what P20 settles.

The phase is deliberately quieter than the rest of Y4. No GPUs, no streaming, no RAG — just MLflow, a sklearn classifier, and the discipline of pinning every input that contributed to the output. The point is to install the frame (lifecycle as a pattern) so the louder phases that follow have somewhere to plug in. By the end of P20, basecamp’s Tier 5 in the basecamp plan has come alive: MLflow on Postgres + MinIO, OIDC via Dex, lineage emitting through OpenLineage into DataHub.

This is also the first phase that asks you to read the ml-and-ai pattern category seriously. model-lifecycle and train-serve-skew were near-empty stubs through Y1-Y3; P20 promotes model-lifecycle to OUTLINE and reinforces schema-on-read-vs-write from a new angle (model inputs, not Iceberg files).


Prerequisites

  • Year 3 complete — basecamp public, lakehouse + JupyterHub operational
  • You accept: MLOps is software engineering applied to ML, not a separate discipline. The patterns (CI/CD, observability, idempotent deploys, gated rollouts) all transfer from Y2-Y3. The “ML” specifics are: tracking experiments, registering models, serving with versions, detecting drift.

Why this phase exists

Years 4-5 deploy ML/LLM workloads that need: experiment tracking (which run produced this model?), model registry (where does the binary live? what’s its lineage?), versioned serving (canary, rollback), and monitoring at the model layer (not just the service layer).

This phase lays the lifecycle foundation. P21 adds serving. P22 adds features. P23 orchestrates. P24 adds LLMs. P25 hardens.

→ Pattern: model-lifecycle (first deepening)


1. PROBLEM

You have data (Year 3’s lakehouse). Someone wants to predict something from it. You need to:

  • Train experimentally + track every experiment’s params + metrics + artifacts
  • Pick a winner; promote the model binary to a registry with provenance
  • Serve the model behind a stable API while iterating new versions
  • Detect when the world shifts (drift) and the deployed model is no longer right
  • Retrain on a schedule + a trigger

MLflow handles tracking + registry. KServe (P21) handles serving. Drift + retrain (P25) close the loop.


2. PRINCIPLES

2.1 The model lifecycle as a pattern

Train → evaluate → register → deploy → monitor → retrain. Each step has hand-offs that need contracts.

→ Pattern: model-lifecycle

Investigate:

  • Read Chip Huyen’s “Designing Machine Learning Systems” Ch. 1 + 6
  • Map the lifecycle to a real model: e.g., predicting next-week-commits from abukix.commits
  • What’s the hardest hand-off? (Usually: monitoring vs. retraining trigger.)

2.2 Experiment tracking

Every training run logs: params, metrics, code version (git SHA), data version (Iceberg snapshot), produced artifacts.

Investigate:

  • MLflow Tracking API: mlflow.log_params, mlflow.log_metrics, mlflow.log_artifact
  • Why is “code version + data version + params” the irreducible reproducibility minimum?
  • Compare with Weights & Biases / Neptune / Comet — same shape, different UX

2.3 Model registry: versioning

A model has versions. Each version has a stage (None / Staging / Production / Archived). Promotions are auditable.

Investigate:

  • MLflow Model Registry API: register, transition, alias
  • Provenance: every version traceable to (training run → data → code)
  • How does this map to GitOps for non-model services?

2.4 Reproducibility

Same code + same data + same params = same model? Almost. Random seeds, hardware, library versions all matter.

Investigate:

  • Pin everything: numpy seed, python hash seed, CUDA determinism mode
  • Pin Python deps: uv lock (Y1 Phase 4 already taught this)
  • Pin data: Iceberg snapshot ID at training time

2.5 Schema validation for ML inputs

A trained model expects a feature shape. Inference inputs must match. This is konfig (Y1) plus shape-check.

→ Pattern: schema-on-read-vs-write (revisited)

Investigate:

  • pydantic models for inference input + output
  • Why is input-shape skew the most common ML production bug?

2.6 Test the model, not just the code

Unit tests cover code; model tests cover behavior. Slice tests, invariance tests, expected-output tests.

Investigate:

  • Read Chip Huyen Ch. 6 (Model Development + Offline Evaluation)
  • Build 3 invariance tests for a sklearn classifier (e.g., “renaming a feature shouldn’t change predictions”)

3. TRADE-OFFS

DecisionOption AOption BWhen
TrackingMLflow (OSS)Weights & Biases (commercial)Neptune
RegistryMLflow Model RegistryDVCcustom S3 layout
Experiment scaleone run at a timeparallel sweepsOptuna / Ray Tune (P22-23)
Reproducibilityseed + version pinningDocker images per experimentboth
Evaloffline (test set)online (canary in prod)Both — offline gates promotion, online validates

4. TOOLS (as of 2025-10)

  • MLflow 2.16+ — tracking + registry + serving (you’ll use registry; serving via KServe in P21)
  • scikit-learn — for early experiments; familiar
  • PyTorch — deepens P21
  • HuggingFace transformers + datasets — preview; deepens P24
  • uv + ruff + pytest (Y1 Python toolchain)
  • JupyterHub (Y3 P15) — primary ML workspace

5. MASTERY

5.1 Reading list

RequiredWhy
”Designing Machine Learning Systems” (Chip Huyen) Ch. 1-7The book
MLflow docs — Tracking + RegistryThe implementation
”Reliable Machine Learning” (Hidalgo, Joglekar, et al.) Ch. 1-5The discipline
RecommendedWhy
”Machine Learning Engineering” (Andriy Burkov)Complementary depth

5.2 Operational depth checklist

[ ] Deploy MLflow on basecamp Tier 5 (with Postgres backend + MinIO artifacts)
[ ] Configure MLflow OIDC auth via Dex
[ ] Train a small sklearn model on abukix.commits (predict next-week-commits)
- Use JupyterHub for the work
- Read data from Iceberg via Spark or DuckDB
- Pin Iceberg snapshot ID; log in MLflow
[ ] Log 5 experiments with different hyperparameters; compare in MLflow UI
[ ] Register the winner in MLflow Model Registry; tag as "Staging"
[ ] Build pydantic input/output schemas for the model
[ ] Build 3 invariance tests
[ ] Containerize a "predict" function (you'll wire to KServe in P21)
[ ] Wire MLflow to OpenLineage (already from Y3 P19); verify lineage in DataHub
[ ] Document the basecamp ML stack in basecamp/README.md (Tier 5 section)

6. COMPARE: MLflow vs Weights & Biases vs DIY

For the same training run, log to MLflow + W&B (free tier OK) + a homemade Postgres + MinIO solution. Compare:

  • UX (the “find my best run” experience)
  • Cost at scale
  • Lock-in / portability
  • Extensibility

400 words.


7. OPERATE

  • 3+ runbooks (mlflow-postgres-bloat, model-registry-promotion-stuck, lineage-broken-from-mlflow-side)
  • 1+ ADR (why-mlflow-over-wandb for the homelab)
  • Weekly log

8. CONTRIBUTE

MLflow itself, OpenLineage MLflow integration, scikit-learn docs.


Validation criteria

[ ] All 10 operational depth checks
[ ] MLflow + Postgres + MinIO running in basecamp Tier 5
[ ] At least 1 model registered with full provenance (code + data + params + metrics)
[ ] Invariance tests + pydantic schemas
[ ] MLflow vs W&B comparison written up
[ ] 3+ runbooks; 1+ ADR; 8+ weekly log entries
[ ] Pattern entries deepened:
- model-lifecycle → OUTLINE (DEEP after P25 closes the loop)
- schema-on-read-vs-write → reinforced (ML inputs)
[ ] Exit Test passed

Exit Test

Time: 3 hours.

  1. Build (90 min) — given a notebook in JupyterHub, train a small classifier, log to MLflow, register in Model Registry, write 3 invariance tests, package the predict function as a container.
  2. Diagnose (60 min) — scenario: MLflow shows two “best” models with identical metrics but different predictions; trace via lineage; explain.
  3. Articulate (30 min) — 600 words: “Walk the model lifecycle for next-week-commits. What hand-offs exist? Where does each fail in production?”

Anti-patterns

Anti-patternWhy
Jupyter notebooks as production codeNotebooks are for exploration; productionize via containers
Training runs not logged to MLflow”I forgot which run made this model” — career-limiting
Random seed not setNon-reproducible; debug nightmare
Skipping invariance testsModel “works” until a column reorder breaks it
Direct deploy from notebook to KServeSkip the registry; lose provenance

Patterns deepened this phase


→ Next: Phase 21: ML Serving + mlship v0