MLOps Foundations

First phase of Year 4. Model lifecycle as a pattern (train → register → deploy → monitor → retrain). MLflow as the registry. Set up basecamp Tier 5 (ML). ~8 weeks, ~95 hrs.

Where this phase sits

Phase 20 is the entry point into Year 4. Year 3 ended with a credible data platform — lakehouse, processing, serving, governance. Year 4 stacks the ML layer on top, and that layer starts here. Before you can serve a model (Phase 21), build features around it (Phase 22), or orchestrate its retraining (Phase 23), you need an honest answer to a much smaller question: where does this model artifact live, and how do I know what produced it? That’s what P20 settles.

The phase is deliberately quieter than the rest of Y4. No GPUs, no streaming, no RAG — just MLflow, a sklearn classifier, and the discipline of pinning every input that contributed to the output. The point is to install the frame (lifecycle as a pattern) so the louder phases that follow have somewhere to plug in. By the end of P20, basecamp’s Tier 5 in the basecamp plan has come alive: MLflow on Postgres + MinIO, OIDC via Dex, lineage emitting through OpenLineage into DataHub.

This is also the first phase that asks you to read the ml-and-ai pattern category seriously. model-lifecycle and train-serve-skew were near-empty stubs through Y1-Y3; P20 promotes model-lifecycle to OUTLINE and reinforces schema-on-read-vs-write from a new angle (model inputs, not Iceberg files).

Prerequisites

Year 3 complete — basecamp public, lakehouse + JupyterHub operational

You accept: MLOps is software engineering applied to ML, not a separate discipline. The patterns (CI/CD, observability, idempotent deploys, gated rollouts) all transfer from Y2-Y3. The “ML” specifics are: tracking experiments, registering models, serving with versions, detecting drift.

Why this phase exists

Years 4-5 deploy ML/LLM workloads that need: experiment tracking (which run produced this model?), model registry (where does the binary live? what’s its lineage?), versioned serving (canary, rollback), and monitoring at the model layer (not just the service layer).

This phase lays the lifecycle foundation. P21 adds serving. P22 adds features. P23 orchestrates. P24 adds LLMs. P25 hardens.

→ Pattern: model-lifecycle (first deepening)

1. PROBLEM

You have data (Year 3’s lakehouse). Someone wants to predict something from it. You need to:

Train experimentally + track every experiment’s params + metrics + artifacts
Pick a winner; promote the model binary to a registry with provenance
Serve the model behind a stable API while iterating new versions
Detect when the world shifts (drift) and the deployed model is no longer right
Retrain on a schedule + a trigger

MLflow handles tracking + registry. KServe (P21) handles serving. Drift + retrain (P25) close the loop.

2. PRINCIPLES

2.1 The model lifecycle as a pattern

Train → evaluate → register → deploy → monitor → retrain. Each step has hand-offs that need contracts.

→ Pattern: model-lifecycle

Investigate:

Read Chip Huyen’s “Designing Machine Learning Systems” Ch. 1 + 6
Map the lifecycle to a real model: e.g., predicting next-week-commits from abukix.commits
What’s the hardest hand-off? (Usually: monitoring vs. retraining trigger.)

2.2 Experiment tracking

Every training run logs: params, metrics, code version (git SHA), data version (Iceberg snapshot), produced artifacts.

Investigate:

MLflow Tracking API: mlflow.log_params, mlflow.log_metrics, mlflow.log_artifact
Why is “code version + data version + params” the irreducible reproducibility minimum?
Compare with Weights & Biases / Neptune / Comet — same shape, different UX

2.3 Model registry: versioning

A model has versions. Each version has a stage (None / Staging / Production / Archived). Promotions are auditable.

Investigate:

MLflow Model Registry API: register, transition, alias
Provenance: every version traceable to (training run → data → code)
How does this map to GitOps for non-model services?

2.4 Reproducibility

Same code + same data + same params = same model? Almost. Random seeds, hardware, library versions all matter.

Investigate:

Pin everything: numpy seed, python hash seed, CUDA determinism mode
Pin Python deps: uv lock (Y1 Phase 4 already taught this)
Pin data: Iceberg snapshot ID at training time

2.5 Schema validation for ML inputs

A trained model expects a feature shape. Inference inputs must match. This is konfig (Y1) plus shape-check.

→ Pattern: schema-on-read-vs-write (revisited)

Investigate:

pydantic models for inference input + output
Why is input-shape skew the most common ML production bug?

2.6 Test the model, not just the code

Unit tests cover code; model tests cover behavior. Slice tests, invariance tests, expected-output tests.

Investigate:

Read Chip Huyen Ch. 6 (Model Development + Offline Evaluation)
Build 3 invariance tests for a sklearn classifier (e.g., “renaming a feature shouldn’t change predictions”)

3. TRADE-OFFS

Decision	Option A	Option B	When
Tracking	MLflow (OSS)	Weights & Biases (commercial)	Neptune
Registry	MLflow Model Registry	DVC	custom S3 layout
Experiment scale	one run at a time	parallel sweeps	Optuna / Ray Tune (P22-23)
Reproducibility	seed + version pinning	Docker images per experiment	both
Eval	offline (test set)	online (canary in prod)	Both — offline gates promotion, online validates

4. TOOLS (as of 2025-10)

MLflow 2.16+ — tracking + registry + serving (you’ll use registry; serving via KServe in P21)
scikit-learn — for early experiments; familiar
PyTorch — deepens P21
HuggingFace transformers + datasets — preview; deepens P24
uv + ruff + pytest (Y1 Python toolchain)
JupyterHub (Y3 P15) — primary ML workspace

5. MASTERY

5.1 Reading list

Required	Why
”Designing Machine Learning Systems” (Chip Huyen) Ch. 1-7	The book
MLflow docs — Tracking + Registry	The implementation
”Reliable Machine Learning” (Hidalgo, Joglekar, et al.) Ch. 1-5	The discipline

Recommended	Why
”Machine Learning Engineering” (Andriy Burkov)	Complementary depth

5.2 Operational depth checklist

[ ] Deploy MLflow on basecamp Tier 5 (with Postgres backend + MinIO artifacts)
[ ] Configure MLflow OIDC auth via Dex
[ ] Train a small sklearn model on abukix.commits (predict next-week-commits)
    - Use JupyterHub for the work
    - Read data from Iceberg via Spark or DuckDB
    - Pin Iceberg snapshot ID; log in MLflow
[ ] Log 5 experiments with different hyperparameters; compare in MLflow UI
[ ] Register the winner in MLflow Model Registry; tag as "Staging"
[ ] Build pydantic input/output schemas for the model
[ ] Build 3 invariance tests
[ ] Containerize a "predict" function (you'll wire to KServe in P21)
[ ] Wire MLflow to OpenLineage (already from Y3 P19); verify lineage in DataHub
[ ] Document the basecamp ML stack in basecamp/README.md (Tier 5 section)

6. COMPARE: MLflow vs Weights & Biases vs DIY

For the same training run, log to MLflow + W&B (free tier OK) + a homemade Postgres + MinIO solution. Compare:

UX (the “find my best run” experience)
Cost at scale
Lock-in / portability
Extensibility

400 words.

7. OPERATE

3+ runbooks (mlflow-postgres-bloat, model-registry-promotion-stuck, lineage-broken-from-mlflow-side)
1+ ADR (why-mlflow-over-wandb for the homelab)
Weekly log

8. CONTRIBUTE

MLflow itself, OpenLineage MLflow integration, scikit-learn docs.

Validation criteria

[ ] All 10 operational depth checks
[ ] MLflow + Postgres + MinIO running in basecamp Tier 5
[ ] At least 1 model registered with full provenance (code + data + params + metrics)
[ ] Invariance tests + pydantic schemas
[ ] MLflow vs W&B comparison written up
[ ] 3+ runbooks; 1+ ADR; 8+ weekly log entries
[ ] Pattern entries deepened:
    - model-lifecycle → OUTLINE (DEEP after P25 closes the loop)
    - schema-on-read-vs-write → reinforced (ML inputs)
[ ] Exit Test passed

Exit Test

Time: 3 hours.

Build (90 min) — given a notebook in JupyterHub, train a small classifier, log to MLflow, register in Model Registry, write 3 invariance tests, package the predict function as a container.
Diagnose (60 min) — scenario: MLflow shows two “best” models with identical metrics but different predictions; trace via lineage; explain.
Articulate (30 min) — 600 words: “Walk the model lifecycle for next-week-commits. What hand-offs exist? Where does each fail in production?”

Anti-patterns

Anti-pattern	Why
Jupyter notebooks as production code	Notebooks are for exploration; productionize via containers
Training runs not logged to MLflow	”I forgot which run made this model” — career-limiting
Random seed not set	Non-reproducible; debug nightmare
Skipping invariance tests	Model “works” until a column reorder breaks it
Direct deploy from notebook to KServe	Skip the registry; lose provenance

Patterns deepened this phase

model-lifecycle → OUTLINE (closes to DEEP at P25)
schema-on-read-vs-write → reinforced

→ Next: Phase 21: ML Serving + mlship v0