Experiment Tracking

Every training run logs its hyperparameters, metrics, code SHA, and dataset version. The discipline that makes ML work reproducible six months later.

Every run captures its inputs. Reproducibility becomes a query, not an archaeology project. Status: STUB — promoted to OUTLINE in Y5 Phase 39.

What this pattern is

Experiment tracking is the discipline of recording every training run’s full context. Hyperparameters, evaluation metrics, Git SHA of the training code, dataset version (Iceberg snapshot ID is ideal), runtime environment, random seeds, and resulting artifacts. MLflow tracking, Weights & Biases, Neptune, and Comet are the major tools. The minimum bar: mlflow.start_run() wrapping every training script, with explicit logging of inputs and metrics. The higher bar: tracking is enforced (CI rejects training code that doesn’t log; log_input() captures dataset hashes; artifact upload is automatic).

The pattern is the load-bearing prerequisite for model-registry (the registry consumes tracked runs), for evals (eval results attached to runs), and for incident response (“what produced this bad prediction?” becomes a registry lookup, not detective work).

The discipline maps directly to what software engineers call “reproducible builds.” Given the same inputs (code, data, configuration), running the training should produce the same outputs (model artifact, metrics). Experiment tracking captures the inputs so reproducibility becomes verifiable. Without it, six months later “why did the model perform this way?” is answerable only by re-running training and hoping — which doesn’t work if the training environment has drifted, the data has changed, or the researcher who ran it has moved teams.

The pattern is not sophisticated theoretically. It’s git commit + parameter logging + metric logging + artifact upload — mechanically simple. The discipline is what’s hard. Teams that don’t enforce tracking end up with hundreds of untracked runs, dozens of “which one was the good one?” moments per quarter, and no ability to answer basic questions about their own work. Teams that do enforce tracking get compound benefits — every subsequent analysis, comparison, and reproduction is trivial.

Concrete instances in the wild

  • MLflow Tracking. OSS, the reference implementation. basecamp default. Tracks runs, params, metrics, artifacts.
  • Weights & Biases. Commercial, richer UI, popular in research settings. Excellent for hyperparameter sweeps.
  • Neptune. Commercial alternative focused on team collaboration and comparison.
  • Comet. Commercial alternative with strong experiment visualization.
  • TensorBoard. Google’s tracking tool. Common in DL research; simpler than MLflow.
  • DVC + DVCLive. Git-based tracking. Simpler alignment with existing Git workflows.
  • AimStack. OSS experiment tracker with a modern UI.
  • ClearML. OSS/commercial platform combining tracking + orchestration.
  • basecamp tracking (Y5 Phase 39). MLflow tracking server deployed on K8s, backed by Postgres + MinIO.
  • Hugging Face Trainer with TensorBoard/W&B integration. Standard tracking for LLM fine-tuning workflows.

Why this pattern matters

Untracked ML work becomes archaeology within weeks. “Which hyperparameters did we use for the good model?” becomes a Slack search. “What dataset was that trained on?” becomes speculation. “Can you reproduce that result?” becomes “not really.” Every ML team without tracking has these conversations weekly, and every one wastes engineering time that tracking would have prevented.

Tracked ML work compounds. Every run is a data point in a queryable database. Comparing runs (which hyperparameters worked best?) is a UI query. Reproducing a specific result is mlflow.pull(run_id=abc123). Auditing (what produced this model in production?) is a lineage query. The engineering time spent on tracking pays back multiple times over the first month.

The pattern also enables higher-level ML disciplines. Model registry needs tracked runs as its input. Automated retraining needs tracked baselines to compare against. Hyperparameter sweeps need tracked results to select from. LLM eval pipelines need tracked eval scores to gate promotion. Everything downstream of “we trained a model” is easier when tracking is discipline rather than heroism.

For LLM workflows specifically, tracking matters differently. Fine-tune runs need tracked base models, tracked datasets, tracked hyperparameters. Prompt engineering needs tracked prompt variations and their eval scores. Model routing decisions need tracked A/B test results. The scope expands from “hyperparameters” to “everything that shapes model behavior,” but the tracking discipline is the same.

Modern tools make tracking cheap. MLflow’s autolog for scikit-learn, PyTorch, and Hugging Face captures runs with no code changes. W&B’s SDK is similarly minimal. Neptune, Comet, and Aim have similar autolog capabilities. The engineering cost of “adopt tracking” is one line of code plus running the tracking server. The engineering cost of “not adopting tracking” is compounding chaos.

The failure modes to know: tracking that’s optional gets skipped (needs to be enforced); tracking servers that go down block training (need HA); tracked runs that accumulate without cleanup bloat storage (need retention policies); manual tracking (typing metrics into a spreadsheet) doesn’t scale and doesn’t survive team turnover.

Depth progression

STUB     ← you are here.
OUTLINE  Promoted when Y5 Phase 39 instruments all training scripts with MLflow.
DEEP     Out of scope for explicit DEEP; reinforced cross-cutting through Y5.

Preview: what OUTLINE will answer

When Y5 Phase 39 promotes this entry to OUTLINE, it will name:

  • PROBLEM. How do you make ML work reproducible and auditable months after the training runs happened?
  • PRINCIPLES. Every training run logs its full context. Autolog when possible. Enforce tracking via CI, not policy. Capture dataset lineage (Iceberg snapshot IDs). Registry consumes tracked runs. Cleanup policies prevent bloat.
  • TRADE-OFFS. OSS self-hosted (MLflow — flexible, ops burden) vs managed (W&B, Neptune — easy, cost). Autolog (zero-code, opaque) vs explicit logging (verbose, transparent). Simple tracking (MLflow) vs feature-rich (W&B with sweeps, reports).
  • TOOLS (time-stamped as of 2026-06): MLflow (basecamp default), Weights & Biases, Neptune, Comet, TensorBoard, DVC + DVCLive, AimStack, ClearML.

The DEEP-level discipline is folded into every Y5 model-development phase; no separate DEEP promotion, but tracking becomes second nature across the basecamp ML workflows.

Canonical references

  • MLflow documentation. Free at mlflow.org.
  • Chip Huyen, Designing Machine Learning Systems, chapter on experiment tracking.
  • Weights & Biases documentation and blog. Free at wandb.ai.
  • Neptune blog on tracking best practices. Free at neptune.ai.
  • ML Ops community best practices docs. Free.

Cross-references