Drift Detection

Monitor the distribution of inputs, features, and predictions over time. Catch the moment the world stops looking like the training data.

The training data was last year. The world is this year. Distributions drift. The model that was right then may be wrong now. Status: STUB — promoted to OUTLINE in Y5 Phase 41.

What this pattern is

Drift detection monitors the statistical distribution of model inputs, feature values, and predictions over time, alerting when current distributions diverge significantly from the training-time baseline. Covariate drift (input distribution shift), label drift (output distribution shift), and concept drift (the relationship between inputs and outputs shifts) all have different remedies. Evidently, Arize OSS, and Whylogs are the OSS observability tools; cloud-native ML platforms have managed equivalents.

The pattern matters because models silently degrade as the world changes. A fraud-detection model trained on pre-pandemic transactions may misclassify post-pandemic patterns; a recommendation model trained on summer behavior may underperform in winter. Drift detection alerts operators before the business notices, giving time to retrain or rollback. The pattern composes with evals (drift triggers re-running the eval suite on recent data) and with model-registry (drift triggers retraining + re-promotion workflows).

Statistical tests underneath drift detection include Kolmogorov-Smirnov (distribution comparison for numeric features), chi-squared (categorical features), Population Stability Index (PSI, common in credit scoring), and Jensen-Shannon divergence (general-purpose distribution distance). Each has trade-offs — sensitivity to sample size, robustness to outliers, interpretability. Picking the right test per feature type is a specialization; using off-the-shelf tools (Evidently, WhyLabs) usually works better than rolling your own.

Drift is often talked about as a single concept but is really a family. Feature drift (inputs look different). Label drift (predictions look different). Concept drift (the mapping between inputs and outputs changed). Each requires different remediation. Feature drift usually means retraining on recent data. Label drift often means model recalibration. Concept drift may require significant model redesign. Diagnosing which type of drift is happening is what makes drift detection useful rather than just noisy.

Concrete instances in the wild

Evidently AI. OSS drift detection with rich reports. K8s-deployable. basecamp default candidate.
Arize AI (commercial + OSS Phoenix). Commercial ML observability with strong drift detection. OSS Phoenix is available.
WhyLabs / Whylogs. OSS observability library plus commercial platform. Statistical profiles of data.
Great Expectations. Data quality framework; also usable for drift detection via expectations.
AWS SageMaker Model Monitor. AWS-managed drift detection.
GCP Vertex AI Model Monitoring. GCP-managed equivalent.
Fiddler AI. Commercial ML observability platform.
Aporia. Commercial ML monitoring focused on drift.
NannyML. OSS library specifically for performance monitoring without ground truth.
basecamp drift monitoring (Y5 Phase 41). Evidently + Grafana dashboards + Alertmanager.

Why this pattern matters

Models silently degrade over time. The training data was collected at some historical moment. The world keeps changing after that moment. Users adopt new behaviors. Product surfaces change. External events (pandemics, economic shifts, regulatory changes) affect input distributions. Without drift detection, models that were right at training time slowly become wrong at serving time, and nobody notices until business metrics degrade.

With drift detection, operators know when the model’s assumptions no longer hold. The alert doesn’t say “the model is wrong”; it says “the input distribution shifted 30% from baseline; consider re-evaluating.” Operators can then decide whether to retrain, rollback, or gate the model’s outputs until they understand the shift. The information changes what would have been a silent degradation into an operational decision.

The pattern also enables data-quality monitoring beyond drift. Sudden schema changes (a required field starts arriving null), sudden distribution changes (a new user segment appears), and sudden volume changes (traffic dropped 50%) are all detectable via similar statistical tests. Drift detection tools serve double duty as data-quality monitors.

For LLM systems, drift shows up differently. Prompt distributions change as products evolve. Output distributions change as models are fine-tuned. Retrieval context distributions change as RAG corpora grow. Each needs monitoring, though the specific statistical tests differ from classical ML drift detection. LLM drift monitoring is a newer discipline; expect the tooling to mature significantly over 2026-2028.

The pattern matters most for production ML that operates over long time horizons. Fraud detection running for years. Recommendation systems running for years. Any model whose predictions affect ongoing business decisions. For short-lived experimental models or models frequently retrained (every day), drift detection matters less because retraining catches drift as a side effect. For long-lived stable models, drift detection is what prevents silent degradation.

Modern tooling has made drift detection accessible. Evidently provides comprehensive OSS with report generation. WhyLabs provides OSS profiling with commercial platform integration. Cloud platforms provide managed drift detection integrated with their ML services. What used to require in-house data-science tooling is now a checkbox.

The failure modes to know: alert fatigue if thresholds are too sensitive (operators ignore alerts); missed drift if thresholds are too lax (silent degradation continues); false positives on seasonal patterns (weekly / monthly drift is normal, not concerning); reliance on drift detection instead of proper evals (drift is a proxy for quality, not quality itself).

Depth progression

STUB     ← you are here.
OUTLINE  Promoted when Y5 Phase 41 deploys Evidently or equivalent on basecamp.
DEEP     Out of scope for /root unless a Y5 capstone direction calls for it.
         Default: OUTLINE target.

Preview: what OUTLINE will answer

When Y5 Phase 41 promotes this entry to OUTLINE, it will name:

PROBLEM. How do you catch silent model degradation as the world changes after training?
PRINCIPLES. Monitor input, feature, and prediction distributions over time. Compare current to training baseline. Statistical tests appropriate to feature types. Alert on significant divergence. Combine with evals for full quality picture.
TRADE-OFFS. Sensitivity (catch drift early, alert fatigue) vs specificity (few false positives, may miss real drift). OSS (Evidently, WhyLabs — flexible) vs managed (SageMaker Model Monitor, Vertex AI — easy). Statistical drift detection vs performance monitoring (with ground truth) vs no-ground-truth methods (NannyML).
TOOLS (time-stamped as of 2026-06): Evidently AI (basecamp candidate), Arize AI (+ OSS Phoenix), WhyLabs / Whylogs, Great Expectations, AWS SageMaker Model Monitor, GCP Vertex AI Model Monitoring, Fiddler AI, Aporia, NannyML.

The DEEP promotion is out of scope for basecamp; if pursued (e.g., Y5 capstone calls for long-running production model), it would add MASTERY (operating drift monitoring on a long-running model), COMPARE (Evidently vs WhyLabs vs cloud-managed), OPERATE (a specific drift event and remediation), and CONTRIBUTE (an Evidently or similar OSS contribution).

Canonical references

Evidently AI documentation. Free at evidentlyai.com.
Chip Huyen, Designing Machine Learning Systems, chapters on monitoring and continuous learning.
Google’s “Rules of Machine Learning” on monitoring. Free.
WhyLabs blog on drift detection patterns. Free at whylabs.ai.
Emmanuel Ameisen, Building Machine Learning Powered Applications (O’Reilly) — chapter on monitoring.

Cross-references

Y5 Phase 41: ML Evaluation + Monitoring
Related: evals, train-serve-skew, ai-observability