Train / Serve Skew

The mismatch between features at training time and features at serving time. The silent ML failure mode the feature store pattern is designed to prevent.

The model performs worse in production than in offline eval. Cause: features computed differently. The skew is silent until somebody notices. Status: STUB — promoted to OUTLINE in Y5 Phase 40.

What this pattern is

Train/serve skew is the silent ML failure mode where features computed at training time differ — subtly — from features computed at inference time. Classic causes: training uses future data the model couldn’t have at inference time (label leakage); training computes a feature in Spark batch but serving computes the same feature in the application layer with subtly different logic; training uses cleaned data but serving sees raw data with missing values or different categorical encodings. The result: offline metrics look great, the model deploys, online performance is significantly worse, and nobody can tell why.

The discipline that prevents skew: point-in-time correctness (training only sees features as they existed at the time of the historical event); single source of truth for feature logic (the feature store); parity testing in CI (compute features both ways on the same data; assert equality); monitoring of feature distributions in production vs training.

The pattern is not really “a pattern to adopt” — it’s an anti-pattern to prevent, and the prevention mechanism is feature-store. Naming it as its own concept matters because engineers need vocabulary for what’s going wrong before they can adopt the fix. Every ML team eventually hits train/serve skew; teams that have named it and drilled on it recognize it faster and remediate faster than teams that only encounter it as “the model is behaving weirdly.”

The failure mode is particularly insidious because offline eval doesn’t catch it. Offline eval uses training-time features. Production uses serving-time features. If the two differ, offline eval measures the wrong thing. The model looks great in eval and performs poorly in production. Standard software testing intuition (“test what production does”) is what’s missing — ML teams need parity testing that specifically compares training-time and serving-time feature values on the same inputs.

Concrete instances in the wild

Label leakage. Training data contains information that wouldn’t be available at inference time. Classic: including “user purchased this product” as a feature for a purchase-prediction model.
Feature computation drift. Training pipeline computes features in Spark; serving computes them in Python with subtly different logic. Every ML team eventually hits this.
Missing value handling drift. Training pipeline drops rows with missing values; serving sees rows with missing values and fails silently.
Categorical encoding drift. Training uses one-hot encoding produced by scikit-learn; serving uses hand-written encoding that misses categories from the tail.
Timezone drift. Training uses UTC timestamps; serving uses local time. Model learns UTC patterns; serves against local-time inputs.
Timestamp precision drift. Training uses second precision; serving uses millisecond precision. Model behavior slightly off.
Test-set contamination. Training data accidentally includes examples from the test set. Not skew per se but adjacent.
basecamp parity testing (Y5 Phase 40). Feast + CI-based parity tests that compute features both ways and assert equality on golden inputs.
Google’s TFX Data Validation. Formal schema for feature values, catches skew via statistical tests.
Great Expectations for ML features. Assertions on feature values, drift detection.

Why this pattern matters

Train/serve skew is one of the top production ML failure modes. Every serious ML team has stories of models that looked great in eval and performed 10-30% worse in production because of skew. The cost is direct: bad predictions, revenue impact, customer trust erosion. The cost is also indirect: engineering time spent debugging “why is this model bad?” when the model is fine and the features are the problem.

The pattern matters because ML systems don’t fail loudly the way software systems do. Software with bugs throws exceptions or returns wrong results that tests catch. ML with skew produces predictions that look reasonable individually but are collectively worse than they should be. The failure mode is statistical, not deterministic. It only shows up when you aggregate many predictions and compare to a baseline.

Naming the failure mode explicitly gives teams vocabulary to prevent it. “This looks like it might be train/serve skew” is a hypothesis that leads to specific diagnostics: check the feature computation code in both paths, compute feature values on the same input in both paths, look at feature distribution drift between training and production. Teams without the vocabulary flounder longer before finding the cause.

The pattern also motivates infrastructure investment. Feature stores exist primarily to prevent train/serve skew. Feature validation frameworks (TFX, Great Expectations) exist to catch skew. CI parity tests exist to prevent skew before deployment. Understanding why these exist is understanding what train/serve skew is and why it’s expensive.

Modern LLM workflows have their own version. Prompt templates used in training might differ from prompt templates used at serving. Retrieval configurations at training time might differ from serving. Context window handling might differ. Even without classical features, LLMs have their own train/serve skew risks that need the same discipline.

The failure modes of the anti-pattern (i.e., how skew keeps happening despite awareness): teams add new features to serving without updating training pipelines; teams change feature computation logic in one place without the other; teams don’t have parity tests because “we’re moving too fast”; teams monitor model metrics but not feature distributions. Each is a specific antipattern with a specific fix.

Depth progression

STUB     ← you are here.
OUTLINE  Promoted when Y5 Phase 40 introduces parity testing on Feast features.
DEEP     Promoted after Y5 end — at least one observed near-miss where the
         pattern caught a skew that would have shipped.

Preview: what OUTLINE will answer

When Y5 Phase 40 promotes this entry to OUTLINE, it will name:

PROBLEM. How do you prevent the silent failure mode where models perform worse in production than in eval because features differ?
PRINCIPLES. Single source of truth for feature logic (feature store). Point-in-time correctness for training. Parity tests in CI. Feature distribution monitoring in production vs training. Named vocabulary so teams recognize the failure mode fast.
TRADE-OFFS. Feature store approach (safe, adds infrastructure) vs manual discipline (cheap, error-prone). Aggressive parity tests (safe, slow CI) vs lightweight (fast, less coverage). Distribution monitoring (production-focused) vs pre-deployment testing (early catch).
TOOLS (time-stamped as of 2026-06): Feast + CI parity tests (basecamp default), TFX Data Validation, Great Expectations, WhyLabs, Arize AI, custom monitoring dashboards.

The DEEP promotion, after Y5 end with at least one near-miss caught, will add MASTERY (operating parity-tested feature pipelines), COMPARE (feature store vs distribution monitoring approaches), OPERATE (a specific skew near-miss and how it was caught), and CONTRIBUTE (a case study of skew prevention).

Canonical references

Google’s “Rules of Machine Learning” (Martin Zinkevich). Free. Section on train/serve skew is definitive.
Chip Huyen, Designing Machine Learning Systems, sections on production ML failure modes.
TFX documentation on Data Validation. Free at tensorflow.org/tfx.
Andrew Ng’s “MLOps: From Model-centric to Data-centric AI” talks. Free.
ML in production talks from various companies (Uber, Airbnb, Netflix). Free.

Cross-references

Y5 Phase 40: Feature Stores
Related: feature-store, drift-detection, evals