Train-Serve Skew

The pattern: features computed differently at training time and inference time → the model performs worse in production than in offline eval. The most common ML production bug. Sources: different code paths (SQL in a notebook vs. Python in a service), point-in-time leakage, distribution shift, data freshness gaps.

The trade-off: iteration speed vs. correctness guarantee. Inline feature engineering in notebooks is fast. Doing it via a feature store is slower upfront but eliminates skew structurally — same definition, two materializations (offline for training, online for inference). Drift detection in production catches the skew that crept in despite discipline.

Deepens in Year 4 Phase 22: Feature Store (Feast eliminates the structural source) and reinforced in Phase 25: GPU Infrastructure where KS-test drift detection on llm-gateway catches the residue. DEEP after both.

feature-store — the structural fix.
model-lifecycle — drift monitoring is the lifecycle step that catches what slipped.
inference-shapes — online vs. batch is where divergent code paths typically hide.
rag-as-pattern — chunker-at-ingest vs. chunker-at-query is the RAG flavor of the same bug.
prompt-as-program — prompt drift is the LLM-era cousin.
schema-on-read-vs-write — schema-on-read pipelines are where silent feature drift lives.
snapshot-plus-delta — point-in-time-correct training data depends on this.
mlship — emits skew + drift telemetry by default on every deploy.
basecamp — Feast (Tier 5) and drift detection (Tier 6) are where this pattern is enforced.

Train-Serve Skew

Related patterns