Train-Serve Skew
The pattern: features computed differently at training time and inference time → the model performs worse in production than in offline eval. The most common ML production bug. Sources: different code paths (SQL in a notebook vs. Python in a service), point-in-time leakage, distribution shift, data freshness gaps.
The trade-off: iteration speed vs. correctness guarantee. Inline feature engineering in notebooks is fast. Doing it via a feature store is slower upfront but eliminates skew structurally — same definition, two materializations (offline for training, online for inference). Drift detection in production catches the skew that crept in despite discipline.
Deepens in Year 4 Phase 22: Feature Store (Feast eliminates the structural source) and reinforced in Phase 25: GPU Infrastructure where KS-test drift detection on llm-gateway catches the residue. DEEP after both.
Related patterns
- feature-store — the structural fix.
- model-lifecycle — drift monitoring is the lifecycle step that catches what slipped.
- inference-shapes — online vs. batch is where divergent code paths typically hide.
- rag-as-pattern — chunker-at-ingest vs. chunker-at-query is the RAG flavor of the same bug.
- prompt-as-program — prompt drift is the LLM-era cousin.
- schema-on-read-vs-write — schema-on-read pipelines are where silent feature drift lives.
- snapshot-plus-delta — point-in-time-correct training data depends on this.
mlship— emits skew + drift telemetry by default on every deploy.basecamp— Feast (Tier 5) and drift detection (Tier 6) are where this pattern is enforced.