Feature Store: Feast
Third phase of Year 4. Features as a first-class citizen — defined once, served at training and serving. Feast on Iceberg. Train/serve skew prevention as the central discipline. ~7 weeks, ~80 hrs.
Where this phase sits
Phase 20 registered models. Phase 21 served them. P22 fills the gap that opens the moment you have both — the question of where features come from. In P20 and P21 you wrote feature SQL twice: once in the JupyterHub notebook for training, once (implicitly) in whatever you fed the KServe inference call. That works exactly until your codebases diverge by a week, at which point your model trains on one definition of commits_last_7d and predicts on a different one. That’s train/serve skew, and it’s the most common ML production bug nobody catches in code review.
P22 is shorter than the surrounding phases (~80 hours vs ~95-100) because Feast’s surface area is smaller than KServe’s or Kubeflow’s. The depth is conceptual, not topological. By the end you’ve done the surgery on next-week-commits — features defined in Feast, training pulls from the offline view, inference pulls from the online view, the skew-prone code paths from P20/P21 are gone. Two patterns reach DEEP this phase: feature-store and train-serve-skew. Both come from the ml-and-ai pattern category.
The infrastructure landing in basecamp this phase is small but durable: Feast on top of Year 3’s Iceberg lakehouse, Redis from Year 1 as the online store, Airflow from Year 3 driving the materialization schedule. Tier 5 of the basecamp plan is now real ML infrastructure, not a single MLflow instance.
Prerequisites
- Phase 21 complete — KServe + Ray + mlship v0
- You accept: the feature store solves one specific problem: train/serve skew. If you build features differently for training and inference, your model fails in subtle ways. Feast forces the same definition on both sides.
Why this phase exists
In P20-21 you trained on Iceberg (abukix.commits) and served from a model artifact. The features were fine because you wrote both sides. In a real platform, training is offline (batch) and serving is online (low-latency). Different code paths = drift.
→ Pattern: feature-store → Pattern: train-serve-skew
Feast solves: define a feature once (SQL or Python), get an offline view (for training, point-in-time-correct) and an online view (for inference, low-latency).
1. PROBLEM
You want a feature like “commits in the last 7 days.” For training, that means: as-of any historical timestamp, what was the value? (point-in-time correctness). For inference, that means: right now, what’s the value? (low-latency lookup).
If you write SQL twice (once in the training notebook, once in the inference service), the two will drift. The model learns one thing; production sees another. That’s train/serve skew, and it’s the single most common ML production bug.
Feast’s promise: define the feature once; the feature store handles offline (Iceberg) + online (Redis or DynamoDB) materialization.
2. PRINCIPLES
2.1 Features as named entities
A feature has: a name, a type, an entity (the thing it describes — e.g., a repo), a source query, freshness expectations.
→ Pattern: feature-store
Investigate:
- Read Feast docs Concepts page
- Define
commits_last_7dfor entityrepofromabukix.commitsin Iceberg - What’s an entity? When do you need a join key vs. just a primary key?
2.2 Point-in-time correctness
For a training row dated 2025-03-15, what was commits_last_7d for repo=abukix/mlship as of 2025-03-15? Not as of now.
Investigate:
- Iceberg time-travel (Y3 P15) is the substrate Feast leans on
- Why is it wrong to “just left-join the latest values”?
- Test: train two models, one with PIT-correct features, one with current-value features; compare evaluation
2.3 Online materialization
For low-latency inference, the feature must be precomputed + cached. Feast materializes from offline → online (Redis / DynamoDB) on a schedule.
→ Pattern: caching (revisited)
Investigate:
- Feast
materializejob — schedule via Airflow (Y3 P17) - TTL strategy: how often does
commits_last_7dchange? Materialize hourly? Daily? - Cold start for online store: what does the first request after empty Redis do?
2.4 Train/serve skew prevention
→ Pattern: train-serve-skew
The same feature definition is the only path to either side. No SQL in the training notebook. No SQL in the inference service. Just feast.get_features(["commits_last_7d"]).
Investigate:
- Audit your P20 model: where did features come from? Was it skew-prone?
- Refactor: define those features in Feast; train via Feast; serve via Feast
2.5 Feature reuse
A feature defined once is queryable by every model. Reuse increases consistency + reduces effort.
Investigate:
- Define 5 features for the
repoentity - Define 3 features for a new entity
author - Show that two models can share
commits_last_7d
2.6 Feature freshness as SLI
Features have freshness SLOs: “commits_last_7d updated within 1 hour, 99%.”
→ Pattern: sli-slo-error-budget (Y2 — applied to features now)
3. TRADE-OFFS
| Decision | Option A | Option B | When |
|---|---|---|---|
| Feature store | Feast (OSS) | Tecton (commercial) | Hopsworks |
| Online store | Redis | DynamoDB | Cassandra |
| Offline store | Iceberg via Spark | BigQuery | Snowflake |
| Materialization | scheduled (Airflow) | streaming (Flink) | hybrid |
| Definitions | Python (Feast) | SQL (dbt-feature-store) | Both |
4. TOOLS (as of 2025-10)
- Feast 0.40+ (Python feature store)
- Redis (already from Y1 — online store)
- Iceberg + Spark (Y3 — offline store)
- Airflow (Y3 — materialization scheduling)
5. MASTERY
5.1 Reading list
| Required | Why |
|---|---|
| Feast docs — Concepts + Quickstart | The implementation |
| ”Designing ML Systems” Ch. 5 (Feature Engineering) | The discipline |
| ”Reliable Machine Learning” Ch. 6 (Feature Engineering Reliability) | The depth |
5.2 Operational depth checklist
[ ] Deploy Feast on basecamp Tier 5 (Postgres registry + Redis online + Iceberg offline)[ ] Define entity `repo` and feature `commits_last_7d`[ ] Define entity `author` and 3 features (commits_total, langs_used, weekly_streak)[ ] Materialize all 4 features to Redis on an Airflow schedule[ ] Build a Feast feature view; query offline (training) + online (serving)[ ] Verify point-in-time correctness: same feature value at training-as-of-X[ ] Refactor next-week-commits model to use Feast features end-to-end (no inline SQL)[ ] Define freshness SLO for each feature; wire to Grafana[ ] Add a feature lineage view in DataHub (already from Y3 P19)[ ] Test train/serve skew: deliberately introduce skew; observe model accuracy drop[ ] Document features in basecamp/README.md (Feast registry + how to add)6. COMPARE: Feast vs DIY (Postgres-only feature serving)
For low-throughput models, you could skip Feast and just SELECT from Iceberg/Postgres. When does Feast earn its weight?
400 words.
7. OPERATE
- 3+ runbooks (
feast-materialization-stuck,redis-online-store-stale,feature-freshness-slo-violation) - 1+ ADR (
why-feast-over-diy-for-basecamp) - Weekly log
8. CONTRIBUTE
Feast itself, Feast connectors (e.g., the Iceberg connector), feature-store community.
Validation criteria
[ ] All 11 operational depth checks[ ] Feast running in basecamp Tier 5 (registry + offline + online)[ ] At least 5 features defined across 2 entities[ ] next-week-commits model refactored to use Feast end-to-end[ ] Freshness SLOs wired to Grafana[ ] Feast vs DIY comparison written up[ ] 3+ runbooks; 1+ ADR; 7+ weekly log entries[ ] Pattern entries deepened: - feature-store → DEEP - train-serve-skew → DEEP (refactor caught the skew)[ ] Exit Test passedExit Test
Time: 3 hours.
- Build (90 min) — define a new feature
mean_files_per_commit_last_30d; materialize to Redis; query offline + online; verify PIT correctness with a 1-month-ago test. - Diagnose (60 min) — scenario: model accuracy dropped after a Feast schema change. Trace via lineage + DataHub; find the skew.
- Articulate (30 min) — 600 words: “Walk a feature from definition to model training to inference. What contracts must hold for skew prevention?”
Anti-patterns
| Anti-pattern | Why |
|---|---|
| Inline feature SQL in inference service | Train/serve skew waiting to happen |
| Materialize on every inference | Defeats the online store; Redis exists for a reason |
| Features without freshness SLOs | Stale features = bad predictions silently |
| Feast for “single user, single model” | Feast for shared/scaled use; one-off can be DIY |
Patterns deepened this phase
- feature-store → DEEP
- train-serve-skew → DEEP
→ Next: Phase 23: Kubeflow Operations