Feature Store: Feast

Third phase of Year 4. Features as a first-class citizen — defined once, served at training and serving. Feast on Iceberg. Train/serve skew prevention as the central discipline. ~7 weeks, ~80 hrs.

Where this phase sits

Phase 20 registered models. Phase 21 served them. P22 fills the gap that opens the moment you have both — the question of where features come from. In P20 and P21 you wrote feature SQL twice: once in the JupyterHub notebook for training, once (implicitly) in whatever you fed the KServe inference call. That works exactly until your codebases diverge by a week, at which point your model trains on one definition of commits_last_7d and predicts on a different one. That’s train/serve skew, and it’s the most common ML production bug nobody catches in code review.

P22 is shorter than the surrounding phases (~80 hours vs ~95-100) because Feast’s surface area is smaller than KServe’s or Kubeflow’s. The depth is conceptual, not topological. By the end you’ve done the surgery on next-week-commits — features defined in Feast, training pulls from the offline view, inference pulls from the online view, the skew-prone code paths from P20/P21 are gone. Two patterns reach DEEP this phase: feature-store and train-serve-skew. Both come from the ml-and-ai pattern category.

The infrastructure landing in basecamp this phase is small but durable: Feast on top of Year 3’s Iceberg lakehouse, Redis from Year 1 as the online store, Airflow from Year 3 driving the materialization schedule. Tier 5 of the basecamp plan is now real ML infrastructure, not a single MLflow instance.

Prerequisites

Phase 21 complete — KServe + Ray + mlship v0

You accept: the feature store solves one specific problem: train/serve skew. If you build features differently for training and inference, your model fails in subtle ways. Feast forces the same definition on both sides.

Why this phase exists

In P20-21 you trained on Iceberg (abukix.commits) and served from a model artifact. The features were fine because you wrote both sides. In a real platform, training is offline (batch) and serving is online (low-latency). Different code paths = drift.

→ Pattern: feature-store → Pattern: train-serve-skew

Feast solves: define a feature once (SQL or Python), get an offline view (for training, point-in-time-correct) and an online view (for inference, low-latency).

1. PROBLEM

You want a feature like “commits in the last 7 days.” For training, that means: as-of any historical timestamp, what was the value? (point-in-time correctness). For inference, that means: right now, what’s the value? (low-latency lookup).

If you write SQL twice (once in the training notebook, once in the inference service), the two will drift. The model learns one thing; production sees another. That’s train/serve skew, and it’s the single most common ML production bug.

Feast’s promise: define the feature once; the feature store handles offline (Iceberg) + online (Redis or DynamoDB) materialization.

2. PRINCIPLES

2.1 Features as named entities

A feature has: a name, a type, an entity (the thing it describes — e.g., a repo), a source query, freshness expectations.

→ Pattern: feature-store

Investigate:

Read Feast docs Concepts page
Define commits_last_7d for entity repo from abukix.commits in Iceberg
What’s an entity? When do you need a join key vs. just a primary key?

2.2 Point-in-time correctness

For a training row dated 2025-03-15, what was commits_last_7d for repo=abukix/mlship as of 2025-03-15? Not as of now.

Investigate:

Iceberg time-travel (Y3 P15) is the substrate Feast leans on
Why is it wrong to “just left-join the latest values”?
Test: train two models, one with PIT-correct features, one with current-value features; compare evaluation

2.3 Online materialization

For low-latency inference, the feature must be precomputed + cached. Feast materializes from offline → online (Redis / DynamoDB) on a schedule.

→ Pattern: caching (revisited)

Investigate:

Feast materialize job — schedule via Airflow (Y3 P17)
TTL strategy: how often does commits_last_7d change? Materialize hourly? Daily?
Cold start for online store: what does the first request after empty Redis do?

2.4 Train/serve skew prevention

→ Pattern: train-serve-skew

The same feature definition is the only path to either side. No SQL in the training notebook. No SQL in the inference service. Just feast.get_features(["commits_last_7d"]).

Investigate:

Audit your P20 model: where did features come from? Was it skew-prone?
Refactor: define those features in Feast; train via Feast; serve via Feast

2.5 Feature reuse

A feature defined once is queryable by every model. Reuse increases consistency + reduces effort.

Investigate:

Define 5 features for the repo entity
Define 3 features for a new entity author
Show that two models can share commits_last_7d

2.6 Feature freshness as SLI

Features have freshness SLOs: “commits_last_7d updated within 1 hour, 99%.”

→ Pattern: sli-slo-error-budget (Y2 — applied to features now)

3. TRADE-OFFS

Decision	Option A	Option B	When
Feature store	Feast (OSS)	Tecton (commercial)	Hopsworks
Online store	Redis	DynamoDB	Cassandra
Offline store	Iceberg via Spark	BigQuery	Snowflake
Materialization	scheduled (Airflow)	streaming (Flink)	hybrid
Definitions	Python (Feast)	SQL (dbt-feature-store)	Both

4. TOOLS (as of 2025-10)

Feast 0.40+ (Python feature store)
Redis (already from Y1 — online store)
Iceberg + Spark (Y3 — offline store)
Airflow (Y3 — materialization scheduling)

5. MASTERY

5.1 Reading list

Required	Why
Feast docs — Concepts + Quickstart	The implementation
”Designing ML Systems” Ch. 5 (Feature Engineering)	The discipline
”Reliable Machine Learning” Ch. 6 (Feature Engineering Reliability)	The depth

5.2 Operational depth checklist

[ ] Deploy Feast on basecamp Tier 5 (Postgres registry + Redis online + Iceberg offline)
[ ] Define entity `repo` and feature `commits_last_7d`
[ ] Define entity `author` and 3 features (commits_total, langs_used, weekly_streak)
[ ] Materialize all 4 features to Redis on an Airflow schedule
[ ] Build a Feast feature view; query offline (training) + online (serving)
[ ] Verify point-in-time correctness: same feature value at training-as-of-X
[ ] Refactor next-week-commits model to use Feast features end-to-end (no inline SQL)
[ ] Define freshness SLO for each feature; wire to Grafana
[ ] Add a feature lineage view in DataHub (already from Y3 P19)
[ ] Test train/serve skew: deliberately introduce skew; observe model accuracy drop
[ ] Document features in basecamp/README.md (Feast registry + how to add)

6. COMPARE: Feast vs DIY (Postgres-only feature serving)

For low-throughput models, you could skip Feast and just SELECT from Iceberg/Postgres. When does Feast earn its weight?

400 words.

7. OPERATE

3+ runbooks (feast-materialization-stuck, redis-online-store-stale, feature-freshness-slo-violation)
1+ ADR (why-feast-over-diy-for-basecamp)
Weekly log

8. CONTRIBUTE

Feast itself, Feast connectors (e.g., the Iceberg connector), feature-store community.

Validation criteria

[ ] All 11 operational depth checks
[ ] Feast running in basecamp Tier 5 (registry + offline + online)
[ ] At least 5 features defined across 2 entities
[ ] next-week-commits model refactored to use Feast end-to-end
[ ] Freshness SLOs wired to Grafana
[ ] Feast vs DIY comparison written up
[ ] 3+ runbooks; 1+ ADR; 7+ weekly log entries
[ ] Pattern entries deepened:
    - feature-store → DEEP
    - train-serve-skew → DEEP (refactor caught the skew)
[ ] Exit Test passed

Exit Test

Time: 3 hours.

Build (90 min) — define a new feature mean_files_per_commit_last_30d; materialize to Redis; query offline + online; verify PIT correctness with a 1-month-ago test.
Diagnose (60 min) — scenario: model accuracy dropped after a Feast schema change. Trace via lineage + DataHub; find the skew.
Articulate (30 min) — 600 words: “Walk a feature from definition to model training to inference. What contracts must hold for skew prevention?”

Anti-patterns

Anti-pattern	Why
Inline feature SQL in inference service	Train/serve skew waiting to happen
Materialize on every inference	Defeats the online store; Redis exists for a reason
Features without freshness SLOs	Stale features = bad predictions silently
Feast for “single user, single model”	Feast for shared/scaled use; one-off can be DIY

Patterns deepened this phase

feature-store → DEEP
train-serve-skew → DEEP

→ Next: Phase 23: Kubeflow Operations