Data Engineering Patterns
Three patterns at the data tier: lakehouse, streaming-vs-batch, and change-data-capture (CDC).
Three patterns from the modern data stack. Touched in Y4 Phase 31; deepens through Y4 operating the data tier.
Patterns in this category
| Pattern | First touched | DEEP target |
|---|---|---|
| lakehouse | Y4 Phase 31 | Y4 end (Iceberg + Trino operational) |
| streaming-vs-batch | Y4 Phase 32 + Phase 33 | Y4 end |
| cdc | Y4 Phase 32 | Y4 end (Debezium + Kafka operational) |
Why this category exists
The modern data stack converges on a small number of shapes. The lakehouse pattern is the table-on-object-storage shape every billion-dollar company has converged on. Streaming-vs-batch is the trade-off that determines latency, cost, and operational complexity. CDC is the pattern that lets the OLTP world feed the OLAP world without bulk ETL.
These three patterns plus the storage-and-data category primitives (snapshot-plus-delta, schema-evolution, partitioning) cover the data tier. The split between categories is deliberate: storage-and-data covers engine-level patterns that describe how a single system works; data-engineering covers stack-level patterns that describe how systems compose into a data platform.
The category is small because the data engineering surface is small once you strip out the tool-specific noise. The tools are legion (Snowflake, BigQuery, Databricks, Iceberg, Delta Lake, Hudi, Apache Flink, Kafka Streams, Spark, Airflow, Dagster, Prefect, Nessie, Polaris), but the patterns beneath them are three. Being fluent in the three patterns means you can evaluate any new data tool by mapping it back to which patterns it implements and what trade-offs it makes.
How to read this category
All three patterns first-fire in Y4 (Phase 31-33). Read all three STUB entries before starting Phase 31. Each deepens through operating a specific piece of the data tier.
Y4 Phase 31: lakehouse first-fires as Iceberg on MinIO comes online. Read the OUTLINE sections during the phase; the pattern goes DEEP as you actually query, evolve schemas, and time-travel through snapshots.
Y4 Phase 32: cdc first-fires as Debezium reads Postgres WAL and produces Kafka topics. streaming-vs-batch first-fires here too as Kafka Streams and Flink introduce the streaming half of the trade-off.
Y4 Phase 33: streaming-vs-batch deepens as Spark and Airflow/Dagster introduce the batch half. Running the same computation in both modes is what makes the trade-off concrete.
By end of Y4, all three should be DEEP. The data tier operates for several months, hits real ingestion pressure, has real schema changes, has real cross-region delays, and the patterns get their operational evidence.
How the patterns connect
The three patterns form the data tier’s operating triangle.
lakehouseis where the data lives. Iceberg tables on object storage, unified catalog, ACID transactions, time-travel queries.streaming-vs-batchis how data moves in. Streaming for low-latency updates and real-time analytics; batch for throughput-heavy transformations and historical rebuilds.cdcis how OLTP and OLAP connect. Postgres write-ahead log feeds a Kafka topic, which feeds streaming or batch ingestion into the lakehouse.
The composition is: OLTP writes go to Postgres → CDC captures the WAL → Kafka topic streams the changes → streaming or batch jobs materialize the changes into Iceberg → analytical queries hit Iceberg via Trino or Spark. Every arrow in that pipeline is one of the three patterns.
lakehouse sits below the other two: streaming and CDC are meaningful because the lakehouse exists as their destination. Without the lakehouse, streaming produces ephemeral state; batch produces snowflake data marts.
Where these show up in /root
- Y2 Phase 9 — Postgres internals expose the WAL. This is the mechanism CDC will later exploit, but the pattern (
cdc) doesn’t first-fire until Y4. - Y2 Phase 14 — queues and event-driven patterns introduce the streaming mindset without naming
streaming-vs-batchyet. - Y3 Phase 21 — DDIA covers the theoretical framing (batch vs stream processing) as part of Chapter 10-11. No DEEP evidence yet.
- Y4 Phase 31 — the lakehouse phase.
lakehousefirst-fires through Iceberg on MinIO. Nessie or Polaris as the catalog. Trino as the query engine. Time-travel queries. Snapshot management. - Y4 Phase 32 — the streaming phase.
cdcfirst-fires through Debezium reading Postgres WAL. Kafka Connect wiring.streaming-vs-batchfirst-fires as Kafka Streams and Flink introduce streaming semantics. - Y4 Phase 33 — the batch phase. Spark for large-scale transformations, Airflow or Dagster for orchestration. Running the same aggregation in both streaming (Flink) and batch (Spark) is the exercise that makes
streaming-vs-batchDEEP. - Y5 Phase 42 — pgvector as the vector store inherits the OLTP-vs-OLAP distinction from the data-tier work.
- Y5 Phase 48 —
mcp-data-tierexposes Iceberg tables as an MCP tool. The lakehouse becomes an agent-accessible query surface.
By end of Y4, the data tier is operating end-to-end. By end of Y5, the AI tier queries against it (RAG over structured tables, feature-store retrieval, training-data pipelines all run through the lakehouse).
Anti-patterns
| Anti-pattern | Why |
|---|---|
| Lakehouse without a catalog | Iceberg tables on S3 are just files without a catalog. Hive Metastore, Nessie, or Polaris makes the tables discoverable and enables cross-engine queries. Skipping the catalog is a common early-mistake. |
| Streaming-vs-batch as an either/or | Real data platforms run both. Lambda architecture (both, then merge) was one answer; Kappa architecture (streaming, batch-as-replay) is another. The pattern isn’t picking one; it’s understanding when each fits. |
| CDC without idempotency at the sink | Debezium produces at-least-once delivery. The sink writing to Iceberg needs to handle duplicates. Duplicate-safe upserts (or a downstream deduplication step) are required. Skipping this produces silently corrupted tables. |
| Lakehouse promotion to DEEP after only Iceberg reads | DEEP requires operating write-heavy paths: schema evolution mid-flight, compaction triggering, snapshot expiration, orphan file cleanup. Reads are easy; writes are where the operational learning happens. |
| Streaming without watermarks | Event-time processing requires watermarks; processing-time processing accepts skew. Streaming systems without watermarks produce wrong results when events arrive out of order. If you can’t articulate your watermark strategy, you don’t have streaming; you have unbounded processing. |
| CDC over primary-key-less tables | CDC needs a stable identity to represent changes. Tables without primary keys either can’t be captured (no updates, only inserts) or produce ambiguous change records. Fix the schema before enabling CDC. |
Cross-references
- Pattern Library
- Y4 Phase 31: Data Lakehouse
- Y4 Phase 32: Stream Processing
- Storage and Data Patterns — WAL, snapshot-plus-delta, schema-evolution, partitioning are the engine-level patterns beneath this category
- Distributed Systems Patterns — delivery-semantics, backpressure, and eventual-consistency compose with CDC and streaming
- ML Systems Patterns — feature stores and training-data pipelines depend on the data tier