Data Engineering Patterns

Three patterns at the data tier: lakehouse, streaming-vs-batch, and change-data-capture (CDC).

Three patterns from the modern data stack. Touched in Y4 Phase 31; deepens through Y4 operating the data tier.

Patterns in this category

PatternFirst touchedDEEP target
lakehouseY4 Phase 31Y4 end (Iceberg + Trino operational)
streaming-vs-batchY4 Phase 32 + Phase 33Y4 end
cdcY4 Phase 32Y4 end (Debezium + Kafka operational)

Why this category exists

The modern data stack converges on a small number of shapes. The lakehouse pattern is the table-on-object-storage shape every billion-dollar company has converged on. Streaming-vs-batch is the trade-off that determines latency, cost, and operational complexity. CDC is the pattern that lets the OLTP world feed the OLAP world without bulk ETL.

These three patterns plus the storage-and-data category primitives (snapshot-plus-delta, schema-evolution, partitioning) cover the data tier. The split between categories is deliberate: storage-and-data covers engine-level patterns that describe how a single system works; data-engineering covers stack-level patterns that describe how systems compose into a data platform.

The category is small because the data engineering surface is small once you strip out the tool-specific noise. The tools are legion (Snowflake, BigQuery, Databricks, Iceberg, Delta Lake, Hudi, Apache Flink, Kafka Streams, Spark, Airflow, Dagster, Prefect, Nessie, Polaris), but the patterns beneath them are three. Being fluent in the three patterns means you can evaluate any new data tool by mapping it back to which patterns it implements and what trade-offs it makes.

How to read this category

All three patterns first-fire in Y4 (Phase 31-33). Read all three STUB entries before starting Phase 31. Each deepens through operating a specific piece of the data tier.

Y4 Phase 31: lakehouse first-fires as Iceberg on MinIO comes online. Read the OUTLINE sections during the phase; the pattern goes DEEP as you actually query, evolve schemas, and time-travel through snapshots.

Y4 Phase 32: cdc first-fires as Debezium reads Postgres WAL and produces Kafka topics. streaming-vs-batch first-fires here too as Kafka Streams and Flink introduce the streaming half of the trade-off.

Y4 Phase 33: streaming-vs-batch deepens as Spark and Airflow/Dagster introduce the batch half. Running the same computation in both modes is what makes the trade-off concrete.

By end of Y4, all three should be DEEP. The data tier operates for several months, hits real ingestion pressure, has real schema changes, has real cross-region delays, and the patterns get their operational evidence.

How the patterns connect

The three patterns form the data tier’s operating triangle.

  • lakehouse is where the data lives. Iceberg tables on object storage, unified catalog, ACID transactions, time-travel queries.
  • streaming-vs-batch is how data moves in. Streaming for low-latency updates and real-time analytics; batch for throughput-heavy transformations and historical rebuilds.
  • cdc is how OLTP and OLAP connect. Postgres write-ahead log feeds a Kafka topic, which feeds streaming or batch ingestion into the lakehouse.

The composition is: OLTP writes go to Postgres → CDC captures the WAL → Kafka topic streams the changes → streaming or batch jobs materialize the changes into Iceberg → analytical queries hit Iceberg via Trino or Spark. Every arrow in that pipeline is one of the three patterns.

lakehouse sits below the other two: streaming and CDC are meaningful because the lakehouse exists as their destination. Without the lakehouse, streaming produces ephemeral state; batch produces snowflake data marts.

Where these show up in /root

  • Y2 Phase 9 — Postgres internals expose the WAL. This is the mechanism CDC will later exploit, but the pattern (cdc) doesn’t first-fire until Y4.
  • Y2 Phase 14 — queues and event-driven patterns introduce the streaming mindset without naming streaming-vs-batch yet.
  • Y3 Phase 21 — DDIA covers the theoretical framing (batch vs stream processing) as part of Chapter 10-11. No DEEP evidence yet.
  • Y4 Phase 31 — the lakehouse phase. lakehouse first-fires through Iceberg on MinIO. Nessie or Polaris as the catalog. Trino as the query engine. Time-travel queries. Snapshot management.
  • Y4 Phase 32 — the streaming phase. cdc first-fires through Debezium reading Postgres WAL. Kafka Connect wiring. streaming-vs-batch first-fires as Kafka Streams and Flink introduce streaming semantics.
  • Y4 Phase 33 — the batch phase. Spark for large-scale transformations, Airflow or Dagster for orchestration. Running the same aggregation in both streaming (Flink) and batch (Spark) is the exercise that makes streaming-vs-batch DEEP.
  • Y5 Phase 42 — pgvector as the vector store inherits the OLTP-vs-OLAP distinction from the data-tier work.
  • Y5 Phase 48mcp-data-tier exposes Iceberg tables as an MCP tool. The lakehouse becomes an agent-accessible query surface.

By end of Y4, the data tier is operating end-to-end. By end of Y5, the AI tier queries against it (RAG over structured tables, feature-store retrieval, training-data pipelines all run through the lakehouse).

Anti-patterns

Anti-patternWhy
Lakehouse without a catalogIceberg tables on S3 are just files without a catalog. Hive Metastore, Nessie, or Polaris makes the tables discoverable and enables cross-engine queries. Skipping the catalog is a common early-mistake.
Streaming-vs-batch as an either/orReal data platforms run both. Lambda architecture (both, then merge) was one answer; Kappa architecture (streaming, batch-as-replay) is another. The pattern isn’t picking one; it’s understanding when each fits.
CDC without idempotency at the sinkDebezium produces at-least-once delivery. The sink writing to Iceberg needs to handle duplicates. Duplicate-safe upserts (or a downstream deduplication step) are required. Skipping this produces silently corrupted tables.
Lakehouse promotion to DEEP after only Iceberg readsDEEP requires operating write-heavy paths: schema evolution mid-flight, compaction triggering, snapshot expiration, orphan file cleanup. Reads are easy; writes are where the operational learning happens.
Streaming without watermarksEvent-time processing requires watermarks; processing-time processing accepts skew. Streaming systems without watermarks produce wrong results when events arrive out of order. If you can’t articulate your watermark strategy, you don’t have streaming; you have unbounded processing.
CDC over primary-key-less tablesCDC needs a stable identity to represent changes. Tables without primary keys either can’t be captured (no updates, only inserts) or produce ambiguous change records. Fix the schema before enabling CDC.

Cross-references