Change Data Capture (CDC)

Stream database changes as events. The pattern that lets OLTP feed OLAP without bulk ETL. Debezium, AWS DMS, native logical-replication consumers.

Every commit to the OLTP database becomes an event. Downstream systems consume the event stream. No bulk reads; no batch lag. Status: STUB — promoted to OUTLINE in Y4 Phase 32.

What this pattern is

Change Data Capture (CDC) streams every commit from a source database (OLTP — Postgres, MySQL) as events on a message bus (Kafka), so downstream consumers (OLAP lakehouse, search indexes, cache invalidation, audit logs) can react to changes in near-real-time without polling. The implementation reads the database’s WAL (write-ahead log) directly — Postgres logical replication slots, MySQL binlogs — and translates the row-level changes into a stream of structured events.

Debezium is the canonical OSS CDC connector (runs as a Kafka Connect plugin or a standalone server, deployed K8s-native via Strimzi). AWS DMS is the cloud-managed alternative. Both consume the source database’s WAL and produce Kafka topics per source table with structured change events (INSERT / UPDATE / DELETE / SCHEMA_CHANGE).

The pattern resolves the OLTP-vs-OLAP split. The OLTP database remains optimized for transactions. The OLAP system receives the data as it changes. No nightly batch ETL window, no stale dashboards. Iceberg + Debezium + Kafka is the modern lakehouse-friendly shape.

The pattern’s operational value goes beyond OLTP-to-OLAP. CDC events feed cache invalidation (Redis, Memcached), search indexes (Elasticsearch, Solr), audit logs (immutable event log per commit), microservice event-driven architectures (payment service commits an order; shipping service consumes the event), and reverse-ETL (data warehouse changes flow back to operational systems). One CDC pipeline supports many downstream patterns because the events are structured, ordered per table, and delivered at-least-once.

Concrete instances in the wild

  • Debezium. The canonical OSS CDC connector. Supports Postgres, MySQL, MongoDB, SQL Server, Oracle, DB2, Cassandra. Runs on Kafka Connect or standalone Debezium Server.
  • AWS DMS. AWS-managed CDC. Multi-source, multi-target. Good for one-time migrations and ongoing replication.
  • Fivetran. Commercial CDC SaaS. Very popular in analytics teams.
  • Airbyte. OSS + commercial CDC + ETL. Growing adoption.
  • Estuary Flow. Real-time CDC platform. Newer entrant.
  • basecamp CDC pipeline (Y4 Phase 32). Debezium on Strimzi + Kafka + Iceberg via Flink or Kafka Connect sink.
  • Google Cloud Datastream. GCP-managed CDC for Postgres, MySQL, Oracle.
  • Postgres native logical replication. Postgres to Postgres via publications and subscriptions. Simpler than Debezium for Postgres-to-Postgres flows.
  • MySQL binlog replication. Native primary-to-replica replication. Also usable as CDC source for tools like Debezium.
  • Materialize. Streaming database that consumes CDC events and maintains materialized views.

Why this pattern matters

Before CDC, moving data from operational databases to analytics systems required nightly batch ETL. Full table scans, bulk exports, staging tables, transformation jobs. The lag between “data changes in production” and “data appears in analytics” was measured in hours or days. Real-time dashboards were impossible. Freshness expectations were low.

With CDC, that lag drops to seconds. Every commit in Postgres becomes a Kafka event within milliseconds. Consumers process the event and update the downstream system within seconds. Real-time dashboards become possible. Freshness expectations shift from “yesterday’s data” to “the last few minutes.”

The pattern also eliminates the load spike from batch ETL. Nightly bulk reads used to hammer the primary database during off-hours, competing with legitimate traffic. CDC reads the WAL, which is designed to be read by replicas continuously. Load on the source database is minimal and constant rather than spiky.

CDC also enables event-driven architectures without requiring application code changes. Traditional event-driven patterns require the application to explicitly publish events on every change (dual-write problem — application must commit to database AND publish to Kafka; failure modes are complex). CDC extracts events from the WAL after the commit is durable, so the source of truth is the database and events are guaranteed consistent with what committed. This is the outbox pattern via CDC — event-driven benefits without the dual-write hazard.

The pattern is also foundational to modern data infrastructure. Feature stores use CDC to keep features fresh. Cache invalidation happens via CDC. Search indexes update via CDC. Audit logs are literally the CDC stream. Reverse-ETL flows analytics changes back to operational systems via CDC-like mechanisms. Understanding CDC is understanding a significant chunk of how modern data infrastructure is glued together.

The failure modes to know: schema evolution can break downstream consumers (Debezium’s schema registry integration mitigates this); replication slot lag can accumulate if consumers can’t keep up (causes WAL bloat on the source database); initial snapshot vs ongoing replication (need to bootstrap consumers with the current state before applying changes); dealing with deletes (some sinks don’t support deletes cleanly). Each has known patterns, but operating CDC means owning them.

Depth progression

STUB     ← you are here.
OUTLINE  Promoted when Y4 Phase 32 deploys Debezium on basecamp with at least
         one Postgres → Kafka → Iceberg flow.
DEEP     Promoted after Y4 end — CDC operational ~6+ months with at least one
         observed schema-evolution event flowing cleanly through CDC.

Preview: what OUTLINE will answer

When Y4 Phase 32 promotes this entry to OUTLINE, it will name:

  • PROBLEM. How do you keep downstream systems in sync with an OLTP database in near-real-time without bulk ETL?
  • PRINCIPLES. Read the WAL, not the tables. Events are structured and ordered per table. At-least-once delivery to downstream. Schema evolution handled explicitly. Initial snapshot bootstraps consumers.
  • TRADE-OFFS. Debezium (OSS, self-hosted, flexible) vs AWS DMS (managed, less flexible) vs Fivetran (SaaS, expensive at scale). Kafka Connect model vs standalone Debezium Server. Per-table topics vs unified topic. Full row events vs delta-only.
  • TOOLS (time-stamped as of 2026-06): Debezium (via Strimzi), AWS DMS, Google Cloud Datastream, Fivetran, Airbyte, Estuary Flow, Postgres logical replication, Materialize, Kafka Connect sinks (Iceberg, Elasticsearch, JDBC).

The DEEP promotion, after Y4 with 6+ months of CDC operation, will add MASTERY (operating Debezium on basecamp), COMPARE (Debezium vs DMS vs Fivetran for different use cases), OPERATE (a specific schema-evolution or replication-slot-lag event), and CONTRIBUTE (a Debezium connector or documentation improvement).

Canonical references

  • Debezium documentation. Free at debezium.io.
  • Martin Kleppmann’s talks and blog posts on CDC and event sourcing. Free.
  • Kleppmann, DDIA, chapters on stream processing.
  • Confluent’s blog series on CDC patterns. Free at confluent.io.
  • Postgres documentation on logical replication and replication slots. Free.

Cross-references