Stream Processing (Strimzi + Flink)
Phase 32 of /root Year 4: Kafka via Strimzi operator, stream processing with Kafka Streams + Flink Kubernetes Operator, CDC via Debezium. Exactly-once-ish semantics, watermarks + event-time windowing, idempotent consumers. All K8s-native: Kafka/KafkaTopic/KafkaConnect/FlinkDeployment as CRDs. 7-9 weeks, ~80-100 hours.
Second phase of Year 4. Streaming as a deliberate engineering choice. 7-9 weeks, ~80-100 hrs.
Y2 Phase 14 introduced queues + pub/sub. This phase goes to depth: a real Kafka cluster managed by Strimzi (the K8s-native operator), real stream processing via Kafka Streams (in-app library) and Flink (separate cluster, also operator-managed). CDC via Debezium. By phase end basecamp ingests events at high rate from Postgres, lands them in Iceberg via streaming Flink jobs, and runs Kafka Streams topology for low-latency in-pipeline analytics.
Every Kafka resource is a CRD: Kafka cluster, KafkaTopic, KafkaUser, KafkaConnect, KafkaConnector. Every Flink job is a FlinkDeployment CRD. Same K8s-native pattern as the rest of basecamp.
Prerequisites
- Phase 31 complete; lakehouse alive
- Sufficient RAM for Kafka + Flink (32GB minimum; 64GB if you can)
- 12 hrs/week budget reserved
- You accept: streaming is harder than batch. The patterns (exactly-once-ish, watermarks, late-arriving events) don’t have shortcuts.
Why this phase exists
Most data needs are batch. But some — fresh recommendations, real-time fraud detection, agent inputs, observability — need streaming. The patterns differ from batch in ways that surprise: event time vs processing time, watermarks for handling late arrivals, exactly-once-ish via idempotent producers + transactions, stateful operators that must checkpoint.
This phase installs the patterns. And it teaches the K8s-native deployment: Strimzi reduces Kafka ops from days-of-work to applying a Kafka CRD; Flink Operator handles Flink job lifecycle the same way.
The pattern-first frame
Same eight steps.
1. PROBLEM
You have events arriving continuously — Postgres row changes, user actions, sensor readings, observability data. You want to: ingest reliably, process in real-time, land in long-term storage (Iceberg), serve to consumers (other services, ML models, alerts).
Streaming infrastructure (Kafka or alternatives like Pulsar, Redpanda) is the durable substrate. Stream processing engines (Kafka Streams, Flink, Spark Structured Streaming) are the compute layer above.
2. PRINCIPLES
2.1 Streaming vs batch — when each fits
→ Pattern: streaming-vs-batch
Investigate:
- When is streaming worth the operational cost?
- “Kappa architecture” — when does it deliver vs when is it cargo-culted?
- Why do most teams that “want streaming” actually want micro-batch?
2.2 Exactly-once-ish via idempotent producers + transactions
True exactly-once across multiple systems is impossible without coordination on both ends. Kafka provides effectively-once via idempotent producers + Kafka transactions + careful consumer offset management.
→ Pattern: delivery-semantics reinforced from Y2 + Y3
Investigate:
- Walk Kafka’s transactional producer + consumer pattern.
- What goes wrong if a consumer commits offsets before processing completes?
- Why does Kafka Streams default to at-least-once and require explicit transactional mode?
2.3 Event time + watermarks
Stream processing must distinguish event time (when something happened) from processing time (when we see it). Watermarks define “we don’t expect events older than this to arrive.”
→ Pattern: event-time-windowing
Investigate:
- Walk a Flink window aggregation with watermark of 10 seconds. What happens when events arrive 5s late? 15s late?
- What’s an “allowed lateness” + side output?
- Why is watermark generation a real engineering problem?
2.4 Change Data Capture (CDC)
Debezium reads Postgres’ WAL and emits row-change events to Kafka. CDC bridges OLTP (Phase 9 Postgres) to OLAP (Phase 31 Iceberg) continuously.
→ Pattern: change-data-capture
Investigate:
- How does Debezium read Postgres logical replication slot without breaking it?
- What’s query-based CDC vs log-based CDC?
- When does CDC break (large transactions, schema changes mid-flight)?
2.5 Strimzi as the K8s-native Kafka
Strimzi provides operators for Kafka, Connect, MirrorMaker, Topics, Users. Every Kafka resource is a CRD. Configuration that previously took hours (cluster setup, security, monitoring) becomes kubectl apply.
→ Pattern: operator-pattern reinforced from Y3
Investigate:
- Walk a Strimzi
KafkaCRD: declare → controller creates StatefulSets + Services + Secrets + ZooKeeper (or KRaft). - What does Strimzi’s
KafkaTopicCRD give you over manually-managed topics? - How does Strimzi integrate with Prometheus + Grafana (built-in observability)?
2.6 Flink Kubernetes Operator + FlinkDeployment
Flink jobs run as Kubernetes operators. The FlinkDeployment CRD describes a Flink cluster + jobs; the operator manages JobManager + TaskManagers + savepoints + failover.
Investigate:
- Walk a
FlinkDeploymentCRD: declare → operator launches JM + TMs + manages savepoints. - What’s a Flink savepoint vs checkpoint, and when does each matter?
- How does the Flink operator handle stateful jobs across restarts?
3. TRADE-OFFS
| Decision | Options | Cost |
|---|---|---|
| Kafka deployment | Strimzi; raw StatefulSet; managed (MSK, Confluent Cloud) | Strimzi: K8s-native, CRD-driven (recommended). Raw: too much manual ops. Managed: convenience, $$. |
| Stream processing | Flink (K8s Operator); Kafka Streams (library); Spark Structured Streaming | Flink: full streaming, stateful. Kafka Streams: simple, library-only. Spark Streaming: micro-batch. |
| Kafka mode | KRaft (modern, no ZooKeeper); ZooKeeper-based (legacy) | KRaft: standard going forward. ZK: legacy. |
| CDC | Debezium; AWS DMS; manual triggers | Debezium: K8s-native via Strimzi KafkaConnect (recommended). Others: vendor-specific or fragile. |
4. TOOLS (as of 2026-06)
K8s-native stack
- Strimzi — Kafka operator;
Kafka,KafkaTopic,KafkaUser,KafkaConnect,KafkaConnectorCRDs - Flink Kubernetes Operator —
FlinkDeployment,FlinkSessionJobCRDs - Debezium — runs as a KafkaConnect connector via Strimzi
- Confluent Schema Registry (or Apicurio for K8s-native alternative) — Helm-deployed
Reading
- “Kafka: The Definitive Guide” (Shapira et al., 2nd ed.)
- “Streaming Systems” (Akidau, Chernyak, Lax) — streaming theory book
- “Stream Processing with Apache Flink” (Hueske, Kalavri)
- Strimzi docs
- Flink Kubernetes Operator docs
5. MASTERY: Streaming alive on basecamp
[ ] Strimzi installed via Flux; deploy a 3-broker Kafka cluster via `Kafka` CRD
[ ] Verify durability: produce, kill a broker, recover, verify no message loss
[ ] Create `KafkaTopic` CRDs for at least 3 topics; partition counts deliberate
[ ] Deploy Schema Registry; register one Avro schema
[ ] Deploy Debezium as `KafkaConnect` + `KafkaConnector`; ingest Postgres changes to Kafka
[ ] CDC end-to-end: Postgres INSERT → Kafka topic → consumer
[ ] Write a small Kafka Streams app in Go or Java; process events in a topology
[ ] Deploy Flink Kubernetes Operator via Flux; deploy a `FlinkDeployment`
[ ] Write a Flink streaming job: read from Kafka → window aggregation with watermark → write to Iceberg
[ ] Verify exactly-once-ish: kill TaskManager mid-process; recover via checkpoint
[ ] Trigger a schema evolution in Postgres; observe Debezium handle (or fail gracefully) downstream
6. COMPARE: Pulsar or Redpanda
Pick one:
- Apache Pulsar — multi-tenant by design, tiered storage
- Redpanda — Kafka API-compatible, no JVM, C++-based, single-binary
400-word reflection on what each gets right vs Kafka.
7. OPERATE
- 4-5 runbooks: Kafka broker disk full, Strimzi cluster degraded, Debezium replication slot disconnect, Flink savepoint corrupted, schema evolution broke a consumer
- 2-3 ADRs (Strimzi over MSK, Flink over Kafka Streams for stateful, KRaft over ZK)
- Weekly log
8. CONTRIBUTE
- Apache Kafka — docs, edge cases
- Strimzi (CNCF) — operator improvements
- Flink Kubernetes Operator — connectors, docs
- Debezium connector for any database you care about
What ships from this phase
- Tier 5 deepened: Strimzi Kafka + Schema Registry + Debezium + Flink Operator alive on basecamp
- CDC pipeline Postgres → Kafka → Iceberg running continuously
- Streaming runbooks
Validation criteria
[ ] Strimzi Kafka 3-broker cluster operational via CRDs
[ ] Debezium CDC end-to-end working
[ ] Flink Operator deployed; one streaming job running
[ ] CDC → Iceberg pipeline operational
[ ] All 11 operational depth checks
[ ] Compare reflection (400 words)
[ ] 4-5 streaming runbooks
[ ] 2-3 ADRs
[ ] Pattern entries:
- streaming-vs-batch → OUTLINE
- change-data-capture → OUTLINE
- event-time-windowing → OUTLINE
- delivery-semantics reinforced toward DEEP
- operator-pattern reinforced via Strimzi + Flink Operator
[ ] Exit Test passed
Exit Test
Time: 3 hours.
Part 1: Build (90 min)
Add a new streaming pipeline: a new Postgres table → Debezium → new Kafka topic → Flink job with windowed aggregation → Iceberg sink. Verify exactly-once-ish via deliberate failure injection.
Part 2: Diagnose (75 min)
A streaming scenario (e.g., “Flink job is reading from Kafka at 1/10th expected throughput”). Possible: partition count mismatch, parallelism misconfig, GC pressure on JobManager, network bandwidth.
Part 3: Articulate (15 min)
~400 words: “Defend Strimzi over a managed Kafka service (MSK or Confluent Cloud) for basecamp. Cite K8s-native ecosystem composition + the meta-pattern.”
Anti-patterns
| Anti-pattern | Why |
|---|---|
| Streaming when batch would work | Streaming costs 5-10× more operationally |
| Claiming exactly-once delivery | Usually a marketing lie — explain what you actually mean |
| Long-running stateful Flink job without savepoints | Restart loses state forever |
| CDC without thinking about schema evolution | Schema changes break the pipeline |
| Hand-managed Kafka without Strimzi | The ops burden multiplies; CRDs are the modern answer |
Patterns touched this phase
streaming-vs-batch— OUTLINEchange-data-capture— OUTLINEevent-time-windowing— OUTLINEdelivery-semanticsreinforced toward DEEPoperator-patternreinforced