Stream Processing: Redpanda + Flink
Third phase of Year 3. Real-time data: events through a broker, stateful processing with watermarks. The Lambda-vs-Kappa architectural decision lands here. ~8 weeks, ~90 hrs.
Prerequisites
- Phase 15 complete — lakehouse operational; Iceberg tables exist
- You accept: streaming is harder than batch. State, time, watermarks, exactly-once-ish — every problem has trade-offs that don’t exist at rest.
Why this phase exists
Year 4’s services/llm-gateway/ streams response tokens. Year 5’s services/aiops/ reacts to alerts in real time. Year 5’s portal pushes live dashboards. All of these depend on a streaming foundation: a broker (Redpanda) + a stateful processor (Flink). The pattern (durable log + replayable consumers + stateful operators with checkpoints) outlives whatever specific tools you happen to be running — same shape Kafka pioneered in 2011, same shape Redpanda iterates on in 2026, same shape whatever replaces Redpanda in 2031 will keep.
This phase also wires the first half of “Homelab life API” — GitHub events flow into Redpanda, Flink enriches, output lands in the lakehouse from P15. It’s the streaming half of the Year 3 composition recipe and the first time the platform has a live data path, not just a periodic ETL.
→ See: stream-vs-batch, storage-and-data
1. PROBLEM
You have data that’s continuously produced (events, logs, metrics, user actions). You want to:
- Buffer it so producers and consumers decouple in time
- Process it stateful (sessionization, joins, aggregations) with low latency
- Replay history when business logic changes
- Handle out-of-order events without breaking your numbers
Brokers solve buffering + replay. Stream processors (Flink, Kafka Streams, Spark Structured Streaming) solve stateful processing. Together they’re the substrate every event-driven architecture in 2026 stands on.
2. PRINCIPLES
2.1 The broker as decoupling layer
Producer writes events; broker stores them; consumers read at their own pace. Replay = restart the consumer at an old offset. The broker is durable storage with subscription semantics — that’s the whole pattern, and it’s what separates a queue (delete-on-read) from a log (retain-and-replay).
Investigate:
- Read Redpanda’s “Why we rewrote Kafka in C++” — they kept the protocol; rewrote everything else. The decision to keep the wire protocol is the lesson.
- Set up a 3-node Redpanda cluster on basecamp; produce + consume; observe partitioning.
- What’s a consumer group? What does “exactly-once” mean for it? (Spoiler: it means transactional offset commits, not “the broker delivers each message once”.)
2.2 Time in streams
→ Pattern: stream-processing
Event time vs processing time. Watermarks bound how late an event can arrive. Get watermarks wrong and your hourly aggregates either lose data (too aggressive) or never fire (too conservative).
Investigate:
- Read DDIA Ch. 11 (stream processing) — the time-and-windows treatment is canonical.
- In Flink: write a tumbling window aggregation; observe behavior with out-of-order events.
- What’s a watermark? When do you set it conservatively (correctness) vs aggressively (latency)?
2.3 State management
Streams without state are trivial. Streams with state are 90% of the problem. Sessionization, joins, dedup, running aggregations — all state, all needing to survive a restart.
Investigate:
- Flink keyed state, operator state, broadcast state — when each.
- RocksDB state backend — why? What’s the trade-off vs heap?
- Checkpoints + savepoints — how Flink survives a job restart, and why savepoints are the migration primitive.
2.4 Delivery semantics revisited
→ Pattern: delivery-semantics (DEEP from Year 2)
At-most-once, at-least-once, exactly-once-ish. In streaming, exactly-once-ish is real but expensive (transactions, idempotent producers, 2PC at the sink). The “ish” is doing real work in that sentence — read the asterisks.
Investigate:
- Configure Kafka/Redpanda transactions for exactly-once.
- What’s the cost? What’s the asterisk? (Hint: it’s exactly-once within the system, and the boundary at your external sink is where the lie lives unless you cooperate.)
2.5 Lambda vs Kappa architectures
→ Pattern: lambda-and-kappa
Lambda: batch + stream paths, results unified. Kappa: stream is the only path; batch is just slow stream. The Kappa argument is that maintaining two codebases for the same logic is the actual cost; the Lambda argument is that some backfills are too expensive to do as a stream replay.
Investigate:
- For
personal-api: would you use Lambda (Airflow batch + Flink stream) or Kappa (Flink only with replay)? - Real-world tradeoff: complexity vs flexibility. You’ll commit to an answer at the end of P17 when both halves exist.
2.6 Schema evolution in streams
The producer adds a field. The consumer hasn’t updated. Now what?
Investigate:
- Schema Registry (Confluent, or Redpanda’s built-in).
- Avro, Protobuf, JSON Schema — when each.
- Forward + backward compatibility rules. The compatibility mode you pick at registry-config time is the contract every producer/consumer pair will live with.
3. TRADE-OFFS
| Decision | Option A | Option B | When |
|---|---|---|---|
| Broker | Redpanda (C++, no ZK) | Kafka (JVM, mature) | Pulsar |
| Stream processor | Flink (stateful, low-lat) | Kafka Streams (JVM only) | Spark Structured Streaming (batch+stream) |
| Schema format | Avro | Protobuf | JSON Schema |
| State backend (Flink) | RocksDB (on-disk) | heap (in-memory) | RocksDB for production scale |
| Architecture | Lambda (batch + stream) | Kappa (stream only) | Kappa is cleaner; Lambda hedges |
4. TOOLS (as of 2025-10)
- Redpanda 24+ (Kafka-compatible; simpler ops)
- Apache Flink 1.20+
- Flink Kubernetes Operator (lifecycle on K8s)
- Schema Registry (Redpanda’s built-in or Confluent’s OSS version)
- kafka-cli / rpk for ops
5. MASTERY
5.1 Reading list
| Required | Why |
|---|---|
| DDIA Ch. 11 (Stream Processing) | The theory |
| ”Streaming Systems” (Akidau, Chernyak, Lax) | Watermarks + windows in depth |
| Flink docs — DataStream API | The implementation |
| Redpanda docs — Architecture | The broker |
5.2 Operational depth checklist
[ ] Deploy 3-node Redpanda on basecamp; verify with rpk[ ] Deploy Flink Kubernetes Operator; submit a sample job[ ] Build the GitHub-events pipeline: - Producer: webhook → Redpanda topic abukix.commits-raw - Flink job: parse + enrich + emit to abukix.commits (Iceberg sink)[ ] Configure exactly-once via Flink + Iceberg sink transactions[ ] Add a tumbling 1-hour window aggregation: commits per repo per hour[ ] Force out-of-order events; observe watermark behavior; tune[ ] Configure Flink checkpoints to MinIO; force a TaskManager kill; verify recovery[ ] Set up Schema Registry; evolve a schema; verify forward/backward compat[ ] Build a Grafana dashboard from Flink metrics (Prometheus exporter)[ ] Replay one day of events from Redpanda; verify idempotent landing in Iceberg5.3 First half of Homelab life API
This phase lands the streaming half of the Y3 composition recipe (see projects/basecamp for where the full recipe lives):
GitHub webhook → Redpanda (abukix.commits-raw) ↓ Flink (parse, enrich) ↓ Iceberg (abukix.commits)P17 will add the batch backfill (Airflow). P18 will add Trino + REST API. By the end of P18, every basecamp resident can ask “how many commits did I make to mlship in March 2027?” and get an answer that ran through your own platform end-to-end.
6. COMPARE: Flink vs Spark Structured Streaming
Both consume from Kafka, write to Iceberg. Build the same pipeline twice. Compare:
- Latency — micro-batch vs true streaming.
- State management ergonomics — Flink’s keyed state vs Spark’s stateful operators.
- Failure recovery — checkpoint cost and restart speed.
- Backfill story — what happens when you need to replay a month of events.
- Operational complexity — Flink Operator vs Spark Operator on K8s.
400 words.
7. OPERATE
- 3+ runbooks (
redpanda-broker-down,flink-job-restart-from-savepoint,consumer-lag-investigation) - 1+ postmortem
- Weekly log
8. CONTRIBUTE
Redpanda, Flink, Kafka clients, OpenLineage Flink integration. The OpenLineage Flink integration in particular is under-documented in 2026 — a real-world example PR has high payoff.
Validation criteria
[ ] All 10 operational depth checks[ ] Redpanda + Flink in basecamp Tier 4[ ] GitHub-events streaming pipeline working end-to-end[ ] Flink vs Spark Structured Streaming comparison[ ] 3+ runbooks; 1+ postmortem; 8+ weekly log entries[ ] Pattern entries deepened: - stream-processing → DEEP - lambda-and-kappa → OUTLINE (P17 brings the batch side; DEEP after) - delivery-semantics → reinforced[ ] Exit Test passedExit Test
Time: 3 hours.
- Build (90 min) — given a fresh topic + an Iceberg table, write a Flink job that consumes events, sessionizes by user, emits hourly aggregates with watermark-tolerant logic. Exactly-once.
- Diagnose (60 min) — Flink job is checkpointing slowly; consumer lag growing; find why (state size, network, sink backpressure).
- Articulate (30 min) — 600 words: “Why is exactly-once delivery a lie, and what does Flink + Iceberg’s exactly-once actually deliver?”
Anti-patterns
| Anti-pattern | Why |
|---|---|
| Treating streams as queues | Streams replay; queues don’t. Different mental model. |
| State without checkpoints | One TaskManager death = lost state |
| Watermarks too aggressive | Late events dropped silently; wrong numbers |
| Schema-less topics | Brittle consumers; schema registry exists for a reason |
Patterns deepened this phase
- stream-processing → DEEP
- lambda-and-kappa → OUTLINE
- delivery-semantics → reinforced
→ Next: Phase 17: Batch Processing