Stream Processing: Redpanda + Flink

Third phase of Year 3. Real-time data: events through a broker, stateful processing with watermarks. The Lambda-vs-Kappa architectural decision lands here. ~8 weeks, ~90 hrs.

Prerequisites

Phase 15 complete — lakehouse operational; Iceberg tables exist

You accept: streaming is harder than batch. State, time, watermarks, exactly-once-ish — every problem has trade-offs that don’t exist at rest.

Why this phase exists

Year 4’s services/llm-gateway/ streams response tokens. Year 5’s services/aiops/ reacts to alerts in real time. Year 5’s portal pushes live dashboards. All of these depend on a streaming foundation: a broker (Redpanda) + a stateful processor (Flink). The pattern (durable log + replayable consumers + stateful operators with checkpoints) outlives whatever specific tools you happen to be running — same shape Kafka pioneered in 2011, same shape Redpanda iterates on in 2026, same shape whatever replaces Redpanda in 2031 will keep.

This phase also wires the first half of “Homelab life API” — GitHub events flow into Redpanda, Flink enriches, output lands in the lakehouse from P15. It’s the streaming half of the Year 3 composition recipe and the first time the platform has a live data path, not just a periodic ETL.

→ See: stream-vs-batch, storage-and-data

1. PROBLEM

You have data that’s continuously produced (events, logs, metrics, user actions). You want to:

Buffer it so producers and consumers decouple in time
Process it stateful (sessionization, joins, aggregations) with low latency
Replay history when business logic changes
Handle out-of-order events without breaking your numbers

Brokers solve buffering + replay. Stream processors (Flink, Kafka Streams, Spark Structured Streaming) solve stateful processing. Together they’re the substrate every event-driven architecture in 2026 stands on.

2. PRINCIPLES

2.1 The broker as decoupling layer

Producer writes events; broker stores them; consumers read at their own pace. Replay = restart the consumer at an old offset. The broker is durable storage with subscription semantics — that’s the whole pattern, and it’s what separates a queue (delete-on-read) from a log (retain-and-replay).

Investigate:

Read Redpanda’s “Why we rewrote Kafka in C++” — they kept the protocol; rewrote everything else. The decision to keep the wire protocol is the lesson.
Set up a 3-node Redpanda cluster on basecamp; produce + consume; observe partitioning.
What’s a consumer group? What does “exactly-once” mean for it? (Spoiler: it means transactional offset commits, not “the broker delivers each message once”.)

2.2 Time in streams

→ Pattern: stream-processing

Event time vs processing time. Watermarks bound how late an event can arrive. Get watermarks wrong and your hourly aggregates either lose data (too aggressive) or never fire (too conservative).

Investigate:

Read DDIA Ch. 11 (stream processing) — the time-and-windows treatment is canonical.
In Flink: write a tumbling window aggregation; observe behavior with out-of-order events.
What’s a watermark? When do you set it conservatively (correctness) vs aggressively (latency)?

2.3 State management

Streams without state are trivial. Streams with state are 90% of the problem. Sessionization, joins, dedup, running aggregations — all state, all needing to survive a restart.

Investigate:

Flink keyed state, operator state, broadcast state — when each.
RocksDB state backend — why? What’s the trade-off vs heap?
Checkpoints + savepoints — how Flink survives a job restart, and why savepoints are the migration primitive.

2.4 Delivery semantics revisited

→ Pattern: delivery-semantics (DEEP from Year 2)

At-most-once, at-least-once, exactly-once-ish. In streaming, exactly-once-ish is real but expensive (transactions, idempotent producers, 2PC at the sink). The “ish” is doing real work in that sentence — read the asterisks.

Investigate:

Configure Kafka/Redpanda transactions for exactly-once.
What’s the cost? What’s the asterisk? (Hint: it’s exactly-once within the system, and the boundary at your external sink is where the lie lives unless you cooperate.)

2.5 Lambda vs Kappa architectures

→ Pattern: lambda-and-kappa

Lambda: batch + stream paths, results unified. Kappa: stream is the only path; batch is just slow stream. The Kappa argument is that maintaining two codebases for the same logic is the actual cost; the Lambda argument is that some backfills are too expensive to do as a stream replay.

Investigate:

For personal-api: would you use Lambda (Airflow batch + Flink stream) or Kappa (Flink only with replay)?
Real-world tradeoff: complexity vs flexibility. You’ll commit to an answer at the end of P17 when both halves exist.

2.6 Schema evolution in streams

The producer adds a field. The consumer hasn’t updated. Now what?

Investigate:

Schema Registry (Confluent, or Redpanda’s built-in).
Avro, Protobuf, JSON Schema — when each.
Forward + backward compatibility rules. The compatibility mode you pick at registry-config time is the contract every producer/consumer pair will live with.

3. TRADE-OFFS

Decision	Option A	Option B	When
Broker	Redpanda (C++, no ZK)	Kafka (JVM, mature)	Pulsar
Stream processor	Flink (stateful, low-lat)	Kafka Streams (JVM only)	Spark Structured Streaming (batch+stream)
Schema format	Avro	Protobuf	JSON Schema
State backend (Flink)	RocksDB (on-disk)	heap (in-memory)	RocksDB for production scale
Architecture	Lambda (batch + stream)	Kappa (stream only)	Kappa is cleaner; Lambda hedges

4. TOOLS (as of 2025-10)

Redpanda 24+ (Kafka-compatible; simpler ops)
Apache Flink 1.20+
Flink Kubernetes Operator (lifecycle on K8s)
Schema Registry (Redpanda’s built-in or Confluent’s OSS version)
kafka-cli / rpk for ops

5. MASTERY

5.1 Reading list

Required	Why
DDIA Ch. 11 (Stream Processing)	The theory
”Streaming Systems” (Akidau, Chernyak, Lax)	Watermarks + windows in depth
Flink docs — DataStream API	The implementation
Redpanda docs — Architecture	The broker

5.2 Operational depth checklist

[ ] Deploy 3-node Redpanda on basecamp; verify with rpk
[ ] Deploy Flink Kubernetes Operator; submit a sample job
[ ] Build the GitHub-events pipeline:
    - Producer: webhook → Redpanda topic abukix.commits-raw
    - Flink job: parse + enrich + emit to abukix.commits (Iceberg sink)
[ ] Configure exactly-once via Flink + Iceberg sink transactions
[ ] Add a tumbling 1-hour window aggregation: commits per repo per hour
[ ] Force out-of-order events; observe watermark behavior; tune
[ ] Configure Flink checkpoints to MinIO; force a TaskManager kill; verify recovery
[ ] Set up Schema Registry; evolve a schema; verify forward/backward compat
[ ] Build a Grafana dashboard from Flink metrics (Prometheus exporter)
[ ] Replay one day of events from Redpanda; verify idempotent landing in Iceberg

5.3 First half of Homelab life API

This phase lands the streaming half of the Y3 composition recipe (see projects/basecamp for where the full recipe lives):

GitHub webhook → Redpanda (abukix.commits-raw)
                    ↓
                 Flink (parse, enrich)
                    ↓
              Iceberg (abukix.commits)

P17 will add the batch backfill (Airflow). P18 will add Trino + REST API. By the end of P18, every basecamp resident can ask “how many commits did I make to mlship in March 2027?” and get an answer that ran through your own platform end-to-end.

6. COMPARE: Flink vs Spark Structured Streaming

Both consume from Kafka, write to Iceberg. Build the same pipeline twice. Compare:

Latency — micro-batch vs true streaming.
State management ergonomics — Flink’s keyed state vs Spark’s stateful operators.
Failure recovery — checkpoint cost and restart speed.
Backfill story — what happens when you need to replay a month of events.
Operational complexity — Flink Operator vs Spark Operator on K8s.

400 words.

7. OPERATE

3+ runbooks (redpanda-broker-down, flink-job-restart-from-savepoint, consumer-lag-investigation)
1+ postmortem
Weekly log

8. CONTRIBUTE

Redpanda, Flink, Kafka clients, OpenLineage Flink integration. The OpenLineage Flink integration in particular is under-documented in 2026 — a real-world example PR has high payoff.

Validation criteria

[ ] All 10 operational depth checks
[ ] Redpanda + Flink in basecamp Tier 4
[ ] GitHub-events streaming pipeline working end-to-end
[ ] Flink vs Spark Structured Streaming comparison
[ ] 3+ runbooks; 1+ postmortem; 8+ weekly log entries
[ ] Pattern entries deepened:
    - stream-processing → DEEP
    - lambda-and-kappa → OUTLINE (P17 brings the batch side; DEEP after)
    - delivery-semantics → reinforced
[ ] Exit Test passed

Exit Test

Time: 3 hours.

Build (90 min) — given a fresh topic + an Iceberg table, write a Flink job that consumes events, sessionizes by user, emits hourly aggregates with watermark-tolerant logic. Exactly-once.
Diagnose (60 min) — Flink job is checkpointing slowly; consumer lag growing; find why (state size, network, sink backpressure).
Articulate (30 min) — 600 words: “Why is exactly-once delivery a lie, and what does Flink + Iceberg’s exactly-once actually deliver?”

Anti-patterns

Anti-pattern	Why
Treating streams as queues	Streams replay; queues don’t. Different mental model.
State without checkpoints	One TaskManager death = lost state
Watermarks too aggressive	Late events dropped silently; wrong numbers
Schema-less topics	Brittle consumers; schema registry exists for a reason

Patterns deepened this phase

stream-processing → DEEP
lambda-and-kappa → OUTLINE
delivery-semantics → reinforced

→ Next: Phase 17: Batch Processing