Streaming vs Batch

Two processing paradigms with different latency/cost/complexity trade-offs. Streaming: low latency, continuous. Batch: high throughput, periodic. Most modern stacks run both.

Streaming: events processed as they arrive. Batch: events processed in scheduled chunks. The choice shapes latency, cost, and operational complexity. Status: STUB — promoted to OUTLINE in Y4 Phase 32.

What this pattern is

Streaming and batch are two processing paradigms with different trade-off shapes. Streaming (Kafka Streams, Flink, Spark Structured Streaming) processes events as they arrive — minutes or seconds of latency, continuous operational presence, complex semantics around windowing and watermarks, harder backfill. Batch (Spark, Airflow / Argo Workflows / Dagster scheduling) processes events in chunks — minutes to hours of latency, periodic resource use, simpler semantics, trivial backfill (just re-run the job).

The same logical transformation often has a streaming implementation AND a batch implementation in production stacks. The kappa architecture says “just use streaming everywhere”; the lambda architecture says “run both for the same source of truth.” Neither is universally correct — the right answer depends on the workload, the team, and what already exists.

The /root choice: Kafka + Flink for streaming; Spark Operator + Argo Workflows for batch. Both K8s-native; both observed with the same telemetry; both feeding Iceberg as the unified storage layer. Senior engineers don’t religiously pick one; they articulate when each fits.

The pattern’s central insight is that latency and throughput trade off against each other, and different workloads want different points on that curve. Real-time dashboards need seconds; hourly reports can wait 15 minutes; monthly financial reports can wait until end-of-month. Trying to serve all three needs with one paradigm produces either over-engineered batch pipelines that pretend to be real-time, or expensive streaming pipelines that produce data nobody looks at more than once a day.

Modern stacks often use micro-batch as a middle ground. Spark Structured Streaming processes small batches every few seconds. Flink can do true streaming with per-event processing but often batches internally for throughput. Iceberg’s small-file problem forces streaming writers to batch their commits. The line between “streaming” and “batch” is blurrier than the marketing suggests.

Concrete instances in the wild

  • Kafka Streams. JVM-based streaming library. Runs in your application. Lightweight streaming for JVM shops.
  • Apache Flink. Standalone streaming engine. Powerful windowing, state management, exactly-once semantics.
  • Spark Structured Streaming. Micro-batch (default) or continuous streaming. Familiar Spark API applied to streams.
  • Apache Beam. Unified batch + streaming API. Runs on Flink, Spark, or Google Dataflow.
  • basecamp streaming stack (Y4 Phase 32). Kafka + Flink + Iceberg + Debezium.
  • basecamp batch stack (Y4 Phase 33). Spark Operator + Argo Workflows + Iceberg.
  • Apache Airflow. The most-deployed batch orchestrator. DAGs as Python code. Wide adoption.
  • Argo Workflows. K8s-native batch orchestrator. Pods as DAG nodes. basecamp default.
  • Dagster. Modern batch orchestrator with better data-quality integration than Airflow.
  • Prefect. Modern batch orchestrator alternative to Airflow.
  • Google Cloud Dataflow. Managed Beam. Combines batch + streaming.

Why this pattern matters

Every data-processing team eventually has both streaming and batch workloads. The dashboard team wants real-time. The finance team wants reproducible reports. The ML team wants both (real-time features + batch training data). Deciding which paradigm serves which workload is a foundational data-platform decision.

Getting the choice right saves substantial operational cost. Streaming pipelines are expensive to operate — they run continuously, they require careful state management, they need explicit handling of late-arriving data, they’re hard to backfill. Batch pipelines are cheaper — they run when scheduled, they can be re-run trivially, state is externalized. Using streaming for workloads that don’t need it wastes engineering resources.

Getting the choice wrong the other way — using batch for workloads that need real-time — produces UX problems and business problems. Fraud detection that’s 15 minutes stale misses fraud. Dashboards that lag 30 minutes get ignored. Model features that are hours old drift from serving data. Batch has a floor on latency that streaming doesn’t.

The pattern also matters for organizational complexity. Streaming systems require different skills (stateful stream processing, windowing semantics, watermark reasoning) than batch systems (dependency management, backfill strategies, cost optimization). Teams often specialize. Platforms that support both cleanly (via shared storage, shared observability, shared operational tooling) reduce the coordination burden.

Modern platforms provide better unification than a decade ago. Apache Beam’s unified API works across streaming and batch. Flink’s Table API abstracts over both. Iceberg serves as unified storage. Kubernetes operators manage both types of workloads uniformly. The operational cost of running both paradigms is lower than it used to be, which shifts the trade-off toward “run both, pick per workload” rather than “commit to one.”

The failure modes to know: streaming pipelines that silently drop late-arriving data (watermark misconfigurations); batch pipelines that miss SLA when data volumes grow (scheduling misconfigurations); duplicated work between streaming and batch implementations (lambda architecture’s operational cost). Each has known patterns for prevention.

Depth progression

STUB     ← you are here.
OUTLINE  Promoted when Y4 Phase 32 (Kafka + Flink) + Y4 Phase 33 (Spark + Argo)
         deploy both stacks on basecamp.
DEEP     Promoted after Y4 end — at least one pipeline implemented streaming-first
         AND one batch-first, with conscious choice of which fit which workload.

Preview: what OUTLINE will answer

When Y4 Phase 32 promotes this entry to OUTLINE, it will name:

  • PROBLEM. How do you choose between streaming and batch for a given workload?
  • PRINCIPLES. Latency requirement drives paradigm choice. Streaming for seconds-to-minutes latency; batch for minutes-to-hours. Both share storage (Iceberg). Both share observability. Backfill is cheap in batch, hard in streaming. Watermarks and windowing matter deeply in streaming.
  • TRADE-OFFS. Streaming (low latency, continuous cost, complex semantics) vs batch (higher latency, periodic cost, simple semantics). Kappa (streaming-only) vs lambda (both). True streaming (Flink) vs micro-batch (Spark). Managed (Dataflow) vs self-hosted.
  • TOOLS (time-stamped as of 2026-06): Kafka Streams, Flink, Spark Structured Streaming, Apache Beam, Airflow, Argo Workflows, Dagster, Prefect, Google Dataflow.

The DEEP promotion, after Y4 end with both paradigms operational, will add MASTERY (operating both streaming and batch on basecamp), COMPARE (Flink vs Spark Structured Streaming vs Beam), OPERATE (a specific case where paradigm choice mattered), and CONTRIBUTE (a Flink or Spark documentation improvement).

Canonical references

  • Tyler Akidau et al., Streaming Systems (O’Reilly) — the definitive text on streaming semantics.
  • Flink documentation. Free at flink.apache.org.
  • Spark Structured Streaming documentation. Free at spark.apache.org.
  • Airflow documentation. Free at airflow.apache.org.
  • Google’s “Dataflow Model” paper (2015). Free. Introduces the unified batch/streaming model.

Cross-references