Batch Processing

The pattern: process bounded data sets in parallel partitions, write the result. Idempotent — re-running with the same input produces the same output. Retryable — failed partitions get re-tried independently. Backfillable — old partitions can be reprocessed when logic changes.

The trade-off: throughput vs. latency. Batch is “stream that’s allowed to be slow” — you trade real-time response for throughput, simpler semantics, and easier reasoning. The discipline is idempotent + partitioned + retryable: if those hold, batch is operationally pleasant. If they don’t, you’ve built a fragile pipeline that breaks every Monday morning.

Deepens in Year 3 Phase 17: Batch Processing — Spark + Iceberg MERGE INTO + dbt incremental models are the worked example. The streaming counterpart lands in Year 3 Phase 16: Stream Processing, and the two converge in lambda-and-kappa.

stream-processing: the unbounded counterpart — same data, different temporal contract.
lambda-and-kappa: how stream and batch coexist (or don’t) in a single architecture.
idempotency: the precondition that makes “retry the partition” a safe operation.
partitioning: how the unit of parallelism is defined and assigned.
snapshot-plus-delta: how Iceberg/Delta make MERGE INTO and backfill atomic.
oltp-vs-olap: batch is the OLAP side — columnar, scan-heavy, throughput-tuned.
append-only-log: the input shape batch jobs prefer to read from.

Batch Processing

Related patterns