Batch Processing
The pattern: process bounded data sets in parallel partitions, write the result. Idempotent — re-running with the same input produces the same output. Retryable — failed partitions get re-tried independently. Backfillable — old partitions can be reprocessed when logic changes.
The trade-off: throughput vs. latency. Batch is “stream that’s allowed to be slow” — you trade real-time response for throughput, simpler semantics, and easier reasoning. The discipline is idempotent + partitioned + retryable: if those hold, batch is operationally pleasant. If they don’t, you’ve built a fragile pipeline that breaks every Monday morning.
Deepens in Year 3 Phase 17: Batch Processing — Spark + Iceberg MERGE INTO + dbt incremental models are the worked example. The streaming counterpart lands in Year 3 Phase 16: Stream Processing, and the two converge in lambda-and-kappa.
Related patterns
- stream-processing: the unbounded counterpart — same data, different temporal contract.
- lambda-and-kappa: how stream and batch coexist (or don’t) in a single architecture.
- idempotency: the precondition that makes “retry the partition” a safe operation.
- partitioning: how the unit of parallelism is defined and assigned.
- snapshot-plus-delta: how Iceberg/Delta make
MERGE INTOand backfill atomic. - oltp-vs-olap: batch is the OLAP side — columnar, scan-heavy, throughput-tuned.
- append-only-log: the input shape batch jobs prefer to read from.