Batch Processing: Spark + Airflow + dbt

Fourth phase of Year 3. Batch as a discipline. Spark for compute, Airflow for orchestration, dbt for analytics transformations. The “Homelab life API” gets its backfill + aggregation half. ~8 weeks, ~90 hrs.

Prerequisites

Phase 16 complete — streaming half of the personal-api pipeline working

You accept: batch is “stream that’s allowed to be slow.” The patterns (idempotency, partitioning, retries, watermarks-for-arrived-late-data) are the same.

Why this phase exists

Streaming gets the live data. Batch handles the rest: backfill, complex joins, ML feature engineering, analytical aggregations. Year 4 ML training is batch. Year 5 nightly portal aggregates are batch. Real platforms run both — and the engineers who can reason across both at the pattern level are the ones who get to design the pipeline architecture rather than just operate it.

This phase also lands the second half of Homelab life API — Airflow DAGs that backfill historical GitHub data into Iceberg + dbt models that aggregate it for the API. Combined with P16’s stream and P18’s serving layer, you’ll have one of the Master Plan’s 5 composition recipes running end-to-end on your own platform.

By the end of P17 the Lambda-vs-Kappa pattern reaches DEEP for real, because both sides exist and you’ve had to commit to one. That kind of forced architectural choice is the whole point of the pattern-first scaffold — patterns get DEEP when you’ve operated something that depends on them, not when you’ve read about them.

→ Pattern: batch-processing → Pattern: materialized-views → Pattern: idempotency (reinforced)

1. PROBLEM

You have data at rest that needs to become data in a different shape. You want to:

Run computations across many partitions in parallel
Re-run when logic changes (backfill)
Survive partial failures without rerunning everything
Schedule by time + dependency (run B after A succeeds)
Test the transformations like real software

Spark = the compute engine. Airflow = the orchestrator. dbt = the SQL-first transformation framework. Each owns one part of the problem; together they’re what every data team in 2026 has some variant of.

→ See: stream-vs-batch, storage-and-data

2. PRINCIPLES

2.1 Batch as a stream-of-files

Each batch run reads N input partitions, writes M output partitions. Idempotent + partitioned + retryable. The partitioning strategy is the most consequential design decision — it determines parallelism, retry granularity, and reprocessing cost.

Investigate:

Why does Spark prefer partitioned input over single-file input? (Hint: parallelism, predicate pushdown, and the cost of starting a stage.)
What does “shuffle” mean in Spark, and why is it expensive?
Read Spark’s Catalyst optimizer overview — the same query plan tree shape every analytical engine of the era uses.

2.2 Idempotent jobs

Running a batch job twice should produce the same output. Side effects must be idempotent.

→ Pattern: idempotency (DEEP from Year 2; reinforced here)

Investigate:

Why is INSERT INTO not idempotent? When is MERGE INTO the right primitive?
Iceberg’s MERGE INTO semantics; concurrency under MERGE.
Design every Airflow task with an idempotency key — make it explicit, not implicit.

2.3 Orchestration as code

Airflow DAGs are Python. Tasks are nodes. Dependencies are edges. Schedules are cron-y. The DAG-as-code property is what lets you code-review pipeline changes and roll them back via git like any other software.

Investigate:

Build a small DAG: download GitHub events → land in MinIO → trigger Spark job → trigger dbt model.
TaskFlow API vs operator-style DAGs (TaskFlow is the modern default).
Why is Airflow’s “scheduler tick” model the source of half its operational issues? (Hint: it’s a polling architecture pretending to be event-driven.)

2.4 Analytics transformations (dbt)

dbt: SQL templates + tests + documentation. Each model becomes an Iceberg table. The killer feature is that tests are first-class — not_null, unique, accepted_values are wired into the same DAG that builds the table, so a broken test gates production.

→ Pattern: materialized-views

Investigate:

Build 3 dbt models for personal-api: commits_per_repo_per_day, commits_per_language_per_week, weekly_activity.
dbt tests: not_null, unique, accepted_values — gate production.
Incremental models — when each row should be reprocessed vs appended.

2.5 Backfill discipline

History changes. New columns. Bug-fixed transformations. You re-run history without breaking live consumers — and without doubling production cost while the backfill runs.

Investigate:

Iceberg branching for backfill (Nessie helps): branch the table, backfill, validate, swap.
“Slowly Changing Dimensions” — Type 1 / 2 / 4 strategies for tracking history.

2.6 Lambda + Kappa revisited

→ Pattern: lambda-and-kappa (DEEP after this phase)

You now have both halves: Flink (stream from P16) + Spark/Airflow (batch this phase). The Lambda question becomes concrete — not “which architecture is better in general?” but “which works for personal-api, given what I’ve built?”

Investigate:

Run the same logic in Flink (streaming) and Spark (batch); verify outputs converge.
When does Lambda’s complexity earn its keep? (Real answer: rarely; Kappa is usually cleaner. Pick deliberately, document the choice as an ADR.)

3. TRADE-OFFS

Decision	Option A	Option B	When
Compute engine	Spark	Trino (analytical only)	Dask
Orchestrator	Airflow	Dagster (modern, asset-aware)	Prefect (Python-native)
Transformations	dbt (SQL)	Spark Scala/Python	dbt for analytics; Spark for ML feature engineering
Job submission	Spark Operator (K8s)	spark-submit	Operator for K8s-native; submit for ad-hoc
Architecture	Lambda	Kappa	Kappa is cleaner; Lambda when streaming reprocessing isn’t viable

4. TOOLS (as of 2025-10)

Apache Spark 3.5+ (with Iceberg + Nessie connectors)
Spark Operator for Kubernetes
Apache Airflow 2.10+
dbt-core + dbt-spark + dbt-trino
Polars + DuckDB for ad-hoc + tests

5. MASTERY

5.1 Reading list

Required	Why
DDIA Ch. 10 (Batch Processing)	The theory
”Spark: The Definitive Guide” Ch. 1-12	The implementation
Airflow docs — TaskFlow + Operators	Real ops
dbt docs — Models + Tests	The discipline

5.2 Operational depth checklist

[ ] Deploy Spark Operator on basecamp; submit a SparkApplication via YAML
[ ] Deploy Airflow on basecamp; configure with KubernetesExecutor
[ ] Build the backfill DAG: GitHub Events API → MinIO → Spark transformation → Iceberg
[ ] Build 3 dbt models: commits-per-repo-per-day, commits-per-lang-per-week, weekly_activity
[ ] Add dbt tests: not_null, unique, accepted_values; wire failures to PagerDuty/Slack
[ ] Configure Iceberg MERGE INTO for incremental updates (vs INSERT)
[ ] Force a Spark job failure mid-run; observe Airflow retry; verify idempotent re-run
[ ] Backfill 1 year of historical GitHub events; verify against streaming results
[ ] Set up dbt docs site (deploy via basecamp); browse lineage graph
[ ] Build a Grafana dashboard for batch jobs (Airflow + Spark Prometheus exporters)

5.3 The full Homelab life API ingest

By phase end, the data side of personal-api is complete:

Stream side (P16):  GitHub webhook → Redpanda → Flink → abukix.commits (Iceberg)
Batch side (P17):   GitHub Events API → Airflow → Spark → abukix.commits (Iceberg)
                    dbt models: commits_per_repo, commits_per_lang, weekly_activity

Both write to the same Iceberg tables (Lambda or Kappa choice; pick one and document).
Trino + REST API (P18) read from these.

personal-api PLAN.md gets updated with the schema + DAG references. See projects/basecamp for where the full personal services tier lives. The Lambda/Kappa choice gets an ADR — write it down so future-you knows why, not just what.

6. COMPARE: Airflow vs Dagster

For the same DAG (the backfill above), build it in Airflow + Dagster. Compare:

Asset-awareness (Dagster’s killer feature — pipelines as transformations of things, not chains of tasks).
Local development ergonomics — what it takes to run a single task on your laptop.
UI + observability.
Failure recovery — partial reruns, asset materialization tracking.
Migration path — what it would take to move basecamp from Airflow to Dagster, or back.

400 words.

7. OPERATE

4+ runbooks (spark-job-stuck, airflow-scheduler-lag, dbt-test-failure-triage, iceberg-merge-conflict)
1+ postmortem
Weekly log

8. CONTRIBUTE

Apache Spark, Airflow, dbt, OpenLineage — all active CNCF/Apache communities. Year 3’s first OpenLineage contribution lands well here, because P19’s governance work will lean on whatever instrumentation you’ve added.

Validation criteria

[ ] All 10 operational depth checks
[ ] Spark + Airflow + dbt in basecamp Tier 4
[ ] personal-api ingest pipeline working end-to-end (stream + batch convergent)
[ ] Airflow vs Dagster comparison written up
[ ] 4+ runbooks; 1+ postmortem; 8+ weekly log entries
[ ] Pattern entries deepened:
    - batch-processing → DEEP
    - materialized-views → DEEP
    - lambda-and-kappa → DEEP (now you have both sides)
    - idempotency → reinforced via MERGE INTO
[ ] Exit Test passed

Exit Test

Time: 3 hours.

Build (90 min) — given the GitHub Events API + Iceberg, build an Airflow DAG that backfills 30 days, runs a Spark transformation, lands in Iceberg via MERGE, runs dbt tests, alerts on failure.
Diagnose (60 min) — scenario: dbt model produces wrong numbers; trace via lineage; find the upstream bug.
Articulate (30 min) — 600 words: “When do I reach for Lambda vs Kappa? Defend with examples from this phase.”

Anti-patterns

Anti-pattern	Why
Non-idempotent batch jobs	Re-runs corrupt data; refactor to idempotent now
Airflow DAGs without idempotency contracts on tasks	Same as above; one rerun ruins the table
dbt without tests	Pipeline lies silently; tests gate production
Spark over `pandas.read_csv`	If your data fits in Polars/DuckDB, you don’t need Spark
Lambda for “just in case” complexity	Real cost; pick Kappa unless streaming-replay-doesn’t-work

Patterns deepened this phase

batch-processing → DEEP
materialized-views → DEEP
lambda-and-kappa → DEEP
idempotency → reinforced

→ Next: Phase 18: Data Serving — Trino + Superset