data-tier

Umbrella project covering Iceberg helpers + Spark utilities + Strimzi templates + Argo Workflow templates. The data tier of basecamp (Tier 5). Built across Year 4 Phases 31-33.

Umbrella project — the data tier of basecamp. Iceberg helpers + Spark utilities + Strimzi templates + Argo Workflow templates.

What this is

data-tier is an umbrella project — not a single repo but a coordinated set of small repos and components that together form basecamp’s Tier 5 (Data Engineering). Each piece is independently useful; together they’re how basecamp ingests, stores, and processes data at homelab scale, K8s-native.

Components:

  • Iceberg helpers (Python + Go libraries) — schema migration scripts, compaction utilities, partition health checks
  • Spark utilities — pre-configured Spark Operator templates for common batch jobs (CDC → Iceberg, partition rebuild, dbt model run)
  • Strimzi templates — Kafka topic patterns, Schema Registry integration, Debezium connector configs
  • Argo Workflow templates — reusable workflow steps for typical data pipelines

Why it exists

Three reasons:

  1. basecamp Tier 5. Without a data tier, Y5’s ML and AI work has no substrate. data-tier IS the substrate.
  2. Reusable patterns. Every data engineer reinvents Iceberg compaction; every team rewrites Spark Operator boilerplate. data-tier captures these patterns once.
  3. K8s-native demonstration. The whole umbrella runs as operators + CRDs. It’s a public example of K8s-native data engineering.

Spec (v0.1.0)

Iceberg helpers

Python library + CLI for common Iceberg operations:

icehelper compact <table>       — compact small files
icehelper expire-snapshots <table> --older-than 7d
icehelper partition-health <table>
icehelper schema-diff <table1> <table2>

Spark utilities

Spark Operator SparkApplication templates (Kustomize-friendly):

data-tier/spark-templates/
├── cdc-to-iceberg/        — Kafka → Iceberg streaming job
├── partition-rebuild/     — backfill or repartition
├── dbt-runner/            — dbt models via Spark SQL
└── aggregation-batch/     — common analytical aggregations

Strimzi templates

Kafka, KafkaTopic, KafkaConnect patterns:

data-tier/strimzi-templates/
├── small-cluster/         — 3-broker KRaft Kafka
├── debezium-postgres/     — KafkaConnect + Debezium for Postgres CDC
└── schema-registry/       — Apicurio / Confluent Schema Registry

Argo Workflow templates

WorkflowTemplate and CronWorkflow patterns:

data-tier/argo-templates/
├── daily-aggregation/     — extract from Iceberg → transform → write
├── ml-feature-prep/       — feature engineering pipeline
└── cdc-backfill/          — replay Kafka → Iceberg from a timestamp

Architecture

data-tier/
├── icehelper/             — Python lib + CLI for Iceberg ops
├── spark-templates/       — SparkApplication CRD templates
├── strimzi-templates/     — Kafka CRD templates
├── argo-templates/        — WorkflowTemplate CRD templates
├── docs/
│   ├── iceberg-on-basecamp.md
│   ├── streaming-patterns.md
│   └── batch-patterns.md
└── examples/

Each component is independently versioned. The umbrella has its own README explaining the relationships.

/root phases involved

PhaseWhat happens
Y4 Phase 31Iceberg helpers + Trino integration
Y4 Phase 32Strimzi templates + Debezium configs + Flink integration
Y4 Phase 33Spark Operator templates + Argo Workflow templates
Y4 Phase 37Feature pipeline patterns added (for Phase 40 Feast prep)
Y5 Phase 40Feast integration; offline feature retrieval from Iceberg via Trino
Y5 Phase 50AIOps consumes data-tier helpers for incident analysis

Public vs private

Public from Y4 Phase 31, component-by-component. Each component ships when its phase reaches it. Quiet ships throughout.

Launch energy

Quiet ship. The components are useful; the umbrella isn’t a flagship. Loud launches are reserved for terralabs, llm-gateway, Studio.

Integration with basecamp

data-tier IS basecamp’s Tier 5. Strimzi + Iceberg + Spark + Argo Workflows all deploy via Flux from basecamp’s GitOps repo, using data-tier’s templates as starting points.

Validation criteria

[ ] icehelper Python library + CLI shipped to PyPI
[ ] Spark Operator templates working on basecamp
[ ] Strimzi templates working on basecamp
[ ] Argo Workflow templates working on basecamp
[ ] At least 3 multi-step pipelines using the templates
[ ] CI: validation runs for all template CRDs
[ ] Tests where applicable (Python helpers, template renders)
[ ] README + docs explaining the umbrella structure
[ ] Feast integration in Y5 Phase 40 uses data-tier templates

Status

Created:           not yet
Started:           Y4 Phase 31 (Lakehouse)
Continues through: Y4 Phase 33 + Y5 Phase 40

Anti-patterns

Anti-patternWhy
One giant repo instead of umbrellaEach component should be independently consumable
Templates without examplesTemplates are abstract; examples ground them
Skipping Iceberg compaction disciplineSmall-files problem kills query performance at scale
Hand-rolling Spark Operator YAML in every jobThe templates exist for a reason — use them

Cross-references