data-tier
Umbrella project covering Iceberg helpers + Spark utilities + Strimzi templates + Argo Workflow templates. The data tier of basecamp (Tier 5). Built across Year 4 Phases 31-33.
Umbrella project — the data tier of basecamp. Iceberg helpers + Spark utilities + Strimzi templates + Argo Workflow templates.
What this is
data-tier is an umbrella project — not a single repo but a coordinated set of small repos and components that together form basecamp’s Tier 5 (Data Engineering). Each piece is independently useful; together they’re how basecamp ingests, stores, and processes data at homelab scale, K8s-native.
Components:
- Iceberg helpers (Python + Go libraries) — schema migration scripts, compaction utilities, partition health checks
- Spark utilities — pre-configured Spark Operator templates for common batch jobs (CDC → Iceberg, partition rebuild, dbt model run)
- Strimzi templates — Kafka topic patterns, Schema Registry integration, Debezium connector configs
- Argo Workflow templates — reusable workflow steps for typical data pipelines
Why it exists
Three reasons:
- basecamp Tier 5. Without a data tier, Y5’s ML and AI work has no substrate. data-tier IS the substrate.
- Reusable patterns. Every data engineer reinvents Iceberg compaction; every team rewrites Spark Operator boilerplate. data-tier captures these patterns once.
- K8s-native demonstration. The whole umbrella runs as operators + CRDs. It’s a public example of K8s-native data engineering.
Spec (v0.1.0)
Iceberg helpers
Python library + CLI for common Iceberg operations:
icehelper compact <table> — compact small files
icehelper expire-snapshots <table> --older-than 7d
icehelper partition-health <table>
icehelper schema-diff <table1> <table2>
Spark utilities
Spark Operator SparkApplication templates (Kustomize-friendly):
data-tier/spark-templates/
├── cdc-to-iceberg/ — Kafka → Iceberg streaming job
├── partition-rebuild/ — backfill or repartition
├── dbt-runner/ — dbt models via Spark SQL
└── aggregation-batch/ — common analytical aggregations
Strimzi templates
Kafka, KafkaTopic, KafkaConnect patterns:
data-tier/strimzi-templates/
├── small-cluster/ — 3-broker KRaft Kafka
├── debezium-postgres/ — KafkaConnect + Debezium for Postgres CDC
└── schema-registry/ — Apicurio / Confluent Schema Registry
Argo Workflow templates
WorkflowTemplate and CronWorkflow patterns:
data-tier/argo-templates/
├── daily-aggregation/ — extract from Iceberg → transform → write
├── ml-feature-prep/ — feature engineering pipeline
└── cdc-backfill/ — replay Kafka → Iceberg from a timestamp
Architecture
data-tier/
├── icehelper/ — Python lib + CLI for Iceberg ops
├── spark-templates/ — SparkApplication CRD templates
├── strimzi-templates/ — Kafka CRD templates
├── argo-templates/ — WorkflowTemplate CRD templates
├── docs/
│ ├── iceberg-on-basecamp.md
│ ├── streaming-patterns.md
│ └── batch-patterns.md
└── examples/
Each component is independently versioned. The umbrella has its own README explaining the relationships.
/root phases involved
| Phase | What happens |
|---|---|
| Y4 Phase 31 | Iceberg helpers + Trino integration |
| Y4 Phase 32 | Strimzi templates + Debezium configs + Flink integration |
| Y4 Phase 33 | Spark Operator templates + Argo Workflow templates |
| Y4 Phase 37 | Feature pipeline patterns added (for Phase 40 Feast prep) |
| Y5 Phase 40 | Feast integration; offline feature retrieval from Iceberg via Trino |
| Y5 Phase 50 | AIOps consumes data-tier helpers for incident analysis |
Public vs private
Public from Y4 Phase 31, component-by-component. Each component ships when its phase reaches it. Quiet ships throughout.
Launch energy
Quiet ship. The components are useful; the umbrella isn’t a flagship. Loud launches are reserved for terralabs, llm-gateway, Studio.
Integration with basecamp
data-tier IS basecamp’s Tier 5. Strimzi + Iceberg + Spark + Argo Workflows all deploy via Flux from basecamp’s GitOps repo, using data-tier’s templates as starting points.
Validation criteria
[ ] icehelper Python library + CLI shipped to PyPI
[ ] Spark Operator templates working on basecamp
[ ] Strimzi templates working on basecamp
[ ] Argo Workflow templates working on basecamp
[ ] At least 3 multi-step pipelines using the templates
[ ] CI: validation runs for all template CRDs
[ ] Tests where applicable (Python helpers, template renders)
[ ] README + docs explaining the umbrella structure
[ ] Feast integration in Y5 Phase 40 uses data-tier templates
Status
Created: not yet
Started: Y4 Phase 31 (Lakehouse)
Continues through: Y4 Phase 33 + Y5 Phase 40
Anti-patterns
| Anti-pattern | Why |
|---|---|
| One giant repo instead of umbrella | Each component should be independently consumable |
| Templates without examples | Templates are abstract; examples ground them |
| Skipping Iceberg compaction discipline | Small-files problem kills query performance at scale |
| Hand-rolling Spark Operator YAML in every job | The templates exist for a reason — use them |
Cross-references
- Y4 Phase 31: Data Lakehouse — Iceberg helpers
- Y4 Phase 32: Stream Processing — Strimzi + Flink templates
- Y4 Phase 33: Batch + Orchestration — Spark + Argo templates
- Y5 Phase 40: Feature Stores — Feast integration
basecamp— consumes data-tier as Tier 5