Lakehouse: MinIO + Iceberg + JupyterHub
Second phase of Year 3. Build the lakehouse foundation that everything ML/AI in Y4-Y5 reads from + writes to. Object storage as substrate; Iceberg as the table format; JupyterHub as the notebook entry point. ~8 weeks, ~100 hrs.
Prerequisites
- Phase 14 complete — observability stack live
- 64GB RAM in place (see homelab/hardware)
- You accept: the lakehouse is the modern data platform architecture. It replaces “warehouse” + “data lake” with one storage tier + many compute engines. Iceberg is the open table format that makes it work. JupyterHub is the interactive entry point.
Why this phase exists
Years 4-5 ML/AI work needs a place where data lives. Building it on Postgres doesn’t scale; building it on a single cloud warehouse locks you in and locks you out of the open ecosystem. The lakehouse pattern (MinIO + Iceberg + open compute) is what 2025 data platforms look like — and the pattern, not the specific tools, is the thing that transfers when the stack rotates again in 2030.
JupyterHub-as-a-service joins basecamp Tier 5 this phase. It’s the entry point for every Studio composition recipe. Year 4 RAG, Year 4 ML training, Year 5 portal command palette — all start with “open a notebook.” The notebook is also the first basecamp surface where the platform stops being a YAML repo and becomes a UX a non-platform-engineer would recognize.
This is the phase where Tier 3 (Lakehouse) of basecamp comes online. Combined with Tier 4 (Processing — landing across P16 and P17) and Tier 8 (Serving — P18), it’s the data-engineering shape that makes the Year 3 Master Plan commitment real.
→ Pattern: oltp-vs-olap (DEEP this phase) → Pattern: schema-on-read-vs-write → Pattern: snapshot-plus-delta → Pattern: append-only-log
1. PROBLEM
You have analytical data (events, logs, ML training data, your own GitHub history). You want:
- Cheap storage at scale — object storage solves this
- Multiple query engines reading the same data (SQL + Spark + ML pipelines)
- Schema evolution without rewriting all the data
- Time-travel queries (“what did this table look like 7 days ago?”)
- Transactional guarantees on writes (no half-written tables)
- A notebook environment where data scientists / future-you write ad-hoc queries
Iceberg + object storage solves the first five. JupyterHub solves the sixth.
→ See: storage-and-data
2. PRINCIPLES
2.1 Storage / compute separation
The lakehouse insight: object storage is dumb cheap (S3, GCS, MinIO); compute is expensive and bursty. Separate them. Multiple compute engines can read the same data without ETL between them — Spark for batch, Trino for ad-hoc, DuckDB for laptop-local exploration, all hitting the same Parquet bytes.
Investigate:
- Compare lakehouse vs warehouse: where is storage? where is compute? what’s the cost shape?
- Why was the warehouse era (Teradata, Vertica) coupled storage + compute? Why did lakehouse decouple? (Hint: cloud object storage had to exist first.)
2.2 Open table formats (Iceberg, Delta, Hudi)
A “table” in the lakehouse is a directory of Parquet files + a metadata layer that tracks which files belong to which snapshot. The metadata layer is where transactional guarantees come from — no half-written tables, no readers seeing partial writes.
Investigate:
- Read the Iceberg spec; understand
metadata.json, snapshot, manifest, data file. - Compare Iceberg vs Delta vs Hudi: what does each optimize for?
- Implement schema evolution on an Iceberg table; verify old data still readable with the new schema.
2.3 Object storage as substrate
→ Pattern: append-only-log (Parquet files are append-only)
MinIO = self-hosted S3-compatible object storage. AWS S3 is the same shape; MinIO is what you run locally. The S3 API has won — even GCS speaks it grudgingly through gcsfuse-shaped tools. Any code that targets the S3 API is portable to every major cloud and every homelab.
Investigate:
- Stand up MinIO on K3s with persistent storage (Longhorn-backed).
- Test S3 SDK compatibility (
boto3,aws-cli,s3cmd) against MinIO. - Set up multi-node erasure coding for durability — survive losing 1 node out of 4.
2.4 Catalog (Hive Metastore, Nessie, Polaris)
A catalog tells engines where Iceberg tables live. Hive is legacy; Nessie + Polaris (Apache) are the modern Iceberg-native options. The catalog is also where governance and access control will hook in during Phase 19 — pick one whose roadmap you trust for that.
Investigate:
- Install Nessie or Polaris as the Iceberg catalog.
- Connect Spark + Trino to the same catalog (Trino lands in P18, but verify the catalog is engine-agnostic now).
- Git-like branching on data via Nessie: branch a table, mutate, merge. This is what makes P17 backfill safe.
2.5 Parquet (the file format)
Columnar, compressed, splittable. The file format the lakehouse stores data in. Every analytical engine on the planet has a fast Parquet reader; that ubiquity is why it’s the substrate.
Investigate:
- Read a Parquet file with
pyarrow; inspect row groups, schema, statistics. - Compare Parquet vs ORC vs JSON for the same data: size + query speed.
- Why columnar wins for analytics queries (column pruning + vectorized execution).
2.6 Snapshots + time travel
→ Pattern: snapshot-plus-delta
Iceberg keeps every snapshot. You can query “as of 7 days ago.” Snapshots compact; old ones expire on a retention policy. Time travel is the feature that turns “what changed?” from a forensics question into a SQL query.
2.7 Notebooks-as-a-service (JupyterHub)
JupyterHub on K8s gives every user a notebook with preloaded creds (S3 keys, Iceberg catalog endpoint, OIDC token). Cold-start to “running a Spark query against Iceberg” should be under 60 seconds — anything slower and the notebook isn’t the entry point, the YAML repo still is.
→ Pattern: platform-as-product (reinforced — notebook UX is platform UX)
Investigate:
- Install JupyterHub via Helm on basecamp; configure OIDC auth (Dex from Y2).
- Pre-load each user’s notebook with: S3 endpoint, Nessie URL, Spark client config.
- Test the cold-start: a user logs in, gets a fresh notebook, can
spark.read.parquet("s3a://...")immediately.
3. TRADE-OFFS
| Decision | Option A | Option B | When |
|---|---|---|---|
| Table format | Iceberg (Netflix; Apache TLP) | Delta Lake (Databricks-flavored) | Hudi (Uber) |
| Catalog | Hive Metastore (legacy) | Nessie (git-like branching) | Polaris (Apache) |
| Object storage | S3 | MinIO (self-hosted) | GCS |
| Engine for ad-hoc | Spark | DuckDB (lightweight) | Trino |
| Notebook UX | JupyterHub on K8s | per-user VMs | hosted (Colab) |
4. TOOLS (as of 2025-10)
- MinIO (S3-compatible)
- Apache Iceberg (table format)
- Nessie OR Polaris (catalog)
- Apache Spark 3.5+ with Iceberg + Nessie connectors (deepens P17)
- DuckDB (local Parquet ops + lightweight queries)
- Trino (deepens P18)
- PyArrow + pandas for notebook-side ops
- JupyterHub (Helm chart on K3s)
- Polars (modern DataFrame; Year 4 ML may prefer over pandas)
5. MASTERY
5.1 Reading list
| Required | Why |
|---|---|
| Iceberg paper + Iceberg docs (iceberg.apache.org) | Start here |
| DDIA Ch. 3 (re-read) | Storage & retrieval with lakehouse lens |
| MinIO docs — distributed deployment + erasure coding | Real ops |
| Recommended | Why |
|---|---|
| ”Data Mesh” (Zhamak Dehghani) | Year 5 architectural lens |
| ”The Data Warehouse Toolkit” (Kimball) | Historical context |
5.2 Operational depth checklist
[ ] Deploy MinIO on basecamp K3s (4+ nodes, distributed mode); verify erasure coding[ ] Deploy Nessie (or Polaris) as Iceberg catalog[ ] Use PySpark to write an Iceberg table to MinIO via Nessie catalog[ ] Schema-evolve an Iceberg table: add a column; verify old + new readers work[ ] Time-travel query: SELECT * FROM table AS OF TIMESTAMP '2027-...'[ ] Branch a table via Nessie; mutate; verify main is unaffected[ ] Read the same Iceberg table from Spark + DuckDB; same data[ ] Compact small files (Iceberg's compaction action)[ ] Set up retention: expire snapshots older than 7 days[ ] Deploy JupyterHub on basecamp; OIDC auth via Dex; pre-loaded with S3 + Nessie creds[ ] Move 1GB of test data into the lakehouse; query 3 ways (Spark, DuckDB, notebook)5.3 The seed dataset for personal-api (lands in P17)
Plan ahead: this phase lands MinIO + Iceberg. P17 (Airflow) will populate them. Define the schema now:
-- abukix.commits — every git commit you've made publiclyCREATE TABLE abukix.commits ( repo STRING, sha STRING, author_date TIMESTAMP, committer_date TIMESTAMP, message STRING, files_changed INT, additions INT, deletions INT, PARTITIONED BY (days(author_date)));
-- abukix.weekly_logs — your own weekly logs (parsed from ops-handbook markdown)CREATE TABLE abukix.weekly_logs ( week_start DATE, learned STRING, broke STRING, surprised STRING, open_questions STRING, progress STRING, sustainability STRING);The commits table is what personal-api queries from P18 onward. Year 4’s notes-rag will consume weekly_logs. Defining the schema in P15 forces you to answer “what shape is the data we’ll actually want?” before any ETL exists — schema-first, not data-first.
6. COMPARE: Iceberg vs Delta Lake
Same exercise on both. Compare:
- Setup complexity — what it takes to get a table on each.
- Performance for write-heavy workloads.
- Performance for read-heavy workloads.
- Ecosystem maturity (engines supporting natively in 2026).
- Where each wins, and what the migration story looks like if you pick wrong.
400 words.
7. OPERATE
- 3+ runbooks (
minio-node-failure,iceberg-metadata-corruption,jupyterhub-user-onboarding) - 1+ postmortem
- Weekly log
8. CONTRIBUTE
Iceberg, MinIO, Nessie, JupyterHub — all welcoming. A docs PR for an under-documented Iceberg evolution rule or a Helm chart values fix for JupyterHub both count.
Validation criteria
[ ] All 11 operational depth checks[ ] MinIO + Nessie + Iceberg in basecamp Tier 3[ ] JupyterHub in basecamp Tier 5; OIDC auth working[ ] Iceberg vs Delta comparison written up[ ] `abukix.commits` schema defined (data lands in P17)[ ] 3+ runbooks; 1+ postmortem; 8+ weekly log entries[ ] Pattern entries deepened: - oltp-vs-olap → DEEP - schema-on-read-vs-write → DEEP - append-only-log → DEEP - snapshot-plus-delta → DEEP[ ] Exit Test passedExit Test
Time: 3 hours.
- Build (90 min) — stand up MinIO + Nessie; create an Iceberg table; write 1M rows via Spark; query from DuckDB; schema-evolve; time-travel back. End-to-end. Notebook session via JupyterHub.
- Diagnose (60 min) — scenario: an Iceberg query returns stale data despite a recent write. Find why (snapshot isolation, catalog staleness, branching).
- Articulate (30 min) — 600 words: “Why is the lakehouse displacing the warehouse model? What does it give up?”
Anti-patterns
| Anti-pattern | Why |
|---|---|
| Putting OLTP data in the lakehouse | Wrong shape; use Postgres |
| One huge table for everything | Iceberg likes reasonable file sizes; partition strategy matters |
| Skipping compaction | Small files kill query perf |
| No snapshot retention policy | Snapshots forever = costs forever |
| JupyterHub without OIDC | Don’t reinvent auth; use Dex |
Patterns deepened this phase
- oltp-vs-olap → DEEP
- schema-on-read-vs-write → DEEP
- append-only-log → DEEP
- snapshot-plus-delta → DEEP
- platform-as-product → reinforced (JupyterHub as platform UX)
→ Next: Phase 16: Stream Processing