Lakehouse: MinIO + Iceberg + JupyterHub

Second phase of Year 3. Build the lakehouse foundation that everything ML/AI in Y4-Y5 reads from + writes to. Object storage as substrate; Iceberg as the table format; JupyterHub as the notebook entry point. ~8 weeks, ~100 hrs.

Prerequisites

Phase 14 complete — observability stack live

64GB RAM in place (see homelab/hardware)

You accept: the lakehouse is the modern data platform architecture. It replaces “warehouse” + “data lake” with one storage tier + many compute engines. Iceberg is the open table format that makes it work. JupyterHub is the interactive entry point.

Why this phase exists

Years 4-5 ML/AI work needs a place where data lives. Building it on Postgres doesn’t scale; building it on a single cloud warehouse locks you in and locks you out of the open ecosystem. The lakehouse pattern (MinIO + Iceberg + open compute) is what 2025 data platforms look like — and the pattern, not the specific tools, is the thing that transfers when the stack rotates again in 2030.

JupyterHub-as-a-service joins basecamp Tier 5 this phase. It’s the entry point for every Studio composition recipe. Year 4 RAG, Year 4 ML training, Year 5 portal command palette — all start with “open a notebook.” The notebook is also the first basecamp surface where the platform stops being a YAML repo and becomes a UX a non-platform-engineer would recognize.

This is the phase where Tier 3 (Lakehouse) of basecamp comes online. Combined with Tier 4 (Processing — landing across P16 and P17) and Tier 8 (Serving — P18), it’s the data-engineering shape that makes the Year 3 Master Plan commitment real.

→ Pattern: oltp-vs-olap (DEEP this phase) → Pattern: schema-on-read-vs-write → Pattern: snapshot-plus-delta → Pattern: append-only-log

1. PROBLEM

You have analytical data (events, logs, ML training data, your own GitHub history). You want:

Cheap storage at scale — object storage solves this
Multiple query engines reading the same data (SQL + Spark + ML pipelines)
Schema evolution without rewriting all the data
Time-travel queries (“what did this table look like 7 days ago?”)
Transactional guarantees on writes (no half-written tables)
A notebook environment where data scientists / future-you write ad-hoc queries

Iceberg + object storage solves the first five. JupyterHub solves the sixth.

→ See: storage-and-data

2. PRINCIPLES

2.1 Storage / compute separation

The lakehouse insight: object storage is dumb cheap (S3, GCS, MinIO); compute is expensive and bursty. Separate them. Multiple compute engines can read the same data without ETL between them — Spark for batch, Trino for ad-hoc, DuckDB for laptop-local exploration, all hitting the same Parquet bytes.

Investigate:

Compare lakehouse vs warehouse: where is storage? where is compute? what’s the cost shape?
Why was the warehouse era (Teradata, Vertica) coupled storage + compute? Why did lakehouse decouple? (Hint: cloud object storage had to exist first.)

2.2 Open table formats (Iceberg, Delta, Hudi)

A “table” in the lakehouse is a directory of Parquet files + a metadata layer that tracks which files belong to which snapshot. The metadata layer is where transactional guarantees come from — no half-written tables, no readers seeing partial writes.

Investigate:

Read the Iceberg spec; understand metadata.json, snapshot, manifest, data file.
Compare Iceberg vs Delta vs Hudi: what does each optimize for?
Implement schema evolution on an Iceberg table; verify old data still readable with the new schema.

2.3 Object storage as substrate

→ Pattern: append-only-log (Parquet files are append-only)

MinIO = self-hosted S3-compatible object storage. AWS S3 is the same shape; MinIO is what you run locally. The S3 API has won — even GCS speaks it grudgingly through gcsfuse-shaped tools. Any code that targets the S3 API is portable to every major cloud and every homelab.

Investigate:

Stand up MinIO on K3s with persistent storage (Longhorn-backed).
Test S3 SDK compatibility (boto3, aws-cli, s3cmd) against MinIO.
Set up multi-node erasure coding for durability — survive losing 1 node out of 4.

2.4 Catalog (Hive Metastore, Nessie, Polaris)

A catalog tells engines where Iceberg tables live. Hive is legacy; Nessie + Polaris (Apache) are the modern Iceberg-native options. The catalog is also where governance and access control will hook in during Phase 19 — pick one whose roadmap you trust for that.

Investigate:

Install Nessie or Polaris as the Iceberg catalog.
Connect Spark + Trino to the same catalog (Trino lands in P18, but verify the catalog is engine-agnostic now).
Git-like branching on data via Nessie: branch a table, mutate, merge. This is what makes P17 backfill safe.

2.5 Parquet (the file format)

Columnar, compressed, splittable. The file format the lakehouse stores data in. Every analytical engine on the planet has a fast Parquet reader; that ubiquity is why it’s the substrate.

Investigate:

Read a Parquet file with pyarrow; inspect row groups, schema, statistics.
Compare Parquet vs ORC vs JSON for the same data: size + query speed.
Why columnar wins for analytics queries (column pruning + vectorized execution).

2.6 Snapshots + time travel

→ Pattern: snapshot-plus-delta

Iceberg keeps every snapshot. You can query “as of 7 days ago.” Snapshots compact; old ones expire on a retention policy. Time travel is the feature that turns “what changed?” from a forensics question into a SQL query.

2.7 Notebooks-as-a-service (JupyterHub)

JupyterHub on K8s gives every user a notebook with preloaded creds (S3 keys, Iceberg catalog endpoint, OIDC token). Cold-start to “running a Spark query against Iceberg” should be under 60 seconds — anything slower and the notebook isn’t the entry point, the YAML repo still is.

→ Pattern: platform-as-product (reinforced — notebook UX is platform UX)

Investigate:

Install JupyterHub via Helm on basecamp; configure OIDC auth (Dex from Y2).
Pre-load each user’s notebook with: S3 endpoint, Nessie URL, Spark client config.
Test the cold-start: a user logs in, gets a fresh notebook, can spark.read.parquet("s3a://...") immediately.

3. TRADE-OFFS

Decision	Option A	Option B	When
Table format	Iceberg (Netflix; Apache TLP)	Delta Lake (Databricks-flavored)	Hudi (Uber)
Catalog	Hive Metastore (legacy)	Nessie (git-like branching)	Polaris (Apache)
Object storage	S3	MinIO (self-hosted)	GCS
Engine for ad-hoc	Spark	DuckDB (lightweight)	Trino
Notebook UX	JupyterHub on K8s	per-user VMs	hosted (Colab)

4. TOOLS (as of 2025-10)

MinIO (S3-compatible)
Apache Iceberg (table format)
Nessie OR Polaris (catalog)
Apache Spark 3.5+ with Iceberg + Nessie connectors (deepens P17)
DuckDB (local Parquet ops + lightweight queries)
Trino (deepens P18)
PyArrow + pandas for notebook-side ops
JupyterHub (Helm chart on K3s)
Polars (modern DataFrame; Year 4 ML may prefer over pandas)

5. MASTERY

5.1 Reading list

Required	Why
Iceberg paper + Iceberg docs (iceberg.apache.org)	Start here
DDIA Ch. 3 (re-read)	Storage & retrieval with lakehouse lens
MinIO docs — distributed deployment + erasure coding	Real ops

Recommended	Why
”Data Mesh” (Zhamak Dehghani)	Year 5 architectural lens
”The Data Warehouse Toolkit” (Kimball)	Historical context

5.2 Operational depth checklist

[ ] Deploy MinIO on basecamp K3s (4+ nodes, distributed mode); verify erasure coding
[ ] Deploy Nessie (or Polaris) as Iceberg catalog
[ ] Use PySpark to write an Iceberg table to MinIO via Nessie catalog
[ ] Schema-evolve an Iceberg table: add a column; verify old + new readers work
[ ] Time-travel query: SELECT * FROM table AS OF TIMESTAMP '2027-...'
[ ] Branch a table via Nessie; mutate; verify main is unaffected
[ ] Read the same Iceberg table from Spark + DuckDB; same data
[ ] Compact small files (Iceberg's compaction action)
[ ] Set up retention: expire snapshots older than 7 days
[ ] Deploy JupyterHub on basecamp; OIDC auth via Dex; pre-loaded with S3 + Nessie creds
[ ] Move 1GB of test data into the lakehouse; query 3 ways (Spark, DuckDB, notebook)

5.3 The seed dataset for `personal-api` (lands in P17)

Plan ahead: this phase lands MinIO + Iceberg. P17 (Airflow) will populate them. Define the schema now:

-- abukix.commits — every git commit you've made publicly
CREATE TABLE abukix.commits (
    repo        STRING,
    sha         STRING,
    author_date TIMESTAMP,
    committer_date TIMESTAMP,
    message     STRING,
    files_changed INT,
    additions   INT,
    deletions   INT,
    PARTITIONED BY (days(author_date))
);

-- abukix.weekly_logs — your own weekly logs (parsed from ops-handbook markdown)
CREATE TABLE abukix.weekly_logs (
    week_start  DATE,
    learned     STRING,
    broke       STRING,
    surprised   STRING,
    open_questions STRING,
    progress    STRING,
    sustainability STRING
);

The commits table is what personal-api queries from P18 onward. Year 4’s notes-rag will consume weekly_logs. Defining the schema in P15 forces you to answer “what shape is the data we’ll actually want?” before any ETL exists — schema-first, not data-first.

6. COMPARE: Iceberg vs Delta Lake

Same exercise on both. Compare:

Setup complexity — what it takes to get a table on each.
Performance for write-heavy workloads.
Performance for read-heavy workloads.
Ecosystem maturity (engines supporting natively in 2026).
Where each wins, and what the migration story looks like if you pick wrong.

400 words.

7. OPERATE

3+ runbooks (minio-node-failure, iceberg-metadata-corruption, jupyterhub-user-onboarding)
1+ postmortem
Weekly log

8. CONTRIBUTE

Iceberg, MinIO, Nessie, JupyterHub — all welcoming. A docs PR for an under-documented Iceberg evolution rule or a Helm chart values fix for JupyterHub both count.

Validation criteria

[ ] All 11 operational depth checks
[ ] MinIO + Nessie + Iceberg in basecamp Tier 3
[ ] JupyterHub in basecamp Tier 5; OIDC auth working
[ ] Iceberg vs Delta comparison written up
[ ] `abukix.commits` schema defined (data lands in P17)
[ ] 3+ runbooks; 1+ postmortem; 8+ weekly log entries
[ ] Pattern entries deepened:
    - oltp-vs-olap → DEEP
    - schema-on-read-vs-write → DEEP
    - append-only-log → DEEP
    - snapshot-plus-delta → DEEP
[ ] Exit Test passed

Exit Test

Time: 3 hours.

Build (90 min) — stand up MinIO + Nessie; create an Iceberg table; write 1M rows via Spark; query from DuckDB; schema-evolve; time-travel back. End-to-end. Notebook session via JupyterHub.
Diagnose (60 min) — scenario: an Iceberg query returns stale data despite a recent write. Find why (snapshot isolation, catalog staleness, branching).
Articulate (30 min) — 600 words: “Why is the lakehouse displacing the warehouse model? What does it give up?”

Anti-patterns

Anti-pattern	Why
Putting OLTP data in the lakehouse	Wrong shape; use Postgres
One huge table for everything	Iceberg likes reasonable file sizes; partition strategy matters
Skipping compaction	Small files kill query perf
No snapshot retention policy	Snapshots forever = costs forever
JupyterHub without OIDC	Don’t reinvent auth; use Dex

Patterns deepened this phase

oltp-vs-olap → DEEP
schema-on-read-vs-write → DEEP
append-only-log → DEEP
snapshot-plus-delta → DEEP
platform-as-product → reinforced (JupyterHub as platform UX)

→ Next: Phase 16: Stream Processing

Lakehouse: MinIO + Iceberg + JupyterHub

Prerequisites

Why this phase exists

1. PROBLEM

2. PRINCIPLES

2.1 Storage / compute separation

2.2 Open table formats (Iceberg, Delta, Hudi)

2.3 Object storage as substrate

2.4 Catalog (Hive Metastore, Nessie, Polaris)

2.5 Parquet (the file format)

2.6 Snapshots + time travel

2.7 Notebooks-as-a-service (JupyterHub)

3. TRADE-OFFS

4. TOOLS (as of 2025-10)

5. MASTERY

5.1 Reading list

5.2 Operational depth checklist

5.3 The seed dataset for personal-api (lands in P17)

6. COMPARE: Iceberg vs Delta Lake

7. OPERATE

8. CONTRIBUTE

Validation criteria

Exit Test

Anti-patterns

Patterns deepened this phase

5.3 The seed dataset for `personal-api` (lands in P17)