Skip to content
5-YEAR PROGRAM · YEAR 3 · PHASE 15
UPCOMING

Lakehouse: MinIO + Iceberg + JupyterHub

Second phase of Year 3. Build the lakehouse foundation that everything ML/AI in Y4-Y5 reads from + writes to. Object storage as substrate; Iceberg as the table format; JupyterHub as the notebook entry point. ~8 weeks, ~100 hrs.


Prerequisites

  • Phase 14 complete — observability stack live
  • 64GB RAM in place (see homelab/hardware)
  • You accept: the lakehouse is the modern data platform architecture. It replaces “warehouse” + “data lake” with one storage tier + many compute engines. Iceberg is the open table format that makes it work. JupyterHub is the interactive entry point.

Why this phase exists

Years 4-5 ML/AI work needs a place where data lives. Building it on Postgres doesn’t scale; building it on a single cloud warehouse locks you in and locks you out of the open ecosystem. The lakehouse pattern (MinIO + Iceberg + open compute) is what 2025 data platforms look like — and the pattern, not the specific tools, is the thing that transfers when the stack rotates again in 2030.

JupyterHub-as-a-service joins basecamp Tier 5 this phase. It’s the entry point for every Studio composition recipe. Year 4 RAG, Year 4 ML training, Year 5 portal command palette — all start with “open a notebook.” The notebook is also the first basecamp surface where the platform stops being a YAML repo and becomes a UX a non-platform-engineer would recognize.

This is the phase where Tier 3 (Lakehouse) of basecamp comes online. Combined with Tier 4 (Processing — landing across P16 and P17) and Tier 8 (Serving — P18), it’s the data-engineering shape that makes the Year 3 Master Plan commitment real.

→ Pattern: oltp-vs-olap (DEEP this phase) → Pattern: schema-on-read-vs-write → Pattern: snapshot-plus-delta → Pattern: append-only-log


1. PROBLEM

You have analytical data (events, logs, ML training data, your own GitHub history). You want:

  • Cheap storage at scale — object storage solves this
  • Multiple query engines reading the same data (SQL + Spark + ML pipelines)
  • Schema evolution without rewriting all the data
  • Time-travel queries (“what did this table look like 7 days ago?”)
  • Transactional guarantees on writes (no half-written tables)
  • A notebook environment where data scientists / future-you write ad-hoc queries

Iceberg + object storage solves the first five. JupyterHub solves the sixth.

→ See: storage-and-data


2. PRINCIPLES

2.1 Storage / compute separation

The lakehouse insight: object storage is dumb cheap (S3, GCS, MinIO); compute is expensive and bursty. Separate them. Multiple compute engines can read the same data without ETL between them — Spark for batch, Trino for ad-hoc, DuckDB for laptop-local exploration, all hitting the same Parquet bytes.

Investigate:

  • Compare lakehouse vs warehouse: where is storage? where is compute? what’s the cost shape?
  • Why was the warehouse era (Teradata, Vertica) coupled storage + compute? Why did lakehouse decouple? (Hint: cloud object storage had to exist first.)

2.2 Open table formats (Iceberg, Delta, Hudi)

A “table” in the lakehouse is a directory of Parquet files + a metadata layer that tracks which files belong to which snapshot. The metadata layer is where transactional guarantees come from — no half-written tables, no readers seeing partial writes.

Investigate:

  • Read the Iceberg spec; understand metadata.json, snapshot, manifest, data file.
  • Compare Iceberg vs Delta vs Hudi: what does each optimize for?
  • Implement schema evolution on an Iceberg table; verify old data still readable with the new schema.

2.3 Object storage as substrate

→ Pattern: append-only-log (Parquet files are append-only)

MinIO = self-hosted S3-compatible object storage. AWS S3 is the same shape; MinIO is what you run locally. The S3 API has won — even GCS speaks it grudgingly through gcsfuse-shaped tools. Any code that targets the S3 API is portable to every major cloud and every homelab.

Investigate:

  • Stand up MinIO on K3s with persistent storage (Longhorn-backed).
  • Test S3 SDK compatibility (boto3, aws-cli, s3cmd) against MinIO.
  • Set up multi-node erasure coding for durability — survive losing 1 node out of 4.

2.4 Catalog (Hive Metastore, Nessie, Polaris)

A catalog tells engines where Iceberg tables live. Hive is legacy; Nessie + Polaris (Apache) are the modern Iceberg-native options. The catalog is also where governance and access control will hook in during Phase 19 — pick one whose roadmap you trust for that.

Investigate:

  • Install Nessie or Polaris as the Iceberg catalog.
  • Connect Spark + Trino to the same catalog (Trino lands in P18, but verify the catalog is engine-agnostic now).
  • Git-like branching on data via Nessie: branch a table, mutate, merge. This is what makes P17 backfill safe.

2.5 Parquet (the file format)

Columnar, compressed, splittable. The file format the lakehouse stores data in. Every analytical engine on the planet has a fast Parquet reader; that ubiquity is why it’s the substrate.

Investigate:

  • Read a Parquet file with pyarrow; inspect row groups, schema, statistics.
  • Compare Parquet vs ORC vs JSON for the same data: size + query speed.
  • Why columnar wins for analytics queries (column pruning + vectorized execution).

2.6 Snapshots + time travel

→ Pattern: snapshot-plus-delta

Iceberg keeps every snapshot. You can query “as of 7 days ago.” Snapshots compact; old ones expire on a retention policy. Time travel is the feature that turns “what changed?” from a forensics question into a SQL query.

2.7 Notebooks-as-a-service (JupyterHub)

JupyterHub on K8s gives every user a notebook with preloaded creds (S3 keys, Iceberg catalog endpoint, OIDC token). Cold-start to “running a Spark query against Iceberg” should be under 60 seconds — anything slower and the notebook isn’t the entry point, the YAML repo still is.

→ Pattern: platform-as-product (reinforced — notebook UX is platform UX)

Investigate:

  • Install JupyterHub via Helm on basecamp; configure OIDC auth (Dex from Y2).
  • Pre-load each user’s notebook with: S3 endpoint, Nessie URL, Spark client config.
  • Test the cold-start: a user logs in, gets a fresh notebook, can spark.read.parquet("s3a://...") immediately.

3. TRADE-OFFS

DecisionOption AOption BWhen
Table formatIceberg (Netflix; Apache TLP)Delta Lake (Databricks-flavored)Hudi (Uber)
CatalogHive Metastore (legacy)Nessie (git-like branching)Polaris (Apache)
Object storageS3MinIO (self-hosted)GCS
Engine for ad-hocSparkDuckDB (lightweight)Trino
Notebook UXJupyterHub on K8sper-user VMshosted (Colab)

4. TOOLS (as of 2025-10)

  • MinIO (S3-compatible)
  • Apache Iceberg (table format)
  • Nessie OR Polaris (catalog)
  • Apache Spark 3.5+ with Iceberg + Nessie connectors (deepens P17)
  • DuckDB (local Parquet ops + lightweight queries)
  • Trino (deepens P18)
  • PyArrow + pandas for notebook-side ops
  • JupyterHub (Helm chart on K3s)
  • Polars (modern DataFrame; Year 4 ML may prefer over pandas)

5. MASTERY

5.1 Reading list

RequiredWhy
Iceberg paper + Iceberg docs (iceberg.apache.org)Start here
DDIA Ch. 3 (re-read)Storage & retrieval with lakehouse lens
MinIO docs — distributed deployment + erasure codingReal ops
RecommendedWhy
”Data Mesh” (Zhamak Dehghani)Year 5 architectural lens
”The Data Warehouse Toolkit” (Kimball)Historical context

5.2 Operational depth checklist

[ ] Deploy MinIO on basecamp K3s (4+ nodes, distributed mode); verify erasure coding
[ ] Deploy Nessie (or Polaris) as Iceberg catalog
[ ] Use PySpark to write an Iceberg table to MinIO via Nessie catalog
[ ] Schema-evolve an Iceberg table: add a column; verify old + new readers work
[ ] Time-travel query: SELECT * FROM table AS OF TIMESTAMP '2027-...'
[ ] Branch a table via Nessie; mutate; verify main is unaffected
[ ] Read the same Iceberg table from Spark + DuckDB; same data
[ ] Compact small files (Iceberg's compaction action)
[ ] Set up retention: expire snapshots older than 7 days
[ ] Deploy JupyterHub on basecamp; OIDC auth via Dex; pre-loaded with S3 + Nessie creds
[ ] Move 1GB of test data into the lakehouse; query 3 ways (Spark, DuckDB, notebook)

5.3 The seed dataset for personal-api (lands in P17)

Plan ahead: this phase lands MinIO + Iceberg. P17 (Airflow) will populate them. Define the schema now:

-- abukix.commits — every git commit you've made publicly
CREATE TABLE abukix.commits (
repo STRING,
sha STRING,
author_date TIMESTAMP,
committer_date TIMESTAMP,
message STRING,
files_changed INT,
additions INT,
deletions INT,
PARTITIONED BY (days(author_date))
);
-- abukix.weekly_logs — your own weekly logs (parsed from ops-handbook markdown)
CREATE TABLE abukix.weekly_logs (
week_start DATE,
learned STRING,
broke STRING,
surprised STRING,
open_questions STRING,
progress STRING,
sustainability STRING
);

The commits table is what personal-api queries from P18 onward. Year 4’s notes-rag will consume weekly_logs. Defining the schema in P15 forces you to answer “what shape is the data we’ll actually want?” before any ETL exists — schema-first, not data-first.


6. COMPARE: Iceberg vs Delta Lake

Same exercise on both. Compare:

  • Setup complexity — what it takes to get a table on each.
  • Performance for write-heavy workloads.
  • Performance for read-heavy workloads.
  • Ecosystem maturity (engines supporting natively in 2026).
  • Where each wins, and what the migration story looks like if you pick wrong.

400 words.


7. OPERATE

  • 3+ runbooks (minio-node-failure, iceberg-metadata-corruption, jupyterhub-user-onboarding)
  • 1+ postmortem
  • Weekly log

8. CONTRIBUTE

Iceberg, MinIO, Nessie, JupyterHub — all welcoming. A docs PR for an under-documented Iceberg evolution rule or a Helm chart values fix for JupyterHub both count.


Validation criteria

[ ] All 11 operational depth checks
[ ] MinIO + Nessie + Iceberg in basecamp Tier 3
[ ] JupyterHub in basecamp Tier 5; OIDC auth working
[ ] Iceberg vs Delta comparison written up
[ ] `abukix.commits` schema defined (data lands in P17)
[ ] 3+ runbooks; 1+ postmortem; 8+ weekly log entries
[ ] Pattern entries deepened:
- oltp-vs-olap → DEEP
- schema-on-read-vs-write → DEEP
- append-only-log → DEEP
- snapshot-plus-delta → DEEP
[ ] Exit Test passed

Exit Test

Time: 3 hours.

  1. Build (90 min) — stand up MinIO + Nessie; create an Iceberg table; write 1M rows via Spark; query from DuckDB; schema-evolve; time-travel back. End-to-end. Notebook session via JupyterHub.
  2. Diagnose (60 min) — scenario: an Iceberg query returns stale data despite a recent write. Find why (snapshot isolation, catalog staleness, branching).
  3. Articulate (30 min) — 600 words: “Why is the lakehouse displacing the warehouse model? What does it give up?”

Anti-patterns

Anti-patternWhy
Putting OLTP data in the lakehouseWrong shape; use Postgres
One huge table for everythingIceberg likes reasonable file sizes; partition strategy matters
Skipping compactionSmall files kill query perf
No snapshot retention policySnapshots forever = costs forever
JupyterHub without OIDCDon’t reinvent auth; use Dex

Patterns deepened this phase


→ Next: Phase 16: Stream Processing