Lakehouse

Tables on object storage with transactional semantics. Warehouse-quality reads + lake-quality storage costs. The architectural pattern behind Iceberg, Delta Lake, Hudi.

Object storage gives you cheap durable bytes. Iceberg-style table formats give you transactions, schema evolution, and time-travel on top. Together: warehouse-quality semantics at lake-scale economics. Status: STUB — promoted to OUTLINE in Y4 Phase 31.

What this pattern is

The lakehouse pattern combines two previously-separate worlds. The data lake (cheap durable bytes on object storage — S3, MinIO, GCS) and the data warehouse (transactional semantics, schema enforcement, ACID, time-travel). The combination is realized by table formats (Apache Iceberg, Delta Lake, Apache Hudi) layered on top of object storage. Tables become metadata + data files; transactions become atomic catalog pointer swaps via snapshot-plus-delta; schema evolution is cheap (metadata changes only); time-travel is a snapshot-pointer-read away.

For /root, the lakehouse stack is MinIO (object storage) + Iceberg (table format) + Nessie or Polaris (catalog) + Trino (interactive queries) + Spark Operator (batch processing) — all K8s-native, all reconciled by Flux. Netflix originated Iceberg; Databricks originated Delta Lake; Uber originated Hudi. All three converge on the same pattern.

The lakehouse pattern is the current end state of a decade-long evolution. First-generation data lakes (Hadoop, S3 with Hive tables) had cheap storage but weak semantics — concurrent writes clobbered each other; schema changes broke queries; time-travel required custom tooling. First-generation warehouses (Redshift, Snowflake, BigQuery) had strong semantics but expensive storage and vendor lock-in. Table formats bridged the gap by adding transactional metadata layers on top of open object storage. Reads got warehouse-quality; storage stayed lake-quality; the vendor coupling loosened.

The pattern composes with streaming-vs-batch (streaming writes append snapshots; batch writes append snapshots; both target the same table), cdc (CDC events flow into Iceberg tables as they arrive), and schema-evolution (Iceberg’s schema evolution is cheap because it’s metadata-only). The full lakehouse stack is what makes analytics-scale data platforms operationally tractable in 2026.

Concrete instances in the wild

  • Apache Iceberg. Netflix-originated table format. Now the dominant open standard. Backed by Snowflake, Databricks (via Delta UniForm), Google, AWS.
  • Delta Lake. Databricks-originated table format. Delta UniForm now provides Iceberg-compatible metadata.
  • Apache Hudi. Uber-originated table format. Strong support for streaming ingestion and upserts.
  • basecamp lakehouse (Y4 Phase 31). MinIO + Iceberg + Nessie + Trino + Spark Operator.
  • Netflix’s data platform. The origin story for Iceberg. Petabyte-scale on S3.
  • Databricks Lakehouse Platform. Commercial lakehouse product. Delta Lake + Spark + MLflow + Unity Catalog.
  • Snowflake with Iceberg external tables. Snowflake reads from external Iceberg tables since 2023.
  • Google BigQuery with Iceberg external tables. Same pattern applied to BigQuery.
  • AWS Athena with Iceberg. Athena queries Iceberg tables directly on S3.
  • Airbnb’s Minerva. Metric-layer product built on top of a lakehouse.
  • Frontier-lab Iceberg deployments. Public conference talks from many frontier data organizations describe petabyte-scale Iceberg operating similar patterns.

Why this pattern matters

Before the lakehouse pattern, teams had to choose between the lake (cheap, weak semantics) and the warehouse (expensive, strong semantics), often ending up with both — expensive dual pipelines shipping data from the lake to the warehouse via nightly ETL, keeping the warehouse consistent enough to trust for reporting. This “medallion architecture” or “bronze-silver-gold” model had significant operational cost and consistency latency.

The lakehouse eliminates the split. One storage layer serves both analytics and ML. Tables have ACID transactions, so multiple writers coordinate correctly. Schema evolution happens without rewriting data. Time-travel enables reproducible experiments (train the model on the data as it existed at snapshot X). Reader compatibility across engines (Trino, Spark, Flink, Snowflake) means the same table serves multiple query patterns.

The economics matter. Object storage costs a fraction of warehouse storage. Compute and storage are separated, so you scale each independently. Data doesn’t need to be duplicated across systems. Multi-region durability comes for free (S3 replication). The cost-per-terabyte is often 10-100x lower than traditional warehouses at scale.

The pattern also unlocks ML workflows. Feature stores can read directly from Iceberg tables. Training data can be point-in-time reproducible via time-travel. Model versions can reference specific table snapshots. The ML/analytics divide that plagued earlier architectures largely dissolves. Any dataset the analytics team produces is available to the ML team without duplication.

Modern platforms make lakehouses operationally tractable. Iceberg’s REST catalog spec (Polaris, Nessie, Unity, AWS Glue Iceberg REST) provides a standard interface. Kubernetes operators (Spark Operator, Flink Operator, Trino Operator) make compute management declarative. Object stores are commodities (S3, GCS, MinIO for on-prem). What used to be a heroic data-platform team is increasingly a small platform-engineering team plus a well-integrated stack.

The failure modes to know: catalog contention under high concurrency (Iceberg’s optimistic concurrency can conflict); small-file problem (streaming writes produce many small files; requires compaction); metadata bloat (long-lived tables accumulate metadata files; requires expiration). Each has known mitigations, but operating a lakehouse means owning them.

Depth progression

STUB     ← you are here.
OUTLINE  Promoted when Y4 Phase 31 stands the Iceberg + Trino + Nessie stack
         on basecamp.
DEEP     Promoted after Y4 end — lakehouse operational ~6+ months with real
         data, real schema evolution events, and at least one time-travel
         reproducibility use.

Preview: what OUTLINE will answer

When Y4 Phase 31 promotes this entry to OUTLINE, it will name:

  • PROBLEM. How do you get warehouse-quality semantics on lake-quality storage economics?
  • PRINCIPLES. Metadata layer on top of object storage. Atomic catalog swaps for transactions. Snapshot-plus-delta for time-travel. Schema evolution is metadata-only. Reader compatibility across engines. Compaction and expiration are ongoing operations.
  • TRADE-OFFS. Iceberg (most-adopted open standard) vs Delta (Databricks-native, now Iceberg-compatible) vs Hudi (streaming-optimized). Central catalog (simpler operations) vs metastore-per-team (isolation). Compute options (Trino for interactive, Spark for batch, Flink for streaming).
  • TOOLS (time-stamped as of 2026-06): Apache Iceberg, Delta Lake (+ UniForm), Apache Hudi, MinIO (self-hosted), S3/GCS (cloud), Nessie (catalog), Polaris (catalog), AWS Glue Iceberg REST, Trino, Spark, Flink, Snowflake external tables.

The DEEP promotion, after Y4 with 6+ months of lakehouse operation, will add MASTERY (operating basecamp’s lakehouse with real workloads), COMPARE (Iceberg vs Delta vs Hudi in practice), OPERATE (a specific schema-evolution event or compaction operation), and CONTRIBUTE (an Iceberg or Nessie documentation improvement or catalog integration example).

Canonical references

  • Iceberg documentation and spec. Free at iceberg.apache.org.
  • Delta Lake documentation. Free at delta.io.
  • Hudi documentation. Free at hudi.apache.org.
  • Databricks’ original “Lakehouse: A New Generation of Open Platforms” paper (2021). Free.
  • Kleppmann, DDIA, chapters on batch processing and dataflow — foundational context.

Cross-references