Data Serving: Trino + Superset

Fifth phase of Year 3. Data lands; people query it. Trino as the federated SQL engine; Superset as the analytics frontend. personal-api ships as the first user-facing service on the data layer. ~7 weeks, ~80 hrs.

Prerequisites

Phase 17 complete — Iceberg + Spark + Airflow + dbt operational

You accept: the lakehouse is dead weight if no one can query it. Trino + Superset (or any equivalent) is the human + machine interface.

Why this phase exists

You have data in Iceberg from P15-P17. Now: someone needs to ask questions of it. Year 4 ML training pipelines query Iceberg via Trino. Year 5 portal command palette queries Trino via the agent. personal-api (this phase) wraps Trino with a tiny REST API — the first basecamp resident that exists for a user (you), not just for the platform itself.

This is the phase where Tier 8 (Data Serving) of basecamp gets its first occupant. The serving layer is also where caching stops being abstract Y1 reading and starts being a real concern: the same questions get asked over and over, the underlying Iceberg snapshots only change when batch lands, and Redis is sitting there from Year 1 waiting to do its job. Caching at depth (TTL strategy, stampede prevention, layered caches) is what separates a personal-api that costs $0 to run from one that re-queries Trino on every request.

→ Pattern: caching (DEEP this phase from Y1’s first touch)

1. PROBLEM

You want to:

Run SQL across multiple sources (Iceberg + Postgres + Redis + maybe S3 directly)
Federate queries (one query joins Iceberg + Postgres without ETL)
Serve analysts/BI users + serve programmatic API consumers
Cache common queries; evict when underlying data changes

Trino does the federation + query execution. Superset does the BI front-end. A small REST API + Redis cache does the programmatic layer.

→ See: storage-and-data

2. PRINCIPLES

2.1 Federated SQL

One query, multiple data sources. Trino has connectors for Iceberg, Hive, Postgres, MySQL, Redis, S3, MongoDB, Kafka, BigQuery — most of them. The federation pattern is older than Trino (Presto, Drill, even mainframe federation), but Trino’s connector ecosystem in 2026 is the broadest open option.

Investigate:

Configure Trino with Iceberg + Postgres connectors.
Write a query that joins abukix.commits (Iceberg) with incidents (Postgres from triage).
What does Trino push down to each source vs handle in its own engine? (Run EXPLAIN and read carefully — pushdown is the difference between a 100ms query and a 100s query.)

2.2 Query optimization

Cost-based optimization, predicate pushdown, projection pushdown, dynamic filtering. The optimizer is only as good as its statistics — ANALYZE TABLE is not optional.

Investigate:

Use EXPLAIN ANALYZE to see what Trino does with a join across Iceberg + Postgres.
Configure dynamic filtering for star-join workloads.
When does Trino’s planner make wrong choices? How do you tell it the truth (statistics)?

2.3 Caching at the query layer

→ Pattern: caching

Trino’s caching options + Redis as application-level cache. Two layers, two failure modes, two TTL strategies — and the discipline is in keeping them straight.

Investigate:

Trino’s Hive caching (file-level) vs result caching (query-level).
Redis as the application-level cache for personal-api query results.
TTL strategy: when data updates daily (commits backfill), how long to cache?
Cache stampede prevention: probabilistic early expiration, request coalescing, the difference between them.

2.4 BI visualization (Superset)

Superset = open-source BI tool. Connects to Trino. Charts, dashboards, alerts, basic SQL Lab. The Y3 use case is “look at your own data the way a stakeholder would” — same shape every analytics engineer uses, smaller scale.

Investigate:

Install Superset on basecamp; configure Trino connection; build a dashboard for personal-api data.
Row-level security in Superset (preview only — depth is in P19 governance).
Alert when a metric crosses a threshold (e.g., “commits in last 7 days < 5”).

2.5 The serving API pattern

Wrap Trino + cache + auth into a tiny REST API for programmatic consumers. The same Go + chi + slog stack from triage — same shape, same patterns, fewer surprises.

Investigate:

Build personal-api (Go, chi, slog — same stack as triage from Y1).
Endpoints: /v1/commits, /v1/activity-summary, /v1/repos/:repo/stats.
Cache layer: Redis with TTL based on data freshness.
OIDC auth via Dex; rate limiting per user.

3. TRADE-OFFS

Decision	Option A	Option B	When
Query engine	Trino	Spark SQL	Snowflake (commercial)
BI frontend	Superset	Metabase	Looker (commercial)
Caching	Trino native	Redis app-layer	Both
API style	REST	GraphQL	gRPC

4. TOOLS (as of 2025-10)

Trino 460+
Apache Superset 4+
Redis (already from Y1)
Go + chi + slog for personal-api
OpenLineage Trino integration (sets up P19 governance)

5. MASTERY

5.1 Reading list

Required	Why
Trino docs — Concepts, Connectors, Tuning	The implementation
”Trino: The Definitive Guide” (Fuller, Moser, Traverso)	Depth
Superset docs — Basic + Advanced	The BI front

5.2 Operational depth checklist

[ ] Deploy Trino on basecamp (4-node coordinator + workers); Iceberg + Postgres + Redis connectors
[ ] Write 5 federated queries that join Iceberg + Postgres + Redis
[ ] Use EXPLAIN ANALYZE to tune one slow query; document the optimization
[ ] Deploy Superset on basecamp; OIDC auth via Dex
[ ] Build a `personal-api` Superset dashboard: commits per repo, activity heatmap, weekly summary
[ ] Build personal-api Go service:
    - Endpoints: /v1/commits, /v1/activity-summary, /v1/repos/:repo/stats
    - Trino client (presto-go-client or trino-go-client)
    - Redis cache with stampede prevention
    - OIDC auth + per-user rate limit
    - OTel + slog + Prometheus metrics
[ ] Containerize personal-api; Helm chart in basecamp; ArgoCD-managed
[ ] Set up Trino caching (Hive caching for Iceberg data files)
[ ] Build a Superset alert: notify when a query result crosses a threshold
[ ] Document personal-api in Backstage catalog; SLO defined; owner set

5.3 personal-api ships (private at first)

By phase end, personal-api is live at personal-api.basecamp.local (or wherever the Ingress lands). Year 5’s portal command palette will query it.

abukix/personal-api (PRIVATE for now — goes public Year 5 if interesting enough)
  Go + chi + slog
  Trino client + Redis cache
  OIDC auth + rate limit
  Helm chart in basecamp
  Backstage entry + SLO

This is the first personal service running on basecamp. The platform now serves you with content — your own activity history, queryable via your own API. Combined with P16’s stream and P17’s batch, the Homelab life API composition recipe is now end-to-end:

GitHub events → Redpanda → Flink → Iceberg
GitHub Events API → Airflow → Spark → Iceberg
                        ↓
                   Trino → personal-api → portal command palette (Y5)

The first runnable example in basecamp/examples/. Every other composition recipe in Y4-Y5 builds on this shape.

6. COMPARE: Trino vs DuckDB for personal-api

DuckDB is in-process; reads Iceberg natively; great for “single-user.” Trino is multi-tenant; federated; production-grade.

For personal-api specifically, when does DuckDB win? When does Trino’s overhead earn its keep? Build the API both ways for one endpoint and measure cold-start latency, warm-query latency, memory footprint, and operational complexity. The answer is probably DuckDB for single-tenant homelab, Trino for the platform overall — but make sure the choice is informed.

400 words.

7. OPERATE

3+ runbooks (trino-coordinator-down, superset-cache-stale, personal-api-cache-stampede)
1+ postmortem
Weekly log

8. CONTRIBUTE

Trino, Superset, dbt-trino, presto/trino Go clients. The Go clients especially are thin and welcoming — a real-world example or a connector improvement is achievable in a phase.

Validation criteria

[ ] All 10 operational depth checks
[ ] Trino + Superset in basecamp Tier 8
[ ] personal-api running on basecamp; queryable by you via OIDC
[ ] Trino vs DuckDB comparison written up
[ ] 3+ runbooks; 1+ postmortem; 7+ weekly log entries
[ ] Pattern entries deepened:
    - caching → DEEP (revisited from Y1; now with stampede prevention + Redis at scale)
[ ] Exit Test passed

Exit Test

Time: 3 hours.

Build (90 min) — add a new endpoint to personal-api: /v1/diff?from=YYYY-MM-DD&to=YYYY-MM-DD returning commits between two dates. With cache + auth + tests + Backstage entry.
Diagnose (60 min) — scenario: personal-api latency spiked from 50ms to 5s; trace via OTel + Trino query log + Redis stats; find cause.
Articulate (30 min) — 600 words: “Walk a request through personal-api: from curl to response. Cover: auth → cache check → Trino → Iceberg → cache write → response.”

Anti-patterns

Anti-pattern	Why
Trino with no statistics	Planner makes wrong choices; ANALYZE your tables
Caching without TTL strategy	Stale data confuses users; thundering herds melt downstreams
Building personal-api without auth “for now”	Rate-limit abuse; OIDC was already wired in P12
Superset directly hitting Postgres OLTP	Wrong workload shape; route through Trino

Patterns deepened this phase

caching → DEEP

→ Next: Phase 19: Data Governance + Lineage (Y3 capstone)