Data Serving: Trino + Superset
Fifth phase of Year 3. Data lands; people query it. Trino as the federated SQL engine; Superset as the analytics frontend.
personal-apiships as the first user-facing service on the data layer. ~7 weeks, ~80 hrs.
Prerequisites
- Phase 17 complete — Iceberg + Spark + Airflow + dbt operational
- You accept: the lakehouse is dead weight if no one can query it. Trino + Superset (or any equivalent) is the human + machine interface.
Why this phase exists
You have data in Iceberg from P15-P17. Now: someone needs to ask questions of it. Year 4 ML training pipelines query Iceberg via Trino. Year 5 portal command palette queries Trino via the agent. personal-api (this phase) wraps Trino with a tiny REST API — the first basecamp resident that exists for a user (you), not just for the platform itself.
This is the phase where Tier 8 (Data Serving) of basecamp gets its first occupant. The serving layer is also where caching stops being abstract Y1 reading and starts being a real concern: the same questions get asked over and over, the underlying Iceberg snapshots only change when batch lands, and Redis is sitting there from Year 1 waiting to do its job. Caching at depth (TTL strategy, stampede prevention, layered caches) is what separates a personal-api that costs $0 to run from one that re-queries Trino on every request.
→ Pattern: caching (DEEP this phase from Y1’s first touch)
1. PROBLEM
You want to:
- Run SQL across multiple sources (Iceberg + Postgres + Redis + maybe S3 directly)
- Federate queries (one query joins Iceberg + Postgres without ETL)
- Serve analysts/BI users + serve programmatic API consumers
- Cache common queries; evict when underlying data changes
Trino does the federation + query execution. Superset does the BI front-end. A small REST API + Redis cache does the programmatic layer.
→ See: storage-and-data
2. PRINCIPLES
2.1 Federated SQL
One query, multiple data sources. Trino has connectors for Iceberg, Hive, Postgres, MySQL, Redis, S3, MongoDB, Kafka, BigQuery — most of them. The federation pattern is older than Trino (Presto, Drill, even mainframe federation), but Trino’s connector ecosystem in 2026 is the broadest open option.
Investigate:
- Configure Trino with Iceberg + Postgres connectors.
- Write a query that joins
abukix.commits(Iceberg) withincidents(Postgres from triage). - What does Trino push down to each source vs handle in its own engine? (Run
EXPLAINand read carefully — pushdown is the difference between a 100ms query and a 100s query.)
2.2 Query optimization
Cost-based optimization, predicate pushdown, projection pushdown, dynamic filtering. The optimizer is only as good as its statistics — ANALYZE TABLE is not optional.
Investigate:
- Use
EXPLAIN ANALYZEto see what Trino does with a join across Iceberg + Postgres. - Configure dynamic filtering for star-join workloads.
- When does Trino’s planner make wrong choices? How do you tell it the truth (statistics)?
2.3 Caching at the query layer
→ Pattern: caching
Trino’s caching options + Redis as application-level cache. Two layers, two failure modes, two TTL strategies — and the discipline is in keeping them straight.
Investigate:
- Trino’s Hive caching (file-level) vs result caching (query-level).
- Redis as the application-level cache for
personal-apiquery results. - TTL strategy: when data updates daily (commits backfill), how long to cache?
- Cache stampede prevention: probabilistic early expiration, request coalescing, the difference between them.
2.4 BI visualization (Superset)
Superset = open-source BI tool. Connects to Trino. Charts, dashboards, alerts, basic SQL Lab. The Y3 use case is “look at your own data the way a stakeholder would” — same shape every analytics engineer uses, smaller scale.
Investigate:
- Install Superset on basecamp; configure Trino connection; build a dashboard for
personal-apidata. - Row-level security in Superset (preview only — depth is in P19 governance).
- Alert when a metric crosses a threshold (e.g., “commits in last 7 days < 5”).
2.5 The serving API pattern
Wrap Trino + cache + auth into a tiny REST API for programmatic consumers. The same Go + chi + slog stack from triage — same shape, same patterns, fewer surprises.
Investigate:
- Build
personal-api(Go, chi, slog — same stack as triage from Y1). - Endpoints:
/v1/commits,/v1/activity-summary,/v1/repos/:repo/stats. - Cache layer: Redis with TTL based on data freshness.
- OIDC auth via Dex; rate limiting per user.
3. TRADE-OFFS
| Decision | Option A | Option B | When |
|---|---|---|---|
| Query engine | Trino | Spark SQL | Snowflake (commercial) |
| BI frontend | Superset | Metabase | Looker (commercial) |
| Caching | Trino native | Redis app-layer | Both |
| API style | REST | GraphQL | gRPC |
4. TOOLS (as of 2025-10)
- Trino 460+
- Apache Superset 4+
- Redis (already from Y1)
- Go + chi + slog for
personal-api - OpenLineage Trino integration (sets up P19 governance)
5. MASTERY
5.1 Reading list
| Required | Why |
|---|---|
| Trino docs — Concepts, Connectors, Tuning | The implementation |
| ”Trino: The Definitive Guide” (Fuller, Moser, Traverso) | Depth |
| Superset docs — Basic + Advanced | The BI front |
5.2 Operational depth checklist
[ ] Deploy Trino on basecamp (4-node coordinator + workers); Iceberg + Postgres + Redis connectors[ ] Write 5 federated queries that join Iceberg + Postgres + Redis[ ] Use EXPLAIN ANALYZE to tune one slow query; document the optimization[ ] Deploy Superset on basecamp; OIDC auth via Dex[ ] Build a `personal-api` Superset dashboard: commits per repo, activity heatmap, weekly summary[ ] Build personal-api Go service: - Endpoints: /v1/commits, /v1/activity-summary, /v1/repos/:repo/stats - Trino client (presto-go-client or trino-go-client) - Redis cache with stampede prevention - OIDC auth + per-user rate limit - OTel + slog + Prometheus metrics[ ] Containerize personal-api; Helm chart in basecamp; ArgoCD-managed[ ] Set up Trino caching (Hive caching for Iceberg data files)[ ] Build a Superset alert: notify when a query result crosses a threshold[ ] Document personal-api in Backstage catalog; SLO defined; owner set5.3 personal-api ships (private at first)
By phase end, personal-api is live at personal-api.basecamp.local (or wherever the Ingress lands). Year 5’s portal command palette will query it.
abukix/personal-api (PRIVATE for now — goes public Year 5 if interesting enough) Go + chi + slog Trino client + Redis cache OIDC auth + rate limit Helm chart in basecamp Backstage entry + SLOThis is the first personal service running on basecamp. The platform now serves you with content — your own activity history, queryable via your own API. Combined with P16’s stream and P17’s batch, the Homelab life API composition recipe is now end-to-end:
GitHub events → Redpanda → Flink → IcebergGitHub Events API → Airflow → Spark → Iceberg ↓ Trino → personal-api → portal command palette (Y5)The first runnable example in basecamp/examples/. Every other composition recipe in Y4-Y5 builds on this shape.
6. COMPARE: Trino vs DuckDB for personal-api
DuckDB is in-process; reads Iceberg natively; great for “single-user.” Trino is multi-tenant; federated; production-grade.
For personal-api specifically, when does DuckDB win? When does Trino’s overhead earn its keep? Build the API both ways for one endpoint and measure cold-start latency, warm-query latency, memory footprint, and operational complexity. The answer is probably DuckDB for single-tenant homelab, Trino for the platform overall — but make sure the choice is informed.
400 words.
7. OPERATE
- 3+ runbooks (
trino-coordinator-down,superset-cache-stale,personal-api-cache-stampede) - 1+ postmortem
- Weekly log
8. CONTRIBUTE
Trino, Superset, dbt-trino, presto/trino Go clients. The Go clients especially are thin and welcoming — a real-world example or a connector improvement is achievable in a phase.
Validation criteria
[ ] All 10 operational depth checks[ ] Trino + Superset in basecamp Tier 8[ ] personal-api running on basecamp; queryable by you via OIDC[ ] Trino vs DuckDB comparison written up[ ] 3+ runbooks; 1+ postmortem; 7+ weekly log entries[ ] Pattern entries deepened: - caching → DEEP (revisited from Y1; now with stampede prevention + Redis at scale)[ ] Exit Test passedExit Test
Time: 3 hours.
- Build (90 min) — add a new endpoint to personal-api:
/v1/diff?from=YYYY-MM-DD&to=YYYY-MM-DDreturning commits between two dates. With cache + auth + tests + Backstage entry. - Diagnose (60 min) — scenario: personal-api latency spiked from 50ms to 5s; trace via OTel + Trino query log + Redis stats; find cause.
- Articulate (30 min) — 600 words: “Walk a request through personal-api: from
curlto response. Cover: auth → cache check → Trino → Iceberg → cache write → response.”
Anti-patterns
| Anti-pattern | Why |
|---|---|
| Trino with no statistics | Planner makes wrong choices; ANALYZE your tables |
| Caching without TTL strategy | Stale data confuses users; thundering herds melt downstreams |
| Building personal-api without auth “for now” | Rate-limit abuse; OIDC was already wired in P12 |
| Superset directly hitting Postgres OLTP | Wrong workload shape; route through Trino |
Patterns deepened this phase
- caching → DEEP