Final Exam

Full-day (~8 hour) scenario combining all 6 phases of Year 3. Pass = ready for Year 4 (Senior DevOps / Data Platform Engineer exit ramp credible; ML Platform trajectory in reach).

The Year 3 Final Exam is where the platform stops being something you run and gets audited as something you offer. Year 1 tested one machine. Year 2 tested multi-cloud. Year 3 tests the full data path — from CDC at the source through stream + batch through serving and BI, with observability and lineage as engineering disciplines, not afterthoughts. One pipeline built end-to-end. One multi-layer incident diagnosed using your own observability stack. One design review where the right answer might be “no” and is definitely “let me show you the trade-offs.”

What’s measured is pattern fluency at the data-platform layer — can you reason about oltp-vs-olap, lambda-vs-kappa, schema-on-read-vs-write, snapshot-plus-delta, and cardinality-as-cost under pressure, with real cluster state, and produce a postmortem that a Senior peer would respect? Two engineers can both wire CDC → Iceberg → Trino → Superset; only one can explain why this composition, with these trade-offs, and what fails when. The exam audits the second one.

This is also the year basecamp goes public. The exam treats that as a fact: the platform is in front of strangers, secrets are sanitized, and the lineage you cite in Part 3 is something an external reader could verify.

When to take

After Phase 19 validation criteria are all green and basecamp is public on GitHub. Schedule it ~2 weeks ahead so DataHub lineage is settled, the observability stack (Loki + Tempo + Hubble) has a few weeks of real data, and the personal-api on the platform has been running long enough to have produced at least one real incident.

Setup

basecamp public on GitHub (Y3 launch)
Tier 1-8 all operational; Tier 5 has JupyterHub
personal-api live on basecamp
DataHub + OpenLineage live; lineage visible end-to-end
Loki + Tempo + Hubble all live; SLO dashboards in Grafana
8 hours of uninterrupted time
The root-exam skill (or solo + this doc as the script)

Format

3 parts, ~8 hours total:
  Part 1: Build a pipeline                (180 min)
  Part 2: Diagnose a multi-layer incident (120 min)
  Part 3: Design review                   (120 min)

Part 1: Build a pipeline (180 min)

“Ingest CDC from Postgres → Redpanda → Flink (clean + enrich) → Iceberg → dbt model → Superset dashboard. End-to-end. Tested. Observability wired. Lineage visible in DataHub.”

Required deliverables:

[ ] Debezium-style CDC from a new Postgres table (or simulated CDC producer)
[ ] Redpanda topic with schema in Schema Registry
[ ] Flink job (Kubernetes Operator): consume CDC, enrich with a lookup join, sink to Iceberg
[ ] Exactly-once via Flink + Iceberg sink transactions
[ ] dbt model materializing a 1-hour rollup; dbt tests on the model
[ ] Superset chart from the dbt model; alert when threshold crossed
[ ] OTel traces, slog logs, Prometheus metrics for Flink job
[ ] OpenLineage events visible in DataHub for the full pipeline
[ ] Iceberg table has retention + compaction configured
[ ] Column masking on one PII column verified
[ ] All deployed via basecamp ArgoCD; no manual `kubectl apply`

Pass criteria:

End-to-end pipeline working with a sample 100k events
Lineage visible from CDC source to Superset chart
All observability pillars correlated for one event (you can pick one event and find its trace, its log line, and its metric panel)
0 manual kubectl apply -f
The dbt tests catch a deliberately-broken upstream row

What passing looks like: the pipeline diff in basecamp is small and reviewable. The DataHub lineage graph is dense and accurate. A stranger could follow the README, run the demo, and reproduce the dashboard within 30 minutes.

Part 2: Diagnose a multi-layer incident (120 min)

“This Superset dashboard shows wrong numbers. Investigate.”

The root-exam skill (or solo prep) breaks one thing in the pipeline. Could be:

Cache layer: Trino caching a stale value (cache invalidation broken)
Trino layer: dynamic filtering off; wrong join order; statistics stale
Iceberg layer: stale snapshot read; branch unmerged; compaction race
dbt layer: model logic regressed; test missing
Spark/Airflow layer: backfill ran with wrong date partition
CDC layer: Debezium dropped events under partition; Redpanda lost messages

You must:

Use the observability stack to triage (Loki, Tempo, Grafana, OTel)
Use lineage in DataHub to trace upstream
Find the root cause (not symptom)
Fix via GitOps (basecamp PR; ArgoCD reconciles)
Write a postmortem with action items

Pass criteria:

Root cause identified within 90 min
Fix applied via GitOps
Postmortem written; action items prevent recurrence
At least 1 of the action items is a new dbt/GE test or a new alert
Postmortem cites the pattern the failure rode in on (e.g., stream-processing, snapshot-plus-delta, caching)

What passing looks like: the diagnosis path is short and pattern-rooted. You start at the dashboard, walk lineage one hop, check the observability pillar most likely to confirm or deny that hop, and proceed. No flailing. The postmortem reads blameless and concrete.

Part 3: Design review (120 min)

“You’re a Staff Engineer. A junior on the team proposes adding a closed analytics warehouse to the platform for analytics queries because Trino feels slow on big joins. Read their design (provided); write a thorough review.”

The provided design has 5 deliberate issues:

Conflates “slow” with “Trino’s fault” (could be statistics, could be Iceberg compaction)
No cost projection; closed-warehouse costs scale with usage in ways MinIO+Trino don’t
No DR plan for moving from open lakehouse to closed warehouse
No ML training story (they want analytics; ignores Y4 needs)
Doesn’t articulate why federation goes away if warehouse-only

Your review must:

Identify 4 of 5 issues
Propose alternatives (tune Trino, add caching, partition strategy, switch only specific workloads)
Be constructive, not gatekeeping
Cite Y3 patterns (oltp-vs-olap, lambda-vs-kappa, materialized-views, caching)

Pass criteria:

4/5 issues identified
2+ alternatives proposed with trade-offs
Cites at least 4 patterns
Tone: senior + constructive

What passing looks like: the review opens with what the design gets right, names the missing trade-offs, and closes with a smaller alternative that gets the team most of the speedup at a fraction of the lock-in cost. It reads like a review a junior would want to receive — specific, kind, and pattern-rooted.

Overall pass criteria

[ ] Part 1: end-to-end pipeline working in <3 hr wall time
[ ] Part 2: root-caused via observability + lineage; fix via GitOps; postmortem written
[ ] Part 3: 4/5 issues caught; alternatives proposed; senior tone
[ ] Self-grade vs Claude grade: agree within ~10%

If Part 2 took 90+ minutes to root-cause, that’s a signal — not a fail by itself, but a sign the observability stack isn’t muscle memory yet. Spend 1-2 weeks running deliberate breakage drills before retaking; the speed comes from operating, not from re-reading the phase docs.

After passing

You can:
- Architect + operate a complete data platform from CDC to BI
- Reason about lineage + governance as engineering discipline
- Tune Trino/Spark/Flink at intermediate depth
- Catalog + monitor data quality at scale
- Operate basecamp publicly with sanitized secrets

You have:
- basecamp PUBLIC with documentation, blog post, LinkedIn launch
- personal-api running on the platform (your own data, queryable via REST)
- terralabs grown with data-infra modules
- ops-handbook with ~80 runbooks, ~15 postmortems, ~150 weekly logs, 8+ ADRs
- 3+ merged upstream PRs (across years)
- ~35 patterns DEEP

Exit ramp: Senior DevOps / Data Platform Engineer / Site Reliability Engineer

Update program/overview.md Status block: “Year 3 complete: YYYY-MM-DD. basecamp public.”

→ Next: Year 4 — ML & AI Infrastructure

Anti-patterns

Anti-pattern	Why
Cramming Phase 14-19 reading the week before	Y3’s reading is dense; it doesn’t compress
Skipping the design review part	Senior+ work IS architectural review; that’s the level you’re testing for
Not sanitizing basecamp before public launch	gitleaks scan is non-negotiable
Symptom-patching the dashboard “wrong numbers” issue	Root cause is the contract; lineage + observability are the tools
Treating observability as dashboards rather than as a triage method	Part 2’s clock rewards muscle memory, not dashboard tourism