Skip to content
5-YEAR PROGRAM · YEAR 3
UPCOMING

Final Exam

Full-day (~8 hour) scenario combining all 6 phases of Year 3. Pass = ready for Year 4 (Senior DevOps / Data Platform Engineer exit ramp credible; ML Platform trajectory in reach).

The Year 3 Final Exam is where the platform stops being something you run and gets audited as something you offer. Year 1 tested one machine. Year 2 tested multi-cloud. Year 3 tests the full data path — from CDC at the source through stream + batch through serving and BI, with observability and lineage as engineering disciplines, not afterthoughts. One pipeline built end-to-end. One multi-layer incident diagnosed using your own observability stack. One design review where the right answer might be “no” and is definitely “let me show you the trade-offs.”

What’s measured is pattern fluency at the data-platform layer — can you reason about oltp-vs-olap, lambda-vs-kappa, schema-on-read-vs-write, snapshot-plus-delta, and cardinality-as-cost under pressure, with real cluster state, and produce a postmortem that a Senior peer would respect? Two engineers can both wire CDC → Iceberg → Trino → Superset; only one can explain why this composition, with these trade-offs, and what fails when. The exam audits the second one.

This is also the year basecamp goes public. The exam treats that as a fact: the platform is in front of strangers, secrets are sanitized, and the lineage you cite in Part 3 is something an external reader could verify.


When to take

After Phase 19 validation criteria are all green and basecamp is public on GitHub. Schedule it ~2 weeks ahead so DataHub lineage is settled, the observability stack (Loki + Tempo + Hubble) has a few weeks of real data, and the personal-api on the platform has been running long enough to have produced at least one real incident.


Setup

  • basecamp public on GitHub (Y3 launch)
  • Tier 1-8 all operational; Tier 5 has JupyterHub
  • personal-api live on basecamp
  • DataHub + OpenLineage live; lineage visible end-to-end
  • Loki + Tempo + Hubble all live; SLO dashboards in Grafana
  • 8 hours of uninterrupted time
  • The root-exam skill (or solo + this doc as the script)

Format

3 parts, ~8 hours total:
Part 1: Build a pipeline (180 min)
Part 2: Diagnose a multi-layer incident (120 min)
Part 3: Design review (120 min)

Part 1: Build a pipeline (180 min)

“Ingest CDC from Postgres → Redpanda → Flink (clean + enrich) → Iceberg → dbt model → Superset dashboard. End-to-end. Tested. Observability wired. Lineage visible in DataHub.”

Required deliverables:

[ ] Debezium-style CDC from a new Postgres table (or simulated CDC producer)
[ ] Redpanda topic with schema in Schema Registry
[ ] Flink job (Kubernetes Operator): consume CDC, enrich with a lookup join, sink to Iceberg
[ ] Exactly-once via Flink + Iceberg sink transactions
[ ] dbt model materializing a 1-hour rollup; dbt tests on the model
[ ] Superset chart from the dbt model; alert when threshold crossed
[ ] OTel traces, slog logs, Prometheus metrics for Flink job
[ ] OpenLineage events visible in DataHub for the full pipeline
[ ] Iceberg table has retention + compaction configured
[ ] Column masking on one PII column verified
[ ] All deployed via basecamp ArgoCD; no manual `kubectl apply`

Pass criteria:

  • End-to-end pipeline working with a sample 100k events
  • Lineage visible from CDC source to Superset chart
  • All observability pillars correlated for one event (you can pick one event and find its trace, its log line, and its metric panel)
  • 0 manual kubectl apply -f
  • The dbt tests catch a deliberately-broken upstream row

What passing looks like: the pipeline diff in basecamp is small and reviewable. The DataHub lineage graph is dense and accurate. A stranger could follow the README, run the demo, and reproduce the dashboard within 30 minutes.


Part 2: Diagnose a multi-layer incident (120 min)

“This Superset dashboard shows wrong numbers. Investigate.”

The root-exam skill (or solo prep) breaks one thing in the pipeline. Could be:

  • Cache layer: Trino caching a stale value (cache invalidation broken)
  • Trino layer: dynamic filtering off; wrong join order; statistics stale
  • Iceberg layer: stale snapshot read; branch unmerged; compaction race
  • dbt layer: model logic regressed; test missing
  • Spark/Airflow layer: backfill ran with wrong date partition
  • CDC layer: Debezium dropped events under partition; Redpanda lost messages

You must:

  1. Use the observability stack to triage (Loki, Tempo, Grafana, OTel)
  2. Use lineage in DataHub to trace upstream
  3. Find the root cause (not symptom)
  4. Fix via GitOps (basecamp PR; ArgoCD reconciles)
  5. Write a postmortem with action items

Pass criteria:

  • Root cause identified within 90 min
  • Fix applied via GitOps
  • Postmortem written; action items prevent recurrence
  • At least 1 of the action items is a new dbt/GE test or a new alert
  • Postmortem cites the pattern the failure rode in on (e.g., stream-processing, snapshot-plus-delta, caching)

What passing looks like: the diagnosis path is short and pattern-rooted. You start at the dashboard, walk lineage one hop, check the observability pillar most likely to confirm or deny that hop, and proceed. No flailing. The postmortem reads blameless and concrete.


Part 3: Design review (120 min)

“You’re a Staff Engineer. A junior on the team proposes adding a closed analytics warehouse to the platform for analytics queries because Trino feels slow on big joins. Read their design (provided); write a thorough review.”

The provided design has 5 deliberate issues:

  • Conflates “slow” with “Trino’s fault” (could be statistics, could be Iceberg compaction)
  • No cost projection; closed-warehouse costs scale with usage in ways MinIO+Trino don’t
  • No DR plan for moving from open lakehouse to closed warehouse
  • No ML training story (they want analytics; ignores Y4 needs)
  • Doesn’t articulate why federation goes away if warehouse-only

Your review must:

  • Identify 4 of 5 issues
  • Propose alternatives (tune Trino, add caching, partition strategy, switch only specific workloads)
  • Be constructive, not gatekeeping
  • Cite Y3 patterns (oltp-vs-olap, lambda-vs-kappa, materialized-views, caching)

Pass criteria:

  • 4/5 issues identified
  • 2+ alternatives proposed with trade-offs
  • Cites at least 4 patterns
  • Tone: senior + constructive

What passing looks like: the review opens with what the design gets right, names the missing trade-offs, and closes with a smaller alternative that gets the team most of the speedup at a fraction of the lock-in cost. It reads like a review a junior would want to receive — specific, kind, and pattern-rooted.


Overall pass criteria

[ ] Part 1: end-to-end pipeline working in <3 hr wall time
[ ] Part 2: root-caused via observability + lineage; fix via GitOps; postmortem written
[ ] Part 3: 4/5 issues caught; alternatives proposed; senior tone
[ ] Self-grade vs Claude grade: agree within ~10%

If Part 2 took 90+ minutes to root-cause, that’s a signal — not a fail by itself, but a sign the observability stack isn’t muscle memory yet. Spend 1-2 weeks running deliberate breakage drills before retaking; the speed comes from operating, not from re-reading the phase docs.


After passing

You can:
- Architect + operate a complete data platform from CDC to BI
- Reason about lineage + governance as engineering discipline
- Tune Trino/Spark/Flink at intermediate depth
- Catalog + monitor data quality at scale
- Operate basecamp publicly with sanitized secrets
You have:
- basecamp PUBLIC with documentation, blog post, LinkedIn launch
- personal-api running on the platform (your own data, queryable via REST)
- terralabs grown with data-infra modules
- ops-handbook with ~80 runbooks, ~15 postmortems, ~150 weekly logs, 8+ ADRs
- 3+ merged upstream PRs (across years)
- ~35 patterns DEEP
Exit ramp: Senior DevOps / Data Platform Engineer / Site Reliability Engineer

Update program/overview.md Status block: “Year 3 complete: YYYY-MM-DD. basecamp public.”

→ Next: Year 4 — ML & AI Infrastructure


Anti-patterns

Anti-patternWhy
Cramming Phase 14-19 reading the week beforeY3’s reading is dense; it doesn’t compress
Skipping the design review partSenior+ work IS architectural review; that’s the level you’re testing for
Not sanitizing basecamp before public launchgitleaks scan is non-negotiable
Symptom-patching the dashboard “wrong numbers” issueRoot cause is the contract; lineage + observability are the tools
Treating observability as dashboards rather than as a triage methodPart 2’s clock rewards muscle memory, not dashboard tourism