Final Exam
8-hour scenario combining all 6 phases of Year 2. Pass = ready for Year 3 (DevOps / Cloud Engineer exit ramp credible; Senior trajectory in reach).
The Year 2 Final Exam is where single-machine intuition gets audited. Year 1 tested can you reason about one machine and one cluster? Year 2 tests can you reason across machines, across clouds, across teams? Three platforms (k3s + EKS + GKE) operating from one git repo. Three independent incidents in 120 minutes. One architecture review where the right answer is sometimes “no, and here’s the alternative.”
What’s measured isn’t AWS trivia or Terraform syntax. It’s pattern fluency at the platform layer — can you reason about CAP/PACELC in a real review, name the secrets-lifecycle trade-offs, defend an SLO contract against a noisy alert, and recognize when zero-trust networking is being undermined by a “harmless” exception. Two engineers can both ship multi-cloud Backstage; only one can explain why this composition, with these trade-offs, and how it fails. The exam audits the second one.
This is also the inflection in tone. Y1’s exam is a Junior SRE bar; Y2’s exam graduates you toward Senior — the architecture review part is non-negotiable for that reason.
When to take
After Phase 13 validation criteria are all green and the multi-cloud capstone has been operating for at least a week. Schedule it ~2 weeks ahead so terralabs is stable, Backstage has a populated catalog, and platform-ctl has its first 5+ commands working.
Setup
- basecamp managing K3s + EKS + GKE simultaneously
terralabsavailable; both clouds with budget headroom- Backstage running with catalog populated
- Service mesh + Sealed Secrets + ESO + Cosign all operational
platform-ctlprivate, 5+ commands working- 8 hours of uninterrupted time
- The
root-examskill (or solo + this doc as the script)
Format
3 parts, 8 hours total: Part 1: Build (180 min) Part 2: Triple incident (120 min) Part 3: Architecture review (120 min)Part 1: Build (180 min)
“Onboard a new team to basecamp. Provision their AWS account + GCP project via
terralabs; create EKS + GKE clusters; deploy a starter app via Backstage; sign image; default-deny NetworkPolicy; service mesh mTLS verified; ArgoCD reconciling. End-to-end, all observable in Backstage + Prometheus.”
Required deliverables:
[ ] AWS + GCP accounts provisioned via terralabs (VPC + cluster + DB shape)[ ] basecamp ArgoCD added both clusters as targets[ ] Backstage Software Template used to scaffold a new service repo[ ] platform-ctl service create + service deploy used (no manual YAML)[ ] Image signed with Cosign; admission policy verifies[ ] SealedSecret for DB password; ESO syncs from AWS Secrets Manager[ ] NetworkPolicy default-deny + explicit allow[ ] Service mesh mTLS verified between two services[ ] Pod Security restricted in tenant namespace[ ] PrometheusRule + SLO + burn-rate alert[ ] Backstage shows the service with health, owners, SLO statePass criteria:
- Onboarding completes in <90 min wall time (the rest of the budget is for verification + screenshots)
- All 11 deliverables verifiable
- No manual
kubectl apply -fat any point - The new team can self-service from Backstage without you sitting next to them
What passing looks like: the Backstage catalog entry is auto-populated, the SLO panel is real (not a placeholder), and the pull request that wired the new tenant into basecamp is small, reviewable, and obviously safe to merge. The platform feels like a product, not a hand-rolled stack.
Part 2: Triple incident (120 min)
Three scenarios simulated, ~40 min each. The root-exam skill picks them deterministically; one each from Phase 9 / Phase 10 / Phase 12.
Scenario shape (each)
- Symptom presented
- Cluster + clouds in real-but-broken state
- You diagnose, fix, write a runbook + postmortem skeleton
Sample scenarios
From Phase 9 (IaC)
- TF state corrupted; recover from backup; document
- Crossplane Composition stuck in
ReadyCondition: False; identify why - TF apply failing because resource exists but not in state; reconcile
From Phase 10 (AWS)
- EKS upgrade broke the AWS Load Balancer Controller; users seeing 503
- IAM “Access Denied” — IRSA misconfigured; trace the OIDC trust chain
- Cloud bill spiked 3x by month-end; identify cause via Cost Explorer; remediate
From Phase 12 (Platform Eng)
- ArgoCD application stuck OutOfSync; refusing to apply; identify why (RBAC, hook, manifest)
- Cosign admission policy blocking deploys after a key rotation; debug + fix
- Service mesh mTLS broken between two services; trace certs; fix without disrupting traffic
- SLO burn-rate alert fires; identify whether real degradation or noisy SLO; recommend action
For each: triage → fix → runbook + postmortem skeleton.
Pass criteria:
- 3/3 root-caused, not symptom-patched
- Fixes applied via GitOps where possible (no manual
kubectl apply) - Each runbook usable by another engineer
- Each postmortem skeleton names at least one pattern and one action item that would prevent recurrence
What passing looks like: the postmortems read blameless and specific. They cite patterns (e.g., secrets-lifecycle, defense-in-depth) rather than tools, and the action items are scoped (a PR, a runbook update, a new alert) rather than aspirational.
Part 3: Architecture review (120 min)
“You’re a Staff Engineer reviewing a junior’s basecamp PR. They propose adding a new K8s cluster on a third cloud (Azure). Read their design (provided); write a thorough but constructive review covering: cost, security, ops burden, DR implications, multi-cloud justification, alternatives.”
The provided design (the root-exam skill or this doc generates one ahead) deliberately has 4-5 issues:
- Adding a third cloud without articulating why
- IAM/identity model conflated across clouds
- Pod Security set to baseline, not restricted
- No SLO for cross-cluster traffic
- Cost projection optimistic by 2x
Your review must catch most of these + propose alternatives + be constructive (not just “no”).
Pass criteria:
- 4/5 issues identified
- At least 1 alternative architecture proposed
- Tone: senior + constructive (not gatekeeping)
- Cites specific patterns from your library (e.g., zero-trust-networking, platform-as-product, defense-in-depth)
What passing looks like: the review reads like one a Senior engineer would actually leave on a junior’s PR — specific, kind, and pattern-rooted. It opens with what’s strong, names the missing trade-off articulation, and closes with a smaller-step alternative that gets the team 80% of the value at 20% of the cost.
Overall pass criteria
[ ] Part 1: end-to-end onboarding in <90 min wall time[ ] Part 2: 3/3 incidents root-caused; runbooks + postmortems written[ ] Part 3: architecture review catches 4/5 issues, proposes alternatives[ ] Self-grade vs Claude grade: agree within ~10%If you fail Part 2 or Part 3: don’t retake immediately. Spend 1-2 weeks on the gap, then retake. A weak Part 3 in particular usually means the pattern library hasn’t been promoted past OUTLINE — the fix is to operate the platform longer with deliberate writeups, not to re-read DDIA.
After passing
You can:- Design and operate multi-cloud K8s platforms- Reason about distributed-systems trade-offs from theory + practice- Build internal developer platforms (Backstage, mesh, security)- Manage cost across clouds, defend security at depth, recover from DR- Ship OSS that other engineers find useful (terralabs public)- Define + measure platform SLOs
You have:- terralabs publicly launched (with stars + maybe a couple PRs)- basecamp ready to go public Year 3 (multi-cloud)- platform-ctl private with 5+ working commands- ops-handbook with ~50 runbooks, ~10 postmortems, ~100 weekly logs, 5+ ADRs- 2+ merged upstream PRs total- ~22 patterns DEEP
Exit ramp: DevOps Engineer / Senior DevOps / Cloud Engineer / Platform EngineerUpdate program/overview.md Status block: “Year 2 complete: YYYY-MM-DD.”
→ Next: Year 3 — Platform Engineering & Data
Anti-patterns
| Anti-pattern | Why |
|---|---|
| Cramming the week before | Cramming masks gaps; 1-2 weeks of practice + the test |
| Retaking same day after a fail | You’ll memorize scenarios, not patterns |
| Symptom-patching to look fast | Defeats the program; root cause is the contract |
| Skipping Part 3 because “writing isn’t engineering” | Architecture review IS engineering at Senior+ level |
| Reviewing the junior’s PR as a gatekeeper | Senior tone is constructive; gatekeeping fails Part 3 even with all 5 issues caught |