Final Exam

8-hour scenario combining all 6 phases of Year 2. Pass = ready for Year 3 (DevOps / Cloud Engineer exit ramp credible; Senior trajectory in reach).

The Year 2 Final Exam is where single-machine intuition gets audited. Year 1 tested can you reason about one machine and one cluster? Year 2 tests can you reason across machines, across clouds, across teams? Three platforms (k3s + EKS + GKE) operating from one git repo. Three independent incidents in 120 minutes. One architecture review where the right answer is sometimes “no, and here’s the alternative.”

What’s measured isn’t AWS trivia or Terraform syntax. It’s pattern fluency at the platform layer — can you reason about CAP/PACELC in a real review, name the secrets-lifecycle trade-offs, defend an SLO contract against a noisy alert, and recognize when zero-trust networking is being undermined by a “harmless” exception. Two engineers can both ship multi-cloud Backstage; only one can explain why this composition, with these trade-offs, and how it fails. The exam audits the second one.

This is also the inflection in tone. Y1’s exam is a Junior SRE bar; Y2’s exam graduates you toward Senior — the architecture review part is non-negotiable for that reason.

When to take

After Phase 13 validation criteria are all green and the multi-cloud capstone has been operating for at least a week. Schedule it ~2 weeks ahead so terralabs is stable, Backstage has a populated catalog, and platform-ctl has its first 5+ commands working.

Setup

basecamp managing K3s + EKS + GKE simultaneously
terralabs available; both clouds with budget headroom
Backstage running with catalog populated
Service mesh + Sealed Secrets + ESO + Cosign all operational
platform-ctl private, 5+ commands working
8 hours of uninterrupted time
The root-exam skill (or solo + this doc as the script)

Format

3 parts, 8 hours total:
  Part 1: Build               (180 min)
  Part 2: Triple incident     (120 min)
  Part 3: Architecture review (120 min)

Part 1: Build (180 min)

“Onboard a new team to basecamp. Provision their AWS account + GCP project via terralabs; create EKS + GKE clusters; deploy a starter app via Backstage; sign image; default-deny NetworkPolicy; service mesh mTLS verified; ArgoCD reconciling. End-to-end, all observable in Backstage + Prometheus.”

Required deliverables:

[ ] AWS + GCP accounts provisioned via terralabs (VPC + cluster + DB shape)
[ ] basecamp ArgoCD added both clusters as targets
[ ] Backstage Software Template used to scaffold a new service repo
[ ] platform-ctl service create + service deploy used (no manual YAML)
[ ] Image signed with Cosign; admission policy verifies
[ ] SealedSecret for DB password; ESO syncs from AWS Secrets Manager
[ ] NetworkPolicy default-deny + explicit allow
[ ] Service mesh mTLS verified between two services
[ ] Pod Security restricted in tenant namespace
[ ] PrometheusRule + SLO + burn-rate alert
[ ] Backstage shows the service with health, owners, SLO state

Pass criteria:

Onboarding completes in <90 min wall time (the rest of the budget is for verification + screenshots)
All 11 deliverables verifiable
No manual kubectl apply -f at any point
The new team can self-service from Backstage without you sitting next to them

What passing looks like: the Backstage catalog entry is auto-populated, the SLO panel is real (not a placeholder), and the pull request that wired the new tenant into basecamp is small, reviewable, and obviously safe to merge. The platform feels like a product, not a hand-rolled stack.

Part 2: Triple incident (120 min)

Three scenarios simulated, ~40 min each. The root-exam skill picks them deterministically; one each from Phase 9 / Phase 10 / Phase 12.

Scenario shape (each)

Symptom presented
Cluster + clouds in real-but-broken state
You diagnose, fix, write a runbook + postmortem skeleton

Sample scenarios

From Phase 9 (IaC)

TF state corrupted; recover from backup; document
Crossplane Composition stuck in ReadyCondition: False; identify why
TF apply failing because resource exists but not in state; reconcile

From Phase 10 (AWS)

EKS upgrade broke the AWS Load Balancer Controller; users seeing 503
IAM “Access Denied” — IRSA misconfigured; trace the OIDC trust chain
Cloud bill spiked 3x by month-end; identify cause via Cost Explorer; remediate

From Phase 12 (Platform Eng)

ArgoCD application stuck OutOfSync; refusing to apply; identify why (RBAC, hook, manifest)
Cosign admission policy blocking deploys after a key rotation; debug + fix
Service mesh mTLS broken between two services; trace certs; fix without disrupting traffic
SLO burn-rate alert fires; identify whether real degradation or noisy SLO; recommend action

For each: triage → fix → runbook + postmortem skeleton.

Pass criteria:

3/3 root-caused, not symptom-patched
Fixes applied via GitOps where possible (no manual kubectl apply)
Each runbook usable by another engineer
Each postmortem skeleton names at least one pattern and one action item that would prevent recurrence

What passing looks like: the postmortems read blameless and specific. They cite patterns (e.g., secrets-lifecycle, defense-in-depth) rather than tools, and the action items are scoped (a PR, a runbook update, a new alert) rather than aspirational.

Part 3: Architecture review (120 min)

“You’re a Staff Engineer reviewing a junior’s basecamp PR. They propose adding a new K8s cluster on a third cloud (Azure). Read their design (provided); write a thorough but constructive review covering: cost, security, ops burden, DR implications, multi-cloud justification, alternatives.”

The provided design (the root-exam skill or this doc generates one ahead) deliberately has 4-5 issues:

Adding a third cloud without articulating why
IAM/identity model conflated across clouds
Pod Security set to baseline, not restricted
No SLO for cross-cluster traffic
Cost projection optimistic by 2x

Your review must catch most of these + propose alternatives + be constructive (not just “no”).

Pass criteria:

4/5 issues identified
At least 1 alternative architecture proposed
Tone: senior + constructive (not gatekeeping)
Cites specific patterns from your library (e.g., zero-trust-networking, platform-as-product, defense-in-depth)

What passing looks like: the review reads like one a Senior engineer would actually leave on a junior’s PR — specific, kind, and pattern-rooted. It opens with what’s strong, names the missing trade-off articulation, and closes with a smaller-step alternative that gets the team 80% of the value at 20% of the cost.

Overall pass criteria

[ ] Part 1: end-to-end onboarding in <90 min wall time
[ ] Part 2: 3/3 incidents root-caused; runbooks + postmortems written
[ ] Part 3: architecture review catches 4/5 issues, proposes alternatives
[ ] Self-grade vs Claude grade: agree within ~10%

If you fail Part 2 or Part 3: don’t retake immediately. Spend 1-2 weeks on the gap, then retake. A weak Part 3 in particular usually means the pattern library hasn’t been promoted past OUTLINE — the fix is to operate the platform longer with deliberate writeups, not to re-read DDIA.

After passing

You can:
- Design and operate multi-cloud K8s platforms
- Reason about distributed-systems trade-offs from theory + practice
- Build internal developer platforms (Backstage, mesh, security)
- Manage cost across clouds, defend security at depth, recover from DR
- Ship OSS that other engineers find useful (terralabs public)
- Define + measure platform SLOs

You have:
- terralabs publicly launched (with stars + maybe a couple PRs)
- basecamp ready to go public Year 3 (multi-cloud)
- platform-ctl private with 5+ working commands
- ops-handbook with ~50 runbooks, ~10 postmortems, ~100 weekly logs, 5+ ADRs
- 2+ merged upstream PRs total
- ~22 patterns DEEP

Exit ramp: DevOps Engineer / Senior DevOps / Cloud Engineer / Platform Engineer

Update program/overview.md Status block: “Year 2 complete: YYYY-MM-DD.”

→ Next: Year 3 — Platform Engineering & Data

Anti-patterns

Anti-pattern	Why
Cramming the week before	Cramming masks gaps; 1-2 weeks of practice + the test
Retaking same day after a fail	You’ll memorize scenarios, not patterns
Symptom-patching to look fast	Defeats the program; root cause is the contract
Skipping Part 3 because “writing isn’t engineering”	Architecture review IS engineering at Senior+ level
Reviewing the junior’s PR as a gatekeeper	Senior tone is constructive; gatekeeping fails Part 3 even with all 5 issues caught