Final Exam
6-hour scenario combining all 7 phases of Year 1. Pass = ready for Year 2 (Junior SRE / IT Engineer exit ramp credible).
The Year 1 Final Exam is a synthesis test, not a checklist. It asks one question across six hours and three formats: can you reason about a small Kubernetes platform from the kernel up, in patterns rather than in commands? Every part composes prior phases — the build leans on Phase 7 (Kubernetes + GitOps), the debug catalog draws from Phase 3 (Databases), Phase 5 (Go), and Phase 7, and the articulation cashes in on Phase 1 (OS) and Phase 2 (Networking) at the same time as it on Phase 7.
What’s being measured is pattern fluency, not tool memorization. Two engineers can both deploy a service via ArgoCD; only one of them can explain why the control-loop pattern is what makes the deploy converge — and only that one is ready for Year 2. The exam audits which one you are.
Treat this doc the way you’d treat a real on-call interview prep: open it ~2 weeks ahead of the date, scan the format, then close it. Cramming the catalog in week 51 is the anti-pattern this exam exists to expose.
When to take
After Phase 7 validation criteria are all green and triage has been live on basecamp for at least two weeks. Schedule the exam date ~2 weeks ahead so the homelab is settled and the ops-handbook is current.
Setup
- Fresh K3s cluster (3 VMs in Proxmox)
- Bastion VM reachable
- Postgres + Redis VMs available
- ArgoCD installed; basecamp repo connected
- 6 hours of uninterrupted time
- The
root-examskill (or solo + this doc as the script)
Format
3 parts, 6 hours total: Part 1: Build (90 min) Part 2: Debug (180 min) Part 3: Articulate (90 min)The exam is AI-administered via the root-exam skill. Claude plays the examiner: presents the scenario, watches your diagnosis, scores root-cause vs symptom-patching, reveals what was actually broken at the end. You can also run it solo using just this doc as the script — it’s harder without the live grading but cheaper to schedule.
Part 1: Build (90 min)
“Given a fresh K3s cluster + a new namespace
tenant-alpha, deploy the following via basecamp + ArgoCD, end-to-end, no manualkubectl apply.”
Required deliverables:
[ ] A new ArgoCD Application: `tenant-alpha-stack`[ ] Postgres StatefulSet with PVC (Longhorn-backed)[ ] Redis Deployment with persistence[ ] A Go HTTP service (use a tiny prebuilt image — focus is K8s, not coding)[ ] Ingress with TLS (self-signed cert from cert-manager OK)[ ] Default-deny NetworkPolicy + explicit allow rules[ ] A Sealed Secret for the DB password[ ] PrometheusRule that fires when the service is down >5 min[ ] All committed in basecamp git repo; ArgoCD reconciles on pushPass criteria:
- All resources reach Healthy in <10 min after
git push - No
kubectl apply -fused at any point - NetworkPolicy verified (curl from disallowed pod fails; from allowed succeeds)
- Sealed Secret round-trip works (encrypted in git, decrypted in cluster)
- PrometheusRule fires on a simulated outage and resolves on recovery
What passing looks like: the diff in basecamp is small and obvious. Each resource is in its expected directory. The ArgoCD UI shows green within minutes of the push, and the runbook for “what is tenant-alpha-stack and where is it declared?” writes itself.
Part 2: Debug (180 min)
Three independent scenarios, ~60 min each. The root-exam skill picks them deterministically from a seed; one each from Phase 3 / Phase 5 / Phase 7.
Scenario shape (each)
- Symptom presented (alert fires, dashboard shows red, user complaint)
- Cluster + DBs are in a real-but-broken state
- You diagnose, fix, write a runbook for the diagnosis path
Sample scenarios (the catalog)
From Phase 3 (databases)
- Postgres autovacuum is hours behind; queries getting slower
- Streaming replica is lagging; lag growing; replication slot WAL piling up
- Redis OOM-killed under load; cache miss storm hitting Postgres
From Phase 5 (Go + concurrency)
- A Go service has a goroutine leak; pprof + heap diff to find the leak
- Race condition under load (race detector clean in dev — find it in prod logs)
- Worker pool deadlocked; identify the channel that’s blocked
From Phase 7 (K8s + GitOps)
- Pod CrashLoopBackOff — root cause is one of: configmap typo, secret missing, image-pull failure, command/args wrong, capabilities dropped that are needed
- Service has no endpoints — selector mismatch, pod readiness probe failing, or all pods in CrashLoopBackOff
- Ingress returns 503 — upstream pod not ready, ingress controller misconfigured, or DNS broken at CoreDNS
- ArgoCD application stuck OutOfSync — bad manifest, RBAC denied, hook failed
- CoreDNS broken cluster-wide — find why; fix without disrupting basecamp
Pass criteria per scenario:
- Root cause identified, not symptom-patched
- Runbook written that another engineer could follow
- Fix applied via GitOps (no manual apply)
- The runbook cites at least one pattern — not just commands
What passing looks like: the runbook reads like one a Junior SRE would actually use at 3am. Specific signals are named (e.g., “if kubectl get endpoints is empty, check the selector before checking readiness”), not vague (“look at the logs”).
Part 3: Articulate (90 min)
~1500 words. One prompt, no notes (you can refer to your own ops-handbook + pattern library — they’re your work).
“Trace what happens when a request arrives at my homelab Ingress, end-to-end down to the kernel and back up. Cite the controllers, syscalls, and patterns at each step.”
Cover at minimum:
- Layer 2-4: packet arrives at NIC; kernel routes via iptables/eBPF
- K8s Service: kube-proxy / Cilium handles the cluster-IP → pod-IP translation
- Pod: container’s network namespace receives the packet
- Process: socket read; syscall; eventual
write()of response - Storage: if the request hits Postgres, walk the WAL → page cache → fsync path
- GitOps: how does this pod even exist? Walk basecamp git → ArgoCD → API server → scheduler → kubelet → containerd → namespaces + cgroups
- Patterns at each step: control-loops, layering-and-abstraction, mediation, privilege-separation, service-discovery, load-balancing
Pass criteria:
- Specific (no hand-waving — name the controller, the syscall, the file)
- Cites at least 6 patterns from the library
- Distinguishes K8s pattern from K8s implementation
- Identifies at least 2 places where the same pattern shows up in different layers (e.g., reconciliation in ArgoCD AND in the kubelet)
What passing looks like: the writeup could be handed to a Year 2 student as a study artifact. It treats “Kubernetes” as one implementation of the control-loop pattern, names which patterns recur at which layer, and admits which parts of the path you’re least sure about — that admission is a feature, not a flaw.
Overall pass criteria
[ ] Part 1: end-to-end deploy works in <10 min after final commit[ ] Part 2: 3/3 scenarios root-caused; runbooks written; fixes applied via GitOps[ ] Part 3: 1500-word articulation passes the depth check[ ] Self-grade vs Claude's grade: agree within ~10%If you fail Part 2 (more than 1 symptom-patched), don’t retake immediately. Spend 1-2 weeks on the gap (which phase failed?), then retake. Symptom-patching under exam pressure usually means the underlying pattern wasn’t deepened enough during operating, not that the test was unfair.
After passing
You can:- Debug any layer above the kernel from first principles- Reason about a system in terms of patterns, not commands- Operate a small K8s cluster with GitOps reconciliation- Ship code with PR review, CI, releases — across two languages- Write a runbook another engineer can follow at 3am
You have:- 4 OSS projects shipped: rxp, konfig, pulse, triage- ops-handbook with ~30 runbooks, 5+ postmortems, 50+ weekly logs, 1+ ADR- 1+ merged upstream PR- ~15 patterns deepened to OUTLINE (some pushed to DEEP)- basecamp Tier 1 operational, hosting triage
Exit ramp: Junior SRE / IT EngineerUpdate program/overview.md Status block: “Year 1 complete: YYYY-MM-DD.”
→ Next: Year 2 — Distributed Systems & Cloud
Anti-patterns
| Anti-pattern | Why |
|---|---|
| Cramming the week before the exam | The exam is a diagnostic; cramming masks gaps |
| Retaking the same day after a fail | You’ll memorize the scenarios, not the patterns |
| Symptom-patching to look fast | Defeats the entire program; root cause is the contract |
| Skipping Part 3 because “writing isn’t engineering” | Articulation IS the audit. If you can’t say it, you don’t know it. |
| Treating the exam as a tool quiz | The bar is pattern fluency. Tools are the surface; patterns are the score. |