ops-handbook Plan
Personal platform knowledge base. Runbooks, incidents, postmortems, ADRs, weekly logs, contribution plans. Every phase of ROOT adds to it. By Year 5 it’s a public reference work other SREs cite.
ops-handbook is the Group C craft artifact that runs in parallel with everything else in ROOT. It’s not a code project — it’s the journal of running the platform. Every runbook you write while debugging a Phase 3 Postgres replication issue, every postmortem after a Phase 7 etcd corruption, every ADR explaining why you picked Cilium over Calico, every Sunday weekly log entry — it all lives here.
The repo is initialized in Year 1 Phase 1 (Month 1, Week 1) and never “completes.” It’s the only artifact in ROOT that compounds across all 60 months without per-phase boundaries — by the time the program ends in Month 60, ops-handbook is ~140 runbooks, ~25 postmortems, ~250 weekly logs, and 15+ ADRs deep. It’s also the dataset that fuels notes-rag in Year 4 and services/aiops/ in Year 5 — the platform learning from its own operational history.
The discipline ops-handbook enforces is the runbook-as-code and blameless-postmortem patterns made tangible. Every artifact in here uses a structured template, is searchable, is AI-consumable, and survives 3am.
What it is
A version-controlled markdown repo holding every artifact ROOT generates during day-to-day operations. Not a wiki (wikis go stale). Not docs-as-prose (those don’t survive 3am). Structured templates, indexed by topic, searchable, AI-consumable.
ops-handbook/├── runbooks/│ ├── linux/│ ├── networking/│ ├── kubernetes/│ ├── data/│ ├── ml/│ ├── platform/│ └── agents/ # Year 5├── incidents/│ ├── 2026/│ │ ├── 2026-W22-postgres-replication-lag.md│ │ └── ...│ ├── 2027/│ └── ...├── postmortems/ # parallel structure to incidents/├── adrs/│ ├── 0001-cilium-cni.md│ ├── 0002-iceberg-over-delta.md│ ├── 0003-mcp-over-custom-rest.md│ └── ...├── weekly-logs/│ ├── 2026/│ │ ├── W22-2026-05-31.md│ │ └── ...│ ├── 2027/│ └── ...├── contributions/│ ├── contribution-plan.md # which OSS PRs you're targeting│ └── shipped/ # the merged ones├── notes/ # phase-aligned reading notes (DDIA, Chip Huyen, etc.)│ ├── ddia/│ ├── designing-ml-systems/│ └── ai-engineering/└── README.md # the index of indexesWhy it exists
- 3am you needs runbooks. Bad runbooks are how 3am pages take 4 hours.
- Pattern recognition over time. Postmortems re-read 6 months later catch recurring patterns.
- Onboarding artifact. Hand the repo to a new engineer; they get up to speed.
- Compounds across years. The discipline accumulates; by Year 5 it’s a serious reference work.
- Audit trail. Every architectural decision (Cilium over Calico, Iceberg over Delta, etc.) has an ADR with the trade-offs you considered.
Pattern it teaches
runbook-as-code: operational knowledge as structured, versioned, testable artifacts.
blameless-postmortem: the discipline of focusing on systems, not people.
Scope
- In: runbooks, incidents, postmortems, ADRs, weekly logs, contribution plans, reading notes per phase, anti-pattern catalogs
- Out: code (lives in basecamp / terralabs / etc.); ephemeral notes (use a scratchpad)
When built
Phase 1, Month 1. Initialized the first week of ROOT. Every subsequent phase adds. Never “complete” — it’s a live artifact.
Deliverables (by year)
| Year | Cumulative state |
|---|---|
| Y1 | ~30 runbooks (Linux, networking, databases, containers, K8s); 5+ postmortems; 50+ weekly logs; 1+ ADR; 1+ shipped contribution |
| Y2 | + AWS + GCP + IaC + Backstage + service mesh + security runbooks; multi-cloud postmortems; ~10 ADRs; 2+ contributions |
| Y3 | + observability + lakehouse + processing + serving + governance runbooks; data quality ADRs; 3+ contributions |
| Y4 | + ML lifecycle + serving + GPU + RAG + agents runbooks; AI security ADRs; 4+ contributions |
| Y5 | + AIOps + portal + capstone runbooks; final reflection postmortem on the program itself; 5+ contributions; 15+ ADRs |
Public vs private
- Initially: PRIVATE (
github.com/abukix/ops-handbook) — incidents may contain sensitive context - Year 5 capstone consideration: extract sanitized version as public reference. The discipline becomes a teaching artifact (“a real engineer’s 5-year operational journal”). Public version lives at
github.com/abukix/ops-handbook-publicwith internal-context redacted.
Cross-references
- Templates: meta/ (runbook-template, incident-template, postmortem-template, adr-template, weekly-log-template, pattern-template, blog-template — written in the meta phase)
- Used in: every phase of ROOT
- Pattern: runbook-as-code, blameless-postmortem
- Master plan context: Master Plan — Group C: the craft
- Year 1 context: Year 1 — what ships publicly
- Brand context: Abukix Studio
Success criteria
- Sunday weekly log NEVER missed without an explicit “skipped this week because X” entry
- Every SEV-1/2/3 incident gets a postmortem within 72 hours
- Every architecture decision (Cilium over Calico, etc.) gets an ADR
- Every runbook tested by handing to Claude in “play the runner” mode at least once
- By Y5: sanitized public version usable as reference by external engineers