Skip to content

ops-handbook Plan

Personal platform knowledge base. Runbooks, incidents, postmortems, ADRs, weekly logs, contribution plans. Every phase of ROOT adds to it. By Year 5 it’s a public reference work other SREs cite.

ops-handbook is the Group C craft artifact that runs in parallel with everything else in ROOT. It’s not a code project — it’s the journal of running the platform. Every runbook you write while debugging a Phase 3 Postgres replication issue, every postmortem after a Phase 7 etcd corruption, every ADR explaining why you picked Cilium over Calico, every Sunday weekly log entry — it all lives here.

The repo is initialized in Year 1 Phase 1 (Month 1, Week 1) and never “completes.” It’s the only artifact in ROOT that compounds across all 60 months without per-phase boundaries — by the time the program ends in Month 60, ops-handbook is ~140 runbooks, ~25 postmortems, ~250 weekly logs, and 15+ ADRs deep. It’s also the dataset that fuels notes-rag in Year 4 and services/aiops/ in Year 5 — the platform learning from its own operational history.

The discipline ops-handbook enforces is the runbook-as-code and blameless-postmortem patterns made tangible. Every artifact in here uses a structured template, is searchable, is AI-consumable, and survives 3am.


What it is

A version-controlled markdown repo holding every artifact ROOT generates during day-to-day operations. Not a wiki (wikis go stale). Not docs-as-prose (those don’t survive 3am). Structured templates, indexed by topic, searchable, AI-consumable.

ops-handbook/
├── runbooks/
│ ├── linux/
│ ├── networking/
│ ├── kubernetes/
│ ├── data/
│ ├── ml/
│ ├── platform/
│ └── agents/ # Year 5
├── incidents/
│ ├── 2026/
│ │ ├── 2026-W22-postgres-replication-lag.md
│ │ └── ...
│ ├── 2027/
│ └── ...
├── postmortems/ # parallel structure to incidents/
├── adrs/
│ ├── 0001-cilium-cni.md
│ ├── 0002-iceberg-over-delta.md
│ ├── 0003-mcp-over-custom-rest.md
│ └── ...
├── weekly-logs/
│ ├── 2026/
│ │ ├── W22-2026-05-31.md
│ │ └── ...
│ ├── 2027/
│ └── ...
├── contributions/
│ ├── contribution-plan.md # which OSS PRs you're targeting
│ └── shipped/ # the merged ones
├── notes/ # phase-aligned reading notes (DDIA, Chip Huyen, etc.)
│ ├── ddia/
│ ├── designing-ml-systems/
│ └── ai-engineering/
└── README.md # the index of indexes

Why it exists

  • 3am you needs runbooks. Bad runbooks are how 3am pages take 4 hours.
  • Pattern recognition over time. Postmortems re-read 6 months later catch recurring patterns.
  • Onboarding artifact. Hand the repo to a new engineer; they get up to speed.
  • Compounds across years. The discipline accumulates; by Year 5 it’s a serious reference work.
  • Audit trail. Every architectural decision (Cilium over Calico, Iceberg over Delta, etc.) has an ADR with the trade-offs you considered.

Pattern it teaches

runbook-as-code: operational knowledge as structured, versioned, testable artifacts.

blameless-postmortem: the discipline of focusing on systems, not people.


Scope

  • In: runbooks, incidents, postmortems, ADRs, weekly logs, contribution plans, reading notes per phase, anti-pattern catalogs
  • Out: code (lives in basecamp / terralabs / etc.); ephemeral notes (use a scratchpad)

When built

Phase 1, Month 1. Initialized the first week of ROOT. Every subsequent phase adds. Never “complete” — it’s a live artifact.


Deliverables (by year)

YearCumulative state
Y1~30 runbooks (Linux, networking, databases, containers, K8s); 5+ postmortems; 50+ weekly logs; 1+ ADR; 1+ shipped contribution
Y2+ AWS + GCP + IaC + Backstage + service mesh + security runbooks; multi-cloud postmortems; ~10 ADRs; 2+ contributions
Y3+ observability + lakehouse + processing + serving + governance runbooks; data quality ADRs; 3+ contributions
Y4+ ML lifecycle + serving + GPU + RAG + agents runbooks; AI security ADRs; 4+ contributions
Y5+ AIOps + portal + capstone runbooks; final reflection postmortem on the program itself; 5+ contributions; 15+ ADRs

Public vs private

  • Initially: PRIVATE (github.com/abukix/ops-handbook) — incidents may contain sensitive context
  • Year 5 capstone consideration: extract sanitized version as public reference. The discipline becomes a teaching artifact (“a real engineer’s 5-year operational journal”). Public version lives at github.com/abukix/ops-handbook-public with internal-context redacted.

Cross-references


Success criteria

  • Sunday weekly log NEVER missed without an explicit “skipped this week because X” entry
  • Every SEV-1/2/3 incident gets a postmortem within 72 hours
  • Every architecture decision (Cilium over Calico, etc.) gets an ADR
  • Every runbook tested by handing to Claude in “play the runner” mode at least once
  • By Y5: sanitized public version usable as reference by external engineers