ops-handbook Plan

Personal platform knowledge base. Runbooks, incidents, postmortems, ADRs, weekly logs, contribution plans. Every phase of ROOT adds to it. By Year 5 it’s a public reference work other SREs cite.

ops-handbook is the Group C craft artifact that runs in parallel with everything else in ROOT. It’s not a code project — it’s the journal of running the platform. Every runbook you write while debugging a Phase 3 Postgres replication issue, every postmortem after a Phase 7 etcd corruption, every ADR explaining why you picked Cilium over Calico, every Sunday weekly log entry — it all lives here.

The repo is initialized in Year 1 Phase 1 (Month 1, Week 1) and never “completes.” It’s the only artifact in ROOT that compounds across all 60 months without per-phase boundaries — by the time the program ends in Month 60, ops-handbook is ~140 runbooks, ~25 postmortems, ~250 weekly logs, and 15+ ADRs deep. It’s also the dataset that fuels notes-rag in Year 4 and services/aiops/ in Year 5 — the platform learning from its own operational history.

The discipline ops-handbook enforces is the runbook-as-code and blameless-postmortem patterns made tangible. Every artifact in here uses a structured template, is searchable, is AI-consumable, and survives 3am.

What it is

A version-controlled markdown repo holding every artifact ROOT generates during day-to-day operations. Not a wiki (wikis go stale). Not docs-as-prose (those don’t survive 3am). Structured templates, indexed by topic, searchable, AI-consumable.

ops-handbook/
├── runbooks/
│   ├── linux/
│   ├── networking/
│   ├── kubernetes/
│   ├── data/
│   ├── ml/
│   ├── platform/
│   └── agents/                 # Year 5
├── incidents/
│   ├── 2026/
│   │   ├── 2026-W22-postgres-replication-lag.md
│   │   └── ...
│   ├── 2027/
│   └── ...
├── postmortems/                # parallel structure to incidents/
├── adrs/
│   ├── 0001-cilium-cni.md
│   ├── 0002-iceberg-over-delta.md
│   ├── 0003-mcp-over-custom-rest.md
│   └── ...
├── weekly-logs/
│   ├── 2026/
│   │   ├── W22-2026-05-31.md
│   │   └── ...
│   ├── 2027/
│   └── ...
├── contributions/
│   ├── contribution-plan.md    # which OSS PRs you're targeting
│   └── shipped/                # the merged ones
├── notes/                      # phase-aligned reading notes (DDIA, Chip Huyen, etc.)
│   ├── ddia/
│   ├── designing-ml-systems/
│   └── ai-engineering/
└── README.md                   # the index of indexes

Why it exists

3am you needs runbooks. Bad runbooks are how 3am pages take 4 hours.
Pattern recognition over time. Postmortems re-read 6 months later catch recurring patterns.
Onboarding artifact. Hand the repo to a new engineer; they get up to speed.
Compounds across years. The discipline accumulates; by Year 5 it’s a serious reference work.
Audit trail. Every architectural decision (Cilium over Calico, Iceberg over Delta, etc.) has an ADR with the trade-offs you considered.

Pattern it teaches

runbook-as-code: operational knowledge as structured, versioned, testable artifacts.

blameless-postmortem: the discipline of focusing on systems, not people.

Scope

In: runbooks, incidents, postmortems, ADRs, weekly logs, contribution plans, reading notes per phase, anti-pattern catalogs
Out: code (lives in basecamp / terralabs / etc.); ephemeral notes (use a scratchpad)

When built

Phase 1, Month 1. Initialized the first week of ROOT. Every subsequent phase adds. Never “complete” — it’s a live artifact.

Deliverables (by year)

Year	Cumulative state
Y1	~30 runbooks (Linux, networking, databases, containers, K8s); 5+ postmortems; 50+ weekly logs; 1+ ADR; 1+ shipped contribution
Y2	+ AWS + GCP + IaC + Backstage + service mesh + security runbooks; multi-cloud postmortems; ~10 ADRs; 2+ contributions
Y3	+ observability + lakehouse + processing + serving + governance runbooks; data quality ADRs; 3+ contributions
Y4	+ ML lifecycle + serving + GPU + RAG + agents runbooks; AI security ADRs; 4+ contributions
Y5	+ AIOps + portal + capstone runbooks; final reflection postmortem on the program itself; 5+ contributions; 15+ ADRs

Public vs private

Initially: PRIVATE (github.com/abukix/ops-handbook) — incidents may contain sensitive context
Year 5 capstone consideration: extract sanitized version as public reference. The discipline becomes a teaching artifact (“a real engineer’s 5-year operational journal”). Public version lives at github.com/abukix/ops-handbook-public with internal-context redacted.

Cross-references

Templates: meta/ (runbook-template, incident-template, postmortem-template, adr-template, weekly-log-template, pattern-template, blog-template — written in the meta phase)
Used in: every phase of ROOT
Pattern: runbook-as-code, blameless-postmortem
Master plan context: Master Plan — Group C: the craft
Year 1 context: Year 1 — what ships publicly
Brand context: Abukix Studio

Success criteria

Sunday weekly log NEVER missed without an explicit “skipped this week because X” entry
Every SEV-1/2/3 incident gets a postmortem within 72 hours
Every architecture decision (Cilium over Calico, etc.) gets an ADR
Every runbook tested by handing to Claude in “play the runner” mode at least once
By Y5: sanitized public version usable as reference by external engineers