Runbook Template
Use for any recurring operational procedure that needs to be repeatable at 3am. Test by handing to Claude in “play the runner” mode.
A runbook is the prescription, not the explanation. It exists so a tired version of you, paged out of bed, can recover the system without re-deriving the architecture. Theory belongs in the Pattern Library; operational discipline belongs here. The two cross-link, but they never blur.
This template is the canonical shape for any entry under ops-handbook/runbooks/. The shape matters: every runbook in ops-handbook shares the same skeleton so future-you doesn’t waste cognition on document layout when a real alert is firing. Trigger, prerequisites, goal, steps, rollback, troubleshooting — same order, every time.
If you find yourself wanting to add a “Background” section, you’re writing the wrong document. That’s a pattern entry or a phase doc. The runbook is the verb, not the noun.
Template
Copy this into ops-handbook/runbooks/{category}/{slug}.md:
---title: "Runbook: <verb> <thing>"slug: <verb>-<thing>tags: [runbook, <category>]last-tested: YYYY-MM-DD---
# Runbook: <verb> <thing>
> One-sentence description of what this runbook does.
## Trigger
When you'd reach for this runbook. Be specific:- Alert: `PostgresConnectionsHigh` fires (>85% of `max_connections`)- User report: "triage dashboard is timing out on every page load"- Scheduled: weekly WAL archive verification, Sundays 09:00
If the trigger is vague ("Postgres seems slow"), the runbook will be reached for in the wrong situations. Force yourself to name the alert, the symptom, or the schedule.
## Prerequisites
- Access required: `kubeconfig` for the `basecamp-prod` cluster; `psql` access to `triage-db` via the bastion- Tools installed locally: `kubectl >= 1.28`, `argocd >= 2.9`, `psql`- State expected: cluster reachable; ArgoCD shows `triage` as `Synced`; no concurrent maintenance window
If a prerequisite is missing, stop here and resolve it — do not improvise into the steps below.
## Goal
What "done" looks like — the post-conditions you can verify:- The `PostgresConnectionsHigh` alert has cleared in Prometheus- `kubectl get pods -n triage` shows all replicas `Running` and `Ready`- A test request to `https://triage.basecamp.local/healthz` returns `200` within 500ms
## Steps
1. **Verify the trigger.** Confirm the symptom matches before you act. ```bash $ kubectl get pods -n triage NAME READY STATUS RESTARTS AGE triage-0 0/1 Running 0 12m triage-1 0/1 Running 0 12mIf pods are Ready 1/1, this isn’t the right runbook — go to TROUBLESHOOTING.
-
Investigate. Look at the highest-signal source first.
Terminal window $ kubectl logs -n triage triage-0 --tail=50 | grep -i 'connection'FATAL: too many connections for role "triage" -
Take action. The fix should be the smallest reversible change.
Terminal window $ kubectl scale deploy/triage -n triage --replicas=1deployment.apps/triage scaled -
Verify. Re-check the goal post-conditions.
Terminal window $ kubectl get pods -n triageNAME READY STATUS RESTARTS AGEtriage-0 1/1 Running 0 14m -
Communicate. Post in
#incidents:triageconnection-pool exhaustion mitigated by scaling to 1 replica. Postmortem to follow within 72h.
Rollback
If the fix made things worse (e.g., scaling down caused user-facing 503s), revert:
kubectl scale deploy/triage -n triage --replicas=3- Re-open the incident; this runbook didn’t apply.
Troubleshooting
If step N fails:
- Step 1 output shows
Pendinginstead ofRunning— node pressure, not connection exhaustion. Switch tonode-pressure-eviction.md. - Step 3 returns
error: deployment not found— namespace drift; check ArgoCD sync status before continuing.
Related
- Postmortem: ../postmortems/2026-W23-triage-connection-pool.md
- Pattern: ../../patterns/foundations/connection-pooling.md
- Adjacent runbook:
restart-triage.md(when the symptom is pod crash-loop, not connection exhaustion)
History
- 2026-06-08 — created by @abukix after incident 2026-W23
- 2026-06-09 — tested by handing to Claude in “play the runner” mode; clarified step 3
- 2026-06-12 — used during incident 2026-W24; worked end-to-end in 6 minutes
---
## How to write a good runbook
Three rules:
1. **3am-readable.** A tired you, half-awake, paged out of bed, can follow this without thinking. If a step requires reasoning, the reasoning belongs in the doc — not in your head.2. **Idempotent steps.** Re-running step 3 shouldn't break things. If it does, redesign the step (use `apply` not `create`; check current state before mutating).3. **Tested.** Hand it to Claude in "play the runner" mode (per the [AI Learning Protocol](/program/ai-learning-protocol/), this is a legitimate "validate-then-write" use of AI). They follow it; you watch. Where they get confused, fix the doc.
:::caution[The `last-tested` field is load-bearing]A runbook with `last-tested: YYYY-MM-DD` more than 6 months old is suspect. Tools change, cluster topology drifts, the credentials path moves. If you're reading a stale runbook during an incident, you're reading fiction. Re-test on a cadence — the field is there to make staleness visible.:::
---
## What NOT to put in a runbook
- Theory ("here's why Postgres has WAL"). The pattern library has that.- "We should also..." action items. Those go in `ops-handbook/contributions/contribution-plan.md` or as a postmortem action item.- Long debug narratives. Those belong in incident docs or postmortems.
A runbook is *prescription*, not *explanation*.
---
## Cross-reference
- Pattern: [runbook-as-code](../patterns/observability-and-ops/runbook-as-code.md)- Related templates: [incident-template.md](incident-template.md) (during the event), [postmortem-template.md](postmortem-template.md) (after)- Style guide: [doc-style-guide.md](doc-style-guide.md)- Program context: [Master Plan](/program/overview/), [ops-handbook plan](/projects/ops-handbook/plan/)