Skip to content
META

Runbook Template

Use for any recurring operational procedure that needs to be repeatable at 3am. Test by handing to Claude in “play the runner” mode.

A runbook is the prescription, not the explanation. It exists so a tired version of you, paged out of bed, can recover the system without re-deriving the architecture. Theory belongs in the Pattern Library; operational discipline belongs here. The two cross-link, but they never blur.

This template is the canonical shape for any entry under ops-handbook/runbooks/. The shape matters: every runbook in ops-handbook shares the same skeleton so future-you doesn’t waste cognition on document layout when a real alert is firing. Trigger, prerequisites, goal, steps, rollback, troubleshooting — same order, every time.

If you find yourself wanting to add a “Background” section, you’re writing the wrong document. That’s a pattern entry or a phase doc. The runbook is the verb, not the noun.


Template

Copy this into ops-handbook/runbooks/{category}/{slug}.md:

---
title: "Runbook: <verb> <thing>"
slug: <verb>-<thing>
tags: [runbook, <category>]
last-tested: YYYY-MM-DD
---
# Runbook: <verb> <thing>
> One-sentence description of what this runbook does.
## Trigger
When you'd reach for this runbook. Be specific:
- Alert: `PostgresConnectionsHigh` fires (>85% of `max_connections`)
- User report: "triage dashboard is timing out on every page load"
- Scheduled: weekly WAL archive verification, Sundays 09:00
If the trigger is vague ("Postgres seems slow"), the runbook will be reached for in the wrong situations. Force yourself to name the alert, the symptom, or the schedule.
## Prerequisites
- Access required: `kubeconfig` for the `basecamp-prod` cluster; `psql` access to `triage-db` via the bastion
- Tools installed locally: `kubectl >= 1.28`, `argocd >= 2.9`, `psql`
- State expected: cluster reachable; ArgoCD shows `triage` as `Synced`; no concurrent maintenance window
If a prerequisite is missing, stop here and resolve it — do not improvise into the steps below.
## Goal
What "done" looks like — the post-conditions you can verify:
- The `PostgresConnectionsHigh` alert has cleared in Prometheus
- `kubectl get pods -n triage` shows all replicas `Running` and `Ready`
- A test request to `https://triage.basecamp.local/healthz` returns `200` within 500ms
## Steps
1. **Verify the trigger.** Confirm the symptom matches before you act.
```bash
$ kubectl get pods -n triage
NAME READY STATUS RESTARTS AGE
triage-0 0/1 Running 0 12m
triage-1 0/1 Running 0 12m

If pods are Ready 1/1, this isn’t the right runbook — go to TROUBLESHOOTING.

  1. Investigate. Look at the highest-signal source first.

    Terminal window
    $ kubectl logs -n triage triage-0 --tail=50 | grep -i 'connection'
    FATAL: too many connections for role "triage"
  2. Take action. The fix should be the smallest reversible change.

    Terminal window
    $ kubectl scale deploy/triage -n triage --replicas=1
    deployment.apps/triage scaled
  3. Verify. Re-check the goal post-conditions.

    Terminal window
    $ kubectl get pods -n triage
    NAME READY STATUS RESTARTS AGE
    triage-0 1/1 Running 0 14m
  4. Communicate. Post in #incidents:

    triage connection-pool exhaustion mitigated by scaling to 1 replica. Postmortem to follow within 72h.

Rollback

If the fix made things worse (e.g., scaling down caused user-facing 503s), revert:

  1. kubectl scale deploy/triage -n triage --replicas=3
  2. Re-open the incident; this runbook didn’t apply.

Troubleshooting

If step N fails:

  • Step 1 output shows Pending instead of Running — node pressure, not connection exhaustion. Switch to node-pressure-eviction.md.
  • Step 3 returns error: deployment not found — namespace drift; check ArgoCD sync status before continuing.

History

  • 2026-06-08 — created by @abukix after incident 2026-W23
  • 2026-06-09 — tested by handing to Claude in “play the runner” mode; clarified step 3
  • 2026-06-12 — used during incident 2026-W24; worked end-to-end in 6 minutes
---
## How to write a good runbook
Three rules:
1. **3am-readable.** A tired you, half-awake, paged out of bed, can follow this without thinking. If a step requires reasoning, the reasoning belongs in the doc — not in your head.
2. **Idempotent steps.** Re-running step 3 shouldn't break things. If it does, redesign the step (use `apply` not `create`; check current state before mutating).
3. **Tested.** Hand it to Claude in "play the runner" mode (per the [AI Learning Protocol](/program/ai-learning-protocol/), this is a legitimate "validate-then-write" use of AI). They follow it; you watch. Where they get confused, fix the doc.
:::caution[The `last-tested` field is load-bearing]
A runbook with `last-tested: YYYY-MM-DD` more than 6 months old is suspect. Tools change, cluster topology drifts, the credentials path moves. If you're reading a stale runbook during an incident, you're reading fiction. Re-test on a cadence — the field is there to make staleness visible.
:::
---
## What NOT to put in a runbook
- Theory ("here's why Postgres has WAL"). The pattern library has that.
- "We should also..." action items. Those go in `ops-handbook/contributions/contribution-plan.md` or as a postmortem action item.
- Long debug narratives. Those belong in incident docs or postmortems.
A runbook is *prescription*, not *explanation*.
---
## Cross-reference
- Pattern: [runbook-as-code](../patterns/observability-and-ops/runbook-as-code.md)
- Related templates: [incident-template.md](incident-template.md) (during the event), [postmortem-template.md](postmortem-template.md) (after)
- Style guide: [doc-style-guide.md](doc-style-guide.md)
- Program context: [Master Plan](/program/overview/), [ops-handbook plan](/projects/ops-handbook/plan/)