Runbook Template

Use for any recurring operational procedure that needs to be repeatable at 3am. Test by handing to Claude in “play the runner” mode.

A runbook is the prescription, not the explanation. It exists so a tired version of you, paged out of bed, can recover the system without re-deriving the architecture. Theory belongs in the Pattern Library; operational discipline belongs here. The two cross-link, but they never blur.

This template is the canonical shape for any entry under ops-handbook/runbooks/. The shape matters: every runbook in ops-handbook shares the same skeleton so future-you doesn’t waste cognition on document layout when a real alert is firing. Trigger, prerequisites, goal, steps, rollback, troubleshooting — same order, every time.

If you find yourself wanting to add a “Background” section, you’re writing the wrong document. That’s a pattern entry or a phase doc. The runbook is the verb, not the noun.

Template

Copy this into ops-handbook/runbooks/{category}/{slug}.md:

---
title: "Runbook: <verb> <thing>"
slug: <verb>-<thing>
tags: [runbook, <category>]
last-tested: YYYY-MM-DD
---

# Runbook: <verb> <thing>

> One-sentence description of what this runbook does.

## Trigger

When you'd reach for this runbook. Be specific:
- Alert: `PostgresConnectionsHigh` fires (>85% of `max_connections`)
- User report: "triage dashboard is timing out on every page load"
- Scheduled: weekly WAL archive verification, Sundays 09:00

If the trigger is vague ("Postgres seems slow"), the runbook will be reached for in the wrong situations. Force yourself to name the alert, the symptom, or the schedule.

## Prerequisites

- Access required: `kubeconfig` for the `basecamp-prod` cluster; `psql` access to `triage-db` via the bastion
- Tools installed locally: `kubectl >= 1.28`, `argocd >= 2.9`, `psql`
- State expected: cluster reachable; ArgoCD shows `triage` as `Synced`; no concurrent maintenance window

If a prerequisite is missing, stop here and resolve it — do not improvise into the steps below.

## Goal

What "done" looks like — the post-conditions you can verify:
- The `PostgresConnectionsHigh` alert has cleared in Prometheus
- `kubectl get pods -n triage` shows all replicas `Running` and `Ready`
- A test request to `https://triage.basecamp.local/healthz` returns `200` within 500ms

## Steps

1. **Verify the trigger.** Confirm the symptom matches before you act.
   ```bash
   $ kubectl get pods -n triage
   NAME           READY   STATUS    RESTARTS   AGE
   triage-0       0/1     Running   0          12m
   triage-1       0/1     Running   0          12m

If pods are Ready 1/1, this isn’t the right runbook — go to TROUBLESHOOTING.

Investigate. Look at the highest-signal source first.

$ kubectl logs -n triage triage-0 --tail=50 | grep -i 'connection'
FATAL: too many connections for role "triage"

Take action. The fix should be the smallest reversible change.

$ kubectl scale deploy/triage -n triage --replicas=1
deployment.apps/triage scaled

Verify. Re-check the goal post-conditions.

$ kubectl get pods -n triage
NAME           READY   STATUS    RESTARTS   AGE
triage-0       1/1     Running   0          14m

Communicate. Post in #incidents:

triage connection-pool exhaustion mitigated by scaling to 1 replica. Postmortem to follow within 72h.

Rollback

If the fix made things worse (e.g., scaling down caused user-facing 503s), revert:

kubectl scale deploy/triage -n triage --replicas=3
Re-open the incident; this runbook didn’t apply.

Troubleshooting

If step N fails:

Step 1 output shows Pending instead of Running — node pressure, not connection exhaustion. Switch to node-pressure-eviction.md.
Step 3 returns error: deployment not found — namespace drift; check ArgoCD sync status before continuing.

Postmortem: ../postmortems/2026-W23-triage-connection-pool.md
Pattern: ../../patterns/foundations/connection-pooling.md
Adjacent runbook: restart-triage.md (when the symptom is pod crash-loop, not connection exhaustion)

History

2026-06-08 — created by @abukix after incident 2026-W23
2026-06-09 — tested by handing to Claude in “play the runner” mode; clarified step 3
2026-06-12 — used during incident 2026-W24; worked end-to-end in 6 minutes

---

## How to write a good runbook

Three rules:

1. **3am-readable.** A tired you, half-awake, paged out of bed, can follow this without thinking. If a step requires reasoning, the reasoning belongs in the doc — not in your head.
2. **Idempotent steps.** Re-running step 3 shouldn't break things. If it does, redesign the step (use `apply` not `create`; check current state before mutating).
3. **Tested.** Hand it to Claude in "play the runner" mode (per the [AI Learning Protocol](/program/ai-learning-protocol/), this is a legitimate "validate-then-write" use of AI). They follow it; you watch. Where they get confused, fix the doc.

:::caution[The `last-tested` field is load-bearing]
A runbook with `last-tested: YYYY-MM-DD` more than 6 months old is suspect. Tools change, cluster topology drifts, the credentials path moves. If you're reading a stale runbook during an incident, you're reading fiction. Re-test on a cadence — the field is there to make staleness visible.
:::

---

## What NOT to put in a runbook

- Theory ("here's why Postgres has WAL"). The pattern library has that.
- "We should also..." action items. Those go in `ops-handbook/contributions/contribution-plan.md` or as a postmortem action item.
- Long debug narratives. Those belong in incident docs or postmortems.

A runbook is *prescription*, not *explanation*.

---

## Cross-reference

- Pattern: [runbook-as-code](../patterns/observability-and-ops/runbook-as-code.md)
- Related templates: [incident-template.md](incident-template.md) (during the event), [postmortem-template.md](postmortem-template.md) (after)
- Style guide: [doc-style-guide.md](doc-style-guide.md)
- Program context: [Master Plan](/program/overview/), [ops-handbook plan](/projects/ops-handbook/plan/)