Skip to content
META

Incident Template

Use during an active incident. Append-only — never edit history. Postmortem comes later from this raw record.

The incident doc is the contemporaneous record — the thing you write while the system is broken, not after. Its job is to preserve the timeline accurately, including the wrong hypotheses, so the postmortem can reconstruct what really happened. Memory degrades fast under stress; the doc is the antidote.

The append-only discipline is the core feature. Wrong hypotheses are data: they show what was plausible at the time, what evidence misled you, and where the gap between the system’s behavior and your model lived. Editing them out makes the postmortem look cleaner and the learning shallower.

This template lives upstream of the postmortem in the ops-handbook flow: incident doc during the event → postmortem within 72 hours → action items that spawn runbooks, ADRs, or pattern deepenings. Each step has its own template; the chain is the ROOT operational discipline made visible.


Template

Copy this into ops-handbook/incidents/{year}/{week}-{slug}.md at the start of the incident:

---
title: "Incident: <symptom>"
slug: YYYY-WXX-<short-slug>
tags: [incident, sev-X]
severity: 1 | 2 | 3 | 4
started: YYYY-MM-DD HH:MM (TZ)
ended: YYYY-MM-DD HH:MM (TZ) # filled in when resolved
status: open | resolved
---
# Incident: <one-line symptom>
## Summary
One sentence: what's broken, who's affected, when it started.
> "triage dashboard returning 503 on every request since 21:14; all internal users affected."
## Severity
- **SEV-1** — full outage, all users affected
- **SEV-2** — partial outage, significant users affected
- **SEV-3** — degraded, small impact
- **SEV-4** — internal-only, no user impact
## Timeline
> Append entries with timestamps. NEVER edit past entries.
- **21:14** — Alert `TriageHighErrorRate` fired (>5% 5xx for 60s)
- **21:15** — Confirmed: triage dashboard at `/healthz` returns 503; error rate ~92%
- **21:16** — Hypothesis: deploy at 21:11 (triage v2.1) regressed something
- **21:19** — Action: rolled back deploy via `argocd app rollback triage`
- **21:24** — Observation: error rate flat at high level; rollback didn't help. Hypothesis wrong.
- **21:31** — New hypothesis: Postgres connection pool exhausted (saw `FATAL: too many connections` in triage logs)
- **21:36** — Action: scaled triage from 5 to 1 replica (`kubectl scale deploy/triage --replicas=1`)
- **21:42** — Observation: error rate dropping; ~40% now
- **21:52** — Resolved: error rate back to baseline; users unaffected
- **21:55** — Status: incident closed; postmortem due 2026-06-11 (72h)
## Hypothesis log
> Append-only. Each hypothesis: stated, action taken, result observed.
| Time | Hypothesis | Action | Result |
|---|---|---|---|
| 21:16 | recent deploy regressed | rollback | no change — hypothesis wrong |
| 21:31 | Postgres pool exhausted | scale triage down to 1 | error rate dropped, then recovered |
## Action items (during incident)
- Mitigations applied (in timeline above): rollback (no effect), scale-down (worked)
- Things that need follow-up (handed to postmortem):
- Why did rollback not show any effect? (deploy wasn't the cause)
- Where is the replicas-vs-connection-pool relationship documented? (nowhere — that's a gap)
## Communication
- Slack channel: `#incidents`
- Status page updated: 21:18 ("triage degraded; investigating")
- User-facing message: "triage is currently unavailable; we're investigating. Updates in #incidents."
- Resolved message at 21:55: "triage is back; postmortem will follow within 72h."

How to write a good incident doc

Three rules:

  1. Append-only. Once an entry is written, you don’t edit it. Even if you were wrong. The wrong hypothesis is data — it’s what makes the postmortem’s “what went poorly” section land instead of feeling like fiction.
  2. Timestamps on everything. Future-you will reconstruct the timeline from this. A bullet without a timestamp is useless to the postmortem.
  3. Short entries. This is a real-time log, not an essay. Save the narrative for the postmortem; here you want raw observation density.

When to start an incident doc

Three triggers:

  1. SEV-1 or SEV-2 alerts fire
  2. A user reports something significant is broken
  3. You’re about to spend more than 30 minutes investigating something

The cost of starting the doc unnecessarily is low. The cost of NOT having one when you wish you did is high. Default to opening.


When to close

Close when:

  • The user-visible symptom is gone (not necessarily root-caused — that’s the postmortem)
  • The system is stable
  • You’ve documented enough that the postmortem can reconstruct the timeline

The postmortem is due within 72 hours. Set a calendar reminder as part of closing the incident doc — that’s the cheapest moment to commit the time, before the urgency fades.


Cross-reference