Postmortem Template

Use after an incident is resolved. Blameless. Focus on systems, not people. Due within 72 hours.

A postmortem is the artifact that turns an incident from a bad night into a permanent reduction in failure rate. It’s not a confession; it’s a system analysis. Done well, it surfaces the implicit assumption or missing guardrail that enabled the failure — and the action items become the changes that prevent the next instance.

The 72-hour deadline is real. Memory degrades fast: the timeline you can reconstruct on Tuesday morning is sharper than the one you can reconstruct the following weekend. The incident doc you wrote during the event is the source material; the postmortem is the synthesis on top of it.

The hardest discipline in postmortems is staying blameless under self-pressure. When you operate the platform alone (as in ROOT), every “human error” framing is pointing at yourself — and the temptation to write “I was sloppy” instead of “the deploy doc didn’t mention the connection-pool ceiling” is enormous. Resist. The system view is the only view that produces durable improvements.

Template

Copy into ops-handbook/postmortems/{year}/{week}-{slug}.md:

---
title: "Postmortem: <symptom>"
slug: YYYY-WXX-<short-slug>
tags: [postmortem, sev-X]
incident: ../incidents/YYYY-WXX-<slug>.md
severity: 1 | 2 | 3 | 4
duration: <H>h <M>m
---

# Postmortem: <one-line symptom>

## Summary

What broke, why it broke, what changed to fix it. 2-3 sentences, no jargon. Concrete shape:

> triage was unable to serve traffic for 38 minutes after a deploy bumped replicas from 3 to 5, exhausting the Postgres connection pool. Mitigated by scaling triage back to 1 replica; permanent fix is a CI capacity check.

## Impact

- Who was affected: all triage dashboard users (homelab-only; ~3 internal sessions during the window)
- For how long: 2026-06-08 21:14 → 21:52 (38m)
- What they couldn't do: load any triage page; saw 503s after a 30s timeout
- Quantitative: ~92% of requests during the window failed; 0 data loss; 0 corrupted writes

## Timeline

(Reconstructed from the incident doc — pull the key entries.)

| Time | Event |
|---|---|
| 21:11 | Deploy of triage v2.1 (replicas 3 → 5) |
| 21:14 | Alert `TriageHighErrorRate` fires |
| 21:16 | Hypothesis: recent deploy regressed |
| 21:19 | Action: rollback (no effect) |
| 21:31 | Hypothesis: Postgres connection pool exhausted |
| 21:36 | Action: scaled triage replicas down to 1 |
| 21:52 | Recovery: error rate returns to baseline |

## Root cause (5-whys)

1. **Why did the alert fire?** Triage error rate exceeded 5%.
2. **Why did the error rate spike?** Triage pods couldn't connect to Postgres.
3. **Why couldn't they connect?** Connection pool exhausted (max_connections=100, triage opened ~120).
4. **Why did triage open 120 connections?** Recent deploy bumped replicas from 3 to 5; default pool size per replica is 25; 5 × 25 = 125 > 100.
5. **Why didn't the deploy catch this?** No connection-pool capacity check in CI; the relationship between replicas and pool size was implicit, not encoded.

**Root cause:** missing capacity check; the replicas-vs-pool-size relationship is implicit knowledge instead of an enforced invariant.

## What went well

- Detection time: 90s from deploy completion to page (Prometheus alert rule was tuned right)
- Communication: incident channel populated within 5 min; status doc started immediately
- Rollback was attempted promptly even though it didn't help — fast hypothesis testing

## What went poorly

- First hypothesis (recent deploy regressed) was wrong + wasted 15 minutes
- The connection-pool relationship to replica count wasn't documented anywhere
- Postgres client logs weren't checked until 25 minutes in — should have been step 2

## Action items

> Each action item: assignee, deadline, link to issue. No vague aspirations.

- [ ] Add a CI check: `replicas × pool_size <= max_connections - safety_margin` (Owner: @abukix, Due: 2026-06-15, Issue: basecamp#412)
- [ ] Add a runbook entry: "Postgres connection pool exhausted" (Owner: @abukix, Due: 2026-06-12, Issue: ops-handbook#88)
- [ ] Add the relationship to `ops-handbook/architecture/triage-postgres.md` (Owner: @abukix, Due: 2026-06-15)
- [ ] Configure HPA to respect connection-pool ceiling (Owner: @abukix, Due: 2026-06-22, Issue: basecamp#413)

## What we learned

- The implicit replicas × pool relationship is the kind of trap that hits at the worst time. Make implicit invariants explicit, or expect them to break.
- Always check the database client first when "service not responding" alerts fire — the database was healthy; the *connections* weren't, and that's a different log line.
- The pattern library entry on `caching` mentions cache stampedes; the analogous "connection stampede" pattern is worth promoting from STUB.

## Anti-pattern check

- "Human error" framing avoided? ✅ (the deploy was correct given existing docs; the system enabled the misconfiguration)
- Action items concrete + assigned + due-dated? ✅
- Action items prevent recurrence (not just patch the symptom)? ✅
- 5-whys reaches a *system* root cause, not a person? ✅

## Cross-references

- Incident: [../incidents/2026-W23-triage-postgres-connections.md](#)
- Pattern: [../../patterns/foundations/caching.md](#) (analogous stampede pattern)
- Runbook: [../runbooks/data/postgres-connection-pool-exhausted.md](#) (created by action item)
- ADR: [../adrs/0008-hpa-connection-pool-aware.md](#) (if the HPA fix needs an architectural decision)

How to write a good postmortem

Five rules:

Blameless. “Human error” is not a root cause; it’s a symptom of a system that enabled the error. Push back on yourself if you write it. The AI Learning Protocol tells AI to push back on shallow conclusions — apply the same rule to your own first draft.
5-whys to a system root cause. “Because Alice forgot” is not a system. “Because the deploy doc didn’t mention X and CI didn’t catch X” is.
Action items must be concrete + assigned + due-dated. “We should improve monitoring” is not an action item.
What went well matters. Postmortems aren’t only about failure; capture what helped recovery so you do more of it next time.
Read it 6 months later. That re-read is where pattern recognition happens — the same root cause showing up across three postmortems is the signal that produces a pattern entry at OUTLINE depth.

Anti-patterns

Anti-pattern	Why
”Human error” as root cause	Vacuous; pushes back on the discipline. The system enabled the error.
Action items without owners or dates	Won’t get done; pretend-fixes
Long narrative chronology	The timeline table + 5-whys is enough; skip the prose
”We’ll add monitoring” without specifying which alert	Never gets done — name the metric, the threshold, the alert
Skipping “what went well”	Loses the chance to reinforce the things that worked

Cross-reference

Pattern: blameless-postmortem
Came from: incident-template.md
Often spawns: adr-template.md (when an action item needs an architectural decision), runbook-template.md (when an action item is a new operational procedure)
Program context: Master Plan, ops-handbook plan