Skip to content
STUB

Blameless Postmortem

The pattern: after an incident, document what happened with focus on the system, not the operator. Timeline, root cause via 5-whys (or similar), what went well, what went poorly, action items. The discipline: ban “human error” framings; ask “what made the human error possible?” The system enabled the mistake.

The trade-off: comfort vs. honesty. Blameless culture is hard to maintain — the temptation to blame an individual is real, especially under pressure. Without blamelessness, postmortems become political theatre and engineers stop writing them. The framework forces structural fixes (process, tooling, monitoring) over individual scapegoating.

First touched in Year 1 Phase 7: Kubernetes + GitOps — the first real K3s incident produces the first real postmortem. Deepens through Year 3 Phase 14: Observability, where ~15 postmortems in ops-handbook make the discipline muscle memory.

  • runbook-as-code — postmortem action items often land as new runbooks.
  • sli-slo-error-budget — burned error budget is what triggers many postmortems.
  • three-pillars-and-unified-telemetry — the timeline reconstruction depends on logs, metrics, traces correlating.
  • fault-isolation — most “what made this possible” answers point at missing isolation boundaries.
  • pulse — the probe scanner whose alerts are the cold-start trigger for many of these writeups.