SLI/SLO and Error Budget
The pattern: a measurable platform contract. SLI (Service Level Indicator) — a number you measure (e.g., ”% of requests succeeding”). SLO (Service Level Objective) — your target for the SLI (e.g., “99.5% over 30 days”). Error budget — how much you can spend (0.5% in this example) before you must stop shipping risky changes and improve reliability.
The trade-off: iteration speed vs. reliability. Without SLOs, the org argues based on vibes (“this is unreliable”). With SLOs, the argument is data (“we burned 80% of the error budget this week — pause new features”). The discipline forces realistic reliability targets and turns reliability into a product trade-off, not a vibes argument.
Deepens in Year 2 Phase 12: Platform Engineering — pulled forward from Y3 because SLO discipline IS platform engineering, not a tooling concern. The instrumentation that feeds SLIs is built in Year 1 Phase 7: Kubernetes + GitOps and matures in Year 3 Phase 14: Observability. Burn-rate alerts get their runbooks in ops-handbook; the synthetic probes that produce many SLIs come from pulse.
Related patterns
- cardinality-as-cost — SLI metrics must stay aggregatable; cardinality is the budget that funds them.
- three-pillars-and-unified-telemetry — metrics for SLIs, traces for diagnosis, logs for evidence.
- runbook-as-code — every burn-rate alert needs a runbook attached.
- blameless-postmortem — exhausting an error budget is a structural signal, not a person’s failure.
- load-balancing — health-checked routing is where availability SLIs are physically enforced.