Reliability Engineering

Engineered resilience. Assume failure; plan for it; practice for it. Chaos engineering, DR drills, backup verification, runbook discipline. The SRE-book pillar.

Failures happen. The question is whether your system was designed for them. Chaos, DR, backup-verification, runbooks — all routine, not aspirational. Status: STUB — promoted to OUTLINE in Y3 Phase 30.

What this pattern is

Reliability engineering is the discipline of designing for failure. Its building blocks: chaos engineering (deliberately injecting failure in production-like environments to verify resilience — Netflix Chaos Monkey is the founding artifact); DR drills (rehearsing recovery from regional outage, lost cluster, deleted database); backup verification (not just “we have backups” — proven restoration on a schedule); runbook discipline (every recurring failure has a runbook in the ops-handbook); postmortems (blameless, system-level analysis after every real incident); error budgets as the planning input for risk-bearing work.

The pattern is most operationally evident in what the team practices. A team that says “we have DR” but has never actually run a DR drill has hope, not engineering. A team that says “we have backups” but has never restored one has hope, not engineering. A team that says “our system is chaos-proof” but has never run chaos experiments has hope, not engineering. Real reliability engineering is what you can prove works because you’ve watched it work under simulated failure.

The mindset shift is treating failure as a first-class concern. Every design decision includes “what happens when this fails?” Every deploy includes “what’s the rollback plan?” Every runbook includes “what if this step fails?” Teams that internalize this mindset produce systems that survive things other teams’ systems don’t. Teams that treat failure as an exception produce systems that fail predictably when the exception happens.

The pattern composes with the other observability-and-ops patterns. SLI/SLO/error-budget provides the measurement framework; error budget spent on chaos experiments is deliberate risk-taking. Disaster recovery is a specific subset of reliability engineering. Three pillars provides the observability substrate that makes chaos experiment outcomes visible.

Concrete instances in the wild

Netflix’s chaos engineering discipline. Chaos Monkey (kill random pods), Chaos Kong (kill entire regions), Latency Monkey (inject latency), Simian Army. Public engineering blogs describe the practice at hyperscale.
Google SRE’s error budget policy. Codified in the SRE book. Error budget spent on planned risk (deploys, experiments). Budget exhausted triggers a freeze.
Amazon’s Game Days. Structured chaos exercises where teams simulate outages and rehearse responses. Multi-hour events with executive attention.
basecamp’s DR drills (Y3 Phase 30+). Scheduled quarterly. Destroy an etcd replica; restore from backup; measure time-to-recover. Document in ops-handbook.
Facebook’s Fault Injection Testing (FIT). Continuous fault injection in production with tight blast-radius controls. Public research papers describe the methodology.
Etsy’s Blameless Post-Mortems post. John Allspaw’s canonical blog post that changed how the industry approaches postmortems. Free.
Gremlin. SaaS chaos-engineering platform. Turnkey experiments across cloud, K8s, and application layers.
LitmusChaos. OSS Kubernetes-native chaos engineering. CRD-based experiments; runs on any K8s cluster.
Chaos Mesh. Another OSS K8s chaos framework from PingCAP. Similar shape to Litmus.

Why this pattern matters

Systems that survive real failure share a common trait: they’ve already survived simulated failure. Systems that fall over during an outage share the opposite trait: nobody practiced the failure mode in advance. The gap between the two is reliability engineering as a discipline versus reliability as aspiration.

Every major post-2015 outage postmortem includes a variant of “we didn’t test this scenario.” The 2017 AWS S3 outage that took down half the internet: an operator ran a command that removed more servers than intended. Post-incident: better testing and drills for exactly this class of mistake. The 2021 Fastly outage: a customer configuration change triggered a latent software bug. Post-incident: better testing coverage for edge cases. Every one of these outages happened because reliability was inspected rather than engineered.

Chaos engineering is the practice that closes the gap. If you don’t know how your system fails, chaos experiments teach you. If you think you know but haven’t verified, chaos experiments confirm or contradict your model. If you’ve never run chaos experiments, your confidence in your system’s failure modes is essentially a belief system. The discipline turns beliefs into knowledge.

The pattern also produces operational readiness. Teams that practice DR drills know what to do when disaster strikes. Teams that practice runbooks know what to do when specific failures happen. Teams that practice incident response know their own procedures. Practice reveals problems in the procedures themselves: the runbook that references a decommissioned tool, the DR script that fails because a dependency was upgraded, the escalation contact who left the company. Every drill is a chance to fix these before the real incident.

Cost matters. Chaos engineering programs consume engineering time; DR drills consume operator time; postmortems consume calendar time. The math still works out because the alternative is worse: an unpracticed incident consumes far more time than any drill, and the reputational cost of “we couldn’t recover” dwarfs the drill cost.

Depth progression

STUB     ← you are here.
OUTLINE  Promoted when Y3 Phase 30 schedules DR drills + chaos experiments
         on basecamp.
DEEP     Promoted after Y3 end + ongoing — chaos experiments + DR drills
         monthly, postmortems for every real incident.

Preview: what OUTLINE will answer

When Y3 Phase 30 promotes this entry to OUTLINE, it will name:

PROBLEM. How do you design a system that survives real failure rather than one that hopes to?
PRINCIPLES. Assume failure; design for it. Verify resilience through experiments, not hope. Every failure mode has a runbook. Every incident produces a postmortem. Practice recovery routinely, not only during real incidents.
TRADE-OFFS. Chaos in production (verified resilience) vs staging (safer, less realistic). Blast radius per experiment (larger = more validation, more risk). Manual chaos (deliberate) vs continuous chaos (Netflix-style). Postmortem depth (deep = expensive, shallow = missed lessons).
TOOLS (time-stamped as of 2026-06): LitmusChaos, Chaos Mesh, Gremlin, Chaos Monkey (canonical, discontinued as separate tool but pattern persists), custom kubectl-scripted experiments, backup tools (Velero, cloud-native snapshots), postmortem templates.

The DEEP promotion, after Y3+ with monthly practice, will add MASTERY (operating chaos and DR routines for months), COMPARE (LitmusChaos vs Chaos Mesh vs custom scripts), OPERATE (a specific chaos experiment that revealed a real weakness), and CONTRIBUTE (a Litmus experiment shared upstream, a postmortem published, or a chaos-engineering blog post).

Canonical references

Betsy Beyer et al., Site Reliability Engineering (2016) and Site Reliability Workbook (2018) — the canonical Google SRE books. Free at sre.google.
Casey Rosenthal & Nora Jones, Chaos Engineering (2020) — the definitive modern reference. Free at O’Reilly’s radar site.
Netflix’s engineering blog on Chaos Monkey and the Simian Army. Free.
John Allspaw, Blameless Post-Mortems — the blog post that changed the industry. Free at etsy’s code-as-craft archive.
Charity Majors’s writing on operational maturity and reliability practice — modern practitioner perspective.

Cross-references

Y3 Phase 30: Reliability Engineering
Related: disaster-recovery, fault-isolation, sli-slo-error-budget
Industry: Platform Patterns — Reliability Engineering
Canonical text: Google SRE Book + Workbook