Disaster Recovery

Recover from large-scale failure: lost region, deleted database, ransomware. RPO + RTO as the contract. Backups verified, runbooks rehearsed.

The region disappeared. The cluster is gone. The database was deleted. DR is the discipline that has an answer to all three — not in theory, but in practice. Status: STUB — promoted to OUTLINE in Y3 Phase 30.

What this pattern is

Disaster recovery (DR) is the planned, rehearsed response to large-scale failures: a cloud region goes down, a database is corrupted or deleted, ransomware encrypts disks, a misconfigured deploy wipes state. The contract is named by two numbers: RPO (Recovery Point Objective — how much data loss is acceptable) and RTO (Recovery Time Objective — how long until the system is back). Different tiers of service have different RPO/RTO commitments.

The discipline operationalizes to backups (multiple geographically-separated copies, encrypted, retention policies), restoration drills (regular test-restores to verify backups actually work), multi-region or multi-AZ topologies that can survive single-region failure, DR runbooks that walk the recovery step-by-step, and chaos experiments that simulate the scenarios. DR isn’t a single tool or process; it’s the whole practice of preparing for large-scale failure and rehearsing the response.

The pattern is what distinguishes “we have a backup script” from “we can recover from disaster.” The first is hope; the second is engineering. The difference shows up during real disasters. Teams with the first practice discover their backups were incomplete, or their restoration procedure has been broken for months, or their DR contacts have left the company, or their runbook references a tool that was decommissioned. Teams with the second have already discovered these issues during drills and fixed them.

DR composes tightly with reliability engineering — DR is one specific subset (large-scale failure rehearsal). It also composes with replication (multi-region topologies enable regional DR) and immutable-infrastructure (declarative infrastructure can be rebuilt from Git after a full-loss event).

Concrete instances in the wild

  • basecamp DR drills. Quarterly. Destroy an etcd replica; restore from backup; measure time-to-recover. Destroy Postgres primary; verify replica promotion. Restore a specific Iceberg snapshot to a fresh MinIO instance. Each drill documents time, issues found, and fixes applied.
  • AWS Multi-Region deployments with Route 53 failover. Primary region + warm-standby region with DNS-based failover. Common pattern for high-availability services.
  • Google’s multi-region Spanner. Spanner natively replicates across regions with strong consistency. DR is a configuration, not a separate procedure.
  • Immutable-infrastructure DR. If everything is declared in Git, “recover from disaster” is terraform apply against a fresh cloud account plus data restoration from backup. This is basecamp’s approach.
  • Cross-region S3 replication. Automatic replication of objects to a second region. Simple, cheap DR primitive.
  • Velero for Kubernetes DR. OSS tool that backs up K8s cluster state and persistent volumes to object storage. Rehydrate a cluster from Velero backups.
  • Bank multi-datacenter DR. Regulated industries operate at least two active datacenters with synchronous replication. DR is a datacenter-swap procedure, drilled quarterly.
  • Ransomware recovery. Immutable backups (append-only, air-gapped, versioned) protect against ransomware that would otherwise encrypt live backups. The 2020s pattern; not optional for enterprise anymore.
  • DNS-based failover. Route 53, Cloudflare, or Akamai health checks trigger DNS record updates. TTL-limited but effective for RTO in minutes.

Why this pattern matters

Every organization eventually experiences a scenario for which normal high-availability isn’t enough. Cloud regions go down (rare, but real: AWS us-east-1 has had multiple major incidents; GCP has had region-wide failures). Databases get deleted (accidental, misconfigured deploy, or malicious). Ransomware encrypts everything reachable, including live backups. Data centers burn down. Cyber attacks encrypt or exfiltrate data. Any single event of this class can destroy an unprepared organization.

DR is the discipline that turns these events from existential threats into serious-but-manageable incidents. The Toys R Us bankruptcy included a ransomware attack that destroyed their operational systems; competitors continued shipping. The Norsk Hydro ransomware attack in 2019 destroyed all IT infrastructure across 170 sites; they operated on paper for weeks and recovered because they had DR practice. The Colonial Pipeline attack in 2021 shut down their operational network; they had DR procedures that got them back online in days rather than weeks.

The pattern’s cost is real but bounded. Backup storage, cross-region replication, DR drills, and DR-tested infrastructure all cost money. But the cost of not having DR is unbounded — an unrecoverable disaster can end a business. The math strongly favors DR investment for any organization whose survival depends on its data or systems.

Modern infrastructure makes DR cheaper than the 2000s era. Cloud multi-region deployments, object-storage cross-region replication, K8s-native backup tools (Velero), and immutable infrastructure all reduce the DR investment required. What was heroic engineering (“build a second datacenter”) is a checkbox (“enable cross-region replication”). What was quarterly-drill effort (“rehearse the datacenter swap”) is scripted (“run this DR notebook”).

The failure mode to watch: DR that isn’t actually tested. Every organization believes their backups work; a large fraction discover during real incidents that they don’t. Untested DR is worse than no DR because it produces false confidence. The rule: if you haven’t successfully restored from a backup in the past 90 days, you have no backups. Practice or accept the risk.

Depth progression

STUB     ← you are here.
OUTLINE  Promoted when Y3 Phase 30 runs a real restoration drill on basecamp.
DEEP     Promoted after Y3 end with monthly DR drills logged and at least one
         observed real-recovery event (controlled or otherwise).

Preview: what OUTLINE will answer

When Y3 Phase 30 promotes this entry to OUTLINE, it will name:

  • PROBLEM. How do you prepare for and recover from disasters that destroy substantial portions of your infrastructure?
  • PRINCIPLES. RPO and RTO as explicit contracts. Backups verified through restoration, not existence. Multi-region topology for regional resilience. Runbooks documented and drilled. Immutable/append-only backups against ransomware.
  • TRADE-OFFS. Synchronous replication (zero RPO, latency cost) vs asynchronous (RPO of seconds to minutes, no latency cost). Warm standby (fast RTO, ongoing cost) vs cold standby (slow RTO, low cost). Full DR drills (comprehensive, disruptive) vs partial (component-specific, less disruptive).
  • TOOLS (time-stamped as of 2026-06): Velero (K8s backups), cloud-native cross-region replication, Route 53 / Cloudflare failover, immutable backup targets (Wasabi immutable buckets, AWS S3 Object Lock), DR-as-code frameworks (custom Terraform + runbooks).

The DEEP promotion, after Y3 end with monthly drills, will add MASTERY (operating DR routines for months), COMPARE (cross-region replication approaches across clouds), OPERATE (a specific DR drill or recovery event, documented in ops-handbook), and CONTRIBUTE (a DR runbook or blog post walking through a specific scenario).

Canonical references

  • Google’s Site Reliability Workbook (2018) — chapter on DR includes real practices at hyperscale.
  • AWS’s Disaster Recovery whitepapers — reference architectures for common DR patterns.
  • Kelsey Hightower’s talks and blog posts on Kubernetes DR practices.
  • Velero documentation. Free at velero.io.
  • Post-mortems from real large-scale outages (AWS, Cloudflare, Fastly) — free at each vendor’s status blog.

Cross-references