Distributed Systems Theory

First phase of Year 2. The intuition shift from “one machine” to “many machines that fail independently.” DDIA Ch. 5-9 + a Raft implementation in Go. ~8 weeks, ~80 hrs.

Phase 8 is where Year 1’s depth gets renegotiated. You spent twelve months earning the right to debug a Linux box from first principles; now you have to accept that the box is no longer the unit of correctness. Two boxes connected by a network are not “twice the box” — they are a fundamentally different kind of system, and most of your Year 1 reflexes are wrong about it.

This is the only phase in ROOT that is mostly reading. There is no shippable artifact at the end. The artifact is you — specifically, the version of you that has internalized DDIA Ch. 5-9, implemented Raft leader election in Go, and walked the CAP/PACELC trade-off matrix on a whiteboard from memory. Years 3-5 stack lakehouse, streaming, ML serving, and agent platforms on top of this foundation; if Phase 8 is shaky, every later phase pays interest on the debt.

The pattern-first scaffold from the Master Plan bends a little here. There is no second tool to COMPARE against; the comparison is between Postgres streaming replication and etcd Raft on your homelab — same problem (replicate state), two implementations, two guarantee shapes. That’s the proof the pattern transferred.

Prerequisites

Year 1 complete

You accept: most of what’s intuitive on one machine is wrong at scale. Replication is hard. Time is hard. “Just retry” creates duplicates. The patterns survive whatever the next-decade distributed-systems substrate looks like.

Why this phase exists

Years 3-5 build on top of distributed systems: lakehouse (Iceberg with concurrent writers), Kafka (Redpanda), Flink (state across nodes), Spark (driver + executors), Kubeflow (multi-step pipelines), Ray (distributed Python), serving meshes (mTLS across nodes), agents (multi-step state machines).

If you don’t have the theory, you’ll learn each tool’s distributed-systems chapter from the README — badly, repeatedly. With theory, every new tool is “another implementation of patterns I already understand.” That is exactly the bet ROOT makes — patterns outlive tools by a decade or more, and Phase 8 is where that bet starts to pay.

This phase is mostly reading + a small implementation. No big project to ship. The proof is internalization, tested by Year 2’s Final Exam architecture review and reinforced by every later phase that touches replication, consensus, or partitioning.

1. PROBLEM

You have data and computation that no longer fit on one machine. Or you have one machine that fails, and you don’t want to lose data when it does. Either way: you need multiple machines that act as if they were one — to applications, to users.

The hard part isn’t that there are multiple machines. The hard part is that:

Networks fail in ways disks don’t (partial, asymmetric, asynchronous — a packet can be delayed for an hour and then arrive after its retry’s reply).
Time isn’t shared (no global clock — your “happens-before” intuition is a wall-clock illusion).
Coordination is expensive (consensus has a price; quorum writes are slower than single-leader writes; that’s not a bug, it’s the bill).
“Just send a message twice” is the wrong fix to almost every problem (now you have two problems plus a duplicate).

The pattern category here is “agreement under uncertainty” — and every distributed system in Year 3 onward is one more implementation of it.

2. PRINCIPLES

2.1 Replication: the second copy

Sync vs async. Single leader vs multi-leader vs leaderless. Each trade-off has a cost.

→ Pattern: replication

Investigate:

Re-read DDIA Ch. 5; map each replication mode to a real system you know (Postgres = single-leader sync/async; DynamoDB = leaderless; Cassandra = leaderless quorum).
What does “RPO = 0, RTO = 30s” actually require? (Hint: synchronous replication, automated failover, and a witness for split-brain.)
What’s split-brain? How does each replication mode handle (or not) it?

2.2 Consensus: agreeing on a single truth

Paxos is famously hard to implement; Raft is the readable alternative.

→ Pattern: consensus

Investigate:

Read the Raft paper (“In Search of an Understandable Consensus Algorithm” — Ongaro & Ousterhout).
Implement Raft leader election in Go. Just leader election — full log replication is a 6-month project; election is a weekend.
Visit raft.github.io’s animation; trace what etcd does during your homelab’s K8s cluster operations (the K3s control plane runs etcd; you can etcdctl endpoint status your way to the actual leader).

2.3 Partitioning: sharding the data

Split data across nodes. Routing strategies (range, hash, consistent-hash). Rebalancing. The hot-shard problem.

→ Pattern: partitioning

Investigate:

DDIA Ch. 6.
Implement a tiny consistent-hash ring in Go (50 lines). Test with N nodes failing; observe which keys move and which don’t.
What’s a “rebalance” in DynamoDB / Cassandra terms? When does it pause writes?

2.4 Eventual consistency: the price of availability

If you choose A over C in CAP, you live with stale reads. The trick is bounding the staleness in a useful way.

→ Pattern: eventual-consistency

Investigate:

Read-your-writes consistency, monotonic reads, causal consistency — what does each give you?
Why is “session consistency” a useful middle ground? (It’s the consistency model most users actually expect, even when the system technically only promises eventual.)

2.5 CAP + PACELC: the trade-off frame

CAP is the cocktail-party version. PACELC (Daniel Abadi) extends it: on partition, choose A or C; else (no partition), choose latency or consistency.

→ Pattern: cap-and-pacelc

Investigate:

Map 5 systems you know to PACELC quadrants: Postgres, DynamoDB, Cassandra, MongoDB, etcd.
What does “tunable consistency” actually mean operationally? (Cassandra’s ONE / QUORUM / ALL — when do you pick each?)

2.6 Idempotency at scale

Every distributed system has duplicate messages. Build for them, don’t fight them.

→ Pattern: idempotency (deepens from Y1)

2.7 Delivery semantics

At-most-once, at-least-once, exactly-once-ish. The third is mostly a lie.

→ Pattern: delivery-semantics

Investigate:

Why is exactly-once “ish”? Which side is the lie — producer, broker, or consumer?
Kafka’s exactly-once-semantics — what does it actually deliver, and where’s the asterisk? (Hint: it works inside one Kafka transaction; the asterisk is everything outside it.)

2.8 Saga vs 2PC

Long-running transactions across services. Sagas (compensating actions) vs two-phase commit (blocking, fragile).

→ Pattern: two-phase-commit-vs-sagas

2.9 CRDTs: coordination-free data

Conflict-free replicated data types: math that lets concurrent edits merge deterministically.

→ Pattern: crdts

Investigate:

Read “A comprehensive study of CRDTs” (Shapiro et al.) — at least the intro + G-Counter/PN-Counter sections.
Where would CRDTs help in your platform? (Spoiler: rarely; their domain is collaborative-edit and offline-first. Knowing when not to reach for them is part of the skill.)

2.10 Distributed time

Wall clocks lie. Logical clocks (Lamport, vector clocks) give you happens-before without sync.

→ Pattern: distributed-time

Investigate:

DDIA Ch. 8 — the time chapter.
Implement a Lamport clock in Go (15 lines).
Why does Spanner use TrueTime + atomic clocks? What does it buy? (External consistency at planet scale, paid for in dedicated hardware in every datacenter.)

3. TRADE-OFFS

The whole phase IS trade-offs. Internalize the table:

Choice axis	Option A	Option B	When each wins
Replication mode	sync	async	sync: safety; async: latency
Leader topology	single-leader	multi-leader	leaderless
Consistency model	strong (linearizable)	sequential	eventual
Partitioning key	hash	range	hash: even; range: range-queries
Coord. style	2PC	Saga	local

4. TOOLS

This phase isn’t tool-heavy. The “tools” are textbooks + papers:

DDIA (Kleppmann) Ch. 5-9 — the spine
Raft paper (Ongaro & Ousterhout)
Spanner paper (Corbett et al.) — TrueTime
Dynamo paper (DeCandia et al., 2007) — leaderless replication
Go for the Raft implementation
Jepsen reports (jepsen.io) — when DBs fail under network partition

5. MASTERY

5.1 Reading list

Required	Why
DDIA Ch. 5 (Replication)	The spine
DDIA Ch. 6 (Partitioning)	The spine
DDIA Ch. 7 (Transactions)	The spine
DDIA Ch. 8 (The Trouble with Distributed Systems)	The spine
DDIA Ch. 9 (Consistency and Consensus)	The spine
Raft paper	Implement leader election

Recommended	Why
Spanner paper	TrueTime + global consistency
Dynamo paper	Leaderless
Jepsen reports for systems you use	What actually breaks under partition

5.2 Operational depth checklist

[ ] Read DDIA Ch. 5-9 (one chapter per week, 5 weeks)
[ ] Implement Raft leader election in Go: 3-node cluster, kill leader, observe re-election
[ ] Implement a consistent-hash ring; test with N nodes failing; observe key redistribution
[ ] Implement a Lamport clock; reason about happens-before for concurrent events
[ ] Postgres failover drill: streaming replica → promote → former primary returns; how do you reconcile?
[ ] Read 1 Jepsen report on a system you use (Postgres, Redis, MongoDB, Kafka — all have reports)
[ ] Walk DDIA's 5 trade-off matrices on the whiteboard from memory
[ ] Pick a real system (etcd, Kafka, CockroachDB) and identify its consensus algorithm + partitioning + replication mode
[ ] Write a 1500-word essay: "Why is exactly-once delivery a lie, and what should I build instead?"
[ ] Practice articulating CAP/PACELC trade-offs at a whiteboard for 30 min

6. COMPARE: Postgres replication vs etcd Raft

Both replicate state. Postgres uses streaming replication (single-leader async/sync). etcd uses Raft (single-leader consensus). On your homelab, both are running — Postgres serves basecamp Tier 1; etcd serves the K3s control plane underneath everything.

What’s different about the guarantees each offers? When would you pick one shape vs the other? Postgres async streaming gives you a fast follower with possible data loss on primary failure; Postgres sync streaming gives you durability at the cost of write latency and a hard dependency on the replica being alive. etcd Raft gives you linearizable reads/writes through a quorum of (N/2)+1 — a 3-node cluster survives 1 failure, a 5-node cluster survives 2, and the math doesn’t care whether you’re storing K8s objects or session tokens.

300 words. This is the prep for Year 2’s Final Exam architecture review.

7. OPERATE

2+ runbooks (Postgres failover; etcd member down)
1+ ADR if a real decision warrants (probably “Why streaming replication for Postgres in basecamp”)
Weekly log

This phase is more reading than operating. Don’t fake operational depth — the next phases (Phase 9 onward) bring it back fast.

8. CONTRIBUTE

DDIA-adjacent: contribute to the DDIA references repo, or any open-source implementation of consensus (etcd, Raft libraries). One landed PR — docs fix counts.

Validation criteria

[ ] All 10 operational depth checks
[ ] DDIA Ch. 5-9 read; notes in ops-handbook/notes/ddia/
[ ] Raft leader election working in Go
[ ] Consistent-hash ring + Lamport clock implementations
[ ] Postgres replication failover drill done
[ ] 1 Jepsen report digested
[ ] Postgres-vs-etcd comparison written
[ ] 2+ runbooks; 8+ weekly log entries
[ ] Pattern entries deepened to OUTLINE (some DEEP):
    - replication, consensus, partitioning, eventual-consistency, cap-and-pacelc
    - delivery-semantics, two-phase-commit-vs-sagas, crdts, distributed-time
    - idempotency (now DEEP)
[ ] Exit Test passed

Exit Test

Time: 3 hours.

Build (60 min) — kill etcd member, observe quorum behavior; intentionally cause split-brain in a 2-node Postgres replication setup; recover.
Articulate (90 min) — 1500 words: “I have a multi-region service that needs strong consistency for billing reads. Walk through CAP/PACELC trade-offs, propose an architecture, defend the choice.”
Whiteboard (30 min) — explain Raft to a hypothetical interviewer in 10 min; explain CRDTs in 5; explain CAP-vs-PACELC in 5.

Anti-patterns

Anti-pattern	Why
Reading DDIA without taking notes	The patterns won’t stick; the marginal cost of writing them down is what makes them yours
”Exactly-once delivery” claims without skepticism	The guarantee is asterisked everywhere; learn where the asterisks live
Treating consensus as a black box	etcd, the K8s API server’s source of truth, runs Raft. You should know what that means.
Skipping the Raft implementation	Implementing it is what makes it not magic

Patterns deepened this phase

All Year 2 distributed-systems patterns reach OUTLINE; several reach DEEP:

replication → DEEP
consensus → DEEP (after Raft impl)
partitioning → DEEP
eventual-consistency → DEEP
cap-and-pacelc → DEEP
idempotency → DEEP
delivery-semantics → OUTLINE
two-phase-commit-vs-sagas → OUTLINE
crdts → OUTLINE
distributed-time → OUTLINE

Browse the full category at patterns/distributed-systems/.

→ Next: Phase 9: IaC: Terraform + Crossplane