Distributed Systems Theory
First phase of Year 2. The intuition shift from “one machine” to “many machines that fail independently.” DDIA Ch. 5-9 + a Raft implementation in Go. ~8 weeks, ~80 hrs.
Phase 8 is where Year 1’s depth gets renegotiated. You spent twelve months earning the right to debug a Linux box from first principles; now you have to accept that the box is no longer the unit of correctness. Two boxes connected by a network are not “twice the box” — they are a fundamentally different kind of system, and most of your Year 1 reflexes are wrong about it.
This is the only phase in ROOT that is mostly reading. There is no shippable artifact at the end. The artifact is you — specifically, the version of you that has internalized DDIA Ch. 5-9, implemented Raft leader election in Go, and walked the CAP/PACELC trade-off matrix on a whiteboard from memory. Years 3-5 stack lakehouse, streaming, ML serving, and agent platforms on top of this foundation; if Phase 8 is shaky, every later phase pays interest on the debt.
The pattern-first scaffold from the Master Plan bends a little here. There is no second tool to COMPARE against; the comparison is between Postgres streaming replication and etcd Raft on your homelab — same problem (replicate state), two implementations, two guarantee shapes. That’s the proof the pattern transferred.
Prerequisites
- Year 1 complete
- You accept: most of what’s intuitive on one machine is wrong at scale. Replication is hard. Time is hard. “Just retry” creates duplicates. The patterns survive whatever the next-decade distributed-systems substrate looks like.
Why this phase exists
Years 3-5 build on top of distributed systems: lakehouse (Iceberg with concurrent writers), Kafka (Redpanda), Flink (state across nodes), Spark (driver + executors), Kubeflow (multi-step pipelines), Ray (distributed Python), serving meshes (mTLS across nodes), agents (multi-step state machines).
If you don’t have the theory, you’ll learn each tool’s distributed-systems chapter from the README — badly, repeatedly. With theory, every new tool is “another implementation of patterns I already understand.” That is exactly the bet ROOT makes — patterns outlive tools by a decade or more, and Phase 8 is where that bet starts to pay.
This phase is mostly reading + a small implementation. No big project to ship. The proof is internalization, tested by Year 2’s Final Exam architecture review and reinforced by every later phase that touches replication, consensus, or partitioning.
1. PROBLEM
You have data and computation that no longer fit on one machine. Or you have one machine that fails, and you don’t want to lose data when it does. Either way: you need multiple machines that act as if they were one — to applications, to users.
The hard part isn’t that there are multiple machines. The hard part is that:
- Networks fail in ways disks don’t (partial, asymmetric, asynchronous — a packet can be delayed for an hour and then arrive after its retry’s reply).
- Time isn’t shared (no global clock — your “happens-before” intuition is a wall-clock illusion).
- Coordination is expensive (consensus has a price; quorum writes are slower than single-leader writes; that’s not a bug, it’s the bill).
- “Just send a message twice” is the wrong fix to almost every problem (now you have two problems plus a duplicate).
The pattern category here is “agreement under uncertainty” — and every distributed system in Year 3 onward is one more implementation of it.
2. PRINCIPLES
2.1 Replication: the second copy
Sync vs async. Single leader vs multi-leader vs leaderless. Each trade-off has a cost.
→ Pattern: replication
Investigate:
- Re-read DDIA Ch. 5; map each replication mode to a real system you know (Postgres = single-leader sync/async; DynamoDB = leaderless; Cassandra = leaderless quorum).
- What does “RPO = 0, RTO = 30s” actually require? (Hint: synchronous replication, automated failover, and a witness for split-brain.)
- What’s split-brain? How does each replication mode handle (or not) it?
2.2 Consensus: agreeing on a single truth
Paxos is famously hard to implement; Raft is the readable alternative.
→ Pattern: consensus
Investigate:
- Read the Raft paper (“In Search of an Understandable Consensus Algorithm” — Ongaro & Ousterhout).
- Implement Raft leader election in Go. Just leader election — full log replication is a 6-month project; election is a weekend.
- Visit raft.github.io’s animation; trace what etcd does during your homelab’s K8s cluster operations (the K3s control plane runs etcd; you can
etcdctl endpoint statusyour way to the actual leader).
2.3 Partitioning: sharding the data
Split data across nodes. Routing strategies (range, hash, consistent-hash). Rebalancing. The hot-shard problem.
→ Pattern: partitioning
Investigate:
- DDIA Ch. 6.
- Implement a tiny consistent-hash ring in Go (50 lines). Test with N nodes failing; observe which keys move and which don’t.
- What’s a “rebalance” in DynamoDB / Cassandra terms? When does it pause writes?
2.4 Eventual consistency: the price of availability
If you choose A over C in CAP, you live with stale reads. The trick is bounding the staleness in a useful way.
→ Pattern: eventual-consistency
Investigate:
- Read-your-writes consistency, monotonic reads, causal consistency — what does each give you?
- Why is “session consistency” a useful middle ground? (It’s the consistency model most users actually expect, even when the system technically only promises eventual.)
2.5 CAP + PACELC: the trade-off frame
CAP is the cocktail-party version. PACELC (Daniel Abadi) extends it: on partition, choose A or C; else (no partition), choose latency or consistency.
→ Pattern: cap-and-pacelc
Investigate:
- Map 5 systems you know to PACELC quadrants: Postgres, DynamoDB, Cassandra, MongoDB, etcd.
- What does “tunable consistency” actually mean operationally? (Cassandra’s
ONE/QUORUM/ALL— when do you pick each?)
2.6 Idempotency at scale
Every distributed system has duplicate messages. Build for them, don’t fight them.
→ Pattern: idempotency (deepens from Y1)
2.7 Delivery semantics
At-most-once, at-least-once, exactly-once-ish. The third is mostly a lie.
→ Pattern: delivery-semantics
Investigate:
- Why is exactly-once “ish”? Which side is the lie — producer, broker, or consumer?
- Kafka’s exactly-once-semantics — what does it actually deliver, and where’s the asterisk? (Hint: it works inside one Kafka transaction; the asterisk is everything outside it.)
2.8 Saga vs 2PC
Long-running transactions across services. Sagas (compensating actions) vs two-phase commit (blocking, fragile).
→ Pattern: two-phase-commit-vs-sagas
2.9 CRDTs: coordination-free data
Conflict-free replicated data types: math that lets concurrent edits merge deterministically.
→ Pattern: crdts
Investigate:
- Read “A comprehensive study of CRDTs” (Shapiro et al.) — at least the intro + G-Counter/PN-Counter sections.
- Where would CRDTs help in your platform? (Spoiler: rarely; their domain is collaborative-edit and offline-first. Knowing when not to reach for them is part of the skill.)
2.10 Distributed time
Wall clocks lie. Logical clocks (Lamport, vector clocks) give you happens-before without sync.
→ Pattern: distributed-time
Investigate:
- DDIA Ch. 8 — the time chapter.
- Implement a Lamport clock in Go (15 lines).
- Why does Spanner use TrueTime + atomic clocks? What does it buy? (External consistency at planet scale, paid for in dedicated hardware in every datacenter.)
3. TRADE-OFFS
The whole phase IS trade-offs. Internalize the table:
| Choice axis | Option A | Option B | When each wins |
|---|---|---|---|
| Replication mode | sync | async | sync: safety; async: latency |
| Leader topology | single-leader | multi-leader | leaderless |
| Consistency model | strong (linearizable) | sequential | eventual |
| Partitioning key | hash | range | hash: even; range: range-queries |
| Coord. style | 2PC | Saga | local |
4. TOOLS
This phase isn’t tool-heavy. The “tools” are textbooks + papers:
- DDIA (Kleppmann) Ch. 5-9 — the spine
- Raft paper (Ongaro & Ousterhout)
- Spanner paper (Corbett et al.) — TrueTime
- Dynamo paper (DeCandia et al., 2007) — leaderless replication
- Go for the Raft implementation
- Jepsen reports (jepsen.io) — when DBs fail under network partition
5. MASTERY
5.1 Reading list
| Required | Why |
|---|---|
| DDIA Ch. 5 (Replication) | The spine |
| DDIA Ch. 6 (Partitioning) | The spine |
| DDIA Ch. 7 (Transactions) | The spine |
| DDIA Ch. 8 (The Trouble with Distributed Systems) | The spine |
| DDIA Ch. 9 (Consistency and Consensus) | The spine |
| Raft paper | Implement leader election |
| Recommended | Why |
|---|---|
| Spanner paper | TrueTime + global consistency |
| Dynamo paper | Leaderless |
| Jepsen reports for systems you use | What actually breaks under partition |
5.2 Operational depth checklist
[ ] Read DDIA Ch. 5-9 (one chapter per week, 5 weeks)[ ] Implement Raft leader election in Go: 3-node cluster, kill leader, observe re-election[ ] Implement a consistent-hash ring; test with N nodes failing; observe key redistribution[ ] Implement a Lamport clock; reason about happens-before for concurrent events[ ] Postgres failover drill: streaming replica → promote → former primary returns; how do you reconcile?[ ] Read 1 Jepsen report on a system you use (Postgres, Redis, MongoDB, Kafka — all have reports)[ ] Walk DDIA's 5 trade-off matrices on the whiteboard from memory[ ] Pick a real system (etcd, Kafka, CockroachDB) and identify its consensus algorithm + partitioning + replication mode[ ] Write a 1500-word essay: "Why is exactly-once delivery a lie, and what should I build instead?"[ ] Practice articulating CAP/PACELC trade-offs at a whiteboard for 30 min6. COMPARE: Postgres replication vs etcd Raft
Both replicate state. Postgres uses streaming replication (single-leader async/sync). etcd uses Raft (single-leader consensus). On your homelab, both are running — Postgres serves basecamp Tier 1; etcd serves the K3s control plane underneath everything.
What’s different about the guarantees each offers? When would you pick one shape vs the other? Postgres async streaming gives you a fast follower with possible data loss on primary failure; Postgres sync streaming gives you durability at the cost of write latency and a hard dependency on the replica being alive. etcd Raft gives you linearizable reads/writes through a quorum of (N/2)+1 — a 3-node cluster survives 1 failure, a 5-node cluster survives 2, and the math doesn’t care whether you’re storing K8s objects or session tokens.
300 words. This is the prep for Year 2’s Final Exam architecture review.
7. OPERATE
- 2+ runbooks (Postgres failover; etcd member down)
- 1+ ADR if a real decision warrants (probably “Why streaming replication for Postgres in basecamp”)
- Weekly log
This phase is more reading than operating. Don’t fake operational depth — the next phases (Phase 9 onward) bring it back fast.
8. CONTRIBUTE
DDIA-adjacent: contribute to the DDIA references repo, or any open-source implementation of consensus (etcd, Raft libraries). One landed PR — docs fix counts.
Validation criteria
[ ] All 10 operational depth checks[ ] DDIA Ch. 5-9 read; notes in ops-handbook/notes/ddia/[ ] Raft leader election working in Go[ ] Consistent-hash ring + Lamport clock implementations[ ] Postgres replication failover drill done[ ] 1 Jepsen report digested[ ] Postgres-vs-etcd comparison written[ ] 2+ runbooks; 8+ weekly log entries[ ] Pattern entries deepened to OUTLINE (some DEEP): - replication, consensus, partitioning, eventual-consistency, cap-and-pacelc - delivery-semantics, two-phase-commit-vs-sagas, crdts, distributed-time - idempotency (now DEEP)[ ] Exit Test passedExit Test
Time: 3 hours.
- Build (60 min) — kill etcd member, observe quorum behavior; intentionally cause split-brain in a 2-node Postgres replication setup; recover.
- Articulate (90 min) — 1500 words: “I have a multi-region service that needs strong consistency for billing reads. Walk through CAP/PACELC trade-offs, propose an architecture, defend the choice.”
- Whiteboard (30 min) — explain Raft to a hypothetical interviewer in 10 min; explain CRDTs in 5; explain CAP-vs-PACELC in 5.
Anti-patterns
| Anti-pattern | Why |
|---|---|
| Reading DDIA without taking notes | The patterns won’t stick; the marginal cost of writing them down is what makes them yours |
| ”Exactly-once delivery” claims without skepticism | The guarantee is asterisked everywhere; learn where the asterisks live |
| Treating consensus as a black box | etcd, the K8s API server’s source of truth, runs Raft. You should know what that means. |
| Skipping the Raft implementation | Implementing it is what makes it not magic |
Patterns deepened this phase
All Year 2 distributed-systems patterns reach OUTLINE; several reach DEEP:
- replication → DEEP
- consensus → DEEP (after Raft impl)
- partitioning → DEEP
- eventual-consistency → DEEP
- cap-and-pacelc → DEEP
- idempotency → DEEP
- delivery-semantics → OUTLINE
- two-phase-commit-vs-sagas → OUTLINE
- crdts → OUTLINE
- distributed-time → OUTLINE
Browse the full category at patterns/distributed-systems/.