Skip to content
5-YEAR PROGRAM · YEAR 1 · PHASE 3
UPCOMING

Databases: Postgres + Redis

Third phase. SQL fluency, transactions, replication, caching. The OLTP/OLAP boundary first appears. ~7-8 weeks, ~90 hrs.

If Phase 1 was “what is a process?” and Phase 2 was “what happens between processes on different machines?”, Phase 3 is “what happens to state that has to survive both?” Every other phase from here forward — containers, Kubernetes, the data tier in Year 3, the ML platform in Year 4 — is built on top of databases that don’t lose data. If the durability primitives feel approximate, every higher tier feels approximate.

Phase 3 also has the heaviest pattern density of any Year 1 phase: write-ahead logging, LSM vs B-tree, OLTP vs OLAP, replication, idempotency, caching all get their first real encounter here. The Year 1 overview calls Phase 3 the substrate phase for state — that’s not marketing. It’s literally where five distinct pattern entries get promoted from STUB to OUTLINE in eight weeks.


Prerequisites

  • Phase 2 complete — networking internalized
  • You accept: you are not learning Postgres. You are learning what a database does, with Postgres as one implementation. The patterns (WAL, MVCC, indexes, replication) survive the tool.

Why this phase exists

Year 1 ships triage (Phase 7), which uses Postgres + Redis. Year 3 builds the lakehouse on top of object storage but with the OLTP/OLAP boundary as a key axis. Year 4 ML serving uses pgvector. Year 5 AIOps queries past incidents from Postgres.

Databases also exercise the most patterns of any Year 1 phase: write-ahead logging, indexes, MVCC, replication, idempotency, schema-on-write vs schema-on-read.


1. PROBLEM

You have data that must persist across crashes, be queried efficiently, support concurrent writers without corruption, scale beyond a single machine when needed, and recover when something breaks. That’s the database problem.

OLTP databases (Postgres, MySQL) optimize for many small concurrent transactions. OLAP databases (Trino, ClickHouse, Snowflake) optimize for few large analytical queries. Caches (Redis, Memcached) optimize for tiny reads at memory speed. Each has a place; the boundary is where most architectural mistakes happen.


2. PRINCIPLES

2.1 Durability via write-ahead logging

Before mutating the on-disk data, log the intent. On crash, replay the log.

→ Pattern: write-ahead-logging

Investigate:

  • Watch Postgres WAL grow on disk; force a checkpoint; observe the file shrink/rotate
  • Read the Postgres source for XLogInsert (just enough to see the structure)
  • What does synchronous_commit = off actually trade away?

2.2 Indexes: B-tree vs LSM

B-tree: balanced, in-place updates, optimized for read-heavy. LSM: append-only, batched compaction, optimized for write-heavy.

→ Pattern: lsm-vs-btree

Investigate:

  • Build a B-tree mental model — read CMU’s database lectures or a textbook chapter
  • LSM in practice — read a RocksDB or Cassandra design doc
  • When does each win? Workload-driven analysis

2.3 MVCC + transactions

Multi-version concurrency control: each row has multiple versions; readers see a consistent snapshot; writers don’t block readers.

Investigate:

  • ACID — pin down what each letter means with a concrete example
  • Postgres pg_xact + tuple visibility — read the docs
  • What’s a write skew? What’s a phantom read? What does each isolation level prevent?

2.4 Replication

The “second copy of the data” problem. Sync vs async. Single leader vs multi-leader vs leaderless.

→ Pattern: replication

Investigate:

  • Set up Postgres streaming replication: 1 primary + 1 replica
  • Force a failover; what’s lost? what’s preserved?
  • Read DDIA Ch. 5 (replication) — Year 2 will deepen this

2.5 Idempotency

A retried request shouldn’t double-charge the user. Idempotency keys + at-most-once semantics.

→ Pattern: idempotency

Investigate:

  • Design an idempotency-key column for a transactions table
  • What’s the trade-off between server-side and client-side idempotency?

Idempotency reappears in Phase 5 (Go + concurrency) when you build retry logic in pulse, and again in Phase 7 where every Kubernetes controller is required to be idempotent — you apply the same manifest a thousand times, the cluster converges to one state. Three different abstraction levels, identical pattern.

2.6 The OLTP/OLAP boundary

OLTP: rows, indexed, small transactions. OLAP: columns, scan-heavy, big aggregations. Mixing them on one instance is the “we’ll add analytics queries to the prod DB” trap.

→ Pattern: oltp-vs-olap

Investigate:

  • Same dataset, two queries (point lookup + analytical aggregate). Compare Postgres vs DuckDB.
  • Why do row stores lose at analytics? Why do column stores lose at OLTP?

2.7 Caching (Redis)

Redis is in-memory data structures over the network. Used as cache, rate limiter, queue, session store.

→ Pattern: caching

Investigate:

  • Cache invalidation strategies (TTL, write-through, write-back, key-by-version)
  • Why is “cache stampede” a thing? How do you prevent it? (e.g., probabilistic early expiration)

3. TRADE-OFFS

DecisionOption AOption BCost
OLTP DBPostgresMySQLSQLite
IndexB-treeHashGIN/GiST
CacheRedisMemcachedRedis: data structures + persistence. Memcached: simpler
ReplicationSyncAsyncSync: safety, latency. Async: speed, possible loss
Schemastrict (NOT NULL, FKs)looseStrict catches bugs early; loose is faster to iterate

The replication row deserves extra attention: it’s the same fundamental sync-vs-async choice you’ll re-encounter in Year 2 (CAP/PACELC), in Phase 7 (etcd quorum writes), and in Year 3’s lakehouse (eventual consistency on object storage). Three different scales, one trade-off.


4. TOOLS (as of 2025-10)

  • Postgres 17 — homelab default
  • Redis 7 — cache + session store
  • DuckDB — for the OLAP compare exercise (in-process, no server)
  • psql, pgcli — REPLs
  • pg_stat_statements — query introspection
  • pgbench — synthetic load
  • pgvector — appears in Year 4; install but don’t use yet

5. MASTERY

5.1 Reading list

RequiredWhy
”Designing Data-Intensive Applications” (DDIA) Ch. 1-3The book; read 1 chapter/week
Postgres docs Ch. 5 (Data Definition) + Ch. 11 (Indexes) + Ch. 13 (Concurrency)The actual spec
”The Internals of PostgreSQL” (Hironobu Suzuki, online)Implementation depth
RecommendedWhy
Use The Index, Luke (Markus Winand)Indexing as practitioner skill

5.2 Operational depth checklist

[ ] Install Postgres 17 on a VM; configure pg_hba.conf, listen_addresses; tune memory
[ ] Design the `incidents` schema for triage (Phase 7 will use this): incidents, paging, escalation, audit_log. NOT NULL + FKs everywhere
[ ] Add appropriate indexes; verify usage with `EXPLAIN ANALYZE`
[ ] Implement an idempotency-key pattern for `incidents` inserts
[ ] Set up streaming replication: primary + replica; force failover; document
[ ] Trigger and recover from: dead tuples bloat (autovacuum off), long-running transaction blocking writes, transaction wraparound warning
[ ] Stand up Redis 7; configure persistence (RDB + AOF); benchmark with redis-benchmark
[ ] Implement a token-bucket rate limiter in Redis (Phase 5 + Phase 7 use this)
[ ] OLAP compare: same 10M-row dataset in Postgres + DuckDB; run analytical query on both; explain the latency gap
[ ] Backup + restore drill: pg_basebackup + WAL replay; verify data integrity post-restore

The backup/restore drill is non-negotiable. A backup you’ve never restored is a wish, not a backup. By the end of this checklist you should have actually run pg_basebackup, actually lost the primary, and actually watched WAL replay onto the replica — and timed the whole thing.

5.3 The incidents schema

This is the schema triage (Phase 7) builds against. Owning it through this phase means Phase 7 doesn’t introduce a schema design problem on top of K8s.

CREATE TABLE incidents (
id UUID PRIMARY KEY,
title TEXT NOT NULL,
severity SMALLINT NOT NULL CHECK (severity BETWEEN 1 AND 4),
status TEXT NOT NULL CHECK (status IN ('open','ack','resolved')),
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
resolved_at TIMESTAMPTZ,
idempotency_key TEXT UNIQUE -- defends against double-creates from retries
);
-- + paging, escalation, audit_log tables

Build the full schema; populate with seed data; benchmark queries. Document tradeoffs as ADRs in ops-handbook/adrs/. The idempotency_key TEXT UNIQUE line is the smallest possible expression of the idempotency pattern — when a duplicate insert hits, the unique-constraint violation is the database telling you “this already happened, no-op.” That’s the same protocol K8s admission controllers use, scaled down to one row.


6. COMPARE: Postgres vs SQLite

For a single-user homelab service like triage, why isn’t SQLite enough? Build the same incidents table in both, run pgbench-equivalent loads, document the trade-offs.

Write 400 words: when does SQLite win? When is Postgres worth the operational tax?

The honest answer for a single-user homelab service is SQLite would actually work. The reason Postgres is worth it for triage is the program’s later trajectory: by Year 4 you need pgvector for RAG, by Year 5 the AIOps agent runs concurrent reads against the same database, and migrating SQLite → Postgres mid-program is more expensive than just paying the operational tax now. That’s a reasoning move you should be able to articulate in the comparison writeup.


7. OPERATE

  • 4+ runbooks (postgres-restart, postgres-vacuum-bloat, redis-replication-lag, pg_stat_statements-tuning)
  • 1+ postmortem (something will go wrong with replication setup)
  • 1-2 ADRs (e.g., “Why streaming replication over logical replication for the homelab”)
  • Weekly log

8. CONTRIBUTE

Postgres docs, Redis docs, pg_stat_statements examples, DuckDB examples — all approachable.


Validation criteria

[ ] All 10 operational depth checks
[ ] `incidents` schema designed, deployed, benchmarked
[ ] Postgres vs SQLite comparison written up
[ ] 4+ runbooks; 1+ postmortem; 1+ ADR; 7+ weekly log entries
[ ] Pattern entries deepened:
- write-ahead-logging → OUTLINE
- lsm-vs-btree → OUTLINE
- oltp-vs-olap → OUTLINE (deepens Year 3)
- replication → OUTLINE (deepens Year 2)
- idempotency → OUTLINE
- caching → first touch (STUB stays)
[ ] Exit Test passed

Exit Test

Time: 3 hours.

  1. Build (90 min) — given a fresh Postgres + Redis VM, create the full incidents schema with appropriate indexes + a rate-limiter in Redis; benchmark; document.
  2. Debug (60 min) — scenario from Phase 3 catalog (autovacuum behind, replication lagging, deadlock).
  3. Articulate (30 min) — 600 words: “Walk a INSERT INTO incidents from psql keystroke to durable on-disk WAL. Cover parser, planner, executor, lock manager, WAL writer, fsync.”

The articulation prompt mirrors Phase 1’s read(2) walk-through — same shape (a single user-facing operation traced down to durable state) at a different layer of the stack. If you can do both, you’ve internalized the layering pattern.


Anti-patterns

Anti-patternWhy
Skipping EXPLAIN ANALYZEYou don’t know what your queries do
Putting analytics queries on the OLTP boxTier-1 incidents waiting to happen
synchronous_commit = off “for performance” without understandingYou just traded durability for latency without knowing
Caching without invalidation strategyThe “two hard problems” joke is a real anti-pattern

Patterns touched this phase


→ Next: Phase 4: Python — Fluency