Message Queues + Event-Driven Patterns
Phase 14 of /root Year 2: queues vs pub/sub, delivery semantics, idempotent consumers, dead-letter handling. Add async background work and webhooks to your Y2 service. 5-7 weeks, ~50-70 hours.
Sixth phase of Year 2. Asynchrony as a design choice. 5-7 weeks, ~50-70 hrs.
Most backend services need to do work asynchronously: send an email, process an upload, recompute a leaderboard, notify external systems. Doing this work synchronously (in the request path) creates slow endpoints, fragile systems, and customer-visible failures when downstream services hiccup. This phase teaches asynchrony as a deliberate pattern — queues, pub/sub, delivery semantics, idempotent consumers, dead-letter queues — applied to your Y2 service.
By phase end your service has at least one async-processed background job and at least one outbound webhook. Both are designed with explicit delivery semantics (at-least-once + idempotent consumer is almost always the right answer in 2026). You’ve recovered a stuck queue. You can defend the decision to use a queue vs pub/sub for a specific use case.
Prerequisites
- Phase 13 complete; service containerized
- 12 hrs/week budget reserved
- You accept: distributed systems primitives leak. You’ll meet failure modes that don’t exist in synchronous code. The phase exists to make those failures legible.
Why this phase exists
Queues and events are the entry to distributed-systems thinking. They introduce concepts that synchronous backend work never surfaces: messages lost, messages duplicated, messages out-of-order, workers that crash mid-processing, queues that grow without bound, schemas that evolve. These failure modes are unavoidable in any production system at scale. Y3’s distributed systems theory (Phase 21) takes the full course; this phase makes you fluent enough to design around the most common ones.
The pattern-first frame
Same eight steps as every phase.
1. PROBLEM
Some work your service needs to do is expensive (long-running), deferred (don’t need to do it now), or fan-out (one event → many consumers). Doing this work in the request path makes endpoints slow and tightly couples your service to downstream availability. Putting the work in a queue and processing it asynchronously is the standard solution.
That’s the messaging problem. Queues (one producer, many workers consuming each message once) vs pub/sub (one event, many subscribers, each seeing every event) are the two flavors. The implementations — Redis queues, RabbitMQ, NATS, Kafka, SQS, Pub/Sub — fit different points on a complexity / durability / scale curve.
2. PRINCIPLES
2.1 Queue vs pub/sub
Queue: one message, one consumer. Workers pull; the queue is empty after. Pub/sub: one event, many subscribers; each subscriber sees the event independently.
→ Pattern: message-queue and pub-sub
Investigate:
- When does queue fit (background work)? When does pub/sub fit (event broadcasting, fan-out)?
- What’s “fan-out” and what’s “fan-in”?
- Why do some systems blur the line (Kafka consumer groups are queue-like over a pub/sub log)?
2.2 Delivery semantics
Three flavors: at-most-once (might lose, won’t duplicate), at-least-once (might duplicate, won’t lose), exactly-once (claimed by some systems, usually means “effectively-once with caveats”). At-least-once is the standard default; you make it effectively-once by making consumers idempotent.
→ Pattern: delivery-semantics
Investigate:
- Why is true exactly-once delivery impossible without coordination on both ends?
- What does Kafka’s “exactly-once semantics” actually guarantee, and at what cost?
- When is at-most-once acceptable? (Hint: when duplicates are worse than losses — like metrics.)
2.3 Idempotent consumers
A consumer is idempotent if processing the same message twice produces the same effect as processing it once. Idempotency is the practical defense against at-least-once delivery’s duplicates.
→ Pattern: idempotent-consumer
Investigate:
- Walk a concrete example: “send email on signup” — what makes the consumer idempotent?
- What’s an idempotency key, and where does it live (database, cache, message header)?
- When is full idempotency impossible? (Hint: non-deterministic side effects like “post tweet.”)
2.4 Dead-letter queues (DLQs)
Some messages fail repeatedly — bad input, downstream outage, bug in the consumer. Without a DLQ, they retry forever and block the queue. With one, after N retries they move to the DLQ for human inspection.
Investigate:
- What goes in a DLQ, and how do you process it back into the main queue?
- What’s the right retry count + backoff before DLQ’ing?
- Why is “alerts on DLQ growth” usually the right signal?
2.5 Webhooks (outbound events)
Your service emits events to other systems via webhooks (HTTP POSTs to consumer URLs). The same delivery-semantics + idempotency concerns apply — but with HTTP failures on top.
Investigate:
- Why do webhooks need retries with backoff?
- Why is a signature header (HMAC) the standard way to authenticate webhooks?
- What’s an event-replay endpoint, and when does it earn its weight?
2.6 Event-driven architecture (the broader pattern)
Beyond background work, “event-driven architecture” is a system design where services communicate primarily by emitting events rather than calling each other directly. Looser coupling, harder debugging.
Investigate:
- Walk a synchronous flow vs event-driven flow for the same business logic. Where does each win?
- What’s CQRS (Command Query Responsibility Segregation), and when does it fit?
- Why is event sourcing (storing every event, deriving state) usually overkill?
3. TRADE-OFFS
| Decision | Options | Cost |
|---|---|---|
| Queue backend | Redis (RPUSH/BLPOP); RQ/Celery (Python); Asynq (Go); RabbitMQ; NATS; SQS; Kafka | Redis: simple, no durability guarantees. Specialized: durable, more ops. Cloud: managed. Kafka: scale, complex. |
| Pub/sub backend | Redis pub/sub; NATS; Kafka; cloud Pub/Sub | Redis: simple, lossy. NATS: lightweight, durable optional. Kafka: log, full replay. Cloud: managed. |
| Delivery guarantee | At-least-once + idempotent (default); at-most-once (metrics, telemetry); claimed exactly-once | At-least-once: standard, requires idempotency. At-most-once: simple, lossy. Exactly-once: complex, expensive. |
| Schema | None (JSON blob); JSON Schema; Protobuf; Avro | None: brittle. JSON Schema: validation. Protobuf: typed + compact. Avro: schema registry-friendly. |
| Webhook auth | HMAC signature; mTLS; OAuth; nothing | HMAC: standard. mTLS: stronger, harder setup. OAuth: heavy. None: don’t ship to real consumers. |
4. TOOLS (as of 2026-06)
Queue libraries
- Python: Celery (full-featured, heavy), RQ (Redis-backed, simple), Dramatiq, Arq (async)
- Go: Asynq (Redis-backed), River (Postgres-backed), or roll your own with Redis BLPOP
Brokers
- Redis 7+ — simple, ubiquitous
- NATS — lightweight, JetStream for durability
- RabbitMQ — durable, AMQP, more ops
- Kafka — Y4 territory; mention but don’t deploy this phase
Reading
- “Designing Data-Intensive Applications” Ch. 11 (Stream Processing)
- “Enterprise Integration Patterns” (Hohpe + Woolf) — the canonical messaging patterns book
- Kafka docs (background reading; we deploy in Y4)
- “Web Hooks Best Practices” — various engineering blog posts (Stripe’s are particularly good)
5. MASTERY: Async + webhooks in your Y2 service
5.1 The deliverable
Your Y2 service now has:
- At least one background job — a real one for your service (e.g., “send welcome email after signup,” “regenerate thumbnails after upload,” “recompute counts every 5 minutes”)
- At least one outbound webhook — your service POSTs to subscriber URLs on relevant events
- Idempotent consumers with documented idempotency keys
- Dead-letter handling — failed messages after N retries land somewhere inspectable
- Webhook signature (HMAC) for authenticating outbound POSTs
- Metrics — queue depth, processing rate, DLQ count
5.2 Operational depth checklist
[ ] Pick a queue backend (Redis + RQ/Asynq is recommended for /root Y2 scale)
[ ] Implement at least one background job; verify it runs end-to-end
[ ] Make the consumer idempotent; verify by sending the same message twice
[ ] Implement retry + backoff + DLQ for failed messages
[ ] Implement outbound webhooks with HMAC signature
[ ] Add an HTTP endpoint to verify webhook delivery (the "consumer-acks-by-200" pattern)
[ ] Trigger a downstream failure (kill the webhook consumer); observe + recover via retry
[ ] Add metrics: queue depth, jobs-per-second, retry count, DLQ depth
[ ] Add a runbook for "queue stuck — investigate"
[ ] Read at least one Stripe-engineering or similar post on idempotency in payments (real-world stakes)
6. COMPARE: NATS or RabbitMQ
Pick one:
- NATS — install NATS locally; replicate one of your queues onto NATS. Observe the differences.
- RabbitMQ — install RabbitMQ; same replication exercise.
400-word reflection on when each fits vs Redis-based simplicity.
7. OPERATE
- 3-4 runbooks: “Queue stuck,” “DLQ growing,” “Webhook delivery failing,” “Idempotency-key collision”
- 1-2 ADRs (e.g., “Why Redis-based queue vs RabbitMQ for /root Y2”)
- Weekly log
8. CONTRIBUTE
- A messaging library — docs, edge cases
- A webhook-tooling project (Hookdeck, Svix — public docs welcome contributions)
- A blog post (when blog is live) on a non-obvious idempotency pattern
What ships from this phase
- Y2 service with async jobs + outbound webhooks
- Queue runbooks in
ops-handbook - Pattern OUTLINEs — message-queue, pub-sub, delivery-semantics, idempotent-consumer
Learning loop cadence
Week 1 PROBLEM + PRINCIPLES 2.1-2.2 (queue vs pub/sub, delivery)
Pick queue backend; first background job
Week 2 PRINCIPLES 2.3 (idempotent consumers)
Idempotency keys; duplicate-message test
Week 3 PRINCIPLES 2.4 (DLQs)
Retry + backoff + DLQ implementation
Week 4 PRINCIPLES 2.5 (webhooks)
Outbound webhooks with HMAC
Week 5 PRINCIPLES 2.6 (event-driven architecture)
Metrics + runbooks
COMPARE: NATS or RabbitMQ
Week 6-7 OPERATE + CONTRIBUTE
Exit Test
Validation criteria
[ ] Y2 service has at least one background job + one outbound webhook
[ ] Consumers idempotent; DLQ working; webhook signatures verified
[ ] All 10 operational depth checks
[ ] Compare reflection (400 words)
[ ] 3-4 queue runbooks
[ ] 1-2 ADRs
[ ] Pattern entries deepened STUB → OUTLINE:
- message-queue
- pub-sub
- delivery-semantics
- idempotent-consumer
[ ] Exit Test passed
Exit Test
Time: 2.5 hours.
Part 1: Build (90 min)
Add a new background job to your service (spec provided): on event X, process Y, emit webhook Z. Must be idempotent, have proper retry/DLQ, and signed webhook output.
Part 2: Diagnose (45 min)
A messaging scenario (provided: e.g., “DLQ grew from 0 to 50,000 messages in the last hour”). Possible causes: downstream API outage; serialization bug; new consumer code panic’ing; quota exhaustion.
Part 3: Articulate (15 min)
~400 words: “Walk through the design of one of your idempotent consumers. What’s the idempotency key? Where is it stored? What happens on duplicate?”
Anti-patterns
| Anti-pattern | Why |
|---|---|
| Synchronous work in request paths “because it’s simple” | Coupling tightens; one downstream slowdown makes your API slow |
| Webhook without HMAC | Anyone who guesses your URL can spoof events |
| Retries without backoff | Retries become DDoS-on-yourself |
| Queue without DLQ | Bad messages retry forever, blocking real work |
| Claiming exactly-once delivery | It’s almost always a lie. Be at-least-once + idempotent and own the truth. |
Patterns touched this phase
message-queue— first OUTLINEpub-sub— first OUTLINEdelivery-semantics— first OUTLINEidempotent-consumer— first OUTLINE
→ Next: Phase 15: Service Observability