Distributed Tracing
Record the path of a single request as it crosses services. OpenTelemetry, Tempo, Jaeger. The observability surface that makes microservice debugging tractable.
One request, propagated trace context, every service contributes a span. The result: a flame graph of where the latency really lived. Status: STUB — promoted to OUTLINE in Y2 Phase 15.
What this pattern is
Distributed tracing records the path of a single request as it crosses many services. Each service contributes one or more spans — units of work tagged with timestamps, durations, attributes, and a shared trace ID. The trace ID propagates via the W3C TraceContext HTTP header (or gRPC metadata, or message-queue attributes). When all spans for a trace arrive at the backend (Tempo, Jaeger, Honeycomb), the system reconstructs the call graph: which service called which, how long each took, where errors occurred. OpenTelemetry (OTel) is the standard SDK plus protocol; Tempo, Jaeger, and commercial backends consume the same wire format.
The pattern is the only realistic way to debug a “the request was slow somewhere” problem across five or more services. Without traces, you’re stuck correlating logs by approximate timestamps across nodes whose clocks disagree. With traces, the flame graph is the answer. A trace’s flame graph shows every span’s start time, duration, and parent relationship, laid out visually so the slow hop is immediately apparent. What took hours of log correlation before now takes seconds.
Context propagation is the pattern’s load-bearing invariant. Every service in a request chain must extract the incoming trace context, propagate it to downstream calls, and emit its own spans with the correct parent-child relationships. Miss propagation at any hop and the trace becomes disconnected islands instead of a coherent graph. Well-designed frameworks handle propagation automatically for common transports (HTTP, gRPC, database drivers, message queues); custom transports require explicit instrumentation.
Traces compose with the other observability pillars. Every log line emitted during a span can be tagged with the trace ID; every metric can be exemplar-linked to a representative trace. Once trace IDs propagate through logs and metrics, every debugging session can pivot: from a metric to the traces contributing to the anomaly, from a trace to the logs generated during the slow span, from a log to the full trace context. The three pillars become one debugging fabric.
Concrete instances in the wild
- basecamp Y3+ tracing. OpenTelemetry SDKs auto-instrument HTTP, gRPC, Postgres, Redis, and Kafka. Traces flow through OTel Collector to Grafana Tempo. Every incident correlates trace IDs to logs and metrics.
- Google Dapper. The original distributed tracing paper (2010) that inspired modern tools. Google’s internal tracing system, still not open-sourced but publicly documented.
- Uber Jaeger. OSS distributed tracing backend from Uber. Widely adopted, especially in Java and Go shops.
- Zipkin. Twitter’s tracing project (2012), older than Jaeger, still in use at many organizations.
- Grafana Tempo. Object-storage-backed tracing, extremely cheap for cold storage. Native fit with the Grafana stack.
- AWS X-Ray. AWS-native distributed tracing. Integrates with API Gateway, Lambda, ECS, and application SDKs.
- Honeycomb. Event-based observability that includes traces as one query dimension. Extremely rich for high-cardinality debugging.
- Datadog APM. SaaS-based APM with distributed tracing. Dominant in enterprise deployments preferring managed services.
- New Relic. Similar shape to Datadog. Older enterprise tool with modern tracing support.
Why this pattern matters
Debugging a single-service system is straightforward: read the logs, follow the flow. Debugging a five-service system requires correlating logs across services, and correlating means reconciling timestamps that may disagree by seconds. Debugging a fifty-service system is essentially impossible without tracing. The complexity of distributed debugging grows exponentially with service count; distributed tracing turns that back into a linear problem — one trace, one flame graph, one root cause.
The pattern also matters for performance work. Without traces, “the request is slow” is impossible to attribute. Is it the load balancer? The auth service? The database call? The downstream API? The serialization overhead? Every guess costs time and often lands on the wrong culprit. A trace shows the answer: here’s the slow span, here’s what it was doing, here’s why. Performance investigations that used to take days take minutes.
For production incidents, traces reduce time-to-diagnosis. When an SLO burns and on-call is paged, the on-call engineer’s first move is to look at recent traces from the affected service. The traces show error rates, latency distributions, and specific slow requests. The engineer can drill into a slow trace, see which downstream call was slow, and either fix it directly or escalate to the owner of that downstream service. Without traces, this diagnosis process is 3-10x slower.
The pattern’s failure modes are also well-known. Untraced calls create trace gaps; the flame graph shows a mysterious 400ms of nothing. Sampling policies that discard error traces produce blind spots exactly where debugging is most needed. Head-based sampling loses causality (you don’t know which traces will be important until the request completes). Modern setups use tail-based sampling: sample all traces upfront, then discard uninteresting ones after the fact based on latency or error status.
Depth progression
STUB ← you are here.
OUTLINE Promoted when Y2 Phase 15 instruments a service with OTel spans;
Y3 Phase 28 deploys Tempo on basecamp.
DEEP Promoted after Y3 Phase 28 — Tempo operational with auto-instrumentation
on every basecamp service, plus at least one debugging session where the
trace was the load-bearing evidence.
Preview: what OUTLINE will answer
When Y2 Phase 15 promotes this entry to OUTLINE, it will name:
- PROBLEM. How do you debug “the request was slow somewhere” across a distributed system?
- PRINCIPLES. Every request has a globally unique trace ID. Context propagates across service boundaries via standard headers. Each service contributes at least one span. Parent-child span relationships form a tree. Traces correlate with logs and metrics via shared IDs.
- TRADE-OFFS. Head-based sampling (cheap, biased) vs tail-based (expensive, unbiased). Instrumentation via SDK (accurate, requires code) vs eBPF-based (zero code changes, less semantic detail). Retention of full traces (expensive) vs summary metrics (loses individual request detail).
- TOOLS (time-stamped as of 2026-06): OpenTelemetry (SDK + protocol standard), Grafana Tempo (OSS, object storage), Jaeger (OSS), Zipkin (OSS, older), AWS X-Ray, Datadog APM, Honeycomb, New Relic.
The DEEP promotion, after Y3 Phase 28 with Tempo operational, will add MASTERY (operating tracing across basecamp for months), COMPARE (Tempo vs Jaeger vs Honeycomb), OPERATE (a specific debugging session where the trace pointed at the root cause), and CONTRIBUTE (an OTel instrumentation contribution or docs improvement).
Canonical references
- Google Dapper paper (2010) — the original that seeded modern distributed tracing.
- OpenTelemetry documentation. Free at opentelemetry.io.
- W3C TraceContext specification — the propagation format standard.
- Cindy Sridharan, Distributed Systems Observability — modern canonical text on observability including traces.
- Yuri Shkuro, Mastering Distributed Tracing (2019) — the author of Jaeger’s book. Deep on tracing internals.