Three Pillars (Logs, Metrics, Traces)

Logs: what happened. Metrics: how much, how often, how fast. Traces: the path of a single request. The three observability surfaces every production service needs.

Logs explain individual events. Metrics summarize trends. Traces show the request’s path across services. Together they cover the debugging surface. Status: STUB — promoted to OUTLINE in Y2 Phase 15.

What this pattern is

The three pillars (logs, metrics, traces) are the canonical observability surfaces. Logs are timestamped, human-readable (or structured-JSON-readable) records of discrete events — what specifically happened, with context. Metrics are numeric aggregations over time — request rate, error rate, latency p99, queue depth, GPU utilization. Traces are the request-scoped record of how one logical operation propagated through services — what called what, how long each hop took, where the latency lived.

Each pillar answers a different question. Logs answer “what happened at this exact moment?” — the granular record when you need to understand a specific event. Metrics answer “how often is this happening?” and “how has this trended?” — the aggregate view for capacity planning and anomaly detection. Traces answer “where in this specific request did time go?” — the request-scoped debugging view for distributed systems. All three are needed because none of them can answer the others’ questions well.

The pillars compose. A useful debugging session starts with metrics (some SLO is burning), narrows to a specific time window and service (traces from that window show which downstream call was slow), and drills into logs from the slow span (the log message reveals the root cause). Each pillar hands off to the next. Systems that instrument only one pillar force debugging to happen entirely within that pillar’s limitations.

The modern operational standard is OpenTelemetry (OTel) for emission plus a backend stack per pillar: Prometheus and Grafana (metrics), Tempo (traces), Loki (logs). On basecamp by Y3 Phase 28 every service emits all three, all reach the same backend, and dashboards correlate them via shared trace IDs.

Concrete instances in the wild

basecamp Y3+ observability. OpenTelemetry SDKs in every service; OTel Collector aggregates and forwards; Prometheus for metrics; Tempo for traces; Loki for logs; Grafana as the unified UI. All three pillars reachable through one browser tab.
Netflix. Historically pioneered the three-pillars pattern at hyperscale. Atlas for metrics, Genie and Mantis for streaming logs and traces. Public engineering blog posts describe the setup.
Uber Jaeger. Uber’s open-source distributed tracing backend, one of the systems that popularized traces alongside Google Dapper.
Grafana Stack (LGTM). Loki + Grafana + Tempo + Mimir — Grafana Labs’s answer to the three pillars, all OSS.
Elastic Stack (ELK). Elasticsearch + Logstash + Kibana + APM. Older stack, still widely used, especially for logs.
Splunk. Enterprise log analytics, historically dominant for logs, expanded to metrics and traces via acquisitions.
Datadog. SaaS-based unified three-pillars observability. Dominant in enterprise deployments that prefer managed.
Honeycomb. Event-based observability that arguably transcends the three-pillars framing (events with high-cardinality attributes cover all three use cases with one primitive).
AWS CloudWatch. AWS-native three-pillars: CloudWatch Logs, CloudWatch Metrics, X-Ray for traces.

Why this pattern matters

Systems built without observability produce a specific and painful failure mode: something breaks in production, and the engineers investigating have no way to reconstruct what happened. Was it a slow database query? A network partition? A downstream service failing? Without instrumentation, the answer is guesswork. Add three pillars — logs, metrics, traces — and every incident becomes a reconstructible event.

The pattern matters more as service count grows. A single monolithic service can be debugged with logs alone; you can grep for the request ID and read the log lines. A distributed system with 20 services can’t. Requests cross service boundaries, logs disperse across nodes, timestamps disagree, and “what happened during this request” becomes a puzzle you can only solve with distributed tracing. The three pillars pattern is what makes distributed debugging tractable.

Modern platforms make the pattern cheap to adopt. OpenTelemetry auto-instrumentation covers common HTTP frameworks, database drivers, and RPC libraries without code changes. Grafana Cloud’s free tier includes a three-pillars starter. Prometheus operators, Loki Helm charts, and Tempo operators install with one command. What used to be a serious engineering project (build your own metrics collection, log aggregation, distributed tracing) is now a checkbox.

The failure mode: instrumentation without discipline. Adding logs, metrics, and traces without thinking about what to instrument produces noise instead of signal. Log every function entry and exit; you get millions of lines you’ll never read. Emit metrics for every internal state; you get dashboards nobody looks at. Trace every span with 100 attributes; you get expensive storage and slow queries. Real observability is deliberate: which questions do we need to answer? What instrumentation answers them? Everything else is cost without benefit.

Depth progression

STUB     ← you are here.
OUTLINE  Promoted when Y2 Phase 15 (service observability) instruments a service
         in all three pillars.
DEEP     Promoted after Y3 Phase 28 — full OTel + Prom + Tempo + Loki stack
         on basecamp, with at least one debugging session that correlated across
         pillars.

Preview: what OUTLINE will answer

When Y2 Phase 15 promotes this entry to OUTLINE, it will name:

PROBLEM. How do you give operators the information needed to reconstruct what happened during any production event?
PRINCIPLES. Each pillar answers distinct questions. Instrumentation is deliberate, not exhaustive. Correlate via shared trace IDs. Emit at the service boundary; enrich in the collector; store in pillar-appropriate backends.
TRADE-OFFS. Cardinality (high-cardinality metrics and traces cost storage) vs granularity. Sampling (cheaper storage) vs completeness (every request captured). Structured logging (queryable) vs unstructured (developer-familiar). Push vs pull metrics collection.
TOOLS (time-stamped as of 2026-06): OpenTelemetry (emission standard), Prometheus + Grafana (metrics OSS), Loki (logs OSS), Tempo (traces OSS), Elastic Stack (older but common), Datadog (SaaS), Honeycomb (event-based), AWS CloudWatch (cloud native).

The DEEP promotion, after Y3 Phase 28, will add MASTERY (operating the full stack for months on basecamp), COMPARE (Prometheus + Grafana vs Datadog vs Honeycomb for the same workload), OPERATE (a real debugging session that correlated across all three pillars), and CONTRIBUTE (an OTel or Grafana docs contribution).

Canonical references

Cindy Sridharan, Distributed Systems Observability (2018) — the modern canonical treatment. Free at O’Reilly’s radar site.
OpenTelemetry documentation. Free at opentelemetry.io.
Google Dapper paper (2010) — the original distributed tracing paper that seeded modern tracing tools.
Charity Majors, Observability Engineering (2022) — argues for the “event-based” alternative to the three-pillars framing. Worth reading for the counterpoint.
Google’s Site Reliability Engineering — chapters on monitoring philosophy and the difference between white-box and black-box monitoring.