Observability at Platform Depth
Phase 28 of /root Year 3: eBPF observability, OpenTelemetry Collector, Prometheus + Grafana + Tempo + Loki on basecamp, SLO discipline as platform contract. Y2's service observability becomes Y3's platform observability. 6-8 weeks, ~70-90 hours.
Twelfth phase of Year 3. Platform-depth observability. 6-8 weeks, ~70-90 hrs.
Y2 Phase 15 instrumented services from inside. This phase consumes that telemetry at platform scale: OpenTelemetry Collector, Prometheus, Grafana, Tempo (traces), Loki (logs), and eBPF-based observability via Cilium Hubble (free from Phase 25 mesh). By phase end basecamp has the full observability stack alive, every service emits standard signals, every team has SLO contracts, alerts are tied to user-visible outcomes.
This phase reaches DEEP on three major patterns: SLI-SLO-error-budget, three-pillars, distributed-tracing. The discipline pays interest for Years 4-5 when ML platforms emit very different telemetry.
Prerequisites
- Phase 27 complete; secrets infrastructure operational
- 12 hrs/week budget reserved
Why this phase exists
A platform without observability is a black box. Senior engineers don’t operate black boxes — they operate systems where every layer’s behavior is visible, measurable, and tied to user-visible outcomes. This phase installs that visibility at platform scale.
The pattern-first frame
Same eight steps.
1. PROBLEM
basecamp runs services across multiple clouds. Each service emits logs, metrics, traces. You need to: collect them centrally, query them efficiently, correlate them across services, define SLOs that map to user experience, alert when SLOs erode.
That’s the platform observability problem. The OSS stack (Prometheus + Grafana + Loki + Tempo + OTel Collector) is one solution. Datadog, Honeycomb, New Relic are managed alternatives.
2. PRINCIPLES
2.1 The three pillars at platform scale
Logs (high cardinality, debuggable), metrics (low cardinality, aggregable), traces (cross-service causal). Each pillar answers different questions.
→ Pattern: three-pillars — DEEP target
Investigate:
- For a production incident, which pillar do you reach for first?
- Why are metrics good for SLOs but bad for individual debugging?
- What’s the cardinality problem at platform scale?
2.2 SLI / SLO / error budget as platform contract
Service Level Indicator: what you measure. Service Level Objective: the target. Error budget: what’s left between target and 100%.
→ Pattern: sli-slo-error-budget — DEEP target
Investigate:
- Pick an SLI for triage. Defend it. Pick the SLO.
- What’s spending the error budget operationally?
- Why are SLOs more useful than SLAs as internal contract?
2.3 OpenTelemetry as the standard
OTel SDKs emit; OTel Collector receives, processes, routes to backends. The Collector is the buffer between services and storage.
Investigate:
- What does OTel Collector receivers / processors / exporters do?
- Why is the Collector a multi-tenant concern?
- Semantic conventions: why do they matter?
2.4 eBPF for zero-code observability
eBPF programs see kernel + syscall events. Cilium Hubble (Phase 25) gives you L7 observability without code changes. Parca, Pyroscope for continuous profiling.
Investigate:
- What does Hubble see that OTel doesn’t?
- When does eBPF observability replace SDK-based vs augment it?
- What’s continuous profiling, and when does it earn its weight?
2.5 Distributed tracing at platform depth
Traces follow requests across services. Tail-based sampling captures interesting (slow, error) traces. Head-based sampling drops fast ones.
→ Pattern: distributed-tracing — DEEP target
Investigate:
- Head-based vs tail-based sampling — when each?
- What’s the trace ID propagation strategy across the mesh?
- When does distributed tracing reveal what metrics + logs can’t?
2.6 Alert design (SLO-driven, not metric-driven)
Bad alerts: “CPU above 80%.” Good alerts: “Error budget burning faster than X.” Alert on the outcome, not the cause.
Investigate:
- Why is “CPU above 80%” almost never the right alert?
- What’s “alert fatigue,” and how does SLO-driven alerting prevent it?
- Multi-window multi-burn-rate alerts: what + when?
3. TRADE-OFFS
| Decision | Options | Cost |
|---|---|---|
| Backend | OSS stack (Prom + Grafana + Loki + Tempo); managed (Datadog, Honeycomb); cloud-native | OSS: ops cost, max control. Managed: convenience, vendor cost. Cloud: integrated, lock-in. |
| Sampling | None (100%); fixed %; tail-based | None: storage explosion. Fixed: blind to outliers. Tail: complex, best signal. |
| SLO tooling | Pyrra; Sloth; OpenSLO; manual | Pyrra/Sloth: declarative. Manual: drift. |
4. TOOLS (as of 2026-06)
- OpenTelemetry Collector (K8s-native; deployed via Helm + Flux)
- Prometheus (and
kube-prometheus-stack) - Grafana
- Loki — logs
- Tempo — traces (or Jaeger as alternative; both speak OTel)
- Cilium Hubble — eBPF observability (from Phase 25)
- Pyrra or Sloth — SLO definitions as CRDs (K8s-native; YAML in Git, reconciled by their operators)
- Pyroscope / Parca — continuous profiling
The whole observability stack is K8s-native: every component is a CRD-driven controller or a HelmRelease reconciled by Flux. The OTel Collector itself runs as a DaemonSet + Deployment; its configuration is a Kubernetes ConfigMap reconciled like any other resource. The “pillars” are emitted by services, collected uniformly, stored in components that all speak the K8s API.
Reading
- “Observability Engineering” (Charity Majors et al.)
- “Implementing Service Level Objectives” (Hidalgo)
- OTel docs — semantic conventions
- Google SRE Book — SLO chapters
5. MASTERY: Full observability on basecamp
[ ] OTel Collector deployed on basecamp; receives from all services
[ ] kube-prometheus-stack collecting metrics
[ ] Loki collecting logs
[ ] Tempo collecting traces
[ ] Grafana dashboards: per-service RED + per-cluster USE + platform-wide SLO overview
[ ] At least 3 services with proper SLO definitions in Pyrra/Sloth
[ ] Multi-burn-rate alerts wired; trigger one deliberately and recover
[ ] Cilium Hubble L7 service map operational
[ ] Continuous profiling for at least one service (Pyroscope)
[ ] Tail-based sampling configured for distributed traces
6. COMPARE: Honeycomb or Datadog
Free tier; route OTel data there. Compare query model + insight latency vs OSS local.
400-word reflection.
7. OPERATE
- 4-5 runbooks: high-cardinality metric explosion, SLO burn alert firing, trace gap diagnosis, OTel Collector OOM, Grafana datasource lag
- 2-3 ADRs (OSS stack over managed, sampling strategy, SLO ownership model)
- Weekly log
8. CONTRIBUTE
- OTel semantic conventions
- Pyrra / Sloth rule definitions
- A blog post on a real observability debugging story
What ships from this phase
- Tier 2 of basecamp deepened: full observability stack
- SLO contracts for 3+ services
- Observability runbooks
Validation criteria
[ ] OTel + Prom + Grafana + Loki + Tempo all running on basecamp
[ ] 3+ services with SLO definitions
[ ] Multi-burn-rate alerts wired
[ ] Continuous profiling for one service
[ ] All 10 operational depth checks
[ ] Compare reflection (400 words)
[ ] 4-5 runbooks
[ ] 2-3 ADRs
[ ] Pattern entries:
- sli-slo-error-budget → DEEP
- three-pillars → DEEP
- distributed-tracing → DEEP
[ ] Exit Test passed
Exit Test
Time: 3 hours.
Part 1: Build (90 min)
Add SLO + dashboards + alerts for a new service in basecamp. Verify the SLO burn-rate alert fires when you deliberately degrade the service.
Part 2: Diagnose (60 min)
A latency scenario: p99 for one service jumped 10× starting an hour ago. Use traces + logs + metrics together to identify root cause.
Part 3: Articulate (30 min)
~800 words: “Walk the design of basecamp’s observability pipeline. Where each pillar is collected, processed, stored, queried. Where SLOs sit. Where alerts originate. Cite the patterns.”
Anti-patterns
| Anti-pattern | Why |
|---|---|
| Alerting on CPU > 80% | Resource symptom, not user outcome |
| Free-form logs in production | Unqueryable; grep-by-archaeology |
| Logging every request body | Drowns signal; PII risk |
| Uniform sampling at low % | Loses outliers (exactly what you want) |
| Dashboards as “real observability” | Dashboards show known unknowns. Observability is unknown unknowns. |
Patterns touched this phase
sli-slo-error-budget— DEEPthree-pillars— DEEPdistributed-tracing— DEEPstructured-loggingreinforced