Observability + eBPF

First phase of Year 3. Observability tooling at depth — the kernel-level visibility eBPF gives you, plus the discipline of three-pillars + cardinality + runbook-as-code. SLI/SLO discipline from Y2 P12 gets real telemetry to back it up. ~8 weeks, ~100 hrs.

Prerequisites

Year 2 complete — multi-cloud basecamp, SLOs already defined

Hardware: 64GB upgrade in (Month-25 milestone — see homelab/hardware)

You accept: observability at depth means kernel-level events. eBPF makes the kernel programmable from userspace; Cilium uses it for networking + observability; the same primitive powers tools you’ll see in Y4-Y5.

Why this phase exists

Year 2’s SLI/SLO work was discipline first, telemetry second. You knew what you’d promise users; you didn’t yet have the instruments to prove you were keeping the promise. Year 3 lands the instruments: structured logs (Loki), metrics with cardinality control (Prometheus + Mimir), distributed traces (Tempo via OTel), and kernel-level events (Cilium Hubble + bpftrace). Together they’re what makes Year 4’s ML observability and Year 5’s services/aiops/ even thinkable — the agent that operates the platform needs a platform whose state is legible.

This phase also formalizes runbook-as-code — the discipline that’s been building since Year 1, Phase 1. By phase end, every basecamp service has runbooks that are tested by handing them to Claude in “play the runner” mode. If Claude can’t execute the runbook from the text alone, the runbook is broken — not Claude.

The Master Plan names the Y2→Y3 transition as platform-as-tool → platform-as-product. Observability is what makes the transition real. Tools are operated; products are understood. You can’t sell a platform whose failures you can only describe in past tense.

1. PROBLEM

You have 20+ services across 3 clusters across 2 clouds. Something breaks. The SLO burn-rate alert fires. You need to know:

What’s broken? (logs)
How bad? (metrics)
Where in the request path? (traces)
What did the kernel see at the moment of failure? (eBPF events)

Three pillars (logs + metrics + traces) cover most cases. eBPF covers the rest — when the issue is below the application: socket-level, page-cache, scheduler, syscall. Without eBPF you’re guessing at kernel behavior from userspace symptoms; with it, you read the kernel directly.

→ See: observability-and-ops

2. PRINCIPLES

2.1 Three pillars + unified telemetry

Logs answer “what happened?”, metrics answer “how often + how much?”, traces answer “where in the request path?”. OpenTelemetry unifies them with shared correlation IDs — a single trace_id propagates from ingress through every span, log line, and exemplar.

→ Pattern: three-pillars-and-unified-telemetry

Investigate:

For one request to triage, capture: structured log line (slog → Loki), metric (Prometheus counter), trace (OTel → Tempo). Correlate via trace_id.
Why is correlation across pillars the hard part? (Hint: clock skew, sampling decisions, log shipper buffering.)
Read OpenTelemetry’s “Why OTel?” docs — what was wrong with the pre-OTel world of vendor-specific agents?

2.2 Cardinality as cost

High-cardinality labels (user_id, request_id, trace_id-on-metric) explode metric storage. The discipline: structured logs for high-cardinality fields; metrics for low-cardinality aggregates; exemplars to bridge the two.

→ Pattern: cardinality-as-cost

Investigate:

Calculate Prometheus storage cost for http_requests_total{user_id="..."} if you have 100k users. Don’t do that.
When IS user_id-on-metric OK? (Spoiler: rarely; almost always log-and-aggregate.)
Build a cardinality budget per service: “no more than N active series.” Enforce in CI via promtool.

2.3 eBPF: kernel as observable surface

eBPF lets you attach small programs to kernel hooks (syscalls, network, scheduler) and emit events to userspace. The verifier guarantees safety; the JIT keeps it fast. Used by Cilium, Pixie, Falco, Hubble, Tetragon — same primitive, different observability lens.

Investigate:

Run bpftrace -e 'tracepoint:syscalls:sys_enter_openat { @[comm] = count(); }' for 30s; identify what’s opening files.
Cilium Hubble: see L3-L7 traffic between pods without sidecars.
Why is eBPF safer than kernel modules? (Verifier, sandboxed, no kernel crashes — the program is proven safe before load, not “trusted because the vendor said so”.)

2.4 Logs at scale (Loki)

Logs as a stream, not a database. Index labels, full-text search by stream. The trade-off: cheap ingest, slower full-text search than Elasticsearch — and you accept the slowness because the cost shape lets you log more.

Investigate:

Install Loki on basecamp; ship slog from triage + pulse + Postgres + Cilium.
LogQL: query last hour of error logs from triage filtered by trace_id.
Why doesn’t Loki index the log message body? (Cost — and because well-structured logs make it unnecessary.)

2.5 Tracing at scale (Tempo + OTel)

Distributed traces show the full request path. Sample at the edge; store everything that gets sampled; query by trace ID. Tail-based sampling is what lets you keep all errors and 1% of successes — the asymmetry that matters.

Investigate:

Add OTel SDK to triage (Go); auto-instrument net/http; export to Tempo.
Trace one request: ingress → triage → Postgres → response. View in Grafana.
Tail-based sampling vs head-based: which when? (Tail wins for “always keep errors”; head wins for predictable sample rate at scale.)

2.6 Runbook-as-code (formalize)

The discipline that’s been building since Phase 1: structured runbooks, versioned, AI-testable. By P14 every basecamp runbook is a file in ops-handbook/runbooks/ following meta/runbook-template.md, and every critical one has been handed to Claude in “play the runner” mode at least once.

→ Pattern: runbook-as-code

Investigate:

Audit all ops-handbook/runbooks/; ensure every one follows meta/runbook-template.md.
For 3 critical runbooks: hand to Claude in “play the runner” mode, observe gaps, revise.
Where do runbooks become executable (P25, AIOps)? Plan ahead — the structured format is what makes Year 5’s agent-driven runbook execution possible.

2.7 Blameless postmortem discipline (formalize)

Mostly already practicing this. P14 is where you formalize it as the only postmortem style basecamp accepts. Templates, review checklist, and a public archive in ops-handbook/postmortems/.

→ Pattern: blameless-postmortem

3. TRADE-OFFS

Decision	Option A	Option B	When
Metrics backend	Prometheus	Mimir (HA Prometheus)	Cortex
Logs backend	Loki (cheap)	Elasticsearch (expensive, fast search)	Loki for most workloads
Traces backend	Tempo (Grafana)	Jaeger	Both; Tempo for the Grafana stack
eBPF tooling	Cilium Hubble (CNI-integrated)	Pixie (auto-instrumentation)	Falco (security)
OTel collector	sidecar	DaemonSet	Gateway
Sampling	head-based	tail-based	Tail wins for “always keep errors”

4. TOOLS (as of 2025-10)

Prometheus + Grafana (already from Y1)
Loki + Promtail — logs
Tempo + OTel collector — traces
Cilium Hubble (already from Y1 P7 Cilium CNI)
bpftrace + bcc-tools — ad-hoc eBPF
kube-prometheus-stack Helm chart bundles much of the above
Mimir — if you outgrow single-cluster Prometheus

5. MASTERY

5.1 Reading list

Required	Why
”Observability Engineering” (Majors, Fong-Jones, Miranda)	The book
OpenTelemetry docs — concepts	The unifier
Cilium docs — Hubble + Network Observability	eBPF in practice

Recommended	Why
”Distributed Tracing in Practice” (Parker, Spoonhower, Mace)	Tracing depth
Liz Fong-Jones’s blog	Cardinality wisdom

5.2 Operational depth checklist

[ ] Deploy Loki + Promtail in basecamp; all pods ship logs
[ ] Deploy Tempo + OTel collector; instrument triage + pulse with OTel-Go
[ ] Correlate logs + metrics + traces for one real triage request via trace_id
[ ] Set up Hubble UI; observe pod-to-pod L7 traffic without sidecars
[ ] Use bpftrace to count syscalls by process across the cluster for 30s
[ ] Add cardinality control: replace one bad-cardinality metric with a structured log + Loki query
[ ] Build a Grafana dashboard tying logs + metrics + traces for triage
[ ] Audit + revise 5 runbooks; hand each to Claude in "play the runner" mode
[ ] Write a postmortem for one self-inflicted incident this phase using the blameless discipline
[ ] Define observability SLOs for the platform itself (e.g., "trace ingestion lag <1 min, 99%")

5.3 platform-ctl extension

Add platform-ctl observe <service> — opens Grafana + filters logs/metrics/traces for the service. Small ergonomic win, big quality-of-life gain. The kind of one-line tool that earns its keep daily and proves platform-ctl is genuinely the platform’s front door, not a vanity wrapper around kubectl.

6. COMPARE: Hubble (eBPF) vs sidecar mesh observability

Year 2 P12’s service mesh already gives you L7 observability via sidecars (or ambient). Cilium Hubble gives you the same via eBPF. Compare:

Latency overhead — sidecars proxy every byte; eBPF observes without proxying.
Cardinality + storage cost — what each tool emits by default, and what it costs to keep.
Coverage (mesh sees L7; eBPF sees everything below L7 too — which matters when the bug is in the kernel).
When each wins, and when running both is the right answer.

400 words.

7. OPERATE

4+ runbooks (loki-investigate-spike, tempo-trace-missing, prometheus-cardinality-runaway, bpftrace-syscall-investigation)
1+ postmortem (using the formalized blameless discipline)
Weekly log

8. CONTRIBUTE

OpenTelemetry, Loki, Cilium, kube-prometheus-stack — all CNCF, all welcoming. A docs PR for an under-documented Hubble flow or a Loki LogQL example for a real query you wrote both count.

Validation criteria

[ ] All 10 operational depth checks
[ ] Loki + Tempo + OTel + Hubble all operational in basecamp
[ ] At least 2 services fully instrumented (logs + metrics + traces correlated)
[ ] All ops-handbook runbooks audited + tested via "play the runner"
[ ] Hubble vs sidecar comparison written up
[ ] 4+ runbooks; 1+ postmortem; 8+ weekly log entries
[ ] Pattern entries deepened:
    - three-pillars-and-unified-telemetry → DEEP
    - cardinality-as-cost → DEEP
    - runbook-as-code → DEEP
    - blameless-postmortem → DEEP
[ ] Exit Test passed

Exit Test

Time: 3 hours.

Build (60 min) — given a service with no instrumentation, add OTel + slog + Prometheus metrics; correlate via trace_id; verify in Grafana.
Diagnose (90 min) — scenario: a request is sometimes slow but Prometheus shows nothing; find the cause via traces + logs + eBPF.
Articulate (30 min) — 600 words: “When do I reach for logs vs metrics vs traces vs eBPF? Give 3 real basecamp examples.”

Anti-patterns

Anti-pattern	Why
Logging everything as `INFO`	Loki cost + signal-to-noise destroyed
`user_id` as a Prometheus label	Cardinality explosion; structured logs are the right tool
Tracing without sampling at scale	Storage + query cost; tail-sampling for “always keep errors”
eBPF “for everything”	Real cost; pick when other pillars don’t answer
Runbooks no one tests	Bad runbooks are worse than no runbook; “play the runner” is the test

Patterns deepened this phase

three-pillars-and-unified-telemetry → DEEP
cardinality-as-cost → DEEP
runbook-as-code → DEEP
blameless-postmortem → DEEP

→ Next: Phase 15: Lakehouse — MinIO + Iceberg + JupyterHub