AI Observability
Telemetry specific to LLM + agent systems: per-request traces with prompts, completions, tool calls, costs, quality scores. OpenLLMetry, OTel for LLMs.
Standard observability (three pillars) plus LLM-specific signals: token counts, costs, prompt/completion content, trajectory shape, tool-call audit. Status: STUB — promoted to OUTLINE in Y5 Phase 49.
What this pattern is
AI observability extends standard telemetry (three-pillars) with signals specific to LLM + agent systems. The minimum surface: per-request trace capturing prompt + completion (with PII handling), tokens-in + tokens-out + cost, latency to first token + total latency, which model handled the request. Agent trajectory — sequence of tool calls + observations per agent run, with timing per step. Per-prompt quality metrics — eval scores, user feedback, regret rate. Cost dashboards — per-tenant, per-model, per-route, per-feature. OpenLLMetry and OTel’s GenAI semantic conventions are the emerging standard for instrumenting this data. Langfuse, Helicone, LangSmith are the AI-observability tools that consume it.
The pattern composes with evals (quality metrics in observability inform eval suite updates), llm-routing (route decisions are observed for cost/quality balancing), and aiops (agent traces are AIOps inputs).
The pattern’s central insight is that traditional observability signals don’t capture what matters for LLM systems. Request latency matters, but so does first-token latency. Error rate matters, but so does response quality (which requires eval, not just success/failure). Throughput matters, but so does per-request cost (which varies wildly with token counts). LLM observability needs additional dimensions to reflect the reality of what these systems do.
For agent systems specifically, trajectory observability is critical. A trace shows the sequence of LLM calls, tool invocations, tool outputs, and reasoning steps that led to a final result. Without trajectory observability, debugging agent behavior is guesswork. With it, you can inspect why an agent chose a specific action, what the tool returned, and how the agent updated its plan. Trajectory replay becomes a debugging tool similar to distributed tracing in microservices.
Cost observability matters at a different level than latency observability. Latency issues affect user experience per-request. Cost issues affect the operating budget at aggregate. Without cost observability, you might not notice that a specific feature is 10x more expensive than intended until the monthly bill arrives. Real-time cost dashboards enable cost-aware decisions in the moment (redesign the prompt, route to cheaper models, cache more aggressively).
Concrete instances in the wild
- OpenLLMetry. OSS extension of OpenTelemetry for LLM observability. Semantic conventions for LLM spans.
- OpenTelemetry GenAI semantic conventions. Standardized attributes for LLM spans. Reference for consistent tracing.
- Langfuse. OSS + commercial LLM observability platform. Rich UI for trace inspection.
- Helicone. Commercial LLM observability + gateway. Focused on cost visibility.
- LangSmith. LangChain’s commercial observability + evaluation tooling.
- Arize AI Phoenix (OSS) + commercial. ML/LLM observability with strong evaluation integration.
- Portkey. Commercial gateway with built-in observability.
- Braintrust. Commercial LLM evaluation + observability platform.
- basecamp AI observability (Y5 Phase 49). OpenLLMetry-instrumented llm-gateway + agent runtime, feeding Grafana/Loki/Tempo for visualization.
- Datadog LLM Observability. Datadog’s addition of LLM-specific dashboards and traces.
- New Relic AI monitoring. New Relic’s equivalent.
Why this pattern matters
Operating LLM systems in production without observability is flying blind. Cost surprises happen. Quality regresses silently. Users experience latency issues nobody notices. Agents fail in ways nobody can debug. Every operational problem that plagued microservices in the early 2010s is happening to LLM applications in 2024-2026 — and the solution is the same: comprehensive observability from day one.
The pattern matters because LLM systems fail in more subtle ways than traditional software. A microservice with 99% success rate is a well-understood metric. An LLM that returns “correct” responses 90% of the time and “plausible-but-wrong” responses 10% of the time has a much harder-to-characterize failure mode. Observing prompt + completion content (with PII handling) is what lets you find these failure modes empirically.
For cost management, observability is essential. LLM costs vary wildly per request based on input length, output length, and model choice. A poorly-designed prompt might cost 100x a well-designed one for the same task. A wrong routing decision might send simple queries to expensive models. Without per-request cost observability, none of these problems are visible; with it, they show up in dashboards and become actionable.
For debugging agent behavior, trajectory observability is transformative. A failed agent invocation can be replayed step-by-step: which tool did it call, what did the tool return, why did the agent make its next decision. Without this, debugging agents is guessing based on final output. With it, you can reason about the agent’s actual behavior at each step.
For basecamp specifically, AI observability is what makes the ML stack operationally trustworthy. Every LLM call is traced. Every agent trajectory is captured. Costs are visible per-team per-feature. Quality metrics are tracked over time. When something goes wrong (a spike in cost, a regression in quality, an incident), the observability data is what enables diagnosis and remediation.
The pattern also matters for compliance and audit. Regulated industries need to prove what their LLM systems did. Emerging AI regulations increasingly require this. Full request/response audit with retention is what satisfies these requirements. Building observability from the start is easier than retrofitting it under compliance pressure later.
Modern tooling is maturing rapidly. OpenLLMetry and OTel GenAI conventions are converging as standards. Langfuse and similar tools provide managed observability with reasonable defaults. Cloud-native platforms (Datadog, New Relic) are adding LLM-specific features. What used to require custom instrumentation is increasingly available as OSS or managed products.
The failure modes to know: observability that captures too much (PII leakage in traces); observability that captures too little (can’t debug specific issues); high sampling that misses rare failures; expensive observability that itself becomes a cost problem; observability without alerting (data collected, never acted on). Each has known patterns, but adopting AI observability means engineering for them.
Depth progression
STUB ← you are here.
OUTLINE Promoted when Y5 Phase 49 instruments llm-gateway + agent runtime with
OpenLLMetry-shaped traces.
DEEP Out of scope unless capstone direction prioritizes it. Default: OUTLINE.
Preview: what OUTLINE will answer
When Y5 Phase 49 promotes this entry to OUTLINE, it will name:
- PROBLEM. How do you observe LLM + agent systems well enough to operate them reliably in production?
- PRINCIPLES. Traditional three-pillars observability plus LLM-specific signals. Per-request cost, tokens, latency. Trajectory capture for agents. Quality metrics feeding evals. PII handling in traces. Alerting on cost, quality, and reliability signals.
- TRADE-OFFS. Comprehensive tracing (rich, expensive) vs sampled (cheaper, may miss issues). Content capture (great debugging, PII risk) vs redacted (safer, less useful). Managed (Langfuse, LangSmith — easy) vs self-hosted (control, ops cost). OpenLLMetry standard vs vendor-specific.
- TOOLS (time-stamped as of 2026-06): OpenLLMetry, OpenTelemetry GenAI conventions, Langfuse (OSS + commercial), Helicone, LangSmith, Arize AI Phoenix, Portkey, Braintrust, Datadog LLM Observability, New Relic AI monitoring, custom Grafana/Loki/Tempo stack.
The DEEP promotion is out of scope for basecamp default; if pursued, it would add MASTERY (operating AI observability across basecamp), COMPARE (Langfuse vs LangSmith vs custom OTel stack), OPERATE (a specific incident diagnosed through AI observability), and CONTRIBUTE (an OpenLLMetry semantic conventions contribution or public case study).
Canonical references
- OpenLLMetry documentation. Free at github.com/traceloop/openllmetry.
- OpenTelemetry GenAI semantic conventions. Free at opentelemetry.io.
- Langfuse documentation. Free at langfuse.com.
- Helicone documentation and blog. Free at helicone.ai.
- Chip Huyen’s writings on production LLM systems. Free at huyenchip.com.