AIOps
Agents that operate the platform: triage incidents, propose runbooks, execute through paved-road APIs under approval gates. The Y5 capstone of /root's agent work.
Agents operating the platform from its own telemetry. Read postmortems. Propose triage. Execute through
platform-ctlunder approval. The pattern’s outline is stabilizing; the OSS shape is still emerging. Status: STUB — promoted to OUTLINE in Y5 Phase 50.
What this pattern is
AIOps applies the agent-loop pattern to platform operations. Agents read alerts, retrieve similar past incidents (via RAG on the ops-handbook), reason about likely causes, propose triage steps, and execute the safe ones through paved-road APIs while pausing for human approval on destructive actions. The pattern composes with rag-as-pattern (the corpus is the agent’s institutional memory), tool-use (platform-ctl is the safe action surface), operator-pattern (AIOps is itself a custom kubebuilder operator watching IncidentReport CRDs), and ai-security (approval gates + capability allowlists are non-negotiable).
The pattern’s senior-IC value: it demonstrates the full /root stack composes — Y5 agents reading Y1-Y5 ops-handbook through Y5 RAG via Y5 llm-gateway via Y3 platform-ctl. New Relic AI and Datadog Watchdog are the proprietary versions of “agent operating the platform.” The OSS shape is emerging from public engineering posts and talks 2024-2026.
AIOps has been a marketing term for a decade, most of it referring to statistical anomaly detection rather than actual agents. The 2024+ AIOps that matters is different — LLM-based agents that reason about incidents using natural-language context (postmortems, runbooks, docs), propose specific actions, and execute them through approved interfaces. This is a categorical shift from earlier AIOps. Earlier tools detected anomalies; newer agents reason about causes and take action.
The pattern’s central design tension is between agent autonomy and human control. Fully autonomous agents can act quickly but might act wrongly at scale. Human-in-the-loop agents are safer but slower. AIOps in production usually falls in the middle: read-only actions (investigating, proposing) autonomous; state-changing actions (deploying, restarting, scaling) approval-gated. As trust builds, more actions move from approval-gated to autonomous. The trajectory over time is toward more autonomy, but it’s a trust-building process, not a technical achievement.
The pattern also requires a strong observability substrate. AIOps agents consume observability data as input. They need alerts to trigger on. They need metrics to reason about. They need traces to investigate. They need logs to search. An AIOps deployment on a platform without mature observability produces agents that can’t see enough to be useful. The platform observability investment is the prerequisite.
Concrete instances in the wild
- basecamp aiops (Y5 Phase 50). Custom kubebuilder operator watching IncidentReport CRDs; agent runtime consuming MCP servers for ops-handbook, telemetry, platform-ctl.
- New Relic AI. Commercial AIOps built into New Relic. Agent triages incidents, proposes remediation.
- Datadog Watchdog + AI (2024+). Similar shape from Datadog. Uses LLM to reason about anomalies.
- PagerDuty AI (Bard for on-call). LLM-assisted incident response in PagerDuty.
- Snowflake Copilot for Snowflake operations. Snowflake-specific AIOps.
- Google Cloud Gemini for Operations. Google’s AIOps offering.
- Runbook agents (emerging OSS). Various projects experimenting with LLM-based runbook execution.
- Cursor / Claude Code as informal AIOps. Not marketed as such, but engineers use them for operational tasks (writing kubectl commands, debugging configs, drafting runbooks).
- Anthropic Claude for security operations (SOC). Public case studies of LLM-based security incident triage.
- Custom internal AIOps at hyperscalers. Google, Meta, Netflix all have internal AIOps agents; specifics are proprietary.
Why this pattern matters
Modern platforms produce more operational data than humans can process. Thousands of alerts per day. Hundreds of dashboards. Petabytes of logs. On-call engineers can’t manually investigate every anomaly; they triage by severity and hope the important ones surface. Non-obvious issues get missed. Recurring problems get re-diagnosed each time because the previous diagnosis lives in someone’s Slack scroll rather than an accessible corpus.
AIOps agents change the operational math. Agents can read every alert, retrieve similar past incidents, and propose triage before a human sees the alert. This reduces on-call cognitive load — the human sees “here’s the alert, here’s what similar past incidents were, here’s the proposed action” rather than “here’s an alert, figure it out from scratch.” Even if the human always makes the final call, the agent’s context reduces time-to-triage significantly.
The pattern also enables institutional memory to persist across on-call rotations. Every postmortem, every runbook, every past incident becomes queryable by the agent. The engineer who joined last week gets the same context as the one who’s been there five years. Tribal knowledge stops being tribal. On-call becomes less punishing for new team members.
For SRE and platform engineering specifically, AIOps changes what’s operationally possible. Reactive on-call becomes proactive investigation. Repetitive incident response becomes automated after enough examples. Time freed from mechanical triage goes to root-cause fixes and platform improvements. The team’s leverage increases without headcount increases.
For basecamp specifically, AIOps is the Y5 capstone that demonstrates the full stack. The ops-handbook exists (Y1-Y5 weekly logs, postmortems, runbooks). The observability exists (Y3 phases). The platform exists (Y3+ platform-ctl). The LLM stack exists (Y5 llm-gateway, agent runtime, MCP servers). AIOps composes them: the agent reads real alerts, retrieves real past incidents from real postmortems, reasons about real causes, and takes real actions on real infrastructure. This is what a platform engineer at senior-IC level should be able to build.
The pattern also demonstrates a specific architectural discipline: AIOps is a custom operator (kubebuilder), not a chat bot. IncidentReports become CRDs. The agent runtime is a controller. Approvals are Kubernetes-native. Audit trails are Kubernetes events. This is operator-pattern applied to AI workflows — reusing the K8s-native patterns rather than inventing new ones.
The failure modes to know: agents that propose wrong actions confidently (need human review); agents that fail silently on unexpected inputs (need error handling and escalation); agents that lose context across long incidents (need memory management); agents that violate approval gates (need airtight capability scoping); over-reliance on AIOps that atrophies human skill (need periodic exercises). Each has known mitigations, but building production AIOps means engineering for them.
The OSS AIOps ecosystem is still nascent. Proprietary AIOps products (New Relic AI, Datadog Watchdog) are mature but locked into their vendors. OSS AIOps is emerging but not yet standardized. Building basecamp’s AIOps is partly research, partly synthesis of emerging patterns, partly custom engineering. This is where /root’s Y5 work operates — at the frontier of a maturing pattern rather than a stable well-understood one.
Depth progression
STUB ← you are here.
OUTLINE Promoted when Y5 Phase 50 ships services/aiops/ on basecamp.
DEEP Out of scope unless capstone direction prioritizes it. Default: OUTLINE.
Preview: what OUTLINE will answer
When Y5 Phase 50 promotes this entry to OUTLINE, it will name:
- PROBLEM. How do you deploy LLM agents to operate a platform safely and usefully?
- PRINCIPLES. Compose the stack (agent loop + RAG + tools + observability + security). Read-only autonomous; state-changing approval-gated. Institutional memory via RAG on ops corpus. K8s-native (custom operator) rather than bolt-on chat. Trajectory audit for every action. Trust built incrementally.
- TRADE-OFFS. Autonomous (fast, risk) vs human-in-the-loop (safe, slower). Custom (control, engineering cost) vs vendor (easy, lock-in). Real-time triage (proactive) vs post-incident analysis (reactive, safer). Narrow scope (safer, less useful) vs broad (powerful, more failure modes).
- TOOLS (time-stamped as of 2026-06): basecamp aiops CRD + operator + agent runtime (custom), New Relic AI, Datadog Watchdog + AI, PagerDuty AI, Snowflake Copilot, Google Cloud Gemini for Ops, runbook agents (emerging OSS), Anthropic Claude for security operations.
The DEEP promotion is out of scope for basecamp default; if pursued (e.g., Y5 capstone direction focuses on AIOps maturity), it would add MASTERY (operating AIOps on basecamp with real incidents), COMPARE (custom aiops vs vendor products), OPERATE (a specific incident triaged by aiops), and CONTRIBUTE (an OSS AIOps contribution or public case study).
Canonical references
- Anthropic’s writings on building agents for operations. Free at anthropic.com.
- New Relic AI documentation. Public product docs.
- Datadog AI/Watchdog documentation. Public product docs.
- Google SRE Book chapters on incident response — foundational context for what AIOps automates.
- Chip Huyen’s writings on production LLM systems. Free at huyenchip.com.