AIOps: services/aiops/

Third phase of Year 5. The narrative payoff: build agents that operate the platform you built, using 5 years of operational data. Auto-incident triage loop lands as a runnable composition recipe. ~8 weeks, ~100 hrs.

Phase 28 is the Year 5 inflection point — the phase that earns the Staff/Principal AI Platform Engineer title the Master Plan pins to the Year 5 exit ramp. P26 gave you state-machine agents. P27 gave you a typed tool gateway. P28 puts both on top of ~4 years of operational data and closes the loop: agents you built, running on the platform you built, operating the same platform you operate.

The artifact is services/aiops/ — Tier 9 of basecamp, sitting on top of every prior tier and consuming basecamp-mcp from P27 plus llm-gateway from Year 4. The deliverable isn’t “an agent that triages incidents” in the abstract; it’s a system that, when an alert fires at 2am, gives you a 75%-helpful first hypothesis instead of a blank Grafana dashboard. Honest measurement is the discipline: the agent’s hypothesis is logged next to the actual root cause every time, and the disagreement is the eval signal that drives the next prompt iteration.

This is also the phase where the AI in AI Platform Engineer becomes load-bearing. Up to here, the platform served LLMs; from here, the platform is operated by LLMs you authored. See patterns/agents/ and patterns/ml-and-ai/ for the durable framing.

Prerequisites

Phase 27 complete — basecamp-mcp running with read + approval-gated write tools

4+ years of operational data in your platform (incidents in Postgres, runbooks in ops-handbook, postmortems indexed in Iceberg / vector DB)

You accept: AIOps isn’t replacing operators — it’s amplifying them. Agents triage, recommend, execute pre-approved runbooks. Humans stay on the loop for novel + risky decisions.

Why this phase exists

You’ve operated the platform for ~4 years. Rich operational data. AIOps takes that data + your runbooks + LLM reasoning to make operations faster + more consistent.

This phase ships services/aiops/ inside basecamp. It’s the loop closing: agents you built, running on the platform you built, with the patterns you learned, operating the same platform you operate.

The cinematic Year 5 moment.

1. PROBLEM

Operations work has 3 patterns:

Repetitive triage (read N dashboards; decide if it’s incident X or Y) — AIOps automates the first pass
Runbook execution (steps 1-5 of “Postgres connections exhausted”) — AIOps executes pre-approved
Pattern detection (this anomaly looks like an incident from 3 months ago) — AIOps surfaces analogies

Humans still own: novel incidents, risky decisions, postmortems, learning.

2. PRINCIPLES

2.1 Incident triage agent

When an alert fires, the agent reads context (metrics, recent deploys, related runbooks, similar past incidents), forms a hypothesis, posts to incident channel.

Investigate:

Subscribe to PrometheusAlertManager webhook
Agent reads via basecamp-mcp: alert payload, related Grafana dashboards, recent deploys, related runbooks
Output: structured hypothesis + suggested first action + confidence
Composition recipe: Auto-incident triage loop — see Studio composition

2.2 Runbook executor

Pre-approved runbooks get an “AI-executable” flag. Agent runs steps; pauses for human confirmation on destructive ops.

Investigate:

Annotate 3 runbooks as AI-executable (with explicit steps + safety bounds)
Build an executor agent that follows step-by-step + reports
Implement “destructive confirmation” mode (Slack approval before reboot/delete)

The three first-target runbooks should be the ones you’ve actually run more than 5 times in ops-handbook — postgres-connections-exhausted, redis-memory-pressure, k8s-pod-crashlooping are good defaults but pick whatever shows up most in your weekly logs. Frequency is the prioritization signal.

2.3 Pattern detection: vector index past incidents

Vector-index past postmortems (you have ~25 by Y5). New incident → search for similar.

Investigate:

Embed all past postmortems via sentence-transformers (already from Y4)
Pgvector index in services/aiops/ (Tier 9)
New incident: embed alert summary; top-3 similar past incidents shown
“Has the team seen this before? What worked?“

2.4 Safety + guardrails

→ Pattern: defense-in-depth (revisited at AI surface)

Read-only by default
Destructive ops require human approval
Per-agent token budget
Audit log of every action
Per-incident step budget (prevent runaway loops on confused alerts)

2.5 Eval + improvement loop

Track AIOps recommendations vs human ground truth. Improve.

Investigate:

After each incident: log “agent’s hypothesis”, “actual root cause”, “agreement?”
Weekly review: disagreements; update prompts/runbooks
Quarterly: train a small classifier on hypothesis-vs-actual to spot patterns

The weekly review is the load-bearing habit, same shape as the Sunday weekly log from the Master Plan. Without it, the agent’s “looks helpful” answers quietly drift away from useful.

2.6 The composition recipe: Auto-incident triage loop

Prometheus alert → AlertManager webhook → services/aiops/
  ↓
aiops calls basecamp-mcp tools:
  - query_trino: "what services have alerted in last hour?"
  - read_logs: "tail of triage's logs around alert time"
  - list_services: "what changed? recent deploys?"
  - vector_search_postmortems: "similar past incidents?"
  ↓
LLM (via llm-gateway): forms hypothesis
  ↓
notify_slack: "Hypothesis: X. Suggested first action: Y. Confidence: 75%."
  ↓
If runbook AI-executable + confidence > threshold: execute step 1
   Otherwise: human-in-loop in Slack

This is composition recipe #2 of the five Y5 recipes — it lands here in P28 and gets surfaced in the Studio command palette in P29.

3. TRADE-OFFS

Decision	Option A	Option B	When
Action authority	Read-only (advisory)	Approved actions only	Full autonomy
Trigger	Alert-driven	Schedule-driven (anomaly detection)	On-demand
Eval cadence	Weekly	Per-incident	Continuous
Agent runtime	LangGraph (Python)	Go state machine	LangGraph for iter-speed; Go for production once stable

4. TOOLS (as of 2025-10)

LangGraph (P26)
basecamp-mcp (P27)
llm-gateway v1.5 (Y4 P25)
Prometheus AlertManager (Y3 — alert source)
pgvector (Y4 P24 — incident retrieval)

5. MASTERY

5.1 Reading list

Required	Why
Google SRE Book — Incident Management chapter	The discipline AIOps amplifies
Anthropic / OpenAI blog posts on production agents (2024-2025)	Real-world patterns
Honeycomb / Datadog AIOps writeups	Vendor-neutral patterns

5.2 Operational depth checklist

[ ] services/aiops/ scaffolded inside basecamp/charts/aiops/
[ ] Wire Prometheus AlertManager → AIOps webhook
[ ] Build triage agent: reads alert, hypothesizes, posts to Slack
[ ] Index all past postmortems in pgvector (auto-update on new postmortem in ops-handbook)
[ ] New incident: surface top-3 similar past incidents
[ ] Annotate 3 runbooks as AI-executable (postgres-connections-exhausted, redis-memory-pressure, k8s-pod-crashlooping)
[ ] Build runbook executor (read-only first; destructive-requires-approval)
[ ] Audit log every action to Loki + a Postgres "aiops_audit" table
[ ] Per-agent token budget + abort
[ ] Weekly eval review process: agent hypothesis vs actual cause
[ ] Build aiops Grafana dashboard: incidents triaged, agreement rate, action latency
[ ] Document composition recipe in basecamp/examples/recipe-incident-triage/

5.3 The aiops production launch

By phase end:

services/aiops/ (PUBLIC via basecamp's repo):
  Triage agent live in production
  Pattern-detection via vector search of postmortems
  Runbook executor (3 AI-executable runbooks)
  Eval + improvement loop weekly
  Audit log + observability dashboard

  Connected to:
    Prometheus AlertManager (event source)
    basecamp-mcp (tools)
    llm-gateway (reasoning)
    triage (incidents schema; aiops augments)
    pgvector (postmortem index)
    Slack (human interface)

This is the system that, when an alert fires at 2am, gives you a 75%-helpful first hypothesis instead of a blank Grafana dashboard.

6. COMPARE: AIOps vs traditional alert routing

Compare manual triage time (current) vs agent-assisted (with eval data). Measure honestly.

400 words. Real numbers from your own incident log, not vendor marketing.

7. OPERATE

4+ runbooks (aiops-wrong-hypothesis-investigation, runbook-executor-stuck, vector-index-stale, agent-token-budget-exceeded)
2+ postmortems (yes, AIOps will misfire — postmortem honestly)
Weekly log

8. CONTRIBUTE

LangGraph, Anthropic Cookbook, OpenLineage agent integration.

Validation criteria

[ ] All 12 operational depth checks
[ ] services/aiops/ live in basecamp Tier 9
[ ] Triage agent shipping hypotheses for real alerts
[ ] At least 1 incident in Y5 P28 where AIOps actually shortened your time to diagnosis
[ ] AIOps comparison written up
[ ] 4+ runbooks; 2+ postmortems
[ ] Pattern entries deepened:
    - all agent patterns DEEP (combined with P26 + P27)
    - defense-in-depth → reinforced (AI-action audit + approval)
[ ] Exit Test passed

Exit Test

Time: 3 hours.

Build (90 min): add a new alert type to AIOps: “PostgreSQL connection exhaustion.” Agent should: identify likely cause via tools, run diagnostics, propose fix, execute if pre-approved. End-to-end with audit log.
Debug (60 min): scenario: AIOps recommended wrong action 3 times in a row. Investigate prompt, eval data, retrieved context.
Articulate (30 min): 600 words: “Where does AIOps add value vs add risk? When is the human-in-the-loop boundary right?”

Anti-patterns

Anti-pattern	Why
Auto-execute destructive ops without approval	Catastrophe waiting
Skipping the eval loop	”Looks helpful” until it’s quietly wrong
AIOps without read-only default	Worse than no AIOps
Vector-index without freshness	New postmortems missing from analogies
Treating AIOps as replacement, not amplifier	Operators still own novel + risky

Patterns deepened this phase

All agent patterns reach DEEP (combined across P26-P28)
defense-in-depth → reinforced

→ Next: Phase 29: Platform Portal + Governance — Abukix Studio launches