AIOps: services/aiops/
Third phase of Year 5. The narrative payoff: build agents that operate the platform you built, using 5 years of operational data. Auto-incident triage loop lands as a runnable composition recipe. ~8 weeks, ~100 hrs.
Phase 28 is the Year 5 inflection point — the phase that earns the Staff/Principal AI Platform Engineer title the Master Plan pins to the Year 5 exit ramp. P26 gave you state-machine agents. P27 gave you a typed tool gateway. P28 puts both on top of ~4 years of operational data and closes the loop: agents you built, running on the platform you built, operating the same platform you operate.
The artifact is services/aiops/ — Tier 9 of basecamp, sitting on top of every prior tier and consuming basecamp-mcp from P27 plus llm-gateway from Year 4. The deliverable isn’t “an agent that triages incidents” in the abstract; it’s a system that, when an alert fires at 2am, gives you a 75%-helpful first hypothesis instead of a blank Grafana dashboard. Honest measurement is the discipline: the agent’s hypothesis is logged next to the actual root cause every time, and the disagreement is the eval signal that drives the next prompt iteration.
This is also the phase where the AI in AI Platform Engineer becomes load-bearing. Up to here, the platform served LLMs; from here, the platform is operated by LLMs you authored. See patterns/agents/ and patterns/ml-and-ai/ for the durable framing.
Prerequisites
- Phase 27 complete — basecamp-mcp running with read + approval-gated write tools
- 4+ years of operational data in your platform (incidents in Postgres, runbooks in ops-handbook, postmortems indexed in Iceberg / vector DB)
- You accept: AIOps isn’t replacing operators — it’s amplifying them. Agents triage, recommend, execute pre-approved runbooks. Humans stay on the loop for novel + risky decisions.
Why this phase exists
You’ve operated the platform for ~4 years. Rich operational data. AIOps takes that data + your runbooks + LLM reasoning to make operations faster + more consistent.
This phase ships services/aiops/ inside basecamp. It’s the loop closing: agents you built, running on the platform you built, with the patterns you learned, operating the same platform you operate.
The cinematic Year 5 moment.
1. PROBLEM
Operations work has 3 patterns:
- Repetitive triage (read N dashboards; decide if it’s incident X or Y) — AIOps automates the first pass
- Runbook execution (steps 1-5 of “Postgres connections exhausted”) — AIOps executes pre-approved
- Pattern detection (this anomaly looks like an incident from 3 months ago) — AIOps surfaces analogies
Humans still own: novel incidents, risky decisions, postmortems, learning.
2. PRINCIPLES
2.1 Incident triage agent
When an alert fires, the agent reads context (metrics, recent deploys, related runbooks, similar past incidents), forms a hypothesis, posts to incident channel.
Investigate:
- Subscribe to PrometheusAlertManager webhook
- Agent reads via
basecamp-mcp: alert payload, related Grafana dashboards, recent deploys, related runbooks - Output: structured hypothesis + suggested first action + confidence
- Composition recipe: Auto-incident triage loop — see Studio composition
2.2 Runbook executor
Pre-approved runbooks get an “AI-executable” flag. Agent runs steps; pauses for human confirmation on destructive ops.
Investigate:
- Annotate 3 runbooks as AI-executable (with explicit steps + safety bounds)
- Build an executor agent that follows step-by-step + reports
- Implement “destructive confirmation” mode (Slack approval before reboot/delete)
The three first-target runbooks should be the ones you’ve actually run more than 5 times in ops-handbook — postgres-connections-exhausted, redis-memory-pressure, k8s-pod-crashlooping are good defaults but pick whatever shows up most in your weekly logs. Frequency is the prioritization signal.
2.3 Pattern detection: vector index past incidents
Vector-index past postmortems (you have ~25 by Y5). New incident → search for similar.
Investigate:
- Embed all past postmortems via sentence-transformers (already from Y4)
- Pgvector index in
services/aiops/(Tier 9) - New incident: embed alert summary; top-3 similar past incidents shown
- “Has the team seen this before? What worked?“
2.4 Safety + guardrails
→ Pattern: defense-in-depth (revisited at AI surface)
- Read-only by default
- Destructive ops require human approval
- Per-agent token budget
- Audit log of every action
- Per-incident step budget (prevent runaway loops on confused alerts)
2.5 Eval + improvement loop
Track AIOps recommendations vs human ground truth. Improve.
Investigate:
- After each incident: log “agent’s hypothesis”, “actual root cause”, “agreement?”
- Weekly review: disagreements; update prompts/runbooks
- Quarterly: train a small classifier on hypothesis-vs-actual to spot patterns
The weekly review is the load-bearing habit, same shape as the Sunday weekly log from the Master Plan. Without it, the agent’s “looks helpful” answers quietly drift away from useful.
2.6 The composition recipe: Auto-incident triage loop
Prometheus alert → AlertManager webhook → services/aiops/ ↓aiops calls basecamp-mcp tools: - query_trino: "what services have alerted in last hour?" - read_logs: "tail of triage's logs around alert time" - list_services: "what changed? recent deploys?" - vector_search_postmortems: "similar past incidents?" ↓LLM (via llm-gateway): forms hypothesis ↓notify_slack: "Hypothesis: X. Suggested first action: Y. Confidence: 75%." ↓If runbook AI-executable + confidence > threshold: execute step 1 Otherwise: human-in-loop in SlackThis is composition recipe #2 of the five Y5 recipes — it lands here in P28 and gets surfaced in the Studio command palette in P29.
3. TRADE-OFFS
| Decision | Option A | Option B | When |
|---|---|---|---|
| Action authority | Read-only (advisory) | Approved actions only | Full autonomy |
| Trigger | Alert-driven | Schedule-driven (anomaly detection) | On-demand |
| Eval cadence | Weekly | Per-incident | Continuous |
| Agent runtime | LangGraph (Python) | Go state machine | LangGraph for iter-speed; Go for production once stable |
4. TOOLS (as of 2025-10)
- LangGraph (P26)
- basecamp-mcp (P27)
- llm-gateway v1.5 (Y4 P25)
- Prometheus AlertManager (Y3 — alert source)
- pgvector (Y4 P24 — incident retrieval)
5. MASTERY
5.1 Reading list
| Required | Why |
|---|---|
| Google SRE Book — Incident Management chapter | The discipline AIOps amplifies |
| Anthropic / OpenAI blog posts on production agents (2024-2025) | Real-world patterns |
| Honeycomb / Datadog AIOps writeups | Vendor-neutral patterns |
5.2 Operational depth checklist
[ ] services/aiops/ scaffolded inside basecamp/charts/aiops/[ ] Wire Prometheus AlertManager → AIOps webhook[ ] Build triage agent: reads alert, hypothesizes, posts to Slack[ ] Index all past postmortems in pgvector (auto-update on new postmortem in ops-handbook)[ ] New incident: surface top-3 similar past incidents[ ] Annotate 3 runbooks as AI-executable (postgres-connections-exhausted, redis-memory-pressure, k8s-pod-crashlooping)[ ] Build runbook executor (read-only first; destructive-requires-approval)[ ] Audit log every action to Loki + a Postgres "aiops_audit" table[ ] Per-agent token budget + abort[ ] Weekly eval review process: agent hypothesis vs actual cause[ ] Build aiops Grafana dashboard: incidents triaged, agreement rate, action latency[ ] Document composition recipe in basecamp/examples/recipe-incident-triage/5.3 The aiops production launch
By phase end:
services/aiops/ (PUBLIC via basecamp's repo): Triage agent live in production Pattern-detection via vector search of postmortems Runbook executor (3 AI-executable runbooks) Eval + improvement loop weekly Audit log + observability dashboard
Connected to: Prometheus AlertManager (event source) basecamp-mcp (tools) llm-gateway (reasoning) triage (incidents schema; aiops augments) pgvector (postmortem index) Slack (human interface)This is the system that, when an alert fires at 2am, gives you a 75%-helpful first hypothesis instead of a blank Grafana dashboard.
6. COMPARE: AIOps vs traditional alert routing
Compare manual triage time (current) vs agent-assisted (with eval data). Measure honestly.
400 words. Real numbers from your own incident log, not vendor marketing.
7. OPERATE
- 4+ runbooks (
aiops-wrong-hypothesis-investigation,runbook-executor-stuck,vector-index-stale,agent-token-budget-exceeded) - 2+ postmortems (yes, AIOps will misfire — postmortem honestly)
- Weekly log
8. CONTRIBUTE
LangGraph, Anthropic Cookbook, OpenLineage agent integration.
Validation criteria
[ ] All 12 operational depth checks[ ] services/aiops/ live in basecamp Tier 9[ ] Triage agent shipping hypotheses for real alerts[ ] At least 1 incident in Y5 P28 where AIOps actually shortened your time to diagnosis[ ] AIOps comparison written up[ ] 4+ runbooks; 2+ postmortems[ ] Pattern entries deepened: - all agent patterns DEEP (combined with P26 + P27) - defense-in-depth → reinforced (AI-action audit + approval)[ ] Exit Test passedExit Test
Time: 3 hours.
- Build (90 min): add a new alert type to AIOps: “PostgreSQL connection exhaustion.” Agent should: identify likely cause via tools, run diagnostics, propose fix, execute if pre-approved. End-to-end with audit log.
- Debug (60 min): scenario: AIOps recommended wrong action 3 times in a row. Investigate prompt, eval data, retrieved context.
- Articulate (30 min): 600 words: “Where does AIOps add value vs add risk? When is the human-in-the-loop boundary right?”
Anti-patterns
| Anti-pattern | Why |
|---|---|
| Auto-execute destructive ops without approval | Catastrophe waiting |
| Skipping the eval loop | ”Looks helpful” until it’s quietly wrong |
| AIOps without read-only default | Worse than no AIOps |
| Vector-index without freshness | New postmortems missing from analogies |
| Treating AIOps as replacement, not amplifier | Operators still own novel + risky |
Patterns deepened this phase
- All agent patterns reach DEEP (combined across P26-P28)
- defense-in-depth → reinforced
→ Next: Phase 29: Platform Portal + Governance — Abukix Studio launches