AIOps — services/aiops/

Phase 50 of /root Year 5: agents that operate the platform. AIOps on the ops-handbook corpus. pgvector indexing of postmortems. Incident triage via agents through MCP. Tier 8 of basecamp comes alive. The final phase before the Y5 capstone. 6-8 weeks, ~60-80 hours.

Twelfth phase of Year 5. Agents that operate the platform. 6-8 weeks, ~60-80 hrs.

This is the AI tier’s closing move: agents that operate basecamp using the corpus you’ve built across 5 years. The ops-handbook is now ~250 weekly logs, ~25 postmortems, ~140 runbooks, ~15 ADRs — a real corpus. Phase 50 builds agents that index it (pgvector embeddings), reason over it (RAG to retrieve relevant past incidents), and propose actions through MCP (Phase 48).

By phase end basecamp’s Tier 8 — AIOps — is alive. Incidents are triaged by agents that read past postmortems, identify similar patterns, and propose runbook execution. Humans approve destructive actions; agents handle the rest. The platform-engineering + AIOps pattern at homelab scale.

This is also the substrate for the Y5 Capstone: Abukix Studio’s AIOps panel renders agent activity to the user.


Prerequisites

  • All Y5 Phase 39-49 complete; Tier 7 + AI security operational
  • ops-handbook has substantial corpus (~5 years of weekly logs at this point)
  • 12 hrs/week budget reserved

Why this phase exists

AIOps is the practical face of AI applied to platform operations. Agents that read your runbooks + postmortems + telemetry, identify similar past incidents, propose triage steps. Frontier-lab platform teams all build internal variants. The pattern works because you have a rich, structured, time-series corpus — your ops-handbook is exactly that.

This phase also closes the operator-pattern arc. Phase 26 you built platform-ctl + custom kubebuilder operator. Phase 50 builds another custom operator: services/aiops/ watches alerts, dispatches agents to triage, surfaces results through the same K8s API.


The pattern-first frame

Same eight steps.


1. PROBLEM

basecamp runs 5+ years of operations. The ops-handbook holds the institutional knowledge. New incidents recur — variations of past ones. Without AIOps: every incident is investigated cold; experienced engineers’ time is consumed in pattern-matching. With AIOps: agents do the pattern-match; humans handle judgment + destructive actions.


2. PRINCIPLES

2.1 The AIOps loop

Alert fires → agent retrieves similar past postmortems via vector search → reasons over them → proposes triage steps → human approves destructive actions → agent executes through platform-ctl (Phase 26) MCP.

→ Pattern: aiops — OUTLINE this phase

Investigate:

  • Walk an AIOps loop: alert from Prometheus → trigger agent → retrieve postmortems → propose → approve → execute.
  • What’s the “first three triage questions” pattern, encoded as a Kyverno policy or agent system prompt?
  • When does human-in-the-loop slow down acceptable response?

2.2 ops-handbook as the corpus

5 years of ops-handbook is a substantial RAG corpus. ~250 weekly logs + ~25 postmortems + ~140 runbooks + ~15 ADRs. Embed it all into pgvector; serve via Phase 42’s RAG endpoint.

→ Pattern: rag-as-pattern reinforced

Investigate:

  • For your ops-handbook: what’s the right chunking strategy?
  • How do you keep the corpus fresh (continuous embedding pipeline)?
  • What’s the value of structured metadata (incident_type, severity, service_affected) alongside the embeddings?

2.3 Trajectory bounding

Agents must not loop forever. Max iterations, max cost, max time. Trajectory eval ensures agents converge.

→ Pattern: agent-loop reinforced

Investigate:

  • What’s the right max-iteration budget for a triage agent? (Hint: 5-10, not 100.)
  • How do you detect a stuck agent vs a slow agent?
  • When does the agent give up and escalate to human?

2.4 Approval gates for destructive actions

Reading is free. Writing is dangerous. Approval gates partition agent actions: read-only allowed; restart/scale/delete requires human OK.

Investigate:

  • Walk approval gate UX: agent proposes → Slack/email/Studio → human approves → execution.
  • What’s the right approval SLA?
  • When does pre-approved automation make sense (within blast radius limits)?

2.5 AIOps as another custom operator

services/aiops/ is a custom kubebuilder operator (like Phase 26’s platform operator). It watches IncidentReport CRDs that fire from Prometheus alerts, reconciles them into agent runs.

→ Pattern: operator-pattern reinforced toward DEEP

Investigate:

  • Walk the IncidentReport CRD: alert fires → CRD created → AIOps operator notices → agent runs.
  • How does the operator handle agent failure (retry, escalate)?
  • Why is “AIOps as operator” the right K8s-native shape?

2.6 The full program’s patterns compose here

Phase 50 is where everything composes: Phase 17 OS internals (debugging), Phase 20 K8s + GitOps (state), Phase 22 IaC (provisioning fix), Phase 26 custom operator (the AIOps operator itself), Phase 28 observability (the alert source), Phase 38 KServe (the LLM inference), Phase 42 RAG (the corpus retrieval), Phase 46 llm-gateway (the model access), Phase 48 agent runtime + MCP (the tool execution), Phase 49 AI security (the safety layer).

Investigate:

  • For one synthetic incident, walk every prior pattern that participates in the AIOps response.
  • Which patterns are load-bearing? Which are nice-to-have?
  • What would break the loop, and which prior phases would you call first?

3. TRADE-OFFS

DecisionOptionsCost
AIOps deploymentCustom kubebuilder operator; standalone service; agent without operatorCustom operator: K8s-native (recommended). Standalone: simpler.
Action scopeRead-only; read + approve writes; full automationRead + approve (recommended); full automation only for low-risk + pre-approved.
Trigger sourcePrometheus alerts; PagerDuty integration; Slack mentionsPrometheus (recommended; native); others as additions.

4. TOOLS (as of 2026-06)

  • kubebuilder — for the AIOps operator
  • LangGraph — agent runtime (from Phase 48)
  • pgvector — corpus retrieval (from Phase 42)
  • llm-gateway — LLM access (from Phase 46)
  • MCP servers — tool surface (from Phase 48)
  • Prometheus AlertManager — alert source
  • Slack/Discord webhook — approval UX

Reading

  • “AIOps in Practice” — various engineering blogs (New Relic, Datadog)
  • Anthropic on agent design — current best practices
  • Public KubeCon / SREcon talks on agents-in-production

5. MASTERY: Tier 8 alive on basecamp

[ ] Custom AIOps kubebuilder operator deployed via Flux
[ ] IncidentReport CRD defined; reconciler triggers agent runs
[ ] pgvector indexed `ops-handbook` corpus (continuous update via Argo CronWorkflow)
[ ] Agent retrieves top-5 similar past incidents for a new alert
[ ] Agent proposes triage steps + cites postmortem references
[ ] Approval gate for destructive actions (Slack webhook or Studio UI)
[ ] Read-only actions (curl health endpoint, run `kubectl describe`) automated
[ ] Trajectory eval: agents converge within 10 iterations on benchmark scenarios
[ ] Mean time to triage: measure baseline (human-only) vs AIOps-assisted
[ ] AIOps OTel traces visible in Grafana

6. COMPARE: PagerDuty AIOps or Datadog Watchdog

Read about one managed AIOps offering. Reflect on what’s gained vs what’s lost via your own implementation.

400-word reflection.


7. OPERATE

  • 4-5 runbooks: agent loop stuck, false-positive triage, corpus index stale, approval gate broken, hallucinated postmortem reference
  • 2-3 ADRs (custom operator over standalone; approval scope; corpus update cadence)
  • Weekly log

8. CONTRIBUTE

  • LangGraph agent patterns
  • A public AIOps blog post — what worked, what didn’t

What ships from this phase

  • Tier 8 of basecamp alive: services/aiops/ custom operator + agent + corpus indexing
  • AIOps runbooks
  • Full Y5 portfolio complete — ready for Y5 Capstone + Final Exam

Validation criteria

[ ] AIOps custom operator deployed
[ ] IncidentReport CRD + agent reconciliation working
[ ] ops-handbook corpus indexed and queryable
[ ] Approval gates functional
[ ] At least one real or simulated incident triaged by AIOps successfully
[ ] All 10 operational depth checks
[ ] Compare reflection (400 words)
[ ] 4-5 runbooks
[ ] 2-3 ADRs
[ ] Pattern entries:
    - aiops → OUTLINE
    - agent-loop reinforced
    - rag-as-pattern reinforced
    - operator-pattern reinforced toward DEEP
[ ] Exit Test passed
[ ] Year 5 Capstone prep can begin

Exit Test

Time: 3 hours.

Part 1: Build (90 min)

Define a new IncidentReport scenario in a YAML CRD. Verify the AIOps operator picks it up, dispatches an agent, retrieves similar postmortems, proposes triage. Approve a safe action; verify execution.

Part 2: Diagnose (60 min)

An AIOps scenario (e.g., “AIOps proposed wrong runbook for an incident — investigate”). Possible: corpus drift; retrieval failure; agent hallucination.

Part 3: Articulate (30 min)

~800 words: “Walk an alert through basecamp’s full AIOps pipeline. Cite every prior /root phase that contributes. Show how 5 years of patterns compose into one operational capability.”


Anti-patterns

Anti-patternWhy
Full automation of destructive actions from day oneOne agent error = production outage
Stale corpusAgent retrieves obsolete postmortems
No trajectory boundsAgent loops forever; cost explodes
Trusting agent output without citation”Agent said X” is not evidence; cite the postmortem

Patterns touched this phase

  • aiops — OUTLINE
  • agent-loop reinforced
  • rag-as-pattern reinforced
  • operator-pattern reinforced toward DEEP

→ Next: Year 5 Capstone + Final Exam — Abukix Studio MVP + Pattern Paper