aiops
AI-augmented on-call. Not autonomous chaos.
aiops applies LLMs to operations tasks that are boring, tedious, and rules-based: incident summarization, log clustering, runbook selection, alert-quality assessment. Every action is auditable. Nothing runs without a human OK.
What you can do
Incident summaries
When triage pages, aiops generates a first-cut summary from the trace and logs.
Log clustering
Similar errors get grouped. You see 5 patterns, not 500 lines.
Runbook retrieval
Given an alert, aiops finds the closest runbook and suggests it as a starting point.
Alert quality
After each incident, aiops rates the alerts: which fired usefully, which were noise.
Fully audited
Every LLM call is logged with input, output, model, and cost. Reviewable per incident.
No autonomous action
aiops proposes. A human accepts. There is no auto-remediation, by design.
Explore the rest of /root
The 9-tier K8s-native platform
Reproducible experiments
Observability & tracing
On-call alerting
Cloud infrastructure as code
Platform control CLI
Lakehouse & data platform
LLM routing & rate limits
MCP tool server plans
Weekly logs + runbooks
Get started with aiops
Clone the repo, read the plan, and start building your own version.
All projects