AI Operations

aiops

AI-augmented on-call. Not autonomous chaos.

aiops applies LLMs to operations tasks that are boring, tedious, and rules-based: incident summarization, log clustering, runbook selection, alert-quality assessment. Every action is auditable. Nothing runs without a human OK.

View on GitHub Read the docs

Capabilities

What you can do

Incident summaries

When triage pages, aiops generates a first-cut summary from the trace and logs.

Log clustering

Similar errors get grouped. You see 5 patterns, not 500 lines.

Runbook retrieval

Given an alert, aiops finds the closest runbook and suggests it as a starting point.

Alert quality

After each incident, aiops rates the alerts: which fired usefully, which were noise.

Fully audited

Every LLM call is logged with input, output, model, and cost. Reviewable per incident.

No autonomous action

aiops proposes. A human accepts. There is no auto-remediation, by design.

Explore the rest of /root

basecamp

The 9-tier K8s-native platform

rxp

Reproducible experiments

pulse

Observability & tracing

triage

On-call alerting

terralabs

Cloud infrastructure as code

platform-ctl

Platform control CLI

data-tier

Lakehouse & data platform

llm-gateway

LLM routing & rate limits

mcp-servers

MCP tool server plans

ops-handbook

Weekly logs + runbooks

Get started with aiops

Clone the repo, read the plan, and start building your own version.

View on GitHub Read the docs

All projects