Agent Development

First phase of Year 5. Agents as state machines, not vibes. LangGraph as the canonical OSS framework. Build, debug, and eval agents with the same rigor as services. ~8 weeks, ~100 hrs.

Phase 26 opens Year 5 and starts the operator-to-architect inflection the Master Plan calls out as the most important transition in ROOT. Through Year 4 you built services/llm-gateway/ and learned to serve models. Now you start building the agents that operate the platform you built — and the only way that ends well is if those agents are engineered like systems, not prompted like demos.

The frame for the entire year is set here: agents are state machines that call LLMs. Every other piece of Y5 — P27’s MCP servers, P28’s services/aiops/, P29’s command palette — assumes you can reason about an agent the same way you reason about a Kubernetes controller: typed state, explicit transitions, replayable history, observable failure modes. If that frame doesn’t land here, the rest of Y5 collapses into vibes-driven prototyping.

This phase deepens three core entries in patterns/agents/ — agent-loop, prompt-as-program, tool-use-as-capability — and ships the first useful agent against your own platform: query-helper.

Prerequisites

Year 4 complete — llm-gateway v1.5 in production with drift detection

You accept: agents are state machines that call LLMs. The state-machine framing is what makes them debuggable. “Build an agent” without a graph is “write a loop and pray.”

Why this phase exists

Year 5’s payoff is services/aiops/ (P28) — an agent that operates the platform. Before you can build a useful agent, you need to:

Frame the agent as a state machine (LangGraph or equivalent)
Treat prompts as code (versioned, tested, gated)
Treat tools as capabilities (typed, allowlisted, audited)
Eval against ground truth (not “looks good to me”)

This phase covers all four. P27 deepens tools via MCP. P28 uses everything to ship aiops.

→ Pattern: agent-loop → Pattern: tool-use-as-capability → Pattern: prompt-as-program

1. PROBLEM

You want a system that:

Takes a goal (natural language)
Plans steps using an LLM
Calls tools (read state, take actions)
Updates state from tool results
Loops until goal reached or budget exhausted
Reports back

That’s an agent. The naive implementation is a while loop with prompts and side effects. The structured implementation is a state machine with explicit transitions, replayable history, and observability.

2. PRINCIPLES

2.1 The agent loop as a state machine

→ Pattern: agent-loop

LangGraph: nodes are functions, edges are conditional transitions, state is a typed object that flows.

Investigate:

Read LangGraph docs (langchain-ai.github.io/langgraph)
Build a simple agent: planner → tool-caller → observer → planner (loop) → done
Why does the state machine framing make debugging tractable? (You can replay any state.)

Concrete example: a 4-node query-helper looks like plan → call_tool → observe → answer, with a conditional edge from observe back to plan when the answer isn’t grounded yet. State is a pydantic model holding question, messages, tool_calls, evidence, final_answer. Each node returns the new state; the runtime serializes it. Every transition becomes a row you can replay.

2.2 Prompt-as-program

→ Pattern: prompt-as-program

Prompts are config. They have versions, tests, and PR review.

Investigate:

Store prompts in basecamp git (or a prompt-store service)
Version prompts; PR template includes “ran eval suite, no regression”
Eval suite: golden inputs + expected outputs (or expected-properties)
Compare with model versioning (Y4 P20) — same shape, different artifact

Same shape as the model-registry discipline from Year 4: prompts/query-helper/v3.yaml lives next to a tests/ directory; the PR that bumps it must show eval pass; rollback is git revert.

2.3 Tool use as capability

→ Pattern: tool-use-as-capability

A tool = a typed function the agent can call. Tools are allowlisted; agents can’t invent new ones; every tool call is audited.

Investigate:

Define 5 tools for a homelab agent: read_logs, query_trino, list_runbooks, get_alert, notify_slack
Each: typed inputs (pydantic / JSON Schema), typed outputs, side-effect declaration
Allowlist: agents declare which tools they need; platform enforces
Audit log: every tool call → Loki

This is the OUTLINE pass on the pattern; P27 takes it to DEEP via MCP.

2.4 Eval: ground truth, not vibes

You can’t improve what you can’t measure. Eval set + scoring + tracking over time.

Investigate:

Build an eval set for your agent (20-50 inputs with expected behavior)
Scoring: rule-based where possible; LLM-judge with disagreement audit
Track eval over agent versions; gate prompt changes on no-regression

Example: query-helper eval set has 30 questions where the expected SQL fragment is known; rule-based grading checks the agent’s query_trino call contains it. The LLM-judge handles freeform answers; sample 10% for human re-judging weekly to catch judge drift.

2.5 Observability for agents

Trace every agent step. Log every tool call. Metric: success rate, steps-per-task, cost-per-task.

Investigate:

Wire LangGraph to OTel: each node = a span; tool calls = child spans
Build a Grafana dashboard: agent success rate, step count distribution, cost
Add a “replay this run” UI in Studio (P29) so you can debug agent runs visually

2.6 Budget + safety

Agents can run away. Token budget per task. Step budget per task. Destructive-action gate.

Investigate:

Set max-steps + max-tokens; abort + report on exceed
Read-only by default; destructive ops require human approval
Default-deny tool list; explicit allowlist per agent
Per-agent rate limit (Redis token bucket from Y3 P18)

3. TRADE-OFFS

Decision	Option A	Option B	When
Framework	LangGraph (LangChain)	Custom Go state machine	LlamaIndex agents
Prompt store	basecamp git	dedicated service	tool registry
Tool definition	typed (pydantic / JSON Schema)	freeform string-in/out	Typed; freeform = reliability disaster
Auth	OIDC token in agent context	service-to-service token	OIDC (Dex, already there)
Eval cadence	per-PR	weekly	continuous

4. TOOLS (as of 2025-10)

LangGraph 0.2+ (Python; the framework)
LangSmith (LangChain’s tracing — or use OTel directly to Tempo)
Anthropic / OpenAI SDKs OR your own llm-gateway (preferred; eats your own dog food)
pydantic (tool-input typing)
pytest + a small eval harness

5. MASTERY

5.1 Reading list

Required	Why
Anthropic’s “Building effective agents” blog post + research	The state-of-art framing
LangGraph docs — Concepts + Tutorials	The implementation
”The agent loop as a state machine” papers (Reflexion, ReAct)	The theory

Recommended	Why
Anthropic Cookbook examples	Practical patterns
Promptops literature on prompt engineering at scale	Discipline

5.2 Operational depth checklist

[ ] Install LangGraph; build a 4-node agent (plan / tool / observe / answer)
[ ] Define 5 typed tools for a homelab use case (read_logs, query_trino, etc.)
[ ] Wire prompts to basecamp git; PR template requires eval pass
[ ] Build eval suite (20-50 inputs); track score over agent versions
[ ] Wire OTel from LangGraph nodes; view traces in Tempo
[ ] Add agent-Grafana dashboard (success rate, step count, cost)
[ ] Token + step budget per agent; abort + report on exceed
[ ] Audit log every tool call to Loki
[ ] Build "replay this run" via stored state from each step
[ ] Run agent against 50 eval inputs; achieve >70% success rate before P28
[ ] Document agent patterns in basecamp/charts/agents/README.md

5.3 The first useful agent: `query-helper`

Build a small but real agent that answers questions about basecamp:

"How many active incidents in triage?"
  → query_trino(SELECT count(*) FROM incidents WHERE status='open')
  → 5

"What ran in basecamp Airflow last night?"
  → query_trino(SELECT * FROM airflow_runs WHERE date = yesterday)
  → list of runs with status

"Show me weekly logs from March 2027 mentioning 'flink'"
  → query_notes_rag(...)  (uses Y4 notes-rag)
  → results

This is query-helper — a read-only agent against your own data. Lives in basecamp/charts/agents/query-helper/. P29’s Studio command palette wires it as the default backend, and the same agent shape gets reused (with write tools added through P27) as the spine of services/aiops/ in P28. One agent, three Y5 launches.

6. COMPARE: LangGraph vs custom Go state machine

For a production agent (P28 aiops), is LangGraph (Python) the right framework or do you write the state machine in Go for production reliability?

500 words. The honest answer probably isn’t “one or the other” — it’s “Python now, port the hot loops to Go when the eval data tells you to.”

7. OPERATE

3+ runbooks (agent-stuck-in-loop, tool-call-failed-debug, eval-regression-investigation)
1+ ADR (e.g., “Why Python LangGraph for now, with custom Go in mind for P28”)
Weekly log

8. CONTRIBUTE

LangGraph, LangChain, MCP servers, Anthropic Cookbook examples.

Validation criteria

[ ] All 11 operational depth checks
[ ] LangGraph + 1 working agent (query-helper) in basecamp
[ ] Eval suite + tracking
[ ] LangGraph vs custom-Go writeup
[ ] 3+ runbooks; 1+ ADR; 8+ weekly log entries
[ ] Pattern entries deepened:
    - agent-loop → DEEP
    - tool-use-as-capability → OUTLINE (DEEP after P27 MCP)
    - prompt-as-program → DEEP
[ ] Exit Test passed

Exit Test

Time: 3 hours.

Build (90 min): given a new use case (“notify me when a service SLO is burning fast”), build an agent: 3+ tools, prompt versioned, eval set, OTel traced.
Diagnose (60 min): scenario: agent loops indefinitely; trace via OTel; find the prompt or tool flaw.
Articulate (30 min): 600 words: “Walk an agent’s execution from user prompt to final answer. What patterns fire at each step?”

Anti-patterns

Anti-pattern	Why
Prompts in code (not versioned)	Drift; can’t audit; can’t roll back
Tools as freeform string-in/out	Reliability disaster
Agent that can do anything by default	Allowlist + budget + approval gate
Eval = “looks good to me”	Can’t improve what you can’t measure
Skipping the state-machine framing	Loops without explicit transitions = unmaintainable

Patterns deepened this phase

agent-loop → DEEP
tool-use-as-capability → OUTLINE
prompt-as-program → DEEP

→ Next: Phase 27: MCP + Tool Use

Agent Development

Prerequisites

Why this phase exists

1. PROBLEM

2. PRINCIPLES

2.1 The agent loop as a state machine

2.2 Prompt-as-program

2.3 Tool use as capability

2.4 Eval: ground truth, not vibes

2.5 Observability for agents

2.6 Budget + safety

3. TRADE-OFFS

4. TOOLS (as of 2025-10)

5. MASTERY

5.1 Reading list

5.2 Operational depth checklist

5.3 The first useful agent: query-helper

6. COMPARE: LangGraph vs custom Go state machine

7. OPERATE

8. CONTRIBUTE

Validation criteria

Exit Test

Anti-patterns

Patterns deepened this phase

5.3 The first useful agent: `query-helper`