Agent Development
First phase of Year 5. Agents as state machines, not vibes. LangGraph as the canonical OSS framework. Build, debug, and eval agents with the same rigor as services. ~8 weeks, ~100 hrs.
Phase 26 opens Year 5 and starts the operator-to-architect inflection the Master Plan calls out as the most important transition in ROOT. Through Year 4 you built services/llm-gateway/ and learned to serve models. Now you start building the agents that operate the platform you built — and the only way that ends well is if those agents are engineered like systems, not prompted like demos.
The frame for the entire year is set here: agents are state machines that call LLMs. Every other piece of Y5 — P27’s MCP servers, P28’s services/aiops/, P29’s command palette — assumes you can reason about an agent the same way you reason about a Kubernetes controller: typed state, explicit transitions, replayable history, observable failure modes. If that frame doesn’t land here, the rest of Y5 collapses into vibes-driven prototyping.
This phase deepens three core entries in patterns/agents/ — agent-loop, prompt-as-program, tool-use-as-capability — and ships the first useful agent against your own platform: query-helper.
Prerequisites
- Year 4 complete — llm-gateway v1.5 in production with drift detection
- You accept: agents are state machines that call LLMs. The state-machine framing is what makes them debuggable. “Build an agent” without a graph is “write a loop and pray.”
Why this phase exists
Year 5’s payoff is services/aiops/ (P28) — an agent that operates the platform. Before you can build a useful agent, you need to:
- Frame the agent as a state machine (LangGraph or equivalent)
- Treat prompts as code (versioned, tested, gated)
- Treat tools as capabilities (typed, allowlisted, audited)
- Eval against ground truth (not “looks good to me”)
This phase covers all four. P27 deepens tools via MCP. P28 uses everything to ship aiops.
→ Pattern: agent-loop → Pattern: tool-use-as-capability → Pattern: prompt-as-program
1. PROBLEM
You want a system that:
- Takes a goal (natural language)
- Plans steps using an LLM
- Calls tools (read state, take actions)
- Updates state from tool results
- Loops until goal reached or budget exhausted
- Reports back
That’s an agent. The naive implementation is a while loop with prompts and side effects. The structured implementation is a state machine with explicit transitions, replayable history, and observability.
2. PRINCIPLES
2.1 The agent loop as a state machine
→ Pattern: agent-loop
LangGraph: nodes are functions, edges are conditional transitions, state is a typed object that flows.
Investigate:
- Read LangGraph docs (langchain-ai.github.io/langgraph)
- Build a simple agent: planner → tool-caller → observer → planner (loop) → done
- Why does the state machine framing make debugging tractable? (You can replay any state.)
Concrete example: a 4-node query-helper looks like plan → call_tool → observe → answer, with a conditional edge from observe back to plan when the answer isn’t grounded yet. State is a pydantic model holding question, messages, tool_calls, evidence, final_answer. Each node returns the new state; the runtime serializes it. Every transition becomes a row you can replay.
2.2 Prompt-as-program
→ Pattern: prompt-as-program
Prompts are config. They have versions, tests, and PR review.
Investigate:
- Store prompts in basecamp git (or a prompt-store service)
- Version prompts; PR template includes “ran eval suite, no regression”
- Eval suite: golden inputs + expected outputs (or expected-properties)
- Compare with model versioning (Y4 P20) — same shape, different artifact
Same shape as the model-registry discipline from Year 4: prompts/query-helper/v3.yaml lives next to a tests/ directory; the PR that bumps it must show eval pass; rollback is git revert.
2.3 Tool use as capability
→ Pattern: tool-use-as-capability
A tool = a typed function the agent can call. Tools are allowlisted; agents can’t invent new ones; every tool call is audited.
Investigate:
- Define 5 tools for a homelab agent:
read_logs,query_trino,list_runbooks,get_alert,notify_slack - Each: typed inputs (pydantic / JSON Schema), typed outputs, side-effect declaration
- Allowlist: agents declare which tools they need; platform enforces
- Audit log: every tool call → Loki
This is the OUTLINE pass on the pattern; P27 takes it to DEEP via MCP.
2.4 Eval: ground truth, not vibes
You can’t improve what you can’t measure. Eval set + scoring + tracking over time.
Investigate:
- Build an eval set for your agent (20-50 inputs with expected behavior)
- Scoring: rule-based where possible; LLM-judge with disagreement audit
- Track eval over agent versions; gate prompt changes on no-regression
Example: query-helper eval set has 30 questions where the expected SQL fragment is known; rule-based grading checks the agent’s query_trino call contains it. The LLM-judge handles freeform answers; sample 10% for human re-judging weekly to catch judge drift.
2.5 Observability for agents
Trace every agent step. Log every tool call. Metric: success rate, steps-per-task, cost-per-task.
Investigate:
- Wire LangGraph to OTel: each node = a span; tool calls = child spans
- Build a Grafana dashboard: agent success rate, step count distribution, cost
- Add a “replay this run” UI in Studio (P29) so you can debug agent runs visually
2.6 Budget + safety
Agents can run away. Token budget per task. Step budget per task. Destructive-action gate.
Investigate:
- Set max-steps + max-tokens; abort + report on exceed
- Read-only by default; destructive ops require human approval
- Default-deny tool list; explicit allowlist per agent
- Per-agent rate limit (Redis token bucket from Y3 P18)
3. TRADE-OFFS
| Decision | Option A | Option B | When |
|---|---|---|---|
| Framework | LangGraph (LangChain) | Custom Go state machine | LlamaIndex agents |
| Prompt store | basecamp git | dedicated service | tool registry |
| Tool definition | typed (pydantic / JSON Schema) | freeform string-in/out | Typed; freeform = reliability disaster |
| Auth | OIDC token in agent context | service-to-service token | OIDC (Dex, already there) |
| Eval cadence | per-PR | weekly | continuous |
4. TOOLS (as of 2025-10)
- LangGraph 0.2+ (Python; the framework)
- LangSmith (LangChain’s tracing — or use OTel directly to Tempo)
- Anthropic / OpenAI SDKs OR your own llm-gateway (preferred; eats your own dog food)
- pydantic (tool-input typing)
- pytest + a small eval harness
5. MASTERY
5.1 Reading list
| Required | Why |
|---|---|
| Anthropic’s “Building effective agents” blog post + research | The state-of-art framing |
| LangGraph docs — Concepts + Tutorials | The implementation |
| ”The agent loop as a state machine” papers (Reflexion, ReAct) | The theory |
| Recommended | Why |
|---|---|
| Anthropic Cookbook examples | Practical patterns |
| Promptops literature on prompt engineering at scale | Discipline |
5.2 Operational depth checklist
[ ] Install LangGraph; build a 4-node agent (plan / tool / observe / answer)[ ] Define 5 typed tools for a homelab use case (read_logs, query_trino, etc.)[ ] Wire prompts to basecamp git; PR template requires eval pass[ ] Build eval suite (20-50 inputs); track score over agent versions[ ] Wire OTel from LangGraph nodes; view traces in Tempo[ ] Add agent-Grafana dashboard (success rate, step count, cost)[ ] Token + step budget per agent; abort + report on exceed[ ] Audit log every tool call to Loki[ ] Build "replay this run" via stored state from each step[ ] Run agent against 50 eval inputs; achieve >70% success rate before P28[ ] Document agent patterns in basecamp/charts/agents/README.md5.3 The first useful agent: query-helper
Build a small but real agent that answers questions about basecamp:
"How many active incidents in triage?" → query_trino(SELECT count(*) FROM incidents WHERE status='open') → 5
"What ran in basecamp Airflow last night?" → query_trino(SELECT * FROM airflow_runs WHERE date = yesterday) → list of runs with status
"Show me weekly logs from March 2027 mentioning 'flink'" → query_notes_rag(...) (uses Y4 notes-rag) → resultsThis is query-helper — a read-only agent against your own data. Lives in basecamp/charts/agents/query-helper/. P29’s Studio command palette wires it as the default backend, and the same agent shape gets reused (with write tools added through P27) as the spine of services/aiops/ in P28. One agent, three Y5 launches.
6. COMPARE: LangGraph vs custom Go state machine
For a production agent (P28 aiops), is LangGraph (Python) the right framework or do you write the state machine in Go for production reliability?
500 words. The honest answer probably isn’t “one or the other” — it’s “Python now, port the hot loops to Go when the eval data tells you to.”
7. OPERATE
- 3+ runbooks (
agent-stuck-in-loop,tool-call-failed-debug,eval-regression-investigation) - 1+ ADR (e.g., “Why Python LangGraph for now, with custom Go in mind for P28”)
- Weekly log
8. CONTRIBUTE
LangGraph, LangChain, MCP servers, Anthropic Cookbook examples.
Validation criteria
[ ] All 11 operational depth checks[ ] LangGraph + 1 working agent (query-helper) in basecamp[ ] Eval suite + tracking[ ] LangGraph vs custom-Go writeup[ ] 3+ runbooks; 1+ ADR; 8+ weekly log entries[ ] Pattern entries deepened: - agent-loop → DEEP - tool-use-as-capability → OUTLINE (DEEP after P27 MCP) - prompt-as-program → DEEP[ ] Exit Test passedExit Test
Time: 3 hours.
- Build (90 min): given a new use case (“notify me when a service SLO is burning fast”), build an agent: 3+ tools, prompt versioned, eval set, OTel traced.
- Diagnose (60 min): scenario: agent loops indefinitely; trace via OTel; find the prompt or tool flaw.
- Articulate (30 min): 600 words: “Walk an agent’s execution from user prompt to final answer. What patterns fire at each step?”
Anti-patterns
| Anti-pattern | Why |
|---|---|
| Prompts in code (not versioned) | Drift; can’t audit; can’t roll back |
| Tools as freeform string-in/out | Reliability disaster |
| Agent that can do anything by default | Allowlist + budget + approval gate |
| Eval = “looks good to me” | Can’t improve what you can’t measure |
| Skipping the state-machine framing | Loops without explicit transitions = unmaintainable |
Patterns deepened this phase
- agent-loop → DEEP
- tool-use-as-capability → OUTLINE
- prompt-as-program → DEEP
→ Next: Phase 27: MCP + Tool Use