Skip to content

triage Plan

An on-call app running on K3s. Lists open incidents, who’s paged, next escalation time. Uses Postgres (Phase 3 schema) + Redis (active-paging state) + Prometheus (metrics). First real service-on-K3s. Lives in basecamp/charts/triage/.

triage is the Group B service that closes Year 1. It’s the first real workload deployed onto basecamp’s K3s cluster at the end of Phase 7 (Kubernetes + GitOps), and it’s deliberately scoped to exercise every Year 1 phase: the Phase 3 incidents schema, Phase 4 Python migration scripts, Phase 5 Go backend, Phase 6 container build, and Phase 7 K3s deployment + GitOps reconciliation.

Beyond being a phase deliverable, triage is real software with a real role. It’s better than the post-it-note incident tracking most homelabs run with, and it earns a permanent place in the platform: by Year 5 Phase 28, services/aiops/ queries triage’s open-incidents API as a tool via basecamp-mcp, making triage the data source for the auto-incident triage composition recipe (Recipe 2).

The architecture is intentionally boring: Go + chi + sqlx + slog backend, htmx server-rendered HTML frontend (no SPA — Year 1 isn’t where you learn frontend), Postgres for incidents, Redis for active-paging state, Prometheus for metrics. The interesting part isn’t the stack; it’s that this is the first service that proves the integration — that all those Year 1 phases compose into a thing that runs in production.


What it is

A small but real web service:

Backend: Go (chi router, sqlx Postgres, slog)
Frontend: server-rendered HTML + htmx (no SPA — not learning frontend yet)
Persistence: Postgres (incidents schema from Phase 3) + Redis (active paging)
Deploy: Helm chart in basecamp; ArgoCD-managed
Observability: Prometheus metrics; structured logs to Loki (from Y3)

Endpoints:

  • GET /: open-incidents dashboard
  • GET /incidents/{id}: incident detail + timeline
  • POST /incidents: create new incident (idempotent via key)
  • POST /incidents/{id}/escalate: escalate
  • GET /healthz, GET /metrics, GET /readyz

Why it exists

  1. Phase deliverable: Year 1 Phase 7 K8s; first real service-on-K3s
  2. Ties together Year 1. Uses: Phase 3 schema + Phase 4 (Python migration scripts) + Phase 5 (Go backend) + Phase 6 (container) + Phase 7 (K3s deployment + GitOps)
  3. Real value. Better than the post-it-note incident tracking most homelabs have.
  4. Year-5 integration: services/aiops/ queries triage’s open-incidents API as a tool via basecamp-mcp.

Pattern it teaches

The integration pattern: first real service that exercises:


Scope

v1 (Year 1 Phase 7)

[ ] Go backend with chi + sqlx + slog
[ ] htmx frontend, server-rendered HTML
[ ] Postgres incidents schema (from Phase 3)
[ ] Redis for active-paging state
[ ] Helm chart in basecamp/charts/triage/
[ ] ArgoCD Application in basecamp/applications/tier-1-foundation/
[ ] Prometheus metrics + Grafana dashboard
[ ] Structured logs (slog) shipped to Loki (Phase 14, deferred slightly from initial release)
[ ] >70% test coverage
[ ] CI: GitHub Actions builds + pushes image; ArgoCD syncs on tag
[ ] README + architecture diagram

Y3 + (incremental enhancements)

  • Loki log shipping (Y3 P14)
  • OTel traces (Y3 P14)
  • SLO definition + burn-rate alerts (already from Y2 P12 discipline)

Y5 (integration with AIOps)

  • Expose /v1/incidents/open API for services/aiops/ to consume via basecamp-mcp
  • Surface in Studio Portal
  • AI-executable runbook: “auto-escalate if no acknowledgment in N minutes”

When built

Year 1 Phase 7, Month 10-12. Ships to github.com/abukix/triage.


Dependencies

triage requires basecamp Tier 1 to be live (Postgres, Redis, ArgoCD, Prometheus, Grafana — all from Phase 7). It also leans on the Phase 3 incidents schema (database design), Phase 4 Python (migration scripts), and Phase 5 Go (the backend itself). Y5 integration depends on services/aiops/ and basecamp-mcp shipping in Year 5 Phases 27-28.


Deliverables checklist

[ ] github.com/abukix/triage public (quiet ship)
[ ] Helm chart in basecamp/charts/triage/
[ ] ArgoCD Application in basecamp/applications/triage.yaml
[ ] Deployed on homelab K3s
[ ] Real incident logged via triage during Phase 7 exit test
[ ] README + architecture diagram

Public vs private

Public from Y1 P7 ship: quiet ship. Push to GitHub + tag a release. No blog post, no LinkedIn announcement. Year 1 ships are about discipline (publish + PR review + cut release), not launch energy. Loud launches reserved for terralabs (Y2), basecamp (Y3), Abukix Studio + mlship v2 (Y5).


Cross-references