Kubernetes + GitOps: ship triage
Final phase of Year 1. The longest. K8s as the canonical distributed scheduler + GitOps as declarative deployment. By phase end, basecamp exists,
triageis deployed via ArgoCD, and you can debug a pod stuck inPendingfrom first principles. ~10 weeks, ~120 hrs.
Phase 7 is where the prior six phases stop being parallel investigations and start being one composed system. Linux processes (Phase 1) packaged as containers (Phase 6), connected over a network (Phase 2), persisting state in databases (Phase 3), operated by tools written in Python (Phase 4) and Go (Phase 5) — all of it scheduled, healed, and reconciled by Kubernetes. Phase 7’s length is honest: this is where six phases of accumulated context get consolidated into a working platform.
It’s also the phase that introduces the control-loop pattern — the single most important pattern in the rest of the program. Every controller you’ll meet in Year 2 (Crossplane, Flux, Gateway), Year 3 (Iceberg compaction, Airflow scheduler), Year 4 (Kubeflow, Katib, KServe), and Year 5 (your own AIOps agents reconciling alerts to runbooks) is the same pattern: observe → diff → act → repeat. Phase 7 is where that pattern gets installed deeply enough that you’ll recognize it everywhere afterward.
The Year 1 overview is blunt about why this phase earns its 10-week length: short-changing it means paying interest for the next four years.
Prerequisites
- Phase 6 complete — containers built from scratch; Year 1 projects all containerized
- Hardware: 32GB RAM + 1TB external SSD installed. You can’t run K3s + workloads on 16GB. See homelab/hardware for the upgrade path.
- 3 VMs available for K3s nodes (or use Proxmox to provision 3 small VMs)
- You accept: K8s is the canonical distributed scheduler. By phase-end you can debug why a pod won’t schedule, why a Service has no endpoints, why traffic isn’t reaching your Ingress — without copy-pasting from Stack Overflow.
Why this phase exists
Years 2-5 deploy basically everything onto Kubernetes. basecamp is built on K8s. Year 4’s ML stack runs on K8s. Year 5’s agents run on K8s. If you treat K8s as magic, you’ll be a “kubectl apply” engineer. If you understand the scheduler + control plane + CNI + Ingress + etcd, you can operate any cloud-native system.
This phase also introduces GitOps — git as the source of truth for cluster state, ArgoCD as the reconciler. The pattern (declarative reconciliation) is what survives whatever replaces K8s in 2035.
This is also the phase where basecamp begins. You won’t ship basecamp publicly this phase — it grows through Years 2-5. But you do start it: enough that ArgoCD manages 3 apps, including triage.
1. PROBLEM
You have N containers across M machines. You want to:
- Schedule containers onto machines automatically (based on resources + constraints)
- Restart failed containers
- Route traffic to healthy instances
- Scale instances up/down based on load
- Update versions without downtime
- Survive machine failures
- All this declaratively (you say what, not how)
K8s solves all seven via the control loop pattern — every controller continuously reconciles actual state toward desired state.
→ Pattern: control-loops → Pattern: declarative-vs-imperative-infrastructure
2. PRINCIPLES
2.1 The control loop
Every K8s controller (Deployment, Job, StatefulSet, custom CRDs) implements the same loop: observe current state via the API server; compare to desired state; act to reduce diff; repeat.
Investigate:
- Find the Deployment controller in
kubernetes/kubernetes. Read its sync loop (pkg/controller/deployment/). - Build a tiny custom controller in Go using
kubebuilder— watch ConfigMaps and log them.
The Go fluency from Phase 5 is what makes this exercise tractable. If you tried to read kubelet source without that prep, you’d bounce off; with it, you can navigate the project structure and recognize idioms (errgroup, context propagation, structured logging) you’ve used in pulse.
2.2 API objects + lifecycle
Pod, Deployment, Service, ConfigMap, Secret, Ingress, PersistentVolume, StatefulSet, Job, CronJob, CRD. Each has spec + status + a controller.
Investigate:
- Pod vs ReplicaSet vs Deployment — when each is the right level of abstraction
- StatefulSet vs Deployment — when you need stable hostnames + ordered deployment
- CRD as API extensibility — write one with kubebuilder
2.3 Scheduling
The scheduler matches pods to nodes based on resource requests, taints, tolerations, affinity, anti-affinity, topology spread.
Investigate:
- Why does my pod stay
Pending? Diagnose 3 different causes. - Use
nodeAffinityto pin a pod; verify - Use
podAntiAffinityto spread replicas across nodes
The “why is my pod Pending” diagnostic is the most common K8s problem you’ll hit, and the operational checklist forces three different root causes — insufficient resources, taints/tolerations mismatch, PVC unbound. Each requires a different chain of kubectl describe reads. By the third one you have a reflex; before, you had a Stack Overflow tab.
2.4 Networking: CNI + Service + Ingress
→ Pattern: load-balancing → Pattern: network-policy → Pattern: service-discovery
Investigate:
- What does the CNI actually do? Cilium (eBPF) vs Calico (BGP/iptables)
- What’s a Service really?
kube-proxy+ iptables/IPVS implementation - Set up an Ingress controller (Traefik or nginx); route by host + path
- Apply default-deny NetworkPolicy; allow only what’s needed
This section is where Phase 2’s networking patterns get scaled to a cluster. A Service is L4 load-balancing implemented by kube-proxy rewriting iptables rules. An Ingress is L7 reverse-proxy (nginx vs Caddy vs HAProxy from Phase 2) wrapped in a control loop. NetworkPolicy is defense-in-depth scaled from a single bastion’s nftables to per-pod east-west enforcement. Same patterns, K8s-shaped.
2.5 Storage: PV / PVC / StorageClass / CSI
Investigate:
- PV/PVC binding lifecycle; reclaim policies
- Install Longhorn on K3s; create a StorageClass; provision a Postgres with persistent storage
The Postgres-on-Longhorn exercise is also the substrate for triage — the incidents schema from Phase 3 gets deployed onto a real persistent volume here. State, WAL durability, and pod restarts have to compose correctly for triage to actually survive a node reboot.
2.6 GitOps
The git repo is the source of truth. ArgoCD watches git, diffs against cluster, applies. No more “kubectl apply” from a developer laptop.
→ Pattern: gitops
Investigate:
- Install ArgoCD; point at a git repo; observe reconciliation on git push
- Why is GitOps better than “kubectl apply from CI”? Pull-based vs push-based; cluster credentials never leave the cluster
- Set up app-of-apps: one root ArgoCD Application managing many child Applications — the basecamp foundation
3. TRADE-OFFS
| Decision | Option A | Option B | Cost |
|---|---|---|---|
| Distribution | K3s (lightweight) | Talos (immutable) | full kubeadm |
| CNI | Cilium (eBPF) | Calico (BGP) | Flannel (simple VXLAN) |
| Ingress | nginx-ingress | Traefik (K8s-native UX) | Gateway API |
| GitOps | ArgoCD (UI) | Flux (CLI) | Both same pattern; ArgoCD nicer UI |
| Package mgmt | Helm | Kustomize | Both |
| State (etcd) | Default 3-node | External etcd | sqlite (k3s minimal) |
Note that ArgoCD vs Flux is a trade-off in surface, not in pattern — both implement the same pull-based GitOps reconciliation loop. That’s the whole point. If you can articulate “they’re the same pattern with different UX”, you’ve internalized GitOps. If you can only argue “ArgoCD has a better UI”, you’re picking a tool, not understanding a pattern.
4. TOOLS (as of 2025-10)
Cluster distributions
- K3s — homelab default
- Talos Linux — immutable OS for K8s; preview here, deepen Year 2
- EKS / GKE / AKS — Year 2
CNI
- Cilium — eBPF; the modern default
- Calico — BGP; widely deployed
- Flannel — simple VXLAN; K3s default if not replaced
Ingress
- Traefik (K3s default) or ingress-nginx
- Gateway API (the modern standard; HTTPRoute, GRPCRoute)
Storage
- Longhorn (Rancher; lightweight)
- OpenEBS (alternative)
GitOps
- ArgoCD (most popular; UI-heavy)
- Flux (lighter; GitOps Toolkit foundation)
Package management
- Helm (templated YAML)
- Kustomize (overlay-based; built into kubectl)
5. MASTERY: K3s + ArgoCD + start basecamp
5.1 Reading list
| Required | Why |
|---|---|
| ”Kubernetes Up & Running” (Burns, Beda, Hightower) | The standard intro |
| Kubernetes Concepts docs (full read) | The actual spec |
| ArgoCD architecture docs | GitOps in practice |
| Recommended | Why |
|---|---|
| ”Programming Kubernetes” (Hausenblas & Schimanski) | Extend K8s, write operators |
| ”Kubernetes the Hard Way” (Hightower) | Do once; install K8s without K3s shortcuts |
| Tim Hockin’s K8s talks on YouTube | The spirit of the project |
5.2 Operational depth checklist
[ ] Install K3s on 3 VMs: 1 control + 2 worker; replace flannel with Cilium[ ] Deploy a stateless app (Deployment + Service + Ingress); reach via hostname[ ] Deploy Postgres as StatefulSet with Longhorn-backed PV; verify data survives pod restart[ ] Install ArgoCD; bootstrap basecamp repo as the first Application[ ] Diagnose a pod stuck in Pending; identify root cause from kubectl describe[ ] Apply default-deny NetworkPolicy in one namespace; selectively allow[ ] Set up Prometheus + Grafana via kube-prometheus-stack; scrape K3s nodes + custom apps[ ] Trigger a rolling update; observe surge/unavailable; rollback[ ] Build a tiny CRD + controller in Go using kubebuilder; deploy; create CR; observe reconciliation[ ] Diagnose "Ingress returns 503" — pod down? service selector wrong? ingress controller misconfigured? DNS?The “default-deny NetworkPolicy” item closes the loop opened in Phase 2 — same pattern (default-deny + explicit allow), now applied at pod scope inside the cluster. The “build a CRD + controller” item closes the loop opened in Phase 5 — same Go fluency, now applied to the canonical control-plane pattern.
5.3 Project: triage (the on-call app)
This phase ships triage publicly. Uses the incidents schema from Phase 3.
triage = an on-call app running on K3s. Lists open incidents, who's paged, next escalation time. Uses Postgres (Phase 3 schema), exposed via Ingress, monitored by Prometheus.
Pattern: first real service-on-K3s; ties Phases 3, 4, 5, 6 together.
Stack:- Backend: Go (Phase 5) with chi router, sqlx for Postgres, slog- Frontend: server-rendered HTML with htmx (you're not learning frontend)- Persistence: Postgres (the schema from Phase 3); Redis for active-paging state- Deploy: Helm chart in basecamp repo, deployed via ArgoCD- Observability: Prometheus metrics; structured logs (Loki ships Year 2)- Tests: >70% Go coverage; one end-to-end via container- CI: GitHub Actions builds + pushes image; ArgoCD syncs on tag- README + architecture diagram in repotriage is the first real service on the Abukix Studio platform. Year 5’s services/aiops/ will query its open-incidents API. Year 5’s portal will surface its dashboard. It’s not a demo — it’s the proof that everything in Year 1 composes into something a user could actually use.
See projects/triage/plan.
5.4 Project: basecamp (start)
You don’t ship basecamp publicly this phase — it grows through Years 2-5. But you do start it.
basecamp scope this phase:- Repo: github.com/abukix/basecamp (PRIVATE — goes public Year 3)- Structure: applications/, charts/, infra/- First Applications: argocd-self (manage itself), postgres, monitoring-stack, triage- Helm + Kustomize: pick one consistently; document why in an ADR- README: "what is basecamp, how to bootstrap" (live document for years)This is Tier 1 of basecamp’s eventual 9-tier architecture. By Year 5 it’ll be a complete production AI/data platform; today it’s just enough to host triage. The compounding starts here.
6. COMPARE: K8s vs Nomad (or ECS)
Pick one alternative orchestrator. Compare:
- Scheduler model (simpler than K8s)
- Workload types
- What you give up (smaller ecosystem; fewer “everyone knows this” patterns)
300 words: why did K8s win the orchestrator wars? When is an alternative still the right call?
The COMPARE step is non-negotiable for the same reason the Master Plan flags it — without it, you’re a K8s operator. With it, you understand “declarative orchestration” as a pattern that K8s implements and that simpler orchestrators implement differently. By Year 5, when something newer than K8s appears, you’ll evaluate it through the trade-off lens this exercise built — not the “is it like K8s?” lens.
7. OPERATE
This is the phase where homelab becomes a real platform.
- 5+ runbooks (
k3s-install,k3s-upgrade,argocd-bootstrap,cilium-debug,networkpolicy-debug,pod-wont-schedule) - 2+ postmortems (you WILL hit incidents)
- Weekly log every Sunday — by phase end you should have ~10-12 entries
The number of runbooks isn’t arbitrary — by phase end the cluster has been running long enough that you’ve genuinely operated it through multiple incidents. Runbooks written during incidents are the high-value ones; runbooks written speculatively before anything broke usually aren’t. Let real failures drive the writing.
8. CONTRIBUTE: Year 1 deadline
Year 1 deadline. If you haven’t shipped a merged PR yet, this is the phase. K8s ecosystem is enormous and welcoming:
- Kubernetes docs (huge “good first issue” pool)
- Helm charts (any popular chart)
- ArgoCD docs / examples
- Cilium docs
- Any kubectl plugin
Submit, address review, get merged. Add to ops-handbook/contributions/shipped/.
Validation criteria (= Year 1 Final Exam prep)
[ ] All 10 operational depth checks[ ] triage shipped publicly + deployed on K3s via ArgoCD[ ] basecamp repo initialized + ArgoCD managing it (still private)[ ] Alternative orchestrator comparison written up[ ] 5+ runbooks; 2+ postmortems; 10+ weekly log entries[ ] **At least 1 merged upstream PR — Year 1 deadline; must be done**[ ] Pattern entries deepened: - control-loops → OUTLINE (Reconcile loops as concrete example) - declarative-vs-imperative-infrastructure → OUTLINE - gitops → OUTLINE - load-balancing → OUTLINE (K8s Service + Ingress) - network-policy → OUTLINE - service-discovery → OUTLINE (K8s DNS + Service)[ ] Exit Test passed[ ] Ready for Year 1 Final ExamExit Test
Time: 4 hours (longest phase, longest exit test).
Part 1: Build (90 min)
Given a fresh K3s cluster, deploy a new app via ArgoCD (Helm chart in basecamp). App must include: Deployment, Service, Ingress, NetworkPolicy (default-deny + explicit allow), PrometheusRule for an alert, Sealed Secret. Verify everything reaches healthy in <10 min.
Part 2: Debug (90 min)
Two parallel scenarios from the Phase 7 catalog:
- Pod stuck in CrashLoopBackOff (configmap, secret, image, command, args, capabilities)
- Service has no endpoints (selector mismatch, pod readiness)
- Ingress returns 503 (upstream pod not ready, controller misconfig)
- ArgoCD app stuck OutOfSync (manifest issue, RBAC, hook failure)
- Cluster-wide DNS broken (CoreDNS pod crashed)
Find root cause + fix + write runbook for each.
Part 3: Articulate (60 min)
~1200 words: “Explain how a Pod gets scheduled, started, and reaches healthy in K8s — from kubectl apply to Running state. Use specific examples + cite the controllers involved at each step.”
The articulation prompt mirrors Phase 1’s read(2) walk-through and Phase 3’s INSERT walk-through — same shape (a single user-facing operation traced through layers down to durable state) at the cluster scale. Three articulations across the year, each at a higher level of abstraction. They’re rehearsals for the Year 1 Final Exam’s end-to-end request walk.
Anti-patterns
| Anti-pattern | Why |
|---|---|
kubectl apply -f without git | No history, no rollback, no audit; GitOps fixes this |
| Imperative scripts that wrap kubectl | Misses the declarative point |
| Helm everywhere because “it’s standard” | Templating YAML is brittle; Kustomize for simpler cases |
--privileged containers in K8s | Defeats most of the K8s security model |
| Ignoring NetworkPolicy “for now" | "For now” becomes “forever”; default-deny early |
Patterns deepened this phase
- control-loops → OUTLINE
- declarative-vs-imperative-infrastructure → OUTLINE
- gitops → OUTLINE
- load-balancing → OUTLINE
- network-policy → OUTLINE
- service-discovery → OUTLINE
Year 1 Final Exam (next milestone)
Once Phase 7 validation passes, the Year 1 Final Exam is the next milestone — a separate ~6-hour scenario combining all 7 phases.
→ Next: Year 1 Final Exam