Kubernetes + GitOps: ship triage

Final phase of Year 1. The longest. K8s as the canonical distributed scheduler + GitOps as declarative deployment. By phase end, basecamp exists, triage is deployed via ArgoCD, and you can debug a pod stuck in Pending from first principles. ~10 weeks, ~120 hrs.

Phase 7 is where the prior six phases stop being parallel investigations and start being one composed system. Linux processes (Phase 1) packaged as containers (Phase 6), connected over a network (Phase 2), persisting state in databases (Phase 3), operated by tools written in Python (Phase 4) and Go (Phase 5) — all of it scheduled, healed, and reconciled by Kubernetes. Phase 7’s length is honest: this is where six phases of accumulated context get consolidated into a working platform.

It’s also the phase that introduces the control-loop pattern — the single most important pattern in the rest of the program. Every controller you’ll meet in Year 2 (Crossplane, Flux, Gateway), Year 3 (Iceberg compaction, Airflow scheduler), Year 4 (Kubeflow, Katib, KServe), and Year 5 (your own AIOps agents reconciling alerts to runbooks) is the same pattern: observe → diff → act → repeat. Phase 7 is where that pattern gets installed deeply enough that you’ll recognize it everywhere afterward.

The Year 1 overview is blunt about why this phase earns its 10-week length: short-changing it means paying interest for the next four years.

Prerequisites

Phase 6 complete — containers built from scratch; Year 1 projects all containerized

Hardware: 32GB RAM + 1TB external SSD installed. You can’t run K3s + workloads on 16GB. See homelab/hardware for the upgrade path.

3 VMs available for K3s nodes (or use Proxmox to provision 3 small VMs)

You accept: K8s is the canonical distributed scheduler. By phase-end you can debug why a pod won’t schedule, why a Service has no endpoints, why traffic isn’t reaching your Ingress — without copy-pasting from Stack Overflow.

Why this phase exists

Years 2-5 deploy basically everything onto Kubernetes. basecamp is built on K8s. Year 4’s ML stack runs on K8s. Year 5’s agents run on K8s. If you treat K8s as magic, you’ll be a “kubectl apply” engineer. If you understand the scheduler + control plane + CNI + Ingress + etcd, you can operate any cloud-native system.

This phase also introduces GitOps — git as the source of truth for cluster state, ArgoCD as the reconciler. The pattern (declarative reconciliation) is what survives whatever replaces K8s in 2035.

This is also the phase where basecamp begins. You won’t ship basecamp publicly this phase — it grows through Years 2-5. But you do start it: enough that ArgoCD manages 3 apps, including triage.

1. PROBLEM

You have N containers across M machines. You want to:

Schedule containers onto machines automatically (based on resources + constraints)
Restart failed containers
Route traffic to healthy instances
Scale instances up/down based on load
Update versions without downtime
Survive machine failures
All this declaratively (you say what, not how)

K8s solves all seven via the control loop pattern — every controller continuously reconciles actual state toward desired state.

→ Pattern: control-loops → Pattern: declarative-vs-imperative-infrastructure

2. PRINCIPLES

2.1 The control loop

Every K8s controller (Deployment, Job, StatefulSet, custom CRDs) implements the same loop: observe current state via the API server; compare to desired state; act to reduce diff; repeat.

Investigate:

Find the Deployment controller in kubernetes/kubernetes. Read its sync loop (pkg/controller/deployment/).
Build a tiny custom controller in Go using kubebuilder — watch ConfigMaps and log them.

The Go fluency from Phase 5 is what makes this exercise tractable. If you tried to read kubelet source without that prep, you’d bounce off; with it, you can navigate the project structure and recognize idioms (errgroup, context propagation, structured logging) you’ve used in pulse.

2.2 API objects + lifecycle

Pod, Deployment, Service, ConfigMap, Secret, Ingress, PersistentVolume, StatefulSet, Job, CronJob, CRD. Each has spec + status + a controller.

Investigate:

Pod vs ReplicaSet vs Deployment — when each is the right level of abstraction
StatefulSet vs Deployment — when you need stable hostnames + ordered deployment
CRD as API extensibility — write one with kubebuilder

2.3 Scheduling

The scheduler matches pods to nodes based on resource requests, taints, tolerations, affinity, anti-affinity, topology spread.

Investigate:

Why does my pod stay Pending? Diagnose 3 different causes.
Use nodeAffinity to pin a pod; verify
Use podAntiAffinity to spread replicas across nodes

The “why is my pod Pending” diagnostic is the most common K8s problem you’ll hit, and the operational checklist forces three different root causes — insufficient resources, taints/tolerations mismatch, PVC unbound. Each requires a different chain of kubectl describe reads. By the third one you have a reflex; before, you had a Stack Overflow tab.

2.4 Networking: CNI + Service + Ingress

→ Pattern: load-balancing → Pattern: network-policy → Pattern: service-discovery

Investigate:

What does the CNI actually do? Cilium (eBPF) vs Calico (BGP/iptables)
What’s a Service really? kube-proxy + iptables/IPVS implementation
Set up an Ingress controller (Traefik or nginx); route by host + path
Apply default-deny NetworkPolicy; allow only what’s needed

This section is where Phase 2’s networking patterns get scaled to a cluster. A Service is L4 load-balancing implemented by kube-proxy rewriting iptables rules. An Ingress is L7 reverse-proxy (nginx vs Caddy vs HAProxy from Phase 2) wrapped in a control loop. NetworkPolicy is defense-in-depth scaled from a single bastion’s nftables to per-pod east-west enforcement. Same patterns, K8s-shaped.

2.5 Storage: PV / PVC / StorageClass / CSI

Investigate:

PV/PVC binding lifecycle; reclaim policies
Install Longhorn on K3s; create a StorageClass; provision a Postgres with persistent storage

The Postgres-on-Longhorn exercise is also the substrate for triage — the incidents schema from Phase 3 gets deployed onto a real persistent volume here. State, WAL durability, and pod restarts have to compose correctly for triage to actually survive a node reboot.

2.6 GitOps

The git repo is the source of truth. ArgoCD watches git, diffs against cluster, applies. No more “kubectl apply” from a developer laptop.

→ Pattern: gitops

Investigate:

Install ArgoCD; point at a git repo; observe reconciliation on git push
Why is GitOps better than “kubectl apply from CI”? Pull-based vs push-based; cluster credentials never leave the cluster
Set up app-of-apps: one root ArgoCD Application managing many child Applications — the basecamp foundation

3. TRADE-OFFS

Decision	Option A	Option B	Cost
Distribution	K3s (lightweight)	Talos (immutable)	full kubeadm
CNI	Cilium (eBPF)	Calico (BGP)	Flannel (simple VXLAN)
Ingress	nginx-ingress	Traefik (K8s-native UX)	Gateway API
GitOps	ArgoCD (UI)	Flux (CLI)	Both same pattern; ArgoCD nicer UI
Package mgmt	Helm	Kustomize	Both
State (etcd)	Default 3-node	External etcd	sqlite (k3s minimal)

Note that ArgoCD vs Flux is a trade-off in surface, not in pattern — both implement the same pull-based GitOps reconciliation loop. That’s the whole point. If you can articulate “they’re the same pattern with different UX”, you’ve internalized GitOps. If you can only argue “ArgoCD has a better UI”, you’re picking a tool, not understanding a pattern.

4. TOOLS (as of 2025-10)

Cluster distributions

K3s — homelab default
Talos Linux — immutable OS for K8s; preview here, deepen Year 2
EKS / GKE / AKS — Year 2

CNI

Cilium — eBPF; the modern default
Calico — BGP; widely deployed
Flannel — simple VXLAN; K3s default if not replaced

Ingress

Traefik (K3s default) or ingress-nginx
Gateway API (the modern standard; HTTPRoute, GRPCRoute)

Storage

Longhorn (Rancher; lightweight)
OpenEBS (alternative)

GitOps

ArgoCD (most popular; UI-heavy)
Flux (lighter; GitOps Toolkit foundation)

Package management

Helm (templated YAML)
Kustomize (overlay-based; built into kubectl)

5. MASTERY: K3s + ArgoCD + start basecamp

5.1 Reading list

Required	Why
”Kubernetes Up & Running” (Burns, Beda, Hightower)	The standard intro
Kubernetes Concepts docs (full read)	The actual spec
ArgoCD architecture docs	GitOps in practice

Recommended	Why
”Programming Kubernetes” (Hausenblas & Schimanski)	Extend K8s, write operators
”Kubernetes the Hard Way” (Hightower)	Do once; install K8s without K3s shortcuts
Tim Hockin’s K8s talks on YouTube	The spirit of the project

5.2 Operational depth checklist

[ ] Install K3s on 3 VMs: 1 control + 2 worker; replace flannel with Cilium
[ ] Deploy a stateless app (Deployment + Service + Ingress); reach via hostname
[ ] Deploy Postgres as StatefulSet with Longhorn-backed PV; verify data survives pod restart
[ ] Install ArgoCD; bootstrap basecamp repo as the first Application
[ ] Diagnose a pod stuck in Pending; identify root cause from kubectl describe
[ ] Apply default-deny NetworkPolicy in one namespace; selectively allow
[ ] Set up Prometheus + Grafana via kube-prometheus-stack; scrape K3s nodes + custom apps
[ ] Trigger a rolling update; observe surge/unavailable; rollback
[ ] Build a tiny CRD + controller in Go using kubebuilder; deploy; create CR; observe reconciliation
[ ] Diagnose "Ingress returns 503" — pod down? service selector wrong? ingress controller misconfigured? DNS?

The “default-deny NetworkPolicy” item closes the loop opened in Phase 2 — same pattern (default-deny + explicit allow), now applied at pod scope inside the cluster. The “build a CRD + controller” item closes the loop opened in Phase 5 — same Go fluency, now applied to the canonical control-plane pattern.

5.3 Project: `triage` (the on-call app)

This phase ships triage publicly. Uses the incidents schema from Phase 3.

triage = an on-call app running on K3s. Lists open incidents, who's paged,
         next escalation time. Uses Postgres (Phase 3 schema), exposed via
         Ingress, monitored by Prometheus.

Pattern: first real service-on-K3s; ties Phases 3, 4, 5, 6 together.

Stack:
- Backend: Go (Phase 5) with chi router, sqlx for Postgres, slog
- Frontend: server-rendered HTML with htmx (you're not learning frontend)
- Persistence: Postgres (the schema from Phase 3); Redis for active-paging state
- Deploy: Helm chart in basecamp repo, deployed via ArgoCD
- Observability: Prometheus metrics; structured logs (Loki ships Year 2)
- Tests: >70% Go coverage; one end-to-end via container
- CI: GitHub Actions builds + pushes image; ArgoCD syncs on tag
- README + architecture diagram in repo

triage is the first real service on the Abukix Studio platform. Year 5’s services/aiops/ will query its open-incidents API. Year 5’s portal will surface its dashboard. It’s not a demo — it’s the proof that everything in Year 1 composes into something a user could actually use.

See projects/triage/plan.

5.4 Project: `basecamp` (start)

You don’t ship basecamp publicly this phase — it grows through Years 2-5. But you do start it.

basecamp scope this phase:
- Repo: github.com/abukix/basecamp (PRIVATE — goes public Year 3)
- Structure: applications/, charts/, infra/
- First Applications: argocd-self (manage itself), postgres, monitoring-stack, triage
- Helm + Kustomize: pick one consistently; document why in an ADR
- README: "what is basecamp, how to bootstrap" (live document for years)

This is Tier 1 of basecamp’s eventual 9-tier architecture. By Year 5 it’ll be a complete production AI/data platform; today it’s just enough to host triage. The compounding starts here.

See projects/basecamp/plan.

6. COMPARE: K8s vs Nomad (or ECS)

Pick one alternative orchestrator. Compare:

Scheduler model (simpler than K8s)
Workload types
What you give up (smaller ecosystem; fewer “everyone knows this” patterns)

300 words: why did K8s win the orchestrator wars? When is an alternative still the right call?

The COMPARE step is non-negotiable for the same reason the Master Plan flags it — without it, you’re a K8s operator. With it, you understand “declarative orchestration” as a pattern that K8s implements and that simpler orchestrators implement differently. By Year 5, when something newer than K8s appears, you’ll evaluate it through the trade-off lens this exercise built — not the “is it like K8s?” lens.

7. OPERATE

This is the phase where homelab becomes a real platform.

5+ runbooks (k3s-install, k3s-upgrade, argocd-bootstrap, cilium-debug, networkpolicy-debug, pod-wont-schedule)
2+ postmortems (you WILL hit incidents)
Weekly log every Sunday — by phase end you should have ~10-12 entries

The number of runbooks isn’t arbitrary — by phase end the cluster has been running long enough that you’ve genuinely operated it through multiple incidents. Runbooks written during incidents are the high-value ones; runbooks written speculatively before anything broke usually aren’t. Let real failures drive the writing.

8. CONTRIBUTE: Year 1 deadline

Year 1 deadline. If you haven’t shipped a merged PR yet, this is the phase. K8s ecosystem is enormous and welcoming:

Kubernetes docs (huge “good first issue” pool)
Helm charts (any popular chart)
ArgoCD docs / examples
Cilium docs
Any kubectl plugin

Submit, address review, get merged. Add to ops-handbook/contributions/shipped/.

Validation criteria (= Year 1 Final Exam prep)

[ ] All 10 operational depth checks
[ ] triage shipped publicly + deployed on K3s via ArgoCD
[ ] basecamp repo initialized + ArgoCD managing it (still private)
[ ] Alternative orchestrator comparison written up
[ ] 5+ runbooks; 2+ postmortems; 10+ weekly log entries
[ ] **At least 1 merged upstream PR — Year 1 deadline; must be done**
[ ] Pattern entries deepened:
    - control-loops → OUTLINE (Reconcile loops as concrete example)
    - declarative-vs-imperative-infrastructure → OUTLINE
    - gitops → OUTLINE
    - load-balancing → OUTLINE (K8s Service + Ingress)
    - network-policy → OUTLINE
    - service-discovery → OUTLINE (K8s DNS + Service)
[ ] Exit Test passed
[ ] Ready for Year 1 Final Exam

Exit Test

Time: 4 hours (longest phase, longest exit test).

Part 1: Build (90 min)

Given a fresh K3s cluster, deploy a new app via ArgoCD (Helm chart in basecamp). App must include: Deployment, Service, Ingress, NetworkPolicy (default-deny + explicit allow), PrometheusRule for an alert, Sealed Secret. Verify everything reaches healthy in <10 min.

Part 2: Debug (90 min)

Two parallel scenarios from the Phase 7 catalog:

Pod stuck in CrashLoopBackOff (configmap, secret, image, command, args, capabilities)
Service has no endpoints (selector mismatch, pod readiness)
Ingress returns 503 (upstream pod not ready, controller misconfig)
ArgoCD app stuck OutOfSync (manifest issue, RBAC, hook failure)
Cluster-wide DNS broken (CoreDNS pod crashed)

Find root cause + fix + write runbook for each.

Part 3: Articulate (60 min)

~1200 words: “Explain how a Pod gets scheduled, started, and reaches healthy in K8s — from kubectl apply to Running state. Use specific examples + cite the controllers involved at each step.”

The articulation prompt mirrors Phase 1’s read(2) walk-through and Phase 3’s INSERT walk-through — same shape (a single user-facing operation traced through layers down to durable state) at the cluster scale. Three articulations across the year, each at a higher level of abstraction. They’re rehearsals for the Year 1 Final Exam’s end-to-end request walk.

Anti-patterns

Anti-pattern	Why
`kubectl apply -f` without git	No history, no rollback, no audit; GitOps fixes this
Imperative scripts that wrap kubectl	Misses the declarative point
Helm everywhere because “it’s standard”	Templating YAML is brittle; Kustomize for simpler cases
`--privileged` containers in K8s	Defeats most of the K8s security model
Ignoring NetworkPolicy “for now"	"For now” becomes “forever”; default-deny early

Patterns deepened this phase

control-loops → OUTLINE
declarative-vs-imperative-infrastructure → OUTLINE
gitops → OUTLINE
load-balancing → OUTLINE
network-policy → OUTLINE
service-discovery → OUTLINE

Year 1 Final Exam (next milestone)

Once Phase 7 validation passes, the Year 1 Final Exam is the next milestone — a separate ~6-hour scenario combining all 7 phases.

→ Next: Year 1 Final Exam

Kubernetes + GitOps: ship triage

Prerequisites

Why this phase exists

1. PROBLEM

2. PRINCIPLES

2.1 The control loop

2.2 API objects + lifecycle

2.3 Scheduling

2.4 Networking: CNI + Service + Ingress

2.5 Storage: PV / PVC / StorageClass / CSI

2.6 GitOps

3. TRADE-OFFS

4. TOOLS (as of 2025-10)

Cluster distributions

CNI

Ingress

Storage

GitOps

Package management

5. MASTERY: K3s + ArgoCD + start basecamp

5.1 Reading list

5.2 Operational depth checklist

5.3 Project: triage (the on-call app)

5.4 Project: basecamp (start)

6. COMPARE: K8s vs Nomad (or ECS)

7. OPERATE

8. CONTRIBUTE: Year 1 deadline

Validation criteria (= Year 1 Final Exam prep)

Exit Test

Part 1: Build (90 min)

Part 2: Debug (90 min)

Part 3: Articulate (60 min)

Anti-patterns

Patterns deepened this phase

Year 1 Final Exam (next milestone)

5.3 Project: `triage` (the on-call app)

5.4 Project: `basecamp` (start)