Master Plan
A 5-year curriculum from SRE Support to Staff/Principal AI Platform Engineer. Built on patterns, not tools. The platform you build is the artifact. The depth you accumulate is the goal. The journey is the content.
This page is the master plan — the high-level map of the entire program. If you read only one document in ROOT, read this one. Everything else is a deeper view of something here.
What ROOT is
ROOT is the curriculum behind Abukix Studio. It’s a 5-year personal program built around three commitments:
- Pattern-first learning. Every phase asks you to find the underlying pattern before the tool. Kubernetes is one implementation of “declarative reconciliation” — not the only one, not the last one.
- Platform-driven practice. Theory is verified by operating real software in production-like conditions. You build basecamp, a 9-tier platform on a homelab mini-PC.
- Public documentation. Weekly logs, runbooks, ADRs, postmortems, pattern entries. Not because someone is watching — because compounding requires written external memory.
The brand pillar is Building an AI Platform in Public. ROOT is how that platform gets built. The five years are how it’s earned.
The bet
Tools change every 5 years. Patterns don’t change in 30.
If you spend five years learning Kubernetes, you’re fluent in 2026 tools that may not be canonical in 2031. If you spend five years learning the control-loop pattern that Kubernetes implements, you carry that pattern to whatever replaces it — Terraform’s reconcile, ArgoCD, future cloud orchestrators, anything declarative.
Staff and Principal engineers reason in patterns and treat tools as interchangeable implementations. They pick up new tools in a week because they already understand the category of problem the tool solves and the trade-offs every implementation makes. ROOT trains that habit from Year 1, Phase 1.
The name is the contract: every phase asks you to find the root principle, not memorize a recipe.
The 5 years at a glance
The curriculum spans 60 months across five thematically distinct years. Each year is self-contained: life can interrupt at any year boundary without wasting prior years, because every year exits at a real, credible role.
| Year | Theme | Months | Exit-ramp role | OSS shipped |
|---|---|---|---|---|
| 1 | Foundations of Computing | 1-12 | Junior SRE / IT Engineer | rxp, konfig, pulse, triage |
| 2 | Distributed Systems & Cloud | 13-24 | DevOps / Cloud Engineer | terralabs, platform-ctl (start) |
| 3 | Platform Engineering & Data | 25-36 | Senior DevOps / Data Platform Engineer | data tier of basecamp |
| 4 | ML & AI Infrastructure | 37-48 | ML Platform / AI Infrastructure Engineer | services/llm-gateway/ |
| 5 | AI Platform + Capstone | 49-60 | Staff/Principal AI Platform Engineer | mlship, services/aiops/, Studio launch |
The full year-by-year structure:
YEAR 1: Foundations of Computing Months 1-12─────────────────────────────────────────────────────────────────Exit ramp: Junior SRE / IT Engineer
OS (Linux + FreeBSD compare), Networking, Databases (Postgres + Redis), Programming fluency (Python + Go — fluency, not mastery), Containers (namespaces + cgroups + UnionFS from scratch), Kubernetes + GitOps (the longest phase — earns its length).
Platform milestone: K3s + ArgoCD + Postgres + Redis + Prometheus + Grafana. First service deployed: triage (on-call app). Year 1 Final Exam: 6-hour scenario across all 7 phases. OSS shipped: rxp, konfig, pulse, triage. Patterns deepened: ~15 (foundations + early distributed systems).
YEAR 2: Distributed Systems & Cloud Months 13-24─────────────────────────────────────────────────────────────────Exit ramp: DevOps / Cloud Engineer
Distributed-systems theory (DDIA, CAP/PACELC, consensus, replication), IaC patterns (Terraform + Crossplane), AWS deep, GCP compare, Platform engineering (Backstage, service mesh, secrets, RBAC, Pod Security, SLI/SLO discipline as a platform contract). Multi-cloud basecamp synthesis (Y2 capstone).
Platform milestone: multi-cloud (k3s + EKS + GKE), Backstage IDP, mesh mTLS, signed images, SLOs for the platform itself. Year 2 Final Exam: 8-hour scenario. Projects: terralabs ships public; platform-ctl starts. Patterns deepened: ~15 (distributed systems + infra + platform + early observability).
YEAR 3: Platform Engineering & Data Months 25-36─────────────────────────────────────────────────────────────────Exit ramp: Senior DevOps / Data Platform Engineer
Observability at depth (eBPF, OTel, three-pillars, cardinality discipline), Lakehouse (MinIO + Iceberg + Nessie), Stream (Redpanda + Flink), Batch (Spark + Airflow + dbt), Serving (Trino + Superset), Data governance (DataHub/OpenMetadata + lineage + access control).
Platform milestone: full data engineering layer operational on basecamp. JupyterHub-as-a-service (notebooks-on-the-platform). Year 3 Final Exam: full-day scenario. Patterns deepened: ~10 (data, networking depth, observability).
YEAR 4: ML & AI Infrastructure Months 37-48─────────────────────────────────────────────────────────────────Exit ramp: ML Platform Engineer / AI Infrastructure Engineer
MLOps lifecycle (MLflow + KServe + Ray), Feature stores (Feast), Kubeflow Pipelines + Katib, LLM infra (vLLM + RAG + vector DBs), GPU scheduling (cloud spot), llm-gateway built incrementally (P21 → P24 → P25).
Platform milestone: ML platform + LLM serving operational. llm-gateway shipping real (homelab-scale) traffic. Year 4 Final Exam: full-day scenario. Year 4 flagship: services/llm-gateway/ inside basecamp. Patterns deepened: ~5 (ML/AI patterns).
YEAR 5: AI Platform + Capstone Months 49-60─────────────────────────────────────────────────────────────────Exit ramp: Staff/Principal AI Platform Engineer
Agent development (LangGraph), MCP + tool use, AIOps (services/aiops/ — agents that operate the platform), Portal + governance (Abukix Studio public surface launches), Capstone: mlship v2 (one-command model deploy) + pattern paper.
Platform milestone: complete data/AI platform with agents, governance, portal. Abukix Studio live at studio.abukix.dev. Year 5 Final Exam: full-day, panel/AI-administered. Graduation. Capstone primary: mlship v2 — OSS launch (sklearn + HF text-gen, excellent; other frameworks land as v2.1, v2.2). Capstone secondary: pattern paper — Staff/Principal-grade writing, 2+ external readers, conference submission.Three transitions matter most. They’re where the role identity changes:
- Y1 → Y2 (Month 12-13): single-machine intuition → distributed-systems thinking. The phase that makes this real is Phase 8 (Distributed Systems Theory).
- Y2 → Y3 (Month 24-25): platform-as-tool → platform-as-product. The platform stops being something you run and starts being something you offer.
- Y4 → Y5 (Month 48-49): operator → architect. You stop building services and start building the agents that operate services. AIOps is the inflection.
The pattern-first scaffold
Every phase doc inside ROOT — all 30 of them — follows this 8-step structure:
PROBLEM What category of human need exists?PRINCIPLES The timeless patterns any solution must implementTRADE-OFFS The decisions every implementation makes (and why)TOOLS Current implementations (time-stamped — they age)MASTERY Pick one tool, go to operational depthCOMPARE Re-implement the same problem in a second tool (this is the proof that the pattern transferred)OPERATE Run it in your homelab, take real incidentsCONTRIBUTE Ship one fix upstreamWorked example — Phase 7: Kubernetes + GitOps:
| Step | What you do |
|---|---|
| Problem | ”I need to run N stateless services across M machines, recover from failures, and deploy via Git.” |
| Principles | Declarative state · control loops · reconciliation · idempotency · operator pattern |
| Trade-offs | Push vs pull deploy · CRDs vs config maps · single-cluster vs multi-cluster · etcd consensus cost |
| Tools | K3s, Argo CD, Helm, Kustomize (Q1 2026) — pinned dates, not “best ever” |
| Mastery | K3s + Argo CD on the homelab. Bootstrap. Day 2. Recover from etcd corruption. |
| Compare | Re-implement the same workload using Nomad + Levant. Did the pattern transfer? |
| Operate | Run it for a real month. Real incidents. Real runbooks. |
| Contribute | One PR upstream — docs fix, bug fix, anything that lands. |
If a phase doc reads like a copy-paste tutorial, the doc is wrong — flag it. Phases give you the framing; you do the investigation. Tools point you in the right direction via Starter hints: lines; you write the actual commands.
What you build (the platform)
By Month 60 you’ve built basecamp — a 9-tier production platform on a homelab mini-PC. The architecture mirrors what Spotify, Netflix, and Uber operate, but at small scale. The pattern is identical; the scale is different.
Tier 9: Agents (LangGraph, MCP servers, services/aiops/) Y5Tier 8: Data Serving (Trino, Superset) Y3Tier 7: LLM (vLLM, vector DB, services/llm-gateway/) Y4Tier 6: ML Platform (Kubeflow, Katib, Training Operators) Y4Tier 5: ML (MLflow, KServe, Ray, Feast, JupyterHub) Y4 / Y3Tier 4: Processing (Spark, Airflow, Redpanda, Flink, dbt) Y3Tier 3: Lakehouse (Iceberg, Nessie, MinIO) Y3Tier 2: Platform (Backstage, Sealed Secrets, ESO, mesh, OTel, Loki) Y2Tier 1: Foundation (ArgoCD, Postgres, Redis, Prometheus, Grafana) Y1Each tier is a stable abstraction the next tier above it depends on. This isn’t an arbitrary stack — it reflects how production data platforms actually evolve. Tier 1 (control plane + observability) must be solid before Tier 2 (developer-facing platform) is meaningful. Tier 3 (storage substrate) is the foundation Tier 4 (processing engines) builds on. And so on.
On top of basecamp sits Abukix Studio — the public brand layer:
- Unified Web UI — Backstage-extended portal showing every service, deploy, and metric.
- Command palette AI assistant — natural-language interface to the platform: “deploy mlship to staging”, “show me yesterday’s errors in llm-gateway”.
platform-ctl— the unified CLI front-door. One command surface to operate everything.- 4-5 documented composition recipes — workflows that chain underlying services into real outcomes.
basecamp is what you operate. Abukix Studio is what you show.
The projects (three groups by role)
ROOT ships ~10 OSS projects across the 5 years. Think of them as actors in a play, not items on a list. They group into three roles:
Group A: the platform itself (the stage)
| Project | Role | First touched |
|---|---|---|
terralabs | Provisions the land — Terraform + Crossplane modules for VPC, K8s, RDS, MinIO, Proxmox K3s | Y2 |
basecamp | The GitOps repo of YAML that declares the platform (9 tiers) | Y1 |
platform-ctl | The unified CLI front-door — one command surface to operate everything | Y2 → Y5 |
| Portal (inside basecamp) | Backstage-extended Web UI + command palette agent — the visitor’s view | Y5 |
These four together form Abukix Studio (the public-facing brand layer). They’re “the stage” because they’re what every other project runs on.
Group B: services running on the platform (the actors)
| Project | What it does | First touched |
|---|---|---|
triage | A small on-call dashboard. The first real service-on-K3s, deployed at the end of Year 1. | Y1 |
services/llm-gateway/ | OpenAI-compat LLM API with multi-model routing, RAG, streaming, cost tracking. Lives inside basecamp. | Y4 |
services/aiops/ | Agent that operates the platform — alert triage, runbook execution, pattern detection. | Y5 |
mlship | Capstone CLI users invoke from anywhere. mlship deploy ./model.pkl → URL. | Y5 |
These are what the platform actually offers to a user. They make the platform feel like a product, not a stack.
Group C: operational discipline + fluency artifacts (the craft)
| Project | What it does | First touched |
|---|---|---|
ops-handbook | Runbooks, incidents, postmortems, ADRs, weekly logs. The journal of running the platform. | Y1 |
rxp | Regex CLI. Fluency artifact + log-pattern utility used inside services/aiops/ (Y5). | Y1 |
konfig | Config validator. Fluency artifact + used in basecamp CI to validate Helm values. | Y1 |
pulse | Probe scanner. Fluency artifact + emits Prometheus metrics scraped by basecamp. | Y1 |
ops-handbook is the journal — proof you can write production-grade docs over years. The three small CLIs are proof you can ship code with PR review, CI, releases — habits, not products. Each gets a real integration role inside basecamp’s tooling so they’re not orphaned demos.
Personal services tier (the dogfood)
basecamp also runs your stuff, not just demos. This is what makes the platform real instead of a portfolio piece:
basecamp/charts/personal/├── personal-blog/ Y2 — your blog deployed via basecamp instead of Cloudflare Pages├── personal-api/ Y3 — life-data API (fitness, learning hours, GitHub activity)├── notes-rag/ Y4 — RAG over your own ROOT writing + weekly logs│ (dogfoods llm-gateway)└── home-dash/ Y5 — internal dashboard pulling from all of the above (dogfoods Studio portal)These prove the platform serves you, not just GitHub stars. They’re also the cinematic moments: watch me self-host my own RAG over five years of weekly logs. Hard to fake, easy to film, impossible to ignore.
The composition recipes (Studio’s public proof)
A platform isn’t its components — it’s the workflows that chain components into real outcomes. ROOT documents 4-5 killer composition recipes that show Abukix Studio working end-to-end. Each recipe = one demo video, one blog post, one Show HN moment:
| Recipe | Services chained | Year landing |
|---|---|---|
| Personal RAG over your own weekly logs | Notebook → Spark (chunk) → embeddings → pgvector → llm-gateway → portal command palette | Y4 |
| Auto-incident triage loop | Prometheus alert → aiops → Trino (history) → propose runbook → platform-ctl execute | Y5 |
| Train → register → deploy in one flow | Notebook → Ray (distributed train) → MLflow → KServe → mlship deploy | Y4 |
| Homelab life API | GitHub → Airflow → Iceberg → Trino → personal API → portal | Y3 |
| AI-assisted on-call | triage + aiops + llm-gateway + command palette | Y5 |
These live in Studio composition recipes and have runnable scripts in basecamp/examples/.
How the platform is shared with the world
Three tiers, ordered by cost-to-you:
- OSS, self-host (free for everyone, free for you) —
github.com/abukix/basecamp. Clone, run on your homelab/cloud, follow the README, get an equivalent platform. This is the moat. This is how the program scales to thousands of users without operational cost. 99% of users land here. - Hosted demo (free for visitors, ~$30-50/month for you) —
studio.abukix.dev. Read-only, rate-limited, pre-loaded notebook + tiny RAG demo + smallmlshipdeploy + portal command palette. CPU-only models (no GPU; Phi-3 quantized via llama.cpp). 10-min session timeout. This is the cinematic surface — visitors see the platform live without authentication, then optionally clone the OSS to run it themselves. - Managed offering (deferred, paid) — only if Year 5 launch shows demand. Optional. Most successful open-core companies don’t launch managed for years.
The order matters. OSS first establishes credibility (real engineers can run real software). Hosted demo amplifies reach (non-engineers can see it work). Managed is monetization (deferred until both prior tiers prove demand).
Time budget: 12 hours/week, sustained over 60 months
The non-negotiable is consistency over intensity:
Sunday (1-2 hrs) weekly log entry — what you learned, broke, were stuck onWeeknight × 2-3 (1 hr each) current phase work — read, investigate, take notesWeekend (3-4 hrs) operate the platform — incidents, runbooks, deeper investigationContinuous every commit lands runbooks/ADRs/patterns in ops-handbookTotal: ~12 hrs/week. Some weeks 8, some weeks 16, average 12. The non-negotiable is the Sunday weekly log — that’s the discipline that compounds. Everything else can flex around real life.
Hardware requirements
YEAR 1 (Months 1-12) RAM: 16GB DDR5 → upgrade to 32GB before Month 10 (Kubernetes) Storage: 256GB NVMe + 1TB external SSD added at Month 10
YEAR 2 (Months 13-24) RAM: 32GB DDR5 Storage: same as Year 2 start
YEAR 3 (Months 25-36) RAM: 64GB DDR5 (upgrade Month 25) Storage: same
YEAR 4-5 (Months 37-60) RAM: 64GB DDR5 Storage: same Optional: second M70q for multi-node testing (~$200-300 used)
CUMULATIVE COST Month 10: +$120-140 (32GB + 1TB SSD before Kubernetes) Month 25: +$150-200 (64GB before Year 3, sell 32GB kit) Month 49: +$200-300 (optional second node)Full breakdown including specific SKUs, sourcing tips, and Proxmox setup: homelab/hardware.
Cloud requirements (phases that NEED cloud)
Year 2: AWS Deep Dive └── AWS Free Tier account; budget $50-100 total
Year 2: GCP + Cloud-Agnostic └── GCP $300 free credits; budget $0
Year 4: GPU Infrastructure └── Cloud spot GPU (g5.xlarge or T4); budget $20-50 total
TOTAL CLOUD SPEND: ~$70-150 over 60 monthsCloud is exploration, not production. Default deployment target is the homelab. Cloud is the lab where you verify your patterns generalize across environments. Destroy at end of session.
The pattern depth ladder
Patterns aren’t a separate concept from phases — they’re the durable knowledge artifact phases produce. Every entry in the Pattern Library progresses through three depths:
- STUB — frontmatter + 1-paragraph summary. Default state. Created when first referenced.
- OUTLINE — promoted when a phase first touches the pattern. Adds Problem · Principles · Trade-offs · Tools sections.
- DEEP — promoted after 3+ months of operating something that depends on the pattern. Adds Mastery · Compare · Operate · Contribute sections — the proof you understand it.
Tools change. Patterns don’t. The Pattern Library is what survives the 60 months and outlives the specific implementations you’ll have learned.
Reading order
You’re reading this in roughly the right place. Here’s the full path:
- This Master Plan — you are here. Skim the rest of this page so you know what exists.
- The Story — the why. Read it before Year 1 starts.
- AI Learning Protocol — the rules for working with Claude/ChatGPT throughout the program. Critical to read before Phase 1.
- Year 1 overview → Phase 1: OS Foundations — the actual program begins.
- As you hit a pattern referenced in a phase: read its entry in Patterns; promote it from STUB to OUTLINE.
- As each project becomes active: read its plan in Projects.
- As you write your first runbook, postmortem, ADR, weekly log: copy from the Writing Templates.
For people who already work in infrastructure: jump straight to Year 1 final exam and see what the bar is. If you can already pass it, skip Year 1 entirely and start at Year 2.
Status
Created: 2025-10-21Started: not yet — Phase 1 begins when Year 1 docs are reviewed + homelab cleared (set the date in this line then)Target graduation: Month 60 from startTarget role: Staff/Principal AI Platform Engineer (or chosen Year-5 elective endpoint)This page updates as the program progresses. The phase you’re currently in shows on every doc page in the kicker line (5-YEAR PROGRAM · YEAR N · PHASE M).
The single sentence to remember
ROOT is a 5-year process to make me the engineer past-me wanted to learn from — by building the platform that future-me would want to operate, while documenting the journey publicly so other engineers can follow.
The 30 phases, the 9 tiers, the ~10 OSS projects, the patterns, the docs, the cinematic content, the brand — all of it serves that one sentence.