Master Plan

A 5-year curriculum from SRE Support to Staff/Principal AI Platform Engineer. Built on patterns, not tools. The platform you build is the artifact. The depth you accumulate is the goal. The journey is the content.

This page is the master plan — the high-level map of the entire program. If you read only one document in ROOT, read this one. Everything else is a deeper view of something here.

What ROOT is

ROOT is the curriculum behind Abukix Studio. It’s a 5-year personal program built around three commitments:

Pattern-first learning. Every phase asks you to find the underlying pattern before the tool. Kubernetes is one implementation of “declarative reconciliation” — not the only one, not the last one.
Platform-driven practice. Theory is verified by operating real software in production-like conditions. You build basecamp, a 9-tier platform on a homelab mini-PC.
Public documentation. Weekly logs, runbooks, ADRs, postmortems, pattern entries. Not because someone is watching — because compounding requires written external memory.

The brand pillar is Building an AI Platform in Public. ROOT is how that platform gets built. The five years are how it’s earned.

The bet

Tools change every 5 years. Patterns don’t change in 30.

If you spend five years learning Kubernetes, you’re fluent in 2026 tools that may not be canonical in 2031. If you spend five years learning the control-loop pattern that Kubernetes implements, you carry that pattern to whatever replaces it — Terraform’s reconcile, ArgoCD, future cloud orchestrators, anything declarative.

Staff and Principal engineers reason in patterns and treat tools as interchangeable implementations. They pick up new tools in a week because they already understand the category of problem the tool solves and the trade-offs every implementation makes. ROOT trains that habit from Year 1, Phase 1.

The name is the contract: every phase asks you to find the root principle, not memorize a recipe.

The 5 years at a glance

The curriculum spans 60 months across five thematically distinct years. Each year is self-contained: life can interrupt at any year boundary without wasting prior years, because every year exits at a real, credible role.

Year	Theme	Months	Exit-ramp role	OSS shipped
1	Foundations of Computing	1-12	Junior SRE / IT Engineer	rxp, konfig, pulse, triage
2	Distributed Systems & Cloud	13-24	DevOps / Cloud Engineer	terralabs, platform-ctl (start)
3	Platform Engineering & Data	25-36	Senior DevOps / Data Platform Engineer	data tier of basecamp
4	ML & AI Infrastructure	37-48	ML Platform / AI Infrastructure Engineer	services/llm-gateway/
5	AI Platform + Capstone	49-60	Staff/Principal AI Platform Engineer	mlship, services/aiops/, Studio launch

The full year-by-year structure:

YEAR 1: Foundations of Computing                Months 1-12
─────────────────────────────────────────────────────────────────
Exit ramp: Junior SRE / IT Engineer

  OS (Linux + FreeBSD compare), Networking, Databases (Postgres + Redis),
  Programming fluency (Python + Go — fluency, not mastery),
  Containers (namespaces + cgroups + UnionFS from scratch),
  Kubernetes + GitOps (the longest phase — earns its length).

  Platform milestone: K3s + ArgoCD + Postgres + Redis + Prometheus + Grafana.
                      First service deployed: triage (on-call app).
  Year 1 Final Exam:  6-hour scenario across all 7 phases.
  OSS shipped:        rxp, konfig, pulse, triage.
  Patterns deepened:  ~15 (foundations + early distributed systems).


YEAR 2: Distributed Systems & Cloud             Months 13-24
─────────────────────────────────────────────────────────────────
Exit ramp: DevOps / Cloud Engineer

  Distributed-systems theory (DDIA, CAP/PACELC, consensus, replication),
  IaC patterns (Terraform + Crossplane), AWS deep, GCP compare,
  Platform engineering (Backstage, service mesh, secrets, RBAC, Pod Security,
                       SLI/SLO discipline as a platform contract).
  Multi-cloud basecamp synthesis (Y2 capstone).

  Platform milestone: multi-cloud (k3s + EKS + GKE), Backstage IDP, mesh mTLS,
                      signed images, SLOs for the platform itself.
  Year 2 Final Exam:  8-hour scenario.
  Projects:           terralabs ships public; platform-ctl starts.
  Patterns deepened:  ~15 (distributed systems + infra + platform + early observability).


YEAR 3: Platform Engineering & Data             Months 25-36
─────────────────────────────────────────────────────────────────
Exit ramp: Senior DevOps / Data Platform Engineer

  Observability at depth (eBPF, OTel, three-pillars, cardinality discipline),
  Lakehouse (MinIO + Iceberg + Nessie), Stream (Redpanda + Flink),
  Batch (Spark + Airflow + dbt), Serving (Trino + Superset),
  Data governance (DataHub/OpenMetadata + lineage + access control).

  Platform milestone: full data engineering layer operational on basecamp.
                      JupyterHub-as-a-service (notebooks-on-the-platform).
  Year 3 Final Exam:  full-day scenario.
  Patterns deepened:  ~10 (data, networking depth, observability).


YEAR 4: ML & AI Infrastructure                  Months 37-48
─────────────────────────────────────────────────────────────────
Exit ramp: ML Platform Engineer / AI Infrastructure Engineer

  MLOps lifecycle (MLflow + KServe + Ray), Feature stores (Feast),
  Kubeflow Pipelines + Katib, LLM infra (vLLM + RAG + vector DBs),
  GPU scheduling (cloud spot), llm-gateway built incrementally (P21 → P24 → P25).

  Platform milestone: ML platform + LLM serving operational.
                      llm-gateway shipping real (homelab-scale) traffic.
  Year 4 Final Exam:  full-day scenario.
  Year 4 flagship:    services/llm-gateway/ inside basecamp.
  Patterns deepened:  ~5 (ML/AI patterns).


YEAR 5: AI Platform + Capstone                  Months 49-60
─────────────────────────────────────────────────────────────────
Exit ramp: Staff/Principal AI Platform Engineer

  Agent development (LangGraph), MCP + tool use,
  AIOps (services/aiops/ — agents that operate the platform),
  Portal + governance (Abukix Studio public surface launches),
  Capstone: mlship v2 (one-command model deploy) + pattern paper.

  Platform milestone: complete data/AI platform with agents, governance, portal.
                      Abukix Studio live at studio.abukix.dev.
  Year 5 Final Exam:  full-day, panel/AI-administered. Graduation.
  Capstone primary:   mlship v2 — OSS launch (sklearn + HF text-gen, excellent;
                                              other frameworks land as v2.1, v2.2).
  Capstone secondary: pattern paper — Staff/Principal-grade writing,
                                       2+ external readers, conference submission.

Three transitions matter most. They’re where the role identity changes:

Y1 → Y2 (Month 12-13): single-machine intuition → distributed-systems thinking. The phase that makes this real is Phase 8 (Distributed Systems Theory).
Y2 → Y3 (Month 24-25): platform-as-tool → platform-as-product. The platform stops being something you run and starts being something you offer.
Y4 → Y5 (Month 48-49): operator → architect. You stop building services and start building the agents that operate services. AIOps is the inflection.

The pattern-first scaffold

Every phase doc inside ROOT — all 30 of them — follows this 8-step structure:

PROBLEM       What category of human need exists?
PRINCIPLES    The timeless patterns any solution must implement
TRADE-OFFS    The decisions every implementation makes (and why)
TOOLS         Current implementations (time-stamped — they age)
MASTERY       Pick one tool, go to operational depth
COMPARE       Re-implement the same problem in a second tool
              (this is the proof that the pattern transferred)
OPERATE       Run it in your homelab, take real incidents
CONTRIBUTE    Ship one fix upstream

Worked example — Phase 7: Kubernetes + GitOps:

Step	What you do
Problem	”I need to run N stateless services across M machines, recover from failures, and deploy via Git.”
Principles	Declarative state · control loops · reconciliation · idempotency · operator pattern
Trade-offs	Push vs pull deploy · CRDs vs config maps · single-cluster vs multi-cluster · etcd consensus cost
Tools	K3s, Argo CD, Helm, Kustomize (Q1 2026) — pinned dates, not “best ever”
Mastery	K3s + Argo CD on the homelab. Bootstrap. Day 2. Recover from etcd corruption.
Compare	Re-implement the same workload using Nomad + Levant. Did the pattern transfer?
Operate	Run it for a real month. Real incidents. Real runbooks.
Contribute	One PR upstream — docs fix, bug fix, anything that lands.

If a phase doc reads like a copy-paste tutorial, the doc is wrong — flag it. Phases give you the framing; you do the investigation. Tools point you in the right direction via Starter hints: lines; you write the actual commands.

What you build (the platform)

By Month 60 you’ve built basecamp — a 9-tier production platform on a homelab mini-PC. The architecture mirrors what Spotify, Netflix, and Uber operate, but at small scale. The pattern is identical; the scale is different.

Tier 9: Agents          (LangGraph, MCP servers, services/aiops/)              Y5
Tier 8: Data Serving    (Trino, Superset)                                      Y3
Tier 7: LLM             (vLLM, vector DB, services/llm-gateway/)               Y4
Tier 6: ML Platform     (Kubeflow, Katib, Training Operators)                  Y4
Tier 5: ML              (MLflow, KServe, Ray, Feast, JupyterHub)               Y4 / Y3
Tier 4: Processing      (Spark, Airflow, Redpanda, Flink, dbt)                 Y3
Tier 3: Lakehouse       (Iceberg, Nessie, MinIO)                               Y3
Tier 2: Platform        (Backstage, Sealed Secrets, ESO, mesh, OTel, Loki)     Y2
Tier 1: Foundation      (ArgoCD, Postgres, Redis, Prometheus, Grafana)         Y1

Each tier is a stable abstraction the next tier above it depends on. This isn’t an arbitrary stack — it reflects how production data platforms actually evolve. Tier 1 (control plane + observability) must be solid before Tier 2 (developer-facing platform) is meaningful. Tier 3 (storage substrate) is the foundation Tier 4 (processing engines) builds on. And so on.

On top of basecamp sits Abukix Studio — the public brand layer:

Unified Web UI — Backstage-extended portal showing every service, deploy, and metric.
Command palette AI assistant — natural-language interface to the platform: “deploy mlship to staging”, “show me yesterday’s errors in llm-gateway”.
platform-ctl — the unified CLI front-door. One command surface to operate everything.
4-5 documented composition recipes — workflows that chain underlying services into real outcomes.

basecamp is what you operate. Abukix Studio is what you show.

The projects (three groups by role)

ROOT ships ~10 OSS projects across the 5 years. Think of them as actors in a play, not items on a list. They group into three roles:

Group A: the platform itself (the stage)

Project	Role	First touched
`terralabs`	Provisions the land — Terraform + Crossplane modules for VPC, K8s, RDS, MinIO, Proxmox K3s	Y2
`basecamp`	The GitOps repo of YAML that declares the platform (9 tiers)	Y1
`platform-ctl`	The unified CLI front-door — one command surface to operate everything	Y2 → Y5
Portal (inside basecamp)	Backstage-extended Web UI + command palette agent — the visitor’s view	Y5

These four together form Abukix Studio (the public-facing brand layer). They’re “the stage” because they’re what every other project runs on.

Group B: services running on the platform (the actors)

Project	What it does	First touched
`triage`	A small on-call dashboard. The first real service-on-K3s, deployed at the end of Year 1.	Y1
`services/llm-gateway/`	OpenAI-compat LLM API with multi-model routing, RAG, streaming, cost tracking. Lives inside basecamp.	Y4
`services/aiops/`	Agent that operates the platform — alert triage, runbook execution, pattern detection.	Y5
`mlship`	Capstone CLI users invoke from anywhere. `mlship deploy ./model.pkl` → URL.	Y5

These are what the platform actually offers to a user. They make the platform feel like a product, not a stack.

Group C: operational discipline + fluency artifacts (the craft)

Project	What it does	First touched
`ops-handbook`	Runbooks, incidents, postmortems, ADRs, weekly logs. The journal of running the platform.	Y1
`rxp`	Regex CLI. Fluency artifact + log-pattern utility used inside `services/aiops/` (Y5).	Y1
`konfig`	Config validator. Fluency artifact + used in basecamp CI to validate Helm values.	Y1
`pulse`	Probe scanner. Fluency artifact + emits Prometheus metrics scraped by basecamp.	Y1

ops-handbook is the journal — proof you can write production-grade docs over years. The three small CLIs are proof you can ship code with PR review, CI, releases — habits, not products. Each gets a real integration role inside basecamp’s tooling so they’re not orphaned demos.

Personal services tier (the dogfood)

basecamp also runs your stuff, not just demos. This is what makes the platform real instead of a portfolio piece:

basecamp/charts/personal/
├── personal-blog/      Y2  — your blog deployed via basecamp instead of Cloudflare Pages
├── personal-api/       Y3  — life-data API (fitness, learning hours, GitHub activity)
├── notes-rag/          Y4  — RAG over your own ROOT writing + weekly logs
│                            (dogfoods llm-gateway)
└── home-dash/          Y5  — internal dashboard pulling from all of the above
                              (dogfoods Studio portal)

These prove the platform serves you, not just GitHub stars. They’re also the cinematic moments: watch me self-host my own RAG over five years of weekly logs. Hard to fake, easy to film, impossible to ignore.

The composition recipes (Studio’s public proof)

A platform isn’t its components — it’s the workflows that chain components into real outcomes. ROOT documents 4-5 killer composition recipes that show Abukix Studio working end-to-end. Each recipe = one demo video, one blog post, one Show HN moment:

Recipe	Services chained	Year landing
Personal RAG over your own weekly logs	Notebook → Spark (chunk) → embeddings → pgvector → `llm-gateway` → portal command palette	Y4
Auto-incident triage loop	Prometheus alert → `aiops` → Trino (history) → propose runbook → `platform-ctl` execute	Y5
Train → register → deploy in one flow	Notebook → Ray (distributed train) → MLflow → KServe → `mlship` deploy	Y4
Homelab life API	GitHub → Airflow → Iceberg → Trino → personal API → portal	Y3
AI-assisted on-call	`triage` + `aiops` + `llm-gateway` + command palette	Y5

These live in Studio composition recipes and have runnable scripts in basecamp/examples/.

How the platform is shared with the world

Three tiers, ordered by cost-to-you:

OSS, self-host (free for everyone, free for you) — github.com/abukix/basecamp. Clone, run on your homelab/cloud, follow the README, get an equivalent platform. This is the moat. This is how the program scales to thousands of users without operational cost. 99% of users land here.
Hosted demo (free for visitors, ~$30-50/month for you) — studio.abukix.dev. Read-only, rate-limited, pre-loaded notebook + tiny RAG demo + small mlship deploy + portal command palette. CPU-only models (no GPU; Phi-3 quantized via llama.cpp). 10-min session timeout. This is the cinematic surface — visitors see the platform live without authentication, then optionally clone the OSS to run it themselves.
Managed offering (deferred, paid) — only if Year 5 launch shows demand. Optional. Most successful open-core companies don’t launch managed for years.

The order matters. OSS first establishes credibility (real engineers can run real software). Hosted demo amplifies reach (non-engineers can see it work). Managed is monetization (deferred until both prior tiers prove demand).

Time budget: 12 hours/week, sustained over 60 months

The non-negotiable is consistency over intensity:

Sunday          (1-2 hrs)   weekly log entry — what you learned, broke, were stuck on
Weeknight × 2-3 (1 hr each) current phase work — read, investigate, take notes
Weekend         (3-4 hrs)   operate the platform — incidents, runbooks, deeper investigation
Continuous                  every commit lands runbooks/ADRs/patterns in ops-handbook

Total: ~12 hrs/week. Some weeks 8, some weeks 16, average 12. The non-negotiable is the Sunday weekly log — that’s the discipline that compounds. Everything else can flex around real life.

Hardware requirements

YEAR 1 (Months 1-12)
  RAM:     16GB DDR5 → upgrade to 32GB before Month 10 (Kubernetes)
  Storage: 256GB NVMe + 1TB external SSD added at Month 10

YEAR 2 (Months 13-24)
  RAM:     32GB DDR5
  Storage: same as Year 2 start

YEAR 3 (Months 25-36)
  RAM:     64GB DDR5 (upgrade Month 25)
  Storage: same

YEAR 4-5 (Months 37-60)
  RAM:     64GB DDR5
  Storage: same
  Optional: second M70q for multi-node testing (~$200-300 used)

CUMULATIVE COST
  Month 10:  +$120-140  (32GB + 1TB SSD before Kubernetes)
  Month 25:  +$150-200  (64GB before Year 3, sell 32GB kit)
  Month 49:  +$200-300  (optional second node)

Full breakdown including specific SKUs, sourcing tips, and Proxmox setup: homelab/hardware.

Cloud requirements (phases that NEED cloud)

Year 2: AWS Deep Dive
  └── AWS Free Tier account; budget $50-100 total

Year 2: GCP + Cloud-Agnostic
  └── GCP $300 free credits; budget $0

Year 4: GPU Infrastructure
  └── Cloud spot GPU (g5.xlarge or T4); budget $20-50 total

TOTAL CLOUD SPEND: ~$70-150 over 60 months

Cloud is exploration, not production. Default deployment target is the homelab. Cloud is the lab where you verify your patterns generalize across environments. Destroy at end of session.

The pattern depth ladder

Patterns aren’t a separate concept from phases — they’re the durable knowledge artifact phases produce. Every entry in the Pattern Library progresses through three depths:

STUB — frontmatter + 1-paragraph summary. Default state. Created when first referenced.
OUTLINE — promoted when a phase first touches the pattern. Adds Problem · Principles · Trade-offs · Tools sections.
DEEP — promoted after 3+ months of operating something that depends on the pattern. Adds Mastery · Compare · Operate · Contribute sections — the proof you understand it.

Tools change. Patterns don’t. The Pattern Library is what survives the 60 months and outlives the specific implementations you’ll have learned.

Reading order

You’re reading this in roughly the right place. Here’s the full path:

This Master Plan — you are here. Skim the rest of this page so you know what exists.
The Story — the why. Read it before Year 1 starts.
AI Learning Protocol — the rules for working with Claude/ChatGPT throughout the program. Critical to read before Phase 1.
Year 1 overview → Phase 1: OS Foundations — the actual program begins.
As you hit a pattern referenced in a phase: read its entry in Patterns; promote it from STUB to OUTLINE.
As each project becomes active: read its plan in Projects.
As you write your first runbook, postmortem, ADR, weekly log: copy from the Writing Templates.

For people who already work in infrastructure: jump straight to Year 1 final exam and see what the bar is. If you can already pass it, skip Year 1 entirely and start at Year 2.

Status

Created:                  2025-10-21
Started:                  not yet — Phase 1 begins when Year 1 docs are reviewed
                                    + homelab cleared (set the date in this line then)
Target graduation:        Month 60 from start
Target role:              Staff/Principal AI Platform Engineer
                          (or chosen Year-5 elective endpoint)

This page updates as the program progresses. The phase you’re currently in shows on every doc page in the kicker line (5-YEAR PROGRAM · YEAR N · PHASE M).

The single sentence to remember

ROOT is a 5-year process to make me the engineer past-me wanted to learn from — by building the platform that future-me would want to operate, while documenting the journey publicly so other engineers can follow.

The 30 phases, the 9 tiers, the ~10 OSS projects, the patterns, the docs, the cinematic content, the brand — all of it serves that one sentence.