Platform Engineering: UX + Security

The longest phase in ROOT. 4 months, two thematic halves: UX (Backstage + service mesh + start platform-ctl + SLI/SLO discipline) and Security (Sealed Secrets + ESO + Cosign + Pod Security Standards + RBAC audit). The phase that elevates “I run K8s” to “I build a platform engineers ship on.” ~16 weeks, ~192 hrs.

Phase 12 is the load-bearing phase of Year 2. It earns its 16 weeks by being the first time you stop being someone who runs a platform and start being someone who offers one — to a developer-user, with a UX contract, an SLO, and a security baseline that’s enforced by default rather than recommended in a wiki. Years 3-5 don’t add this discipline; they assume it.

The phase is split deliberately into two halves because UX and Security are different mental models. Half 1 is product thinking: golden paths, time-to-first-deploy, scaffolders, mesh observability, SLO burn-rate alerts that someone — including future-you — actually looks at. Half 2 is adversarial thinking: every layer in the stack has to assume the layer above is compromised. Image signing, Sealed Secrets, Pod Security Standards, RBAC audit, mesh ACLs — each is a control that fires in spite of the others having failed. The two halves rhyme but don’t blend.

This is also the phase that begins platform-ctl — the unified CLI front-door for basecamp. It stays private through Year 4 and ships publicly with the Studio launch in Year 5. By the end of Phase 12 it has 5+ working commands, tests, CI, and a real release pipeline. It’s small but it’s real, and it grows continuously through every later phase.

Prerequisites

Phase 11 complete — multi-cloud terralabs, basecamp deploying to K3s + EKS + GKE

You accept: Platform Engineering is a discipline, not a tool. Backstage and service mesh are implementations. The discipline is shipping internal products for developer-users with the same rigor as external products. Security is a co-equal half — applied at every layer, not bolted on.

Why this phase is split into two halves

Platform Engineering as a 16-week single push fails because UX and Security are two distinct mental models. The fix: explicitly pace them as halves.

Half 1 — Platform UX (weeks 1-8, ~96 hrs)
  Backstage IDP + service mesh + platform-ctl start + SLI/SLO discipline.
  Goal: a developer can self-serve "create new service" → running on basecamp in <10 min.

Half 2 — Platform Security (weeks 9-16, ~96 hrs)
  Sealed Secrets + ESO + Cosign image signing + Pod Security Standards + RBAC audit + supply chain.
  Goal: every basecamp service is signed, scoped, and secret-managed by default.

Each half is gated by its own validation criteria. Don’t start Half 2 until Half 1 is stable.

Why this phase exists at all

By Year 2 end you exit as Senior DevOps / Platform Engineer. That role requires:

Internal Developer Platform (IDP) experience — Backstage is the canonical OSS choice.
Service mesh fluency — mTLS, traffic management, L7 observability.
Platform-level security — secrets management beyond raw K8s Secrets, image signing, RBAC at depth.
SLI/SLO discipline — the platform itself has SLOs; users hold you accountable.

This phase ships all four. It also begins platform-ctl — the abukix CLI that wraps basecamp + terralabs into a self-service interface.

→ Pattern: platform-as-product → Pattern: service-mesh → Pattern: secrets-lifecycle → Pattern: defense-in-depth → Pattern: zero-trust-networking → Pattern: zero-trust-security → Pattern: sli-slo-error-budget

Half 1: Platform UX

1.1 PROBLEM

You have a platform (K3s + EKS + GKE + basecamp + terralabs). You want engineers — including future-you — to:

Discover what services exist.
Provision a new service in <10 min via a golden path.
Have observability + secrets + mesh wired automatically.
Trust that platform-managed defaults are correct.

Backstage solves discovery + scaffolding. Service mesh solves cross-service comms (mTLS, retries, observability) without changing app code. Together they collapse “spin up a new service” from a multi-day yak-shave into a Backstage Scaffolder click + a git push.

1.2 PRINCIPLES: Half 1

Platform-as-Product thinking

Internal platform has users. Users have UX expectations. Time-to-first-deploy is the diagnostic metric — the platform equivalent of SLI/SLO for the developer experience.

→ Pattern: platform-as-product

Investigate:

Measure your current “time from idea to running service” on basecamp. Is it 10 min or 2 hours? (Be honest. The first measurement is always embarrassing; that’s the point.)
What’s the single biggest friction point a new dev would hit?

Backstage as the IDP

Service catalog, scaffolder (templates), tech docs, all in one app.

Investigate:

Install Backstage; register basecamp services in the catalog.
Build a Scaffolder Template: “create new Go service” → git repo + ArgoCD app + Backstage entry, end to end.
Compare with industry IDPs (the developer-portal pattern at scale-co’s).

Service mesh

→ Pattern: service-mesh

Pick Linkerd (simplest) or Istio Ambient (modern, sidecar-less) or Cilium Service Mesh (eBPF, if Cilium is your CNI from Y1 P7).

Investigate:

Install on K3s; verify mTLS between two services with tcpdump.
Configure a canary rollout: 90/10 split, advance based on success rate.
Inject 5xx into 10% of traffic; verify retry + circuit breaker.
Read access logs end-to-end for one request — what fields does the mesh add?

SLI/SLO discipline (pulled forward from Y3)

The platform itself has SLOs. “basecamp Applications reach Synced within 5 min of git commit, 99% of the time.”

→ Pattern: sli-slo-error-budget

Investigate:

Define 3 platform SLIs: ArgoCD sync latency, Ingress success rate, mesh mTLS coverage.
Set SLOs (99%, 99.5%) and error budgets.
Wire alerts that fire when error budget burn rate is fast (Honeycomb-style burn-rate alerts).
Dashboard the SLO state in Grafana; surface in Backstage.

1.3 PROJECT: start `platform-ctl`

This phase begins platform-ctl — the abukix CLI.

platform-ctl scope this half:
  github.com/abukix/platform-ctl (PRIVATE; goes public Y5)
  Go binary, cobra-based

  Subcommands so far:
    platform-ctl service create <name>   — scaffolds via Backstage Scaffolder
    platform-ctl service deploy <name>   — triggers ArgoCD sync
    platform-ctl service status <name>   — pods, recent deploys, error rate, SLO burn
    platform-ctl secret rotate <name>    — rotates via ESO (Half 2 wires this)

  Tests, CI, GoReleaser
  Continues growing in Year 3-5

See the platform-ctl plan.

1.4 Half-1 operational depth checklist

[ ] Install Backstage on EKS; register basecamp services in catalog
[ ] Build a Software Template: "Go service" → git repo + ArgoCD app + Backstage entry
[ ] Install service mesh (Linkerd or Istio Ambient); verify mTLS between two services with tcpdump
[ ] Configure canary rollout via mesh; advance based on success rate
[ ] Inject 5xx errors; verify retry + circuit breaker
[ ] Define 3 platform SLIs + SLOs + burn-rate alerts; surface in Grafana + Backstage
[ ] platform-ctl: 4 working subcommands; tests; binary releases
[ ] Time-to-first-deploy via Backstage + platform-ctl: target <10 min from idea to running
[ ] Document the platform's "golden path" in basecamp/README.md
[ ] Half-1 exit test passed

1.5 Half-1 exit test (4 hours)

Build (120 min) — onboard a new developer to basecamp using Backstage + platform-ctl. From zero to deployed-and-monitored service in <10 min. Document every step.
Diagnose (60 min) — scenario: Backstage Scaffolder fails halfway; or mesh canary stuck advancing.
Articulate (60 min) — 1000 words: “Walk through the developer experience of creating a new service in basecamp. What controls fire at each step?”

Half 2: Platform Security

2.1 PROBLEM

Half 1 made it easy to ship services. Half 2 makes it safe.

You need: secrets that aren’t plaintext in git, images you can verify weren’t tampered with, RBAC that’s actually least-privilege, Pod Security defaults that limit blast radius, supply-chain provenance. Each control assumes the others have already failed — that’s defense-in-depth by definition, and the discipline that makes it real is enforce by default, not recommend in a wiki.

2.2 PRINCIPLES: Half 2

Sealed Secrets + External Secrets Operator

Raw K8s Secrets are base64 plaintext. Sealed Secrets encrypts them in git. ESO syncs from Vault/Secrets Manager.

→ Pattern: secrets-lifecycle

Investigate:

Install Sealed Secrets; convert 3 raw Secrets to SealedSecrets; commit encrypted versions.
Install External Secrets Operator; sync one secret from AWS Secrets Manager.
Rotate a database password without changing app code; observe how each approach handles rotation.

K8s RBAC at depth

→ Pattern: least-privilege (deepens from Phase 10’s IAM)

ClusterRole vs Role; ServiceAccount vs User vs Group; verb/resource matrix; common anti-patterns.

Investigate:

Audit basecamp RBAC: any ServiceAccount with */*? Fix.
Build a custom Role for a team that can deploy in their namespace but not others.
Use kubectl auth can-i to verify intent.

Supply-chain security: Cosign + SBOM + admission

Sign your images. Verify signatures at admission. SBOM (Syft) for vuln scanning + provenance.

Investigate:

Sign one basecamp service image with Cosign.
Install policy-controller; require signed images in one namespace.
Generate SBOM with Syft; scan with Trivy / Grype; commit SBOM with the chart.

Pod Security Standards

K8s enforces PSS (Restricted / Baseline / Privileged) via PSA labels.

Investigate:

Enforce restricted in one namespace; fix any violations.
Why is restricted strict? What does each restriction protect against? (Capabilities, hostPath mounts, runAsNonRoot — each is a documented escape vector that someone has used in production.)

Zero-trust mesh

Service mesh from Half 1 + identity (mTLS) + authorization (mesh ACLs) = zero-trust networking.

→ Pattern: zero-trust-networking → Pattern: zero-trust-security → Pattern: defense-in-depth (now DEEP after stacking image signing + PSS + NetPol + mTLS + RBAC)

Investigate:

Mesh ACL: service A can call service B but not service C.
Pod identity via SPIFFE/SPIRE — preview only.

2.3 Half-2 operational depth checklist

[ ] Install Sealed Secrets; convert all raw Secrets in basecamp; verify encrypted in git
[ ] Install ESO; sync 1+ secret from AWS Secrets Manager
[ ] Rotate a Postgres password without app changes; observe both approaches
[ ] Audit basecamp RBAC; fix wildcard verbs; minimum role per ServiceAccount
[ ] Sign basecamp service images with Cosign; require signed at admission
[ ] Generate SBOM with Syft; commit alongside chart; trivy-clean before merge
[ ] Enforce Pod Security restricted in 1+ namespace
[ ] Configure mesh ACL: service A can call B but not C; verify
[ ] Wire platform-ctl secret rotate <name> to ESO under the hood
[ ] Half-2 exit test passed

2.4 Half-2 exit test (3 hours)

Build (90 min) — a new service onboarded via Backstage now requires: signed image, SealedSecret for DB password, restricted Pod Security, namespace-scoped RBAC, mesh-only access from one upstream service.
Diagnose (60 min) — image-signing policy is blocking all deploys; debug + fix + postmortem.
Articulate (30 min) — 600 words: “Walk the security stack of basecamp from image build to running pod. What fires at each layer?“

3. TRADE-OFFS (combined)

Decision	Option A	Option B	When
Service mesh	Istio (Ambient)	Linkerd	Cilium Service Mesh
Secrets	Sealed Secrets	ESO	both
IDP	Backstage	Port (commercial)	Backstage: free, customizable, learning curve
Image signing	Cosign + Sigstore	Notary v2	Cosign is the modern default
Pod Security	PSS labels (built-in)	OPA Gatekeeper	Kyverno

4. TOOLS (as of Q1 2026)

Backstage (IDP)
Linkerd OR Istio Ambient OR Cilium Service Mesh
Sealed Secrets + External Secrets Operator
Cosign + Sigstore + policy-controller
Syft (SBOM) + Trivy / Grype (vuln scan)
Kyverno OR OPA Gatekeeper (policy)
Go + cobra (platform-ctl)

5. MASTERY

5.1 Reading list

Required	Why
”Team Topologies” (Skelton & Pais)	How platform teams structure
”Building Internal Developer Platforms” (Manuel Pais)	The discipline
Backstage docs (Catalog + Templates + Plugins)	The implementation
One service mesh docs (Istio Ambient OR Linkerd)	The implementation
Google SRE Book Ch. on SLOs	Half-1 SLI/SLO discipline

Recommended	Why
Spotify Backstage talks at KubeCon	Real patterns
”Container Security” (Liz Rice)	Half-2 security depth

5.2 Combined operational checklist

See Half-1 (10 items) + Half-2 (10 items) above.

6. COMPARE

6.1 Half-1 compare: Backstage vs Heroku-style

Heroku invented the IDP pattern (push code → URL). Backstage is the modern self-hosted version. 300 words: how does Backstage’s open + composable model compare to Heroku’s closed + opinionated one?

6.2 Half-2 compare: Sealed Secrets vs ESO

Both solve secrets-in-git. Different shapes. 300 words: when each wins.

7. OPERATE

8+ runbooks across the phase (Backstage ops, mesh debug, secret rotation, RBAC audit, image-signing failure, PSS violation)
3+ postmortems (Half 1 + Half 2 incidents)
2+ ADRs (mesh choice; secrets approach)
Weekly log

8. CONTRIBUTE

Backstage has tons of “good first issue” tickets. Service mesh projects (Istio, Linkerd) similarly. Sealed Secrets + ESO + Cosign all have welcoming communities.

Validation criteria (combined)

[ ] Half-1 validation passed
[ ] Half-2 validation passed
[ ] Backstage running, catalog populated, scaffolder working
[ ] Service mesh running with mTLS verified
[ ] Sealed Secrets + ESO in basecamp; raw Secrets gone
[ ] Cosign + policy-controller enforcing signed images
[ ] Pod Security restricted in tenant namespaces
[ ] platform-ctl private with 5+ commands working
[ ] 3 platform SLIs/SLOs live with burn-rate alerts
[ ] 8+ runbooks; 3+ postmortems; 2+ ADRs; 16+ weekly log entries
[ ] Pattern entries deepened to DEEP:
    - platform-as-product, service-mesh, secrets-lifecycle
    - least-privilege (now DEEP from K8s RBAC + IAM stacked)
    - defense-in-depth (DEEP — image signing + Pod Security + RBAC + NetPol + mTLS stacked)
    - zero-trust-networking, zero-trust-security
    - sli-slo-error-budget (OUTLINE → DEEP via real platform SLOs)
[ ] Phase Exit Test passed (combined or per-half)

Anti-patterns

Anti-pattern	Why
Backstage as documentation site only	Misses scaffolder + catalog power; just becomes a wiki
Service mesh “for everything” without measuring perf	Sidecar latency adds up
Raw K8s Secrets “we’ll fix later”	Later never comes; SealedSecrets is one helm install away
Signing images but not verifying	Half the security; verification at admission is non-negotiable
SLO theater (SLOs no one looks at)	The point is decision-making, not dashboards
Skipping platform-ctl ergonomics work	A platform without UX is just YAML

Patterns deepened this phase

Browse the full categories at patterns/infrastructure-and-platform/ and patterns/security/.

→ Next: Phase 13: Multi-cloud basecamp (Y2 capstone)

Platform Engineering: UX + Security

Prerequisites

Why this phase is split into two halves

Why this phase exists at all

Half 1: Platform UX

1.1 PROBLEM

1.2 PRINCIPLES: Half 1

Platform-as-Product thinking

Backstage as the IDP

Service mesh

SLI/SLO discipline (pulled forward from Y3)

1.3 PROJECT: start platform-ctl

1.4 Half-1 operational depth checklist

1.5 Half-1 exit test (4 hours)

Half 2: Platform Security

2.1 PROBLEM

2.2 PRINCIPLES: Half 2

Sealed Secrets + External Secrets Operator

K8s RBAC at depth

Supply-chain security: Cosign + SBOM + admission

Pod Security Standards

Zero-trust mesh

2.3 Half-2 operational depth checklist

2.4 Half-2 exit test (3 hours)

3. TRADE-OFFS (combined)

4. TOOLS (as of Q1 2026)

5. MASTERY

5.1 Reading list

5.2 Combined operational checklist

6. COMPARE

6.1 Half-1 compare: Backstage vs Heroku-style

6.2 Half-2 compare: Sealed Secrets vs ESO

7. OPERATE

8. CONTRIBUTE

Validation criteria (combined)

Anti-patterns

Patterns deepened this phase

1.3 PROJECT: start `platform-ctl`