Platform Engineering: UX + Security
The longest phase in ROOT. 4 months, two thematic halves: UX (Backstage + service mesh + start
platform-ctl+ SLI/SLO discipline) and Security (Sealed Secrets + ESO + Cosign + Pod Security Standards + RBAC audit). The phase that elevates “I run K8s” to “I build a platform engineers ship on.” ~16 weeks, ~192 hrs.
Phase 12 is the load-bearing phase of Year 2. It earns its 16 weeks by being the first time you stop being someone who runs a platform and start being someone who offers one — to a developer-user, with a UX contract, an SLO, and a security baseline that’s enforced by default rather than recommended in a wiki. Years 3-5 don’t add this discipline; they assume it.
The phase is split deliberately into two halves because UX and Security are different mental models. Half 1 is product thinking: golden paths, time-to-first-deploy, scaffolders, mesh observability, SLO burn-rate alerts that someone — including future-you — actually looks at. Half 2 is adversarial thinking: every layer in the stack has to assume the layer above is compromised. Image signing, Sealed Secrets, Pod Security Standards, RBAC audit, mesh ACLs — each is a control that fires in spite of the others having failed. The two halves rhyme but don’t blend.
This is also the phase that begins platform-ctl — the unified CLI front-door for basecamp. It stays private through Year 4 and ships publicly with the Studio launch in Year 5. By the end of Phase 12 it has 5+ working commands, tests, CI, and a real release pipeline. It’s small but it’s real, and it grows continuously through every later phase.
Prerequisites
- Phase 11 complete — multi-cloud terralabs, basecamp deploying to K3s + EKS + GKE
- You accept: Platform Engineering is a discipline, not a tool. Backstage and service mesh are implementations. The discipline is shipping internal products for developer-users with the same rigor as external products. Security is a co-equal half — applied at every layer, not bolted on.
Why this phase is split into two halves
Platform Engineering as a 16-week single push fails because UX and Security are two distinct mental models. The fix: explicitly pace them as halves.
Half 1 — Platform UX (weeks 1-8, ~96 hrs) Backstage IDP + service mesh + platform-ctl start + SLI/SLO discipline. Goal: a developer can self-serve "create new service" → running on basecamp in <10 min.
Half 2 — Platform Security (weeks 9-16, ~96 hrs) Sealed Secrets + ESO + Cosign image signing + Pod Security Standards + RBAC audit + supply chain. Goal: every basecamp service is signed, scoped, and secret-managed by default.Each half is gated by its own validation criteria. Don’t start Half 2 until Half 1 is stable.
Why this phase exists at all
By Year 2 end you exit as Senior DevOps / Platform Engineer. That role requires:
- Internal Developer Platform (IDP) experience — Backstage is the canonical OSS choice.
- Service mesh fluency — mTLS, traffic management, L7 observability.
- Platform-level security — secrets management beyond raw K8s Secrets, image signing, RBAC at depth.
- SLI/SLO discipline — the platform itself has SLOs; users hold you accountable.
This phase ships all four. It also begins platform-ctl — the abukix CLI that wraps basecamp + terralabs into a self-service interface.
→ Pattern: platform-as-product → Pattern: service-mesh → Pattern: secrets-lifecycle → Pattern: defense-in-depth → Pattern: zero-trust-networking → Pattern: zero-trust-security → Pattern: sli-slo-error-budget
Half 1: Platform UX
1.1 PROBLEM
You have a platform (K3s + EKS + GKE + basecamp + terralabs). You want engineers — including future-you — to:
- Discover what services exist.
- Provision a new service in <10 min via a golden path.
- Have observability + secrets + mesh wired automatically.
- Trust that platform-managed defaults are correct.
Backstage solves discovery + scaffolding. Service mesh solves cross-service comms (mTLS, retries, observability) without changing app code. Together they collapse “spin up a new service” from a multi-day yak-shave into a Backstage Scaffolder click + a git push.
1.2 PRINCIPLES: Half 1
Platform-as-Product thinking
Internal platform has users. Users have UX expectations. Time-to-first-deploy is the diagnostic metric — the platform equivalent of SLI/SLO for the developer experience.
→ Pattern: platform-as-product
Investigate:
- Measure your current “time from idea to running service” on basecamp. Is it 10 min or 2 hours? (Be honest. The first measurement is always embarrassing; that’s the point.)
- What’s the single biggest friction point a new dev would hit?
Backstage as the IDP
Service catalog, scaffolder (templates), tech docs, all in one app.
Investigate:
- Install Backstage; register basecamp services in the catalog.
- Build a Scaffolder Template: “create new Go service” → git repo + ArgoCD app + Backstage entry, end to end.
- Compare with industry IDPs (the developer-portal pattern at scale-co’s).
Service mesh
→ Pattern: service-mesh
Pick Linkerd (simplest) or Istio Ambient (modern, sidecar-less) or Cilium Service Mesh (eBPF, if Cilium is your CNI from Y1 P7).
Investigate:
- Install on K3s; verify mTLS between two services with
tcpdump. - Configure a canary rollout: 90/10 split, advance based on success rate.
- Inject 5xx into 10% of traffic; verify retry + circuit breaker.
- Read access logs end-to-end for one request — what fields does the mesh add?
SLI/SLO discipline (pulled forward from Y3)
The platform itself has SLOs. “basecamp Applications reach Synced within 5 min of git commit, 99% of the time.”
→ Pattern: sli-slo-error-budget
Investigate:
- Define 3 platform SLIs: ArgoCD sync latency, Ingress success rate, mesh mTLS coverage.
- Set SLOs (99%, 99.5%) and error budgets.
- Wire alerts that fire when error budget burn rate is fast (Honeycomb-style burn-rate alerts).
- Dashboard the SLO state in Grafana; surface in Backstage.
1.3 PROJECT: start platform-ctl
This phase begins platform-ctl — the abukix CLI.
platform-ctl scope this half: github.com/abukix/platform-ctl (PRIVATE; goes public Y5) Go binary, cobra-based
Subcommands so far: platform-ctl service create <name> — scaffolds via Backstage Scaffolder platform-ctl service deploy <name> — triggers ArgoCD sync platform-ctl service status <name> — pods, recent deploys, error rate, SLO burn platform-ctl secret rotate <name> — rotates via ESO (Half 2 wires this)
Tests, CI, GoReleaser Continues growing in Year 3-5See the platform-ctl plan.
1.4 Half-1 operational depth checklist
[ ] Install Backstage on EKS; register basecamp services in catalog[ ] Build a Software Template: "Go service" → git repo + ArgoCD app + Backstage entry[ ] Install service mesh (Linkerd or Istio Ambient); verify mTLS between two services with tcpdump[ ] Configure canary rollout via mesh; advance based on success rate[ ] Inject 5xx errors; verify retry + circuit breaker[ ] Define 3 platform SLIs + SLOs + burn-rate alerts; surface in Grafana + Backstage[ ] platform-ctl: 4 working subcommands; tests; binary releases[ ] Time-to-first-deploy via Backstage + platform-ctl: target <10 min from idea to running[ ] Document the platform's "golden path" in basecamp/README.md[ ] Half-1 exit test passed1.5 Half-1 exit test (4 hours)
- Build (120 min) — onboard a new developer to basecamp using Backstage + platform-ctl. From zero to deployed-and-monitored service in <10 min. Document every step.
- Diagnose (60 min) — scenario: Backstage Scaffolder fails halfway; or mesh canary stuck advancing.
- Articulate (60 min) — 1000 words: “Walk through the developer experience of creating a new service in basecamp. What controls fire at each step?”
Half 2: Platform Security
2.1 PROBLEM
Half 1 made it easy to ship services. Half 2 makes it safe.
You need: secrets that aren’t plaintext in git, images you can verify weren’t tampered with, RBAC that’s actually least-privilege, Pod Security defaults that limit blast radius, supply-chain provenance. Each control assumes the others have already failed — that’s defense-in-depth by definition, and the discipline that makes it real is enforce by default, not recommend in a wiki.
2.2 PRINCIPLES: Half 2
Sealed Secrets + External Secrets Operator
Raw K8s Secrets are base64 plaintext. Sealed Secrets encrypts them in git. ESO syncs from Vault/Secrets Manager.
→ Pattern: secrets-lifecycle
Investigate:
- Install Sealed Secrets; convert 3 raw Secrets to SealedSecrets; commit encrypted versions.
- Install External Secrets Operator; sync one secret from AWS Secrets Manager.
- Rotate a database password without changing app code; observe how each approach handles rotation.
K8s RBAC at depth
→ Pattern: least-privilege (deepens from Phase 10’s IAM)
ClusterRole vs Role; ServiceAccount vs User vs Group; verb/resource matrix; common anti-patterns.
Investigate:
- Audit basecamp RBAC: any ServiceAccount with
*/*? Fix. - Build a custom Role for a team that can deploy in their namespace but not others.
- Use
kubectl auth can-ito verify intent.
Supply-chain security: Cosign + SBOM + admission
Sign your images. Verify signatures at admission. SBOM (Syft) for vuln scanning + provenance.
Investigate:
- Sign one basecamp service image with Cosign.
- Install policy-controller; require signed images in one namespace.
- Generate SBOM with Syft; scan with Trivy / Grype; commit SBOM with the chart.
Pod Security Standards
K8s enforces PSS (Restricted / Baseline / Privileged) via PSA labels.
Investigate:
- Enforce
restrictedin one namespace; fix any violations. - Why is
restrictedstrict? What does each restriction protect against? (Capabilities, hostPath mounts, runAsNonRoot — each is a documented escape vector that someone has used in production.)
Zero-trust mesh
Service mesh from Half 1 + identity (mTLS) + authorization (mesh ACLs) = zero-trust networking.
→ Pattern: zero-trust-networking → Pattern: zero-trust-security → Pattern: defense-in-depth (now DEEP after stacking image signing + PSS + NetPol + mTLS + RBAC)
Investigate:
- Mesh ACL: service A can call service B but not service C.
- Pod identity via SPIFFE/SPIRE — preview only.
2.3 Half-2 operational depth checklist
[ ] Install Sealed Secrets; convert all raw Secrets in basecamp; verify encrypted in git[ ] Install ESO; sync 1+ secret from AWS Secrets Manager[ ] Rotate a Postgres password without app changes; observe both approaches[ ] Audit basecamp RBAC; fix wildcard verbs; minimum role per ServiceAccount[ ] Sign basecamp service images with Cosign; require signed at admission[ ] Generate SBOM with Syft; commit alongside chart; trivy-clean before merge[ ] Enforce Pod Security restricted in 1+ namespace[ ] Configure mesh ACL: service A can call B but not C; verify[ ] Wire platform-ctl secret rotate <name> to ESO under the hood[ ] Half-2 exit test passed2.4 Half-2 exit test (3 hours)
- Build (90 min) — a new service onboarded via Backstage now requires: signed image, SealedSecret for DB password, restricted Pod Security, namespace-scoped RBAC, mesh-only access from one upstream service.
- Diagnose (60 min) — image-signing policy is blocking all deploys; debug + fix + postmortem.
- Articulate (30 min) — 600 words: “Walk the security stack of basecamp from image build to running pod. What fires at each layer?“
3. TRADE-OFFS (combined)
| Decision | Option A | Option B | When |
|---|---|---|---|
| Service mesh | Istio (Ambient) | Linkerd | Cilium Service Mesh |
| Secrets | Sealed Secrets | ESO | both |
| IDP | Backstage | Port (commercial) | Backstage: free, customizable, learning curve |
| Image signing | Cosign + Sigstore | Notary v2 | Cosign is the modern default |
| Pod Security | PSS labels (built-in) | OPA Gatekeeper | Kyverno |
4. TOOLS (as of Q1 2026)
- Backstage (IDP)
- Linkerd OR Istio Ambient OR Cilium Service Mesh
- Sealed Secrets + External Secrets Operator
- Cosign + Sigstore + policy-controller
- Syft (SBOM) + Trivy / Grype (vuln scan)
- Kyverno OR OPA Gatekeeper (policy)
- Go + cobra (platform-ctl)
5. MASTERY
5.1 Reading list
| Required | Why |
|---|---|
| ”Team Topologies” (Skelton & Pais) | How platform teams structure |
| ”Building Internal Developer Platforms” (Manuel Pais) | The discipline |
| Backstage docs (Catalog + Templates + Plugins) | The implementation |
| One service mesh docs (Istio Ambient OR Linkerd) | The implementation |
| Google SRE Book Ch. on SLOs | Half-1 SLI/SLO discipline |
| Recommended | Why |
|---|---|
| Spotify Backstage talks at KubeCon | Real patterns |
| ”Container Security” (Liz Rice) | Half-2 security depth |
5.2 Combined operational checklist
See Half-1 (10 items) + Half-2 (10 items) above.
6. COMPARE
6.1 Half-1 compare: Backstage vs Heroku-style
Heroku invented the IDP pattern (push code → URL). Backstage is the modern self-hosted version. 300 words: how does Backstage’s open + composable model compare to Heroku’s closed + opinionated one?
6.2 Half-2 compare: Sealed Secrets vs ESO
Both solve secrets-in-git. Different shapes. 300 words: when each wins.
7. OPERATE
- 8+ runbooks across the phase (Backstage ops, mesh debug, secret rotation, RBAC audit, image-signing failure, PSS violation)
- 3+ postmortems (Half 1 + Half 2 incidents)
- 2+ ADRs (mesh choice; secrets approach)
- Weekly log
8. CONTRIBUTE
Backstage has tons of “good first issue” tickets. Service mesh projects (Istio, Linkerd) similarly. Sealed Secrets + ESO + Cosign all have welcoming communities.
Validation criteria (combined)
[ ] Half-1 validation passed[ ] Half-2 validation passed[ ] Backstage running, catalog populated, scaffolder working[ ] Service mesh running with mTLS verified[ ] Sealed Secrets + ESO in basecamp; raw Secrets gone[ ] Cosign + policy-controller enforcing signed images[ ] Pod Security restricted in tenant namespaces[ ] platform-ctl private with 5+ commands working[ ] 3 platform SLIs/SLOs live with burn-rate alerts[ ] 8+ runbooks; 3+ postmortems; 2+ ADRs; 16+ weekly log entries[ ] Pattern entries deepened to DEEP: - platform-as-product, service-mesh, secrets-lifecycle - least-privilege (now DEEP from K8s RBAC + IAM stacked) - defense-in-depth (DEEP — image signing + Pod Security + RBAC + NetPol + mTLS stacked) - zero-trust-networking, zero-trust-security - sli-slo-error-budget (OUTLINE → DEEP via real platform SLOs)[ ] Phase Exit Test passed (combined or per-half)Anti-patterns
| Anti-pattern | Why |
|---|---|
| Backstage as documentation site only | Misses scaffolder + catalog power; just becomes a wiki |
| Service mesh “for everything” without measuring perf | Sidecar latency adds up |
| Raw K8s Secrets “we’ll fix later” | Later never comes; SealedSecrets is one helm install away |
| Signing images but not verifying | Half the security; verification at admission is non-negotiable |
| SLO theater (SLOs no one looks at) | The point is decision-making, not dashboards |
| Skipping platform-ctl ergonomics work | A platform without UX is just YAML |
Patterns deepened this phase
- platform-as-product → DEEP
- service-mesh → DEEP
- secrets-lifecycle → DEEP
- least-privilege → DEEP
- defense-in-depth → DEEP
- zero-trust-networking → DEEP
- zero-trust-security → DEEP
- sli-slo-error-budget → DEEP
Browse the full categories at patterns/infrastructure-and-platform/ and patterns/security/.