FinOps + Cost Engineering

Phase 29 of /root Year 3: cost as architecture. Cost attribution per tenant, reserved vs spot vs on-demand strategy, egress awareness, OpenCost. Apply FinOps discipline to basecamp's multi-cloud spend. 4-6 weeks, ~40-60 hours.

Thirteenth phase of Year 3. Cost as an engineering concern. 4-6 weeks, ~40-60 hrs.

Most engineering teams reason about cost only after the first surprising bill. Senior engineers treat cost as part of the architecture — alongside latency, availability, security. This phase installs that discipline. By phase end basecamp has cost attribution per workload, cost dashboards across all three clouds, reserved + spot strategies applied where they earn weight, and runbooks for “cost spike — investigate.”

This isn’t an accounting phase. It’s an engineering phase. The patterns transfer to every later employer where cost-of-infrastructure is a real number on a real spreadsheet.

Prerequisites

Phase 28 complete; observability operational (cost is just another metric)

12 hrs/week budget reserved

Why this phase exists

In on-prem you buy hardware once. In cloud you pay continuously. This changes architecture: idle compute is expensive, cross-AZ traffic is expensive, data egress is very expensive. Architecture decisions made without cost awareness produce 2-3× cost overruns at scale.

The pattern-first frame

Same eight steps.

1. PROBLEM

basecamp runs across K3s + EKS + GKE. Each charges differently for the same workload. You need to know where money goes (attribution), what’s optimizable (analysis), what trade-offs to make (architecture). And you want to do this without becoming a finance team.

2. PRINCIPLES

2.1 Cost attribution

You can’t optimize what you can’t attribute. Per-workload, per-team, per-environment cost visibility.

→ Pattern: finops

Investigate:

How does Kubernetes-native cost attribution work (per-namespace, per-pod)?
What does OpenCost give you that AWS Cost Explorer doesn’t?
Why is “showback” different from “chargeback”?

2.2 Reserved vs spot vs on-demand

Three pricing models per cloud. On-demand: flexible, expensive. Reserved (or Savings Plans): commit for a discount. Spot: cheap, can be evicted.

Investigate:

What workloads fit spot? (Hint: batch, stateless replicas, async work.)
When do Reserved Instances vs Savings Plans win?
What’s the right ratio of RIs to on-demand for a steady-state workload?

2.3 Cross-AZ + egress costs

Cross-availability-zone traffic costs money. Cross-region traffic costs more. Egress to the public internet costs the most. These are invisible to most architecture decisions until they bite.

Investigate:

AWS cross-AZ pricing — exact numbers?
Why is NAT Gateway $32/month just for being on?
When does CDN in front of egress pay for itself?

2.4 Idle as a cost category

Idle compute is the largest cost waste in most cloud-native deployments. Workloads provisioned for peak but running at 15% average.

Investigate:

What’s the right utilization target for a fleet?
How does autoscaling (HPA, VPA, cluster autoscaler) help?
When does “rightsizing” become “underprovisioning”?

2.6 Node autoscaling with Karpenter

Karpenter is the K8s-native node autoscaler (Phase 20 deployed it). At FinOps scale, Karpenter’s NodePool CRDs are how you express cost-optimization policies declaratively: “use spot instances when available, fall back to on-demand,” “prefer arm64 for compute-bound workloads,” “consolidate nodes when underutilized.”

→ Pattern: karpenter-as-finops-tool (or just reinforcement of finops + operator-pattern)

Investigate:

Walk a Karpenter NodePool CRD: declare allowed instance types + spot preference + consolidation policy → controller provisions optimally.
Why does Karpenter beat the older cluster-autoscaler for cost optimization? (Hint: instance-type flexibility per pending pod, not static node groups.)
What’s “consolidation,” and when does it save real money?

3. TRADE-OFFS

Decision	Options	Cost
Pricing model	On-demand only; RIs/Savings Plans; mixed	On-demand: flexibility, premium. RIs: discount, commitment risk.
Spot strategy	All-spot; mixed; none	All-spot: cheapest, eviction risk. Mixed: pragmatic. None: stable, expensive.
Attribution tool	OpenCost; Kubecost; cloud-native (Cost Explorer + tags); none	OSS: K8s-native. Cloud: integrated, gaps for K8s. None: blind.

4. TOOLS (as of 2026-06)

OpenCost (CNCF Sandbox)
Kubecost (commercial extension)
AWS Cost Explorer + tagging discipline
GCP Billing Console
Infracost — predict TF cost changes in PRs

Reading

“Cloud FinOps” (J.R. Storment + Mike Fuller)
AWS Well-Architected Cost Optimization pillar
OpenCost docs

5. MASTERY: FinOps on basecamp

[ ] OpenCost deployed on basecamp K3s
[ ] Per-namespace cost attribution working
[ ] AWS + GCP cost explorers reviewed weekly
[ ] Tagging standard applied: env, owner, service, cost-center
[ ] Reserved Instances or Savings Plans applied where steady-state workloads warrant
[ ] Spot pool configured for at least one batch workload
[ ] Egress costs audited; one architectural change reduced them measurably
[ ] HPA + VPA enabled for production services
[ ] Cost-spike alert wired (e.g., daily spend > 2× rolling average)
[ ] Infracost in CI to predict TF cost impact

6. COMPARE: Kubecost

Install Kubecost (commercial; free tier exists); compare its insights vs OpenCost.

400-word reflection.

7. OPERATE

3-4 runbooks: cost spike, idle resources, RI utilization low, autoscaling broken
1-2 ADRs (RI commitment level; spot strategy)
Weekly log

8. CONTRIBUTE

OpenCost docs or providers
Infracost docs
A blog post on a real cost-optimization story

What ships from this phase

Cost dashboards for basecamp multi-cloud
Cost runbooks
Tagging discipline as a documented standard

Validation criteria

[ ] Per-namespace cost attribution working on basecamp
[ ] Reserved Instances or Savings Plans applied where appropriate
[ ] Spot used for at least one workload
[ ] Egress audit + one reduction made
[ ] All 10 operational depth checks
[ ] Compare reflection (400 words)
[ ] 3-4 cost runbooks
[ ] 1-2 ADRs
[ ] Pattern entries:
    - finops → OUTLINE
[ ] Exit Test passed

Exit Test

Time: 2 hours.

Part 1: Analyze (60 min)

Given basecamp’s cost dashboard, identify the top 3 optimization opportunities. Propose specific changes with estimated savings.

Part 2: Articulate (60 min)

~1000 words: “Walk basecamp’s cost structure. Top 3 line items per cloud. Where the architecture decisions drove cost. What you’d change with infinite optimization budget vs limited.”

Anti-patterns

Anti-pattern	Why
”Just turn off the dev environment at night” without automation	Won’t happen consistently
Reserved Instances for workloads that scale wildly	RIs lock in baseline; scale beyond pays on-demand
Untagged resources	Untaggable means unattributable means unoptimizable
Ignoring egress until the bill	Egress is the highest hidden cost in most deployments

Patterns touched this phase

finops — OUTLINE

→ Next: Phase 30: Reliability Engineering