AWS Deep Dive

Third phase of Year 2. AWS as the most-deployed cloud. IAM, VPC, EKS, RDS, S3, CloudWatch — at depth. Budget ~$50. ~8 weeks, ~90 hrs.

Phase 10 is the first time in ROOT that someone else’s hardware bills you for a mistake. That’s not a bug — it’s the lesson. Cloud isn’t “rented servers”; it’s a category of utility computing with its own primitives, its own failure modes, and its own cost surprise stories. Learning AWS in depth is the fastest path to the cloud as a category mental model that Phase 11 cements via comparison.

The shape of this phase is intentional: you already wrote the terralabs aws-* modules in Phase 9, so AWS isn’t introduced through Console click-paths. It’s introduced through declarative modules you wrote yourself, deployed to a Free Tier account with budget alerts at $1 / $5 / $25 / $50, and torn down at the end of every session. The discipline of “destroy at end of session, every time” is not optional and it’s not a tip — it’s a Year 2 lesson, and your first real bill is your first real cost incident. Treat it like one.

The deeper goal is bigger than AWS itself: identify what’s primitive (compute, storage, identity, network, observability) versus what’s marketing (200+ services, mostly variations on those five). When Phase 11 maps the same shape to GCP, the patterns transfer in a week — but only if you learned shape, not service names.

Prerequisites

Phase 9 complete — terralabs ships AWS modules

AWS Free Tier account ready (12 months free, with budget alerts at $10 / $25 / $50)

You accept: AWS is one cloud implementation. Don’t memorize service names — learn the shape (compute / storage / identity / network / observability) so GCP/Azure are 90% pattern-transfer.

Why this phase exists

Most production platforms run on AWS. Year 4’s GPU work uses AWS spot. Most LLM infrastructure tutorials assume AWS primitives. Knowing AWS deeply is a Year 2-graduation requirement.

But the deeper goal: understand cloud as a category. What’s primitive (compute, storage, identity, network)? What’s marketing (200+ services, mostly variations)? What patterns survive whichever cloud you’re forced to use next?

1. PROBLEM

You need infrastructure that scales beyond the homelab: managed K8s, scalable object storage, hosted databases, identity at scale. AWS is the most mature implementation. You’ll learn it via terralabs (already wrote the modules in Phase 9) + AWS Console + AWS CLI.

The discipline: destroy at end of every session. Bills add up otherwise.

2. PRINCIPLES

2.1 IAM: identity at scale

The hardest part of AWS. Users, roles, policies, trust relationships, STS, federation.

→ Pattern: least-privilege → Pattern: defense-in-depth

Investigate:

Build an IAM policy that allows S3 read on one bucket only — write it from scratch (no iamlive, no copy-paste; you should be able to read the JSON without flinching).
IAM role vs IAM user — when each.
AssumeRole + STS — service-to-service auth without long-lived credentials.
IRSA (IAM Roles for Service Accounts) — how an EKS pod gets AWS perms via OIDC. This is the load-bearing primitive for Phase 12 supply-chain security.

2.2 VPC: network at scale

Subnets, route tables, security groups, NACLs, NAT gateways, VPC peering, endpoints.

Investigate:

Why public + private subnets? What goes where? (Hint: NAT gateway in public, workloads in private, RDS in isolated.)
Security Group vs NACL — when each. (SG: stateful, instance-level. NACL: stateless, subnet-level. Both fire; both can deny.)
VPC endpoints — keep S3 traffic inside AWS network (cost + security).

2.3 EKS: managed K8s

EKS is K8s with AWS doing the control plane. You still operate the workers + addons.

Investigate:

EKS control plane HA — what AWS gives you (and what they don’t).
Node groups (managed vs self-managed) — managed for ergonomics, self-managed for full control over the AMI.
AWS Load Balancer Controller for Ingress.
IRSA for pod-level AWS access — wire one pod to one S3 bucket and trace every IAM check that fires (this is the Exit Test scenario).

2.4 RDS / Aurora: managed Postgres

Single-AZ, multi-AZ, read replicas, blue/green deployments, snapshots.

Investigate:

Multi-AZ vs read replica — different goals (multi-AZ: HA; read replica: scale reads).
pg_basebackup-equivalent in RDS — can you self-restore?
Aurora vs vanilla Postgres on RDS — cost + perf trade-offs.

2.5 S3: object storage

Buckets, prefixes, storage classes, lifecycle policies, versioning, replication.

Investigate:

S3 consistency model (now strong read-after-write since 2020 — and why that one change made Iceberg on S3 viable).
Storage classes — Standard / IA / Glacier — when each.
S3 lifecycle for old basecamp Postgres backups (auto-tier to Glacier after 90 days).

2.6 Observability primitives

CloudWatch (logs + metrics), X-Ray (tracing), CloudTrail (audit logs).

Investigate:

CloudWatch metrics retention + cost.
CloudTrail — what’s logged, what’s not.
“Aggregator” patterns — push to your own Loki/Prometheus instead of paying CloudWatch retention. (basecamp Tier 1 already runs Prom + Grafana; this is the same pattern, scaled to cloud.)

2.7 Cost: the discipline

The biggest difference between cloud-comfortable and cloud-burned engineers: cost discipline.

Investigate:

AWS Pricing Calculator before deploying anything.
Budget alerts at $1 / $5 / $20 / $50 — never silently.
Cost Explorer + tags + the “destroy at end of session” habit.
Why does egress cost more than ingress? (You’re stuck once your data lands. That’s the business model — don’t let it become your model by accident.)

3. TRADE-OFFS

Decision	Option A	Option B	When
Compute	EC2 (raw)	EKS (K8s)	Fargate (serverless containers)
State	RDS (managed Postgres)	DynamoDB (managed NoSQL)	self-host on EC2
Compute scaling	manual	ASG	EKS HPA
Observability	CloudWatch native	self-hosted (Loki + Prom)	Native: easy. Self-host: cheaper at scale

4. TOOLS (as of Q1 2026)

AWS CLI v2
eksctl — quick EKS bootstrap (use it once; then go via TF)
aws-vault — local credential isolation (never store keys in ~/.aws/credentials)
session-manager-plugin — SSH-less EC2 access via IAM
Cost Explorer + Budgets — set up day 1
terralabs aws-* modules from Phase 9

5. MASTERY

5.1 Reading list

Required	Why
AWS Well-Architected Framework — 6 pillars	The framing
AWS IAM docs — Policy evaluation logic	The hardest part
EKS documentation — full read	Real ops shape
AWS in Action (Wittig & Wittig, latest)	Practical depth

Recommended	Why
AWS Builder Library articles	Real architectures from inside AWS
Cory Quinn’s Last Week in AWS newsletter	Cost + cynicism

5.2 Operational depth checklist

[ ] Set up AWS Free Tier account; budget alerts at $1/$5/$25/$50; root MFA on; org SCP if you're feeling fancy
[ ] Use terralabs to provision a VPC + EKS + RDS Postgres in us-west-2; verify
[ ] Configure IRSA: K8s ServiceAccount → IAM Role → S3 bucket access from a pod
[ ] Deploy AWS Load Balancer Controller; expose a service via ALB Ingress
[ ] Configure RDS multi-AZ; force failover via the console; observe behavior
[ ] Set up S3 lifecycle: old objects → Standard-IA after 30 days → Glacier after 90
[ ] Use CloudTrail to identify what your account did in the last 7 days
[ ] Build an IAM policy from scratch: read-only access to one S3 bucket, no other perms
[ ] Hit a cost surprise; identify cause via Cost Explorer; remediate
[ ] **Destroy** the EKS + RDS at end of phase; verify $0 ongoing spend

5.3 basecamp expansion

By phase end, basecamp can target an EKS cluster as well as homelab K3s. The same Helm charts deploy to both. ArgoCD ApplicationSet handles per-cluster overlays — the same primitive that will carry into Phase 11 when GKE becomes a third target and into Phase 13 when all three run side-by-side.

6. COMPARE: AWS vs DIY

You could run the same workloads on bare metal or rented dedicated servers. When does AWS earn its margin? When does self-host win?

400 words.

7. OPERATE

4+ runbooks (aws-cost-spike-investigation, iam-debug-access-denied, eks-node-group-recovery, rds-failover-drill)
2+ postmortems (you WILL hit a cost surprise; you WILL hit an IAM error)
1+ ADR (e.g., “Why us-west-2 over us-east-1”)
Weekly log

8. CONTRIBUTE

AWS-adjacent OSS — eksctl, aws-load-balancer-controller, aws-cli, Terraform AWS provider, kube2iam.

Validation criteria

[ ] All 10 operational depth checks
[ ] basecamp targets EKS via terralabs + ArgoCD
[ ] AWS-vs-DIY comparison written up
[ ] 4+ runbooks; 2+ postmortems; 1+ ADR; 8+ weekly log entries
[ ] Total AWS spend this phase: <$50
[ ] Pattern entries deepened:
    - least-privilege → DEEP (after IAM depth)
    - defense-in-depth → reinforced
    - threat-modeling → first deepening (STUB → OUTLINE) via IAM policy work
[ ] Exit Test passed

Exit Test

Time: 3 hours.

Build (90 min) — given a fresh AWS account, use terralabs to provision EKS + RDS + S3, configure IRSA so a pod can read from S3, deploy basecamp app-of-apps.
Debug (60 min) — scenario from Phase 10 catalog (IAM “Access Denied” with no obvious cause; RDS connection timeout; cost spike).
Articulate (30 min) — 600 words: “Walk through every IAM check that fires when an EKS pod with IRSA reads from S3.”

Anti-patterns

Anti-pattern	Why
Storing AWS access keys in `~/.aws/credentials`	Use `aws-vault` or SSO
`*` in IAM policies	Pre-incident; least-privilege is one rewrite later, but you’ll forget
Forgetting to destroy EKS at end of session	$0.10/hr × forgotten weekend = ouch
Trusting Console clicks instead of IaC	Drift you can’t reproduce
Using `iamlive` to generate policies and never auditing	Let AWS write your perms and you’ve defeated least-privilege

Patterns deepened this phase

least-privilege → DEEP
defense-in-depth → reinforced
threat-modeling → OUTLINE

Browse the full category at patterns/security/.

→ Next: Phase 11: GCP + Cloud-Agnostic