AWS Deep Dive
Third phase of Year 2. AWS as the most-deployed cloud. IAM, VPC, EKS, RDS, S3, CloudWatch — at depth. Budget ~$50. ~8 weeks, ~90 hrs.
Phase 10 is the first time in ROOT that someone else’s hardware bills you for a mistake. That’s not a bug — it’s the lesson. Cloud isn’t “rented servers”; it’s a category of utility computing with its own primitives, its own failure modes, and its own cost surprise stories. Learning AWS in depth is the fastest path to the cloud as a category mental model that Phase 11 cements via comparison.
The shape of this phase is intentional: you already wrote the terralabs aws-* modules in Phase 9, so AWS isn’t introduced through Console click-paths. It’s introduced through declarative modules you wrote yourself, deployed to a Free Tier account with budget alerts at $1 / $5 / $25 / $50, and torn down at the end of every session. The discipline of “destroy at end of session, every time” is not optional and it’s not a tip — it’s a Year 2 lesson, and your first real bill is your first real cost incident. Treat it like one.
The deeper goal is bigger than AWS itself: identify what’s primitive (compute, storage, identity, network, observability) versus what’s marketing (200+ services, mostly variations on those five). When Phase 11 maps the same shape to GCP, the patterns transfer in a week — but only if you learned shape, not service names.
Prerequisites
- Phase 9 complete — terralabs ships AWS modules
- AWS Free Tier account ready (12 months free, with budget alerts at $10 / $25 / $50)
- You accept: AWS is one cloud implementation. Don’t memorize service names — learn the shape (compute / storage / identity / network / observability) so GCP/Azure are 90% pattern-transfer.
Why this phase exists
Most production platforms run on AWS. Year 4’s GPU work uses AWS spot. Most LLM infrastructure tutorials assume AWS primitives. Knowing AWS deeply is a Year 2-graduation requirement.
But the deeper goal: understand cloud as a category. What’s primitive (compute, storage, identity, network)? What’s marketing (200+ services, mostly variations)? What patterns survive whichever cloud you’re forced to use next?
1. PROBLEM
You need infrastructure that scales beyond the homelab: managed K8s, scalable object storage, hosted databases, identity at scale. AWS is the most mature implementation. You’ll learn it via terralabs (already wrote the modules in Phase 9) + AWS Console + AWS CLI.
The discipline: destroy at end of every session. Bills add up otherwise.
2. PRINCIPLES
2.1 IAM: identity at scale
The hardest part of AWS. Users, roles, policies, trust relationships, STS, federation.
→ Pattern: least-privilege → Pattern: defense-in-depth
Investigate:
- Build an IAM policy that allows S3 read on one bucket only — write it from scratch (no
iamlive, no copy-paste; you should be able to read the JSON without flinching). - IAM role vs IAM user — when each.
- AssumeRole + STS — service-to-service auth without long-lived credentials.
- IRSA (IAM Roles for Service Accounts) — how an EKS pod gets AWS perms via OIDC. This is the load-bearing primitive for Phase 12 supply-chain security.
2.2 VPC: network at scale
Subnets, route tables, security groups, NACLs, NAT gateways, VPC peering, endpoints.
Investigate:
- Why public + private subnets? What goes where? (Hint: NAT gateway in public, workloads in private, RDS in isolated.)
- Security Group vs NACL — when each. (SG: stateful, instance-level. NACL: stateless, subnet-level. Both fire; both can deny.)
- VPC endpoints — keep S3 traffic inside AWS network (cost + security).
2.3 EKS: managed K8s
EKS is K8s with AWS doing the control plane. You still operate the workers + addons.
Investigate:
- EKS control plane HA — what AWS gives you (and what they don’t).
- Node groups (managed vs self-managed) — managed for ergonomics, self-managed for full control over the AMI.
- AWS Load Balancer Controller for Ingress.
- IRSA for pod-level AWS access — wire one pod to one S3 bucket and trace every IAM check that fires (this is the Exit Test scenario).
2.4 RDS / Aurora: managed Postgres
Single-AZ, multi-AZ, read replicas, blue/green deployments, snapshots.
Investigate:
- Multi-AZ vs read replica — different goals (multi-AZ: HA; read replica: scale reads).
- pg_basebackup-equivalent in RDS — can you self-restore?
- Aurora vs vanilla Postgres on RDS — cost + perf trade-offs.
2.5 S3: object storage
Buckets, prefixes, storage classes, lifecycle policies, versioning, replication.
Investigate:
- S3 consistency model (now strong read-after-write since 2020 — and why that one change made Iceberg on S3 viable).
- Storage classes — Standard / IA / Glacier — when each.
- S3 lifecycle for old basecamp Postgres backups (auto-tier to Glacier after 90 days).
2.6 Observability primitives
CloudWatch (logs + metrics), X-Ray (tracing), CloudTrail (audit logs).
Investigate:
- CloudWatch metrics retention + cost.
- CloudTrail — what’s logged, what’s not.
- “Aggregator” patterns — push to your own Loki/Prometheus instead of paying CloudWatch retention. (basecamp Tier 1 already runs Prom + Grafana; this is the same pattern, scaled to cloud.)
2.7 Cost: the discipline
The biggest difference between cloud-comfortable and cloud-burned engineers: cost discipline.
Investigate:
- AWS Pricing Calculator before deploying anything.
- Budget alerts at $1 / $5 / $20 / $50 — never silently.
- Cost Explorer + tags + the “destroy at end of session” habit.
- Why does egress cost more than ingress? (You’re stuck once your data lands. That’s the business model — don’t let it become your model by accident.)
3. TRADE-OFFS
| Decision | Option A | Option B | When |
|---|---|---|---|
| Compute | EC2 (raw) | EKS (K8s) | Fargate (serverless containers) |
| State | RDS (managed Postgres) | DynamoDB (managed NoSQL) | self-host on EC2 |
| Compute scaling | manual | ASG | EKS HPA |
| Observability | CloudWatch native | self-hosted (Loki + Prom) | Native: easy. Self-host: cheaper at scale |
4. TOOLS (as of Q1 2026)
- AWS CLI v2
eksctl— quick EKS bootstrap (use it once; then go via TF)aws-vault— local credential isolation (never store keys in~/.aws/credentials)session-manager-plugin— SSH-less EC2 access via IAM- Cost Explorer + Budgets — set up day 1
- terralabs
aws-*modules from Phase 9
5. MASTERY
5.1 Reading list
| Required | Why |
|---|---|
| AWS Well-Architected Framework — 6 pillars | The framing |
| AWS IAM docs — Policy evaluation logic | The hardest part |
| EKS documentation — full read | Real ops shape |
| AWS in Action (Wittig & Wittig, latest) | Practical depth |
| Recommended | Why |
|---|---|
| AWS Builder Library articles | Real architectures from inside AWS |
| Cory Quinn’s Last Week in AWS newsletter | Cost + cynicism |
5.2 Operational depth checklist
[ ] Set up AWS Free Tier account; budget alerts at $1/$5/$25/$50; root MFA on; org SCP if you're feeling fancy[ ] Use terralabs to provision a VPC + EKS + RDS Postgres in us-west-2; verify[ ] Configure IRSA: K8s ServiceAccount → IAM Role → S3 bucket access from a pod[ ] Deploy AWS Load Balancer Controller; expose a service via ALB Ingress[ ] Configure RDS multi-AZ; force failover via the console; observe behavior[ ] Set up S3 lifecycle: old objects → Standard-IA after 30 days → Glacier after 90[ ] Use CloudTrail to identify what your account did in the last 7 days[ ] Build an IAM policy from scratch: read-only access to one S3 bucket, no other perms[ ] Hit a cost surprise; identify cause via Cost Explorer; remediate[ ] **Destroy** the EKS + RDS at end of phase; verify $0 ongoing spend5.3 basecamp expansion
By phase end, basecamp can target an EKS cluster as well as homelab K3s. The same Helm charts deploy to both. ArgoCD ApplicationSet handles per-cluster overlays — the same primitive that will carry into Phase 11 when GKE becomes a third target and into Phase 13 when all three run side-by-side.
6. COMPARE: AWS vs DIY
You could run the same workloads on bare metal or rented dedicated servers. When does AWS earn its margin? When does self-host win?
400 words.
7. OPERATE
- 4+ runbooks (
aws-cost-spike-investigation,iam-debug-access-denied,eks-node-group-recovery,rds-failover-drill) - 2+ postmortems (you WILL hit a cost surprise; you WILL hit an IAM error)
- 1+ ADR (e.g., “Why us-west-2 over us-east-1”)
- Weekly log
8. CONTRIBUTE
AWS-adjacent OSS — eksctl, aws-load-balancer-controller, aws-cli, Terraform AWS provider, kube2iam.
Validation criteria
[ ] All 10 operational depth checks[ ] basecamp targets EKS via terralabs + ArgoCD[ ] AWS-vs-DIY comparison written up[ ] 4+ runbooks; 2+ postmortems; 1+ ADR; 8+ weekly log entries[ ] Total AWS spend this phase: <$50[ ] Pattern entries deepened: - least-privilege → DEEP (after IAM depth) - defense-in-depth → reinforced - threat-modeling → first deepening (STUB → OUTLINE) via IAM policy work[ ] Exit Test passedExit Test
Time: 3 hours.
- Build (90 min) — given a fresh AWS account, use terralabs to provision EKS + RDS + S3, configure IRSA so a pod can read from S3, deploy basecamp app-of-apps.
- Debug (60 min) — scenario from Phase 10 catalog (IAM “Access Denied” with no obvious cause; RDS connection timeout; cost spike).
- Articulate (30 min) — 600 words: “Walk through every IAM check that fires when an EKS pod with IRSA reads from S3.”
Anti-patterns
| Anti-pattern | Why |
|---|---|
Storing AWS access keys in ~/.aws/credentials | Use aws-vault or SSO |
* in IAM policies | Pre-incident; least-privilege is one rewrite later, but you’ll forget |
| Forgetting to destroy EKS at end of session | $0.10/hr × forgotten weekend = ouch |
| Trusting Console clicks instead of IaC | Drift you can’t reproduce |
Using iamlive to generate policies and never auditing | Let AWS write your perms and you’ve defeated least-privilege |
Patterns deepened this phase
- least-privilege → DEEP
- defense-in-depth → reinforced
- threat-modeling → OUTLINE
Browse the full category at patterns/security/.
→ Next: Phase 11: GCP + Cloud-Agnostic