Containers: Namespaces + cgroups + UnionFS

Sixth phase. Build a container from scratch using Linux primitives. Docker as the canonical implementation. The magic dissolves; what’s left is just clever process isolation. ~6 weeks, ~70 hrs.

Phase 6 is the phase where the abstractions you’ve been building toward all year — process isolation from Phase 1, layering from Phase 1 + Phase 2 — collapse into a single artifact you can build with unshare(1). There is no magic in containers. There is just clever composition of Linux primitives you’ve already met. The phase exists to make that statement true for you.

The pattern frame is unusually clean here: a container is a process under privilege-separation (namespaces + capabilities + seccomp), running on a layered filesystem (OverlayFS), bounded by resource virtualization (cgroups v2). All four ingredients are concepts you internalized 10+ weeks ago in Phase 1 — what’s new is the composition. By Phase 6 end, “Docker” should feel like a convenient frontend over kernel features, not a mysterious daemon.

This is also the phase that unlocks Phase 7. Kubernetes assumes you know what a container is. Without Phase 6 you’re driving K8s blind.

Prerequisites

Phase 5 complete — Go fluent, pulse shipped

You accept: a container is a process with namespaces + cgroups + a UnionFS rootfs. That’s it. Once you build one with unshare(1), Docker stops being magic.

Why this phase exists

Phase 7’s K8s is built on containers. Year 4’s llm-gateway runs in containers. mlship (Year 5 capstone) auto-builds containers. All Year 3 data tools (Spark, Trino, Iceberg) ship as containers. If containers are magic, every higher-level system is partial magic.

The principle is privilege-separation revisited — containers are a process’s view of the system, scoped down via kernel primitives.

1. PROBLEM

You want to package and run software in a way that’s:

Reproducible — same image, same behavior, anywhere
Isolated — one container’s mistakes don’t break another
Lightweight — faster than VMs (no separate kernel)
Distributable — pull from a registry, run anywhere

Linux containers solve this with three building blocks: namespaces (isolation), cgroups (resource limits), UnionFS (layered filesystems).

2. PRINCIPLES

2.1 Namespaces: the “what can this process see?” boundary

Linux has 7+ namespace types: PID, mount, net, user, UTS, IPC, cgroup, time. Each restricts what a process sees of that resource.

→ Pattern: privilege-separation (revisited)

Investigate:

Use unshare -p -m -n -f /bin/bash to enter a namespaced shell; observe ps, ip link, mount
Read man 7 namespaces; map each type to “what would break if it weren’t there?”
Why is the PID namespace particularly important?

2.2 cgroups: the “how much can this process use?” boundary

cgroups v2 (unified hierarchy) controls CPU, memory, IO, PIDs.

Investigate:

Create a cgroup manually under /sys/fs/cgroup/; pin a process; cap memory
What happens when memory cap is exceeded? (OOM kill within cgroup, kernel intact)
What’s the difference between cgroups v1 and v2?

This is the same /sys/fs/cgroup/ you played with in Phase 1 when bounding a single process. Phase 6’s contribution isn’t a new primitive — it’s the composition with namespaces and an overlay rootfs that turns “bounded process” into “container.”

2.3 UnionFS: the layered filesystem

A container image is a stack of read-only layers + one writable layer on top. OverlayFS is the canonical Linux implementation.

→ Pattern: layering-and-abstraction (reinforced)

Investigate:

Use mount -t overlay to create your own overlay manually; understand lowerdir/upperdir/workdir
Why is the writable layer copy-on-write? What’s the cost?
What’s a Docker layer in image manifest terms? Read a manifest with crane manifest.

2.4 Capabilities: fine-grained root

root was historically all-or-nothing. Capabilities split root into ~40 specific permissions (CAP_NET_BIND_SERVICE, CAP_SYS_ADMIN, etc.). Containers should drop everything not needed.

Investigate:

setcap cap_net_bind_service+ep ./mybinary — let a non-root binary bind port 80
docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE — minimal-cap container
Why is --privileged almost always wrong?

2.5 seccomp: syscall filtering

seccomp-bpf restricts which syscalls a process can make. Default Docker profile blocks ~50 dangerous ones.

Investigate:

Read the default Docker seccomp profile (it’s JSON)
Run a container with --security-opt seccomp=unconfined; observe which syscalls become available

2.6 Image building: Dockerfile and beyond

A Dockerfile is the imperative recipe; the image is the declarative result. Multi-stage builds + distroless minimize attack surface.

Investigate:

Multi-stage Go build: builder image → distroless runtime; observe size delta
Why does COPY order matter for cache? Layer caching mental model.
Compare Dockerfile vs buildah/kaniko/nixpkgs — different paths to the same OCI image

The “imperative recipe → declarative result” framing also appears in Phase 7 at a different level — Helm charts are imperative templates that produce declarative manifests, just like Dockerfiles produce immutable images. Same shape, different artifact.

3. TRADE-OFFS

Decision	Option A	Option B	Cost
Runtime	docker	podman (rootless-friendly)	containerd direct
Builder	Dockerfile	Buildah	Kaniko
Base image	distroless (Google)	scratch	alpine
User in container	root	non-root with capabilities	non-root: defense in depth, harder configs

The Alpine row hides a real bug class — musl-libc is almost glibc-compatible, but the differences (DNS resolution behavior, thread-local storage, dynamic linking) bite at 3am in ways that are notoriously hard to debug. For Go binaries, distroless or scratch are usually the right call. For Python, paying the size cost of a Debian-slim base is often worth not chasing musl-libc compatibility issues.

4. TOOLS (as of 2025-10)

docker 25+ or podman 5+
unshare, nsenter (util-linux) — for the from-scratch exercises
buildah, skopeo — image manipulation
crane — registry interaction without Docker
dive — image-layer inspection
trivy — vulnerability scanning (warm-up for Year 2 supply chain)
distroless base images (gcr.io/distroless)

5. MASTERY

5.1 Reading list

Required	Why
”Container Security” (Liz Rice)	The principles + the pitfalls
`man 7 namespaces`, `man 7 capabilities`	The actual contracts
Docker docs (Build, Storage, Networking)	The implementation

Recommended	Why
”Kubernetes the Hard Way” (Hightower) — read it now, you’ll do it Phase 7	Bridge to Phase 7

5.2 Operational depth checklist

[ ] Build a container from scratch — `unshare -p -m -n -f -U /bin/bash`, mount overlay, run a process. No Docker.
[ ] Multi-stage Go build for `pulse`: builder + distroless runtime; observe size (~10MB final)
[ ] Run a container with --cap-drop=ALL + only what's needed; verify with `getpcaps` from inside
[ ] Configure cgroups v2 manually for a docker run with `--cpus=0.5 --memory=100m`; force OOM
[ ] Read a Docker image manifest via `crane manifest`; identify layer SHAs
[ ] Use `dive` on `pulse` image; identify wasted space; reduce
[ ] Run `trivy image` on `pulse`; address any HIGH/CRITICAL CVEs
[ ] Set up a local registry with `registry:2`; push/pull `pulse`
[ ] Containerize `triage`'s Postgres + Redis dependencies (foreshadow [Phase 7](/program/year-1/phase-7/))
[ ] Read Linux kernel source for one namespace type (e.g., PID — `kernel/pid_namespace.c`); 1 hour

The “build a container from scratch” item is the load-bearing exercise of the entire phase. If you skip it and rely on Docker the whole way through, the abstraction never dissolves and Phase 7’s K8s stays partially mysterious. Spend an entire afternoon on it. Watch ps from inside the namespace and from outside; reconcile the two views.

5.3 Containerize the Year 1 services

By phase end, you have container images for:

pulse (you ship this anyway)
triage (Phase 7 will deploy this)
rxp, konfig (CLIs containerized for CI use)

These all live in Dockerfiles in their respective project repos. Multi-stage. Distroless or scratch where possible. Trivy-clean. By the end of Phase 6 you have everything K8s would need to deploy in Phase 7 — only the orchestration is missing.

6. COMPARE: Docker vs Podman (rootless)

Run the same Dockerfile under Docker (root-daemon) and Podman (rootless). Compare:

Setup complexity
Permission model
Network behavior
CI/CD ergonomics

400 words.

The rootless Podman exercise is also a foreshadowing of Phase 7 and Year 2 supply-chain hygiene — don’t run privileged daemons you don’t need. Podman’s daemonless, rootless model is closer to the security posture you want for production K8s nodes than Docker’s root daemon is.

7. OPERATE

3+ runbooks (container-build-failed, container-runs-locally-fails-in-prod, image-too-large)
1+ ADR (e.g., “Why distroless over alpine for Go services”)
Weekly log

8. CONTRIBUTE

Container-adjacent OSS — buildah, podman, crane, dive, trivy, distroless. Lots of “good first issue” tickets.

Validation criteria

[ ] All 10 operational depth checks
[ ] Container-from-scratch exercise documented
[ ] Docker vs Podman comparison written up
[ ] All Year 1 projects containerized (pulse, triage, rxp, konfig)
[ ] 3+ runbooks; 1+ ADR; 6+ weekly log entries
[ ] Pattern entries:
    - privilege-separation → reinforced (now extends to namespace + capability scope)
    - layering-and-abstraction → reinforced (UnionFS layers as concrete example)
[ ] Exit Test passed

Exit Test

Time: 2 hours.

Build (60 min) — given a Go binary, write a multi-stage Dockerfile producing a < 20MB distroless image with non-root user, cap-drop ALL, healthcheck. Run it locally with explicit cgroup limits.
Articulate (60 min) — 600 words: “Walk a docker run from CLI to running process. Cover: image pull, layer extraction, namespace creation, cgroup setup, exec.”

The articulation prompt is the exact composition the phase teaches: image (UnionFS) + namespaces + cgroups + exec. If you can describe each step crisply in your own words, the abstraction has dissolved. If any step still feels like “and then Docker does some stuff”, go back and redo the from-scratch exercise.

Anti-patterns

Anti-pattern	Why
`docker run --privileged` “to make it work”	Defeats the entire isolation model
Single-stage Dockerfile with build deps in final image	Bloated + larger attack surface
`latest` tag in production references	Unreproducible; pin to digests
Running as `root` inside the container	One escape and you’re root on the host (until user namespaces)
Skipping `trivy` because “it’s just dev”	Dev images become prod images

Patterns touched this phase

privilege-separation — reinforced (now scoped per-process via namespaces + capabilities)
layering-and-abstraction — reinforced (OverlayFS as canonical layered filesystem)
immutable-infrastructure — first touch (containers as immutable unit of deployment; deepens Year 2)

→ Next: Phase 7: Kubernetes + GitOps