Deep Learning Fundamentals (PyTorch)

Phase 36 of /root Year 4: PyTorch from tensors up. Autograd, NN modules, training loops, optimizers, common architectures (MLP, CNN, basic attention). The DL substrate every Y5 LLM/agent phase builds on. 7-9 weeks, ~80-100 hours.

Sixth phase of Year 4. Deep learning from tensors up. 7-9 weeks, ~80-100 hrs.

Phase 35 showed classical ML wins on tabular. This phase teaches deep learning for the cases where it wins — unstructured data (text, images, audio), large datasets, when the patterns are too rich for hand-engineered features. PyTorch is the canonical framework in 2026. By phase end you’ve trained MLPs, CNNs, and a basic attention model from scratch, understanding the math + the code at each step.

This phase is the substrate every Y5 LLM/agent phase builds on. LLMs are transformers, transformers are attention, attention is matrix math on tensors. Master tensors here; the rest of Y5 reads naturally.


Prerequisites

  • Phase 35 complete; classical ML fluency
  • GPU available (local NVIDIA or cloud burst); Apple Silicon MPS works for small models
  • 12 hrs/week budget reserved
  • You accept: deep learning is just calculus and linear algebra with good engineering. Most of the work is data + plumbing.

Why this phase exists

You can’t operate a Y5 LLM gateway without understanding what an LLM is computing. You can’t reason about distributed training without understanding the training loop. You can’t tune fine-tuning without understanding the loss + optimizer + autograd. This phase installs the foundations so Y5 is engineering, not magic.


The pattern-first frame

Same eight steps.


1. PROBLEM

You have unstructured data (text, images, audio) or structured data with rich nonlinear patterns. Classical ML doesn’t capture the relevant structure. You want a model that learns hierarchical representations from data — that’s deep learning.

PyTorch is the canonical framework. TensorFlow + JAX are the alternatives. Hugging Face Transformers is the standard library for transformer models.


2. PRINCIPLES

2.1 Tensors and autograd

A tensor is a multi-dimensional array with gradients. PyTorch’s autograd tracks operations and computes gradients via reverse-mode automatic differentiation.

→ Pattern: autograd

Investigate:

  • Walk a small computation: y = (x ** 2).sum(); y.backward(); print(x.grad). What did PyTorch do?
  • What’s the difference between tensor.detach() and with torch.no_grad():?
  • Why does .backward() accumulate gradients?

2.2 Gradient descent + backpropagation

Gradient descent: move weights in the negative-gradient direction to reduce loss. Backpropagation: efficient gradient computation via the chain rule.

→ Pattern: gradient-descent, backpropagation

Investigate:

  • Walk SGD updates by hand on a 2-parameter problem.
  • Why is mini-batch SGD better than full-batch in practice (memory + noise → escape local minima)?
  • What’s the role of learning rate, and what do schedulers do?

2.3 NN modules + the training loop

PyTorch’s nn.Module is the abstraction. A training loop is: forward → loss → backward → optimizer.step → repeat.

Investigate:

  • Walk a minimal MLP class. What does __init__ + forward do?
  • Why is model.train() vs model.eval() significant? (Dropout, BatchNorm behavior.)
  • What’s optimizer.zero_grad() and why is forgetting it the most common PyTorch bug?

2.4 Optimizers and learning rate schedules

SGD, momentum, Adam, AdamW. Each has a use case. Learning rate schedules (cosine, linear, constant) shape the trajectory.

Investigate:

  • Walk Adam’s update rule. What’s it adapting?
  • Why is AdamW preferred over Adam for many cases (weight decay correctness)?
  • What’s “warmup” and why does it matter for transformer training?

2.5 Common architectures: MLP, CNN, attention

  • MLP — fully connected layers; baseline for any task.
  • CNN — convolutional layers; image-shaped data; locality + translation invariance.
  • Attention — query-key-value; transformer’s core operation. Phase 36 introduces basic attention; Y5 deepens.

→ Pattern: attention-mechanism — first OUTLINE (Y5 deepens)

Investigate:

  • Walk a convolution operation: 3×3 kernel over an image. What’s it computing?
  • Walk basic self-attention: Q × K^T / sqrt(d), softmax, × V. Why does each piece exist?
  • Why did transformers replace RNNs for sequence modeling? (Hint: parallelism + better long-range dependencies.)

2.6 Overfitting, regularization, the validation curve

Same patterns as classical ML, plus DL-specific ones: dropout, weight decay, data augmentation, early stopping.

Investigate:

  • Walk a learning curve where you’ve overfit. What does it look like?
  • When is more data better than more regularization?
  • Why is dropout’s “scale at inference” trick necessary?

3. TRADE-OFFS

DecisionOptionsCost
FrameworkPyTorch; TensorFlow; JAXPyTorch: research + production standard. TF: enterprise legacy. JAX: research-leaning, fast.
Architecture for tabularXGBoost (Phase 35); MLPXGBoost usually wins; MLP only competitive with very large data + careful regularization.
Architecture for sequencesLSTM/GRU; TransformerLSTM: simpler, weaker. Transformer: better, more compute.
OptimizerSGD + momentum; AdamWSGD: classical, careful tuning. AdamW: modern default.

4. TOOLS (as of 2026-06)

  • PyTorch 2.x
  • einops — tensor reshaping that’s readable
  • torch.compile — JIT compiler for PyTorch 2+
  • Hugging Face Transformers — pretrained models + tokenizers
  • Hugging Face Datasets — public datasets

Reading

  • “Deep Learning with PyTorch” (Stevens et al.)
  • “Dive into Deep Learning” (Zhang et al., free online)
  • The original “Attention Is All You Need” paper (Vaswani et al.)
  • Karpathy’s “Let’s build GPT from scratch” video — the canonical intro

5. MASTERY: Three models end-to-end

Build three models from scratch (no high-level wrappers; PyTorch + numpy only for the first two):

5.1 MLP for a tabular task

Same data as Phase 35. Train an MLP. Compare to XGBoost. Reflect.

5.2 CNN for an image task

MNIST or CIFAR-10 (or your own image dataset). Train a small CNN. Reach > 90% on MNIST.

5.3 Basic attention model

Implement a minimal self-attention layer + train a tiny transformer (~10M params or fewer) on a small text task. This is Karpathy-style “build GPT from scratch.”

5.4 Operational depth checklist

[ ] PyTorch installed; GPU detected (or MPS for Apple Silicon)
[ ] MLP trained on tabular data; compared to XGBoost
[ ] CNN trained on MNIST/CIFAR; reached >90% on MNIST
[ ] Attention layer implemented from scratch
[ ] Tiny transformer trained on a text task
[ ] Profiled training: GPU util, memory, throughput
[ ] Used a learning rate scheduler; observed the loss curve
[ ] Applied dropout + weight decay; observed regularization effect
[ ] Saved + loaded model checkpoints
[ ] Used Hugging Face Transformers to load a pretrained model; observe vs your from-scratch impl

6. COMPARE: JAX or Hugging Face

Pick one:

  • JAX — reimplement your attention model in JAX + Flax. Reflect on functional style + JIT.
  • HF Transformers — fine-tune a pretrained BERT on your text task. Reflect on what HF abstracts.

400-word reflection.


7. OPERATE

  • 2-3 runbooks: GPU OOM, training loss NaN, model checkpoint loading failure
  • 1-2 ADRs (PyTorch over TF; AdamW default)
  • Weekly log

8. CONTRIBUTE

  • PyTorch — docs, examples
  • Hugging Face Transformers — community models, docs
  • A public ML notebook or blog post (when blog is live)

What ships from this phase

  • Three trained DL models (MLP, CNN, tiny transformer)
  • Notebooks repo updated
  • Confidence reading PyTorch source

Validation criteria

[ ] MLP, CNN, tiny transformer all trained from scratch
[ ] Attention layer implemented from scratch
[ ] Compared to pretrained via HF
[ ] All 10 operational depth checks
[ ] Compare reflection (400 words)
[ ] 2-3 DL runbooks
[ ] 1-2 ADRs
[ ] Pattern entries:
    - autograd → OUTLINE
    - gradient-descent → OUTLINE
    - backpropagation → OUTLINE
    - attention-mechanism → OUTLINE (Y5 deepens)
[ ] Exit Test passed

Exit Test

Time: 3 hours.

Part 1: Build (120 min)

Implement a tiny transformer from scratch (≤ 50 lines of PyTorch). Train it on a small text task. Get loss to converge.

Part 2: Diagnose (45 min)

A training scenario (loss is NaN; model isn’t learning; GPU OOM). Possible causes for each.

Part 3: Articulate (15 min)

~400 words: “Walk a single backprop step through a 2-layer MLP. Cover forward pass, loss computation, gradient computation per layer, weight update.”


Anti-patterns

Anti-patternWhy
Forgetting optimizer.zero_grad()Gradients accumulate; training breaks silently
Using .train()/.eval() interchangeablyBatchNorm + Dropout behave differently; results lie
Loading data without num_workers > 0GPU idle waiting on data
Not profiling trainingYou optimize the wrong thing
Reaching for SOTA architectures before baselinesSometimes a strong baseline + good data beats a complex model

Patterns touched this phase

  • autograd — OUTLINE
  • gradient-descent — OUTLINE
  • backpropagation — OUTLINE
  • attention-mechanism — OUTLINE (Y5 deepens)

→ Next: Phase 37: Distributed Training (KubeRay)