Deep Learning Fundamentals (PyTorch)
Phase 36 of /root Year 4: PyTorch from tensors up. Autograd, NN modules, training loops, optimizers, common architectures (MLP, CNN, basic attention). The DL substrate every Y5 LLM/agent phase builds on. 7-9 weeks, ~80-100 hours.
Sixth phase of Year 4. Deep learning from tensors up. 7-9 weeks, ~80-100 hrs.
Phase 35 showed classical ML wins on tabular. This phase teaches deep learning for the cases where it wins — unstructured data (text, images, audio), large datasets, when the patterns are too rich for hand-engineered features. PyTorch is the canonical framework in 2026. By phase end you’ve trained MLPs, CNNs, and a basic attention model from scratch, understanding the math + the code at each step.
This phase is the substrate every Y5 LLM/agent phase builds on. LLMs are transformers, transformers are attention, attention is matrix math on tensors. Master tensors here; the rest of Y5 reads naturally.
Prerequisites
- Phase 35 complete; classical ML fluency
- GPU available (local NVIDIA or cloud burst); Apple Silicon MPS works for small models
- 12 hrs/week budget reserved
- You accept: deep learning is just calculus and linear algebra with good engineering. Most of the work is data + plumbing.
Why this phase exists
You can’t operate a Y5 LLM gateway without understanding what an LLM is computing. You can’t reason about distributed training without understanding the training loop. You can’t tune fine-tuning without understanding the loss + optimizer + autograd. This phase installs the foundations so Y5 is engineering, not magic.
The pattern-first frame
Same eight steps.
1. PROBLEM
You have unstructured data (text, images, audio) or structured data with rich nonlinear patterns. Classical ML doesn’t capture the relevant structure. You want a model that learns hierarchical representations from data — that’s deep learning.
PyTorch is the canonical framework. TensorFlow + JAX are the alternatives. Hugging Face Transformers is the standard library for transformer models.
2. PRINCIPLES
2.1 Tensors and autograd
A tensor is a multi-dimensional array with gradients. PyTorch’s autograd tracks operations and computes gradients via reverse-mode automatic differentiation.
→ Pattern: autograd
Investigate:
- Walk a small computation:
y = (x ** 2).sum(); y.backward(); print(x.grad). What did PyTorch do? - What’s the difference between
tensor.detach()andwith torch.no_grad():? - Why does
.backward()accumulate gradients?
2.2 Gradient descent + backpropagation
Gradient descent: move weights in the negative-gradient direction to reduce loss. Backpropagation: efficient gradient computation via the chain rule.
→ Pattern: gradient-descent, backpropagation
Investigate:
- Walk SGD updates by hand on a 2-parameter problem.
- Why is mini-batch SGD better than full-batch in practice (memory + noise → escape local minima)?
- What’s the role of learning rate, and what do schedulers do?
2.3 NN modules + the training loop
PyTorch’s nn.Module is the abstraction. A training loop is: forward → loss → backward → optimizer.step → repeat.
Investigate:
- Walk a minimal MLP class. What does
__init__+forwarddo? - Why is
model.train()vsmodel.eval()significant? (Dropout, BatchNorm behavior.) - What’s
optimizer.zero_grad()and why is forgetting it the most common PyTorch bug?
2.4 Optimizers and learning rate schedules
SGD, momentum, Adam, AdamW. Each has a use case. Learning rate schedules (cosine, linear, constant) shape the trajectory.
Investigate:
- Walk Adam’s update rule. What’s it adapting?
- Why is AdamW preferred over Adam for many cases (weight decay correctness)?
- What’s “warmup” and why does it matter for transformer training?
2.5 Common architectures: MLP, CNN, attention
- MLP — fully connected layers; baseline for any task.
- CNN — convolutional layers; image-shaped data; locality + translation invariance.
- Attention — query-key-value; transformer’s core operation. Phase 36 introduces basic attention; Y5 deepens.
→ Pattern: attention-mechanism — first OUTLINE (Y5 deepens)
Investigate:
- Walk a convolution operation: 3×3 kernel over an image. What’s it computing?
- Walk basic self-attention: Q × K^T / sqrt(d), softmax, × V. Why does each piece exist?
- Why did transformers replace RNNs for sequence modeling? (Hint: parallelism + better long-range dependencies.)
2.6 Overfitting, regularization, the validation curve
Same patterns as classical ML, plus DL-specific ones: dropout, weight decay, data augmentation, early stopping.
Investigate:
- Walk a learning curve where you’ve overfit. What does it look like?
- When is more data better than more regularization?
- Why is dropout’s “scale at inference” trick necessary?
3. TRADE-OFFS
| Decision | Options | Cost |
|---|---|---|
| Framework | PyTorch; TensorFlow; JAX | PyTorch: research + production standard. TF: enterprise legacy. JAX: research-leaning, fast. |
| Architecture for tabular | XGBoost (Phase 35); MLP | XGBoost usually wins; MLP only competitive with very large data + careful regularization. |
| Architecture for sequences | LSTM/GRU; Transformer | LSTM: simpler, weaker. Transformer: better, more compute. |
| Optimizer | SGD + momentum; AdamW | SGD: classical, careful tuning. AdamW: modern default. |
4. TOOLS (as of 2026-06)
- PyTorch 2.x
einops— tensor reshaping that’s readabletorch.compile— JIT compiler for PyTorch 2+- Hugging Face Transformers — pretrained models + tokenizers
- Hugging Face Datasets — public datasets
Reading
- “Deep Learning with PyTorch” (Stevens et al.)
- “Dive into Deep Learning” (Zhang et al., free online)
- The original “Attention Is All You Need” paper (Vaswani et al.)
- Karpathy’s “Let’s build GPT from scratch” video — the canonical intro
5. MASTERY: Three models end-to-end
Build three models from scratch (no high-level wrappers; PyTorch + numpy only for the first two):
5.1 MLP for a tabular task
Same data as Phase 35. Train an MLP. Compare to XGBoost. Reflect.
5.2 CNN for an image task
MNIST or CIFAR-10 (or your own image dataset). Train a small CNN. Reach > 90% on MNIST.
5.3 Basic attention model
Implement a minimal self-attention layer + train a tiny transformer (~10M params or fewer) on a small text task. This is Karpathy-style “build GPT from scratch.”
5.4 Operational depth checklist
[ ] PyTorch installed; GPU detected (or MPS for Apple Silicon)
[ ] MLP trained on tabular data; compared to XGBoost
[ ] CNN trained on MNIST/CIFAR; reached >90% on MNIST
[ ] Attention layer implemented from scratch
[ ] Tiny transformer trained on a text task
[ ] Profiled training: GPU util, memory, throughput
[ ] Used a learning rate scheduler; observed the loss curve
[ ] Applied dropout + weight decay; observed regularization effect
[ ] Saved + loaded model checkpoints
[ ] Used Hugging Face Transformers to load a pretrained model; observe vs your from-scratch impl
6. COMPARE: JAX or Hugging Face
Pick one:
- JAX — reimplement your attention model in JAX + Flax. Reflect on functional style + JIT.
- HF Transformers — fine-tune a pretrained BERT on your text task. Reflect on what HF abstracts.
400-word reflection.
7. OPERATE
- 2-3 runbooks: GPU OOM, training loss NaN, model checkpoint loading failure
- 1-2 ADRs (PyTorch over TF; AdamW default)
- Weekly log
8. CONTRIBUTE
- PyTorch — docs, examples
- Hugging Face Transformers — community models, docs
- A public ML notebook or blog post (when blog is live)
What ships from this phase
- Three trained DL models (MLP, CNN, tiny transformer)
- Notebooks repo updated
- Confidence reading PyTorch source
Validation criteria
[ ] MLP, CNN, tiny transformer all trained from scratch
[ ] Attention layer implemented from scratch
[ ] Compared to pretrained via HF
[ ] All 10 operational depth checks
[ ] Compare reflection (400 words)
[ ] 2-3 DL runbooks
[ ] 1-2 ADRs
[ ] Pattern entries:
- autograd → OUTLINE
- gradient-descent → OUTLINE
- backpropagation → OUTLINE
- attention-mechanism → OUTLINE (Y5 deepens)
[ ] Exit Test passed
Exit Test
Time: 3 hours.
Part 1: Build (120 min)
Implement a tiny transformer from scratch (≤ 50 lines of PyTorch). Train it on a small text task. Get loss to converge.
Part 2: Diagnose (45 min)
A training scenario (loss is NaN; model isn’t learning; GPU OOM). Possible causes for each.
Part 3: Articulate (15 min)
~400 words: “Walk a single backprop step through a 2-layer MLP. Cover forward pass, loss computation, gradient computation per layer, weight update.”
Anti-patterns
| Anti-pattern | Why |
|---|---|
Forgetting optimizer.zero_grad() | Gradients accumulate; training breaks silently |
Using .train()/.eval() interchangeably | BatchNorm + Dropout behave differently; results lie |
Loading data without num_workers > 0 | GPU idle waiting on data |
| Not profiling training | You optimize the wrong thing |
| Reaching for SOTA architectures before baselines | Sometimes a strong baseline + good data beats a complex model |
Patterns touched this phase
autograd— OUTLINEgradient-descent— OUTLINEbackpropagation— OUTLINEattention-mechanism— OUTLINE (Y5 deepens)