Neural networks learn nonlinear feature maps and train them end to end with backpropagation. They are linear algebra, nonlinear activations, loss functions, and chain-rule gradients stacked into a trainable program.
Overview
A feed-forward network is a composition:
Each layer usually computes an affine map followed by a nonlinearity:
Backpropagation computes gradients of the loss with respect to every parameter by reusing intermediate derivatives from the output back to the input.
Prerequisites
- Linear models and matrix multiplication
- Chain rule and gradients
- Cross-entropy and least-squares loss
- Basic optimization vocabulary
Companion Notebooks
| Notebook | Purpose |
|---|---|
| theory.ipynb | Demonstrates forward passes, backprop gradients, activations, initialization, gradient descent, Adam, normalization, dropout, and diagnostics. |
| exercises.ipynb | Ten practice problems for neural-network computations and debugging. |
Learning Objectives
After this section, you should be able to:
- Explain why nonlinear activations are necessary.
- Compute a forward pass through a small MLP.
- Derive backprop gradients for affine layers and activations.
- Compare sigmoid, tanh, ReLU, GELU, and gated activations.
- Explain Xavier and He initialization.
- Implement SGD, momentum, Adam intuition, and gradient clipping.
- Explain BatchNorm, LayerNorm, dropout, and weight decay.
- Diagnose activation scale, gradient norms, overfitting, and optimization failure.
Table of Contents
- From Linear Models to Neural Networks
- 1.1 Learned feature map
- 1.2 Linear head
- 1.3 Composition
- 1.4 Nonlinearity
- 1.5 Representation learning
- Forward Pass
- 2.1 Affine layer
- 2.2 Activation
- 2.3 MLP layer stack
- 2.4 Batch vectorization
- 2.5 Output logits
- Loss Functions
- 3.1 Regression MSE
- 3.2 Binary cross-entropy
- 3.3 Softmax cross-entropy
- 3.4 Regularized objective
- 3.5 Empirical risk
- Backpropagation
- 4.1 Chain rule
- 4.2 Layer gradient
- 4.3 Activation derivative
- 4.4 Reverse accumulation
- 4.5 Gradient check
- Activations
- Initialization and Signal Propagation
- 6.1 Variance propagation
- 6.2 Xavier initialization
- 6.3 He initialization
- 6.4 Symmetry breaking
- 6.5 Depth instability
- Optimization
- 7.1 SGD
- 7.2 Momentum
- 7.3 Adam
- 7.4 Learning-rate schedule
- 7.5 Gradient clipping
- Normalization and Regularization
- 8.1 Batch normalization
- 8.2 Layer normalization
- 8.3 Dropout
- 8.4 Weight decay
- 8.5 Early stopping
- Expressivity and Generalization
- 9.1 Universal approximation
- 9.2 Depth efficiency
- 9.3 Overparameterization
- 9.4 Double descent
- 9.5 Inductive bias
- Diagnostics
- 10.1 Shape checks
- 10.2 Activation statistics
- 10.3 Gradient norms
- 10.4 Train validation curves
- 10.5 Ablations
Shape Map
input batch: X shape (B, d_in)
layer weights: W_l shape (d_out, d_in)
pre-activation: Z_l shape (B, d_out)
activation: H_l shape (B, d_out)
logits: Z shape (B, classes)
1. From Linear Models to Neural Networks
This part studies from linear models to neural networks as trainable representation learning. Keep track of forward values, backward gradients, scale, and diagnostics.
| Subtopic | Question | Formula |
|---|---|---|
| Learned feature map | make features trainable instead of fixed | |
| Linear head | final prediction is often linear on learned features | |
| Composition | depth composes many simple maps | |
| Nonlinearity | without nonlinear activations, stacked linear layers collapse to one linear map | |
| Representation learning | hidden layers learn intermediate coordinates useful for the task |
1.1 Learned feature map
Main idea. Make features trainable instead of fixed.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
1.2 Linear head
Main idea. Final prediction is often linear on learned features.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
1.3 Composition
Main idea. Depth composes many simple maps.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
1.4 Nonlinearity
Main idea. Without nonlinear activations, stacked linear layers collapse to one linear map.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
1.5 Representation learning
Main idea. Hidden layers learn intermediate coordinates useful for the task.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
2. Forward Pass
This part studies forward pass as trainable representation learning. Keep track of forward values, backward gradients, scale, and diagnostics.
| Subtopic | Question | Formula |
|---|---|---|
| Affine layer | matrix multiply plus bias | |
| Activation | apply elementwise nonlinearity | |
| MLP layer stack | repeat affine and nonlinear transformations | |
| Batch vectorization | process many examples together | |
| Output logits | classification networks usually produce logits before softmax |
2.1 Affine layer
Main idea. Matrix multiply plus bias.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
2.2 Activation
Main idea. Apply elementwise nonlinearity.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
2.3 MLP layer stack
Main idea. Repeat affine and nonlinear transformations.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
2.4 Batch vectorization
Main idea. Process many examples together.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
2.5 Output logits
Main idea. Classification networks usually produce logits before softmax.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
3. Loss Functions
This part studies loss functions as trainable representation learning. Keep track of forward values, backward gradients, scale, and diagnostics.
| Subtopic | Question | Formula |
|---|---|---|
| Regression MSE | penalize squared prediction error | |
| Binary cross-entropy | Bernoulli negative log likelihood | |
| Softmax cross-entropy | multi-class negative log likelihood | |
| Regularized objective | add parameter penalties or other constraints | |
| Empirical risk | training minimizes average loss over data |
3.1 Regression MSE
Main idea. Penalize squared prediction error.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
3.2 Binary cross-entropy
Main idea. Bernoulli negative log likelihood.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
3.3 Softmax cross-entropy
Main idea. Multi-class negative log likelihood.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
3.4 Regularized objective
Main idea. Add parameter penalties or other constraints.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
3.5 Empirical risk
Main idea. Training minimizes average loss over data.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
4. Backpropagation
This part studies backpropagation as trainable representation learning. Keep track of forward values, backward gradients, scale, and diagnostics.
| Subtopic | Question | Formula |
|---|---|---|
| Chain rule | gradients flow backward through composed functions | |
| Layer gradient | weight gradient is outer product of upstream error and input | |
| Activation derivative | nonlinearity gates gradient flow | |
| Reverse accumulation | store intermediate activations and traverse backward | |
| Gradient check | finite differences verify backprop implementation |
4.1 Chain rule
Main idea. Gradients flow backward through composed functions.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. Backprop is the chain rule organized so every parameter receives credit efficiently.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
4.2 Layer gradient
Main idea. Weight gradient is outer product of upstream error and input.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
4.3 Activation derivative
Main idea. Nonlinearity gates gradient flow.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
4.4 Reverse accumulation
Main idea. Store intermediate activations and traverse backward.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
4.5 Gradient check
Main idea. Finite differences verify backprop implementation.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
5. Activations
This part studies activations as trainable representation learning. Keep track of forward values, backward gradients, scale, and diagnostics.
| Subtopic | Question | Formula |
|---|---|---|
| Sigmoid | squash to (0,1) but can saturate | |
| Tanh | zero-centered saturating activation | |
| ReLU | simple piecewise linear activation | |
| GELU | smooth activation used in many transformers | |
| SwiGLU | gated activation used in modern LLM MLPs |
5.1 Sigmoid
Main idea. Squash to (0,1) but can saturate.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
5.2 Tanh
Main idea. Zero-centered saturating activation.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
5.3 ReLU
Main idea. Simple piecewise linear activation.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. ReLU made deep feed-forward networks easier to optimize than saturating activations in many settings.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
5.4 GELU
Main idea. Smooth activation used in many transformers.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
5.5 SwiGLU
Main idea. Gated activation used in modern llm mlps.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
6. Initialization and Signal Propagation
This part studies initialization and signal propagation as trainable representation learning. Keep track of forward values, backward gradients, scale, and diagnostics.
| Subtopic | Question | Formula |
|---|---|---|
| Variance propagation | activation variance should not explode or vanish across layers | |
| Xavier initialization | balance fan-in and fan-out for tanh-like activations | |
| He initialization | scale for ReLU-like activations | |
| Symmetry breaking | random weights let units learn different features | |
| Depth instability | bad initialization makes gradients vanish or explode |
6.1 Variance propagation
Main idea. Activation variance should not explode or vanish across layers.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
6.2 Xavier initialization
Main idea. Balance fan-in and fan-out for tanh-like activations.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
6.3 He initialization
Main idea. Scale for relu-like activations.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. Initialization is not decoration; it controls signal scale at depth.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
6.4 Symmetry breaking
Main idea. Random weights let units learn different features.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
6.5 Depth instability
Main idea. Bad initialization makes gradients vanish or explode.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
7. Optimization
This part studies optimization as trainable representation learning. Keep track of forward values, backward gradients, scale, and diagnostics.
| Subtopic | Question | Formula |
|---|---|---|
| SGD | update parameters using mini-batch gradients | |
| Momentum | smooth updates with velocity | |
| Adam | normalize first moment by second moment estimate | |
| Learning-rate schedule | change step size over training | |
| Gradient clipping | cap gradient norm to prevent unstable updates |
7.1 SGD
Main idea. Update parameters using mini-batch gradients.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
7.2 Momentum
Main idea. Smooth updates with velocity.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
7.3 Adam
Main idea. Normalize first moment by second moment estimate.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
7.4 Learning-rate schedule
Main idea. Change step size over training.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
7.5 Gradient clipping
Main idea. Cap gradient norm to prevent unstable updates.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
8. Normalization and Regularization
This part studies normalization and regularization as trainable representation learning. Keep track of forward values, backward gradients, scale, and diagnostics.
| Subtopic | Question | Formula |
|---|---|---|
| Batch normalization | normalize features using batch statistics | |
| Layer normalization | normalize features within an example | |
| Dropout | randomly mask activations during training | |
| Weight decay | penalize large weights | |
| Early stopping | stop when validation loss stops improving |
8.1 Batch normalization
Main idea. Normalize features using batch statistics.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
8.2 Layer normalization
Main idea. Normalize features within an example.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. LayerNorm is one of the basic stabilizers behind modern transformers.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
8.3 Dropout
Main idea. Randomly mask activations during training.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
8.4 Weight decay
Main idea. Penalize large weights.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
8.5 Early stopping
Main idea. Stop when validation loss stops improving.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
9. Expressivity and Generalization
This part studies expressivity and generalization as trainable representation learning. Keep track of forward values, backward gradients, scale, and diagnostics.
| Subtopic | Question | Formula |
|---|---|---|
| Universal approximation | wide nonlinear networks can approximate many functions | |
| Depth efficiency | some functions are represented more compactly with depth | |
| Overparameterization | large networks can fit data yet generalize with the right training biases | |
| Double descent | test error can be non-monotonic in model size | |
| Inductive bias | architecture and optimization shape what functions are easy to learn |
9.1 Universal approximation
Main idea. Wide nonlinear networks can approximate many functions.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
9.2 Depth efficiency
Main idea. Some functions are represented more compactly with depth.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
9.3 Overparameterization
Main idea. Large networks can fit data yet generalize with the right training biases.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
9.4 Double descent
Main idea. Test error can be non-monotonic in model size.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
9.5 Inductive bias
Main idea. Architecture and optimization shape what functions are easy to learn.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
10. Diagnostics
This part studies diagnostics as trainable representation learning. Keep track of forward values, backward gradients, scale, and diagnostics.
| Subtopic | Question | Formula |
|---|---|---|
| Shape checks | track batch and feature axes at each layer | |
| Activation statistics | watch means, variances, and dead units | |
| Gradient norms | monitor gradients by layer | |
| Train validation curves | separate optimization failure from overfitting | |
| Ablations | compare width, depth, activation, optimizer, and normalization |
10.1 Shape checks
Main idea. Track batch and feature axes at each layer.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
10.2 Activation statistics
Main idea. Watch means, variances, and dead units.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
10.3 Gradient norms
Main idea. Monitor gradients by layer.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. Layer-wise gradient norms are the first place to look when a network does not train.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
10.4 Train validation curves
Main idea. Separate optimization failure from overfitting.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
10.5 Ablations
Main idea. Compare width, depth, activation, optimizer, and normalization.
Core relation:
A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.
Worked micro-example. A two-layer network computes and . If is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.
Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.
AI connection. This is a practical neural-network control variable.
Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.
Practice Exercises
- Compute one affine layer.
- Apply ReLU and its derivative.
- Compute a two-layer forward pass.
- Compute softmax cross-entropy.
- Compute an affine-layer gradient.
- Run a finite-difference gradient check.
- Compare Xavier and He initialization scales.
- Apply dropout with inverted scaling.
- Compute LayerNorm for one example.
- Write a neural-network debugging checklist.
Why This Matters for AI
Transformers, CNNs, RNNs, diffusion models, and reward models are all neural networks. The same concepts repeat everywhere: differentiable computation, learned representations, chain-rule gradients, initialization, normalization, regularization, and diagnostics.
Bridge to Probabilistic Models
The next section studies probabilistic models. Neural networks often parameterize probability distributions, so the next step is to connect learned functions with likelihoods, latent variables, and uncertainty.
References
- Ian Goodfellow, Yoshua Bengio, and Aaron Courville, "Deep Learning", 2016: https://www.deeplearningbook.org/
- David Rumelhart, Geoffrey Hinton, and Ronald Williams, "Learning representations by back-propagating errors", Nature, 1986: https://www.nature.com/articles/323533a0
- Xavier Glorot and Yoshua Bengio, "Understanding the difficulty of training deep feedforward neural networks", 2010: https://proceedings.mlr.press/v9/glorot10a.html
- Kaiming He et al., "Delving Deep into Rectifiers", 2015: https://arxiv.org/abs/1502.01852