Notes - Math for LLMs Tutorial

Notes

Neural networks learn nonlinear feature maps and train them end to end with backpropagation. They are linear algebra, nonlinear activations, loss functions, and chain-rule gradients stacked into a trainable program.

Overview

A feed-forward network is a composition:

f_\theta(x)=f_L(f_{L-1}(\cdots f_1(x))).

Each layer usually computes an affine map followed by a nonlinearity:

h_{\ell+1}=\phi(W_\ell h_\ell+b_\ell).

Backpropagation computes gradients of the loss with respect to every parameter by reusing intermediate derivatives from the output back to the input.

Prerequisites

Linear models and matrix multiplication
Chain rule and gradients
Cross-entropy and least-squares loss
Basic optimization vocabulary

Companion Notebooks

Notebook	Purpose
theory.ipynb	Demonstrates forward passes, backprop gradients, activations, initialization, gradient descent, Adam, normalization, dropout, and diagnostics.
exercises.ipynb	Ten practice problems for neural-network computations and debugging.

Learning Objectives

After this section, you should be able to:

Explain why nonlinear activations are necessary.
Compute a forward pass through a small MLP.
Derive backprop gradients for affine layers and activations.
Compare sigmoid, tanh, ReLU, GELU, and gated activations.
Explain Xavier and He initialization.
Implement SGD, momentum, Adam intuition, and gradient clipping.
Explain BatchNorm, LayerNorm, dropout, and weight decay.
Diagnose activation scale, gradient norms, overfitting, and optimization failure.

From Linear Models to Neural Networks
- 1.1 Learned feature map
- 1.2 Linear head
- 1.3 Composition
- 1.4 Nonlinearity
- 1.5 Representation learning
Forward Pass
- 2.1 Affine layer
- 2.2 Activation
- 2.3 MLP layer stack
- 2.4 Batch vectorization
- 2.5 Output logits
Loss Functions
- 3.1 Regression MSE
- 3.2 Binary cross-entropy
- 3.3 Softmax cross-entropy
- 3.4 Regularized objective
- 3.5 Empirical risk
Backpropagation
- 4.1 Chain rule
- 4.2 Layer gradient
- 4.3 Activation derivative
- 4.4 Reverse accumulation
- 4.5 Gradient check
Activations
- 5.1 Sigmoid
- 5.2 Tanh
- 5.3 ReLU
- 5.4 GELU
- 5.5 SwiGLU
Initialization and Signal Propagation
- 6.1 Variance propagation
- 6.2 Xavier initialization
- 6.3 He initialization
- 6.4 Symmetry breaking
- 6.5 Depth instability
Optimization
- 7.1 SGD
- 7.2 Momentum
- 7.3 Adam
- 7.4 Learning-rate schedule
- 7.5 Gradient clipping
Normalization and Regularization
- 8.1 Batch normalization
- 8.2 Layer normalization
- 8.3 Dropout
- 8.4 Weight decay
- 8.5 Early stopping
Expressivity and Generalization
- 9.1 Universal approximation
- 9.2 Depth efficiency
- 9.3 Overparameterization
- 9.4 Double descent
- 9.5 Inductive bias
Diagnostics

10.1 Shape checks
10.2 Activation statistics
10.3 Gradient norms
10.4 Train validation curves
10.5 Ablations

Shape Map

input batch:        X       shape (B, d_in)
layer weights:      W_l     shape (d_out, d_in)
pre-activation:     Z_l     shape (B, d_out)
activation:         H_l     shape (B, d_out)
logits:             Z       shape (B, classes)

1. From Linear Models to Neural Networks

This part studies from linear models to neural networks as trainable representation learning. Keep track of forward values, backward gradients, scale, and diagnostics.

Subtopic	Question	Formula
Learned feature map	make features trainable instead of fixed	$h=f_\theta(x)$
Linear head	final prediction is often linear on learned features	$\hat y=W h+b$
Composition	depth composes many simple maps	$f=f_L\circ\cdots\circ f_1$
Nonlinearity	without nonlinear activations, stacked linear layers collapse to one linear map	$W_2W_1x$
Representation learning	hidden layers learn intermediate coordinates useful for the task	$h_\ell$

1.1 Learned feature map

Main idea. Make features trainable instead of fixed.

Core relation:

h=f_\theta(x)

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes $h=\mathrm{ReLU}(W_1x+b_1)$ and $\hat y=W_2h+b_2$ . If $W_1x+b_1$ is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

1.2 Linear head

Main idea. Final prediction is often linear on learned features.

Core relation:

\hat y=W h+b

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

1.3 Composition

Main idea. Depth composes many simple maps.

Core relation:

f=f_L\circ\cdots\circ f_1

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

1.4 Nonlinearity

Main idea. Without nonlinear activations, stacked linear layers collapse to one linear map.

Core relation:

W_2W_1x

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

1.5 Representation learning

Main idea. Hidden layers learn intermediate coordinates useful for the task.

Core relation:

h_\ell

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

2. Forward Pass

This part studies forward pass as trainable representation learning. Keep track of forward values, backward gradients, scale, and diagnostics.

Subtopic	Question	Formula
Affine layer	matrix multiply plus bias	$z=W x+b$
Activation	apply elementwise nonlinearity	$a=\phi(z)$
MLP layer stack	repeat affine and nonlinear transformations	$h_{\ell+1}=\phi(W_\ell h_\ell+b_\ell)$
Batch vectorization	process many examples together	$H_{\ell+1}=\phi(H_\ell W_\ell^\top+\mathbf{1}b_\ell^\top)$
Output logits	classification networks usually produce logits before softmax	$z_K=W_Kh_K+b_K$

2.1 Affine layer

Main idea. Matrix multiply plus bias.

Core relation:

z=W x+b

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

2.2 Activation

Main idea. Apply elementwise nonlinearity.

Core relation:

a=\phi(z)

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

2.3 MLP layer stack

Main idea. Repeat affine and nonlinear transformations.

Core relation:

h_{\ell+1}=\phi(W_\ell h_\ell+b_\ell)

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

2.4 Batch vectorization

Main idea. Process many examples together.

Core relation:

H_{\ell+1}=\phi(H_\ell W_\ell^\top+\mathbf{1}b_\ell^\top)

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

2.5 Output logits

Main idea. Classification networks usually produce logits before softmax.

Core relation:

z_K=W_Kh_K+b_K

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

3. Loss Functions

This part studies loss functions as trainable representation learning. Keep track of forward values, backward gradients, scale, and diagnostics.

Subtopic	Question	Formula
Regression MSE	penalize squared prediction error	$L=\frac{1}{n}\sum_i\Vert \hat y_i-y_i\Vert^2$
Binary cross-entropy	Bernoulli negative log likelihood	$-\left[y\log p+(1-y)\log(1-p)\right]$
Softmax cross-entropy	multi-class negative log likelihood	$L=-\log p_y$
Regularized objective	add parameter penalties or other constraints	$L_\mathrm{total}=L+\lambda R(\theta)$
Empirical risk	training minimizes average loss over data	$\hat R(\theta)=n^{-1}\sum_i\ell(f_\theta(x_i),y_i)$

3.1 Regression MSE

Main idea. Penalize squared prediction error.

Core relation:

L=\frac{1}{n}\sum_i\Vert \hat y_i-y_i\Vert^2

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

3.2 Binary cross-entropy

Main idea. Bernoulli negative log likelihood.

Core relation:

-\left[y\log p+(1-y)\log(1-p)\right]

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

3.3 Softmax cross-entropy

Main idea. Multi-class negative log likelihood.

Core relation:

L=-\log p_y

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

3.4 Regularized objective

Main idea. Add parameter penalties or other constraints.

Core relation:

L_\mathrm{total}=L+\lambda R(\theta)

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

3.5 Empirical risk

Main idea. Training minimizes average loss over data.

Core relation:

\hat R(\theta)=n^{-1}\sum_i\ell(f_\theta(x_i),y_i)

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

4. Backpropagation

This part studies backpropagation as trainable representation learning. Keep track of forward values, backward gradients, scale, and diagnostics.

Subtopic	Question	Formula
Chain rule	gradients flow backward through composed functions	$\partial L/\partial x=(\partial z/\partial x)^\top\partial L/\partial z$
Layer gradient	weight gradient is outer product of upstream error and input	$\partial L/\partial W=\delta x^\top$
Activation derivative	nonlinearity gates gradient flow	$\delta_z=\delta_a\odot\phi'(z)$
Reverse accumulation	store intermediate activations and traverse backward	$\delta_L\rightarrow\delta_{L-1}\rightarrow\cdots$
Gradient check	finite differences verify backprop implementation	$\frac{L(\theta+\epsilon)-L(\theta-\epsilon)}{2\epsilon}$

4.1 Chain rule

Main idea. Gradients flow backward through composed functions.

Core relation:

\partial L/\partial x=(\partial z/\partial x)^\top\partial L/\partial z

AI connection. Backprop is the chain rule organized so every parameter receives credit efficiently.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

4.2 Layer gradient

Main idea. Weight gradient is outer product of upstream error and input.

Core relation:

\partial L/\partial W=\delta x^\top

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

4.3 Activation derivative

Main idea. Nonlinearity gates gradient flow.

Core relation:

\delta_z=\delta_a\odot\phi'(z)

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

4.4 Reverse accumulation

Main idea. Store intermediate activations and traverse backward.

Core relation:

\delta_L\rightarrow\delta_{L-1}\rightarrow\cdots

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

4.5 Gradient check

Main idea. Finite differences verify backprop implementation.

Core relation:

\frac{L(\theta+\epsilon)-L(\theta-\epsilon)}{2\epsilon}

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

5. Activations

This part studies activations as trainable representation learning. Keep track of forward values, backward gradients, scale, and diagnostics.

Subtopic	Question	Formula
Sigmoid	squash to (0,1) but can saturate	$\sigma(z)=1/(1+e^{-z})$
Tanh	zero-centered saturating activation	$\tanh z$
ReLU	simple piecewise linear activation	$\max(0,z)$
GELU	smooth activation used in many transformers	$x\Phi(x)$
SwiGLU	gated activation used in modern LLM MLPs	$(xW_1)\odot \mathrm{swish}(xW_2)$

5.1 Sigmoid

Main idea. Squash to (0,1) but can saturate.

Core relation:

\sigma(z)=1/(1+e^{-z})

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

5.2 Tanh

Main idea. Zero-centered saturating activation.

Core relation:

\tanh z

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

5.3 ReLU

Main idea. Simple piecewise linear activation.

Core relation:

\max(0,z)

AI connection. ReLU made deep feed-forward networks easier to optimize than saturating activations in many settings.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

5.4 GELU

Main idea. Smooth activation used in many transformers.

Core relation:

x\Phi(x)

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

5.5 SwiGLU

Main idea. Gated activation used in modern llm mlps.

Core relation:

(xW_1)\odot \mathrm{swish}(xW_2)

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

6. Initialization and Signal Propagation

This part studies initialization and signal propagation as trainable representation learning. Keep track of forward values, backward gradients, scale, and diagnostics.

Subtopic	Question	Formula
Variance propagation	activation variance should not explode or vanish across layers	$\mathrm{Var}(h_\ell)$
Xavier initialization	balance fan-in and fan-out for tanh-like activations	$\mathrm{Var}(W)=2/(n_{in}+n_{out})$
He initialization	scale for ReLU-like activations	$\mathrm{Var}(W)=2/n_{in}$
Symmetry breaking	random weights let units learn different features	$W_i\ne W_j$
Depth instability	bad initialization makes gradients vanish or explode	$\prod_\ell J_\ell$

6.1 Variance propagation

Main idea. Activation variance should not explode or vanish across layers.

Core relation:

\mathrm{Var}(h_\ell)

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

6.2 Xavier initialization

Main idea. Balance fan-in and fan-out for tanh-like activations.

Core relation:

\mathrm{Var}(W)=2/(n_{in}+n_{out})

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

6.3 He initialization

Main idea. Scale for relu-like activations.

Core relation:

\mathrm{Var}(W)=2/n_{in}

AI connection. Initialization is not decoration; it controls signal scale at depth.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

6.4 Symmetry breaking

Main idea. Random weights let units learn different features.

Core relation:

W_i\ne W_j

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

6.5 Depth instability

Main idea. Bad initialization makes gradients vanish or explode.

Core relation:

\prod_\ell J_\ell

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

7. Optimization

This part studies optimization as trainable representation learning. Keep track of forward values, backward gradients, scale, and diagnostics.

Subtopic	Question	Formula
SGD	update parameters using mini-batch gradients	$\theta\leftarrow\theta-\eta g$
Momentum	smooth updates with velocity	$v\leftarrow\beta v+g$
Adam	normalize first moment by second moment estimate	$\theta\leftarrow\theta-\eta\hat m/(\sqrt{\hat v}+\epsilon)$
Learning-rate schedule	change step size over training	$\eta_t$
Gradient clipping	cap gradient norm to prevent unstable updates	$g\leftarrow g\min(1,c/\Vert g\Vert)$

7.1 SGD

Main idea. Update parameters using mini-batch gradients.

Core relation:

\theta\leftarrow\theta-\eta g

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

7.2 Momentum

Main idea. Smooth updates with velocity.

Core relation:

v\leftarrow\beta v+g

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

7.3 Adam

Main idea. Normalize first moment by second moment estimate.

Core relation:

\theta\leftarrow\theta-\eta\hat m/(\sqrt{\hat v}+\epsilon)

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

7.4 Learning-rate schedule

Main idea. Change step size over training.

Core relation:

\eta_t

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

7.5 Gradient clipping

Main idea. Cap gradient norm to prevent unstable updates.

Core relation:

g\leftarrow g\min(1,c/\Vert g\Vert)

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

8. Normalization and Regularization

This part studies normalization and regularization as trainable representation learning. Keep track of forward values, backward gradients, scale, and diagnostics.

Subtopic	Question	Formula
Batch normalization	normalize features using batch statistics	$\hat x=(x-\mu_B)/\sqrt{\sigma_B^2+\epsilon}$
Layer normalization	normalize features within an example	$\hat x=(x-\mu)/\sqrt{\sigma^2+\epsilon}$
Dropout	randomly mask activations during training	$\tilde h=m\odot h/(1-p)$
Weight decay	penalize large weights	$L+\lambda\Vert\theta\Vert^2$
Early stopping	stop when validation loss stops improving	$L_\mathrm{val}$

8.1 Batch normalization

Main idea. Normalize features using batch statistics.

Core relation:

\hat x=(x-\mu_B)/\sqrt{\sigma_B^2+\epsilon}

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

8.2 Layer normalization

Main idea. Normalize features within an example.

Core relation:

\hat x=(x-\mu)/\sqrt{\sigma^2+\epsilon}

AI connection. LayerNorm is one of the basic stabilizers behind modern transformers.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

8.3 Dropout

Main idea. Randomly mask activations during training.

Core relation:

\tilde h=m\odot h/(1-p)

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

8.4 Weight decay

Main idea. Penalize large weights.

Core relation:

L+\lambda\Vert\theta\Vert^2

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

8.5 Early stopping

Main idea. Stop when validation loss stops improving.

Core relation:

L_\mathrm{val}

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

9. Expressivity and Generalization

This part studies expressivity and generalization as trainable representation learning. Keep track of forward values, backward gradients, scale, and diagnostics.

Subtopic	Question	Formula
Universal approximation	wide nonlinear networks can approximate many functions	$f_\theta\approx f$
Depth efficiency	some functions are represented more compactly with depth	$L>1$
Overparameterization	large networks can fit data yet generalize with the right training biases	$P\gg n$
Double descent	test error can be non-monotonic in model size	$E_\mathrm{test}(P)$
Inductive bias	architecture and optimization shape what functions are easy to learn	$\theta_\mathrm{SGD}$

9.1 Universal approximation

Main idea. Wide nonlinear networks can approximate many functions.

Core relation:

f_\theta\approx f

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

9.2 Depth efficiency

Main idea. Some functions are represented more compactly with depth.

Core relation:

L>1

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

9.3 Overparameterization

Main idea. Large networks can fit data yet generalize with the right training biases.

Core relation:

P\gg n

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

9.4 Double descent

Main idea. Test error can be non-monotonic in model size.

Core relation:

E_\mathrm{test}(P)

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

9.5 Inductive bias

Main idea. Architecture and optimization shape what functions are easy to learn.

Core relation:

\theta_\mathrm{SGD}

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

10. Diagnostics

This part studies diagnostics as trainable representation learning. Keep track of forward values, backward gradients, scale, and diagnostics.

Subtopic	Question	Formula
Shape checks	track batch and feature axes at each layer	$(B,d_\ell)$
Activation statistics	watch means, variances, and dead units	$\mu_\ell,\sigma_\ell$
Gradient norms	monitor gradients by layer	$\Vert\nabla_{\theta_\ell}L\Vert$
Train validation curves	separate optimization failure from overfitting	$L_\mathrm{train},L_\mathrm{val}$
Ablations	compare width, depth, activation, optimizer, and normalization	$\Delta L$

10.1 Shape checks

Main idea. Track batch and feature axes at each layer.

Core relation:

(B,d_\ell)

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

10.2 Activation statistics

Main idea. Watch means, variances, and dead units.

Core relation:

\mu_\ell,\sigma_\ell

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

10.3 Gradient norms

Main idea. Monitor gradients by layer.

Core relation:

\Vert\nabla_{\theta_\ell}L\Vert

AI connection. Layer-wise gradient norms are the first place to look when a network does not train.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

10.4 Train validation curves

Main idea. Separate optimization failure from overfitting.

Core relation:

L_\mathrm{train},L_\mathrm{val}

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

10.5 Ablations

Main idea. Compare width, depth, activation, optimizer, and normalization.

Core relation:

\Delta L

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

Practice Exercises

Compute one affine layer.
Apply ReLU and its derivative.
Compute a two-layer forward pass.
Compute softmax cross-entropy.
Compute an affine-layer gradient.
Run a finite-difference gradient check.
Compare Xavier and He initialization scales.
Apply dropout with inverted scaling.
Compute LayerNorm for one example.
Write a neural-network debugging checklist.

Why This Matters for AI

Transformers, CNNs, RNNs, diffusion models, and reward models are all neural networks. The same concepts repeat everywhere: differentiable computation, learned representations, chain-rule gradients, initialization, normalization, regularization, and diagnostics.

Bridge to Probabilistic Models

The next section studies probabilistic models. Neural networks often parameterize probability distributions, so the next step is to connect learned functions with likelihoods, latent variables, and uncertainty.

References

Ian Goodfellow, Yoshua Bengio, and Aaron Courville, "Deep Learning", 2016: https://www.deeplearningbook.org/
David Rumelhart, Geoffrey Hinton, and Ronald Williams, "Learning representations by back-propagating errors", Nature, 1986: https://www.nature.com/articles/323533a0
Xavier Glorot and Yoshua Bengio, "Understanding the difficulty of training deep feedforward neural networks", 2010: https://proceedings.mlr.press/v9/glorot10a.html
Kaiming He et al., "Delving Deep into Rectifiers", 2015: https://arxiv.org/abs/1502.01852

Neural Networks

Overview

Prerequisites

Companion Notebooks

Learning Objectives

Table of Contents

Shape Map

1. From Linear Models to Neural Networks

1.1 Learned feature map

1.2 Linear head

1.3 Composition

1.4 Nonlinearity

1.5 Representation learning

2. Forward Pass

2.1 Affine layer

2.2 Activation

2.3 MLP layer stack

2.4 Batch vectorization

2.5 Output logits

3. Loss Functions

3.1 Regression MSE

3.2 Binary cross-entropy

3.3 Softmax cross-entropy

3.4 Regularized objective

3.5 Empirical risk

4. Backpropagation

4.1 Chain rule

4.2 Layer gradient

4.3 Activation derivative

4.4 Reverse accumulation

4.5 Gradient check

5. Activations

5.1 Sigmoid

5.2 Tanh

5.3 ReLU

5.4 GELU

5.5 SwiGLU

6. Initialization and Signal Propagation

6.1 Variance propagation

6.2 Xavier initialization

6.3 He initialization

6.4 Symmetry breaking

6.5 Depth instability

7. Optimization

7.1 SGD

7.2 Momentum

7.3 Adam

7.4 Learning-rate schedule

7.5 Gradient clipping

8. Normalization and Regularization

8.1 Batch normalization

8.2 Layer normalization

8.3 Dropout

8.4 Weight decay

8.5 Early stopping

9. Expressivity and Generalization

9.1 Universal approximation

9.2 Depth efficiency

9.3 Overparameterization

9.4 Double descent

9.5 Inductive bias

10. Diagnostics

10.1 Shape checks

10.2 Activation statistics

10.3 Gradient norms

10.4 Train validation curves

10.5 Ablations

Practice Exercises

Why This Matters for AI

Bridge to Probabilistic Models

References