NotesMath for LLMs

Neural Networks

Math for Specific Models / Neural Networks

Notes

Neural networks learn nonlinear feature maps and train them end to end with backpropagation. They are linear algebra, nonlinear activations, loss functions, and chain-rule gradients stacked into a trainable program.

Overview

A feed-forward network is a composition:

fθ(x)=fL(fL1(f1(x))).f_\theta(x)=f_L(f_{L-1}(\cdots f_1(x))).

Each layer usually computes an affine map followed by a nonlinearity:

h+1=ϕ(Wh+b).h_{\ell+1}=\phi(W_\ell h_\ell+b_\ell).

Backpropagation computes gradients of the loss with respect to every parameter by reusing intermediate derivatives from the output back to the input.

Prerequisites

  • Linear models and matrix multiplication
  • Chain rule and gradients
  • Cross-entropy and least-squares loss
  • Basic optimization vocabulary

Companion Notebooks

NotebookPurpose
theory.ipynbDemonstrates forward passes, backprop gradients, activations, initialization, gradient descent, Adam, normalization, dropout, and diagnostics.
exercises.ipynbTen practice problems for neural-network computations and debugging.

Learning Objectives

After this section, you should be able to:

  • Explain why nonlinear activations are necessary.
  • Compute a forward pass through a small MLP.
  • Derive backprop gradients for affine layers and activations.
  • Compare sigmoid, tanh, ReLU, GELU, and gated activations.
  • Explain Xavier and He initialization.
  • Implement SGD, momentum, Adam intuition, and gradient clipping.
  • Explain BatchNorm, LayerNorm, dropout, and weight decay.
  • Diagnose activation scale, gradient norms, overfitting, and optimization failure.

Table of Contents

  1. From Linear Models to Neural Networks
  2. Forward Pass
  3. Loss Functions
  4. Backpropagation
  5. Activations
  6. Initialization and Signal Propagation
  7. Optimization
  8. Normalization and Regularization
  9. Expressivity and Generalization
  10. Diagnostics

Shape Map

input batch:        X       shape (B, d_in)
layer weights:      W_l     shape (d_out, d_in)
pre-activation:     Z_l     shape (B, d_out)
activation:         H_l     shape (B, d_out)
logits:             Z       shape (B, classes)

1. From Linear Models to Neural Networks

This part studies from linear models to neural networks as trainable representation learning. Keep track of forward values, backward gradients, scale, and diagnostics.

SubtopicQuestionFormula
Learned feature mapmake features trainable instead of fixedh=fθ(x)h=f_\theta(x)
Linear headfinal prediction is often linear on learned featuresy^=Wh+b\hat y=W h+b
Compositiondepth composes many simple mapsf=fLf1f=f_L\circ\cdots\circ f_1
Nonlinearitywithout nonlinear activations, stacked linear layers collapse to one linear mapW2W1xW_2W_1x
Representation learninghidden layers learn intermediate coordinates useful for the taskhh_\ell

1.1 Learned feature map

Main idea. Make features trainable instead of fixed.

Core relation:

h=fθ(x)h=f_\theta(x)

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

1.2 Linear head

Main idea. Final prediction is often linear on learned features.

Core relation:

y^=Wh+b\hat y=W h+b

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

1.3 Composition

Main idea. Depth composes many simple maps.

Core relation:

f=fLf1f=f_L\circ\cdots\circ f_1

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

1.4 Nonlinearity

Main idea. Without nonlinear activations, stacked linear layers collapse to one linear map.

Core relation:

W2W1xW_2W_1x

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

1.5 Representation learning

Main idea. Hidden layers learn intermediate coordinates useful for the task.

Core relation:

hh_\ell

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

2. Forward Pass

This part studies forward pass as trainable representation learning. Keep track of forward values, backward gradients, scale, and diagnostics.

SubtopicQuestionFormula
Affine layermatrix multiply plus biasz=Wx+bz=W x+b
Activationapply elementwise nonlinearitya=ϕ(z)a=\phi(z)
MLP layer stackrepeat affine and nonlinear transformationsh+1=ϕ(Wh+b)h_{\ell+1}=\phi(W_\ell h_\ell+b_\ell)
Batch vectorizationprocess many examples togetherH+1=ϕ(HW+1b)H_{\ell+1}=\phi(H_\ell W_\ell^\top+\mathbf{1}b_\ell^\top)
Output logitsclassification networks usually produce logits before softmaxzK=WKhK+bKz_K=W_Kh_K+b_K

2.1 Affine layer

Main idea. Matrix multiply plus bias.

Core relation:

z=Wx+bz=W x+b

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

2.2 Activation

Main idea. Apply elementwise nonlinearity.

Core relation:

a=ϕ(z)a=\phi(z)

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

2.3 MLP layer stack

Main idea. Repeat affine and nonlinear transformations.

Core relation:

h+1=ϕ(Wh+b)h_{\ell+1}=\phi(W_\ell h_\ell+b_\ell)

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

2.4 Batch vectorization

Main idea. Process many examples together.

Core relation:

H+1=ϕ(HW+1b)H_{\ell+1}=\phi(H_\ell W_\ell^\top+\mathbf{1}b_\ell^\top)

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

2.5 Output logits

Main idea. Classification networks usually produce logits before softmax.

Core relation:

zK=WKhK+bKz_K=W_Kh_K+b_K

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

3. Loss Functions

This part studies loss functions as trainable representation learning. Keep track of forward values, backward gradients, scale, and diagnostics.

SubtopicQuestionFormula
Regression MSEpenalize squared prediction errorL=1niy^iyi2L=\frac{1}{n}\sum_i\Vert \hat y_i-y_i\Vert^2
Binary cross-entropyBernoulli negative log likelihood[ylogp+(1y)log(1p)]-\left[y\log p+(1-y)\log(1-p)\right]
Softmax cross-entropymulti-class negative log likelihoodL=logpyL=-\log p_y
Regularized objectiveadd parameter penalties or other constraintsLtotal=L+λR(θ)L_\mathrm{total}=L+\lambda R(\theta)
Empirical risktraining minimizes average loss over dataR^(θ)=n1i(fθ(xi),yi)\hat R(\theta)=n^{-1}\sum_i\ell(f_\theta(x_i),y_i)

3.1 Regression MSE

Main idea. Penalize squared prediction error.

Core relation:

L=1niy^iyi2L=\frac{1}{n}\sum_i\Vert \hat y_i-y_i\Vert^2

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

3.2 Binary cross-entropy

Main idea. Bernoulli negative log likelihood.

Core relation:

[ylogp+(1y)log(1p)]-\left[y\log p+(1-y)\log(1-p)\right]

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

3.3 Softmax cross-entropy

Main idea. Multi-class negative log likelihood.

Core relation:

L=logpyL=-\log p_y

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

3.4 Regularized objective

Main idea. Add parameter penalties or other constraints.

Core relation:

Ltotal=L+λR(θ)L_\mathrm{total}=L+\lambda R(\theta)

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

3.5 Empirical risk

Main idea. Training minimizes average loss over data.

Core relation:

R^(θ)=n1i(fθ(xi),yi)\hat R(\theta)=n^{-1}\sum_i\ell(f_\theta(x_i),y_i)

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

4. Backpropagation

This part studies backpropagation as trainable representation learning. Keep track of forward values, backward gradients, scale, and diagnostics.

SubtopicQuestionFormula
Chain rulegradients flow backward through composed functionsL/x=(z/x)L/z\partial L/\partial x=(\partial z/\partial x)^\top\partial L/\partial z
Layer gradientweight gradient is outer product of upstream error and inputL/W=δx\partial L/\partial W=\delta x^\top
Activation derivativenonlinearity gates gradient flowδz=δaϕ(z)\delta_z=\delta_a\odot\phi'(z)
Reverse accumulationstore intermediate activations and traverse backwardδLδL1\delta_L\rightarrow\delta_{L-1}\rightarrow\cdots
Gradient checkfinite differences verify backprop implementationL(θ+ϵ)L(θϵ)2ϵ\frac{L(\theta+\epsilon)-L(\theta-\epsilon)}{2\epsilon}

4.1 Chain rule

Main idea. Gradients flow backward through composed functions.

Core relation:

L/x=(z/x)L/z\partial L/\partial x=(\partial z/\partial x)^\top\partial L/\partial z

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. Backprop is the chain rule organized so every parameter receives credit efficiently.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

4.2 Layer gradient

Main idea. Weight gradient is outer product of upstream error and input.

Core relation:

L/W=δx\partial L/\partial W=\delta x^\top

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

4.3 Activation derivative

Main idea. Nonlinearity gates gradient flow.

Core relation:

δz=δaϕ(z)\delta_z=\delta_a\odot\phi'(z)

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

4.4 Reverse accumulation

Main idea. Store intermediate activations and traverse backward.

Core relation:

δLδL1\delta_L\rightarrow\delta_{L-1}\rightarrow\cdots

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

4.5 Gradient check

Main idea. Finite differences verify backprop implementation.

Core relation:

L(θ+ϵ)L(θϵ)2ϵ\frac{L(\theta+\epsilon)-L(\theta-\epsilon)}{2\epsilon}

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

5. Activations

This part studies activations as trainable representation learning. Keep track of forward values, backward gradients, scale, and diagnostics.

SubtopicQuestionFormula
Sigmoidsquash to (0,1) but can saturateσ(z)=1/(1+ez)\sigma(z)=1/(1+e^{-z})
Tanhzero-centered saturating activationtanhz\tanh z
ReLUsimple piecewise linear activationmax(0,z)\max(0,z)
GELUsmooth activation used in many transformersxΦ(x)x\Phi(x)
SwiGLUgated activation used in modern LLM MLPs(xW1)swish(xW2)(xW_1)\odot \mathrm{swish}(xW_2)

5.1 Sigmoid

Main idea. Squash to (0,1) but can saturate.

Core relation:

σ(z)=1/(1+ez)\sigma(z)=1/(1+e^{-z})

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

5.2 Tanh

Main idea. Zero-centered saturating activation.

Core relation:

tanhz\tanh z

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

5.3 ReLU

Main idea. Simple piecewise linear activation.

Core relation:

max(0,z)\max(0,z)

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. ReLU made deep feed-forward networks easier to optimize than saturating activations in many settings.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

5.4 GELU

Main idea. Smooth activation used in many transformers.

Core relation:

xΦ(x)x\Phi(x)

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

5.5 SwiGLU

Main idea. Gated activation used in modern llm mlps.

Core relation:

(xW1)swish(xW2)(xW_1)\odot \mathrm{swish}(xW_2)

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

6. Initialization and Signal Propagation

This part studies initialization and signal propagation as trainable representation learning. Keep track of forward values, backward gradients, scale, and diagnostics.

SubtopicQuestionFormula
Variance propagationactivation variance should not explode or vanish across layersVar(h)\mathrm{Var}(h_\ell)
Xavier initializationbalance fan-in and fan-out for tanh-like activationsVar(W)=2/(nin+nout)\mathrm{Var}(W)=2/(n_{in}+n_{out})
He initializationscale for ReLU-like activationsVar(W)=2/nin\mathrm{Var}(W)=2/n_{in}
Symmetry breakingrandom weights let units learn different featuresWiWjW_i\ne W_j
Depth instabilitybad initialization makes gradients vanish or explodeJ\prod_\ell J_\ell

6.1 Variance propagation

Main idea. Activation variance should not explode or vanish across layers.

Core relation:

Var(h)\mathrm{Var}(h_\ell)

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

6.2 Xavier initialization

Main idea. Balance fan-in and fan-out for tanh-like activations.

Core relation:

Var(W)=2/(nin+nout)\mathrm{Var}(W)=2/(n_{in}+n_{out})

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

6.3 He initialization

Main idea. Scale for relu-like activations.

Core relation:

Var(W)=2/nin\mathrm{Var}(W)=2/n_{in}

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. Initialization is not decoration; it controls signal scale at depth.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

6.4 Symmetry breaking

Main idea. Random weights let units learn different features.

Core relation:

WiWjW_i\ne W_j

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

6.5 Depth instability

Main idea. Bad initialization makes gradients vanish or explode.

Core relation:

J\prod_\ell J_\ell

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

7. Optimization

This part studies optimization as trainable representation learning. Keep track of forward values, backward gradients, scale, and diagnostics.

SubtopicQuestionFormula
SGDupdate parameters using mini-batch gradientsθθηg\theta\leftarrow\theta-\eta g
Momentumsmooth updates with velocityvβv+gv\leftarrow\beta v+g
Adamnormalize first moment by second moment estimateθθηm^/(v^+ϵ)\theta\leftarrow\theta-\eta\hat m/(\sqrt{\hat v}+\epsilon)
Learning-rate schedulechange step size over trainingηt\eta_t
Gradient clippingcap gradient norm to prevent unstable updatesggmin(1,c/g)g\leftarrow g\min(1,c/\Vert g\Vert)

7.1 SGD

Main idea. Update parameters using mini-batch gradients.

Core relation:

θθηg\theta\leftarrow\theta-\eta g

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

7.2 Momentum

Main idea. Smooth updates with velocity.

Core relation:

vβv+gv\leftarrow\beta v+g

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

7.3 Adam

Main idea. Normalize first moment by second moment estimate.

Core relation:

θθηm^/(v^+ϵ)\theta\leftarrow\theta-\eta\hat m/(\sqrt{\hat v}+\epsilon)

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

7.4 Learning-rate schedule

Main idea. Change step size over training.

Core relation:

ηt\eta_t

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

7.5 Gradient clipping

Main idea. Cap gradient norm to prevent unstable updates.

Core relation:

ggmin(1,c/g)g\leftarrow g\min(1,c/\Vert g\Vert)

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

8. Normalization and Regularization

This part studies normalization and regularization as trainable representation learning. Keep track of forward values, backward gradients, scale, and diagnostics.

SubtopicQuestionFormula
Batch normalizationnormalize features using batch statisticsx^=(xμB)/σB2+ϵ\hat x=(x-\mu_B)/\sqrt{\sigma_B^2+\epsilon}
Layer normalizationnormalize features within an examplex^=(xμ)/σ2+ϵ\hat x=(x-\mu)/\sqrt{\sigma^2+\epsilon}
Dropoutrandomly mask activations during trainingh~=mh/(1p)\tilde h=m\odot h/(1-p)
Weight decaypenalize large weightsL+λθ2L+\lambda\Vert\theta\Vert^2
Early stoppingstop when validation loss stops improvingLvalL_\mathrm{val}

8.1 Batch normalization

Main idea. Normalize features using batch statistics.

Core relation:

x^=(xμB)/σB2+ϵ\hat x=(x-\mu_B)/\sqrt{\sigma_B^2+\epsilon}

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

8.2 Layer normalization

Main idea. Normalize features within an example.

Core relation:

x^=(xμ)/σ2+ϵ\hat x=(x-\mu)/\sqrt{\sigma^2+\epsilon}

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. LayerNorm is one of the basic stabilizers behind modern transformers.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

8.3 Dropout

Main idea. Randomly mask activations during training.

Core relation:

h~=mh/(1p)\tilde h=m\odot h/(1-p)

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

8.4 Weight decay

Main idea. Penalize large weights.

Core relation:

L+λθ2L+\lambda\Vert\theta\Vert^2

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

8.5 Early stopping

Main idea. Stop when validation loss stops improving.

Core relation:

LvalL_\mathrm{val}

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

9. Expressivity and Generalization

This part studies expressivity and generalization as trainable representation learning. Keep track of forward values, backward gradients, scale, and diagnostics.

SubtopicQuestionFormula
Universal approximationwide nonlinear networks can approximate many functionsfθff_\theta\approx f
Depth efficiencysome functions are represented more compactly with depthL>1L>1
Overparameterizationlarge networks can fit data yet generalize with the right training biasesPnP\gg n
Double descenttest error can be non-monotonic in model sizeEtest(P)E_\mathrm{test}(P)
Inductive biasarchitecture and optimization shape what functions are easy to learnθSGD\theta_\mathrm{SGD}

9.1 Universal approximation

Main idea. Wide nonlinear networks can approximate many functions.

Core relation:

fθff_\theta\approx f

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

9.2 Depth efficiency

Main idea. Some functions are represented more compactly with depth.

Core relation:

L>1L>1

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

9.3 Overparameterization

Main idea. Large networks can fit data yet generalize with the right training biases.

Core relation:

PnP\gg n

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

9.4 Double descent

Main idea. Test error can be non-monotonic in model size.

Core relation:

Etest(P)E_\mathrm{test}(P)

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

9.5 Inductive bias

Main idea. Architecture and optimization shape what functions are easy to learn.

Core relation:

θSGD\theta_\mathrm{SGD}

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

10. Diagnostics

This part studies diagnostics as trainable representation learning. Keep track of forward values, backward gradients, scale, and diagnostics.

SubtopicQuestionFormula
Shape checkstrack batch and feature axes at each layer(B,d)(B,d_\ell)
Activation statisticswatch means, variances, and dead unitsμ,σ\mu_\ell,\sigma_\ell
Gradient normsmonitor gradients by layerθL\Vert\nabla_{\theta_\ell}L\Vert
Train validation curvesseparate optimization failure from overfittingLtrain,LvalL_\mathrm{train},L_\mathrm{val}
Ablationscompare width, depth, activation, optimizer, and normalizationΔL\Delta L

10.1 Shape checks

Main idea. Track batch and feature axes at each layer.

Core relation:

(B,d)(B,d_\ell)

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

10.2 Activation statistics

Main idea. Watch means, variances, and dead units.

Core relation:

μ,σ\mu_\ell,\sigma_\ell

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

10.3 Gradient norms

Main idea. Monitor gradients by layer.

Core relation:

θL\Vert\nabla_{\theta_\ell}L\Vert

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. Layer-wise gradient norms are the first place to look when a network does not train.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

10.4 Train validation curves

Main idea. Separate optimization failure from overfitting.

Core relation:

Ltrain,LvalL_\mathrm{train},L_\mathrm{val}

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.

10.5 Ablations

Main idea. Compare width, depth, activation, optimizer, and normalization.

Core relation:

ΔL\Delta L

A neural network is a differentiable program made from parameterized layers. The forward pass computes predictions. The backward pass uses the chain rule to assign credit to every parameter. Training works when signals and gradients stay numerically healthy across depth.

Worked micro-example. A two-layer network computes h=ReLU(W1x+b1)h=\mathrm{ReLU}(W_1x+b_1) and y^=W2h+b2\hat y=W_2h+b_2. If W1x+b1W_1x+b_1 is negative for a hidden unit, ReLU outputs zero and the local gradient through that unit is also zero for that example.

Implementation check. Log activation means, activation standard deviations, loss, gradient norms, and parameter update norms. A falling training loss is useful; a stable diagnostic picture is better.

AI connection. This is a practical neural-network control variable.

Common mistake. Do not debug deep networks only from the final loss. The final loss is a symptom; layer statistics often reveal the cause.


Practice Exercises

  1. Compute one affine layer.
  2. Apply ReLU and its derivative.
  3. Compute a two-layer forward pass.
  4. Compute softmax cross-entropy.
  5. Compute an affine-layer gradient.
  6. Run a finite-difference gradient check.
  7. Compare Xavier and He initialization scales.
  8. Apply dropout with inverted scaling.
  9. Compute LayerNorm for one example.
  10. Write a neural-network debugging checklist.

Why This Matters for AI

Transformers, CNNs, RNNs, diffusion models, and reward models are all neural networks. The same concepts repeat everywhere: differentiable computation, learned representations, chain-rule gradients, initialization, normalization, regularization, and diagnostics.

Bridge to Probabilistic Models

The next section studies probabilistic models. Neural networks often parameterize probability distributions, so the next step is to connect learned functions with likelihoods, latent variables, and uncertainty.

References