NotesMath for LLMs

Normalization Techniques

ML Specific Math / Normalization Techniques

Notes

"Deep networks are easier to train when their internal scales are made visible, controlled, and learnable."

Overview

Normalization techniques control the statistics of activations, weights, or features so optimization sees a better-conditioned problem. They are not just preprocessing tricks. Inside neural networks, normalization changes forward signal scale, backward gradient flow, effective parameterization, train/eval behavior, and numerical stability.

This section is the canonical home for normalization math as a reusable ML primitive. Chapter 14 discusses how BatchNorm, LayerNorm, RMSNorm, and SpectralNorm appear in specific model families. Chapter 15 discusses LLM-scale block design and fused kernels. Here we focus on axes, moments, affine reparameterization, train-time statistics, inference statistics, epsilon, RMS-only scaling, group statistics, weight statistics, residual placement, and the implementation mistakes that make normalization layers silently wrong.

The core habit is to always ask: "What axis is normalized? Which statistics are used? Are they batch-dependent? Are they learned, frozen, or recomputed? What gradient path does this create?"

Prerequisites

Companion Notebooks

NotebookDescription
theory.ipynbAxis-aware BatchNorm, LayerNorm, RMSNorm, GroupNorm, WeightNorm, SpectralNorm, and residual-placement demos
exercises.ipynb10 graded exercises on moments, broadcasting, train/eval behavior, and normalization diagnostics

Learning Objectives

After completing this section, you will be able to:

  • Identify which axes BatchNorm, LayerNorm, RMSNorm, GroupNorm, InstanceNorm, WeightNorm, and SpectralNorm normalize
  • Derive the normalized affine transform γx^+β\gamma\hat{x}+\beta
  • Explain why epsilon is a numerical stabilizer, not a regularization constant
  • Distinguish batch statistics, running statistics, and per-example statistics
  • Implement BatchNorm, LayerNorm, RMSNorm, and GroupNorm in NumPy
  • Explain why BatchNorm has different train and eval behavior
  • Compare LayerNorm and RMSNorm for sequence-style hidden states
  • Diagnose broadcasting, axis, small-batch, and mixed-precision failures
  • Explain pre-norm versus post-norm as a gradient-flow design choice

Table of Contents


1. Intuition

1.1 Why Activations Drift

During training, every layer receives inputs produced by previous layers whose parameters are changing. Even if the raw data distribution is fixed, internal activation distributions move. Means shift, variances change, outliers appear, and gradient scales become inconsistent across depth.

Normalization addresses this by computing statistics over a chosen axis and reparameterizing activations into a controlled scale. A generic normalized activation has the form

x^=xμσ2+ϵ.\hat{x}=\frac{x-\mu}{\sqrt{\sigma^2+\epsilon}}.

1.2 Normalization as Re-Centering

Mean subtraction re-centers values:

xxμ.x \mapsto x-\mu.

This can make optimization easier because downstream affine layers do not need to constantly compensate for drifting offsets. But not every normalization subtracts the mean. RMSNorm normalizes by root mean square and preserves the mean direction.

1.3 Normalization as Conditioning

Optimization is sensitive to scale. If one feature dimension has variance 10610^6 and another has variance 10410^{-4}, a single learning rate is poorly matched to both. Normalization reduces such scale mismatch. It does not make all optimization problems convex, but it often improves conditioning.

1.4 Signal Scale in Deep Nets

Deep networks multiply many local transformations. If activation scale grows slightly at each layer, it can explode. If it shrinks slightly, it can vanish. Normalization layers repeatedly reset or constrain scale so residual blocks, attention blocks, convolution blocks, or recurrent transitions remain trainable.

1.5 Historical Path

BatchNorm made normalization a core deep-learning layer by using mini-batch statistics. LayerNorm removed batch dependence and became natural for sequence models. GroupNorm and InstanceNorm addressed small-batch and image-style settings. RMSNorm simplified LayerNorm by removing mean subtraction. SpectralNorm constrained weight operators rather than activations.

2. Formal Definitions

2.1 Normalization Axes

Let XRB×T×D\mathcal{X}\in\mathbb{R}^{B\times T\times D} represent a batch of sequence activations. Here BB is batch size, TT is sequence length, and DD is feature width. Different normalization methods choose different axes:

MethodTypical axes for statisticsBatch-dependent?
BatchNormbatch and sometimes spatial axesYes
LayerNormfeature axis per example/tokenNo
RMSNormfeature axis per example/tokenNo
GroupNormgroups of channels per exampleNo
InstanceNormspatial axes per example/channelNo
WeightNormweight-vector direction/scaleNo
SpectralNormlargest singular value of weight matrixNo batch statistics

The axis choice is the most important part of the definition. Two formulas can look identical while normalizing completely different objects.

2.2 Mean and Variance Statistics

For a set of values {xi}i=1m\{x_i\}_{i=1}^{m}, the mean and variance are

μ=1mi=1mxi,σ2=1mi=1m(xiμ)2.\mu=\frac{1}{m}\sum_{i=1}^{m}x_i, \qquad \sigma^2=\frac{1}{m}\sum_{i=1}^{m}(x_i-\mu)^2.

The normalized value is

x^i=xiμσ2+ϵ.\hat{x}_i=\frac{x_i-\mu}{\sqrt{\sigma^2+\epsilon}}.

In deep-learning libraries, variance is often the biased batch variance, not the unbiased sample variance. The choice is part of the layer definition.

2.3 Epsilon

Epsilon prevents division by zero:

σ2+ϵ.\sqrt{\sigma^2+\epsilon}.

It is not a regularizer in the usual statistical sense. It is a numerical stabilizer. Too small an epsilon can fail in low precision. Too large an epsilon changes the effective normalization by preventing true unit variance.

2.4 Learnable Gain and Bias

After normalization, most methods apply learnable affine parameters:

yi=γix^i+βi.y_i=\gamma_i\hat{x}_i+\beta_i.

The gain γ\gamma lets the network restore useful scale. The bias β\beta lets it restore useful offset. Without these parameters, normalization could remove representations the model needs.

2.5 Train-Time Versus Inference-Time Behavior

BatchNorm uses mini-batch statistics during training and running estimates during inference. LayerNorm and RMSNorm compute per-example statistics at both train and inference time. This distinction affects reproducibility, deployment, and small-batch behavior.

3. Batch Normalization

3.1 Batch Statistics

For a feature jj over a batch,

μj=1Bi=1Bxij,σj2=1Bi=1B(xijμj)2.\mu_j=\frac{1}{B}\sum_{i=1}^{B}x_{ij}, \qquad \sigma_j^2=\frac{1}{B}\sum_{i=1}^{B}(x_{ij}-\mu_j)^2.

Then

x^ij=xijμjσj2+ϵ.\hat{x}_{ij}=\frac{x_{ij}-\mu_j}{\sqrt{\sigma_j^2+\epsilon}}.

For convolutional tensors, statistics are often computed across batch and spatial positions for each channel.

3.2 Running Averages

BatchNorm maintains running estimates:

μrun(1α)μrun+αμbatch,\mu_{\mathrm{run}}\leftarrow (1-\alpha)\mu_{\mathrm{run}}+\alpha\mu_{\mathrm{batch}}, σrun2(1α)σrun2+ασbatch2.\sigma^2_{\mathrm{run}}\leftarrow (1-\alpha)\sigma^2_{\mathrm{run}}+\alpha\sigma^2_{\mathrm{batch}}.

These estimates are used in evaluation mode. If the running estimates are bad, validation and deployment behavior can be bad even when training behavior looked fine.

3.3 Affine Transform

BatchNorm output is

yij=γjx^ij+βj.y_{ij}=\gamma_j\hat{x}_{ij}+\beta_j.

The parameters are feature-specific. In CNNs they are channel-specific and broadcast over spatial dimensions.

3.4 Train/Eval Gap

During training, a sample is normalized using statistics from its current mini-batch. During inference, it is normalized using stored running statistics. This creates a train/eval gap. The gap is small when batches are large and representative. It can be large for tiny batches, distribution shift, or nonstationary data.

3.5 Batch-Size Sensitivity

BatchNorm is noisy with small batches because mean and variance estimates have high variance. Distributed training can also change effective batch statistics depending on whether BatchNorm is synchronized across devices.

4. Layer Normalization

4.1 Per-Example Statistics

LayerNorm computes statistics over features for each example:

μi=1Dj=1Dxij,σi2=1Dj=1D(xijμi)2.\mu_i=\frac{1}{D}\sum_{j=1}^{D}x_{ij}, \qquad \sigma_i^2=\frac{1}{D}\sum_{j=1}^{D}(x_{ij}-\mu_i)^2.

For sequence tensors, it is usually applied separately to each token position.

4.2 Feature-Axis Normalization

LayerNorm normalizes the feature vector:

x^i=xiμi1σi2+ϵ.\hat{\mathbf{x}}_i =\frac{\mathbf{x}_i-\mu_i\mathbf{1}} {\sqrt{\sigma_i^2+\epsilon}}.

Its gain and bias are feature-wise:

yi=γx^i+β.\mathbf{y}_i=\boldsymbol{\gamma}\odot\hat{\mathbf{x}}_i+\boldsymbol{\beta}.

4.3 Sequence Models

LayerNorm is natural for sequence models because it does not depend on other examples in the batch. A token can be normalized using only its own hidden state. This makes behavior consistent between training and autoregressive inference.

4.4 Invariance Properties

LayerNorm is invariant to adding a constant offset to all features in the normalized vector and to multiplying all features by a positive scalar, up to epsilon and learned affine parameters.

4.5 Comparison to BatchNorm

BatchNorm normalizes each feature using batch statistics. LayerNorm normalizes each example using feature statistics. BatchNorm couples examples in a batch. LayerNorm couples features inside an example.

5. RMSNorm

5.1 RMS Statistic

RMSNorm uses the root mean square:

RMS(x)=1Dj=1Dxj2+ϵ.\operatorname{RMS}(\mathbf{x}) =\sqrt{\frac{1}{D}\sum_{j=1}^{D}x_j^2+\epsilon}.

5.2 Scale-Only Normalization

The normalized vector is

x^=xRMS(x).\hat{\mathbf{x}}= \frac{\mathbf{x}}{\operatorname{RMS}(\mathbf{x})}.

Then

y=γx^.\mathbf{y}=\boldsymbol{\gamma}\odot\hat{\mathbf{x}}.

5.3 No Mean Subtraction

RMSNorm does not subtract the mean and usually does not include a bias parameter. It controls scale but preserves mean direction. This makes it cheaper and can work well in residual-stream architectures.

5.4 Efficiency

LayerNorm requires mean and variance. RMSNorm requires only the mean square. That saves operations and can simplify fused kernels. At large scale, small per-token savings matter.

5.5 LLM Usage Preview

Many modern LLM-style architectures use RMSNorm or RMSNorm-like variants. Chapter 15 covers LLM-specific block details. Here the reusable math is simply scale-only feature normalization.

6. Other Normalizations

6.1 GroupNorm

GroupNorm divides channels into groups and normalizes within each group for each example. It avoids batch dependence while retaining channel-group structure.

6.2 InstanceNorm

InstanceNorm normalizes each sample and channel over spatial dimensions. It is common in style-transfer and image-generation contexts where instance-specific contrast should be controlled.

6.3 WeightNorm

WeightNorm reparameterizes a weight vector:

w=gvv2.\mathbf{w}=g\frac{\mathbf{v}}{\lVert\mathbf{v}\rVert_2}.

It separates direction v/v2\mathbf{v}/\lVert\mathbf{v}\rVert_2 from magnitude gg.

6.4 SpectralNorm

Spectral normalization constrains a weight matrix by its largest singular value:

Wˉ=Wσmax(W).\bar{W}=\frac{W}{\sigma_{\max}(W)}.

This controls the operator norm and therefore the layer's Lipschitz constant. The full singular-value theory belongs in Singular Value Decomposition.

6.5 ScaleNorm Preview

ScaleNorm normalizes a vector by its norm and multiplies by a learned scalar. It is conceptually close to RMSNorm but uses explicit vector norm scaling.

7. Residual Blocks and Placement

7.1 Pre-Norm

Pre-norm residual blocks apply normalization before the sublayer:

hl+1=hl+F(Norm(hl)).\mathbf{h}_{l+1} =\mathbf{h}_l+F(\operatorname{Norm}(\mathbf{h}_l)).

This gives the residual path a direct identity route for gradients.

7.2 Post-Norm

Post-norm applies normalization after the residual addition:

hl+1=Norm(hl+F(hl)).\mathbf{h}_{l+1} =\operatorname{Norm}(\mathbf{h}_l+F(\mathbf{h}_l)).

It can produce cleaner normalized outputs per block but may make very deep stacks harder to optimize.

7.3 Sandwich Norm Preview

Some architectures add normalization both before and after certain sublayers. This is a model-design detail, so the full treatment belongs in model-specific chapters. The general math is that each norm changes both forward scale and backward Jacobian.

7.4 Gradient Flow

Normalization layers have Jacobians. They do not simply "standardize and disappear." Their backward pass subtracts mean-like components and rescales gradients. This can stabilize training but can also introduce coupling across the normalized axis.

7.5 Deep-Stack Stability

Deep residual stacks rely on controlling activation and residual-stream scale. Normalization placement, residual scaling, initialization, and optimizer choice interact. A stable block is a system, not a single layer.

8. Numerical and Implementation Details

8.1 Broadcasting Axes

Most normalization bugs are axis bugs. A parameter γRD\gamma\in\mathbb{R}^D must broadcast over batch and sequence axes but align with feature axes. If the shape is wrong, code may run while normalizing the wrong dimension.

8.2 Epsilon Choice

Epsilon must be large enough to protect low-variance vectors and low-precision formats. It must be small enough not to dominate real variance. Common values include 10510^{-5} and 10610^{-6}, but the right value is implementation and dtype dependent.

8.3 Mixed Precision

Mixed-precision normalization often accumulates statistics in higher precision even when inputs are lower precision. This reduces underflow, overflow, and rounding errors in variance computation.

8.4 Small-Batch Failure

BatchNorm with batch size one is usually ill-posed for dense features because the batch variance can be zero. CNN spatial axes may still provide enough samples, but the effective sample count should be checked.

8.5 Fused Kernels Preview

At large scale, normalization is often fused with neighboring operations to reduce memory traffic. The math is unchanged, but implementation details affect speed and numerical behavior.

9. Applications in Machine Learning

9.1 CNNs

BatchNorm is common in CNNs because batch and spatial dimensions provide many samples per channel. It also acts as a mild source of stochasticity during training.

9.2 RNNs

LayerNorm is often easier than BatchNorm for recurrent models because sequence lengths and time dependencies make batch statistics awkward.

9.3 Transformers

Transformers use LayerNorm, RMSNorm, and placement variants to stabilize deep residual blocks. The full Transformer block treatment belongs in Transformer Architecture.

9.4 GANs

SpectralNorm controls discriminator Lipschitz behavior and helps stabilize adversarial training. InstanceNorm and normalization variants also appear in image generation pipelines.

9.5 Large-Scale Training

At large scale, normalization affects stability, throughput, memory bandwidth, mixed-precision behavior, and distributed reproducibility. Seemingly small choices become system-level choices.

10. Common Mistakes

#MistakeWhy It Is WrongFix
1Normalizing the wrong axisStatistics describe the wrong objectWrite tensor shapes and axes explicitly
2Forgetting BatchNorm train/eval modeBatch and running statistics differSwitch modes deliberately and test both
3Treating epsilon as harmlessLarge epsilon changes scaleTune epsilon with dtype and variance in mind
4Using BatchNorm with tiny batches blindlyStatistics are noisy or degenerateUse LayerNorm, GroupNorm, or synchronized stats
5Broadcasting gain over the wrong dimensionCode may run but learn wrong parametersAssert shapes before training
6Comparing LayerNorm and RMSNorm as identicalRMSNorm does not centerTrack mean and RMS separately
7Ignoring mixed-precision accumulationVariance can underflow or overflowAccumulate statistics in higher precision
8Calling SpectralNorm an activation normIt normalizes weights, not activationsSeparate activation and operator normalization
9Assuming normalization replaces initializationBad initialization can still break trainingUse compatible initialization and norm placement
10Removing affine gain/bias casuallyThe model may need restored scale/offsetRemove only with a clear architectural reason

11. Exercises

  1. (*) Compute mean, variance, and normalized values for a vector.
  2. (*) Implement BatchNorm for a matrix with shape B×DB\times D.
  3. (*) Implement LayerNorm for a matrix with shape B×DB\times D.
  4. (**) Show that BatchNorm changes when the batch composition changes.
  5. (**) Show that LayerNorm is independent of other examples in the batch.
  6. (**) Implement RMSNorm and compare its output mean to LayerNorm.
  7. (**) Implement GroupNorm for a tensor with grouped channels.
  8. (***) Compute a WeightNorm reparameterization and verify its norm.
  9. (***) Estimate a spectral norm by power iteration.
  10. (***) Diagnose a shape bug in a fake normalization layer.

12. Why This Matters for AI

ConceptAI impact
Axis choiceDetermines whether examples, features, channels, or weights are coupled
BatchNormStabilizes CNN training but introduces batch dependence
LayerNormMakes sequence and token-wise models batch-independent
RMSNormControls scale with lower overhead and no centering
GroupNormWorks when batch statistics are unreliable
SpectralNormControls operator scale and Lipschitz behavior
EpsilonPrevents numerical failure in low-variance and low-precision regimes
Pre-normImproves gradient flow in deep residual stacks
Train/eval modeAffects reproducibility and deployment correctness
Fused normalizationMatters for throughput at large scale

13. Conceptual Bridge

Activation functions create nonlinear hidden states. Normalization techniques control the statistics of those hidden states.

activation values
    -> mean / variance / RMS statistics
    -> normalized hidden state
    -> learnable scale and bias
    -> next layer or residual block

Next, Sampling Methods moves from deterministic transformations to randomized estimators, proposal distributions, and generation procedures.

References

  • Ioffe, S., and Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.
  • Ba, J. L., Kiros, J. R., and Hinton, G. E. (2016). Layer Normalization.
  • Salimans, T., and Kingma, D. P. (2016). Weight Normalization.
  • Ulyanov, D., Vedaldi, A., and Lempitsky, V. (2016). Instance Normalization.
  • Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. (2018). Spectral Normalization for Generative Adversarial Networks.
  • Wu, Y., and He, K. (2018). Group Normalization.
  • Zhang, B., and Sennrich, R. (2019). Root Mean Square Layer Normalization.