NotesMath for LLMs

Activation Functions

ML Specific Math / Activation Functions

Notes

"A neural network without nonlinear activation is only a disguised linear map."

Overview

Activation functions are the local nonlinear maps that turn stacks of affine layers into expressive models. They decide which signals pass forward, which gradients flow backward, which neurons saturate, which features become sparse, and how numerical scale changes with depth. Their formulas look small, but their training consequences are large.

This section is the canonical home for activation-function math in the curriculum. Chapter 14 discusses how neural networks, RNNs, CNNs, and Transformers use these functions inside model architectures. Chapter 15 discusses LLM-specific block design. Here we focus on the reusable mathematical objects themselves: scalar activations, vector activations, derivatives, Jacobians, saturation, smoothness, Lipschitz behavior, gating, softmax temperature, and gradient-flow consequences.

The central question is not "Which activation is popular?" but "What map does this activation apply, what derivative does it create, and what does that derivative do to learning?"

Prerequisites

Companion Notebooks

NotebookDescription
theory.ipynbCurves, derivatives, Jacobians, saturation, gated activations, and softmax temperature
exercises.ipynb10 graded exercises on activation values, derivatives, gradients, and stable implementations

Learning Objectives

After completing this section, you will be able to:

  • Define scalar, elementwise vector, and coupled vector activations
  • Compute derivatives for sigmoid, tanh, ReLU, Leaky ReLU, GELU, SiLU, and softmax
  • Explain saturation and why it causes vanishing gradients
  • Compare piecewise-linear and smooth activations by gradient behavior
  • Derive the softmax Jacobian Jij=si(δijsj)J_{ij}=s_i(\delta_{ij}-s_j)
  • Explain why gated activations such as GLU, GEGLU, and SwiGLU are multiplicative feature selectors
  • Connect activation variance to Xavier and He initialization
  • Diagnose dead neurons, exploding activations, saturated gates, and unstable softmax
  • Choose activations by mathematical role rather than popularity

Table of Contents


1. Intuition

1.1 Why Nonlinearities Matter

An affine layer has the form

h=Wx+b.\mathbf{h}=W\mathbf{x}+\mathbf{b}.

If we stack affine layers without nonlinearities, the result is still affine:

W3(W2(W1x+b1)+b2)+b3=Ax+c.W_3(W_2(W_1\mathbf{x}+\mathbf{b}_1)+\mathbf{b}_2)+\mathbf{b}_3 =A\mathbf{x}+\mathbf{c}.

Depth alone does not create nonlinear expressivity. Activation functions insert nonlinear maps between affine transforms:

h[l]=ϕ(W[l]h[l1]+b[l]).\mathbf{h}^{[l]}=\phi(W^{[l]}\mathbf{h}^{[l-1]}+\mathbf{b}^{[l]}).

The activation ϕ\phi is what lets the network bend decision boundaries, compose features, gate information, and approximate complex functions.

1.2 Activations as Gates

Many activations behave like gates. ReLU passes positive values and blocks negative values:

ReLU(x)=max(0,x).\operatorname{ReLU}(x)=\max(0,x).

Sigmoid maps values into (0,1)(0,1) and is often interpreted as a soft gate:

σ(x)=11+exp(x).\sigma(x)=\frac{1}{1+\exp(-x)}.

Gated activations multiply a content stream by a learned gate:

GLU(a,b)=aσ(b).\operatorname{GLU}(\mathbf{a},\mathbf{b}) =\mathbf{a}\odot\sigma(\mathbf{b}).

This gate perspective is especially useful in LSTMs, GRUs, gated MLPs, and Transformer feedforward variants.

1.3 Activations as Gradient Controllers

Backpropagation multiplies by activation derivatives. In a depth-LL network, early-layer gradients contain products of terms like

W[l]diag(ϕ(z[l])).W^{[l]\top}\operatorname{diag}(\phi'(\mathbf{z}^{[l]})).

If ϕ\phi' is often near zero, gradients vanish. If the product of weight norms and activation slopes is large, gradients can explode. This is why sigmoid can saturate deep networks, why ReLU helped train deeper models, and why normalization and residual connections matter.

1.4 Shape of Activation Curves

The curve shape determines both forward statistics and backward gradients:

ActivationRangeSaturates?Derivative behavior
Sigmoid(0,1)(0,1)Both tailsMax derivative 1/41/4
Tanh(1,1)(-1,1)Both tailsMax derivative 11
ReLU[0,)[0,\infty)Negative sideDerivative 00 or 11
GELUR\mathbb{R}Soft negative gateSmooth nonmonotone region
SiLUR\mathbb{R}Soft negative gateSmooth self-gating
SoftmaxSimplexProbability saturationCoupled Jacobian

1.5 Historical Path

Sigmoid and tanh dominated early neural networks because they looked like smooth biological firing rates and probabilistic gates. ReLU became important because it reduced saturation and made sparse activations easy. Smooth variants such as GELU and SiLU became common in large architectures because they combine gating-like behavior with differentiability. Gated activations such as GLU, GEGLU, and SwiGLU made multiplicative feature selection a standard MLP ingredient.

The historical lesson is that activation choice follows training dynamics. Expressivity, gradient propagation, initialization, normalization, and hardware all influence the best choice.

2. Formal Definitions

2.1 Scalar Activation

A scalar activation is a function

ϕ:RR.\phi:\mathbb{R}\to\mathbb{R}.

Examples:

  • σ(x)=1/(1+exp(x))\sigma(x)=1/(1+\exp(-x))
  • tanh(x)\tanh(x)
  • ReLU(x)=max(0,x)\operatorname{ReLU}(x)=\max(0,x)

Non-examples:

  • A loss function (y^,y)\ell(\hat{y},y) is not an activation, because it compares prediction and target.
  • A full neural-network layer Wx+bW\mathbf{x}+\mathbf{b} is not an activation, because it is affine and parameterized.

2.2 Elementwise Vector Activation

Given xRd\mathbf{x}\in\mathbb{R}^d, an elementwise activation applies the same scalar map to each coordinate:

ϕ(x)=(ϕ(x1),,ϕ(xd)).\phi(\mathbf{x})= (\phi(x_1),\ldots,\phi(x_d))^\top.

The Jacobian is diagonal:

Jϕ(x)=diag(ϕ(x1),,ϕ(xd)).J_{\phi}(\mathbf{x})= \operatorname{diag}(\phi'(x_1),\ldots,\phi'(x_d)).

Examples include sigmoid, tanh, ReLU, GELU, and SiLU when applied to vectors.

2.3 Coupled Vector Activation

A coupled vector activation maps Rd\mathbb{R}^d to Rd\mathbb{R}^d but each output can depend on multiple input coordinates. Softmax is the central example:

softmax(z)i=exp(zi)j=1dexp(zj).\operatorname{softmax}(\mathbf{z})_i =\frac{\exp(z_i)}{\sum_{j=1}^{d}\exp(z_j)}.

Changing one logit changes all softmax probabilities because the denominator couples every coordinate.

2.4 Jacobian

For s=softmax(z)\mathbf{s}=\operatorname{softmax}(\mathbf{z}), the Jacobian entries are

sizj=si(δijsj).\frac{\partial s_i}{\partial z_j} =s_i(\delta_{ij}-s_j).

In matrix form,

Jsoftmax(z)=diag(s)ss.J_{\operatorname{softmax}}(\mathbf{z}) =\operatorname{diag}(\mathbf{s})-\mathbf{s}\mathbf{s}^\top.

This matrix is positive semidefinite and has row sums zero. The row-sum property reflects shift invariance:

softmax(z+c1)=softmax(z).\operatorname{softmax}(\mathbf{z}+c\mathbf{1}) =\operatorname{softmax}(\mathbf{z}).

2.5 Smoothness, Lipschitz Constants, and Monotonicity

An activation is smooth if it has continuous derivatives of the needed order. ReLU is continuous but not differentiable at zero. GELU and SiLU are smooth. Sigmoid, tanh, softplus, and softmax are smooth on their domains.

An activation is Lipschitz if there exists LL such that

ϕ(x)ϕ(y)Lxy.\lvert \phi(x)-\phi(y)\rvert\le L\lvert x-y\rvert.

For differentiable scalar activations, a bounded derivative gives a Lipschitz constant. ReLU is 1-Lipschitz. Sigmoid is 1/41/4-Lipschitz. Tanh is 1-Lipschitz.

3. Classical Activations

3.1 Sigmoid

The sigmoid is

σ(x)=11+exp(x).\sigma(x)=\frac{1}{1+\exp(-x)}.

Its derivative is

σ(x)=σ(x)(1σ(x)).\sigma'(x)=\sigma(x)(1-\sigma(x)).

The derivative is largest at x=0x=0 and approaches zero in both tails. This is why sigmoid works well as a gate but poorly as a default hidden activation in deep feedforward networks.

3.2 Tanh

The hyperbolic tangent is

tanh(x)=exp(x)exp(x)exp(x)+exp(x).\tanh(x)=\frac{\exp(x)-\exp(-x)}{\exp(x)+\exp(-x)}.

Its derivative is

ddxtanh(x)=1tanh2(x).\frac{d}{dx}\tanh(x)=1-\tanh^2(x).

Tanh is zero-centered, unlike sigmoid, which often makes optimization easier. But it still saturates for large x\lvert x\rvert.

3.3 Softplus

Softplus is a smooth approximation to ReLU:

softplus(x)=log(1+exp(x)).\operatorname{softplus}(x)=\log(1+\exp(x)).

Its derivative is sigmoid:

ddxsoftplus(x)=σ(x).\frac{d}{dx}\operatorname{softplus}(x)=\sigma(x).

Softplus is useful when a strictly positive output is needed, for example a variance or rate parameter.

3.4 Saturation

An activation saturates when its derivative becomes very small over a large input region. Sigmoid and tanh saturate in both tails:

limxσ(x)=0,limxσ(x)=0.\lim_{x\to\infty}\sigma'(x)=0, \qquad \lim_{x\to-\infty}\sigma'(x)=0.

Saturation creates vanishing gradients. Once a unit enters a saturated region, large upstream gradients may be multiplied by near-zero local derivatives.

3.5 Gate Interpretation

Sigmoid values can be interpreted as soft gates because they lie between zero and one. If

hout=ghin,g=σ(a),\mathbf{h}_{\mathrm{out}} =\mathbf{g}\odot\mathbf{h}_{\mathrm{in}}, \qquad \mathbf{g}=\sigma(\mathbf{a}),

then each coordinate of g\mathbf{g} decides how much of the input passes. This pattern is foundational for LSTM input, forget, and output gates, and it reappears in modern gated feedforward blocks.

4. ReLU Family

4.1 ReLU

ReLU is

ReLU(x)=max(0,x).\operatorname{ReLU}(x)=\max(0,x).

A common derivative convention is

ReLU(x)={1,x>0,0,x0.\operatorname{ReLU}'(x)= \begin{cases} 1, & x>0,\\ 0, & x\le 0. \end{cases}

ReLU avoids positive-side saturation and creates sparse activations. Its main failure mode is dead neurons: units whose preactivations remain negative so their gradients stay zero.

4.2 Leaky ReLU

Leaky ReLU uses a small negative slope:

LeakyReLUα(x)={x,x>0,αx,x0.\operatorname{LeakyReLU}_{\alpha}(x)= \begin{cases} x, & x>0,\\ \alpha x, & x\le 0. \end{cases}

Its derivative is α\alpha on the negative side and 11 on the positive side. This reduces the dead-neuron problem while preserving much of ReLU's behavior.

4.3 PReLU

Parametric ReLU learns the negative slope:

PReLUa(x)={x,x>0,ax,x0.\operatorname{PReLU}_{a}(x)= \begin{cases} x, & x>0,\\ ax, & x\le 0. \end{cases}

The slope aa becomes a parameter. This can improve flexibility, but it adds a small risk: if slopes grow uncontrolled, activation scale can drift.

4.4 ELU and SELU Preview

ELU uses an exponential negative branch:

ELUα(x)={x,x>0,α(exp(x)1),x0.\operatorname{ELU}_{\alpha}(x)= \begin{cases} x, & x>0,\\ \alpha(\exp(x)-1), & x\le 0. \end{cases}

SELU scales ELU to encourage self-normalizing behavior under assumptions about initialization and network structure. These activations are useful historically and conceptually, though modern large Transformer-style models more commonly use GELU, SiLU, or gated variants.

4.5 Dead Neurons and Sparse Activations

A ReLU neuron is dead when its preactivation is negative for almost all examples and updates do not move it back into the active region. Causes include too-large learning rates, biased initialization, distribution shift, and poor normalization.

Sparse activation is not always bad. ReLU sparsity can improve feature selectivity and computation. The problem is irreversible inactivity, not zeros themselves.

5. Smooth Modern Activations

5.1 GELU

GELU is

GELU(x)=xΦ(x),\operatorname{GELU}(x)=x\Phi(x),

where Φ\Phi is the standard normal CDF. A common approximation is

GELU(x)0.5x(1+tanh(2π(x+0.044715x3))).\operatorname{GELU}(x) \approx 0.5x\left(1+\tanh\left(\sqrt{\frac{2}{\pi}}(x+0.044715x^3)\right)\right).

GELU can be read as stochastic-looking gating: larger positive values pass almost unchanged, large negative values are suppressed, and values near zero are softly weighted.

5.2 SiLU and Swish

SiLU, also called Swish when parameterized, is

SiLU(x)=xσ(x).\operatorname{SiLU}(x)=x\sigma(x).

Its derivative is

SiLU(x)=σ(x)+xσ(x)(1σ(x)).\operatorname{SiLU}'(x) =\sigma(x)+x\sigma(x)(1-\sigma(x)).

SiLU is smooth and self-gated. It allows small negative outputs, which can improve gradient flow compared with hard zeroing.

5.3 Mish

Mish is

Mish(x)=xtanh(softplus(x)).\operatorname{Mish}(x)=x\tanh(\operatorname{softplus}(x)).

Like GELU and SiLU, Mish is smooth and allows a negative tail. It is less standard in large LLM stacks but useful for understanding the family of smooth self-gated activations.

5.4 Curvature Comparison

Smooth activations have nonzero second derivatives over wider regions. That changes curvature seen by optimizers. ReLU is piecewise linear, so its second derivative is zero away from the kink. GELU and SiLU have smooth curvature near zero, which can give more gradual gradient transitions.

Curvature is not automatically good. It can improve optimization smoothness, but it also changes gradient scale and may interact with initialization.

5.5 Gradient-Flow Effects

The main gradient-flow question is:

typical slope×typical weight scale.\text{typical slope} \times \text{typical weight scale}.

Sigmoid has small slopes in the tails. ReLU has slope one on active units and zero on inactive units. GELU and SiLU have soft slopes that vary smoothly. Gated activations multiply by learned factors, so their gradients include both content and gate paths.

6. Gated Activations

6.1 GLU

Given two vectors a,bRd\mathbf{a},\mathbf{b}\in\mathbb{R}^d, the gated linear unit is

GLU(a,b)=aσ(b).\operatorname{GLU}(\mathbf{a},\mathbf{b}) =\mathbf{a}\odot\sigma(\mathbf{b}).

The output is linear content modulated by a sigmoid gate. The derivative has two paths: one through a\mathbf{a} and one through b\mathbf{b}.

6.2 GEGLU

GEGLU replaces the sigmoid gate with GELU:

GEGLU(a,b)=aGELU(b).\operatorname{GEGLU}(\mathbf{a},\mathbf{b}) =\mathbf{a}\odot\operatorname{GELU}(\mathbf{b}).

This gives a smooth non-probability gate. Unlike sigmoid gates, GELU gates are not constrained to (0,1)(0,1).

6.3 SwiGLU

SwiGLU uses SiLU as the gate:

SwiGLU(a,b)=aSiLU(b).\operatorname{SwiGLU}(\mathbf{a},\mathbf{b}) =\mathbf{a}\odot\operatorname{SiLU}(\mathbf{b}).

SwiGLU is common in modern Transformer feedforward blocks. The architecture details belong to Chapter 14 and Chapter 15; here the reusable math is the bilinear interaction between content and gate streams.

6.4 Bilinear Gating

Gated activations introduce multiplicative interactions:

yi=aigi.y_i=a_i g_i.

The local derivatives are

yiai=gi,yigi=ai.\frac{\partial y_i}{\partial a_i}=g_i, \qquad \frac{\partial y_i}{\partial g_i}=a_i.

Thus the gate controls content gradients, and the content controls gate gradients. This is more expressive than an elementwise activation applied to a single stream.

6.5 Why Gated MLPs Help

Gated MLPs can select features conditionally. A standard activation transforms each coordinate independently. A gated activation lets one projection decide which coordinates of another projection matter. This adds a simple form of feature interaction without attention.

7. Vector Activations

7.1 Softmax

Softmax maps logits to probabilities:

si=exp(zi)j=1Cexp(zj).s_i=\frac{\exp(z_i)}{\sum_{j=1}^{C}\exp(z_j)}.

It is shift-invariant:

softmax(z+c1)=softmax(z).\operatorname{softmax}(\mathbf{z}+c\mathbf{1}) =\operatorname{softmax}(\mathbf{z}).

Softmax is used in multiclass classification, attention weights, categorical sampling, and contrastive objectives.

7.2 Temperature Softmax

Temperature softmax is

si(τ)=exp(zi/τ)jexp(zj/τ).s_i(\tau) =\frac{\exp(z_i/\tau)}{\sum_j\exp(z_j/\tau)}.

Small τ\tau sharpens the distribution. Large τ\tau flattens it. Temperature changes both probabilities and gradient scale, because derivatives include the factor 1/τ1/\tau.

7.3 Softmax Jacobian

The softmax Jacobian is

J=diag(s)ss.J=\operatorname{diag}(\mathbf{s})-\mathbf{s}\mathbf{s}^\top.

The diagonal entries are si(1si)s_i(1-s_i). Off-diagonal entries are sisj-s_is_j. This coupling is why increasing one logit decreases other probabilities.

7.4 Sparsemax and Entmax Preview

Softmax assigns positive probability to every class. Sparse alternatives such as sparsemax and entmax can assign exact zeros. These are useful when sparse probability distributions are desirable, but their full treatment belongs to specialized sequence and attention contexts.

7.5 Attention-Output Preview

In attention, softmax converts scores into weights:

A=softmax(QKdk).A=\operatorname{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right).

The full attention mechanism belongs in Attention Mechanism Math. Here the important activation fact is that softmax can saturate when logits have large variance, which motivates scaling by dk\sqrt{d_k}.

8. Initialization and Gradient Flow

8.1 Activation Variance

If activations grow layer by layer, later layers may explode. If activations shrink layer by layer, signals may vanish. Initialization chooses weight variance to keep preactivation and activation variance controlled.

8.2 Xavier and He Initialization

Xavier initialization is designed for roughly symmetric activations such as tanh:

Var(Wij)2nin+nout.\operatorname{Var}(W_{ij})\approx\frac{2}{n_{\mathrm{in}}+n_{\mathrm{out}}}.

He initialization is designed for ReLU-like activations:

Var(Wij)2nin.\operatorname{Var}(W_{ij})\approx\frac{2}{n_{\mathrm{in}}}.

The factor of two appears because ReLU zeroes roughly half of a symmetric input distribution.

8.3 Vanishing Gradients

If many derivatives satisfy ϕ(z)<1\lvert \phi'(z)\rvert<1, products of derivatives can shrink exponentially with depth. This was a major reason sigmoid and tanh hidden activations were hard to train in deep networks without careful initialization, normalization, or residual paths.

8.4 Exploding Gradients

Exploding gradients occur when Jacobian products have large singular values. Activations contribute through their slopes. ReLU slopes are bounded by one, but weights can still produce exploding products. Smooth activations can have slopes slightly above one in some regions, so scale control still matters.

8.5 Residual Connection Preview

Residual connections create paths where gradients can flow through identity maps:

h[l+1]=h[l]+F(h[l]).\mathbf{h}^{[l+1]}=\mathbf{h}^{[l]}+F(\mathbf{h}^{[l]}).

This reduces dependence on long products of activation derivatives. The full model-specific treatment belongs in Chapter 14.

9. Applications in Machine Learning

9.1 CNNs

CNNs historically use ReLU-family activations because they are cheap, piecewise-linear, and sparse. Leaky variants can help when dead filters appear.

9.2 RNN Gates

RNNs use sigmoid gates and tanh candidate states. The bounded range controls state updates, while gates regulate memory. The full recurrent architecture math belongs in RNN and LSTM Math.

9.3 Transformer Feedforward Blocks

Transformers commonly use GELU, SiLU, or gated variants inside feedforward blocks. The activation controls token-wise nonlinear transformation between attention layers. Chapter 14 and Chapter 15 cover the full block design.

9.4 Binary Outputs

Sigmoid maps logits to Bernoulli probabilities for binary classification. Training should usually use a logits-based BCE loss rather than manually applying sigmoid and then taking logs.

9.5 Probability Heads

Softmax maps class logits to categorical probabilities. Its coupling and shift-invariance make it the standard final activation for multiclass probability heads and many contrastive objectives.

10. Common Mistakes

#MistakeWhy It Is WrongFix
1Stacking affine layers without activationsThe composition is still affineInsert nonlinear activations between affine maps
2Using sigmoid hidden units in a deep plain MLPSaturation causes vanishing gradientsPrefer ReLU/GELU/SiLU plus normalization when appropriate
3Calling softmax elementwiseSoftmax couples coordinates through the denominatorUse the full Jacobian when deriving gradients
4Forgetting softmax shift-invarianceLarge logits may overflowSubtract the max before exponentiating
5Treating ReLU derivative at zero as importantThe convention rarely changes trainingState the chosen convention and move on
6Ignoring activation scale in initializationVariance can explode or vanishMatch initialization to activation family
7Assuming smooth activations are always betterSmoothness changes scale and costCompare gradients, not only curves
8Applying sigmoid before a logits BCE lossThe loss applies sigmoid internallyPass raw logits to fused logits losses
9Confusing GLU gates with probabilitiesGELU/SwiGLU gates are not constrained to (0,1)(0,1)Interpret them as multiplicative feature selectors
10Using temperature without considering gradientsTemperature rescales derivativesRetune or monitor gradient norms

11. Exercises

  1. (*) Derive σ(x)=σ(x)(1σ(x))\sigma'(x)=\sigma(x)(1-\sigma(x)).
  2. (*) Derive ddxtanh(x)=1tanh2(x)\frac{d}{dx}\tanh(x)=1-\tanh^2(x).
  3. (*) Compute ReLU, Leaky ReLU, and their derivatives for a vector.
  4. (**) Show that stacked affine layers without activations collapse to one affine map.
  5. (**) Implement stable softmax and verify shift-invariance.
  6. (**) Derive the softmax Jacobian for a three-class vector.
  7. (**) Compare GELU and SiLU curves and derivatives numerically.
  8. (***) Compute gradients for a GLU and explain the two gradient paths.
  9. (***) Estimate activation variance after Xavier and He initialization.
  10. (***) Diagnose a dead-ReLU layer from activation and gradient statistics.

12. Why This Matters for AI

Activation conceptAI impact
NonlinearityGives networks expressive function classes
Sigmoid gatesControls memory and binary probabilities
TanhBounded hidden states and centered gates
ReLUSparse, cheap, deep-friendly hidden activations
GELUSmooth stochastic-style gating in Transformer blocks
SiLU/SwiGLUSelf-gated and multiplicative feedforward transformations
SoftmaxConverts scores into probabilities and attention weights
Softmax temperatureControls sharpness in classification, sampling, and contrastive learning
Activation derivativesDetermine gradient propagation and trainability
Initialization couplingKeeps activation scale stable across depth

13. Conceptual Bridge

Loss functions define the gradient at the output. Activation functions decide how that gradient passes through each hidden layer.

Loss gradient
    -> output head
    -> activation derivatives
    -> layer Jacobians
    -> earlier parameters

Next, Normalization Techniques studies how to control the statistics of those activations so deep networks remain trainable.

References

  • Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning representations by back-propagating errors.
  • Nair, V., and Hinton, G. E. (2010). Rectified Linear Units Improve Restricted Boltzmann Machines.
  • Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep Sparse Rectifier Neural Networks.
  • He, K., Zhang, X., Ren, S., and Sun, J. (2015). Delving Deep into Rectifiers.
  • Hendrycks, D., and Gimpel, K. (2016). Gaussian Error Linear Units.
  • Elfwing, S., Uchibe, E., and Doya, K. (2018). Sigmoid-Weighted Linear Units.
  • Shazeer, N. (2020). GLU Variants Improve Transformer.