"A neural network without nonlinear activation is only a disguised linear map."
Overview
Activation functions are the local nonlinear maps that turn stacks of affine layers into expressive models. They decide which signals pass forward, which gradients flow backward, which neurons saturate, which features become sparse, and how numerical scale changes with depth. Their formulas look small, but their training consequences are large.
This section is the canonical home for activation-function math in the curriculum. Chapter 14 discusses how neural networks, RNNs, CNNs, and Transformers use these functions inside model architectures. Chapter 15 discusses LLM-specific block design. Here we focus on the reusable mathematical objects themselves: scalar activations, vector activations, derivatives, Jacobians, saturation, smoothness, Lipschitz behavior, gating, softmax temperature, and gradient-flow consequences.
The central question is not "Which activation is popular?" but "What map does this activation apply, what derivative does it create, and what does that derivative do to learning?"
Prerequisites
- Single-variable derivatives and chain rule - Derivatives
- Jacobians and vector-valued functions - Jacobian Matrix
- Gradient flow through optimization - Gradient Descent
- Loss gradients at the output - Loss Functions
- Numerical stability - Floating Point Arithmetic
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | Curves, derivatives, Jacobians, saturation, gated activations, and softmax temperature |
| exercises.ipynb | 10 graded exercises on activation values, derivatives, gradients, and stable implementations |
Learning Objectives
After completing this section, you will be able to:
- Define scalar, elementwise vector, and coupled vector activations
- Compute derivatives for sigmoid, tanh, ReLU, Leaky ReLU, GELU, SiLU, and softmax
- Explain saturation and why it causes vanishing gradients
- Compare piecewise-linear and smooth activations by gradient behavior
- Derive the softmax Jacobian
- Explain why gated activations such as GLU, GEGLU, and SwiGLU are multiplicative feature selectors
- Connect activation variance to Xavier and He initialization
- Diagnose dead neurons, exploding activations, saturated gates, and unstable softmax
- Choose activations by mathematical role rather than popularity
Table of Contents
- 1. Intuition
- 2. Formal Definitions
- 3. Classical Activations
- 4. ReLU Family
- 5. Smooth Modern Activations
- 6. Gated Activations
- 7. Vector Activations
- 8. Initialization and Gradient Flow
- 9. Applications in Machine Learning
- 10. Common Mistakes
- 11. Exercises
- 12. Why This Matters for AI
- 13. Conceptual Bridge
- References
1. Intuition
1.1 Why Nonlinearities Matter
An affine layer has the form
If we stack affine layers without nonlinearities, the result is still affine:
Depth alone does not create nonlinear expressivity. Activation functions insert nonlinear maps between affine transforms:
The activation is what lets the network bend decision boundaries, compose features, gate information, and approximate complex functions.
1.2 Activations as Gates
Many activations behave like gates. ReLU passes positive values and blocks negative values:
Sigmoid maps values into and is often interpreted as a soft gate:
Gated activations multiply a content stream by a learned gate:
This gate perspective is especially useful in LSTMs, GRUs, gated MLPs, and Transformer feedforward variants.
1.3 Activations as Gradient Controllers
Backpropagation multiplies by activation derivatives. In a depth- network, early-layer gradients contain products of terms like
If is often near zero, gradients vanish. If the product of weight norms and activation slopes is large, gradients can explode. This is why sigmoid can saturate deep networks, why ReLU helped train deeper models, and why normalization and residual connections matter.
1.4 Shape of Activation Curves
The curve shape determines both forward statistics and backward gradients:
| Activation | Range | Saturates? | Derivative behavior |
|---|---|---|---|
| Sigmoid | Both tails | Max derivative | |
| Tanh | Both tails | Max derivative | |
| ReLU | Negative side | Derivative or | |
| GELU | Soft negative gate | Smooth nonmonotone region | |
| SiLU | Soft negative gate | Smooth self-gating | |
| Softmax | Simplex | Probability saturation | Coupled Jacobian |
1.5 Historical Path
Sigmoid and tanh dominated early neural networks because they looked like smooth biological firing rates and probabilistic gates. ReLU became important because it reduced saturation and made sparse activations easy. Smooth variants such as GELU and SiLU became common in large architectures because they combine gating-like behavior with differentiability. Gated activations such as GLU, GEGLU, and SwiGLU made multiplicative feature selection a standard MLP ingredient.
The historical lesson is that activation choice follows training dynamics. Expressivity, gradient propagation, initialization, normalization, and hardware all influence the best choice.
2. Formal Definitions
2.1 Scalar Activation
A scalar activation is a function
Examples:
Non-examples:
- A loss function is not an activation, because it compares prediction and target.
- A full neural-network layer is not an activation, because it is affine and parameterized.
2.2 Elementwise Vector Activation
Given , an elementwise activation applies the same scalar map to each coordinate:
The Jacobian is diagonal:
Examples include sigmoid, tanh, ReLU, GELU, and SiLU when applied to vectors.
2.3 Coupled Vector Activation
A coupled vector activation maps to but each output can depend on multiple input coordinates. Softmax is the central example:
Changing one logit changes all softmax probabilities because the denominator couples every coordinate.
2.4 Jacobian
For , the Jacobian entries are
In matrix form,
This matrix is positive semidefinite and has row sums zero. The row-sum property reflects shift invariance:
2.5 Smoothness, Lipschitz Constants, and Monotonicity
An activation is smooth if it has continuous derivatives of the needed order. ReLU is continuous but not differentiable at zero. GELU and SiLU are smooth. Sigmoid, tanh, softplus, and softmax are smooth on their domains.
An activation is Lipschitz if there exists such that
For differentiable scalar activations, a bounded derivative gives a Lipschitz constant. ReLU is 1-Lipschitz. Sigmoid is -Lipschitz. Tanh is 1-Lipschitz.
3. Classical Activations
3.1 Sigmoid
The sigmoid is
Its derivative is
The derivative is largest at and approaches zero in both tails. This is why sigmoid works well as a gate but poorly as a default hidden activation in deep feedforward networks.
3.2 Tanh
The hyperbolic tangent is
Its derivative is
Tanh is zero-centered, unlike sigmoid, which often makes optimization easier. But it still saturates for large .
3.3 Softplus
Softplus is a smooth approximation to ReLU:
Its derivative is sigmoid:
Softplus is useful when a strictly positive output is needed, for example a variance or rate parameter.
3.4 Saturation
An activation saturates when its derivative becomes very small over a large input region. Sigmoid and tanh saturate in both tails:
Saturation creates vanishing gradients. Once a unit enters a saturated region, large upstream gradients may be multiplied by near-zero local derivatives.
3.5 Gate Interpretation
Sigmoid values can be interpreted as soft gates because they lie between zero and one. If
then each coordinate of decides how much of the input passes. This pattern is foundational for LSTM input, forget, and output gates, and it reappears in modern gated feedforward blocks.
4. ReLU Family
4.1 ReLU
ReLU is
A common derivative convention is
ReLU avoids positive-side saturation and creates sparse activations. Its main failure mode is dead neurons: units whose preactivations remain negative so their gradients stay zero.
4.2 Leaky ReLU
Leaky ReLU uses a small negative slope:
Its derivative is on the negative side and on the positive side. This reduces the dead-neuron problem while preserving much of ReLU's behavior.
4.3 PReLU
Parametric ReLU learns the negative slope:
The slope becomes a parameter. This can improve flexibility, but it adds a small risk: if slopes grow uncontrolled, activation scale can drift.
4.4 ELU and SELU Preview
ELU uses an exponential negative branch:
SELU scales ELU to encourage self-normalizing behavior under assumptions about initialization and network structure. These activations are useful historically and conceptually, though modern large Transformer-style models more commonly use GELU, SiLU, or gated variants.
4.5 Dead Neurons and Sparse Activations
A ReLU neuron is dead when its preactivation is negative for almost all examples and updates do not move it back into the active region. Causes include too-large learning rates, biased initialization, distribution shift, and poor normalization.
Sparse activation is not always bad. ReLU sparsity can improve feature selectivity and computation. The problem is irreversible inactivity, not zeros themselves.
5. Smooth Modern Activations
5.1 GELU
GELU is
where is the standard normal CDF. A common approximation is
GELU can be read as stochastic-looking gating: larger positive values pass almost unchanged, large negative values are suppressed, and values near zero are softly weighted.
5.2 SiLU and Swish
SiLU, also called Swish when parameterized, is
Its derivative is
SiLU is smooth and self-gated. It allows small negative outputs, which can improve gradient flow compared with hard zeroing.
5.3 Mish
Mish is
Like GELU and SiLU, Mish is smooth and allows a negative tail. It is less standard in large LLM stacks but useful for understanding the family of smooth self-gated activations.
5.4 Curvature Comparison
Smooth activations have nonzero second derivatives over wider regions. That changes curvature seen by optimizers. ReLU is piecewise linear, so its second derivative is zero away from the kink. GELU and SiLU have smooth curvature near zero, which can give more gradual gradient transitions.
Curvature is not automatically good. It can improve optimization smoothness, but it also changes gradient scale and may interact with initialization.
5.5 Gradient-Flow Effects
The main gradient-flow question is:
Sigmoid has small slopes in the tails. ReLU has slope one on active units and zero on inactive units. GELU and SiLU have soft slopes that vary smoothly. Gated activations multiply by learned factors, so their gradients include both content and gate paths.
6. Gated Activations
6.1 GLU
Given two vectors , the gated linear unit is
The output is linear content modulated by a sigmoid gate. The derivative has two paths: one through and one through .
6.2 GEGLU
GEGLU replaces the sigmoid gate with GELU:
This gives a smooth non-probability gate. Unlike sigmoid gates, GELU gates are not constrained to .
6.3 SwiGLU
SwiGLU uses SiLU as the gate:
SwiGLU is common in modern Transformer feedforward blocks. The architecture details belong to Chapter 14 and Chapter 15; here the reusable math is the bilinear interaction between content and gate streams.
6.4 Bilinear Gating
Gated activations introduce multiplicative interactions:
The local derivatives are
Thus the gate controls content gradients, and the content controls gate gradients. This is more expressive than an elementwise activation applied to a single stream.
6.5 Why Gated MLPs Help
Gated MLPs can select features conditionally. A standard activation transforms each coordinate independently. A gated activation lets one projection decide which coordinates of another projection matter. This adds a simple form of feature interaction without attention.
7. Vector Activations
7.1 Softmax
Softmax maps logits to probabilities:
It is shift-invariant:
Softmax is used in multiclass classification, attention weights, categorical sampling, and contrastive objectives.
7.2 Temperature Softmax
Temperature softmax is
Small sharpens the distribution. Large flattens it. Temperature changes both probabilities and gradient scale, because derivatives include the factor .
7.3 Softmax Jacobian
The softmax Jacobian is
The diagonal entries are . Off-diagonal entries are . This coupling is why increasing one logit decreases other probabilities.
7.4 Sparsemax and Entmax Preview
Softmax assigns positive probability to every class. Sparse alternatives such as sparsemax and entmax can assign exact zeros. These are useful when sparse probability distributions are desirable, but their full treatment belongs to specialized sequence and attention contexts.
7.5 Attention-Output Preview
In attention, softmax converts scores into weights:
The full attention mechanism belongs in Attention Mechanism Math. Here the important activation fact is that softmax can saturate when logits have large variance, which motivates scaling by .
8. Initialization and Gradient Flow
8.1 Activation Variance
If activations grow layer by layer, later layers may explode. If activations shrink layer by layer, signals may vanish. Initialization chooses weight variance to keep preactivation and activation variance controlled.
8.2 Xavier and He Initialization
Xavier initialization is designed for roughly symmetric activations such as tanh:
He initialization is designed for ReLU-like activations:
The factor of two appears because ReLU zeroes roughly half of a symmetric input distribution.
8.3 Vanishing Gradients
If many derivatives satisfy , products of derivatives can shrink exponentially with depth. This was a major reason sigmoid and tanh hidden activations were hard to train in deep networks without careful initialization, normalization, or residual paths.
8.4 Exploding Gradients
Exploding gradients occur when Jacobian products have large singular values. Activations contribute through their slopes. ReLU slopes are bounded by one, but weights can still produce exploding products. Smooth activations can have slopes slightly above one in some regions, so scale control still matters.
8.5 Residual Connection Preview
Residual connections create paths where gradients can flow through identity maps:
This reduces dependence on long products of activation derivatives. The full model-specific treatment belongs in Chapter 14.
9. Applications in Machine Learning
9.1 CNNs
CNNs historically use ReLU-family activations because they are cheap, piecewise-linear, and sparse. Leaky variants can help when dead filters appear.
9.2 RNN Gates
RNNs use sigmoid gates and tanh candidate states. The bounded range controls state updates, while gates regulate memory. The full recurrent architecture math belongs in RNN and LSTM Math.
9.3 Transformer Feedforward Blocks
Transformers commonly use GELU, SiLU, or gated variants inside feedforward blocks. The activation controls token-wise nonlinear transformation between attention layers. Chapter 14 and Chapter 15 cover the full block design.
9.4 Binary Outputs
Sigmoid maps logits to Bernoulli probabilities for binary classification. Training should usually use a logits-based BCE loss rather than manually applying sigmoid and then taking logs.
9.5 Probability Heads
Softmax maps class logits to categorical probabilities. Its coupling and shift-invariance make it the standard final activation for multiclass probability heads and many contrastive objectives.
10. Common Mistakes
| # | Mistake | Why It Is Wrong | Fix |
|---|---|---|---|
| 1 | Stacking affine layers without activations | The composition is still affine | Insert nonlinear activations between affine maps |
| 2 | Using sigmoid hidden units in a deep plain MLP | Saturation causes vanishing gradients | Prefer ReLU/GELU/SiLU plus normalization when appropriate |
| 3 | Calling softmax elementwise | Softmax couples coordinates through the denominator | Use the full Jacobian when deriving gradients |
| 4 | Forgetting softmax shift-invariance | Large logits may overflow | Subtract the max before exponentiating |
| 5 | Treating ReLU derivative at zero as important | The convention rarely changes training | State the chosen convention and move on |
| 6 | Ignoring activation scale in initialization | Variance can explode or vanish | Match initialization to activation family |
| 7 | Assuming smooth activations are always better | Smoothness changes scale and cost | Compare gradients, not only curves |
| 8 | Applying sigmoid before a logits BCE loss | The loss applies sigmoid internally | Pass raw logits to fused logits losses |
| 9 | Confusing GLU gates with probabilities | GELU/SwiGLU gates are not constrained to | Interpret them as multiplicative feature selectors |
| 10 | Using temperature without considering gradients | Temperature rescales derivatives | Retune or monitor gradient norms |
11. Exercises
- (*) Derive .
- (*) Derive .
- (*) Compute ReLU, Leaky ReLU, and their derivatives for a vector.
- (**) Show that stacked affine layers without activations collapse to one affine map.
- (**) Implement stable softmax and verify shift-invariance.
- (**) Derive the softmax Jacobian for a three-class vector.
- (**) Compare GELU and SiLU curves and derivatives numerically.
- (***) Compute gradients for a GLU and explain the two gradient paths.
- (***) Estimate activation variance after Xavier and He initialization.
- (***) Diagnose a dead-ReLU layer from activation and gradient statistics.
12. Why This Matters for AI
| Activation concept | AI impact |
|---|---|
| Nonlinearity | Gives networks expressive function classes |
| Sigmoid gates | Controls memory and binary probabilities |
| Tanh | Bounded hidden states and centered gates |
| ReLU | Sparse, cheap, deep-friendly hidden activations |
| GELU | Smooth stochastic-style gating in Transformer blocks |
| SiLU/SwiGLU | Self-gated and multiplicative feedforward transformations |
| Softmax | Converts scores into probabilities and attention weights |
| Softmax temperature | Controls sharpness in classification, sampling, and contrastive learning |
| Activation derivatives | Determine gradient propagation and trainability |
| Initialization coupling | Keeps activation scale stable across depth |
13. Conceptual Bridge
Loss functions define the gradient at the output. Activation functions decide how that gradient passes through each hidden layer.
Loss gradient
-> output head
-> activation derivatives
-> layer Jacobians
-> earlier parameters
Next, Normalization Techniques studies how to control the statistics of those activations so deep networks remain trainable.
References
- Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning representations by back-propagating errors.
- Nair, V., and Hinton, G. E. (2010). Rectified Linear Units Improve Restricted Boltzmann Machines.
- Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep Sparse Rectifier Neural Networks.
- He, K., Zhang, X., Ren, S., and Sun, J. (2015). Delving Deep into Rectifiers.
- Hendrycks, D., and Gimpel, K. (2016). Gaussian Error Linear Units.
- Elfwing, S., Uchibe, E., and Doya, K. (2018). Sigmoid-Weighted Linear Units.
- Shazeer, N. (2020). GLU Variants Improve Transformer.