NotesMath for LLMs

Functions and Mappings

Mathematical Foundations / Functions and Mappings

Notes

<- Sets and Logic | Next: Summation and Product Notation ->

"A neural network is a composition of functions. Understanding functions - what they preserve, what they destroy, how they compose, where they break - is understanding neural networks at the deepest possible level."

Overview

A function is the most fundamental concept in mathematics after sets. It is a rule that assigns to each input exactly one output. Every mathematical operation you have ever performed - adding, multiplying, differentiating, integrating - is a function. Every layer in a neural network is a function. The entire model from tokens in to probabilities out is a single (very complex) function. Training is optimisation over a space of functions. Inference is function evaluation.

This chapter develops function theory from first principles through to the frontier concepts needed to understand modern LLMs and AI systems at a mathematical level. We cover the formal definition, types (injective, surjective, bijective), composition, inverses, special classes (linear, affine, activation functions, convex, Lipschitz), higher-order functions and functionals, continuity, differentiability, measure-theoretic foundations, the systematic role of functions in machine learning, and the category-theoretic perspective that unifies it all.

Every concept is connected to how it appears in real neural network computation, with emphasis on transformer architectures, attention mechanisms, and the training pipeline.

Prerequisites

  • Set theory fundamentals (Chapter 02: Sets and Logic)
  • Basic algebra and coordinate geometry
  • Familiarity with vectors and matrices (helpful but not required yet)
  • Python/NumPy basics for code examples

Companion Notebooks

NotebookDescription
theory.ipynbInteractive code: function visualisation, composition chains, activation functions, Lipschitz demos
exercises.ipynbPractice problems: injectivity proofs, composition calculations, inverse construction, approximation

Learning Objectives

After completing this section, you will:

  • Understand functions formally as subsets of Cartesian products satisfying totality and uniqueness
  • Distinguish domain, codomain, range, image, and preimage - and know why each matters for AI
  • Classify functions as injective, surjective, bijective, monotone, or periodic
  • Compose functions and understand associativity, non-commutativity, and the identity function
  • Construct inverses (left, right, two-sided, pseudo-inverse) and connect invertibility to information preservation
  • Master special function classes: linear, affine, multilinear, polynomial, activation functions, convex, Lipschitz
  • Work with higher-order functions, functionals, operators, currying, and function spaces
  • Understand continuity (pointwise, uniform) and its role in gradient-based optimisation
  • Apply derivatives, Jacobians, and the chain rule to function composition (backpropagation)
  • Connect functions to measure theory: measurable functions, Lebesgue integration, probability distributions
  • See every component of a neural network as a specific type of function with specific mathematical properties
  • Appreciate the category-theoretic perspective: morphisms, functors, natural transformations

Table of Contents

  1. Intuition
  2. Formal Definitions
  3. Types of Functions
  4. Function Composition
  5. Inverse Functions
  6. Special Classes of Functions
  7. Higher-Order Functions and Functionals
  8. Continuity and Limits
  9. Differentiable Functions and Derivatives
  10. Measure-Theoretic View of Functions
  11. Functions in Machine Learning
  12. Category Theory Perspective
  13. Common Mistakes and Misconceptions
  14. Exercises and Practice Problems
  15. Why This Matters for AI/LLMs
  16. Conceptual Bridge

1. Intuition

1.1 What Is a Function?

A function is a rule that assigns to each input exactly one output. No ambiguity, no missing outputs, no contradictions. If you give a function the same input twice, you get the same output both times. This is the single most important mathematical concept you will ever encounter.

The word "function" comes from the Latin functio, meaning "performance" or "execution". A function performs a transformation: it takes something and produces something else, reliably and deterministically.

The machine metaphor. Think of a function as a machine with an input slot and an output slot. You drop an object into the input slot; the machine does something to it; exactly one object comes out of the output slot. The same input always produces the same output. The machine never refuses an input (within its declared domain), and it never produces two outputs for a single input.

              +--------------+
  input  --->|   f(x) = 2x  |--->  output
  x = 3      |              |      f(3) = 6
              +--------------+

  Same input, same output - always:
  f(3) = 6, f(3) = 6, f(3) = 6, ...

Formal view. A function f:ABf: A \to B is a subset of the Cartesian product A×BA \times B such that every element of AA appears as the first element of exactly one ordered pair. If (a,b1)f(a, b_1) \in f and (a,b2)f(a, b_2) \in f, then b1=b2b_1 = b_2. This definition reduces the concept of a function to the concept of a set - and since set theory is the foundation of all mathematics, this means functions are built on the most basic level of the mathematical hierarchy.

The word "mapping". The word "mapping" is a synonym for "function" that emphasises the geometric or structural idea. When we say "ff maps AA to BB", we are emphasising that ff takes each point in one space and places it in another space. In linear algebra, we often say "linear map" rather than "linear function". In topology, we say "continuous map". The word "map" carries the same formal definition as "function".

For AI and LLMs. Every layer of a neural network is a function. Every activation function is a function. Every loss function is a function. The entire model - from raw text input to probability distribution over vocabulary - is a single (complicated) function. Training is optimisation: finding the best function within a parameterised family. Inference is function evaluation: computing the output for a given input. Understanding functions deeply is not a prerequisite for AI - it is understanding AI.

1.2 Why Functions Are Central to AI

Every component of a modern AI system is a function. Here is a systematic mapping:

AI ComponentMathematical FunctionType Signature
Neural network layerflf_l: vector space -> vector spacefl:RdRdf_l: \mathbb{R}^{d} \to \mathbb{R}^{d'}
Embedding layerMaps discrete tokens to continuous vectorsE:VRdE: V \to \mathbb{R}^d
Attention mechanismMaps query, key, value triples to outputAttn:Rn×dk×Rn×dk×Rn×dvRn×dv\text{Attn}: \mathbb{R}^{n \times d_k} \times \mathbb{R}^{n \times d_k} \times \mathbb{R}^{n \times d_v} \to \mathbb{R}^{n \times d_v}
Loss functionMaps parameters to scalar lossL:ΘRL: \Theta \to \mathbb{R}
TokeniserMaps strings to token sequencesT:ΣVT: \Sigma^* \to V^*
Probability headMaps logits to probability distribution$\sigma: \mathbb{R}^{
Activation functionElementwise nonlinearityσ:RR\sigma: \mathbb{R} \to \mathbb{R}
Optimiser stepMaps current parameters to updated parametersstep:ΘΘ\text{step}: \Theta \to \Theta
Learning rate schedulerMaps step count to learning rateη:NR+\eta: \mathbb{N} \to \mathbb{R}^+
Positional encodingMaps position index to vectorPE:NRd\text{PE}: \mathbb{N} \to \mathbb{R}^d

Everything that happens inside an LLM - from the moment text enters as bytes to the moment a probability distribution over the next token is produced - is a composition of functions. Understanding the mathematical properties of these functions (continuity, differentiability, Lipschitz constants, injectivity, equivariance) directly determines what the model can learn, how stable training is, and what the model can represent.

1.3 Functions as Transformations

The geometric viewpoint reveals what functions do to space. A function f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m takes every point in nn-dimensional space and moves it to a point in mm-dimensional space. Collections of points - lines, surfaces, regions - are transformed into new collections.

Linear functions preserve structure. Under a linear function, straight lines map to straight lines (or to points). The origin maps to the origin. Parallelism is preserved. Grid lines remain evenly spaced. Linear functions can rotate, reflect, stretch, compress, and project, but they cannot bend or fold.

Linear transformation of a grid:

  Before (identity):          After (rotation + stretch):
  +---+---+---+               /   /   /   /
  |   |   |   |              /   /   /   /
  +---+---+---+             /   /   /   /
  |   |   |   |            /   /   /   /
  +---+---+---+           /   /   /   /
  |   |   |   |
  +---+---+---+
  Grid lines remain straight; origin stays fixed.

Nonlinear functions bend, fold, and warp space. Under a nonlinear function, straight lines can become curves, parallel lines can converge or diverge, and the structure of space is fundamentally altered. This is exactly what enables neural networks to learn complex decision boundaries - if all functions were linear, a neural network would be no more powerful than a single matrix multiplication.

Invertible functions (bijections) transform space without destroying any information. Every point in the output space came from exactly one point in the input space, so you can always "undo" the transformation and recover the original. This is the principle behind normalising flows: a chain of bijective transformations that transforms a simple distribution (like Gaussian) into a complex one (like the data distribution), with the guarantee that the transformation can be reversed.

Non-invertible functions (many-to-one) destroy information. Multiple input points map to the same output point. Once the function is applied, you cannot tell which input was used. ReLU is a simple example: both x=3x = -3 and x=5x = -5 map to ReLU(x)=0\text{ReLU}(x) = 0, so if you only see the output 00, you cannot recover the input. This information destruction is actually useful - it's a form of compression, throwing away irrelevant details and keeping what matters.

Deep networks compose many simple functions. Each individual function may be simple (an affine map followed by an elementwise nonlinearity), but the composition of many such functions creates a global transformation that is highly nonlinear and enormously expressive. This is the fundamental insight of deep learning: complexity emerges from the composition of simplicity.

1.4 The Language of Functions in Mathematics

Functions appear everywhere in mathematics, but under different names depending on the context. Each name emphasises a different aspect of the same underlying concept:

NameEmphasisExample
FunctionGeneral concept; rule from inputs to outputsf:RRf: \mathbb{R} \to \mathbb{R}, f(x)=x2f(x) = x^2
Map / MappingGeometric/structural view; moving points between spacesLinear map T:VWT: V \to W
MorphismStructure-preserving; respects algebraic propertiesGroup homomorphism ϕ:GH\phi: G \to H
OperatorFunction acting on functions (or function spaces)Differentiation DD, integration \int
FunctionalFunction whose input is a function, output is a scalarDefinite integral I[f]=abf(x)dxI[f] = \int_a^b f(x)\,dx
TransformationEmphasis on changing/deforming objectsLinear transformation, Fourier transform
KernelFunction of two arguments used in integral operators or MLRBF kernel K(x,y)=exp(xy2/2σ2)K(x, y) = \exp(-\|x-y\|^2/2\sigma^2)
DistributionGeneralised function (Schwartz); or probability functionDirac delta δ(x)\delta(x); Gaussian PDF

The key insight is that all of these are instances of the same concept. A linear map is a function that happens to satisfy linearity. An operator is a function whose domain is a function space. A functional is an operator that happens to return scalars. Understanding that these are all functions - with the same formal definition, the same composition rules, the same invertibility questions - unifies vast swaths of mathematics under a single framework.

1.5 Historical Timeline

The concept of a function evolved over centuries, from implicit geometric constructions to the fully abstract modern definition:

DateContributorDevelopment
~300 BCEEuclidGeometric constructions implicitly define functions (compass-and-straightedge); proportionality as functional relationship
1694LeibnizFirst explicit use of the word "function" (functio) in a mathematical context; described quantities depending on a curve
1748EulerFunction as analytic expression; introduced the f(x)f(x) notation; classified functions as algebraic or transcendental
1837DirichletModern definition: function as arbitrary rule assigning outputs to inputs, not necessarily given by a formula. The Dirichlet function D(x)=1D(x) = 1 if xQx \in \mathbb{Q}, 00 otherwise, has no formula but is a valid function
1870sCantorFunction as set of ordered pairs; foundation in set theory; functions between infinite sets
1879FregeFunctions in formal logic; predicate as function mapping objects to truth values
1888DedekindMorphism concept; structure-preserving maps between number systems
1900sHilbertOperator theory; functions on infinite-dimensional spaces; functional analysis foundations
1920sBanachNormed function spaces (Banach spaces); fixed-point theorem; foundation of functional analysis
1936ChurchLambda calculus: functions as fundamental computational objects; every computation is function evaluation
1945Eilenberg & Mac LaneCategory theory: morphisms (functions) as primary objects; composition as fundamental operation
1960s-1980sVariousFunctional programming languages (ML, Haskell, Lisp); functions as first-class values in computation
1986Rumelhart, Hinton, WilliamsBackpropagation popularised: neural network as differentiable function composition; chain rule as algorithm
1989Cybenko, HornikUniversal Approximation Theorem: neural networks can approximate any continuous function
2017Vaswani et al.Transformer architecture: attention as specific function class; self-attention as permutation-equivariant function
2020-2026Scaling eraFunctions at unprecedented scale: GPT-4, LLaMA, Gemini - trillions of parameters defining single functions from text to text

The trend is clear: the concept of a function has become increasingly abstract and increasingly central. From Euler's "formula in xx" to Dirichlet's "arbitrary rule" to Church's "computational object" to modern AI's "parameterised differentiable map" - each step expanded what counts as a function and what we can do with functions.

1.6 Pipeline Position in AI

Every stage of the LLM pipeline is a function. Here is the complete chain:

Input Data (text, tokens, images)
    down [Tokeniser function T: \Sigma* -> V*]
Token Sequences (discrete integer IDs)
    down [Embedding function E: V^n -> \mathbb{R}^n^x^d]
Embedding Matrices (continuous vectors per token)
    down [Positional encoding PE: \mathbb{R}^n^x^d -> \mathbb{R}^n^x^d]
Position-Aware Embeddings
    down [Transformer layer 1: f_1: \mathbb{R}^n^x^d -> \mathbb{R}^n^x^d]
    down [Transformer layer 2: f_2: \mathbb{R}^n^x^d -> \mathbb{R}^n^x^d]
    down [...]
    down [Transformer layer L: fl: \mathbb{R}^n^x^d -> \mathbb{R}^n^x^d]
Contextual Representations (hidden states)
    down [Layer Norm: LN: \mathbb{R}^n^x^d -> \mathbb{R}^n^x^d]
Normalised Representations
    down [LM head (linear projection): W: \mathbb{R}^d -> \mathbb{R}|V|]
Logits (unnormalised scores per vocabulary token)
    down [Softmax \sigma: \mathbb{R}|V| -> \Delta|V|^-^1]
Probability Distribution (simplex)
    down [Sampling function: sample: \Delta|V|^-^1 -> V]
Output Token (single discrete token)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Functions everywhere. The entire pipeline is one
   composed function: sample \circ \sigma \circ W \circ LN \circ fl \circ ... \circ f_1 \circ PE \circ E \circ T

Each arrow in this pipeline is a function with specific mathematical properties:

  • T (tokeniser): discrete, non-differentiable, deterministic
  • E (embedding): linear (lookup is matrix multiply with one-hot), differentiable
  • PE (positional encoding): additive, fixed or learned, differentiable
  • f_1 through fl (transformer layers): nonlinear, differentiable, permutation-equivariant, residual
  • LN (layer norm): nonlinear, differentiable, scale-invariant
  • W (LM head): linear, differentiable
  • \sigma (softmax): nonlinear, differentiable, maps to probability simplex
  • sample: stochastic (not a deterministic function in the classical sense; involves randomness)

Training optimises the parameters of the differentiable functions (E, f_1-fl, W) to minimise a loss function L over the training data. The loss function itself is a function L:ΘRL: \Theta \to \mathbb{R} mapping the entire parameter vector to a single scalar.


2. Formal Definitions

2.1 Function as a Set of Ordered Pairs

Definition. A function from set AA to set BB, written f:ABf: A \to B, is a subset fA×Bf \subseteq A \times B satisfying two conditions:

  1. Totality (every input has an output): For every aAa \in A, there exists some bBb \in B such that (a,b)f(a, b) \in f.
  2. Uniqueness (single-valued): If (a,b1)f(a, b_1) \in f and (a,b2)f(a, b_2) \in f, then b1=b2b_1 = b_2.

Totality says the function is defined everywhere on its domain. Uniqueness says the function does not assign two different outputs to the same input. Together they guarantee: for each aAa \in A, there is exactly one bBb \in B such that (a,b)f(a, b) \in f. We write this unique bb as f(a)f(a).

Example. Consider A={1,2,3}A = \{1, 2, 3\} and B={a,b,c,d}B = \{a, b, c, d\}. Then:

  • f={(1,a),(2,c),(3,c)}f = \{(1, a), (2, c), (3, c)\} is a valid function (total, single-valued). Note: two inputs map to the same output (cc) - this is allowed.
  • g={(1,a),(3,b)}g = \{(1, a), (3, b)\} is NOT a function (not total: input 2 has no output).
  • h={(1,a),(1,b),(2,c),(3,d)}h = \{(1, a), (1, b), (2, c), (3, d)\} is NOT a function (not single-valued: input 1 maps to both aa and bb).

Relation vs function. A relation from AA to BB is any subset of A×BA \times B, without the totality or uniqueness requirements. Every function is a relation, but not every relation is a function. A relation that satisfies uniqueness but not totality is called a partial function.

   Relations \supseteq Partial Functions \supseteq Functions \supseteq Bijections
     (any subset     (unique but       (unique and    (unique, total,
      of A \times B)      not total)         total)        one-to-one, onto)

For AI. When we write a neural network layer fl(x)=σ(Wx+b)f_l(x) = \sigma(Wx + b), we are defining a function from Rdin\mathbb{R}^{d_{\text{in}}} to Rdout\mathbb{R}^{d_{\text{out}}}. Totality is guaranteed: for any input vector xx, the matrix multiplication Wx+bWx + b is defined, and σ\sigma is defined on all of R\mathbb{R}. Uniqueness is guaranteed: matrix multiplication is deterministic. During training with dropout, the computation is stochastic - but we model this as sampling a different function at each step, not as a single multi-valued function.

2.2 Domain, Codomain, and Range

Three fundamental sets are associated with every function f:ABf: A \to B:

ConceptSymbolDefinitionExample: f(x)=x2f(x) = x^2, f:RRf: \mathbb{R} \to \mathbb{R}
Domaindom(f)=A\text{dom}(f) = AThe set of all valid inputsR\mathbb{R} (all real numbers)
Codomaincod(f)=B\text{cod}(f) = BThe "target" set where outputs liveR\mathbb{R} (all real numbers)
Range (Image)range(f)=f(A)\text{range}(f) = f(A)The set of outputs actually produced[0,)[0, \infty) (non-negative reals)

The distinction between codomain and range is subtle but crucial:

  • The codomain is part of the function's definition - it is the set where outputs are declared to live.
  • The range is a property of the function - it is the set of outputs that are actually achieved.
  • Always: range(f)cod(f)\text{range}(f) \subseteq \text{cod}(f).
  • A function is surjective (onto) if and only if range(f)=cod(f)\text{range}(f) = \text{cod}(f).

Why the distinction matters for AI. Consider a neural network's output layer:

  • Domain: Rd\mathbb{R}^d (any hidden representation is a valid input).
  • Codomain: RV\mathbb{R}^{|V|} (the output space has dimension equal to vocabulary size).
  • Range: typically a strict subset of RV\mathbb{R}^{|V|} - not every logit vector will actually be produced.

When we apply softmax, the codomain becomes the probability simplex ΔV1={pRV:pi0,pi=1}\Delta^{|V|-1} = \{p \in \mathbb{R}^{|V|} : p_i \geq 0, \sum p_i = 1\}. The range is a strict subset: softmax can never output exactly 0 or exactly 1 for any component, because ez>0e^z > 0 for all finite zz.

Domain (\mathbb{R}^d)                    Codomain (\mathbb{R}|V|)
+-------------+               +-----------------+
|             |               |                 |
|  All hidden |---f(x)------->|  All possible   |
|  states     |               |  logit vectors  |
|             |               |                 |
|             |               | +-------------+ |
|             |               | | Range:      | |
|             |               | | Actually    | |
|             |               | | produced    | |
|             |               | | logits      | |
|             |               | +-------------+ |
+-------------+               +-----------------+

2.3 Image and Preimage

Image. For a function f:ABf: A \to B and a subset SAS \subseteq A, the image of SS under ff is:

f(S)={f(a):aS}={bB:aS,f(a)=b}f(S) = \{f(a) : a \in S\} = \{b \in B : \exists a \in S, \, f(a) = b\}

Preimage (Inverse Image). For a subset TBT \subseteq B, the preimage of TT under ff is:

f1(T)={aA:f(a)T}f^{-1}(T) = \{a \in A : f(a) \in T\}

Critical note: The preimage notation f1(T)f^{-1}(T) does NOT require ff to have an inverse function. Even for non-invertible functions, the preimage of a set is always well-defined.

Properties of images and preimages:

PropertyImagePreimage
Unionf(A1A2)=f(A1)f(A2)f(A_1 \cup A_2) = f(A_1) \cup f(A_2) OKf1(B1B2)=f1(B1)f1(B2)f^{-1}(B_1 \cup B_2) = f^{-1}(B_1) \cup f^{-1}(B_2) OK
Intersectionf(A1A2)f(A1)f(A2)f(A_1 \cap A_2) \subseteq f(A_1) \cap f(A_2) (\subseteq only!)f1(B1B2)=f1(B1)f1(B2)f^{-1}(B_1 \cap B_2) = f^{-1}(B_1) \cap f^{-1}(B_2) OK
ComplementNo simple rulef1(BT)=Af1(T)f^{-1}(B \setminus T) = A \setminus f^{-1}(T) OK
SubsetS1S2    f(S1)f(S2)S_1 \subseteq S_2 \implies f(S_1) \subseteq f(S_2)T1T2    f1(T1)f1(T2)T_1 \subseteq T_2 \implies f^{-1}(T_1) \subseteq f^{-1}(T_2)

The key asymmetry. Preimage preserves ALL set operations (unions, intersections, complements). Image only preserves unions. Image preserves intersections if and only if ff is injective.

Proof that image does not preserve intersection in general.

Let f(x)=x2f(x) = x^2, A1={1,2}A_1 = \{-1, 2\}, A2={1,3}A_2 = \{1, 3\}.

  • A1A2=A_1 \cap A_2 = \emptyset, so f(A1A2)=f(A_1 \cap A_2) = \emptyset.
  • f(A1)={1,4}f(A_1) = \{1, 4\}, f(A2)={1,9}f(A_2) = \{1, 9\}, so f(A1)f(A2)={1}f(A_1) \cap f(A_2) = \{1\}.
  • {1}\emptyset \neq \{1\}, so f(A1A2)f(A1)f(A2)f(A_1 \cap A_2) \subsetneq f(A_1) \cap f(A_2). QED

For AI. Preimages appear naturally in classification. The preimage f1({class k})f^{-1}(\{\text{class } k\}) is the decision region for class kk - the set of all inputs that the classifier assigns to class kk. Understanding the geometry of these preimages (connected? convex? simply connected?) is central to understanding what the classifier has learned.

2.4 Equality of Functions

Definition. Two functions f:ABf: A \to B and g:ABg: A \to B are equal, written f=gf = g, if and only if:

  1. They have the same domain: dom(f)=dom(g)\text{dom}(f) = \text{dom}(g)
  2. They have the same codomain: cod(f)=cod(g)\text{cod}(f) = \text{cod}(g)
  3. They agree on all inputs: f(a)=g(a)f(a) = g(a) for all aAa \in A

This is the principle of extensionality: two functions are equal if and only if they produce the same output for every input. How they compute their output is irrelevant - only the input-output mapping matters.

Example. The functions:

  • f(x)=(x+1)22x1f(x) = (x+1)^2 - 2x - 1
  • g(x)=x2g(x) = x^2

are equal as functions RR\mathbb{R} \to \mathbb{R}, because (x+1)22x1=x2+2x+12x1=x2(x+1)^2 - 2x - 1 = x^2 + 2x + 1 - 2x - 1 = x^2 for all xx.

For AI. Two neural networks with completely different weights can compute the same function (e.g., by permuting neurons - see Section 11.4). Conversely, two architectures that "look" the same but have different weights compute different functions. Function equality is about the input-output mapping, not the implementation.

2.5 Partial Functions

A partial function f:ABf: A \rightharpoonup B satisfies uniqueness but not necessarily totality. There may be elements of AA for which ff is undefined.

The domain of definition: dom(f)={aA:f(a) is defined}A\text{dom}(f) = \{a \in A : f(a) \text{ is defined}\} \subseteq A.

Examples in mathematics:

  • Division: f(x)=1/xf(x) = 1/x is partial on R\mathbb{R} - undefined at x=0x = 0
  • Logarithm: log(x)\log(x) is partial on R\mathbb{R} - undefined for x0x \leq 0
  • Square root: x\sqrt{x} is partial on R\mathbb{R} - undefined for x<0x < 0

Examples in AI:

  • Masked attention: In causal (autoregressive) attention, token ii cannot attend to token j>ij > i. The attention weight function is "undefined" for future positions - set to -\infty before softmax, giving weight 0.
  • Sparse attention: Only a subset of positions are attended to; the attention function is partial.
  • Variable-length sequences: A function defined on sequences up to length nn is partial on the space of all sequences.

Making partial functions total. Two standard approaches:

  1. Restrict the domain: change AA to only elements where ff is defined. x\sqrt{x} becomes total on [0,)[0, \infty).
  2. Extend the codomain: add a special "undefined" value \bot to BB and set f(a)=f(a) = \bot for undefined inputs.

In neural network implementations, the extension approach is standard: masked attention sets logits to -\infty (a stand-in for "undefined"), and after softmax, these become 0, contributing nothing to the weighted average.


3. Types of Functions

3.1 Injective Functions (One-to-One)

Definition. A function f:ABf: A \to B is injective (or one-to-one) if different inputs always produce different outputs:

a1,a2A:f(a1)=f(a2)    a1=a2\forall a_1, a_2 \in A: \quad f(a_1) = f(a_2) \implies a_1 = a_2

Equivalently (contrapositive): a1a2    f(a1)f(a2)a_1 \neq a_2 \implies f(a_1) \neq f(a_2).

Intuition. An injective function preserves distinctness. No two inputs "collide" at the same output. If you see the output, you can determine which input produced it.

Injective:                    NOT injective:
  A        B                    A        B
  1 ----> a                    1 ----> a
  2 ----> b                    2 ----+
  3 ----> c                    3 ----+> b
           d (not hit)                  c (not hit)
  
Each output comes from         Two inputs map to
at most one input              the same output (b)

Proof technique for injectivity. Assume f(a1)=f(a2)f(a_1) = f(a_2) and derive a1=a2a_1 = a_2.

Example 1. Prove that f(x)=2x+3f(x) = 2x + 3 is injective.

Proof. Suppose f(a1)=f(a2)f(a_1) = f(a_2). Then 2a1+3=2a2+32a_1 + 3 = 2a_2 + 3, so 2a1=2a22a_1 = 2a_2, so a1=a2a_1 = a_2. QED

Example 2. Prove that f(x)=exf(x) = e^x is injective.

Proof. Suppose ea1=ea2e^{a_1} = e^{a_2}. Taking ln\ln of both sides: a1=a2a_1 = a_2. QED

Example 3. Prove that f(x)=x2f(x) = x^2 is NOT injective on R\mathbb{R}.

Proof. Counterexample: f(1)=1=f(1)f(1) = 1 = f(-1) but 111 \neq -1. QED

Proof technique to disprove injectivity: find two distinct inputs with the same output (a single counterexample suffices).

For AI. Injectivity means no information is lost. If a layer is injective, the original input can (in principle) be recovered from the output. Non-injective layers destroy information.

ComponentInjective?Why
Token embedding EEYesEach token maps to a unique vector
ReLU activationNoAll negative values map to 0
Sigmoid activationYesStrictly monotone increasing
Softmax (RnΔn1\mathbb{R}^n \to \Delta^{n-1})NoAdding constant to all inputs gives same output
Max poolingNoDifferent inputs with same max give same output
Average poolingNoDifferent inputs with same mean give same output
Invertible ResNet layersYesDesigned to be bijective

3.2 Surjective Functions (Onto)

Definition. A function f:ABf: A \to B is surjective (or onto) if every element of the codomain is achieved:

bB,  aA:f(a)=b\forall b \in B, \; \exists a \in A: \quad f(a) = b

Equivalently: range(f)=B\text{range}(f) = B. Every element of BB is "hit" by at least one input.

Surjective:                   NOT surjective:
  A        B                    A        B
  1 ----> a                    1 ----> a
  2 ---+                       2 ----> b
  3 ---+> b                    3 ----> c
  4 ----> c                             d <- never reached
  
Every output is hit            Element d in codomain
by at least one input          is never produced

Proof technique for surjectivity. Take an arbitrary bBb \in B and explicitly find an aAa \in A with f(a)=bf(a) = b.

Example 1. Prove that f:RRf: \mathbb{R} \to \mathbb{R}, f(x)=2x+3f(x) = 2x + 3 is surjective.

Proof. Let bRb \in \mathbb{R}. We need aa with f(a)=bf(a) = b: solve 2a+3=b2a + 3 = b to get a=(b3)/2a = (b-3)/2. Since bRb \in \mathbb{R}, we have aRa \in \mathbb{R}, and f(a)=2b32+3=bf(a) = 2 \cdot \frac{b-3}{2} + 3 = b. QED

Example 2. Prove that f(x)=x2f(x) = x^2 is NOT surjective as f:RRf: \mathbb{R} \to \mathbb{R}.

Proof. Take b=1b = -1. For any aRa \in \mathbb{R}, f(a)=a20>1f(a) = a^2 \geq 0 > -1, so no aa maps to 1-1. QED

Note on codomain choice. The same formula f(x)=x2f(x) = x^2 gives:

  • f:RRf: \mathbb{R} \to \mathbb{R} - NOT surjective (negative numbers not in range)
  • f:R[0,)f: \mathbb{R} \to [0, \infty) - surjective (every non-negative number is a perfect square of something)

Changing the codomain changes whether the function is surjective.

For AI - surjectivity means the output space is fully utilised:

ComponentSurjective?Why
Softmax onto open simplexYesAny strictly positive distribution achievable
ReLU: RR\mathbb{R} \to \mathbb{R}NoNegative outputs never produced
ReLU: R[0,)\mathbb{R} \to [0, \infty)YesEvery non-negative value achievable
Tanh: R(1,1)\mathbb{R} \to (-1, 1)YesApproaches but never reaches ±1\pm 1
Linear WxWx: RnRm\mathbb{R}^n \to \mathbb{R}^mOnly if rank(W)=m\text{rank}(W) = mRange is column space of WW

3.3 Bijective Functions (One-to-One and Onto)

Definition. A function f:ABf: A \to B is bijective if it is both injective AND surjective. Every element of BB is the image of exactly one element of AA.

Bijective (perfect pairing):
  A        B
  1 ----> a
  2 ----> b
  3 ----> c
  
  Every element of A maps to a unique element of B.
  Every element of B is hit exactly once.
  No information created, no information destroyed.

Key property. A function has a two-sided inverse f1:BAf^{-1}: B \to A if and only if it is bijective.

Bijections and cardinality. A bijection f:ABf: A \to B proves that A=B|A| = |B| (same cardinality). This is the formal definition of "same size" for sets, including infinite sets. For example, f(n)=2nf(n) = 2n is a bijection N{0,2,4,6,}\mathbb{N} \to \{0, 2, 4, 6, \ldots\}, proving that the natural numbers and the even numbers have the same cardinality - a result that seems paradoxical until you see the bijection.

For AI - where bijections are critical:

ApplicationHow Bijection Is Used
Normalising flowsChain of bijections transforms simple distribution to complex one. Change of variables: $p_Y(y) = p_X(f^{-1}(y)) \cdot
Reversible architecturesInvertible ResNets (iRevNet, Glow) save memory by recomputing activations during backprop
Permutation matricesReordering tokens is a bijection on indices; attention is permutation-equivariant
Tokeniser / DetokeniserBijection between token IDs and token strings (within vocabulary)
EncryptionEncryption and decryption are inverse bijections; information-preserving

3.4 Summary Table

PropertyConditionInjective?Surjective?Invertible?Example
Injective onlyf(a1)=f(a2)a1=a2f(a_1) = f(a_2) \Rightarrow a_1 = a_2OKNOLeft inversef(x)=exf(x) = e^x, RR\mathbb{R} \to \mathbb{R}
Surjective onlyrange(f)=B\text{range}(f) = BNOOKRight inversef(x)=x2f(x) = x^2, R[0,)\mathbb{R} \to [0,\infty)
BijectiveBothOKOKTwo-sidedf(x)=2x+1f(x) = 2x + 1, RR\mathbb{R} \to \mathbb{R}
NeitherNeitherNONONonef(x)=sin(x)f(x) = \sin(x), R[2,2]\mathbb{R} \to [-2, 2]

3.5 Monotone Functions

Definition. A function f:RRf: \mathbb{R} \to \mathbb{R} is:

  • Strictly increasing: x1<x2    f(x1)<f(x2)x_1 < x_2 \implies f(x_1) < f(x_2)
  • Non-decreasing (weakly increasing): x1<x2    f(x1)f(x2)x_1 < x_2 \implies f(x_1) \leq f(x_2)
  • Strictly decreasing: x1<x2    f(x1)>f(x2)x_1 < x_2 \implies f(x_1) > f(x_2)
  • Non-increasing (weakly decreasing): x1<x2    f(x1)f(x2)x_1 < x_2 \implies f(x_1) \geq f(x_2)

A function is monotone if it is either non-decreasing or non-increasing.

Key theorem. Every strictly monotone function is injective.

Proof. Suppose ff is strictly increasing and f(a1)=f(a2)f(a_1) = f(a_2). If a1<a2a_1 < a_2, then f(a1)<f(a2)f(a_1) < f(a_2) - contradiction. If a1>a2a_1 > a_2, then f(a1)>f(a2)f(a_1) > f(a_2) - contradiction. So a1=a2a_1 = a_2. QED

Monotonicity in AI:

FunctionMonotone?TypeConsequence
Sigmoid σ(x)\sigma(x)Strictly increasingOn all R\mathbb{R}Injective; preserves input order
TanhStrictly increasingOn all R\mathbb{R}Injective; preserves order
ReLUNon-decreasing (not strict)Flat for x<0x < 0Not injective; loses sign info
GELUNOT monotoneDip near x0.17x \approx -0.17Not order-preserving
Leaky ReLU (α>0\alpha > 0)Strictly increasingOn all R\mathbb{R}Injective; all info preserved
SiLU/Swish xσ(x)x\sigma(x)NOT monotoneDip near x1.28x \approx -1.28Similar to GELU
Cross-entropy L(p)L(p)Strictly decreasing in ppCorrect class probHigher confidence -> lower loss

Why monotonicity matters for optimisation. If the loss function is monotone in some parameter direction, that direction has no local minima - the loss either always decreases or always increases. Non-monotone loss landscapes create local minima and saddle points that can trap gradient descent.

3.6 Periodic Functions

Definition. A function f:RRf: \mathbb{R} \to \mathbb{R} is periodic with period T>0T > 0 if:

f(x+T)=f(x)for all xRf(x + T) = f(x) \quad \text{for all } x \in \mathbb{R}

The smallest such TT is the fundamental period.

Classical examples:

  • sin(x)\sin(x) and cos(x)\cos(x): period 2π2\pi
  • tan(x)\tan(x): period π\pi
  • eixe^{ix}: period 2π2\pi (Euler's formula)

Key property. Periodic functions on R\mathbb{R} are never injective (since f(x)=f(x+T)f(x) = f(x + T) with xx+Tx \neq x + T). However, they are injective when restricted to one period [0,T)[0, T).

Periodic functions in modern AI:

1. Sinusoidal Positional Encoding (Vaswani et al. 2017):

PE(pos,2i)=sin ⁣(pos100002i/d),PE(pos,2i+1)=cos ⁣(pos100002i/d)\text{PE}(\text{pos}, 2i) = \sin\!\left(\frac{\text{pos}}{10000^{2i/d}}\right), \quad \text{PE}(\text{pos}, 2i+1) = \cos\!\left(\frac{\text{pos}}{10000^{2i/d}}\right)

Each dimension is a periodic function of position with a different frequency. Low dimensions have short periods (capture local position); high dimensions have long periods (capture global position). The periods range from 2π2\pi to 2π100002\pi \cdot 10000.

Why sinusoidal? For any fixed offset kk, PE(pos+k)\text{PE}(\text{pos} + k) can be written as a linear function of PE(pos)\text{PE}(\text{pos}):

(sin(ω(pos+k))cos(ω(pos+k)))=(cos(ωk)sin(ωk)sin(ωk)cos(ωk))(sin(ωpos)cos(ωpos))\begin{pmatrix} \sin(\omega(\text{pos}+k)) \\ \cos(\omega(\text{pos}+k)) \end{pmatrix} = \begin{pmatrix} \cos(\omega k) & \sin(\omega k) \\ -\sin(\omega k) & \cos(\omega k) \end{pmatrix} \begin{pmatrix} \sin(\omega \cdot \text{pos}) \\ \cos(\omega \cdot \text{pos}) \end{pmatrix}

This means relative position information is accessible via a linear transformation - the model can learn to attend to "3 positions back" using a simple linear projection.

2. Rotary Position Embedding (RoPE, Su et al. 2021):

RoPE rotates the query and key vectors by position-dependent angles:

fRoPE(xm,m)=RΘ,mxmf_{\text{RoPE}}(x_m, m) = R_{\Theta, m} \cdot x_m

where RΘ,mR_{\Theta, m} is a block-diagonal rotation matrix with angles θim\theta_i \cdot m. Each block rotates by angle θim\theta_i \cdot m - a periodic function of position with period 2π/θi2\pi/\theta_i.

Key property of RoPE: The dot product fRoPE(q,m),fRoPE(k,n)\langle f_{\text{RoPE}}(q, m), f_{\text{RoPE}}(k, n) \rangle depends only on qq, kk, and the relative position mnm - n (not the absolute positions). This gives the model a natural notion of relative distance.

3. Fourier features in NeRF (Neural Radiance Fields):

γ(p)=(sin(20πp),cos(20πp),sin(21πp),cos(21πp),,sin(2L1πp),cos(2L1πp))\gamma(p) = (\sin(2^0 \pi p), \cos(2^0 \pi p), \sin(2^1 \pi p), \cos(2^1 \pi p), \ldots, \sin(2^{L-1} \pi p), \cos(2^{L-1} \pi p))

Periodic functions at multiple frequencies encode position information at multiple scales, enabling the network to represent high-frequency details.


4. Function Composition

4.1 Definition and Notation

Definition. Given functions f:ABf: A \to B and g:BCg: B \to C, the composition of gg with ff is the function:

(gf):AC,(gf)(x)=g(f(x))(g \circ f): A \to C, \quad (g \circ f)(x) = g(f(x))

Read: "g composed with f" or "g after f". The function ff is applied first, then gg. Note the order: in gfg \circ f, the rightmost function (ff) is applied first.

Requirement. The composition gfg \circ f is only defined when range(f)dom(g)\text{range}(f) \subseteq \text{dom}(g), i.e., the outputs of ff must be valid inputs for gg.

     f            g
A ------> B ------> C

     g \circ f
A ----------------> C

(g \circ f)(x) = g(f(x))

For AI. This requirement means the output dimension of layer ll must match the input dimension of layer l+1l+1. Dimension mismatches between layers are among the most common neural network implementation bugs.

4.2 Associativity of Composition

Theorem. Function composition is associative: for f:ABf: A \to B, g:BCg: B \to C, h:CDh: C \to D:

(hg)f=h(gf)(h \circ g) \circ f = h \circ (g \circ f)

Proof. For any xAx \in A:

((hg)f)(x)=(hg)(f(x))=h(g(f(x)))((h \circ g) \circ f)(x) = (h \circ g)(f(x)) = h(g(f(x))) (h(gf))(x)=h((gf)(x))=h(g(f(x)))(h \circ (g \circ f))(x) = h((g \circ f)(x)) = h(g(f(x)))

Both sides equal h(g(f(x)))h(g(f(x))). Since this holds for all xAx \in A, the functions are equal by extensionality. QED

Consequence. We can write hgfh \circ g \circ f without ambiguity - no parentheses needed.

For AI. Associativity means we can group layers however we want. A 96-layer transformer can be viewed as:

  • 96 individual layers
  • Two groups of 48 layers
  • Three groups of 32 layers

The computed function is the same regardless of grouping. This is why we can meaningfully talk about "the first 10 layers" or "the last 3 layers" as coherent sub-computations. It also allows pipeline parallelism: split a model across GPUs at any layer boundary, and the overall function is unchanged.

4.3 The Identity Function

Definition. The identity function on set AA is:

idA:AA,idA(x)=x\text{id}_A: A \to A, \quad \text{id}_A(x) = x

Properties:

  • fidA=ff \circ \text{id}_A = f for any f:ABf: A \to B (right identity)
  • idBf=f\text{id}_B \circ f = f for any f:ABf: A \to B (left identity)

Proof of right identity. For any xAx \in A: (fidA)(x)=f(idA(x))=f(x)(f \circ \text{id}_A)(x) = f(\text{id}_A(x)) = f(x). QED

For AI - Residual Connections. The identity function appears in residual connections (He et al. 2016):

ResBlock(x)=x+F(x)=id(x)+F(x)\text{ResBlock}(x) = x + F(x) = \text{id}(x) + F(x)

The output is the input plus a learned residual. If F(x)0F(x) \approx 0, the block approximates the identity. This design allows the network to learn modifications to the identity, rather than learning the entire transformation from scratch.

Why this helps gradient flow: The Jacobian of a residual block is:

x(x+F(x))=I+Fx\frac{\partial}{\partial x}\big(x + F(x)\big) = I + \frac{\partial F}{\partial x}

The identity term II ensures gradients always flow through, even if Fx\frac{\partial F}{\partial x} is small. Without residual connections, gradients must pass through every layer multiplicatively - if any layer's Jacobian has small singular values, gradients vanish. With residual connections, there is always a "skip path" carrying the identity gradient.

Pre-LN Transformer block (GPT-2, LLaMA, etc.):

Block(x)=x+FFN(LN(x+Attn(LN(x))))\text{Block}(x) = x + \text{FFN}(\text{LN}(x + \text{Attn}(\text{LN}(x))))

Two residual connections per block: one around attention, one around FFN. Each is id+learnable function\text{id} + \text{learnable function}.

4.4 Composition and Function Properties

Theorem. Composition preserves injectivity, surjectivity, and bijectivity:

If ff is...and gg is...then gfg \circ f is...Proof sketch
InjectiveInjectiveInjectiveg(f(a1))=g(f(a2))g injf(a1)=f(a2)f inja1=a2g(f(a_1)) = g(f(a_2)) \xRightarrow{g \text{ inj}} f(a_1) = f(a_2) \xRightarrow{f \text{ inj}} a_1 = a_2
SurjectiveSurjectiveSurjectiveFor cCc \in C: gg surj b\Rightarrow \exists b s.t. g(b)=cg(b)=c; ff surj a\Rightarrow \exists a s.t. f(a)=bf(a)=b
BijectiveBijectiveBijectiveCombine both rows above

Full proof of injectivity preservation. Suppose ff and gg are both injective. Let (gf)(a1)=(gf)(a2)(g \circ f)(a_1) = (g \circ f)(a_2), i.e., g(f(a1))=g(f(a2))g(f(a_1)) = g(f(a_2)). Since gg is injective, f(a1)=f(a2)f(a_1) = f(a_2). Since ff is injective, a1=a2a_1 = a_2. QED

Converse results (partial converses):

  • If gfg \circ f is injective, then ff must be injective (but gg need not be).
  • If gfg \circ f is surjective, then gg must be surjective (but ff need not be).

Proof that gfg \circ f injective implies ff injective. Suppose f(a1)=f(a2)f(a_1) = f(a_2). Then g(f(a1))=g(f(a2))g(f(a_1)) = g(f(a_2)), i.e., (gf)(a1)=(gf)(a2)(g \circ f)(a_1) = (g \circ f)(a_2). Since gfg \circ f is injective, a1=a2a_1 = a_2. QED

For AI. If every layer is injective, the entire network is injective (no information loss). If any single layer is non-injective, the composition is non-injective. Since ReLU is non-injective (maps all negatives to 0), standard ReLU networks are non-injective - they destroy information about the sign of pre-activations.

4.5 Neural Networks as Function Composition

A deep neural network is exactly a composition of simpler functions. For a transformer with LL layers:

ftransformer=LM_headLNfLfL1f1PEEmbedf_{\text{transformer}} = \text{LM\_head} \circ \text{LN} \circ f_L \circ f_{L-1} \circ \cdots \circ f_1 \circ \text{PE} \circ \text{Embed}

where each transformer layer flf_l is itself a composition:

fl(x)=x+FFN(LN2(x+Attn(LN1(x))))f_l(x) = x + \text{FFN}\big(\text{LN}_2(x + \text{Attn}(\text{LN}_1(x)))\big)

expanding further:

FFN(x)=W2σ(W1x+b1)+b2\text{FFN}(x) = W_2 \cdot \sigma(W_1 x + b_1) + b_2 Attn(X)=softmax ⁣(XWQ(XWK)Tdk)XWVWO\text{Attn}(X) = \text{softmax}\!\left(\frac{XW_Q(XW_K)^T}{\sqrt{d_k}}\right)XW_V \cdot W_O

The total function is a composition of O(L)O(L) sub-functions, each parameterised by learnable weights. All are differentiable (except sampling at inference), enabling end-to-end gradient-based training via the chain rule.

Counting the depth. A single transformer layer involves roughly:

  • 2 layer norms (each: subtract mean, divide by std, scale, shift)
  • 4 linear projections (Q, K, V, O)
  • 1 softmax
  • 2 more linear projections (FFN up and down)
  • 1 activation function
  • 2 residual additions

That is approximately 12 elementary functions per transformer layer. A 96-layer model composes 96×121,15296 \times 12 \approx 1{,}152 elementary functions. The fact that gradient-based optimisation works over such deep compositions is remarkable and depends critically on residual connections and careful normalisation.

4.6 Non-Commutativity of Composition

Composition is NOT commutative. In general, fggff \circ g \neq g \circ f.

Example. Let f(x)=x2f(x) = x^2 and g(x)=x+1g(x) = x + 1:

  • (fg)(x)=f(g(x))=f(x+1)=(x+1)2=x2+2x+1(f \circ g)(x) = f(g(x)) = f(x+1) = (x+1)^2 = x^2 + 2x + 1
  • (gf)(x)=g(f(x))=g(x2)=x2+1(g \circ f)(x) = g(f(x)) = g(x^2) = x^2 + 1

These are different: (fg)(2)=9(f \circ g)(2) = 9 but (gf)(2)=5(g \circ f)(2) = 5.

When does composition commute? Two functions ff and gg commute (i.e., fg=gff \circ g = g \circ f) in special cases:

  • f=idf = \text{id} or g=idg = \text{id} (trivially)
  • ff and gg are both powers of the same function: f(m)f(n)=f(n)f(m)=f(m+n)f^{(m)} \circ f^{(n)} = f^{(n)} \circ f^{(m)} = f^{(m+n)}
  • ff and gg are both linear: if f(x)=Axf(x) = Ax and g(x)=Bxg(x) = Bx, then fg=gff \circ g = g \circ f iff AB=BAAB = BA (matrices commute)
  • Diagonal matrices always commute with each other

For AI. The order of operations matters enormously:

OrderingEffect
Pre-LN: LNAttn\text{LN} \to \text{Attn}More stable training; used in GPT-2, LLaMA, most modern LLMs
Post-LN: AttnLN\text{Attn} \to \text{LN}Original Transformer; training can be unstable without warmup
BatchNorm before activationStandard in CNNs
BatchNorm after activationDifferent normalisation statistics; rarely used
Dropout before softmaxDrops attention logits; called "attention dropout"
Dropout after softmaxDrops attention weights; different regularisation effect

The shift from Post-LN to Pre-LN significantly improved training stability for deep transformers - same components, different composition order. This is a direct consequence of non-commutativity.

4.7 Iterated Composition and Fixed Points

Definition. The nn-fold composition of ff with itself:

f(n)=fffn timesf^{(n)} = \underbrace{f \circ f \circ \cdots \circ f}_{n \text{ times}}

with f(0)=idf^{(0)} = \text{id}.

Fixed point. A point xx^* is a fixed point of ff if f(x)=xf(x^*) = x^*. Fixed points are invariant under iteration: f(n)(x)=xf^{(n)}(x^*) = x^* for all nn.

Banach Fixed Point Theorem (Contraction Mapping Theorem). Let (X,d)(X, d) be a complete metric space and f:XXf: X \to X a contraction, meaning there exists L[0,1)L \in [0, 1) such that:

d(f(x),f(y))Ld(x,y)for all x,yXd(f(x), f(y)) \leq L \cdot d(x, y) \quad \text{for all } x, y \in X

Then:

  1. ff has a unique fixed point xXx^* \in X
  2. For any starting point x0Xx_0 \in X, the sequence xn+1=f(xn)x_{n+1} = f(x_n) converges to xx^*
  3. Rate of convergence: d(xn,x)Ln1Ld(x0,x1)d(x_n, x^*) \leq \frac{L^n}{1-L} \cdot d(x_0, x_1)

Proof sketch. The sequence x0,f(x0),f(f(x0)),x_0, f(x_0), f(f(x_0)), \ldots is Cauchy because d(xn+1,xn)Ld(xn,xn1)Lnd(x1,x0)0d(x_{n+1}, x_n) \leq L \cdot d(x_n, x_{n-1}) \leq L^n \cdot d(x_1, x_0) \to 0. Since XX is complete, it converges to some xx^*. By continuity of ff: f(x)=f(limxn)=limf(xn)=limxn+1=xf(x^*) = f(\lim x_n) = \lim f(x_n) = \lim x_{n+1} = x^*. Uniqueness: if f(x)=xf(x^*) = x^* and f(y)=yf(y^*) = y^*, then d(x,y)=d(f(x),f(y))Ld(x,y)d(x^*, y^*) = d(f(x^*), f(y^*)) \leq L \cdot d(x^*, y^*), so (1L)d(x,y)0(1-L) \cdot d(x^*, y^*) \leq 0, giving d(x,y)=0d(x^*, y^*) = 0. QED

For AI: Deep Equilibrium Models (DEQ, Bai et al. 2019). Instead of stacking LL copies of a layer (using O(L)O(L) memory for activations), DEQ finds the fixed point of a single layer:

z=fθ(z,x)where z=limnfθ(n)(z0,x)z^* = f_\theta(z^*, x) \quad \text{where } z^* = \lim_{n \to \infty} f_\theta^{(n)}(z_0, x)

This is equivalent to an infinitely deep weight-tied network that has converged.

PropertyStandard Deep NetworkDEQ
MemoryO(L)O(L) - store all activationsO(1)O(1) - store only zz^*
ComputationLL forward passesIterate until convergence
BackpropThrough all LL layersImplicit differentiation at fixed point
Convergence guaranteeN/AGuaranteed if fθf_\theta is a contraction

The Banach theorem guarantees convergence if the layer is a contraction - which can be enforced via spectral normalisation (constraining Wop<1\|W\|_{\text{op}} < 1).

Example: iterating cosine. f(x)=cos(x)f(x) = \cos(x) on [0,1][0, 1] is a contraction (since f(x)=sin(x)sin(1)0.84<1|f'(x)| = |\sin(x)| \leq \sin(1) \approx 0.84 < 1).

Starting from x0=0.5x_0 = 0.5:

  • x1=cos(0.5)0.8776x_1 = \cos(0.5) \approx 0.8776
  • x2=cos(0.8776)0.6390x_2 = \cos(0.8776) \approx 0.6390
  • x3=cos(0.6390)0.8027x_3 = \cos(0.6390) \approx 0.8027
  • Converges to the Dottie number: x0.7391x^* \approx 0.7391 where cos(x)=x\cos(x^*) = x^*.

5. Inverse Functions

5.1 Left and Right Inverses

Not every function has a full (two-sided) inverse. The general picture involves left and right inverses:

Definition. For f:ABf: A \to B:

  • A left inverse of ff is a function g:BAg: B \to A such that gf=idAg \circ f = \text{id}_A (i.e., g(f(a))=ag(f(a)) = a for all aa)
  • A right inverse of ff is a function h:BAh: B \to A such that fh=idBf \circ h = \text{id}_B (i.e., f(h(b))=bf(h(b)) = b for all bb)
  • A two-sided inverse is a function that is both a left and right inverse

Key theorems connecting inverses to injectivity/surjectivity:

StatementCondition
ff has a left inverse    \iff ff is injective
ff has a right inverse    \iff ff is surjective (requires Axiom of Choice)
ff has a two-sided inverse    \iff ff is bijective
If both left and right inverses existThey are equal and unique

Proof that injective implies left inverse exists. Suppose f:ABf: A \to B is injective with AA \neq \emptyset. Fix some a0Aa_0 \in A. Define g:BAg: B \to A by:

g(b)={aif b=f(a) for some (unique by injectivity) aAa0if bf(A)g(b) = \begin{cases} a & \text{if } b = f(a) \text{ for some (unique by injectivity) } a \in A \\ a_0 & \text{if } b \notin f(A) \end{cases}

Then for any aAa \in A: g(f(a))=ag(f(a)) = a (by the first case), so gf=idAg \circ f = \text{id}_A. QED

Proof that left and right inverse, if both exist, are equal. Suppose gg is a left inverse (gf=idAg \circ f = \text{id}_A) and hh is a right inverse (fh=idBf \circ h = \text{id}_B). Then:

g=gidB=g(fh)=(gf)h=idAh=hg = g \circ \text{id}_B = g \circ (f \circ h) = (g \circ f) \circ h = \text{id}_A \circ h = h

So g=hg = h. QED

For AI. The distinction between left and right inverses appears in:

  • Encoder-decoder architectures: The encoder ff maps high-dimensional input to a low-dimensional bottleneck. The decoder gg maps back. We want gfidg \circ f \approx \text{id} (reconstruct the original), making gg an approximate left inverse. But fgidf \circ g \neq \text{id} (not every bottleneck vector corresponds to a real input).
  • Projection layers: Reducing dimension from dd to d<dd' < d has a left inverse (pseudo-inverse) but not a right inverse.

5.2 The Inverse Function

Definition. If f:ABf: A \to B is bijective, the inverse function f1:BAf^{-1}: B \to A is defined by:

f1(b)=a    f(a)=bf^{-1}(b) = a \quad \iff \quad f(a) = b

Properties of inverse functions:

PropertyStatement
Composition with originalf1f=idAf^{-1} \circ f = \text{id}_A and ff1=idBf \circ f^{-1} = \text{id}_B
Inverse of inverse(f1)1=f(f^{-1})^{-1} = f
Bijectivityf1f^{-1} is itself bijective
Graph reflectionThe graph of f1f^{-1} is the graph of ff reflected across y=xy = x
ContinuityIf ff is continuous and bijective (on appropriate spaces), f1f^{-1} is continuous

Common inverse pairs used in AI:

Function f(x)f(x)Inverse f1(y)f^{-1}(y)Domain of ffAI Use
exe^xln(y)\ln(y)R(0,)\mathbb{R} \to (0,\infty)Log-probabilities; logsoftmax\log \circ \text{softmax} (LogSoftmax)
σ(x)=11+ex\sigma(x) = \frac{1}{1+e^{-x}}logit(y)=ln ⁣(y1y)\text{logit}(y) = \ln\!\left(\frac{y}{1-y}\right)R(0,1)\mathbb{R} \to (0,1)Logit-space operations; probability calibration
AxAx (AA invertible)A1yA^{-1}yRnRn\mathbb{R}^n \to \mathbb{R}^nSolving linear systems; whitening transforms
ax+bax + b (a0a \neq 0)(yb)/a(y - b)/aRR\mathbb{R} \to \mathbb{R}Undoing normalisation; de-standardisation
tanh(x)\tanh(x)arctanh(y)=12ln ⁣(1+y1y)\text{arctanh}(y) = \frac{1}{2}\ln\!\left(\frac{1+y}{1-y}\right)R(1,1)\mathbb{R} \to (-1,1)Inverse activation in normalising flows

5.3 Inverse of a Composition

Theorem (Socks-and-Shoes Lemma). If f:ABf: A \to B and g:BCg: B \to C are both bijective, then:

(gf)1=f1g1(g \circ f)^{-1} = f^{-1} \circ g^{-1}

The inverse of a composition reverses the order.

Proof. We verify that f1g1f^{-1} \circ g^{-1} is the inverse of gfg \circ f:

(f1g1)(gf)=f1(g1g)f=f1idBf=f1f=idA(f^{-1} \circ g^{-1}) \circ (g \circ f) = f^{-1} \circ (g^{-1} \circ g) \circ f = f^{-1} \circ \text{id}_B \circ f = f^{-1} \circ f = \text{id}_A

Similarly: (gf)(f1g1)=g(ff1)g1=gidBg1=gg1=idC(g \circ f) \circ (f^{-1} \circ g^{-1}) = g \circ (f \circ f^{-1}) \circ g^{-1} = g \circ \text{id}_B \circ g^{-1} = g \circ g^{-1} = \text{id}_C. QED

Name origin. You put on socks first, then shoes. To undo: you remove shoes first, then socks. The inverse reverses the order.

Generalisation. For nn bijections:

(fnfn1f1)1=f11f21fn1(f_n \circ f_{n-1} \circ \cdots \circ f_1)^{-1} = f_1^{-1} \circ f_2^{-1} \circ \cdots \circ f_n^{-1}

For AI: Normalising Flows. A normalising flow is a composition of bijections:

f=fKfK1f1f = f_K \circ f_{K-1} \circ \cdots \circ f_1

By the socks-and-shoes lemma, the inverse is:

f1=f11f21fK1f^{-1} = f_1^{-1} \circ f_2^{-1} \circ \cdots \circ f_K^{-1}

Sampling (generation): start with noise zN(0,I)z \sim \mathcal{N}(0, I) and apply ff (forward pass through all layers in order f1,f2,,fKf_1, f_2, \ldots, f_K).

Likelihood computation (training): given data xx, apply f1f^{-1} (inverse pass through layers in reverse order fK1,,f11f_K^{-1}, \ldots, f_1^{-1}), and use the change-of-variables formula:

logp(x)=logp(z)+k=1Klog ⁣detfk1zk\log p(x) = \log p(z) + \sum_{k=1}^{K} \log\!\left|\det \frac{\partial f_k^{-1}}{\partial z_k}\right|

The order matters: forward pass and inverse pass traverse the layers in opposite directions.

5.4 Pseudo-Inverse (Moore-Penrose)

When ff is not bijective (e.g., a non-square matrix, or a square but singular matrix), no true inverse exists. The Moore-Penrose pseudo-inverse provides the "best substitute".

Definition. For a matrix ARm×nA \in \mathbb{R}^{m \times n}, the pseudo-inverse A+Rn×mA^+ \in \mathbb{R}^{n \times m} is the unique matrix satisfying the four Penrose conditions:

  1. AA+A=AA A^+ A = A (acts like an inverse when it can)
  2. A+AA+=A+A^+ A A^+ = A^+ (acts like an inverse when it can)
  3. (AA+)T=AA+(A A^+)^T = A A^+ (symmetric projection onto column space)
  4. (A+A)T=A+A(A^+ A)^T = A^+ A (symmetric projection onto row space)

Via SVD. If A=UΣVTA = U \Sigma V^T (singular value decomposition), then:

A+=VΣ+UTA^+ = V \Sigma^+ U^T

where Σ+\Sigma^+ is obtained by taking the reciprocal of each non-zero singular value and transposing. If Σ=diag(σ1,,σr,0,,0)\Sigma = \text{diag}(\sigma_1, \ldots, \sigma_r, 0, \ldots, 0), then Σ+=diag(1/σ1,,1/σr,0,,0)T\Sigma^+ = \text{diag}(1/\sigma_1, \ldots, 1/\sigma_r, 0, \ldots, 0)^T.

Geometric interpretation. x=A+bx^* = A^+ b gives the minimum-norm least-squares solution to Ax=bAx = b:

  • If Ax=bAx = b has a unique solution: A+bA^+ b is that solution (same as A1bA^{-1}b).
  • If Ax=bAx = b has multiple solutions (underdetermined): A+bA^+ b is the solution with smallest x\|x\|.
  • If Ax=bAx = b has no exact solution (overdetermined): A+bA^+ b minimises Axb2\|Ax - b\|^2, and among all minimisers, is the one with smallest x\|x\|.

Special cases:

Matrix TypePseudo-Inverse
Square invertible (m=nm = n, full rank)A+=A1A^+ = A^{-1}
Tall full-rank (m>nm > n, rank=n\text{rank} = n) - overdeterminedA+=(ATA)1ATA^+ = (A^T A)^{-1} A^T (left inverse)
Wide full-rank (m<nm < n, rank=m\text{rank} = m) - underdeterminedA+=AT(AAT)1A^+ = A^T (A A^T)^{-1} (right inverse)
Rank-deficientUse SVD formula

For AI. The pseudo-inverse appears in:

  • Linear regression: closed-form solution θ^=(XTX)1XTy=X+y\hat{\theta} = (X^TX)^{-1}X^Ty = X^+ y
  • LoRA (Low-Rank Adaptation): weight update ΔW=BA\Delta W = BA where BRd×rB \in \mathbb{R}^{d \times r}, ARr×dA \in \mathbb{R}^{r \times d} with rdr \ll d; the effective inverse involves pseudo-inverses of low-rank matrices
  • Weight initialisation analysis: understanding the effective rank and conditioning of weight matrices
  • Gradient computation: when computing gradients through rank-deficient Jacobians

5.5 Invertibility and Information Preservation

Theorem. A function is bijective (invertible) if and only if it preserves all information - i.e., the input can be recovered from the output.

Linear case. For a linear map f(x)=Axf(x) = Ax with ARm×nA \in \mathbb{R}^{m \times n}:

PropertyConditionMeaning
Injectiveker(A)={0}\ker(A) = \{0\}; rank(A)=n\text{rank}(A) = n; columns linearly independentNothing in the kernel; no information destroyed
Surjectiverange(A)=Rm\text{range}(A) = \mathbb{R}^m; rank(A)=m\text{rank}(A) = m; rows span Rm\mathbb{R}^mFull range; all outputs reachable
Bijectivem=nm = n and rank(A)=n\text{rank}(A) = n (det(A)0\det(A) \neq 0)Square and full rank

The dimension argument. By the rank-nullity theorem (rank(A)+nullity(A)=n\text{rank}(A) + \text{nullity}(A) = n):

  • If n>mn > m (more inputs than outputs): rank(A)m<n\text{rank}(A) \leq m < n, so nullity(A)>0\text{nullity}(A) > 0, so ff is NOT injective. The layer necessarily compresses.
  • If n<mn < m (fewer inputs than outputs): rank(A)n<m\text{rank}(A) \leq n < m, so ff is NOT surjective. Not all outputs are reachable.
  • Only when n=mn = m can ff be bijective (and even then, only if det(A)0\det(A) \neq 0).

Consequences for neural network architecture:

                    Information Flow in a Network
                    
     d=768         d=768         d=768         d=768
  +--------+    +--------+    +--------+    +--------+
  | Layer 1|--->| Layer 2|--->| Layer 3|--->| Layer 4|
  +--------+    +--------+    +--------+    +--------+
  Same dimension throughout -> bijection is POSSIBLE
  (Transformer: d_model stays constant through all layers)
  
     d=784        d=256          d=64          d=10
  +--------+    +--------+    +--------+    +--------+
  |Compress|--->|Compress|--->|Compress|--->| Output |
  +--------+    +--------+    +--------+    +--------+
  Decreasing dimension -> NOT injective -> information lost
  (Classifier: must compress to number of classes)

Autoencoders exploit this: the encoder maps high dimension to low (non-injective, compresses, loses information), and the decoder maps low to high (non-surjective, reconstructs). The bottleneck forces the network to learn a compact representation preserving the most important information.

Invertible neural networks (iRevNet, Glow, NICE) maintain d=dd = d' throughout and use carefully designed bijective layers (coupling layers, additive/affine transforms), enabling exact likelihood computation and perfect reconstruction.


6. Special Classes of Functions

6.1 Linear Functions

Definition. A function f:VWf: V \to W between vector spaces is linear if it preserves vector addition and scalar multiplication:

  1. Additivity: f(u+v)=f(u)+f(v)f(u + v) = f(u) + f(v) for all u,vVu, v \in V
  2. Homogeneity: f(αv)=αf(v)f(\alpha v) = \alpha f(v) for all αR\alpha \in \mathbb{R}, vVv \in V

These two conditions can be combined into one: f(αu+βv)=αf(u)+βf(v)f(\alpha u + \beta v) = \alpha f(u) + \beta f(v) (preservation of linear combinations).

Key consequences:

  • f(0)=0f(0) = 0 (the zero vector must map to the zero vector)
  • ff is completely determined by its action on a basis: if {e1,,en}\{e_1, \ldots, e_n\} is a basis for VV, then f(v)=f(αiei)=αif(ei)f(v) = f(\sum \alpha_i e_i) = \sum \alpha_i f(e_i)
  • Every linear function f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m can be represented as matrix multiplication: f(x)=Axf(x) = Ax where ARm×nA \in \mathbb{R}^{m \times n}

The kernel (null space). ker(f)={vV:f(v)=0}\ker(f) = \{v \in V : f(v) = 0\} - the set of vectors that ff maps to zero.

Rank-Nullity Theorem. For f:VWf: V \to W with VV finite-dimensional:

dim(V)=rank(f)+nullity(f)=dim(range(f))+dim(ker(f))\dim(V) = \text{rank}(f) + \text{nullity}(f) = \dim(\text{range}(f)) + \dim(\ker(f))

This is a conservation law: dimension is neither created nor destroyed, just redistributed between the range and the kernel.

For AI. Linear layers f(x)=Wxf(x) = Wx (without bias) are the backbone of neural networks:

  • The attention Q, K, V projections are linear maps
  • The FFN up/down projections are linear maps
  • The LM head (hidden state -> logits) is a linear map
  • LoRA approximates weight updates with low-rank linear maps: ΔW=BA\Delta W = BA where rank(BA)=rd\text{rank}(BA) = r \ll d

Without nonlinear activation functions, composing any number of linear layers gives a single linear layer: WLW2W1=WtotalW_L \cdots W_2 W_1 = W_{\text{total}}. This is why activations are essential - they break the linearity.

6.2 Affine Functions

Definition. A function f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m is affine if:

f(x)=Ax+bf(x) = Ax + b

where ARm×nA \in \mathbb{R}^{m \times n} is a matrix and bRmb \in \mathbb{R}^m is a bias vector.

Affine = linear + translation. An affine function is a linear function followed by a translation (shift by bb). It preserves affine combinations: f(αx+(1α)y)=αf(x)+(1α)f(y)f(\alpha x + (1-\alpha) y) = \alpha f(x) + (1-\alpha) f(y) (preserves weighted averages where weights sum to 1).

Key difference from linear:

  • Linear: f(0)=0f(0) = 0 always. Lines through the origin map to lines through the origin.
  • Affine: f(0)=b0f(0) = b \neq 0 in general. Lines map to lines, but the origin can shift.

Composition of affine functions is affine:

f2(f1(x))=A2(A1x+b1)+b2=(A2A1)x+(A2b1+b2)f_2(f_1(x)) = A_2(A_1 x + b_1) + b_2 = (A_2 A_1)x + (A_2 b_1 + b_2)

Result is affine with matrix A2A1A_2 A_1 and bias A2b1+b2A_2 b_1 + b_2.

Critical consequence. Composing any number of affine functions (without nonlinear activations) gives a single affine function. A 100-layer network with only affine layers is equivalent to a single affine layer. This is why nonlinear activation functions are essential - they prevent the entire network from collapsing into a single affine transformation.

For AI. Nearly every "linear layer" in a neural network is actually affine: f(x)=Wx+bf(x) = Wx + b. The bias term bb allows the function to shift the output away from the origin, which is necessary for representing functions that do not pass through zero. Some architectures omit the bias (e.g., some attention projections in LLaMA), but most include it.

6.3 Multilinear Functions

Definition. A function f:V1×V2××VkWf: V_1 \times V_2 \times \cdots \times V_k \to W is multilinear (kk-linear) if it is linear in each argument separately, with the others held fixed:

f(v1,,αvi+βvi,,vk)=αf(v1,,vi,,vk)+βf(v1,,vi,,vk)f(v_1, \ldots, \alpha v_i + \beta v_i', \ldots, v_k) = \alpha f(v_1, \ldots, v_i, \ldots, v_k) + \beta f(v_1, \ldots, v_i', \ldots, v_k)

The most important case is bilinear (k=2k = 2):

f:V×WU,f(αv1+βv2,w)=αf(v1,w)+βf(v2,w)f: V \times W \to U, \quad f(\alpha v_1 + \beta v_2, w) = \alpha f(v_1, w) + \beta f(v_2, w) f(v,αw1+βw2)=αf(v,w1)+βf(v,w2)f(v, \alpha w_1 + \beta w_2) = \alpha f(v, w_1) + \beta f(v, w_2)

Key property. A bilinear function is linear in each argument but NOT linear overall:

f(αv,αw)=α2f(v,w)αf(v,w)(unless α=0 or 1)f(\alpha v, \alpha w) = \alpha^2 f(v, w) \neq \alpha f(v, w) \quad \text{(unless } \alpha = 0 \text{ or } 1\text{)}

Common bilinear operations in AI:

OperationBilinear MapVariables
Dot productf(u,v)=uTvf(u, v) = u^T vLinear in uu, linear in vv
Matrix-vector productf(A,x)=Axf(A, x) = AxLinear in AA, linear in xx
Attention scoresf(Q,K)=QKTf(Q, K) = QK^TLinear in QQ, linear in KK
Outer productf(u,v)=uvTf(u, v) = uv^TLinear in uu, linear in vv
Bilinear formf(x,y)=xTAyf(x, y) = x^T A yLinear in xx, linear in yy

For AI: Attention is bilinear (before softmax). The raw attention score between query qq and key kk is:

score(q,k)=qTkdk\text{score}(q, k) = \frac{q^T k}{\sqrt{d_k}}

This is bilinear: linear in qq (doubling qq doubles the score) and linear in kk (doubling kk doubles the score). The softmax that follows makes the full attention mechanism nonlinear.

6.4 Polynomial Functions

Definition. A polynomial function f:RRf: \mathbb{R} \to \mathbb{R} has the form:

f(x)=anxn+an1xn1++a1x+a0=k=0nakxkf(x) = a_n x^n + a_{n-1} x^{n-1} + \cdots + a_1 x + a_0 = \sum_{k=0}^{n} a_k x^k

The degree of the polynomial is nn (the highest power with non-zero coefficient).

Weierstrass Approximation Theorem. Every continuous function on a closed interval [a,b][a, b] can be uniformly approximated by polynomials. That is, for any continuous f:[a,b]Rf: [a,b] \to \mathbb{R} and any ε>0\varepsilon > 0, there exists a polynomial pp with f(x)p(x)<ε|f(x) - p(x)| < \varepsilon for all x[a,b]x \in [a,b].

This is a classical universal approximation result - but it requires the degree (number of terms) to grow, and in general, the required degree grows quickly for complex functions.

Multivariate polynomials. A polynomial in nn variables:

f(x1,,xn)=αcαx1α1x2α2xnαnf(x_1, \ldots, x_n) = \sum_{\alpha} c_\alpha x_1^{\alpha_1} x_2^{\alpha_2} \cdots x_n^{\alpha_n}

where α=(α1,,αn)\alpha = (\alpha_1, \ldots, \alpha_n) is a multi-index.

Connection to neural networks (depth separation). Telgarsky (2016) showed that there exist functions expressible by depth-kk ReLU networks with polynomial width that require exponential width in depth-(k1)(k-1) networks. This is analogous to the fact that high-degree polynomials can represent functions that low-degree polynomials cannot.

Why neural networks don't use polynomial activations. If the activation function σ(x)=xk\sigma(x) = x^k were polynomial, then composing affine layers with polynomial activations gives a polynomial of degree kLk^L (where LL is depth). While high-degree polynomials are universal approximators, they have catastrophic extrapolation behaviour: x100x^{100} explodes for x>1|x| > 1 and vanishes for x<1|x| < 1. ReLU networks avoid this by producing piecewise linear functions, which extrapolate more gracefully.

6.5 Activation Functions

Activation functions are the nonlinear components that give neural networks their expressive power. Without them, any deep network collapses to a single affine transformation.

Detailed analysis of key activation functions:

Sigmoid:

σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}}
PropertyValue
Range(0,1)(0, 1)
Derivativeσ(x)=σ(x)(1σ(x))\sigma'(x) = \sigma(x)(1 - \sigma(x))
Max derivative1/41/4 at x=0x = 0
MonotoneStrictly increasing
InjectiveYes
Lipschitz constant1/41/4
BoundedYes
SymmetricAbout (0,0.5)(0, 0.5): σ(x)=1σ(x)\sigma(-x) = 1 - \sigma(x)
SaturationGradients vanish for $

Problem. Max derivative is 1/41/4. After LL layers, gradient is at most (1/4)L(1/4)^L. For L=20L = 20: (1/4)201012(1/4)^{20} \approx 10^{-12}. This is the vanishing gradient problem - the reason sigmoid was abandoned as a hidden-layer activation in deep networks.

Tanh:

tanh(x)=exexex+ex=2σ(2x)1\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = 2\sigma(2x) - 1
PropertyValue
Range(1,1)(-1, 1)
Derivativetanh(x)=1tanh2(x)\tanh'(x) = 1 - \tanh^2(x)
Max derivative11 at x=0x = 0
MonotoneStrictly increasing
Zero-centredYes: tanh(0)=0\tanh(0) = 0 and tanh(x)=tanh(x)\tanh(-x) = -\tanh(x)
SaturationStill saturates for large $

ReLU (Rectified Linear Unit):

ReLU(x)=max(0,x)={xif x>00if x0\text{ReLU}(x) = \max(0, x) = \begin{cases} x & \text{if } x > 0 \\ 0 & \text{if } x \leq 0 \end{cases}
PropertyValue
Range[0,)[0, \infty)
Derivative11 for x>0x > 0; 00 for x<0x < 0; undefined at x=0x = 0
MonotoneNon-decreasing (not strict)
InjectiveNo (all negatives map to 0)
Lipschitz constant11
BoundedNo (unbounded above)
ComputationExtremely cheap: just a comparison
Dying ReLUIf pre-activation is always negative, gradient is always 0 - neuron is "dead"
SparsityRoughly 50% of neurons output 0, creating sparse representations

Why ReLU dominates. Despite losing information (not injective) and not being differentiable at 0, ReLU works extremely well because:

  1. Gradient is exactly 1 for positive inputs - no vanishing gradient
  2. Creates piecewise linear functions - efficient to compute and optimise
  3. Induces sparsity - only active neurons contribute
  4. Extremely cheap to compute

GELU (Gaussian Error Linear Units, Hendrycks & Gimpel 2016):

GELU(x)=xΦ(x)=x12[1+erf ⁣(x2)]\text{GELU}(x) = x \cdot \Phi(x) = x \cdot \frac{1}{2}\left[1 + \text{erf}\!\left(\frac{x}{\sqrt{2}}\right)\right]

where Φ\Phi is the standard normal CDF. Approximation: GELU(x)0.5x(1+tanh[2/π(x+0.044715x3)])\text{GELU}(x) \approx 0.5 x \left(1 + \tanh\left[\sqrt{2/\pi}(x + 0.044715 x^3)\right]\right).

PropertyValue
Range[0.17,)[\approx -0.17, \infty)
MonotoneNo (slight dip near x0.17x \approx -0.17)
SmoothYes (CC^\infty - infinitely differentiable)
Used inGPT-2, BERT, most transformer models before SwiGLU era

SwiGLU (Shazeer 2020, used in LLaMA, PaLM, Gemini):

SwiGLU(x,W1,W2,W3)=SiLU(xW1)(xW2)\text{SwiGLU}(x, W_1, W_2, W_3) = \text{SiLU}(xW_1) \odot (xW_2)

where SiLU(x)=xσ(x)\text{SiLU}(x) = x \cdot \sigma(x) (Swish activation) and \odot is elementwise multiplication.

PropertyValue
GatedYes - one path gates the other via sigmoid
ParametersRequires 3 weight matrices instead of 2 (50% more parameters per FFN)
Used inLLaMA 1/2/3, PaLM, Gemini, Mistral, most 2023-2026 LLMs
Why preferredEmpirically better than GELU for large-scale language models

Mish (Misra 2019):

Mish(x)=xtanh(softplus(x))=xtanh(ln(1+ex))\text{Mish}(x) = x \cdot \tanh(\text{softplus}(x)) = x \cdot \tanh(\ln(1 + e^x))
PropertyValue
SmoothYes (CC^\infty)
Non-monotoneSlight negative dip (similar to GELU)
Self-regularisingBounded below, unbounded above
Used inYOLOv4, some vision models

Comprehensive comparison table:

ActivationFormulaRangeSmooth?Monotone?Injective?Lipschitz LLVanishing grad?
Sigmoid1/(1+ex)1/(1+e^{-x})(0,1)(0,1)YesYesYes1/41/4Yes (severe)
Tanhtanh(x)\tanh(x)(1,1)(-1,1)YesYesYes11Yes (moderate)
ReLUmax(0,x)\max(0,x)[0,)[0,\infty)No (x=0x=0)Yes (weak)No11No (but dying neurons)
Leaky ReLUmax(αx,x)\max(\alpha x, x)R\mathbb{R}No (x=0x=0)YesYes11No
GELUxΦ(x)x\Phi(x)[0.17,)\approx[-0.17,\infty)YesNoNo1.1\approx 1.1No
SiLU/Swishxσ(x)x\sigma(x)[0.28,)\approx[-0.28,\infty)YesNoNo1.1\approx 1.1No
Mishxtanh(sp(x))x\tanh(\text{sp}(x))[0.31,)\approx[-0.31,\infty)YesNoNo1.0\approx 1.0No
Softplusln(1+ex)\ln(1+e^x)(0,)(0,\infty)YesYesYes11No

6.6 Convex Functions

Definition. A function f:RnRf: \mathbb{R}^n \to \mathbb{R} is convex if for all x,ydom(f)x, y \in \text{dom}(f) and λ[0,1]\lambda \in [0, 1]:

f(λx+(1λ)y)λf(x)+(1λ)f(y)f(\lambda x + (1-\lambda) y) \leq \lambda f(x) + (1-\lambda) f(y)

Geometrically: the line segment connecting any two points on the graph lies above (or on) the graph. The function "curves upward".

         f(x)
          |    /
          |   /  Line segment (always above curve)
          |  */*
          | / * 
          |/  *  <- f curves below the line
          /   *    (this is convexity)
         /----*--------- x

Three equivalent characterisations (for differentiable ff):

  1. Definition: f(λx+(1λ)y)λf(x)+(1λ)f(y)f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda) f(y)
  2. First-order condition: f(y)f(x)+f(x)T(yx)f(y) \geq f(x) + \nabla f(x)^T(y - x) (function lies above every tangent plane)
  3. Second-order condition: 2f(x)0\nabla^2 f(x) \succeq 0 (Hessian is positive semi-definite everywhere)

Strict convexity: replace \leq with << (for xyx \neq y and λ(0,1)\lambda \in (0,1)). Strict convexity implies a unique global minimum.

Strong convexity: f(y)f(x)+f(x)T(yx)+μ2yx2f(y) \geq f(x) + \nabla f(x)^T(y - x) + \frac{\mu}{2}\|y - x\|^2 for some μ>0\mu > 0. Strong convexity implies:

  • Unique global minimum
  • Gradient descent converges at rate O(eμt)O(e^{-\mu t}) (exponential)
  • The Hessian satisfies 2f(x)μI\nabla^2 f(x) \succeq \mu I

Why convexity matters for optimisation:

PropertyConvex ffNon-convex ff
Local minimaEvery local minimum is globalCan have many local minima
Gradient descentGuaranteed to converge to global minMay converge to local min or saddle
UniquenessStrict convex -> unique minimumMultiple minima possible
Rate of convergenceO(1/t)O(1/t) or O(eμt)O(e^{-\mu t})No guarantees

For AI. The loss landscape of neural networks is non-convex in the parameters θ\theta. However:

  • Cross-entropy loss is convex in the logits zz (for fixed true labels): L(z)=zy+logezvL(z) = -z_y + \log\sum e^{z_v} is convex because log-sum-exp is convex
  • The L2L_2 regularisation term λ2θ2\frac{\lambda}{2}\|\theta\|^2 is strongly convex in θ\theta
  • Adding regularisation improves the conditioning of the loss landscape

Despite non-convexity, SGD finds good solutions in practice. This is one of the central mysteries of deep learning theory.

6.7 Lipschitz Functions

Definition. A function f:XYf: X \to Y (between metric spaces) is Lipschitz continuous with constant L0L \geq 0 if:

dY(f(x1),f(x2))LdX(x1,x2)for all x1,x2Xd_Y(f(x_1), f(x_2)) \leq L \cdot d_X(x_1, x_2) \quad \text{for all } x_1, x_2 \in X

In Rn\mathbb{R}^n with Euclidean norm: f(x1)f(x2)Lx1x2\|f(x_1) - f(x_2)\| \leq L \|x_1 - x_2\|.

Intuition. A Lipschitz function cannot change its output faster than LL times the change in input. The constant LL is a bound on the "slope" or "sensitivity" of the function.

The smallest valid LL is the Lipschitz constant:

Lip(f)=supx1x2f(x1)f(x2)x1x2\text{Lip}(f) = \sup_{x_1 \neq x_2} \frac{\|f(x_1) - f(x_2)\|}{\|x_1 - x_2\|}

For differentiable functions: Lip(f)=supxJf(x)op\text{Lip}(f) = \sup_x \|Jf(x)\|_{\text{op}} (supremum of the operator norm of the Jacobian).

Hierarchy:

ContractionLipschitzUniformly ContinuousContinuous\text{Contraction} \subsetneq \text{Lipschitz} \subsetneq \text{Uniformly Continuous} \subsetneq \text{Continuous}

(Contraction: L<1L < 1. Lipschitz: L<L < \infty.)

Lipschitz constants for common AI functions:

FunctionLipschitz ConstantNotes
ReLU11$
Sigmoid1/41/4Max derivative at x=0x = 0
Tanh11Max derivative at x=0x = 0
Softmax11In 1\ell^1 norm
Linear WxWxWop=σmax(W)\|W\|_{\text{op}} = \sigma_{\max}(W)Largest singular value
Layer Normd\leq \sqrt{d} (approximately)Depends on the input distribution
Affine Wx+bWx + bWop\|W\|_{\text{op}}Bias does not affect Lipschitz constant

Lipschitz constant of a composition. If ff has Lipschitz constant LfL_f and gg has Lipschitz constant LgL_g, then gfg \circ f has Lipschitz constant at most LgLfL_g \cdot L_f:

g(f(x))g(f(y))Lgf(x)f(y)LgLfxy\|g(f(x)) - g(f(y))\| \leq L_g \|f(x) - f(y)\| \leq L_g L_f \|x - y\|

For a deep network with LL layers, each with Lipschitz constant LlL_l:

Lip(fnetwork)l=1LLl\text{Lip}(f_{\text{network}}) \leq \prod_{l=1}^{L} L_l

If each layer has Lipschitz constant >1> 1, the bound grows exponentially with depth - a small input perturbation could be amplified exponentially. This is exactly the phenomenon behind adversarial examples: a tiny, imperceptible change to the input causes a large change in the output.

Spectral normalisation (Miyato et al. 2018). To control the Lipschitz constant of each layer, divide the weight matrix by its largest singular value:

W~=Wσmax(W)\tilde{W} = \frac{W}{\sigma_{\max}(W)}

This ensures W~op=1\|\tilde{W}\|_{\text{op}} = 1, so each layer has Lipschitz constant 1\leq 1 (when composed with a 1-Lipschitz activation like ReLU), and the overall network has Lipschitz constant 1L=1\leq 1^L = 1.

Applications of Lipschitz constraints in AI:

  • Wasserstein GANs (WGAN): The discriminator must be 1-Lipschitz; enforced via spectral normalisation or gradient penalty
  • Certified robustness: A network with Lipschitz constant LL is guaranteed to be robust to perturbations of size ε/L\varepsilon/L - no adversarial example can flip the prediction within this radius
  • Training stability: Constraining layer Lipschitz constants prevents gradient explosion
  • ODE-based models: Neural ODEs require Lipschitz dynamics for well-posedness (existence and uniqueness of solutions)

7. Higher-Order Functions and Functionals

7.1 Functions as Arguments

Definition. A higher-order function is a function that takes one or more functions as arguments, returns a function as output, or both.

In mathematical notation, if F\mathcal{F} is a set of functions, a higher-order function has the form:

H:FGorH:F×XYH: \mathcal{F} \to \mathcal{G} \quad \text{or} \quad H: \mathcal{F} \times X \to Y

Formal framework. Let Fun(X,Y)=YX={f:XY}\text{Fun}(X, Y) = Y^X = \{f : X \to Y\} denote the set of all functions from XX to YY. A higher-order function is simply a function whose domain or codomain includes such function spaces.

Examples in mathematics:

  • Derivative operator: D:C1(R)C0(R)D: C^1(\mathbb{R}) \to C^0(\mathbb{R}), D(f)=fD(f) = f'. Takes a differentiable function, returns its derivative.
  • Integration operator: Iab:C([a,b])RI_a^b: C([a,b]) \to \mathbb{R}, Iab(f)=abf(x)dxI_a^b(f) = \int_a^b f(x)\,dx. Takes a continuous function, returns a number.
  • Composition operator: Cg:Fun(X,Y)Fun(X,Z)C_g: \text{Fun}(X, Y) \to \text{Fun}(X, Z), Cg(f)=gfC_g(f) = g \circ f. Takes a function, returns a function.

Examples in programming (Python/JAX):

# map applies a function to each element
map(f, [x1, x2, x3]) -> [f(x1), f(x2), f(x3)]

# JAX grad: takes a function, returns its gradient function
grad_f = jax.grad(f)  # grad: (R^n -> R) -> (R^n -> R^n)

# JAX vmap: takes a function, returns its batched version  
batched_f = jax.vmap(f)  # vmap: (R^n -> R^m) -> (R^{B\timesn} -> R^{B\timesm})

7.2 Functionals (Functions of Functions)

Definition. A functional is a function from a function space to the real numbers:

J:FRJ: \mathcal{F} \to \mathbb{R}

where F\mathcal{F} is a set of functions. A functional maps an entire function to a single number.

Key examples:

FunctionalDefinitionWhat It Measures
Definite integralJ(f)=abf(x)dxJ(f) = \int_a^b f(x)\,dxArea under the curve
LpL^p norm$J(f) = \left(\intf(x)
EvaluationJ(f)=f(x0)J(f) = f(x_0)Value at a specific point
Supremum norm$J(f) = \sup_xf(x)
EntropyH(p)=p(x)logp(x)dxH(p) = -\int p(x) \log p(x)\,dxUncertainty/information
KL divergenceDKL(pq)=p(x)logp(x)q(x)dxD_{KL}(p \| q) = \int p(x) \log\frac{p(x)}{q(x)}\,dxDistance between distributions

Loss functions as functionals. In machine learning, the loss function is a functional on the space of models:

L:FR,L(f)=E(x,y)D[(f(x),y)]\mathcal{L}: \mathcal{F} \to \mathbb{R}, \quad \mathcal{L}(f) = \mathbb{E}_{(x,y) \sim \mathcal{D}}[\ell(f(x), y)]

This maps each model ff to a single number measuring how well ff fits the data. Training is the process of minimising this functional over the space of representable functions.

7.3 Operators (Functions on Function Spaces)

Definition. An operator is a function from one function space to another:

T:FGT: \mathcal{F} \to \mathcal{G}

Unlike functionals (which output numbers), operators output functions.

Linear operators. An operator TT is linear if T(αf+βg)=αT(f)+βT(g)T(\alpha f + \beta g) = \alpha T(f) + \beta T(g). These are the infinite-dimensional analogue of matrices.

Key examples of operators:

OperatorDefinitionType
DerivativeD(f)=fD(f) = f'Linear
IntegrationI(f)(x)=0xf(t)dtI(f)(x) = \int_0^x f(t)\,dtLinear
Fourier transformf^(ω)=f(t)eiωtdt\hat{f}(\omega) = \int f(t) e^{-i\omega t}\,dtLinear
Convolution(Tgf)(x)=g(xt)f(t)dt(T_g f)(x) = \int g(x-t)f(t)\,dtLinear (for fixed gg)
CompositionCg(f)=gfC_g(f) = g \circ fGenerally nonlinear
Softmaxsoftmax(f)i=efiefj\text{softmax}(f)_i = \frac{e^{f_i}}{\sum e^{f_j}}Nonlinear

For AI: Neural networks as operators. Recent work treats neural networks as operators between function spaces:

  • Neural Operators (FNO, DeepONet): Learn mappings between function spaces directly, e.g., mapping initial conditions to PDE solutions
  • Attention as an operator: Self-attention can be viewed as a nonlinear operator on a sequence of vectors: Attn:Rn×dRn×d\text{Attn}: \mathbb{R}^{n \times d} \to \mathbb{R}^{n \times d}, where nn varies (the operator handles variable-length inputs)

7.4 Calculus of Variations

Problem. Find the function ff that minimises (or maximises) a functional J[f]J[f].

This is optimisation over function spaces rather than over finite-dimensional parameter vectors. The classical framework is the calculus of variations.

Euler-Lagrange equation. Consider the functional:

J[f]=abL(x,f(x),f(x))dxJ[f] = \int_a^b \mathcal{L}(x, f(x), f'(x))\,dx

where L\mathcal{L} is the Lagrangian. The function ff^* that minimises JJ satisfies:

LfddxLf=0\frac{\partial \mathcal{L}}{\partial f} - \frac{d}{dx}\frac{\partial \mathcal{L}}{\partial f'} = 0

This is the Euler-Lagrange equation - an ODE whose solutions are the critical points of the functional.

Example: shortest path. The length of a curve y=f(x)y = f(x) from (a,f(a))(a, f(a)) to (b,f(b))(b, f(b)) is:

J[f]=ab1+(f(x))2dxJ[f] = \int_a^b \sqrt{1 + (f'(x))^2}\,dx

The Euler-Lagrange equation gives f(x)=0f''(x) = 0, meaning the shortest path is a straight line f(x)=cx+df(x) = cx + d. (In Euclidean geometry.)

Functional derivative. The analogue of a gradient for functionals:

δJδf(x)=limε0J[f+εδx]J[f]ε\frac{\delta J}{\delta f}(x) = \lim_{\varepsilon \to 0} \frac{J[f + \varepsilon \delta_x] - J[f]}{\varepsilon}

This measures how JJ changes when ff is perturbed at the point xx.

Connection to machine learning. Training a neural network by gradient descent on parameters θ\theta is a finite-dimensional approximation to functional optimisation:

Functional view:     min_{f \in F}     L(f)        <- optimise over all functions
                            up
                     Parameterise f_\theta
                            down
Parameter view:      min_{\theta \in R^p}   L(f_\theta)      <- optimise over parameters

The "lottery ticket hypothesis" and neural architecture search can be seen as trying to find the right parameterisation of the function space.

7.5 Transformations in AI Frameworks

Modern deep learning frameworks extensively use higher-order functions. Here is a conceptual map:

JAX's transformation system:

TransformType SignatureDescription
grad(f:RnR)(g:RnRn)(f: \mathbb{R}^n \to \mathbb{R}) \to (g: \mathbb{R}^n \to \mathbb{R}^n)Returns gradient function
vmap(f:XY)(g:XBYB)(f: X \to Y) \to (g: X^B \to Y^B)Vectorises (auto-batches)
jit(f:XY)(g:XY)(f: X \to Y) \to (g: X \to Y)Compiles for speed
pmap(f:XY)(g:XDYD)(f: X \to Y) \to (g: X^D \to Y^D)Parallelises across devices

The power of composability: These transforms compose:

jit(vmap(grad(f)))\text{jit}(\text{vmap}(\text{grad}(f)))

This gives a compiled, batched gradient function - three layers of higher-order functions applied in sequence. Each transform takes a function and returns a new, enhanced function.

Autodiff as a higher-order function. Automatic differentiation (the engine behind backpropagation) is fundamentally a higher-order function:

AD:(f:RnRm)(f:RnRm×n)\text{AD}: (f: \mathbb{R}^n \to \mathbb{R}^m) \to (\nabla f: \mathbb{R}^n \to \mathbb{R}^{m \times n})

It takes any differentiable function and returns its Jacobian (or gradient, for scalar outputs). The reverse-mode variant computes gradients in O(cost(f))O(\text{cost}(f)) time regardless of the number of parameters - this is why training networks with billions of parameters is feasible.


8. Continuity and Limits

This chapter only needs the core idea of continuity: functions can preserve nearby structure instead of tearing inputs apart with abrupt jumps. The full story belongs to Limits and Continuity, where epsilon-delta reasoning, theorem statements, and examples are treated systematically.

8.1 The Epsilon-Delta Definition

A function is continuous at aa when sufficiently small input changes force small output changes:

ε>0,  δ>0:  xa<δ    f(x)f(a)<ε\forall \varepsilon > 0,\; \exists \delta > 0:\; |x-a| < \delta \implies |f(x)-f(a)| < \varepsilon

You do not need to master epsilon-delta proofs yet. What matters in this chapter is the interpretation: continuity is a structural property of a function, not a plotting aesthetic. It tells us whether a function respects local neighborhoods.

8.2 Types of Continuity

Ordinary continuity is pointwise: the allowed δ\delta may depend on the location aa. Uniform continuity is stronger: one choice of δ\delta works across the whole set. Lipschitz continuity is stronger still and gives an explicit bound on how much outputs can move relative to inputs.

For AI, this hierarchy matters because robustness, sensitivity, and optimization stability are all function-shape questions. When we say a layer is "well behaved," we usually mean something in this family.

8.3 Continuity in Higher Dimensions

For f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m, continuity is stated with norms:

ε>0,  δ>0:  xa<δ    f(x)f(a)<ε\forall \varepsilon > 0,\; \exists \delta > 0:\; \|x-a\| < \delta \implies \|f(x)-f(a)\| < \varepsilon

This is the version that matters for neural networks, where inputs, hidden states, and parameters all live in high-dimensional spaces. A small step in parameter space should not cause an uncontrolled jump in loss unless the model includes genuinely discontinuous operations.

8.4 Limits and Convergence

Limits describe what a function approaches near a point, even if the function value at that point is missing or different. Continuity can be summarized as "the function value agrees with the limiting value."

This perspective is especially useful for activation functions and optimization trajectories: sigmoid and tanh saturate toward limiting values, while ReLU becomes asymptotically linear on the positive side. We will return to these ideas in Limits and Continuity.

8.5 Discontinuities and Their Consequences

Not every useful AI function is smooth or even continuous. argmax, hard thresholding, rounding, and hard attention all introduce abrupt changes. That is why so much modern ML relies on soft relaxations such as softmax, sigmoid gates, or temperature-controlled approximations.

The important bridge for this chapter is simple: continuity is one of the first major properties that distinguishes function classes in a way that directly affects trainability and robustness.


9. Differentiable Functions and Derivatives

Differentiability is the next major refinement: not only does the function avoid jumps, it admits a meaningful local linear approximation. The detailed machinery lives in Derivatives and Differentiation, Partial Derivatives and Gradients, Chain Rule and Backpropagation, and Automatic Differentiation.

9.1 Derivative as a Linear Approximation

For a scalar function, the derivative at aa is the limit

f(a)=limh0f(a+h)f(a)hf'(a) = \lim_{h \to 0}\frac{f(a+h)-f(a)}{h}

when that limit exists. In higher dimensions, the same idea becomes the gradient or Jacobian: the best linear map that approximates the function near a point. This "best local linear model" viewpoint is the right mental model for optimization.

9.2 Smoothness Classes

Analysts organize functions by regularity: continuous functions, continuously differentiable functions, twice differentiable functions, and so on. In ML terms, this is the difference between kinked activations such as ReLU and smoother choices such as GELU or softplus.

You do not need the full taxonomy yet. The important lesson is that stronger regularity assumptions buy stronger theorems, but many practical models deliberately work at the edge of those assumptions because piecewise-linear functions are computationally convenient.

9.3 The Chain Rule and Backpropagation

Differentiation becomes operationally important because neural networks are compositions:

f=fLf2f1f = f_L \circ \cdots \circ f_2 \circ f_1

The chain rule tells us how derivatives of compositions combine, and backpropagation is the computational form of that rule. This is the main reason the function-composition viewpoint earlier in the chapter matters so much.

9.4 Gradient Problems in Deep Networks

Composing many layers can make gradients shrink, blow up, or become numerically unstable. Vanishing and exploding gradients are not separate mysteries; they are consequences of how many Jacobian-like factors are multiplied together through depth.

Residual connections, normalization, initialization schemes, and gradient clipping are all function-shaping interventions designed to keep this derivative information usable.

9.5 Automatic Differentiation

Modern frameworks do not symbolically differentiate full formulas by hand. They build a computation graph and apply local derivative rules mechanically. That is automatic differentiation, and it is what turns abstract calculus into a practical training system.

For this section, the main bridge is conceptual: once functions are composed from simple pieces, their derivatives can also be assembled systematically. That idea powers nearly all large-scale deep learning.


10. Measure-Theoretic View of Functions

Functions eventually need to interact with probability, integration, and "almost everywhere" reasoning. This chapter only needs a preview so the vocabulary will feel familiar later. The closest developed companions in the current curriculum are Introduction to Probability and Random Variables, Expectation and Moments, and Integration; the dedicated measure-theory chapter sequence is still mostly scaffolded.

10.1 Why Measure Theory?

Probability theory needs a rigorous way to say which subsets count as events, which functions are valid random variables, and when an expectation is well defined. Measure theory supplies that language.

From the function viewpoint, the key upgrade is that we stop asking only "what output does this function produce?" and start asking "how does this function interact with a measure on its domain?"

10.2 Measurable Functions

A function between measurable spaces is measurable when preimages of measurable sets remain measurable. This is the measure-theoretic analogue of continuity preserving open-set structure.

That condition is what lets us push probabilities through a function and talk about the distribution of outputs in a mathematically controlled way.

10.3 Random Variables as Measurable Functions

A random variable is not fundamentally a mysterious object that "has a distribution." It is a measurable function

X:(Ω,F,P)(R,B(R))X : (\Omega, \mathcal{F}, P) \to (\mathbb{R}, \mathcal{B}(\mathbb{R}))

from outcomes to values. The distribution is the pushforward of PP through XX. This perspective becomes especially helpful when we later treat generative models as maps that transform one distribution into another.

10.4 Lebesgue Integration

Expectation is integration with respect to a probability measure. Lebesgue integration is the framework that makes this precise for very general classes of functions, far beyond the simple Riemann-integrable cases you first see in basic calculus.

For ML, this matters because loss expectations, likelihoods, densities, and interchange-of-limit arguments all rest on integration theory, even when the notation looks innocently familiar.

10.5 Almost Everywhere and Null Sets

Many theorems in analysis and ML care only about what happens outside sets of measure zero. A property that holds almost everywhere may fail on a negligible exceptional set and still be enough for integration, optimization, and probability arguments.

This is why ReLU's non-differentiability at a single point is usually harmless in practice: the bad set is tiny in the measure-theoretic sense that later theory makes precise.


11. Functions in Machine Learning

11.1 Universal Approximation Theorems

The fundamental question. Can neural networks represent any function? The universal approximation theorem says yes - with sufficient width or depth.

Width version (Cybenko 1989, Hornik 1991). For any continuous function f:[0,1]nRf: [0,1]^n \to \mathbb{R} and any ε>0\varepsilon > 0, there exists a single-hidden-layer network:

g(x)=i=1Nαiσ(wiTx+bi)+βg(x) = \sum_{i=1}^{N} \alpha_i \sigma(w_i^T x + b_i) + \beta

with NN sufficiently large (but finite), such that supx[0,1]nf(x)g(x)<ε\sup_{x \in [0,1]^n} |f(x) - g(x)| < \varepsilon.

Requirements on σ\sigma:

  • Original (Cybenko): σ\sigma must be sigmoidal (monotone, σ(x)1\sigma(x) \to 1 as xx \to \infty, σ(x)0\sigma(x) \to 0 as xx \to -\infty)
  • Extended (Hornik): Any continuous, non-polynomial σ\sigma works
  • ReLU: Specific constructions show that ReLU networks are universal approximators

Depth version (Lu et al. 2017). ReLU networks with width n+1n + 1 (where nn is input dimension) but arbitrary depth are universal approximators. Width n\leq n is NOT sufficient - there exist continuous functions that cannot be approximated by arbitrarily deep but narrow ReLU networks.

What the theorem does NOT say:

  • It does NOT guarantee that gradient descent can find the approximation
  • It does NOT bound the number of parameters needed (could be astronomically large)
  • It does NOT say anything about generalisation (fitting training data \neq generalising)
  • It does NOT say the approximation is efficient

Depth separation results. There exist functions that:

  • Can be computed by depth-kk ReLU networks with polynomial width
  • Require exponential width to approximate with depth-(k1)(k-1) networks (Telgarsky 2016, Eldan & Shamir 2016)

This justifies deeper architectures: depth provides exponential efficiency gains in representation.

11.2 Loss Functions as Function Compositions

The training loss decomposes as a composition of functions:

Input x -> Model f_\theta(x) -> Prediction yhat -> Loss ell(yhat, y) -> Scalar L
L(θ)=1ni=1n(fθ(xi),yi)\mathcal{L}(\theta) = \frac{1}{n}\sum_{i=1}^{n} \ell(f_\theta(x_i), y_i)

Common loss functions and their properties:

LossFormulaDomain -> RangeConvex in pred?Bounded?
MSE1n(yiy^i)2\frac{1}{n}\sum(y_i - \hat{y}_i)^2Rn[0,)\mathbb{R}^n \to [0, \infty)Yes (strongly)No
Cross-entropyyilogy^i-\sum y_i \log \hat{y}_iΔk[0,)\Delta^k \to [0, \infty)YesNo (\to \infty as y^i0\hat{y}_i \to 0)
Hingemax(0,1yy^)\max(0, 1 - y\hat{y})R[0,)\mathbb{R} \to [0, \infty)YesNo
HuberMSE if $e< \delta$, MAE otherwiseR[0,)\mathbb{R} \to [0, \infty)
KL divergencepilog(pi/qi)\sum p_i \log(p_i/q_i)Δk×Δk[0,)\Delta^k \times \Delta^k \to [0,\infty)Yes (in qq)No
Focal(1y^)γlogy^-(1-\hat{y})^\gamma \log \hat{y}(0,1)[0,)(0,1) \to [0, \infty)Depends on γ\gammaNo

Important subtlety: convexity in predictions vs. parameters. Cross-entropy is convex in the logits zz (input to softmax), but the overall loss L(θ)\mathcal{L}(\theta) is non-convex in the parameters θ\theta because fθf_\theta is a non-convex function of θ\theta.

11.3 Hypothesis Classes and Function Spaces

Definition. A hypothesis class H\mathcal{H} is the set of functions that a learning algorithm can select from:

H={fθ:θΘ}\mathcal{H} = \{f_\theta : \theta \in \Theta\}

The choice of architecture determines H\mathcal{H}:

ArchitectureHypothesis ClassProperties
Linear regression{xwTx+b:wRd,bR}\{x \mapsto w^T x + b : w \in \mathbb{R}^d, b \in \mathbb{R}\}Finite-dim, convex, small
Polynomial degree kk${x \mapsto \sum a_\alpha x^\alpha :\alpha
1-hidden-layer MLP{xαiσ(wiTx+bi)}\{x \mapsto \sum \alpha_i \sigma(w_i^T x + b_i)\}Finite-dim, non-convex
Deep networkVery complex function classFinite-dim, highly non-convex
Kernel methods{xαiK(x,xi)}\{x \mapsto \sum \alpha_i K(x, x_i)\}Can be infinite-dim, convex
TransformerSee belowFinite-dim, non-convex

The hypothesis class of a transformer. A transformer with LL layers, HH heads, hidden dimension dd, and vocabulary VV defines:

Htransformer{f:VΔV}\mathcal{H}_{\text{transformer}} \subset \{f: V^* \to \Delta^{|V|}\}

This is a subset of all functions from variable-length token sequences to distributions over the vocabulary. The subset is determined by the architecture and parameter count.

Bias-variance tradeoff through function space lens:

  • Small H\mathcal{H} (few parameters): High bias (can't fit complex functions), low variance (stable across datasets)
  • Large H\mathcal{H} (many parameters): Low bias (can fit anything), but potentially high variance
  • Modern deep learning paradox: Over-parameterised networks (θn|\theta| \gg n) have low bias AND low variance due to implicit regularisation by SGD

11.4 Attention as a Function

Self-attention defines a function Attn:Rn×dRn×d\text{Attn}: \mathbb{R}^{n \times d} \to \mathbb{R}^{n \times d} where nn is sequence length and dd is hidden dimension:

Attn(X)=softmax(XWQ(XWK)Tdk)XWV\text{Attn}(X) = \text{softmax}\left(\frac{XW_Q (XW_K)^T}{\sqrt{d_k}}\right) XW_V

Decomposing the function:

  1. Linear projections: Q=XWQQ = XW_Q, K=XWKK = XW_K, V=XWVV = XW_V (affine functions)
  2. Attention scores: S=QKT/dkS = QK^T / \sqrt{d_k} (bilinear function of QQ and KK)
  3. Softmax: A=softmax(S)A = \text{softmax}(S) (nonlinear, applied row-wise)
  4. Weighted sum: O=AVO = AV (bilinear function of AA and VV)

Properties of attention as a function:

PropertyAttention
Linear?No (softmax makes it nonlinear)
Injective?Generally yes (for distinct inputs)
Equivariant?Yes - permuting input rows permutes output rows
Lipschitz?Yes, but constant depends on sequence length
Differentiable?Yes (CC^\infty)
Input sizeVariable (nn can change)

Attention is not a fixed function. Unlike a typical neural network layer with fixed computational graph, attention's computation depends on the input through the attention pattern AA. This makes it a data-dependent function: the effective "weight matrix" AWVA \cdot W_V changes with every input.

11.5 Tokenisation and Embedding as Functions

Tokenisation is a function from strings to token sequences:

Tokenise:ΣV\text{Tokenise}: \Sigma^* \to V^*

where Σ\Sigma is the character alphabet and VV is the token vocabulary. For BPE/SentencePiece tokenisation, this function is:

  • Deterministic (same input always gives same output)
  • Not injective in general (some tokenisers normalise text, losing information)
  • Not surjective (not all token sequences correspond to valid text)

Embedding is a learned function from discrete tokens to continuous vectors:

Embed:VRd,Embed(t)=E[t,:]\text{Embed}: V \to \mathbb{R}^d, \quad \text{Embed}(t) = E[t, :]

where ERV×dE \in \mathbb{R}^{|V| \times d} is the embedding matrix. This is:

  • Injective (distinct tokens get distinct vectors - assuming no hash collisions)
  • Not continuous (the domain VV is discrete, so continuity doesn't apply)
  • A lookup table, not a "computation" - but mathematically it's a function

Complete pipeline as function composition:

String ---> Tokens ---> Embeddings ---> + Position ---> Transformer ---> Logits ---> Probs
       Tokenise    Embed          PosEnc         f_\theta           LM Head    Softmax
P(next tokencontext)=softmax(Wheadfθ(Embed(Tokenise(text))+PE))P(\text{next token} | \text{context}) = \text{softmax}(W_{\text{head}} \cdot f_\theta(\text{Embed}(\text{Tokenise}(\text{text})) + \text{PE}))

11.6 The Transformer Block as a Function

A single Pre-LN transformer block (as used in GPT-2+, LLaMA, etc.) defines a residual function:

Block(X)=X+FFN(LN2(X+Attn(LN1(X))))\text{Block}(X) = X + \text{FFN}(\text{LN}_2(X + \text{Attn}(\text{LN}_1(X))))

Breaking this down:

StepFunctionType
1. Layer NormLN1(X)\text{LN}_1(X)Nonlinear (normalisation)
2. Self-AttentionAttn()\text{Attn}(\cdot)Nonlinear (softmax), data-dependent
3. Residual AddX+Attn(LN1(X))X + \text{Attn}(\text{LN}_1(X))Linear in XX and in attention output
4. Layer NormLN2()\text{LN}_2(\cdot)Nonlinear
5. FFNW2σ(W1())W_2 \cdot \sigma(W_1 \cdot (\cdot)) or SwiGLU variantNonlinear (activation function)
6. Residual Add()+FFN()(\cdot) + \text{FFN}(\cdot)Linear

The full transformer stacks LL such blocks:

fθ=LNfinalBlockLBlockL1Block1f_\theta = \text{LN}_{\text{final}} \circ \text{Block}_L \circ \text{Block}_{L-1} \circ \cdots \circ \text{Block}_1

Parameter count and function expressiveness. For a transformer with hidden dimension dd, LL layers, HH heads, FFN dimension dff=4dd_{\text{ff}} = 4d, and vocabulary V|V|:

θL(12d2)+Vd12Ld2|\theta| \approx L(12d^2) + |V| \cdot d \approx 12Ld^2

For GPT-3 (175B): L=96L = 96, d=12288d = 12288, giving 12×96×122882174B\approx 12 \times 96 \times 12288^2 \approx 174B parameters. Each parameter is a coordinate in the hypothesis space Θ=R174B\Theta = \mathbb{R}^{174B}.


12. Category Theory Perspective

12.1 Categories and Morphisms

Definition. A category C\mathcal{C} consists of:

  1. A collection of objects Ob(C)\text{Ob}(\mathcal{C})
  2. For each pair of objects A,BA, B, a set of morphisms Hom(A,B)\text{Hom}(A, B) (arrows from AA to BB)
  3. A composition operation: if f:ABf: A \to B and g:BCg: B \to C, then gf:ACg \circ f: A \to C
  4. An identity morphism idA:AA\text{id}_A: A \to A for each object

Subject to:

  • Associativity: h(gf)=(hg)fh \circ (g \circ f) = (h \circ g) \circ f
  • Identity: fidA=ff \circ \text{id}_A = f and idBf=f\text{id}_B \circ f = f

This should look familiar. The axioms of a category are precisely the axioms of function composition from Section 4!

Key categories:

CategoryObjectsMorphisms
SetSetsFunctions
VectVector spacesLinear maps
TopTopological spacesContinuous maps
MeasMeasurable spacesMeasurable functions
GrpGroupsGroup homomorphisms

For AI. A neural network layer is a morphism in the category of parameterised differentiable functions:

  • Objects: Rn\mathbb{R}^n (Euclidean spaces of various dimensions)
  • Morphisms: differentiable functions fθ:RnRmf_\theta: \mathbb{R}^n \to \mathbb{R}^m
  • Composition: layered network construction

12.2 Functors: Maps Between Categories

Definition. A functor F:CDF: \mathcal{C} \to \mathcal{D} is a "structure-preserving map" between categories:

  1. FF maps objects: AF(A)A \mapsto F(A)
  2. FF maps morphisms: (f:AB)(F(f):F(A)F(B))(f: A \to B) \mapsto (F(f): F(A) \to F(B))
  3. FF preserves composition: F(gf)=F(g)F(f)F(g \circ f) = F(g) \circ F(f)
  4. FF preserves identities: F(idA)=idF(A)F(\text{id}_A) = \text{id}_{F(A)}

Intuition. A functor is a "homomorphism of categories" - it maps the entire structure (objects + arrows + composition) from one category to another.

Examples relevant to AI:

FunctorFromToMaps
ForgetfulVect -> SetVector spacesForget the linear structure
FreeSet -> VectSetsMap to free vector space
ProbabilityMeas -> SetMeasurable spacesMap to set of probability measures
Dual spaceVect -> Vectop^{\text{op}}Vector spacesVVV \mapsto V^* (contravariant)

For AI: JAX transforms as functors. The vmap transform in JAX can be viewed as a functor:

  • It maps functions f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m to batched functions vmap(f):RB×nRB×m\text{vmap}(f): \mathbb{R}^{B \times n} \to \mathbb{R}^{B \times m}
  • It preserves composition: vmap(gf)=vmap(g)vmap(f)\text{vmap}(g \circ f) = \text{vmap}(g) \circ \text{vmap}(f)
  • It preserves identity: vmap(id)=id\text{vmap}(\text{id}) = \text{id}

Similarly, grad is a (contravariant-like) functor from scalar-valued functions to vector-valued functions.

12.3 Natural Transformations

Definition. A natural transformation η:FG\eta: F \Rightarrow G between functors F,G:CDF, G: \mathcal{C} \to \mathcal{D} assigns to each object AA a morphism ηA:F(A)G(A)\eta_A: F(A) \to G(A) such that the following diagram commutes for every morphism f:ABf: A \to B:

F(A) ---- \eta_A -----> G(A)
  |                    |
F(f)|                    |G(f)
  |                    |
  v                    v
F(B) ---- \eta_B -----> G(B)

Commutative: G(f)ηA=ηBF(f)G(f) \circ \eta_A = \eta_B \circ F(f).

Intuition. A natural transformation is a "systematic" way of converting one construction into another, one that works uniformly across all objects and is compatible with all morphisms.

For AI. Natural transformations appear in:

  • Activation functions: The ReLU activation defines a natural transformation from the identity functor to itself (in a suitable category of parameterised functions). It transforms each layer's output in a way that commutes with the layer's parameterisation.
  • Normalisation layers: Layer norm transforms representations in a way that is "natural" - it doesn't depend on the specific basis chosen for the hidden dimension.

12.4 Isomorphisms and Equivalences

Definition. A morphism f:ABf: A \to B is an isomorphism if there exists g:BAg: B \to A with gf=idAg \circ f = \text{id}_A and fg=idBf \circ g = \text{id}_B.

In different categories, isomorphisms have different names:

CategoryIsomorphism CalledMeaning
SetBijectionOne-to-one correspondence
VectLinear isomorphismInvertible linear map
TopHomeomorphismContinuous bijection with continuous inverse
Smooth manifoldsDiffeomorphismSmooth bijection with smooth inverse
GroupsGroup isomorphismStructure-preserving bijection

For AI: Diffeomorphisms in normalising flows. A normalising flow is a sequence of diffeomorphisms f1,f2,,fKf_1, f_2, \ldots, f_K in the category of smooth manifolds. Each fkf_k is:

  • Smooth (differentiable)
  • Bijective (invertible)
  • Has a smooth inverse

The composition f=fKf1f = f_K \circ \cdots \circ f_1 is also a diffeomorphism, and the change-of-variables formula allows exact likelihood computation:

logp(x)=logpz(f1(x))+k=1KlogdetJfk1\log p(x) = \log p_z(f^{-1}(x)) + \sum_{k=1}^{K} \log \left|\det J_{f_k^{-1}}\right|

12.5 The Yoneda Lemma and Representations

Yoneda Lemma (informal). An object is completely determined by its relationships (morphisms) to all other objects. More precisely:

Nat(Hom(A,),F)F(A)\text{Nat}(\text{Hom}(A, -), F) \cong F(A)

Natural transformations from the representable functor Hom(A,)\text{Hom}(A, -) to any functor FF are in bijection with elements of F(A)F(A).

Consequence. Two objects are isomorphic if and only if they have the same "relationship pattern" to all other objects: ABA \cong B if and only if Hom(A,C)Hom(B,C)\text{Hom}(A, C) \cong \text{Hom}(B, C) for all CC.

For AI (conceptual analogy). The Yoneda perspective suggests that a representation (embedding) is good if it captures all the relationships of the represented object:

  • A word embedding vwRdv_w \in \mathbb{R}^d is a "compressed" version of the word's relationships to all other words
  • Two words are "the same" (isomorphic) if they have the same relationship to all contexts
  • This is precisely the distributional hypothesis: "a word is characterised by the company it keeps" (Firth 1957)
  • Word2Vec, GloVe, and contextual embeddings all operationalise this principle: embeddings derive from co-occurrence patterns (relationships to other words)

The Yoneda perspective also connects to representation learning more broadly: a good representation of an object captures all the relevant "morphisms" (transformations, relationships) between it and everything else.


13. Common Mistakes and Misconceptions

13.1 Confusing Function with Formula

Mistake: Treating "f(x)=x2f(x) = x^2" as the function, rather than understanding that the function is the mapping f:RRf: \mathbb{R} \to \mathbb{R} defined by xx2x \mapsto x^2.

Why it matters: The same formula can define different functions depending on domain and codomain:

  • f:RRf: \mathbb{R} \to \mathbb{R}, f(x)=x2f(x) = x^2 - NOT injective
  • g:[0,)Rg: [0, \infty) \to \mathbb{R}, g(x)=x2g(x) = x^2 - injective
  • h:R[0,)h: \mathbb{R} \to [0, \infty), h(x)=x2h(x) = x^2 - surjective
  • k:[0,)[0,)k: [0, \infty) \to [0, \infty), k(x)=x2k(x) = x^2 - bijective

These are four different functions with the same formula.

13.2 Confusing Codomain and Range

Mistake: Assuming the codomain equals the range.

The codomain is part of the function's definition (what it can in principle output). The range is what it actually outputs. For f:RRf: \mathbb{R} \to \mathbb{R}, f(x)=x2f(x) = x^2:

  • Codomain: R\mathbb{R} (all reals)
  • Range: [0,)[0, \infty) (non-negative reals only)
  • The function is NOT surjective because range \neq codomain

In AI: a classifier f:RdR10f: \mathbb{R}^d \to \mathbb{R}^{10} has codomain R10\mathbb{R}^{10}, but after softmax the actual outputs are in Δ10\Delta^{10} (the probability simplex) - a strict subset.

13.3 Assuming Composition is Commutative

Mistake: Assuming fg=gff \circ g = g \circ f.

Composition is almost never commutative. In neural networks:

  • Pre-LN: x+Attn(LN(x))x + \text{Attn}(\text{LN}(x)) - normalise then attend
  • Post-LN: LN(x+Attn(x))\text{LN}(x + \text{Attn}(x)) - attend then normalise
  • These produce different results and have different training dynamics

13.4 Forgetting That Linear Means f(0) = 0

Mistake: Calling f(x)=Wx+bf(x) = Wx + b a "linear function".

It is affine, not linear (unless b=0b = 0). A truly linear function satisfies f(0)=0f(0) = 0, f(x+y)=f(x)+f(y)f(x + y) = f(x) + f(y), and f(αx)=αf(x)f(\alpha x) = \alpha f(x). The term "linear layer" in deep learning is a misnomer - PyTorch's nn.Linear is actually an affine layer.

13.5 Assuming Differentiable Implies Smooth

Mistake: Thinking once-differentiable (C1C^1) means infinitely differentiable (CC^\infty).

ReLU is not even C1C^1 (not differentiable at x=0x = 0). The function f(x)=xxf(x) = x|x| is C1C^1 but not C2C^2. For optimisation, C1C^1 is sufficient for gradient descent, but C2C^2 is needed for second-order methods.

13.6 Ignoring Domain Restrictions

Mistake: Applying log(x)\log(x) without checking that x>0x > 0, or computing 1/x1/x at x=0x = 0.

In AI, this appears as:

  • log(0) in cross-entropy loss when prediction is exactly 0 - fix: use log(p + eps) or log_softmax
  • Division by zero in normalisation when variance is 0 - fix: use LayerNorm(x)=xμσ+ε\text{LayerNorm}(x) = \frac{x - \mu}{\sigma + \varepsilon}
  • sqrt(0) gradient is infinite - fix: use sqrt(x + eps)

13.7 Confusing Bijectivity with Invertibility

Mistake: Trying to "invert" a non-bijective function.

The pseudo-inverse A+A^+ is NOT a true inverse - AA+A=AAA^+A = A but A+AA+=A+A^+AA^+ = A^+ does not give AA+=IAA^+ = I or A+A=IA^+A = I in general. In AI:

  • You cannot perfectly reconstruct an input from a lower-dimensional embedding (dimensionality reduction is not injective)
  • An autoencoder with bottleneck dlatent<dinputd_{\text{latent}} < d_{\text{input}} cannot be bijective
  • Pooling layers (max pool, average pool) are not injective - information is lost

14. Exercises and Practice Problems

14.1 Foundational Exercises

Exercise 1 (Injectivity). Prove that the function f:RnRnf: \mathbb{R}^n \to \mathbb{R}^n defined by f(x)=Axf(x) = Ax is injective if and only if det(A)0\det(A) \neq 0.

Hint: Injective means ker(A)={0}\ker(A) = \{0\}. Use the fact that det(A)0    ker(A)={0}\det(A) \neq 0 \iff \ker(A) = \{0\}.

Exercise 2 (Composition). Let f:RmRkf: \mathbb{R}^m \to \mathbb{R}^k and g:RnRmg: \mathbb{R}^n \to \mathbb{R}^m be linear functions represented by matrices AA and BB respectively. Show that fgf \circ g is represented by the matrix ABAB.

Exercise 3 (Surjectivity of softmax). Show that softmax:RnΔn\text{softmax}: \mathbb{R}^n \to \Delta^n is surjective onto the interior of the simplex int(Δn)={pRn:pi>0,pi=1}\text{int}(\Delta^n) = \{p \in \mathbb{R}^n : p_i > 0, \sum p_i = 1\} but NOT surjective onto the boundary.

Hint: For interior points, use zi=log(pi)z_i = \log(p_i). For the boundary, note that softmax outputs are always strictly positive.

Exercise 4 (Inverse of sigmoid). Derive the inverse of the sigmoid function σ(x)=1/(1+ex)\sigma(x) = 1/(1 + e^{-x}).

Solution sketch: y=1/(1+ex)    1+ex=1/y    ex=1/y1=(1y)/y    x=log(y/(1y))y = 1/(1 + e^{-x}) \implies 1 + e^{-x} = 1/y \implies e^{-x} = 1/y - 1 = (1-y)/y \implies x = \log(y/(1-y)). This is the logit function: logit(y)=log(y/(1y))\text{logit}(y) = \log(y/(1-y)).

14.2 Intermediate Exercises

Exercise 5 (Lipschitz constant of composition). A network has 3 layers: ReLU activation (Lipschitz constant 1), a weight matrix W1W_1 with σmax(W1)=2\sigma_{\max}(W_1) = 2, and another weight matrix W2W_2 with σmax(W2)=3\sigma_{\max}(W_2) = 3. What is the upper bound on the network's Lipschitz constant?

Answer: L1×2×3=6L \leq 1 \times 2 \times 3 = 6.

Exercise 6 (Residual connection Jacobian). For a residual block f(x)=x+g(x)f(x) = x + g(x) where g(x)=W2ReLU(W1x)g(x) = W_2 \cdot \text{ReLU}(W_1 x):

(a) Write the Jacobian Jf(x)J_f(x) in terms of Jg(x)J_g(x). (b) Show that if Jgop<1\|J_g\|_{\text{op}} < 1, then JfJ_f is invertible. (c) What does this imply about the information flow through the residual block?

Exercise 7 (Convexity). Prove that the cross-entropy loss L(z)=zy+logjezjL(z) = -z_y + \log\sum_{j} e^{z_j} is convex in the logits zz.

Hint: Compute the Hessian and show it is positive semi-definite. The Hessian of log-sum-exp is diag(p)ppT\text{diag}(p) - pp^T where p=softmax(z)p = \text{softmax}(z).

14.3 Advanced Exercises

Exercise 8 (Universal approximation). Construct a ReLU network with one hidden layer that exactly represents the function f(x)=xf(x) = |x| for xRx \in \mathbb{R}.

Solution: x=ReLU(x)+ReLU(x)|x| = \text{ReLU}(x) + \text{ReLU}(-x). This is a network with 2 hidden units, weights w1=1,w2=1w_1 = 1, w_2 = -1, biases b1=b2=0b_1 = b_2 = 0, output weights α1=α2=1\alpha_1 = \alpha_2 = 1.

Exercise 9 (Measure theory). Let XN(0,1)X \sim \mathcal{N}(0, 1) and Y=X2Y = X^2. Find the pushforward measure PYP_Y (the distribution of YY).

Hint: YY follows a chi-squared distribution with 1 degree of freedom. Use the change-of-variables formula for non-injective functions by splitting the domain into (,0)(-\infty, 0) and (0,)(0, \infty).

Exercise 10 (Category theory). Verify that neural network layers with composition form a category: (a) Define the objects and morphisms. (b) Verify associativity of composition. (c) Define the identity morphism and verify the identity laws. (d) Is this category a subcategory of Set? Of Vect? Why or why not?


15. Why This Matters for AI/LLMs

15.1 Everything Is a Function

The central thesis of this chapter: every component of a modern AI system is a function, and the properties of those functions (injectivity, surjectivity, linearity, continuity, differentiability, Lipschitz constants) directly determine the system's capabilities and limitations.

+-----------------------------------------------------------------+
|                    THE AI SYSTEM AS A FUNCTION                  |
|                                                                 |
|  String ---> Tokens ---> Embeddings ---> Hidden ---> Logits ---> P  |
|         f_1         f_2            f_3         f_4         f_5     |
|                                                                 |
|  f_1: Tokenise (deterministic, not injective in general)        |
|  f_2: Embed     (injective lookup, discrete -> continuous)       |
|  f_3: Transform (nonlinear, differentiable, high-dimensional)   |
|  f_4: LM Head   (linear projection, dimension reduction)        |
|  f_5: Softmax   (continuous, surjective onto int(\Delta), bounded)  |
|                                                                 |
|  Full system: f = f_5 \circ f_4 \circ f_3 \circ f_2 \circ f_1                     |
|  Training: find \theta* = argmin_\theta E[L(f_\theta(x), y)]                 |
|  Inference: sample from f_\theta*(context)                           |
+-----------------------------------------------------------------+

15.2 The Function Properties That Matter Most

PropertyWhy It MattersWhere It Appears
DifferentiabilityEnables gradient-based trainingEverywhere (backpropagation)
Lipschitz continuityControls sensitivity and robustnessAll layers (adversarial robustness)
InjectivityPreserves informationInvertible networks, normalising flows
CompositionBuilds complex from simpleDeep networks = composition of layers
LinearityEfficient, well-understoodAttention projections, FFN
ConvexityGuarantees global optimumLoss (in logits), regularisation
BilinearityCaptures interactionsAttention scores, dot products
MeasurabilityEnables probabilistic reasoningRandom variables, generative models

15.3 Scale and the Function Perspective

The remarkable discovery of the scaling laws era (2020-present): the function class Hθ\mathcal{H}_\theta indexed by a few hyperparameters (depth LL, width dd, data DD, compute CC) follows predictable relationships:

L(C)(C0C)αC,L(D)(D0D)αD,L(N)(N0N)αN\mathcal{L}(C) \approx \left(\frac{C_0}{C}\right)^{\alpha_C}, \quad \mathcal{L}(D) \approx \left(\frac{D_0}{D}\right)^{\alpha_D}, \quad \mathcal{L}(N) \approx \left(\frac{N_0}{N}\right)^{\alpha_N}

where L\mathcal{L} is the loss, CC is compute, DD is data, and NN is parameter count. These are power laws - the loss is a specific functional form of the "size" of the function class.

From a function theory perspective:

  • More parameters -> larger hypothesis class -> can approximate more complex functions
  • More data -> better estimation of the target function -> lower approximation error
  • More compute -> more SGD steps -> better search through the hypothesis class
  • The power-law form suggests deep structural regularities in the relationship between function class capacity and approximation quality

16. Conceptual Bridge

16.1 Looking Ahead

The mathematical tools from this chapter appear throughout the rest of this curriculum:

TopicFunction Concepts Used
Linear AlgebraLinear maps, matrix representations of functions
CalculusDerivatives (local linear approximations of functions)
OptimisationMinimising loss functionals over parameterised function spaces
ProbabilityRandom variables as measurable functions, distributions as pushforwards
Information TheoryEntropy and mutual information as functionals on distributions
BackpropagationChain rule for compositions of differentiable functions

16.2 The Meta-Insight

Mathematics is, in a deep sense, the study of functions and their properties. Linear algebra studies linear functions. Calculus studies differentiable functions. Topology studies continuous functions. Measure theory studies measurable functions. Category theory studies all structure-preserving functions (morphisms) at once.

The bridge forward is this: later chapters will often look like they are about matrices, gradients, probability distributions, or optimization algorithms, but underneath they are still about maps between spaces, compositions of transformations, and constraints on what those transformations can preserve. Understanding functions - truly understanding them - is understanding the language in which all of AI is written.


References

  1. Cybenko, G. (1989). "Approximation by superpositions of a sigmoidal function." Mathematics of Control, Signals and Systems, 2(4), 303-314.
  2. Hornik, K. (1991). "Approximation capabilities of multilayer feedforward networks." Neural Networks, 4(2), 251-257.
  3. Hendrycks, D. & Gimpel, K. (2016). "Gaussian Error Linear Units (GELUs)." arXiv:1606.08415.
  4. Shazeer, N. (2020). "GLU Variants Improve Transformer." arXiv:2002.05202.
  5. Miyato, T. et al. (2018). "Spectral Normalization for Generative Adversarial Networks." ICLR 2018.
  6. Telgarsky, M. (2016). "Benefits of depth in neural networks." COLT 2016.
  7. He, K. et al. (2016). "Deep Residual Learning for Image Recognition." CVPR 2016.
  8. Kingma, D.P. & Dhariwal, P. (2018). "Glow: Generative Flow with Invertible 1\times1 Convolutions." NeurIPS 2018.
  9. Jang, E. et al. (2017). "Categorical Reparameterization with Gumbel-Softmax." ICLR 2017.
  10. Lu, Z. et al. (2017). "The Expressive Power of Neural Networks: A View from the Width." NeurIPS 2017.
  11. Penrose, R. (1955). "A generalized inverse for matrices." Mathematical Proceedings of the Cambridge Philosophical Society, 51(3), 406-413.
  12. Mac Lane, S. (1998). Categories for the Working Mathematician. Springer.
  13. Kaplan, J. et al. (2020). "Scaling Laws for Neural Language Models." arXiv:2001.08361.
  14. Misra, D. (2019). "Mish: A Self Regularized Non-Monotonic Activation Function." arXiv:1908.08681.
  15. Bai, S. et al. (2019). "Deep Equilibrium Models." NeurIPS 2019.

Next: 04-Summation-and-Product-Notation