Theory NotebookMath for LLMs

Functions and Mappings

Mathematical Foundations / Functions and Mappings

Run notebook
Theory Notebook

Theory Notebook

Converted from theory.ipynb for web reading.

Functions and Mappings — From Mathematical Foundations to LLM Internals

A function is the most fundamental verb in mathematics: it takes an input and produces an output. Every layer of a neural network, every loss computation, every gradient step — all are functions. Understanding functions deeply is understanding neural networks deeply.

This notebook is the interactive companion to notes.md. It covers all 15 sections with runnable Python, NumPy, and PyTorch code — from set-theoretic definitions through category theory, with AI applications throughout.

SectionTopicWhat You'll Build
1IntuitionPipeline visualiser, function-as-machine demos
2Formal DefinitionsDomain/codomain/range checker, preimage calculator
3Types of FunctionsInjectivity/surjectivity tester, activation function zoo
4Function CompositionComposition engine, neural network as composition
5Inverse FunctionsInverse finder, pseudo-inverse via SVD
6Special ClassesLinear, affine, multilinear, convex, Lipschitz demos
7Higher-Order FunctionsMap/filter/reduce, functionals, currying
8Continuity and Limitsε-δ visualiser, continuity checker
9Differentiable FunctionsDerivative engine, chain rule, Taylor expansion
10Measure-Theoretic ViewMeasurable functions, Lebesgue integration, PDFs
11Functions in MLUniversal approximation, attention as function, loss functions
12Category TheoryCategories, functors, natural transformations
13Common MistakesInteractive mistake detector
14ExercisesHands-on problems with solutions
15Why This Matters for AIComprehensive impact summary

Prerequisites: Python, NumPy. PyTorch optional but recommended.

Code cell 2

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

try:
    import seaborn as sns
    sns.set_theme(style="whitegrid", palette="colorblind")
    HAS_SNS = True
except ImportError:
    plt.style.use("seaborn-v0_8-whitegrid")
    HAS_SNS = False

mpl.rcParams.update({
    "figure.figsize":    (10, 6),
    "figure.dpi":         120,
    "font.size":           13,
    "axes.titlesize":      15,
    "axes.labelsize":      13,
    "xtick.labelsize":     11,
    "ytick.labelsize":     11,
    "legend.fontsize":     11,
    "legend.framealpha":   0.85,
    "lines.linewidth":      2.0,
    "axes.spines.top":     False,
    "axes.spines.right":   False,
    "savefig.bbox":       "tight",
    "savefig.dpi":         150,
})
np.random.seed(42)
print("Plot setup complete.")

Code cell 3

import numpy as np
import matplotlib.pyplot as plt
from typing import Callable, Set, Tuple, List, Dict, Any, Optional
from itertools import product as cartesian_product
import warnings

np.random.seed(42)
np.set_printoptions(precision=6, suppress=True)
plt.rcParams.update({'figure.figsize': (8, 4), 'font.size': 11})

# Optional: PyTorch for neural network demos
try:
    import torch
    import torch.nn as nn
    import torch.nn.functional as F
    torch.manual_seed(42)
    HAS_TORCH = True
    print(f'NumPy {np.__version__} | PyTorch {torch.__version__}')
    if torch.cuda.is_available():
        print(f'GPU: {torch.cuda.get_device_name()}')
    elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
        print('Device: Apple Silicon (MPS)')
    else:
        print('Device: CPU')
except ImportError:
    HAS_TORCH = False
    print(f'NumPy {np.__version__} | PyTorch not installed (NumPy demos still work)')

1. Intuition

1.1 What Is a Function?

A function is a rule that assigns to each input exactly one output — no ambiguity, no missing outputs.

It is the most fundamental concept in all of mathematics: nearly every mathematical object is either a set, a function, or both. The word "mapping" emphasises the geometric or structural idea — a function maps points from one space to another.

Formal definition (set-theoretic): A function f:ABf: A \to B is a subset of A×BA \times B (the Cartesian product) where every element of AA appears in exactly one ordered pair:

fA×Bsuch thataA,  !bB:(a,b)ff \subseteq A \times B \quad \text{such that} \quad \forall a \in A,\; \exists! \, b \in B : (a, b) \in f

The machine metaphor: Put something in, get something out. Same input always gives same output. This determinism is what separates functions from "relations" (which can map one input to many outputs).

For AI: Every layer in a neural network is a function. The entire model is a function composition. Training is optimisation over a space of functions. Understanding functions deeply is understanding neural networks deeply.

Code cell 5

# ══════════════════════════════════════════════════════════════════
# 1.1 Functions as Sets of Ordered Pairs
# ══════════════════════════════════════════════════════════════════

def is_valid_function(pairs: set, domain: set) -> Tuple[bool, str]:
    """
    Check if a set of ordered pairs defines a valid function.
    
    A valid function must satisfy:
      1. Totality:   every element of domain has at least one pair
      2. Uniqueness: no element of domain maps to two different outputs
    """
    mapped_from = {}
    for a, b in pairs:
        if a in mapped_from:
            if mapped_from[a] != b:
                return False, f'UNIQUENESS VIOLATED: {a} maps to both {mapped_from[a]} and {b}'
        mapped_from[a] = b
    
    for a in domain:
        if a not in mapped_from:
            return False, f'TOTALITY VIOLATED: {a} has no output'
    
    return True, 'Valid function'

# ── Example: define functions as sets of pairs ──
domain = {1, 2, 3}

f_valid = {(1, 'a'), (2, 'b'), (3, 'a')}           # valid: each input -> one output
f_not_unique = {(1, 'a'), (2, 'b'), (2, 'c'), (3, 'a')}  # fails uniqueness
f_not_total = {(1, 'a'), (2, 'b')}                  # fails totality (3 missing)

print('Function Validity Checker')
print('=' * 55)
print(f'Domain: {sorted(domain)}\n')

for name, pairs in [('f_valid', f_valid), ('f_not_unique', f_not_unique), ('f_not_total', f_not_total)]:
    valid, reason = is_valid_function(pairs, domain)
    print(f'{name} = {sorted(pairs)}')
    print(f'  -> {reason}\n')

# Key insight: functions are determined by input-output pairs, not representation
f1 = {(1, 'a'), (2, 'b'), (3, 'a')}
f2 = {(3, 'a'), (1, 'a'), (2, 'b')}  # same pairs, different order
print(f'f1 == f2 (same pairs, reordered): {f1 == f2}')
print('  Functions are determined entirely by input-output behaviour (extensionality)')

1.2 Why Functions Are Central to AI

Every component of a modern LLM is a function. Here is the complete inventory:

ComponentFunction SignatureWhat It Does
TokeniserT:ΣVT: \Sigma^* \to V^*Strings to token sequences
EmbeddingE:VRdE: V \to \mathbb{R}^dTokens to continuous vectors
AttentionAttn:Rn×dRn×d\text{Attn}: \mathbb{R}^{n \times d} \to \mathbb{R}^{n \times d}Context mixing via weighted sums
FFN layerFFN:RdRd\text{FFN}: \mathbb{R}^d \to \mathbb{R}^dNonlinear feature transformation
Layer NormLN:RdRd\text{LN}: \mathbb{R}^d \to \mathbb{R}^dNormalise to zero mean, unit variance
LM Head$W: \mathbb{R}^d \to \mathbb{R}^{V
Softmax$\sigma: \mathbb{R}^{V
LossL:ΘR\mathcal{L}: \Theta \to \mathbb{R}Parameters to scalar loss value
Activationσ:RR\sigma: \mathbb{R} \to \mathbb{R}Elementwise nonlinearity

Everything that happens inside an LLM is a composition of functions.

Code cell 7

# ══════════════════════════════════════════════════════════════════
# 1.2 Every AI Component Is a Function — Concrete Demonstration
# ══════════════════════════════════════════════════════════════════

# Build a tiny "LLM" where every step is an explicit function

vocab = {'the': 0, 'cat': 1, 'sat': 2, 'on': 3, '<pad>': 4}
inv_vocab = {v: k for k, v in vocab.items()}
V = len(vocab)
d_model = 8  # embedding dimension

# ── Function 1: Tokeniser T: str -> list[int] ──
def tokeniser(text: str) -> List[int]:
    """T: Sigma* -> V*  maps strings to token ID sequences"""
    return [vocab.get(w, vocab['<pad>']) for w in text.lower().split()]

# ── Function 2: Embedding E: list[int] -> matrix ──
np.random.seed(42)
embedding_matrix = np.random.randn(V, d_model) * 0.1

def embed(token_ids: List[int]) -> np.ndarray:
    """E: V^n -> R^{n x d}  maps token IDs to embedding vectors"""
    return embedding_matrix[token_ids]

# ── Function 3: Simple self-attention ──
W_q = np.random.randn(d_model, d_model) * 0.1
W_k = np.random.randn(d_model, d_model) * 0.1
W_v = np.random.randn(d_model, d_model) * 0.1

def attention(X: np.ndarray) -> np.ndarray:
    """Attn: R^{n x d} -> R^{n x d}  self-attention function"""
    Q, K, V_mat = X @ W_q, X @ W_k, X @ W_v
    scores = Q @ K.T / np.sqrt(d_model)
    weights = np.exp(scores - scores.max(axis=-1, keepdims=True))
    weights /= weights.sum(axis=-1, keepdims=True)  # softmax
    return weights @ V_mat

# ── Function 4: Feed-forward network ──
W1 = np.random.randn(d_model, d_model * 4) * 0.1
b1 = np.zeros(d_model * 4)
W2 = np.random.randn(d_model * 4, d_model) * 0.1
b2 = np.zeros(d_model)

def ffn(X: np.ndarray) -> np.ndarray:
    """FFN: R^{n x d} -> R^{n x d}  two-layer feed-forward with ReLU"""
    hidden = np.maximum(0, X @ W1 + b1)  # ReLU activation
    return hidden @ W2 + b2

# ── Function 5: LM head ──
W_head = np.random.randn(d_model, V) * 0.1

def lm_head(X: np.ndarray) -> np.ndarray:
    """W: R^{n x d} -> R^{n x |V|}  project to vocabulary logits"""
    return X @ W_head

# ── Function 6: Softmax ──
def softmax_fn(logits: np.ndarray) -> np.ndarray:
    """sigma: R^{n x |V|} -> Delta^{|V|-1}  logits to probabilities"""
    exp_l = np.exp(logits - logits.max(axis=-1, keepdims=True))
    return exp_l / exp_l.sum(axis=-1, keepdims=True)

# ── Compose them all: the full "LLM" is a function composition ──
text = "the cat sat"
print(f'Input text: "{text}"')
print('=' * 60)

tokens = tokeniser(text)
print(f'1. Tokeniser    T(text)        -> {tokens} (token IDs)')

X = embed(tokens)
print(f'2. Embedding    E(tokens)      -> shape {X.shape}')

X_attn = attention(X)
X = X + X_attn  # residual connection
print(f'3. Attention    Attn(X) + X    -> shape {X.shape}')

X_ffn = ffn(X)
X = X + X_ffn   # residual connection
print(f'4. FFN          FFN(X) + X     -> shape {X.shape}')

logits = lm_head(X)
print(f'5. LM Head      W(X)           -> shape {logits.shape}')

probs = softmax_fn(logits)
print(f'6. Softmax      sigma(logits)  -> shape {probs.shape}')

# Show final prediction for last token
last_probs = probs[-1]
pred_token = inv_vocab[np.argmax(last_probs)]
print(f'\nPredicted next token: "{pred_token}" (prob={last_probs.max():.4f})')
print(f'Full distribution: {dict(zip(vocab.keys(), np.round(last_probs, 4)))}')
print(f'\nThe entire pipeline: softmax . W_head . (FFN + id) . (Attn + id) . E . T')
print(f'This IS the model. A composition of functions. Nothing more.')

1.3 Functions as Transformations

Geometric view: A function f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m transforms space — points move, shapes deform.

  • Linear functions preserve structure: lines map to lines, the origin maps to the origin, parallelism is preserved.
  • Nonlinear functions bend, fold, and warp space — enabling complex decision boundaries that linear functions cannot achieve.
  • Invertible functions (bijections) preserve information: you can always recover the input from the output.
  • Non-invertible functions (many-to-one) destroy information: compression happens, and recovery is impossible.
  • Deep networks compose many simple functions. The global transformation is highly nonlinear even if each individual layer is relatively simple.

Code cell 9

# ══════════════════════════════════════════════════════════════════
# 1.3 Functions as Transformations — Visualising How Functions
#     Warp Space (Linear vs Nonlinear)
# ══════════════════════════════════════════════════════════════════

# Create a grid of points in 2D
t = np.linspace(-2, 2, 15)
grid_x, grid_y = np.meshgrid(t, t)
points = np.stack([grid_x.ravel(), grid_y.ravel()], axis=1)  # (225, 2)

# Define transformations
def linear_transform(pts):
    """Linear: rotation + scaling. Preserves lines and origin."""
    A = np.array([[1.5, -0.5],
                  [0.5,  1.2]])
    return pts @ A.T

def nonlinear_transform(pts):
    """Nonlinear: bends space. Lines become curves."""
    x, y = pts[:, 0], pts[:, 1]
    return np.stack([x + 0.3 * y**2, y + 0.3 * np.sin(2*x)], axis=1)

def relu_2d(pts):
    """ReLU applied componentwise: clips negative values to 0."""
    return np.maximum(0, pts)

# Plot the transformations
fig, axes = plt.subplots(1, 4, figsize=(16, 4))
transforms = [
    ('Original Grid', lambda p: p),
    ('Linear (Ax)', linear_transform),
    ('Nonlinear (bend)', nonlinear_transform),
    ('ReLU (clip)', relu_2d),
]

for ax, (title, fn) in zip(axes, transforms):
    transformed = fn(points)
    ax.scatter(transformed[:, 0], transformed[:, 1], c=grid_x.ravel(),
               cmap='coolwarm', s=8, alpha=0.7)
    ax.set_title(title, fontsize=11, fontweight='bold')
    ax.set_aspect('equal')
    ax.grid(True, alpha=0.3)
    ax.axhline(0, color='k', linewidth=0.5)
    ax.axvline(0, color='k', linewidth=0.5)

plt.suptitle('How Different Functions Transform Space', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

print('Linear transform: lines stay lines, origin stays at origin')
print('Nonlinear transform: lines become curves, space is warped')
print('ReLU: all negative values collapsed to 0 (information lost in that region)')

1.4 The Language of Functions in Mathematics

Functions appear under many different names — each emphasising a different aspect:

NameEmphasisExample
FunctionGeneral input-output rulef(x)=x2f(x) = x^2
Map / MappingGeometric or structuralf:RnRmf: \mathbb{R}^n \to \mathbb{R}^m
OperatorActs on functions, not numbersD(f)=fD(f) = f' (differentiation)
FunctionalInput is a function, output is a scalar01f(x)dx\int_0^1 f(x)\,dx
MorphismPreserves algebraic structureGroup homomorphism
TransformationGeometric change of coordinatesRotation, scaling
KernelSimilarity or weighting functionK(x,y)=exy2K(x, y) = e^{-\|x-y\|^2}
DistributionProbability assignmentP:V[0,1]P: V \to [0,1]

Understanding that all of these are instances of the same concept unifies vast areas of mathematics.

1.5 Historical Timeline

YearMathematicianContribution
~300 BCEuclidGeometric constructions implicitly use functions; proportionality
1694LeibnizFirst use of word "function" in a mathematical context
1748EulerFunction as analytic expression; introduced f(x)f(x) notation
1837DirichletModern definition — function as arbitrary rule, not necessarily a formula
1870sCantorFunction as set of ordered pairs; foundation in set theory
1879FregeFunctions in formal logic; predicate as function to truth values
1888DedekindMorphism concept; structure-preserving maps between algebraic systems
1900sHilbertOperator theory; functions on infinite-dimensional spaces
1920sBanachFunctional analysis; functions on function spaces; normed spaces
1936ChurchLambda calculus; functions as fundamental computational objects
1945Eilenberg & Mac LaneCategory theory; morphisms as central concept; functions as arrows
1986Rumelhart et al.Backpropagation popularised; neural network as differentiable function composition
2017Vaswani et al.Transformer architecture; attention as a specific function class

From Euclid's geometric proportions to the Transformer's attention mechanism — the concept of function has been the thread running through all of mathematics.

1.6 Pipeline Position in AI

Every stage of an LLM pipeline is a specific function type:

Input Data (text, tokens, images)
    | [Tokeniser function T: Sigma* -> V*]
Token Sequences
    | [Embedding function E: V^n -> R^{n x d}]
Embedding Matrices
    | [Transformer layers f_1 . f_2 . ... . f_L: R^{n x d} -> R^{n x d}]
Contextual Representations
    | [LM head W: R^d -> R^{|V|}]
Logits
    | [Softmax sigma: R^{|V|} -> Delta^{|V|-1}]
Probability Distribution
    | [Sampling or argmax]
Output Token
^^^^^^^^^^^^^^^^^^^^^^^
Functions everywhere

The rest of this notebook develops each mathematical concept that makes this pipeline work.


2. Formal Definitions

2.1 Function as Set of Ordered Pairs

A function ff from set AA to set BB is a set of ordered pairs fA×Bf \subseteq A \times B satisfying two axioms:

  1. Totality: aA  bB:(a,b)f\forall a \in A\; \exists b \in B: (a, b) \in f — every input has an output
  2. Uniqueness: aA  b1,b2B:(a,b1)f(a,b2)f    b1=b2\forall a \in A\; \forall b_1, b_2 \in B: (a, b_1) \in f \land (a, b_2) \in f \implies b_1 = b_2 — each input has at most one output

Notation: f:ABf: A \to B; f(a)=bf(a) = b means (a,b)f(a, b) \in f.

Every element of AA has exactly one associated element of BB. No element is left out (totality). No element has two images (uniqueness).

2.2 Domain, Codomain, Range

  • Domain: dom(f)=A\text{dom}(f) = A — the set of all valid inputs
  • Codomain: cod(f)=B\text{cod}(f) = B — the set within which outputs must fall; declared in f:ABf: A \to B
  • Range (Image): ran(f)=im(f)={f(a)aA}B\text{ran}(f) = \text{im}(f) = \{f(a) \mid a \in A\} \subseteq B — the set of values actually achieved

Critical distinction: Codomain is declared; range is computed. The range \subseteq codomain but may be a proper subset.

Example: f:RRf: \mathbb{R} \to \mathbb{R} defined by f(x)=x2f(x) = x^2. Codomain = R\mathbb{R}, but range = [0,)R[0, \infty) \subsetneq \mathbb{R}.

AI Example: Softmax f:RnRnf: \mathbb{R}^n \to \mathbb{R}^n has codomain Rn\mathbb{R}^n but range is the probability simplex Δn1Rn\Delta^{n-1} \subsetneq \mathbb{R}^n.

Code cell 14

# ══════════════════════════════════════════════════════════════════
# 2.1–2.2 Domain, Codomain, Range — Concrete Analysis
# ══════════════════════════════════════════════════════════════════

class FunctionAnalyser:
    """Analyse domain, codomain, and range of finite functions."""
    
    def __init__(self, name: str, pairs: dict, codomain: set):
        self.name = name
        self.pairs = pairs        # {input: output}
        self.domain = set(pairs.keys())
        self.codomain = codomain
        self.range = set(pairs.values())
    
    def report(self):
        print(f'Function: {self.name}')
        print(f'  Domain   = {sorted(self.domain)}')
        print(f'  Codomain = {sorted(self.codomain)}')
        print(f'  Range    = {sorted(self.range)}')
        print(f'  Range == Codomain? {self.range == self.codomain}', end='')
        if self.range != self.codomain:
            missed = self.codomain - self.range
            print(f'  (missed: {sorted(missed)})')
        else:
            print()
        print()

# Example 1: f: {1,2,3} -> {a,b,c} with f = {1->a, 2->b, 3->a}
f1 = FunctionAnalyser('f1: {1,2,3} -> {a,b,c}', 
                       {1: 'a', 2: 'b', 3: 'a'}, {'a', 'b', 'c'})

# Example 2: g: {1,2,3} -> {a,b,c} with g = {1->a, 2->b, 3->c}
f2 = FunctionAnalyser('g: {1,2,3} -> {a,b,c}',
                       {1: 'a', 2: 'b', 3: 'c'}, {'a', 'b', 'c'})

print('Domain / Codomain / Range Analysis')
print('=' * 55)
f1.report()
f2.report()

# ── Continuous example: x^2 ──
print('Continuous example: f(x) = x^2, declared as f: R -> R')
x_vals = np.linspace(-5, 5, 1000)
y_vals = x_vals ** 2
print(f'  Codomain: all of R (declared)')
print(f'  Range:    [{y_vals.min():.1f}, {y_vals.max():.1f}] (computed from samples)')
print(f'  True range: [0, inf) — proper subset of R')
print(f'  Negative reals are in codomain but NOT in range')
print(f'  -> f is NOT surjective as f: R -> R')

# ── AI example: softmax range ──
print('\nAI example: softmax output')
logits = np.array([2.0, 1.0, 0.5])
exp_l = np.exp(logits - logits.max())
probs = exp_l / exp_l.sum()
print(f'  Logits: {logits}')
print(f'  Probs:  {np.round(probs, 4)} (sum = {probs.sum():.4f})')
print(f'  Codomain: R^3')
print(f'  Range: probability simplex Delta^2 = {{p in R^3 : p_i >= 0, sum(p) = 1}}')
print(f'  The vector {[0.5, 0.5, -0.1]} is in codomain but NOT in range (negative component)')

2.3 Image and Preimage

Image of a set: For SAS \subseteq A, the image f(S)={f(a)aS}f(S) = \{f(a) \mid a \in S\} — the set of outputs from inputs in SS.

Preimage (inverse image): For TBT \subseteq B, the preimage f1(T)={aAf(a)T}f^{-1}(T) = \{a \in A \mid f(a) \in T\} — the set of inputs mapping into TT.

Important: f1(T)f^{-1}(T) (preimage of a set) always exists — it does NOT require ff to be invertible. This is one of the most common confusions.

Properties of preimage (exact preservation of set operations):

f1(T1T2)=f1(T1)f1(T2)f^{-1}(T_1 \cup T_2) = f^{-1}(T_1) \cup f^{-1}(T_2) f1(T1T2)=f1(T1)f1(T2)f^{-1}(T_1 \cap T_2) = f^{-1}(T_1) \cap f^{-1}(T_2) f1(BT)=Af1(T)f^{-1}(B \setminus T) = A \setminus f^{-1}(T)

Properties of image (less exact — intersection is only \subseteq):

f(S1S2)=f(S1)f(S2)f(S_1 \cup S_2) = f(S_1) \cup f(S_2) f(S1S2)f(S1)f(S2)(not equality in general!)f(S_1 \cap S_2) \subseteq f(S_1) \cap f(S_2) \quad \text{(not equality in general!)}

Code cell 16

# ══════════════════════════════════════════════════════════════════
# 2.3 Image and Preimage — Computation and Verification
# ══════════════════════════════════════════════════════════════════

def image(f: dict, S: set) -> set:
    """Compute f(S) = {f(a) | a in S}."""
    return {f[a] for a in S if a in f}

def preimage(f: dict, T: set) -> set:
    """Compute f^{-1}(T) = {a in domain | f(a) in T}."""
    return {a for a, b in f.items() if b in T}

# Define a non-injective function: f(x) = x^2 mod 7 on {0,1,...,6}
f = {x: (x**2) % 7 for x in range(7)}
print('Function: f(x) = x^2 mod 7 on {0, 1, ..., 6}')
print(f'  f = {f}')
print()

# Image of subsets
S1 = {0, 1, 2}
S2 = {2, 3, 4}
print(f'S1 = {S1}, S2 = {S2}')
print(f'  f(S1) = {image(f, S1)}')
print(f'  f(S2) = {image(f, S2)}')
print(f'  f(S1 union S2) = {image(f, S1 | S2)}')
print(f'  f(S1) union f(S2) = {image(f, S1) | image(f, S2)}')
print(f'  Equal? {image(f, S1 | S2) == image(f, S1) | image(f, S2)}  (image preserves union)')
print()

# Intersection: image does NOT preserve it exactly
print(f'  f(S1 inter S2) = {image(f, S1 & S2)}')
print(f'  f(S1) inter f(S2) = {image(f, S1) & image(f, S2)}')
print(f'  f(S1 inter S2) subset f(S1) inter f(S2)? '
      f'{image(f, S1 & S2).issubset(image(f, S1) & image(f, S2))}')
print(f'  Equal? {image(f, S1 & S2) == image(f, S1) & image(f, S2)}'
      f'  (image does NOT preserve intersection in general)')
print()

# Preimage
T1 = {1, 4}
T2 = {0, 2}
B = set(f.values())

print(f'T1 = {T1}, T2 = {T2}')
print(f'  f^{{-1}}(T1) = {preimage(f, T1)}')
print(f'  f^{{-1}}(T2) = {preimage(f, T2)}')
print(f'  f^{{-1}}(T1 union T2) = {preimage(f, T1 | T2)}')
print(f'  f^{{-1}}(T1) union f^{{-1}}(T2) = {preimage(f, T1) | preimage(f, T2)}')
print(f'  Equal? {preimage(f, T1 | T2) == preimage(f, T1) | preimage(f, T2)}  '
      f'(preimage preserves union exactly)')
print()
print(f'  f^{{-1}}(T1 inter T2) = {preimage(f, T1 & T2)}')
print(f'  f^{{-1}}(T1) inter f^{{-1}}(T2) = {preimage(f, T1) & preimage(f, T2)}')
print(f'  Equal? {preimage(f, T1 & T2) == preimage(f, T1) & preimage(f, T2)}  '
      f'(preimage preserves intersection exactly)')

# Complement (De Morgan)
codomain_full = set(range(7))  # using same range as domain for simplicity
complement_T1 = codomain_full - T1
print(f'\n  f^{{-1}}(complement of T1) = {preimage(f, complement_T1)}')
print(f'  domain \\ f^{{-1}}(T1)      = {set(range(7)) - preimage(f, T1)}')
print(f'  Equal? {preimage(f, complement_T1) == set(range(7)) - preimage(f, T1)}  '
      f'(preimage preserves complement)')

2.4 Equality of Functions

Two functions f,g:ABf, g: A \to B are equal if and only if f(a)=g(a)f(a) = g(a) for all aAa \in A.

They must have the same domain, the same codomain, AND the same values everywhere.

Extensionality: Functions are completely determined by their input-output behaviour. Internal representation is irrelevant. Two neural networks with entirely different architectures but the same input-output map are equal as functions.

2.5 Partial Functions

A partial function f:ABf: A \rightharpoonup B is defined only on some subset dom(f)A\text{dom}(f) \subseteq A.

  • f(a)f(a) may be undefined for some aAa \in A.
  • A total function is a partial function where dom(f)=A\text{dom}(f) = A.

Examples in AI:

  • Division: div:R×RR\text{div}: \mathbb{R} \times \mathbb{R} \rightharpoonup \mathbb{R} — undefined when denominator = 0
  • Tokeniser: may fail on invalid byte sequences
  • LLM with context limit: undefined for inputs longer than context window

Code cell 18

# ══════════════════════════════════════════════════════════════════
# 2.4 Equality of Functions
# 2.5 Partial Functions
# ══════════════════════════════════════════════════════════════════

# ── 2.4 Function equality ──
print('2.4 Function Equality')
print('=' * 55)

# Two "different" definitions that produce the same function
def f_abs(x):
    """Defined using abs()"""
    return abs(x)

def g_abs(x):
    """Defined using x * sign(x)"""
    return x if x >= 0 else -x

# Test equality on a sample of inputs
test_inputs = np.linspace(-5, 5, 1000)
equal = all(f_abs(x) == g_abs(x) for x in test_inputs)
print(f'f(x) = |x| vs g(x) = x*sign(x): equal on 1000 test points? {equal}')
print('  Same function despite different internal definitions (extensionality)')

# Different domains => different functions even if formulas agree
print(f'\nf: R -> R, f(x) = |x|  vs  g: [0,inf) -> R, g(x) = x')
print(f'  f(-3) = {f_abs(-3)}, g(-3) = undefined (outside domain)')
print(f'  These are NOT the same function (different domains)')

# ── 2.5 Partial functions ──
print(f'\n2.5 Partial Functions')
print('=' * 55)

class PartialFunction:
    """A partial function that may be undefined on some inputs."""
    
    def __init__(self, name: str, func: Callable, domain_check: Callable):
        self.name = name
        self.func = func
        self.domain_check = domain_check
    
    def __call__(self, x):
        if not self.domain_check(x):
            return None  # undefined
        return self.func(x)
    
    def try_eval(self, x):
        result = self(x)
        if result is None:
            print(f'  {self.name}({x}) = UNDEFINED')
        else:
            print(f'  {self.name}({x}) = {result:.6f}')

# Division as partial function
div = PartialFunction(
    'div(a, b) = a/b',
    lambda x: x[0] / x[1],
    lambda x: x[1] != 0
)

# Logarithm as partial function
log = PartialFunction(
    'log(x)',
    lambda x: np.log(x),
    lambda x: x > 0
)

# Square root as partial function (on reals)
sqrt = PartialFunction(
    'sqrt(x)',
    lambda x: np.sqrt(x),
    lambda x: x >= 0
)

print('Division:')
div.try_eval((6, 3))
div.try_eval((6, 0))

print('\nLogarithm:')
log.try_eval(np.e)
log.try_eval(0)
log.try_eval(-1)

print('\nSquare root (on reals):')
sqrt.try_eval(4)
sqrt.try_eval(-4)

print('\nAI example: LLM with context limit 4096 tokens')
context_limit = 4096
test_lengths = [100, 2000, 4096, 8000]
for length in test_lengths:
    status = 'defined (within context)' if length <= context_limit else 'UNDEFINED (exceeds context)'
    print(f'  LLM(input of {length} tokens) -> {status}')

3. Types of Functions

3.1 Injective Functions (One-to-One)

A function f:ABf: A \to B is injective (one-to-one) if distinct inputs always map to distinct outputs:

a1,a2A:f(a1)=f(a2)    a1=a2\forall a_1, a_2 \in A: f(a_1) = f(a_2) \implies a_1 = a_2

Equivalently (contrapositive): a1a2    f(a1)f(a2)a_1 \neq a_2 \implies f(a_1) \neq f(a_2).

Geometric intuition: No two points in the domain map to the same point in the codomain. In an arrow diagram, no two arrows hit the same target.

Consequence: Injectivity means information is preserved — you can always recover the input from the output (at least on the range).

Test for injectivity: Assume f(a1)=f(a2)f(a_1) = f(a_2) and show a1=a2a_1 = a_2.

3.2 Surjective Functions (Onto)

A function f:ABf: A \to B is surjective (onto) if every element of BB is the image of some element of AA:

bB:aA:f(a)=b\forall b \in B: \exists a \in A: f(a) = b

Equivalently: range=codomain\text{range} = \text{codomain}, i.e. im(f)=B\text{im}(f) = B.

Geometric intuition: No element in the codomain is "missed". All of BB is "covered" by the function.

Test for surjectivity: Take arbitrary bBb \in B; find (construct) some aa with f(a)=bf(a) = b.

3.3 Bijective Functions (One-to-One and Onto)

A function f:ABf: A \to B is bijective if it is both injective AND surjective:

  • Every element of BB is the image of exactly one element of AA.
  • No collisions (injective), no gaps (surjective). A perfect matching.

Consequence: Bijections are exactly the functions that have (two-sided) inverses.

A bijection ABA \to B demonstrates A=B|A| = |B| (same cardinality).

3.4 Summary Table

PropertyConditionArrow DiagramInformation
InjectiveNo two inputs → same outputNo two arrows hit same targetPreserved; invertible if restricted
SurjectiveEvery output hit by some inputEvery target hit by ≥1 arrowCodomain fully covered
BijectiveBoth injective and surjectiveExactly one arrow per targetPerfectly preserved; fully invertible
NeitherCollisions AND missesArrows collide and missLost in both directions

Code cell 20

# ══════════════════════════════════════════════════════════════════
# 3.1–3.4 Injectivity, Surjectivity, Bijectivity — Complete Tester
# ══════════════════════════════════════════════════════════════════

def classify_function(f: dict, domain: set, codomain: set) -> dict:
    """
    Classify a finite function as injective, surjective, bijective, or neither.
    
    Parameters:
        f: dict mapping domain elements to codomain elements
        domain: set of valid inputs
        codomain: set of possible outputs (declared)
    
    Returns:
        dict with classification results and evidence
    """
    range_set = set(f.values())
    
    # Injectivity: check if any two inputs map to same output
    injective = True
    collision = None
    seen = {}
    for a, b in f.items():
        if b in seen:
            injective = False
            collision = (seen[b], a, b)  # (a1, a2, common_output)
            break
        seen[b] = a
    
    # Surjectivity: check if every codomain element is achieved
    surjective = range_set == codomain
    missed = codomain - range_set
    
    bijective = injective and surjective
    
    return {
        'injective': injective,
        'surjective': surjective,
        'bijective': bijective,
        'collision': collision,
        'missed': missed,
        'range': range_set,
    }

def print_classification(name: str, f: dict, domain: set, codomain: set):
    """Pretty-print function classification."""
    result = classify_function(f, domain, codomain)
    
    # Determine type label
    if result['bijective']:
        label = 'BIJECTIVE'
    elif result['injective']:
        label = 'INJECTIVE only'
    elif result['surjective']:
        label = 'SURJECTIVE only'
    else:
        label = 'NEITHER'
    
    print(f'{name}: {label}')
    print(f'  Mapping: {f}')
    print(f'  Range:   {sorted(result["range"])}')
    
    if not result['injective'] and result['collision']:
        a1, a2, b = result['collision']
        print(f'  Not injective: f({a1}) = f({a2}) = {b} (collision)')
    if not result['surjective'] and result['missed']:
        print(f'  Not surjective: {sorted(result["missed"])} not achieved')
    print()

# ── Test suite ──
print('Function Classification')
print('=' * 55)

# f1: bijective
f1 = {1: 'a', 2: 'b', 3: 'c'}
print_classification('f1: {1,2,3} -> {a,b,c}', f1, {1,2,3}, {'a','b','c'})

# f2: injective but not surjective
f2 = {1: 'a', 2: 'b'}
print_classification('f2: {1,2} -> {a,b,c}', f2, {1,2}, {'a','b','c'})

# f3: surjective but not injective
f3 = {1: 'a', 2: 'a', 3: 'b'}
print_classification('f3: {1,2,3} -> {a,b}', f3, {1,2,3}, {'a','b'})

# f4: neither
f4 = {1: 'a', 2: 'a', 3: 'b'}
print_classification('f4: {1,2,3} -> {a,b,c}', f4, {1,2,3}, {'a','b','c'})

# ── Continuous examples ──
print('Continuous Function Examples')
print('=' * 55)

x = np.linspace(-5, 5, 10000)

examples = [
    ('f(x) = 2x + 1 on R',  lambda x: 2*x + 1,  'BIJECTIVE (linear, nonzero slope)'),
    ('f(x) = x^2 on R',     lambda x: x**2,      'NOT injective: f(2) = f(-2) = 4'),
    ('f(x) = x^3 on R',     lambda x: x**3,      'BIJECTIVE (strictly increasing)'),
    ('f(x) = e^x on R -> R', lambda x: np.exp(x), 'Injective but NOT surjective (range = (0,inf))'),
    ('f(x) = sin(x) on R',  lambda x: np.sin(x), 'NEITHER (not injective, not surjective onto R)'),
]

for name, fn, classification in examples:
    y = fn(x)
    print(f'{name}')
    print(f'  Range approx: [{y.min():.4f}, {y.max():.4f}]')
    print(f'  -> {classification}')
    print()

3.5 Monotone Functions

Let f:ABf: A \to B where A,BA, B are ordered sets.

TypeConditionProperty
Monotone increasinga1a2    f(a1)f(a2)a_1 \leq a_2 \implies f(a_1) \leq f(a_2)Preserves order
Strictly increasinga1<a2    f(a1)<f(a2)a_1 < a_2 \implies f(a_1) < f(a_2)Strictly preserves order
Monotone decreasinga1a2    f(a1)f(a2)a_1 \leq a_2 \implies f(a_1) \geq f(a_2)Reverses order
Strictly decreasinga1<a2    f(a1)>f(a2)a_1 < a_2 \implies f(a_1) > f(a_2)Strictly reverses order

Key fact: Strictly monotone functions are automatically injective (no two distinct inputs can produce the same output if the function always goes up or always goes down).

AI relevance:

  • Sigmoid is strictly increasing — preserves ranking of inputs
  • ReLU is monotone increasing but NOT strictly (flat for x<0x < 0 → many-to-one)
  • Softmax is monotone in each logit (increasing one logit increases that token's probability)

3.6 Periodic Functions

A function f:RRf: \mathbb{R} \to \mathbb{R} is periodic with period T>0T > 0 iff f(x+T)=f(x)f(x + T) = f(x) for all xx.

The smallest such TT is the fundamental period. Examples: sin(x)\sin(x), cos(x)\cos(x) have period 2π2\pi; tan(x)\tan(x) has period π\pi.

Fourier analysis: Any periodic function can be represented as a sum of sines and cosines.

AI relevance:

  • Sinusoidal positional encodings: PE(pos,2i)=sin ⁣(pos100002i/d)PE(pos, 2i) = \sin\!\left(\frac{pos}{10000^{2i/d}}\right); each dimension periodic with a different period
  • RoPE: periodic rotations encode relative positions
  • Fourier features: periodic basis functions for function approximation in NeRF, kernel methods

Code cell 22

# ══════════════════════════════════════════════════════════════════
# 3.5 Monotone Functions — Analysis
# 3.6 Periodic Functions — Positional Encodings
# ══════════════════════════════════════════════════════════════════

def check_monotonicity(f: Callable, x: np.ndarray) -> dict:
    """Check monotonicity properties of a function on sampled points."""
    y = f(x)
    diffs = np.diff(y)
    return {
        'monotone_increasing': bool(np.all(diffs >= -1e-12)),
        'strictly_increasing': bool(np.all(diffs > 1e-12)),
        'monotone_decreasing': bool(np.all(diffs <= 1e-12)),
        'strictly_decreasing': bool(np.all(diffs < -1e-12)),
    }

x = np.linspace(-5, 5, 10000)

functions = {
    'sigmoid(x)':    lambda x: 1 / (1 + np.exp(-x)),
    'ReLU(x)':       lambda x: np.maximum(0, x),
    'tanh(x)':       lambda x: np.tanh(x),
    'x^2':           lambda x: x**2,
    'sin(x)':        lambda x: np.sin(x),
    '-exp(-x)':      lambda x: -np.exp(-x),
}

print('Monotonicity Analysis')
print('=' * 70)
print(f'{"Function":<14} {"Mon.Inc.":<12} {"Strict Inc.":<12} {"Mon.Dec.":<12} {"Strict Dec.":<12}')
print('-' * 70)

for name, fn in functions.items():
    m = check_monotonicity(fn, x)
    print(f'{name:<14} {str(m["monotone_increasing"]):<12} {str(m["strictly_increasing"]):<12} '
          f'{str(m["monotone_decreasing"]):<12} {str(m["strictly_decreasing"]):<12}')

print()
print('Key observations:')
print('  sigmoid: strictly increasing -> automatically injective')
print('  ReLU:    monotone increasing but NOT strictly (flat on x < 0)')
print('  x^2:     neither monotone increasing nor decreasing on full R')
print('  sin(x):  neither (oscillates)')

# ── 3.6 Periodic functions: Sinusoidal positional encodings ──
print('\n\nPeriodic Functions and Positional Encodings')
print('=' * 55)

d_model = 16
max_pos = 50

def sinusoidal_pe(pos: int, d: int) -> np.ndarray:
    """Compute sinusoidal positional encoding for position pos."""
    pe = np.zeros(d)
    for i in range(0, d, 2):
        denom = 10000 ** (i / d)
        pe[i] = np.sin(pos / denom)
        if i + 1 < d:
            pe[i+1] = np.cos(pos / denom)
    return pe

# Build PE matrix
positions = np.arange(max_pos)
pe_matrix = np.array([sinusoidal_pe(p, d_model) for p in positions])

fig, axes = plt.subplots(1, 2, figsize=(14, 4))

# Left: PE values across positions for different dimensions
for dim_idx in [0, 2, 6, 14]:
    axes[0].plot(positions, pe_matrix[:, dim_idx], label=f'dim {dim_idx}', alpha=0.8)
axes[0].set_xlabel('Position')
axes[0].set_ylabel('PE value')
axes[0].set_title('Sinusoidal PE: Different Periods per Dimension')
axes[0].legend(fontsize=9)
axes[0].grid(True, alpha=0.3)

# Right: full PE matrix as heatmap
im = axes[1].imshow(pe_matrix.T, aspect='auto', cmap='RdBu', vmin=-1, vmax=1)
axes[1].set_xlabel('Position')
axes[1].set_ylabel('Dimension')
axes[1].set_title('Full PE Matrix (position x dimension)')
plt.colorbar(im, ax=axes[1], shrink=0.8)

plt.tight_layout()
plt.show()

print('Each dimension is a periodic function of position with different period.')
print('Low dimensions: short period (rapidly varying) -> fine position info')
print('High dimensions: long period (slowly varying) -> coarse position info')
print('This multi-scale periodic structure lets the model distinguish all positions.')

4. Function Composition

4.1 Definition and Notation

Given f:ABf: A \to B and g:BCg: B \to C, the composition gf:ACg \circ f: A \to C is defined by:

(gf)(a)=g(f(a))(g \circ f)(a) = g(f(a))

Read: "g after f" or "g composed with f". Apply ff first, then apply gg to the result.

Requirement: The codomain of ff must be compatible with the domain of gg (at minimum, range(f)dom(g)\text{range}(f) \subseteq \text{dom}(g)).

4.2 Associativity

Composition is associative: h(gf)=(hg)fh \circ (g \circ f) = (h \circ g) \circ f.

Proof: For any aa:

(h(gf))(a)=h((gf)(a))=h(g(f(a)))=(hg)(f(a))=((hg)f)(a)(h \circ (g \circ f))(a) = h((g \circ f)(a)) = h(g(f(a))) = (h \circ g)(f(a)) = ((h \circ g) \circ f)(a) \quad \square

This means we can write hgfh \circ g \circ f without ambiguity — parentheses don't matter.

Multi-layer networks: fLfL1f1f_L \circ f_{L-1} \circ \cdots \circ f_1 — associativity means we can group layers arbitrarily.

4.3 Identity Function

The identity function idA:AA\text{id}_A: A \to A is defined by idA(a)=a\text{id}_A(a) = a for all aa.

It acts as the identity for composition: fidA=ff \circ \text{id}_A = f and idBf=f\text{id}_B \circ f = f.

Neural network analogy: A residual connection computes x+f(x)x + f(x). When f0f \approx 0, this approaches the identity function. ResNets learn residuals on top of identity.

4.4 Composition Properties

  • Composition of injections is injective
  • Composition of surjections is surjective
  • Composition of bijections is bijective
  • If gfg \circ f is injective, then ff must be injective (but gg need not be)
  • If gfg \circ f is surjective, then gg must be surjective (but ff need not be)

4.5 Neural Networks as Function Composition

Each transformer layer: fl:Rn×dRn×df_l: \mathbb{R}^{n \times d} \to \mathbb{R}^{n \times d}

fl(X)=X+Attnl(LN(X))+FFNl(LN(X+Attnl(LN(X))))f_l(X) = X + \text{Attn}_l(\text{LN}(X)) + \text{FFN}_l(\text{LN}(X + \text{Attn}_l(\text{LN}(X))))

Full transformer: F=softmaxWheadfLf1EF = \text{softmax} \circ W_{\text{head}} \circ f_L \circ \cdots \circ f_1 \circ E

Backpropagation is the chain rule for derivatives applied to this composition — gradient flows backward through the composition graph.

Code cell 24

# ══════════════════════════════════════════════════════════════════
# 4.1–4.4 Function Composition — Engine and Properties
# ══════════════════════════════════════════════════════════════════

def compose(g: Callable, f: Callable) -> Callable:
    """Return the composition g . f (apply f first, then g)."""
    return lambda x: g(f(x))

# ── 4.2 Verify associativity ──
f = lambda x: 2 * x + 1
g = lambda x: x ** 2
h = lambda x: x - 3

# (h . (g . f))(x) vs ((h . g) . f)(x)
hgf_1 = compose(h, compose(g, f))  # h . (g . f)
hgf_2 = compose(compose(h, g), f)  # (h . g) . f

test_vals = np.array([-2, -1, 0, 1, 2, 3, 5.5])
print('4.2 Associativity Verification: h . (g . f) == (h . g) . f')
print('=' * 60)
print(f'f(x) = 2x + 1,  g(x) = x^2,  h(x) = x - 3')
print(f'{"x":>6}  {"h.(g.f)(x)":>14}  {"(h.g).f(x)":>14}  {"Equal?":>8}')
print('-' * 50)
for x in test_vals:
    v1, v2 = hgf_1(x), hgf_2(x)
    print(f'{x:>6.1f}  {v1:>14.4f}  {v2:>14.4f}  {str(np.isclose(v1, v2)):>8}')

# ── 4.3 Identity function ──
print('\n4.3 Identity Function')
print('=' * 40)
identity = lambda x: x
for x in [0, 3.14, -7]:
    print(f'  f . id({x}) = {compose(f, identity)(x):.4f} == f({x}) = {f(x):.4f}')

# ── 4.6 Non-commutativity ──
print('\n4.6 Composition is NOT Commutative')
print('=' * 55)
print('f(x) = 2x + 1,  g(x) = x^2')
print(f'{"x":>6}  {"(g.f)(x)":>12}  {"(f.g)(x)":>12}  {"Equal?":>8}')
print('-' * 45)
gf = compose(g, f)  # g(f(x)) = (2x+1)^2
fg = compose(f, g)  # f(g(x)) = 2x^2 + 1

for x in [-2, -1, 0, 1, 2, 3]:
    v1, v2 = gf(x), fg(x)
    print(f'{x:>6}  {v1:>12.1f}  {v2:>12.1f}  {str(np.isclose(v1, v2)):>8}')
print()
print('(g.f)(x) = (2x+1)^2  vs  (f.g)(x) = 2x^2 + 1  -- clearly different!')
print('Order matters. Cannot swap layers in a neural network.')

# ── 4.4 Composition preserves injectivity/surjectivity ──
print('\n4.4 Composition Preserves Properties')
print('=' * 55)

# Two injective functions
f_inj = lambda x: 2*x      # injective
g_inj = lambda x: x + 10   # injective
gf_inj = compose(g_inj, f_inj)

# Check on sample
x_test = np.array([1, 2, 3, 4, 5])
y_test = np.array([gf_inj(x) for x in x_test])
is_inj = len(set(y_test)) == len(y_test)
print(f'f(x)=2x (injective), g(x)=x+10 (injective)')
print(f'  (g.f)(x) = 2x+10: values {y_test} -> all distinct? {is_inj}')
print(f'  Composition of injections is injective.')

# Non-injective composed with anything
f_non = lambda x: x**2   # not injective on R
g_any = lambda x: x + 1
gf_non = compose(g_any, f_non)
y_non = np.array([gf_non(x) for x in [-2, 2]])
print(f'\nf(x)=x^2 (NOT injective), g(x)=x+1')
print(f'  (g.f)(-2) = {gf_non(-2)}, (g.f)(2) = {gf_non(2)}')
print(f'  Collision: -2 and 2 both map to {gf_non(2)} -> composition NOT injective')

4.7 Iterated Composition and Fixed Points

Iterated composition: fn=fffn timesf^n = \underbrace{f \circ f \circ \cdots \circ f}_{n \text{ times}} — apply ff repeatedly.

f0=idf^0 = \text{id}, f1=ff^1 = f, f2=fff^2 = f \circ f, etc.

Fixed point: aa^* such that f(a)=af(a^*) = a^* — a point unchanged by application of ff.

Banach fixed point theorem: If f:XXf: X \to X is a contraction (f(x)f(y)kxy\|f(x) - f(y)\| \leq k\|x-y\| for k<1k < 1) on a complete metric space, then ff has a unique fixed point, and iteration fn(x0)af^n(x_0) \to a^* for any starting point.

AI relevance:

  • Gradient descent approaching a fixed point at the loss minimum
  • Attention in Hopfield networks: repeated application converges to retrieved memory
  • Value iteration in RL: Bellman operator TT is a contraction; TnVVT^n V \to V^*

Code cell 26

# ══════════════════════════════════════════════════════════════════
# 4.7 Iterated Composition and Fixed Points
# ══════════════════════════════════════════════════════════════════

def iterate(f: Callable, x0: float, n_steps: int) -> List[float]:
    """Iterate f starting from x0 for n_steps: x0, f(x0), f(f(x0)), ..."""
    trajectory = [x0]
    x = x0
    for _ in range(n_steps):
        x = f(x)
        trajectory.append(x)
    return trajectory

# ── Example 1: Contraction mapping → converges to unique fixed point ──
print('Iterated Composition & Fixed Points')
print('=' * 60)

# f(x) = cos(x) is a contraction on [0, 1] (|cos'(x)| = |sin(x)| < 1)
# Fixed point: x* where cos(x*) = x* ≈ 0.7390851332
f_cos = np.cos
traj = iterate(f_cos, 0.5, 30)

print('f(x) = cos(x), starting at x0 = 0.5')
print(f'  Iteration 0:  {traj[0]:.10f}')
print(f'  Iteration 5:  {traj[5]:.10f}')
print(f'  Iteration 10: {traj[10]:.10f}')
print(f'  Iteration 20: {traj[20]:.10f}')
print(f'  Iteration 30: {traj[30]:.10f}')
print(f'  Fixed point:  cos(x*) = x* ≈ 0.7390851332')
print(f'  Converged?    |f^30(x0) - x*| = {abs(traj[30] - 0.7390851332):.2e}')

# ── Example 2: Different starting points all converge (Banach theorem) ──
print('\nBanach Fixed Point Theorem: ALL starting points converge')
for x0 in [0.0, 0.5, 1.0, -1.0, 3.0, 10.0]:
    traj = iterate(f_cos, x0, 100)
    print(f'  x0 = {x0:>5.1f} -> f^100(x0) = {traj[-1]:.10f}')

# ── Example 3: Non-contraction → diverges ──
print('\nNon-contraction: f(x) = 2x (|f\'| = 2 > 1)')
f_expand = lambda x: 2 * x
traj_div = iterate(f_expand, 0.01, 20)
print(f'  x0 = 0.01: f^5 = {traj_div[5]:.4f}, f^10 = {traj_div[10]:.2f}, f^20 = {traj_div[20]:.0f}')
print(f'  Diverges! Expansion factor > 1 means no convergence.')

# ── Visualise convergence ──
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Left: cobweb diagram for cos(x)
x_line = np.linspace(-0.5, 1.5, 200)
axes[0].plot(x_line, np.cos(x_line), 'b-', label='y = cos(x)', linewidth=2)
axes[0].plot(x_line, x_line, 'k--', label='y = x', linewidth=1)

# Cobweb
traj = iterate(f_cos, 0.2, 15)
for i in range(len(traj)-1):
    axes[0].plot([traj[i], traj[i]], [traj[i], traj[i+1]], 'r-', alpha=0.5, linewidth=0.8)
    axes[0].plot([traj[i], traj[i+1]], [traj[i+1], traj[i+1]], 'r-', alpha=0.5, linewidth=0.8)
axes[0].scatter([traj[-1]], [traj[-1]], c='red', s=40, zorder=5, label=f'Fixed point ≈ {traj[-1]:.4f}')
axes[0].set_title('Cobweb Diagram: cos(x) → Fixed Point', fontweight='bold')
axes[0].legend(fontsize=9)
axes[0].grid(True, alpha=0.3)
axes[0].set_xlabel('x')
axes[0].set_ylabel('f(x)')

# Right: convergence rate
starts = [0.0, 0.3, 0.7, 1.0, 1.5]
for x0 in starts:
    traj = iterate(f_cos, x0, 30)
    errors = [abs(t - 0.7390851332) for t in traj]
    axes[1].semilogy(errors, label=f'x0={x0}', alpha=0.8)
axes[1].set_xlabel('Iteration')
axes[1].set_ylabel('|x_n - x*| (log scale)')
axes[1].set_title('Convergence Rate from Different Starts', fontweight='bold')
axes[1].legend(fontsize=9)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
print('All starting points converge to the same fixed point (Banach theorem).')

5. Inverse Functions

5.1 Left and Right Inverses

  • Left inverse (retraction): g:BAg: B \to A is a left inverse of f:ABf: A \to B iff gf=idAg \circ f = \text{id}_A
    • g(f(a))=ag(f(a)) = a for all aAa \in A; ff is left-invertible iff ff is injective
  • Right inverse (section): h:BAh: B \to A is a right inverse of ff iff fh=idBf \circ h = \text{id}_B
    • f(h(b))=bf(h(b)) = b for all bBb \in B; ff is right-invertible iff ff is surjective
  • Two-sided inverse: gg is both left and right inverse; gf=idAg \circ f = \text{id}_A AND fg=idBf \circ g = \text{id}_B
    • Exists iff ff is bijective
    • Unique: if gg and hh are both two-sided, then g=hg = h

5.2 The Inverse Function

For a bijection f:ABf: A \to B, the inverse f1:BAf^{-1}: B \to A is defined by:

f1(b)=a    f(a)=bf^{-1}(b) = a \iff f(a) = b
  • f1f^{-1} is well-defined because ff is bijective (each bb has exactly one preimage)
  • f1f^{-1} is also bijective; (f1)1=f(f^{-1})^{-1} = f
  • Composition: f1f=idAf^{-1} \circ f = \text{id}_A and ff1=idBf \circ f^{-1} = \text{id}_B

5.3 Inverse of Composition

(gf)1=f1g1(reverse order!)(g \circ f)^{-1} = f^{-1} \circ g^{-1} \quad \text{(reverse order!)}

Proof: (gf)(f1g1)=g(ff1)g1=gidg1=gg1=id(g \circ f) \circ (f^{-1} \circ g^{-1}) = g \circ (f \circ f^{-1}) \circ g^{-1} = g \circ \text{id} \circ g^{-1} = g \circ g^{-1} = \text{id} \square

Analogy: Putting on shoes then socks → reverse by taking off socks then shoes.

AI: Invertible neural networks (normalising flows): decoding = inverse of encoding, applied in reverse layer order.

Code cell 28

# ══════════════════════════════════════════════════════════════════
# 5.1–5.3 Inverse Functions — Demonstration
# ══════════════════════════════════════════════════════════════════

# ── 5.1 Left and right inverses ──
print('5.1 Left and Right Inverses')
print('=' * 55)

# f: R -> R, f(x) = 2x + 3 (bijective -> has two-sided inverse)
f_lin = lambda x: 2*x + 3
f_lin_inv = lambda x: (x - 3) / 2  # inverse: y = 2x+3 => x = (y-3)/2

x_test = np.array([-3, 0, 1, 5, 10])
print('f(x) = 2x + 3, f_inv(x) = (x-3)/2')
print(f'  f_inv(f(x)):  {[f_lin_inv(f_lin(x)) for x in x_test]}  (should be x)')
print(f'  f(f_inv(x)):  {[f_lin(f_lin_inv(x)) for x in x_test]}  (should be x)')

# ── Sigmoid and logit are mutual inverses ──
print('\nSigmoid and Logit are mutual inverses:')
sigmoid = lambda x: 1 / (1 + np.exp(-x))
logit = lambda p: np.log(p / (1 - p))

for x in [-2, 0, 1, 3]:
    print(f'  logit(sigmoid({x})) = {logit(sigmoid(x)):.6f}  (should be {x})')
for p in [0.1, 0.5, 0.73, 0.99]:
    print(f'  sigmoid(logit({p})) = {sigmoid(logit(p)):.6f}  (should be {p})')

# ── 5.2 Inverse is unique for bijections ──
print('\n5.2 For bijections, the inverse is unique')
print('=' * 55)

# exp and ln
print('exp: R -> (0,inf) is bijective')
print('  Its unique inverse is ln: (0,inf) -> R')
for x in [0.5, 1, 2, np.e]:
    print(f'  ln(exp({x:.4f})) = {np.log(np.exp(x)):.4f}')

# ── 5.3 Inverse of composition: reverse order ──
print('\n5.3 Inverse of Composition: (g . f)^{-1} = f^{-1} . g^{-1}')
print('=' * 55)

f = lambda x: 2*x + 1    # f^{-1}(y) = (y-1)/2
g = lambda x: x**3        # g^{-1}(y) = y^(1/3)
f_inv = lambda y: (y - 1) / 2
g_inv = lambda y: np.sign(y) * np.abs(y)**(1/3)

# g . f composition and its inverse
gf = lambda x: g(f(x))
gf_inv_correct = lambda y: f_inv(g_inv(y))   # f^{-1} . g^{-1} (correct order)
gf_inv_wrong   = lambda y: g_inv(f_inv(y))   # g^{-1} . f^{-1} (wrong order)

x_test = np.array([-2, -1, 0, 1, 2, 3])
print(f'f(x) = 2x+1,  g(x) = x^3')
print(f'{"x":>5}  {"(g.f)(x)":>12}  {"f_inv.g_inv(y)":>15}  {"correct?":>10}  {"g_inv.f_inv(y)":>15}  {"correct?":>10}')
print('-' * 75)
for x in x_test:
    y = gf(x)
    x_recovered_correct = gf_inv_correct(y)
    x_recovered_wrong = gf_inv_wrong(y)
    print(f'{x:>5.0f}  {y:>12.2f}  {x_recovered_correct:>15.6f}  '
          f'{str(np.isclose(x_recovered_correct, x)):>10}  '
          f'{x_recovered_wrong:>15.6f}  {str(np.isclose(x_recovered_wrong, x)):>10}')
print()
print('Correct order: f^{-1} . g^{-1} (reverse!)')
print('Wrong order:   g^{-1} . f^{-1} does NOT recover x')

5.4 Pseudo-Inverse (Moore-Penrose)

For a non-square or non-invertible matrix ARm×nA \in \mathbb{R}^{m \times n}, the Moore-Penrose pseudo-inverse A+Rn×mA^+ \in \mathbb{R}^{n \times m} is defined by four conditions:

  1. AA+A=AAA^+A = A
  2. A+AA+=A+A^+AA^+ = A^+
  3. (AA+)T=AA+(AA^+)^T = AA^+
  4. (A+A)T=A+A(A^+A)^T = A^+A

Via SVD: If A=UΣVTA = U\Sigma V^T then A+=VΣ+UTA^+ = V\Sigma^+U^T where Σ+\Sigma^+ inverts the non-zero singular values.

Least squares: x=A+bx = A^+b gives the minimum-norm least-squares solution to Ax=bAx = b.

5.5 Invertibility and Information Preservation

  • Invertible function: no information lost; can always recover input from output
  • Non-invertible (many-to-one): information destroyed; cannot recover which input was used
  • Dimension argument: f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m; if m<nm < n, ff cannot be injective → information necessarily lost
  • Autoencoder: encoder f:XZf: X \to Z, decoder g:ZX^g: Z \to \hat{X}. If dim(Z)<dim(X)\dim(Z) < \dim(X), then gfidg \circ f \neq \text{id} — perfect reconstruction impossible

Code cell 30

# ══════════════════════════════════════════════════════════════════
# 5.4 Moore-Penrose Pseudo-Inverse
# 5.5 Invertibility and Information Preservation
# ══════════════════════════════════════════════════════════════════

# ── 5.4 Pseudo-inverse via SVD ──
print('5.4 Moore-Penrose Pseudo-Inverse')
print('=' * 60)

# Tall matrix (more rows than columns): full column rank
A = np.array([[1, 0],
              [0, 1],
              [1, 1]], dtype=float)

print(f'A (3x2):\n{A}\n')

# Compute pseudo-inverse via np.linalg.pinv (uses SVD internally)
A_pinv = np.linalg.pinv(A)
print(f'A+ (pseudo-inverse, 2x3):\n{np.round(A_pinv, 6)}\n')

# Verify the four Penrose conditions
print('Verifying Penrose conditions:')
cond1 = np.allclose(A @ A_pinv @ A, A)
cond2 = np.allclose(A_pinv @ A @ A_pinv, A_pinv)
cond3 = np.allclose((A @ A_pinv).T, A @ A_pinv)
cond4 = np.allclose((A_pinv @ A).T, A_pinv @ A)
print(f'  1. A A+ A = A:       {cond1}')
print(f'  2. A+ A A+ = A+:     {cond2}')
print(f'  3. (A A+)^T = A A+:  {cond3}')
print(f'  4. (A+ A)^T = A+ A:  {cond4}')

# Manual SVD computation
U, S, Vt = np.linalg.svd(A, full_matrices=False)
S_pinv = np.diag(1.0 / S)
A_pinv_manual = Vt.T @ S_pinv @ U.T
print(f'\nA+ via SVD (V Sigma+ U^T):\n{np.round(A_pinv_manual, 6)}')
print(f'Matches np.linalg.pinv? {np.allclose(A_pinv, A_pinv_manual)}')

# Least-squares: solve Ax = b in least-squares sense
b = np.array([1, 2, 3])
x_ls = A_pinv @ b
print(f'\nLeast-squares: A+ b = {np.round(x_ls, 4)}')
print(f'Residual ||Ax - b|| = {np.linalg.norm(A @ x_ls - b):.6f}')

# ── 5.5 Information preservation ──
print('\n\n5.5 Invertibility and Information Preservation')
print('=' * 60)

# Dimension argument: R^n -> R^m with m < n => info lost
print('Dimension reduction destroys information:')
np.random.seed(42)
X = np.random.randn(5, 10)  # 5 samples in R^10

# Project to R^3 (lossy)
W_down = np.random.randn(10, 3)  # project R^10 -> R^3
W_up = np.random.randn(3, 10)    # project R^3 -> R^10

Z = X @ W_down              # encode: R^10 -> R^3
X_hat = Z @ W_up             # decode: R^3 -> R^10

recon_error = np.linalg.norm(X - X_hat) / np.linalg.norm(X) * 100
print(f'  Original X:     shape {X.shape}')
print(f'  Compressed Z:   shape {Z.shape}  (dimension 10 -> 3)')
print(f'  Reconstructed:  shape {X_hat.shape}')
print(f'  Reconstruction error: {recon_error:.1f}% of original norm')
print(f'  -> Cannot perfectly reconstruct: dim(Z) < dim(X)')

# No dimension reduction => perfect reconstruction possible
W_square = np.random.randn(10, 10)
W_square_inv = np.linalg.inv(W_square)

Z_full = X @ W_square
X_perfect = Z_full @ W_square_inv
recon_err_full = np.linalg.norm(X - X_perfect) / np.linalg.norm(X) * 100
print(f'\n  Square (10->10) transform:')
print(f'  Reconstruction error: {recon_err_full:.2e}% (essentially zero)')
print(f'  -> Bijective transform: perfect reconstruction')

6. Special Classes of Functions

6.1 Linear Functions

f:VWf: V \to W is linear iff:

  1. f(u+v)=f(u)+f(v)f(u + v) = f(u) + f(v) (additivity)
  2. f(αv)=αf(v)f(\alpha v) = \alpha f(v) (homogeneity)

Equivalently: f(αu+βv)=αf(u)+βf(v)f(\alpha u + \beta v) = \alpha f(u) + \beta f(v).

Matrix representation: Every linear f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m corresponds to a matrix ARm×nA \in \mathbb{R}^{m \times n} via f(x)=Axf(x) = Ax.

Key properties:

  • f(0)=0f(0) = 0 always
  • Kernel: ker(f)={vf(v)=0}\ker(f) = \{v \mid f(v) = 0\} — measures "collapse"
  • Rank-nullity: dim(ker(f))+dim(im(f))=dim(domain)\dim(\ker(f)) + \dim(\text{im}(f)) = \dim(\text{domain})

6.2 Affine Functions

f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m is affine iff f(x)=Ax+bf(x) = Ax + b for matrix AA and vector bb.

Affine = linear + translation. Not linear unless b=0b = 0 (since f(0)=b0f(0) = b \neq 0).

Every dense layer in a neural network (without activation) is affine: z=Wx+bz = Wx + b.

6.3 Multilinear Functions

f:V1×V2××VkWf: V_1 \times V_2 \times \cdots \times V_k \to W is multilinear iff linear in each argument separately.

Bilinear example: The inner product u,v=uTv\langle u, v \rangle = u^T v is bilinear — linear in uu and linear in vv separately.

AI relevance: Attention score qTkq^T k is bilinear in qq and kk. Matrix multiplication (A,B)AB(A, B) \mapsto AB is bilinear.

Code cell 32

# ══════════════════════════════════════════════════════════════════
# 6.1–6.3 Linear, Affine, and Multilinear Functions
# ══════════════════════════════════════════════════════════════════

# ── 6.1 Test linearity ──
print('6.1 Linear Functions')
print('=' * 55)

def test_linearity(name: str, f: Callable, n_dim: int, n_tests: int = 100):
    """Test if f is linear by checking f(au + bv) = af(u) + bf(v)."""
    np.random.seed(42)
    for _ in range(n_tests):
        u = np.random.randn(n_dim)
        v = np.random.randn(n_dim)
        a, b = np.random.randn(2)
        lhs = f(a * u + b * v)
        rhs = a * f(u) + b * f(v)
        if not np.allclose(lhs, rhs, atol=1e-10):
            print(f'  {name}: NOT LINEAR (failed at a={a:.3f}, b={b:.3f})')
            return False
    print(f'  {name}: LINEAR (passed {n_tests} random tests)')
    return True

# Linear: matrix multiply
A = np.array([[2, -1], [0, 3], [1, 1]])
f_linear = lambda x: A @ x
test_linearity('f(x) = Ax', f_linear, 2)

# Affine: NOT linear (has bias)
b_vec = np.array([1, 0, -1])
f_affine = lambda x: A @ x + b_vec
test_linearity('f(x) = Ax + b', f_affine, 2)

# ReLU: NOT linear
f_relu = lambda x: np.maximum(0, x)
test_linearity('ReLU(x)', f_relu, 3)

# ── Why affine is not linear ──
print(f'\n  f_affine(0) = {f_affine(np.zeros(2))}  (should be 0 for linear, but it\'s b)')
print(f'  Key: linear functions MUST map 0 to 0. Affine shifts the origin.')

# ── 6.1 Rank-Nullity theorem ──
print(f'\nRank-Nullity Theorem for A:')
print(f'  A shape: {A.shape} (maps R^2 -> R^3)')
rank = np.linalg.matrix_rank(A)
nullity = A.shape[1] - rank
print(f'  rank(A) = {rank}')
print(f'  nullity(A) = {nullity}')
print(f'  rank + nullity = {rank + nullity} = dim(domain) = {A.shape[1]}')

# ── 6.3 Bilinear: attention scores ──
print(f'\n6.3 Bilinear Functions: Attention Scores')
print('=' * 55)

d_k = 4
q = np.random.randn(d_k)
k1 = np.random.randn(d_k)
k2 = np.random.randn(d_k)
alpha, beta = 2.5, -1.3

# score(q, k) = q^T k is bilinear
score = lambda q, k: q @ k

# Linear in q: score(a*q1 + b*q2, k) = a*score(q1,k) + b*score(q2,k)
q1, q2 = np.random.randn(d_k), np.random.randn(d_k)
lhs_q = score(alpha * q1 + beta * q2, k1)
rhs_q = alpha * score(q1, k1) + beta * score(q2, k1)
print(f'  Linear in q: score(a*q1 + b*q2, k) = {lhs_q:.6f}')
print(f'               a*score(q1,k) + b*score(q2,k) = {rhs_q:.6f}')
print(f'               Equal? {np.isclose(lhs_q, rhs_q)}')

# Linear in k
lhs_k = score(q, alpha * k1 + beta * k2)
rhs_k = alpha * score(q, k1) + beta * score(q, k2)
print(f'  Linear in k: score(q, a*k1 + b*k2) = {lhs_k:.6f}')
print(f'               a*score(q,k1) + b*score(q,k2) = {rhs_k:.6f}')
print(f'               Equal? {np.isclose(lhs_k, rhs_k)}')
print(f'  -> Attention score q^T k is bilinear (linear in each argument separately)')

6.4 Polynomial Functions

f:RRf: \mathbb{R} \to \mathbb{R} is a polynomial of degree dd: f(x)=adxd+ad1xd1++a1x+a0f(x) = a_d x^d + a_{d-1} x^{d-1} + \cdots + a_1 x + a_0

Weierstrass theorem: Any continuous function on [a,b][a, b] can be uniformly approximated by polynomials.

Neural networks vs polynomials: Depth-dd networks can represent exponentially more functions than degree-dd polynomials (exponential expressiveness advantage).

6.5 Activation Functions

Nonlinear functions applied elementwise to break linearity in neural networks. Without nonlinearity, composition of linear maps = single linear map → depth gives no benefit.

ActivationFormulaRangeUsed In
Sigmoidσ(x)=11+ex\sigma(x) = \frac{1}{1+e^{-x}}(0,1)(0, 1)Gates, binary classification
Tanhtanh(x)=exexex+ex\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}(1,1)(-1, 1)RNNs, zero-centred
ReLUmax(0,x)\max(0, x)[0,)[0, \infty)CNNs, dominant 2012–2020
GELUxΦ(x)x \cdot \Phi(x)[0.17,)\approx [-0.17, \infty)BERT, GPT-2/3
SwiGLUSwish(xW)(xV)\text{Swish}(xW) \odot (xV)(,)(-\infty, \infty)LLaMA, PaLM, Gemma, Mistral
Mishxtanh(softplus(x))x \cdot \tanh(\text{softplus}(x))[0.31,)\approx [-0.31, \infty)YOLOv4, niche

6.6 Convex Functions

f:RnRf: \mathbb{R}^n \to \mathbb{R} is convex iff:

f(λx+(1λ)y)λf(x)+(1λ)f(y)for all x,y,  λ[0,1]f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda)f(y) \quad \text{for all } x, y,\; \lambda \in [0,1]

Key property: Every local minimum of a convex function is a global minimum.

Second-order condition: ff is convex iff the Hessian 2f\nabla^2 f is positive semidefinite everywhere.

AI: Loss landscapes of neural networks are NOT convex in general — but cross-entropy as a function of logits IS convex.

6.7 Lipschitz Functions

f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m is LL-Lipschitz iff:

f(x)f(y)Lxyfor all x,y\|f(x) - f(y)\| \leq L\|x - y\| \quad \text{for all } x, y

LL bounds how fast ff can change. Lipschitz functions cannot amplify distances by more than factor LL.

Key properties:

  • Composition: gfg \circ f is (L1L2)(L_1 L_2)-Lipschitz if ff is L1L_1- and gg is L2L_2-Lipschitz
  • ReLU is 1-Lipschitz; Sigmoid is 0.25-Lipschitz
  • Spectral normalisation: constrain weight matrices to ensure Lipschitz networks (used in GANs, adversarial robustness)

Code cell 34

# ══════════════════════════════════════════════════════════════════
# 6.5 Activation Function Zoo — Complete Analysis
# ══════════════════════════════════════════════════════════════════

x = np.linspace(-5, 5, 1000)

# Define activations and their derivatives
activations = {
    'Sigmoid': {
        'fn': lambda x: 1 / (1 + np.exp(-x)),
        'deriv': lambda x: (1/(1+np.exp(-x))) * (1 - 1/(1+np.exp(-x))),
        'range': '(0, 1)',
    },
    'Tanh': {
        'fn': lambda x: np.tanh(x),
        'deriv': lambda x: 1 - np.tanh(x)**2,
        'range': '(-1, 1)',
    },
    'ReLU': {
        'fn': lambda x: np.maximum(0, x),
        'deriv': lambda x: np.where(x > 0, 1.0, 0.0),
        'range': '[0, inf)',
    },
    'GELU': {
        'fn': lambda x: 0.5 * x * (1 + np.tanh(np.sqrt(2/np.pi) * (x + 0.044715 * x**3))),
        'deriv': None,  # computed numerically
        'range': '~[-0.17, inf)',
    },
    'Swish': {
        'fn': lambda x: x / (1 + np.exp(-x)),
        'deriv': lambda x: (1/(1+np.exp(-x))) + x * (1/(1+np.exp(-x))) * (1 - 1/(1+np.exp(-x))),
        'range': '~[-0.28, inf)',
    },
    'Mish': {
        'fn': lambda x: x * np.tanh(np.log1p(np.exp(x))),
        'deriv': None,
        'range': '~[-0.31, inf)',
    },
}

# Plot all activations and their derivatives
fig, axes = plt.subplots(2, 3, figsize=(15, 8))

for idx, (name, info) in enumerate(activations.items()):
    row, col = divmod(idx, 3)
    ax = axes[row, col]
    
    y = info['fn'](x)
    ax.plot(x, y, 'b-', linewidth=2, label=name)
    
    # Derivative (numerical if not provided)
    if info['deriv'] is not None:
        dy = info['deriv'](x)
    else:
        dy = np.gradient(y, x)
    ax.plot(x, dy, 'r--', linewidth=1.5, label=f"{name}'", alpha=0.7)
    
    ax.axhline(0, color='k', linewidth=0.5)
    ax.axvline(0, color='k', linewidth=0.5)
    ax.set_title(f'{name}  (range: {info["range"]})', fontweight='bold', fontsize=10)
    ax.legend(fontsize=9)
    ax.grid(True, alpha=0.3)
    ax.set_xlim(-5, 5)

plt.suptitle('Activation Functions and Their Derivatives', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# ── Analysis table ──
print('Activation Function Properties')
print('=' * 80)
print(f'{"Name":<10} {"Max |f\'|":<10} {"f(0)":<8} {"f\'(0)":<8} {"Monotone?":<10} {"Smooth?":<10}')
print('-' * 80)

for name, info in activations.items():
    fn = info['fn']
    y = fn(x)
    if info['deriv']:
        dy = info['deriv'](x)
    else:
        dy = np.gradient(y, x)
    
    max_deriv = np.max(np.abs(dy))
    f_zero = fn(np.array([0.0]))[0] if hasattr(fn(np.array([0.0])), '__len__') else fn(0.0)
    d_zero = dy[len(dy)//2]
    monotone = 'Yes' if np.all(np.diff(y) >= -1e-8) else 'No'
    smooth = 'No' if name == 'ReLU' else 'Yes'
    
    print(f'{name:<10} {max_deriv:<10.4f} {f_zero:<8.4f} {d_zero:<8.4f} {monotone:<10} {smooth:<10}')

print()
print('Key insights:')
print('  Sigmoid/Tanh: saturate (derivative -> 0 for large |x|) -> vanishing gradient')
print('  ReLU: constant gradient for x>0, but dead neurons when x<0')
print('  GELU/Swish: smooth ReLU-like; non-zero gradient for negative x')
print('  Modern LLMs (2023-2026): SwiGLU = Swish gating, used in LLaMA/Mistral/Gemma')

Code cell 35

# ══════════════════════════════════════════════════════════════════
# 6.6 Convex Functions
# 6.7 Lipschitz Functions
# ══════════════════════════════════════════════════════════════════

# ── 6.6 Convexity test ──
print('6.6 Convex Functions')
print('=' * 55)

def test_convexity(name: str, f: Callable, x_range: np.ndarray,
                   n_tests: int = 10000) -> bool:
    """Test convexity: f(lambda*x + (1-lambda)*y) <= lambda*f(x) + (1-lambda)*f(y)."""
    np.random.seed(42)
    for _ in range(n_tests):
        x, y = np.random.choice(x_range, 2)
        lam = np.random.uniform(0, 1)
        lhs = f(lam * x + (1 - lam) * y)
        rhs = lam * f(x) + (1 - lam) * f(y)
        if lhs > rhs + 1e-10:
            print(f'  {name}: NOT CONVEX (violated at x={x:.3f}, y={y:.3f}, lambda={lam:.3f})')
            return False
    print(f'  {name}: CONVEX (passed {n_tests} random tests)')
    return True

x_range = np.linspace(-5, 5, 500)

test_convexity('f(x) = x^2',         lambda x: x**2, x_range)
test_convexity('f(x) = |x|',         lambda x: abs(x), x_range)
test_convexity('f(x) = e^x',         lambda x: np.exp(x), x_range)
test_convexity('f(x) = x^3',         lambda x: x**3, x_range)
test_convexity('f(x) = -log(x)',     lambda x: -np.log(x) if x > 0 else np.inf, 
               np.linspace(0.01, 5, 500))
test_convexity('f(x) = cos(x)',      lambda x: np.cos(x), x_range)

# Cross-entropy as function of logits (convex!)
print(f'\n  Cross-entropy loss L(z) = -z_y + log(sum(exp(z))) is convex in z')
print(f'  This means: no local minima when optimising logits (but NOT when optimising weights)')

# ── 6.7 Lipschitz constants ──
print(f'\n\n6.7 Lipschitz Functions')
print('=' * 55)

def estimate_lipschitz(f: Callable, x_range: np.ndarray, n_pairs: int = 50000) -> float:
    """Estimate Lipschitz constant by looking at max |f(x)-f(y)|/|x-y|."""
    np.random.seed(42)
    idx = np.random.choice(len(x_range), (n_pairs, 2))
    x_pairs = x_range[idx]
    
    x1, x2 = x_pairs[:, 0], x_pairs[:, 1]
    mask = np.abs(x1 - x2) > 1e-12
    x1, x2 = x1[mask], x2[mask]
    
    ratios = np.abs(f(x1) - f(x2)) / np.abs(x1 - x2)
    return np.max(ratios)

x_dense = np.linspace(-10, 10, 10000)

lip_functions = {
    'ReLU':    (lambda x: np.maximum(0, x),    1.0),
    'Sigmoid': (lambda x: 1/(1+np.exp(-x)),    0.25),
    'Tanh':    (lambda x: np.tanh(x),          1.0),
    'sin(x)':  (lambda x: np.sin(x),           1.0),
    'x^2':     (lambda x: x**2,                'unbounded'),
}

print(f'{"Function":<12} {"Estimated L":<14} {"True L":<14} {"L-Lipschitz?":<14}')
print('-' * 55)

for name, (fn, true_L) in lip_functions.items():
    est_L = estimate_lipschitz(fn, x_dense)
    print(f'{name:<12} {est_L:<14.4f} {str(true_L):<14} {"Yes" if isinstance(true_L, (int, float)) else "No"}')

print()
print('Key insights:')
print('  ReLU is 1-Lipschitz: |ReLU(x) - ReLU(y)| <= |x - y|')
print('  Sigmoid max derivative = 0.25, so it is 0.25-Lipschitz')
print('  x^2 has unbounded derivative => NOT Lipschitz on all of R')
print()
print('  Composition: L1-Lip . L2-Lip => (L1*L2)-Lipschitz')
print('  Deep network with L layers, each L_i-Lip: total L = prod(L_i)')
print('  -> Lipschitz constant can grow EXPONENTIALLY with depth!')
print('  -> Spectral normalisation constrains each layer to 1-Lipschitz')

7. Higher-Order Functions and Functionals

7.1 Functions as First-Class Objects

A higher-order function takes a function as input and/or returns a function as output. Functions are objects — they can be stored, passed, returned, and composed like any other value.

Lambda calculus (Church 1936): formal system where functions are first-class; foundation of functional programming.

7.2 Map, Filter, Reduce

OperationWhat It DoesAI Example
MapApply ff to every elementElementwise activation; LayerNorm per token
FilterKeep elements satisfying predicate PPAttention masking; top-kk sampling
Reduce (Fold)Combine elements with binary operationSum pooling; softmax denominator ezi\sum e^{z_i}

7.3 Functionals

A functional maps functions to values: F:(AB)CF: (A \to B) \to C.

  • Definite integral: F[f]=abf(x)dxF[f] = \int_a^b f(x)\,dx maps a function to a scalar
  • Loss functional: L[f]=E(x,y)D[(f(x),y)]\mathcal{L}[f] = \mathbb{E}_{(x,y) \sim D}[\ell(f(x), y)] maps a model function to expected loss
  • KL divergence: DKL(PQ)D_{KL}(P \| Q) is a functional on distributions

7.4 Operators

An operator TT maps one function space to another: T:(AB)(CD)T: (A \to B) \to (C \to D).

  • Differential operator: D(f)=fD(f) = f'; linear: D(αf+βg)=αD(f)+βD(g)D(\alpha f + \beta g) = \alpha D(f) + \beta D(g)
  • Attention operator: Qsoftmax(QKT/d)VQ \mapsto \text{softmax}(QK^T/\sqrt{d})V; maps queries to attended values
  • LayerNorm operator: maps a vector to its normalised version

7.5 Currying and Partial Application

Currying converts a multi-argument function into a sequence of single-argument functions:

curry(f):A(BC)fromf:A×BC\text{curry}(f): A \to (B \to C) \quad \text{from} \quad f: A \times B \to C

So f(a,b)=(curry(f)(a))(b)f(a, b) = (\text{curry}(f)(a))(b) — evaluate one argument at a time.

AI example: Attention score score(q,k)=qTk/dk\text{score}(q, k) = q^T k / \sqrt{d_k}. Curry in qq: scoreq(k)=qTk/dk\text{score}_q(k) = q^T k / \sqrt{d_k} gives a function from keys to scalars.

7.6 Function Spaces

The set of all functions from AA to BB is written BAB^A or Hom(A,B)\text{Hom}(A, B). For finite sets, BA=BA|B^A| = |B|^{|A|}.

Vector space of functions: C(X)={f:XRf continuous}C(X) = \{f: X \to \mathbb{R} \mid f \text{ continuous}\} under pointwise operations.

Universal approximation theorem: Neural networks are dense in C(K)C(K) (continuous functions on compact KK) — they can approximate any continuous function arbitrarily well.

Code cell 37

# ══════════════════════════════════════════════════════════════════
# 7.2 Map, Filter, Reduce — Implemented for AI
# ══════════════════════════════════════════════════════════════════

from functools import reduce

print('7.2 Map, Filter, Reduce')
print('=' * 60)

# ── MAP: apply function elementwise ──
print('MAP: apply f to every element')
x = np.array([-2.0, -0.5, 0.0, 1.0, 3.0])

relu = lambda x: max(0, x)
mapped = list(map(relu, x))
print(f'  x = {x}')
print(f'  map(ReLU, x) = {mapped}')
print(f'  Equivalent to: np.maximum(0, x) = {np.maximum(0, x)}')
print(f'  -> This IS what activation functions do: map(sigma, z) for each neuron')

# ── FILTER: keep elements satisfying predicate ──
print(f'\nFILTER: keep elements satisfying predicate')
logits = np.array([2.1, 0.5, -0.3, 1.8, 3.0, 0.1, -1.0])
token_names = ['the', 'cat', 'sat', 'on', 'mat', 'a', 'is']
probs = np.exp(logits) / np.exp(logits).sum()

# Top-k sampling: filter to keep only top-3 tokens
k = 3
top_k_indices = np.argsort(probs)[-k:]
filtered_tokens = [(token_names[i], f'{probs[i]:.4f}') for i in top_k_indices]
print(f'  All tokens: {list(zip(token_names, np.round(probs, 4)))}')
print(f'  filter(top-{k}, tokens) = {filtered_tokens}')
print(f'  -> Top-k sampling IS a filter operation on the probability distribution')

# ── REDUCE: combine elements with binary operation ──
print(f'\nREDUCE: combine elements with binary function')
z = np.array([2.0, 1.0, 0.5, 3.0])
print(f'  Logits z = {z}')

# Sum of exponentials (softmax denominator)
exp_sum = reduce(lambda acc, zi: acc + np.exp(zi), z, 0.0)
print(f'  reduce(lambda acc, zi: acc + exp(zi), z, 0) = {exp_sum:.4f}')
print(f'  np.sum(np.exp(z)) = {np.sum(np.exp(z)):.4f}')
print(f'  -> Softmax denominator is a REDUCE operation')

# Log-sum-exp (numerically stable)
max_z = z.max()
lse = max_z + np.log(reduce(lambda acc, zi: acc + np.exp(zi - max_z), z, 0.0))
print(f'  log-sum-exp(z) = {lse:.4f}')

# Attention weighted sum is also reduce
print(f'\n  Attention output = sum(alpha_i * v_i) = reduce(+, alpha_i * v_i)')
print(f'  -> Weighted combination of values is a REDUCE operation')

Code cell 38

# ══════════════════════════════════════════════════════════════════
# 7.3–7.6 Functionals, Operators, Currying, Function Spaces
# ══════════════════════════════════════════════════════════════════

# ── 7.3 Functionals: functions that take functions as input ──
print('7.3 Functionals')
print('=' * 55)

def integral_functional(f: Callable, a: float, b: float, n: int = 10000) -> float:
    """Compute integral of f from a to b (trapezoidal rule)."""
    x = np.linspace(a, b, n)
    y = np.array([f(xi) for xi in x])
    return np.trapezoid(y, x)

# The integral is a functional: maps a function to a number
print('Definite integral as functional F[f] = integral_0^1 f(x) dx:')
for name, fn, expected in [
    ('f(x) = x',     lambda x: x,      0.5),
    ('f(x) = x^2',   lambda x: x**2,   1/3),
    ('f(x) = sin(x)', lambda x: np.sin(x), 1 - np.cos(1)),
    ('f(x) = 1',     lambda x: 1.0,    1.0),
]:
    result = integral_functional(fn, 0, 1)
    print(f'  F[{name}] = {result:.6f}  (expected: {expected:.6f})')

# Loss as functional
print('\nLoss as functional L[f] = E[(f(x) - y)^2]:')
print('  L maps a prediction function f to a scalar loss value')
print('  Training = finding f* that minimises L[f] over function space')

# ── 7.4 Operators ──
print(f'\n7.4 Operators')
print('=' * 55)

def numerical_derivative(f: Callable, x: float, h: float = 1e-7) -> float:
    """Differentiation operator D: functions -> functions."""
    return (f(x + h) - f(x - h)) / (2 * h)

# D is a linear operator
print('Differentiation D is a linear operator:')
f1 = lambda x: x**3
f2 = lambda x: np.sin(x)
a, b_coeff = 2.0, 3.0

x_test = 1.0
D_sum = numerical_derivative(lambda x: a*f1(x) + b_coeff*f2(x), x_test)
sum_D = a * numerical_derivative(f1, x_test) + b_coeff * numerical_derivative(f2, x_test)
print(f'  D(af + bg)(1.0) = {D_sum:.6f}')
print(f'  aD(f)(1.0) + bD(g)(1.0) = {sum_D:.6f}')
print(f'  Equal? {np.isclose(D_sum, sum_D)}  -> D is linear')

# ── 7.5 Currying ──
print(f'\n7.5 Currying and Partial Application')
print('=' * 55)

def attention_score(q: np.ndarray, k: np.ndarray, d_k: int) -> float:
    """Full attention score: score(q, k) = q^T k / sqrt(d_k)."""
    return q @ k / np.sqrt(d_k)

def curry_attention(q: np.ndarray, d_k: int) -> Callable:
    """Curry in q: returns a function from keys to scores."""
    return lambda k: q @ k / np.sqrt(d_k)

d_k = 4
q = np.array([1.0, 0.5, -1.0, 2.0])
keys = [np.random.randn(d_k) for _ in range(5)]

# Curried version: fix q, get function over keys
score_q = curry_attention(q, d_k)

print(f'  score(q, k) = q^T k / sqrt(d_k)    [2-argument function]')
print(f'  score_q(k)  = q^T k / sqrt(d_k)     [1-argument function, q fixed]')
print()
for i, k in enumerate(keys):
    full = attention_score(q, k, d_k)
    curried = score_q(k)
    print(f'  score(q, k{i}) = {full:.4f}  ==  score_q(k{i}) = {curried:.4f}  '
          f'match: {np.isclose(full, curried)}')

# ── 7.6 Function space cardinality ──
print(f'\n7.6 Function Spaces')
print('=' * 55)
print('|B^A| = |B|^|A| for finite sets:')
for size_A, size_B in [(2, 2), (3, 2), (2, 3), (3, 3), (10, 2)]:
    count = size_B ** size_A
    print(f'  |A|={size_A}, |B|={size_B}: {count} possible functions A -> B')

print(f'\nNeural network as point in function space:')
print(f'  A network with p parameters lives in R^p')
print(f'  Each theta in R^p defines a different function f_theta')
print(f'  Training = searching R^p for the best function')

8. Continuity and Limits

8.1 Limits of Functions

Informal: limxaf(x)=L\lim_{x \to a} f(x) = L means f(x)f(x) gets arbitrarily close to LL as xx gets close to (but not equal to) aa.

Formal (ε\varepsilon-δ\delta definition):

limxaf(x)=L    ε>0  δ>0  x:0<xa<δ    f(x)L<ε\lim_{x \to a} f(x) = L \iff \forall \varepsilon > 0\; \exists \delta > 0\; \forall x: 0 < |x - a| < \delta \implies |f(x) - L| < \varepsilon

Note: f(a)f(a) may be undefined. The limit concerns behaviour near aa, not at aa.

8.2 Continuity

ff is continuous at aa iff:

  1. f(a)f(a) is defined
  2. limxaf(x)\lim_{x \to a} f(x) exists
  3. limxaf(x)=f(a)\lim_{x \to a} f(x) = f(a)

Intuition: Small input change → small output change. No jumps or gaps.

8.3 Continuity in Higher Dimensions

f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m is continuous at aa iff: ε>0  δ>0:xa<δ    f(x)f(a)<ε\forall \varepsilon > 0\; \exists \delta > 0: \|x - a\| < \delta \implies \|f(x) - f(a)\| < \varepsilon.

Composition of continuous functions is continuous. All modern activations (sigmoid, tanh, GELU, ReLU) are continuous, so neural networks with any of these define continuous functions.

8.4 Uniform Continuity

ff is uniformly continuous iff a single δ\delta works for ALL points simultaneously:

ε>0  δ>0  x,y:xy<δ    f(x)f(y)<ε\forall \varepsilon > 0\; \exists \delta > 0\; \forall x, y: |x - y| < \delta \implies |f(x) - f(y)| < \varepsilon

Stronger than pointwise continuity: δ\delta cannot depend on location.

Lipschitz     \implies uniformly continuous: LL-Lipschitz functions are uniformly continuous with δ=ε/L\delta = \varepsilon / L.

8.5 Discontinuities

TypeDescriptionExample
RemovableLimit exists but f(a)\neq f(a) or f(a)f(a) undefinedsinxx\frac{\sin x}{x} at x=0x = 0
JumpLeft and right limits exist but differHeaviside step function at 0
EssentialLimit does not existsin(1/x)\sin(1/x) at x=0x = 0

AI: Hard attention (argmax) is discontinuous → cannot backpropagate through it. Soft attention (softmax) is continuous → gradients flow.

8.6 Intermediate Value Theorem

If f:[a,b]Rf: [a, b] \to \mathbb{R} is continuous and yy is between f(a)f(a) and f(b)f(b), then c(a,b):f(c)=y\exists c \in (a, b): f(c) = y.

Continuous functions cannot "skip over" values. If f(a)<0f(a) < 0 and f(b)>0f(b) > 0, then ff has a root between aa and bb.

Code cell 40

# ══════════════════════════════════════════════════════════════════
# 8.1–8.6 Continuity, Limits, and Discontinuities
# ══════════════════════════════════════════════════════════════════

# ── 8.1 Epsilon-delta demonstration ──
print('8.1 Limits: epsilon-delta Demonstration')
print('=' * 60)

def verify_limit(f: Callable, a: float, L: float, epsilon: float) -> float:
    """Find delta such that |x-a| < delta => |f(x) - L| < epsilon."""
    # Binary search for the largest working delta
    delta = 1.0
    while delta > 1e-15:
        # Test many points in (a-delta, a+delta) \ {a}
        x_test = np.linspace(a - delta, a + delta, 10000)
        x_test = x_test[np.abs(x_test - a) > 1e-15]  # exclude a itself
        
        if len(x_test) == 0:
            break
        
        violations = np.abs(f(x_test) - L) >= epsilon
        if not np.any(violations):
            return delta
        delta /= 2
    return delta

print('lim(x->0) sin(x)/x = 1')
sinc = lambda x: np.where(np.abs(x) < 1e-15, 1.0, np.sin(x) / x)
for eps in [0.1, 0.01, 0.001]:
    delta = verify_limit(sinc, 0.0, 1.0, eps)
    print(f'  epsilon = {eps}: delta = {delta:.6f} works')
    print(f'    meaning: |x - 0| < {delta:.6f} => |sin(x)/x - 1| < {eps}')

# ── 8.2 Continuity checker ──
print(f'\n8.2-8.3 Continuity Analysis')
print('=' * 60)

def check_continuity(name: str, f: Callable, points: List[float],
                     epsilon: float = 0.01) -> None:
    """Check continuity at specific points."""
    for a in points:
        # Check from left and right
        h_vals = np.logspace(-1, -10, 50)
        
        try:
            f_a = f(a)
        except (ValueError, ZeroDivisionError):
            print(f'  {name} at x={a}: f(a) UNDEFINED -> discontinuous')
            continue
        
        left_lim = f(a - h_vals[-1])
        right_lim = f(a + h_vals[-1])
        
        if np.isnan(f_a) or np.isinf(f_a):
            print(f'  {name} at x={a}: f(a) = {f_a} -> discontinuous')
        elif not np.isclose(left_lim, right_lim, atol=epsilon):
            print(f'  {name} at x={a}: JUMP discontinuity')
            print(f'    left limit ≈ {left_lim:.6f}, right limit ≈ {right_lim:.6f}')
        elif not np.isclose(left_lim, f_a, atol=epsilon):
            print(f'  {name} at x={a}: REMOVABLE discontinuity')
            print(f'    limit ≈ {left_lim:.6f}, but f({a}) = {f_a:.6f}')
        else:
            print(f'  {name} at x={a}: CONTINUOUS (f(a)={f_a:.6f})')

# Continuous functions
check_continuity('sigmoid', lambda x: 1/(1+np.exp(-x)), [0, -5, 5])
check_continuity('ReLU', lambda x: max(0, x), [0, -1, 1])

# Discontinuous: Heaviside step
heaviside = lambda x: 1.0 if x > 0 else (0.5 if x == 0 else 0.0)
check_continuity('Heaviside', heaviside, [0])

# ── 8.5 Visualise discontinuity types ──
fig, axes = plt.subplots(1, 3, figsize=(14, 4))

x = np.linspace(-2, 2, 1000)

# Removable: sinc at 0
axes[0].plot(x, np.where(np.abs(x) < 1e-10, np.nan, np.sin(x)/x), 'b-', linewidth=2)
axes[0].scatter([0], [1], c='red', s=60, zorder=5, label='Limit = 1')
axes[0].scatter([0], [0], c='white', edgecolors='blue', s=60, zorder=5, label='Not defined')
axes[0].set_title('Removable: sin(x)/x at x=0', fontweight='bold')
axes[0].legend(fontsize=9)
axes[0].grid(True, alpha=0.3)

# Jump: Heaviside
x_neg = x[x < 0]
x_pos = x[x > 0]
axes[1].plot(x_neg, np.zeros_like(x_neg), 'b-', linewidth=2)
axes[1].plot(x_pos, np.ones_like(x_pos), 'b-', linewidth=2)
axes[1].scatter([0], [0.5], c='blue', s=60, zorder=5)
axes[1].scatter([0], [0], c='white', edgecolors='blue', s=60, zorder=5)
axes[1].scatter([0], [1], c='white', edgecolors='blue', s=60, zorder=5)
axes[1].set_title('Jump: Heaviside step at x=0', fontweight='bold')
axes[1].set_ylim(-0.3, 1.3)
axes[1].grid(True, alpha=0.3)

# Essential: sin(1/x)
x_ess = np.linspace(-2, 2, 10000)
x_ess = x_ess[np.abs(x_ess) > 0.001]
axes[2].plot(x_ess, np.sin(1/x_ess), 'b-', linewidth=0.5, alpha=0.7)
axes[2].set_title('Essential: sin(1/x) near x=0', fontweight='bold')
axes[2].set_xlim(-0.5, 0.5)
axes[2].grid(True, alpha=0.3)

for ax in axes:
    ax.axhline(0, color='k', linewidth=0.5)
    ax.axvline(0, color='k', linewidth=0.5)

plt.suptitle('Types of Discontinuities', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

print('Hard attention (argmax) is like Heaviside: discontinuous, no gradient.')
print('Soft attention (softmax) is continuous: gradients flow smoothly.')

9. Differentiable Functions and Derivatives

9.1 The Derivative as a Function

The derivative f:RRf': \mathbb{R} \to \mathbb{R} is itself a function, mapping each point to the instantaneous rate of change:

f(a)=limh0f(a+h)f(a)hf'(a) = \lim_{h \to 0} \frac{f(a+h) - f(a)}{h}

Higher-order derivatives f,f,f(n)f'', f''', f^{(n)} are also functions. Differentiation is a linear operator: D(αf+βg)=αD(f)+βD(g)D(\alpha f + \beta g) = \alpha D(f) + \beta D(g).

9.2 Partial Derivatives and the Jacobian

For f:RnRf: \mathbb{R}^n \to \mathbb{R}: the gradient f=(fx1,,fxn)T\nabla f = \left(\frac{\partial f}{\partial x_1}, \ldots, \frac{\partial f}{\partial x_n}\right)^T points in the direction of steepest ascent.

For f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m (vector-valued): the Jacobian JRm×nJ \in \mathbb{R}^{m \times n}:

Jij=fixjJ_{ij} = \frac{\partial f_i}{\partial x_j}

The Jacobian is the linear approximation to ff at a point: f(x+δ)f(x)+Jδf(x + \delta) \approx f(x) + J \cdot \delta.

9.3 The Chain Rule

For composition gfg \circ f: (gf)(a)=g(f(a))f(a)(g \circ f)'(a) = g'(f(a)) \cdot f'(a)

Multivariable: Jh(x)=Jg(f(x))Jf(x)J_{h}(x) = J_g(f(x)) \cdot J_f(x) — Jacobian of composition = product of Jacobians.

Backpropagation IS the chain rule applied to the composition of all network layers.

9.4 Differentiability Implies Continuity

If ff is differentiable at aa, then ff is continuous at aa. But the converse fails: x|x| is continuous everywhere but not differentiable at x=0x = 0.

9.5 Taylor's Theorem

Any smooth function is locally approximated by a polynomial:

f(x)=f(a)+f(a)(xa)+f(a)2!(xa)2++f(n)(a)n!(xa)n+Rn(x)f(x) = f(a) + f'(a)(x-a) + \frac{f''(a)}{2!}(x-a)^2 + \cdots + \frac{f^{(n)}(a)}{n!}(x-a)^n + R_n(x)

Multivariate: f(x+δ)f(x)+f(x)Tδ+12δTHδ+f(x + \delta) \approx f(x) + \nabla f(x)^T \delta + \frac{1}{2}\delta^T H \delta + \cdots (HH = Hessian)

AI:

  • Gradient descent = first-order Taylor approximation
  • Newton's method = second-order Taylor approximation
  • Understanding convergence requires Taylor analysis

Code cell 42

# ══════════════════════════════════════════════════════════════════
# 9.1–9.3 Derivatives, Jacobians, and the Chain Rule
# ══════════════════════════════════════════════════════════════════

# ── 9.1 Derivative as a function ──
print('9.1 The Derivative as a Function')
print('=' * 55)

def numerical_deriv(f, x, h=1e-7):
    """Central difference derivative."""
    return (f(x + h) - f(x - h)) / (2 * h)

# Show that derivative is a new function derived from f
funcs = [
    ('f(x) = x^3',     lambda x: x**3,           "f'(x) = 3x^2"),
    ('f(x) = sin(x)',   lambda x: np.sin(x),      "f'(x) = cos(x)"),
    ('f(x) = e^x',      lambda x: np.exp(x),      "f'(x) = e^x"),
    ('f(x) = sigmoid',  lambda x: 1/(1+np.exp(-x)), "f'(x) = sig(x)(1-sig(x))"),
]

x_test = np.array([0.0, 1.0, -1.0, 2.0])
print(f'{"Function":<18} {"x":<6} {"f(x)":<12} {"f\'(x) numerical":<18} {"f\'(x) analytic":<18}')
print('-' * 75)

analytic_derivs = [
    lambda x: 3*x**2,
    lambda x: np.cos(x),
    lambda x: np.exp(x),
    lambda x: (1/(1+np.exp(-x))) * (1 - 1/(1+np.exp(-x))),
]

for (name, fn, deriv_name), analytic in zip(funcs, analytic_derivs):
    for x in [0.0, 1.0]:
        num_d = numerical_deriv(fn, x)
        ana_d = analytic(x)
        print(f'{name:<18} {x:<6.1f} {fn(x):<12.6f} {num_d:<18.6f} {ana_d:<18.6f}')

# ── 9.2 Jacobian ──
print(f'\n9.2 Jacobian Matrix')
print('=' * 55)

def compute_jacobian(f: Callable, x: np.ndarray, h: float = 1e-7) -> np.ndarray:
    """Compute Jacobian of f: R^n -> R^m at point x."""
    n = len(x)
    f_x = f(x)
    m = len(f_x)
    J = np.zeros((m, n))
    
    for j in range(n):
        x_plus = x.copy()
        x_minus = x.copy()
        x_plus[j] += h
        x_minus[j] -= h
        J[:, j] = (f(x_plus) - f(x_minus)) / (2 * h)
    
    return J

# Example: f(x1, x2) = [x1^2 + x2, sin(x1*x2)]
def f_vec(x):
    return np.array([x[0]**2 + x[1], np.sin(x[0] * x[1])])

x0 = np.array([1.0, 2.0])
J = compute_jacobian(f_vec, x0)
print(f'f(x1, x2) = [x1^2 + x2, sin(x1*x2)]')
print(f'At x = {x0}:')
print(f'  f(x) = {f_vec(x0)}')
print(f'  Jacobian J =')
print(f'    {J}')
print(f'  J[0,0] = df1/dx1 = 2*x1 = {2*x0[0]:.4f}')
print(f'  J[0,1] = df1/dx2 = 1 = 1.0000')
print(f'  J[1,0] = df2/dx1 = x2*cos(x1*x2) = {x0[1]*np.cos(x0[0]*x0[1]):.4f}')
print(f'  J[1,1] = df2/dx2 = x1*cos(x1*x2) = {x0[0]*np.cos(x0[0]*x0[1]):.4f}')

# ── 9.3 Chain rule: Jacobian of composition ──
print(f'\n9.3 Chain Rule: J_{{g.f}}(x) = J_g(f(x)) * J_f(x)')
print('=' * 55)

def g_vec(y):
    return np.array([y[0] * y[1], y[0] + y[1]**2])

# Composition h = g . f
def h_composed(x):
    return g_vec(f_vec(x))

# Jacobian of composition (direct)
J_h_direct = compute_jacobian(h_composed, x0)

# Jacobian via chain rule: J_g(f(x)) * J_f(x)
J_f = compute_jacobian(f_vec, x0)
J_g = compute_jacobian(g_vec, f_vec(x0))
J_h_chain = J_g @ J_f

print(f'J_h (direct computation):')
print(f'  {J_h_direct}')
print(f'J_h (chain rule: J_g @ J_f):')
print(f'  {J_h_chain}')
print(f'Match? {np.allclose(J_h_direct, J_h_chain)}')
print('-> Backpropagation is exactly this: J_loss = J_L @ J_(L-1) @ ... @ J_1')

Code cell 43

# ══════════════════════════════════════════════════════════════════
# 9.5 Taylor's Theorem — Approximation Quality
# ══════════════════════════════════════════════════════════════════

from math import factorial

def taylor_approx(f_derivs: list, a: float, x: np.ndarray) -> np.ndarray:
    """
    Taylor approximation of f around point a.
    f_derivs[k] = f^(k)(a) for k = 0, 1, 2, ...
    """
    result = np.zeros_like(x, dtype=float)
    for k, dk in enumerate(f_derivs):
        result += dk * (x - a)**k / factorial(k)
    return result

# Taylor series of e^x around a=0: all derivatives = 1
x = np.linspace(-3, 3, 500)
a = 0.0

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: e^x Taylor approximations
f_true = np.exp(x)
axes[0].plot(x, f_true, 'k-', linewidth=2, label='$e^x$ (exact)')

for n_terms in [1, 2, 3, 5, 8]:
    derivs = [1.0] * n_terms  # all derivatives of e^x at 0 are 1
    y_approx = taylor_approx(derivs, a, x)
    axes[0].plot(x, y_approx, '--', linewidth=1.5, label=f'Order {n_terms-1}', alpha=0.7)

axes[0].set_ylim(-5, 25)
axes[0].set_title('Taylor Approximation of $e^x$ at $a=0$', fontweight='bold')
axes[0].legend(fontsize=9)
axes[0].grid(True, alpha=0.3)
axes[0].set_xlabel('x')

# Right: Approximation error vs order
errors = []
orders = range(1, 15)
x_test_point = 1.0  # approximate e^1
for n in orders:
    derivs = [1.0] * n
    approx = taylor_approx(derivs, 0.0, np.array([x_test_point]))[0]
    errors.append(abs(approx - np.e))

axes[1].semilogy(list(orders), errors, 'bo-', linewidth=2, markersize=6)
axes[1].set_xlabel('Taylor Order')
axes[1].set_ylabel('|Approximation Error| at x=1')
axes[1].set_title('Error Drops Exponentially with Order', fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print('Taylor\'s theorem connects to AI optimisation:')
print('  Order 0: f(x) ≈ f(a)                        -> constant approximation')
print('  Order 1: f(x) ≈ f(a) + f\'(a)(x-a)           -> gradient descent uses this')
print('  Order 2: f(x) ≈ f(a) + f\'(a)(x-a) + ½f\'\'(a)(x-a)²  -> Newton\'s method uses this')
print('  Higher orders: better approximation, but more expensive to compute')

10. Measure-Theoretic View of Functions

"Probability is a measure on a σ-algebra; random variables are measurable functions." — Andrey Kolmogorov

When we move from discrete to continuous domains, the notion of "function" must be refined. Measure theory provides the rigorous foundation for probability, integration, and the mathematical framework underlying modern machine learning.


10.1 Measurable Functions

Definition. Let (X,F)(X, \mathcal{F}) and (Y,G)(Y, \mathcal{G}) be measurable spaces (sets equipped with σ-algebras). A function f:XYf: X \to Y is measurable if:

BG:f1(B)F\forall\, B \in \mathcal{G}:\quad f^{-1}(B) \in \mathcal{F}

That is, the preimage of every measurable set in the codomain is measurable in the domain.

Why this matters for ML:

  • A random variable X:ΩRX: \Omega \to \mathbb{R} is simply a measurable function from a probability space (Ω,F,P)(\Omega, \mathcal{F}, P) to (R,B(R))(\mathbb{R}, \mathcal{B}(\mathbb{R}))
  • Neural network outputs must be measurable functions of inputs for probabilistic reasoning to be valid
  • Loss expectations E[L(f(X),Y)]\mathbb{E}[L(f(X), Y)] only exist when LfL \circ f is measurable

Key Properties of Measurable Functions

PropertyStatementML Relevance
Compositionf,gf, g measurable \Rightarrow gfg \circ f measurableDeep network layers preserve measurability
Pointwise limitslimnfn\lim_{n \to \infty} f_n measurable if each fnf_n isTraining convergence stays in measurable world
Algebraic closuref+gf + g, fgf \cdot g, max(f,g)\max(f,g) measurableReLU, addition of features are measurable
Borel functionsContinuous \Rightarrow measurable (w.r.t. Borel σ-algebra)Standard activations are automatically measurable

10.2 Lebesgue Integration

The Lebesgue integral generalises the Riemann integral and is essential for probability theory.

For a non-negative measurable function f:X[0,]f: X \to [0, \infty]:

Xfdμ=sup{Xsdμ:0sf,  s simple}\int_X f \, d\mu = \sup\left\{\int_X s \, d\mu : 0 \le s \le f, \; s \text{ simple}\right\}

where a simple function s=i=1nai1Ais = \sum_{i=1}^{n} a_i \, \mathbf{1}_{A_i} for measurable sets AiA_i.

Connection to expectation:

E[f(X)]=Ωf(X(ω))dP(ω)=Rf(x)dPX(x)\mathbb{E}[f(X)] = \int_\Omega f(X(\omega)) \, dP(\omega) = \int_{\mathbb{R}} f(x) \, dP_X(x)

Lebesgue vs Riemann

AspectRiemannLebesgue
ApproachPartition the domainPartition the range
HandlesContinuous/piecewise continuousAny measurable function
Limit theoremsLimitedMonotone & Dominated Convergence
ProbabilityInadequate for general theoryFoundation of modern probability
ML contextSimple 1D integralsExpected loss, KL divergence, mutual information

10.3 Almost Everywhere Properties

Definition. A property holds almost everywhere (a.e.) with respect to measure μ\mu if the set where it fails has measure zero:

μ({xX:property fails at x})=0\mu(\{x \in X : \text{property fails at } x\}) = 0

Key results:

  • Functions equal a.e. have the same integral: f=gf = g a.e. fdμ=gdμ\Rightarrow \int f \, d\mu = \int g \, d\mu
  • Dominated Convergence Theorem (DCT): If fnff_n \to f a.e. and fng|f_n| \le g with gdμ<\int g \, d\mu < \infty, then fndμfdμ\int f_n \, d\mu \to \int f \, d\mu
  • Monotone Convergence Theorem (MCT): If 0f1f20 \le f_1 \le f_2 \le \cdots a.e., then limfndμ=limfndμ\int \lim f_n \, d\mu = \lim \int f_n \, d\mu

ML Significance:

  • DCT justifies interchanging limits and expectations in SGD convergence proofs
  • Two models that differ on a measure-zero set of inputs are "the same" for all practical purposes
  • Dropout can be analysed as modifying functions on measure-zero sets during training

10.4 Probability Distributions as Functions

Every probability distribution can be characterised by several function representations:

Cumulative Distribution Function (CDF)

FX(x)=P(Xx)=xfX(t)dtF_X(x) = P(X \le x) = \int_{-\infty}^{x} f_X(t) \, dt

Properties: right-continuous, non-decreasing, limxF(x)=0\lim_{x \to -\infty} F(x) = 0, limxF(x)=1\lim_{x \to \infty} F(x) = 1.

Probability Density Function (PDF)

fX(x)=dFXdx(x)(when the derivative exists)f_X(x) = \frac{dF_X}{dx}(x) \quad \text{(when the derivative exists)}

Moment-Generating Function (MGF)

MX(t)=E[etX]=etxfX(x)dxM_X(t) = \mathbb{E}[e^{tX}] = \int_{-\infty}^{\infty} e^{tx} f_X(x) \, dx

Characteristic Function

φX(t)=E[eitX]\varphi_X(t) = \mathbb{E}[e^{itX}]

Always exists (unlike MGF) and uniquely determines the distribution.

FunctionDomain → RangeAlways Exists?Uniquely Determines Distribution?
CDFR[0,1]\mathbb{R} \to [0,1]YesYes
PDFR[0,)\mathbb{R} \to [0,\infty)Only for abs. continuousYes (when exists)
MGFR[0,)\mathbb{R} \to [0,\infty)Not alwaysYes (in neighbourhood of 0)
Characteristic fnRC\mathbb{R} \to \mathbb{C}YesYes

Reference: Billingsley, P. (1995). Probability and Measure, 3rd ed. Wiley. Kolmogorov, A.N. (1933). Foundations of the Theory of Probability.

Code cell 45

# ══════════════════════════════════════════════════════════════════
# SECTION 10 — Measure-Theoretic View: Demonstrations
# ══════════════════════════════════════════════════════════════════

# --- 10.1  Measurable Functions: Preimage Verification ---

def is_measurable_discrete(f: dict, sigma_algebra_domain: list, sigma_algebra_codomain: list) -> bool:
    """
    For a finite function (given as a dict), check measurability:
    every preimage of a set in the codomain σ-algebra must be in the domain σ-algebra.
    """
    for B in sigma_algebra_codomain:
        preimage = frozenset(x for x, y in f.items() if y in B)
        if preimage not in {frozenset(s) for s in sigma_algebra_domain}:
            return False
    return True

# Example: f: {1,2,3,4} -> {a,b}
f_map = {1: 'a', 2: 'a', 3: 'b', 4: 'b'}

# Trivial σ-algebra: {∅, {a,b}}
sigma_cod = [set(), {'a', 'b'}]
sigma_dom_trivial = [set(), {1, 2, 3, 4}]

# Finer σ-algebra on domain: {∅, {1,2}, {3,4}, {1,2,3,4}}
sigma_dom_fine = [set(), {1, 2}, {3, 4}, {1, 2, 3, 4}]

# Full power-set σ-algebra on codomain
sigma_cod_full = [set(), {'a'}, {'b'}, {'a', 'b'}]

print("=== 10.1 Measurability Check ===")
print(f"f = {f_map}")
print(f"Trivial codomain σ-algebra → measurable: "
      f"{is_measurable_discrete(f_map, sigma_dom_trivial, sigma_cod)}")
print(f"Full codomain σ-algebra, fine domain σ-algebra → measurable: "
      f"{is_measurable_discrete(f_map, sigma_dom_fine, sigma_cod_full)}")

# With wrong domain σ-algebra (missing {1,2})
sigma_dom_wrong = [set(), {1, 3}, {2, 4}, {1, 2, 3, 4}]
print(f"Full codomain σ-algebra, wrong domain σ-algebra → measurable: "
      f"{is_measurable_discrete(f_map, sigma_dom_wrong, sigma_cod_full)}")

# --- 10.2  Lebesgue-style Integration via Simple Functions ---

def lebesgue_integral_simple(f_values: np.ndarray, x_grid: np.ndarray) -> float:
    """
    Approximate Lebesgue integral by partitioning the RANGE.
    Standard numerical approach: partition y-axis, measure preimage of each level set.
    """
    y_min, y_max = f_values.min(), f_values.max()
    n_levels = 200
    y_levels = np.linspace(y_min, y_max, n_levels + 1)
    dx = x_grid[1] - x_grid[0]
    
    integral = 0.0
    for i in range(n_levels):
        y_lo, y_hi = y_levels[i], y_levels[i + 1]
        # Measure of preimage: {x : y_lo ≤ f(x) < y_hi}
        mask = (f_values >= y_lo) & (f_values < y_hi)
        measure = np.sum(mask) * dx
        integral += y_lo * measure  # Lower sum approximation
    return integral

x = np.linspace(0, 1, 1000)
f_vals = x**2  # ∫₀¹ x² dx = 1/3

riemann_approx = np.trapezoid(f_vals, x)
lebesgue_approx = lebesgue_integral_simple(f_vals, x)

print(f"\n=== 10.2 Lebesgue vs Riemann Integration ===")
print(f"Integrating f(x) = x² over [0, 1]")
print(f"  Exact value:           {1/3:.6f}")
print(f"  Riemann (trapezoidal): {riemann_approx:.6f}")
print(f"  Lebesgue (level sets): {lebesgue_approx:.6f}")

# --- 10.3  Dominated Convergence Theorem Demonstration ---

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Sequence f_n(x) = x^n on [0,1]: converges to 0 a.e. (except at x=1)
x = np.linspace(0, 1, 500)

ax = axes[0]
for n in [1, 2, 5, 10, 50, 200]:
    ax.plot(x, x**n, label=f'n={n}', alpha=0.7)
ax.set_title('$f_n(x) = x^n$ converges to 0 a.e.')
ax.set_xlabel('x')
ax.set_ylabel('$f_n(x)$')
ax.legend(fontsize=7)
ax.grid(True, alpha=0.3)

# Integrals converge: ∫₀¹ xⁿ dx = 1/(n+1) → 0
ax = axes[1]
ns = np.arange(1, 101)
integrals = 1.0 / (ns + 1)
ax.plot(ns, integrals, 'b-', linewidth=2)
ax.axhline(y=0, color='r', linestyle='--', label='Limit = 0')
ax.set_title('$\\int_0^1 x^n \\, dx = \\frac{1}{n+1} \\to 0$ (DCT)')
ax.set_xlabel('n')
ax.set_ylabel('Integral value')
ax.legend()
ax.grid(True, alpha=0.3)

# Dominating function: g(x) = 1 (integrable on [0,1])
ax = axes[2]
ax.fill_between(x, 0, 1, alpha=0.2, color='red', label='$g(x) = 1$ (dominator)')
ax.plot(x, x**2, 'b-', label='$f_2(x) = x^2$')
ax.plot(x, x**10, 'g-', label='$f_{10}(x) = x^{10}$')
ax.set_title('$|f_n| \\leq g = 1$ (domination condition)')
ax.set_xlabel('x')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('10_dominated_convergence.png', dpi=100, bbox_inches='tight')
plt.show()
print("✓ DCT verified: ∫f_n → ∫(lim f_n) = 0")

# --- 10.4  Probability Distributions as Functions ---

from scipy import stats  # For comparison

x = np.linspace(-4, 4, 1000)

# Standard Normal N(0,1)
pdf_normal = (1 / np.sqrt(2 * np.pi)) * np.exp(-x**2 / 2)
cdf_normal = np.cumsum(pdf_normal) * (x[1] - x[0])  # Numerical CDF

# Moment-generating function M(t) = e^(μt + σ²t²/2) for N(μ,σ²)
t_vals = np.linspace(-2, 2, 200)
mgf_normal = np.exp(t_vals**2 / 2)  # μ=0, σ=1

# Characteristic function φ(t) = e^(iμt - σ²t²/2)
char_real = np.exp(-t_vals**2 / 2)  # Real part (imaginary = 0 for μ=0)

fig, axes = plt.subplots(2, 2, figsize=(12, 8))

axes[0, 0].plot(x, pdf_normal, 'b-', linewidth=2)
axes[0, 0].fill_between(x, pdf_normal, alpha=0.2)
axes[0, 0].set_title('PDF: $f_X(x) = \\frac{1}{\\sqrt{2\\pi}} e^{-x^2/2}$')
axes[0, 0].set_ylabel('Density')
axes[0, 0].grid(True, alpha=0.3)

axes[0, 1].plot(x, cdf_normal, 'r-', linewidth=2)
axes[0, 1].axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)
axes[0, 1].set_title('CDF: $F_X(x) = P(X \\leq x)$')
axes[0, 1].set_ylabel('Cumulative Probability')
axes[0, 1].grid(True, alpha=0.3)

axes[1, 0].plot(t_vals, mgf_normal, 'g-', linewidth=2)
axes[1, 0].set_title('MGF: $M_X(t) = e^{t^2/2}$')
axes[1, 0].set_xlabel('t')
axes[1, 0].set_ylabel('$M_X(t)$')
axes[1, 0].grid(True, alpha=0.3)

axes[1, 1].plot(t_vals, char_real, 'm-', linewidth=2, label='Real part')
axes[1, 1].axhline(y=0, color='gray', linestyle='--', alpha=0.5)
axes[1, 1].set_title('Characteristic fn: $\\varphi_X(t) = e^{-t^2/2}$')
axes[1, 1].set_xlabel('t')
axes[1, 1].set_ylabel('$\\varphi_X(t)$')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.suptitle('Four Function Representations of N(0,1)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('10_distribution_functions.png', dpi=100, bbox_inches='tight')
plt.show()

print("\n=== 10.4 Distribution Function Properties ===")
print(f"PDF integrates to:  {np.trapezoid(pdf_normal, x):.6f}  (should be ≈ 1)")
print(f"CDF at x=0:         {cdf_normal[len(x)//2]:.6f}  (should be ≈ 0.5)")
print(f"MGF at t=0:         {mgf_normal[len(t_vals)//2]:.6f}  (should be 1.0)")
print(f"Char fn at t=0:     {char_real[len(t_vals)//2]:.6f}  (should be 1.0)")

11. Functions in Machine Learning

"Learning is the process of finding a function from a hypothesis class that best fits the data."

Machine learning is, at its mathematical core, the study of function approximation. This section formalises that connection and shows how every major ML concept maps to function-theoretic ideas.


11.1 Learning as Function Approximation

The fundamental ML problem: given data D={(xi,yi)}i=1N\mathcal{D} = \{(x_i, y_i)\}_{i=1}^{N} drawn from an unknown distribution, find f^H\hat{f} \in \mathcal{H} that minimises:

f^=argminfH1Ni=1NL(f(xi),yi)+λΩ(f)\hat{f} = \arg\min_{f \in \mathcal{H}} \frac{1}{N} \sum_{i=1}^{N} L(f(x_i), y_i) + \lambda \, \Omega(f)

where:

  • H\mathcal{H} is the hypothesis class (a set of functions)
  • LL is the loss function measuring prediction error
  • Ω(f)\Omega(f) is a regulariser controlling complexity
  • λ>0\lambda > 0 balances fit vs. complexity

The Bias-Variance Decomposition (Function-Theoretic View)

E[(f(x)f^(x))2]total error=(E[f^(x)]f(x))2bias2:H too small+Var(f^(x))variance: f^ too sensitive+σ2irreducible noise\underbrace{\mathbb{E}[(f^*(x) - \hat{f}(x))^2]}_{\text{total error}} = \underbrace{(\mathbb{E}[\hat{f}(x)] - f^*(x))^2}_{\text{bias}^2: \mathcal{H} \text{ too small}} + \underbrace{\text{Var}(\hat{f}(x))}_{\text{variance: } \hat{f} \text{ too sensitive}} + \underbrace{\sigma^2}_{\text{irreducible noise}}

11.2 Universal Approximation Theorem

Theorem (Cybenko, 1989; Hornik et al., 1989). Let σ\sigma be a non-constant, bounded, continuous activation function. Then for any continuous function f:[0,1]dRf: [0,1]^d \to \mathbb{R} and any ε>0\varepsilon > 0, there exists NNN \in \mathbb{N} and parameters such that:

g(x)=i=1Nαiσ(wix+bi)g(x) = \sum_{i=1}^{N} \alpha_i \, \sigma(w_i^\top x + b_i)

satisfies supx[0,1]df(x)g(x)<ε\sup_{x \in [0,1]^d} |f(x) - g(x)| < \varepsilon.

Important caveats:

  1. Existence, not construction — the theorem doesn't tell you how to find the weights
  2. Width vs. depth — modern results show depth can be exponentially more efficient than width
  3. ReLU networks — Lu et al. (2017) proved UAT for ReLU with width d+4d + 4
  4. Practical gap — approximation ≠ learning (optimisation + generalisation are separate challenges)

11.3 Parameterised Function Families

A neural network defines a parameterised function family:

Fθ={fθ:RdRkθΘRp}\mathcal{F}_\theta = \{f_\theta : \mathbb{R}^d \to \mathbb{R}^k \mid \theta \in \Theta \subseteq \mathbb{R}^p\}
ArchitectureFunction FamilyParametersKey Property
Linearf(x)=Wx+bf(x) = Wx + bW,bW, bLinearity (can only learn linear boundaries)
MLPf=σLALσ1A1f = \sigma_L \circ A_L \circ \cdots \circ \sigma_1 \circ A_1{Wi,bi}\{W_i, b_i\}Universal approximation
CNNf(x)=pool(σ(Wx+b))f(x) = \text{pool}(\sigma(W * x + b))Shared WW (kernel)Translation equivariance
RNNht=σ(Whht1+Wxxt)h_t = \sigma(W_h h_{t-1} + W_x x_t)Shared Wh,WxW_h, W_xSequential memory
Transformerf(x)=FFN(Attn(x))f(x) = \text{FFN}(\text{Attn}(x))WQ,WK,WV,WOW_Q, W_K, W_V, W_OPermutation equivariance
ResNetf(x)=x+g(x)f(x) = x + g(x)Params of ggIdentity + perturbation

11.4 Symmetries and Equivariance

Definition. A function f:XYf: X \to Y is equivariant to a group action TgT_g if:

f(Tgx)=Sgf(x)gGf(T_g \cdot x) = S_g \cdot f(x) \quad \forall g \in G

When Sg=idS_g = \text{id}, we call ff invariant: f(Tgx)=f(x)f(T_g \cdot x) = f(x).

SymmetryGroupEquivariant ArchitectureInvariant Architecture
Translation(R2,+)(\mathbb{R}^2, +)CNNCNN + Global Pooling
PermutationSnS_nTransformer (self-attention)DeepSets (\sum)
RotationSO(3)SO(3)SE(3)-TransformerScalar output networks
Scale(R+,×)(\mathbb{R}^+, \times)Scale-equivariant CNNsNormalised features

11.5 Attention as a Function

Self-attention is a data-dependent function composition:

Attention(Q,K,V)=softmax(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V

Breaking this down as a composition of functions:

softmaxRn×nΔnscaleRRbilinearRn×d×Rn×dRn×n\underbrace{\text{softmax}}_{\mathbb{R}^{n \times n} \to \Delta^n} \circ \underbrace{\text{scale}}_{\mathbb{R} \to \mathbb{R}} \circ \underbrace{\text{bilinear}}_{\mathbb{R}^{n \times d} \times \mathbb{R}^{n \times d} \to \mathbb{R}^{n \times n}}

Key insight: Unlike fixed composition in MLPs, attention weights are functions of the input itself — making attention a second-order function (a function that computes functions).


11.6 Loss Functions as Functionals

Loss functions map from function space to R\mathbb{R}:

L:FR,L(f)=E(x,y)D[L(f(x),y)]\mathcal{L}: \mathcal{F} \to \mathbb{R}, \quad \mathcal{L}(f) = \mathbb{E}_{(x,y) \sim \mathcal{D}}[L(f(x), y)]
LossFormulaDomain → RangeMinimiser
MSEyf(x)2\|y - f(x)\|^2Rk×RkR0\mathbb{R}^k \times \mathbb{R}^k \to \mathbb{R}_{\ge 0}E[YX=x]\mathbb{E}[Y \mid X = x] (conditional mean)
Cross-entropyyilogf(x)i-\sum y_i \log f(x)_iΔk×ΔkR0\Delta^k \times \Delta^k \to \mathbb{R}_{\ge 0}True posterior P(YX)P(Y \mid X)
KL divergencepilog(pi/qi)\sum p_i \log(p_i / q_i)Δk×ΔkR0\Delta^k \times \Delta^k \to \mathbb{R}_{\ge 0}q=pq = p
Hingemax(0,1yf(x))\max(0, 1 - y \cdot f(x))R×{1,+1}R0\mathbb{R} \times \{-1,+1\} \to \mathbb{R}_{\ge 0}sign(E[YX])\text{sign}(\mathbb{E}[Y \mid X])

11.7 Function Transformations During Training

Training transforms the function fθ0fθ1fθTf_{\theta_0} \to f_{\theta_1} \to \cdots \to f_{\theta_T}:

θt+1=θtηθL(fθt)\theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}(f_{\theta_t})

This is a discrete dynamical system in function space. Key phenomena:

  • Loss landscape: The surface θL(fθ)\theta \mapsto \mathcal{L}(f_\theta) determines training dynamics
  • Mode connectivity: Different minima are often connected by low-loss paths
  • Neural tangent kernel: In the infinite-width limit, fθf_\theta evolves as a linear function of parameters, governed by the NTK: K(x,x)=θf(x),θf(x)K(x, x') = \langle \nabla_\theta f(x), \nabla_\theta f(x') \rangle

References: Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Math. Control Signals Systems. Hornik, K. et al. (1989). Multilayer feedforward networks are universal approximators. Neural Networks. Jacot, A. et al. (2018). Neural tangent kernel. NeurIPS.

Code cell 47

# ══════════════════════════════════════════════════════════════════
# SECTION 11 — Functions in Machine Learning: Demonstrations
# ══════════════════════════════════════════════════════════════════

# --- 11.2  Universal Approximation Theorem: Visualisation ---

def shallow_network(x, weights, biases, alphas, activation=lambda z: 1/(1+np.exp(-z))):
    """Single hidden-layer network: g(x) = Σ αᵢ σ(wᵢx + bᵢ)"""
    hidden = np.array([alpha * activation(w * x + b) 
                       for w, b, alpha in zip(weights, biases, alphas)])
    return np.sum(hidden, axis=0)

# Target function to approximate
x = np.linspace(-3, 3, 500)
target = np.sin(x) + 0.5 * np.cos(3*x)

# Approximations with increasing neurons
np.random.seed(42)
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for idx, n_neurons in enumerate([3, 10, 50]):
    # Random (but seeded) parameters — in practice these would be learned
    w = np.random.randn(n_neurons) * 2
    b = np.random.randn(n_neurons)
    
    # Fit alphas by least squares (simple linear regression on hidden features)
    H = np.array([1/(1+np.exp(-(wi*x + bi))) for wi, bi in zip(w, b)])  # (n_neurons, len(x))
    alphas_fit, _, _, _ = np.linalg.lstsq(H.T, target, rcond=None)
    
    approx = shallow_network(x, w, b, alphas_fit)
    error = np.mean((target - approx)**2)
    
    axes[idx].plot(x, target, 'b-', linewidth=2, label='Target', alpha=0.7)
    axes[idx].plot(x, approx, 'r--', linewidth=2, label=f'{n_neurons} neurons')
    axes[idx].set_title(f'N = {n_neurons},  MSE = {error:.4f}')
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)

plt.suptitle('Universal Approximation: More Neurons → Better Fit', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.savefig('11_universal_approximation.png', dpi=100, bbox_inches='tight')
plt.show()

# --- 11.3  Parameterised Function Families ---

print("\n=== 11.3 Parameterised Function Families ===")

# Count parameters for different architectures
architectures = {
    'Linear (784→10)': 784*10 + 10,
    'MLP (784→256→128→10)': 784*256+256 + 256*128+128 + 128*10+10,
    'CNN (3×3 kernel, 32 filters)': 3*3*1*32 + 32,  # Single conv layer
    'ResNet block (256→256)': 2*(256*256 + 256),  # Two conv layers
}

print(f"{'Architecture':<30} {'Parameters':>12}")
print("-" * 44)
for name, params in architectures.items():
    print(f"{name:<30} {params:>12,}")

# --- 11.4  Equivariance Demonstration ---

print("\n=== 11.4 Equivariance ===")

def check_equivariance(f, transform_in, transform_out, x, name=""):
    """Check: f(T_in(x)) ≈ T_out(f(x))"""
    lhs = f(transform_in(x))
    rhs = transform_out(f(x))
    is_equiv = np.allclose(lhs, rhs, atol=1e-10)
    print(f"  {name}: f(T(x)) ≈ T'(f(x))? {is_equiv}")
    return is_equiv

# Example 1: Sum is permutation-invariant
x_vec = np.array([1.0, 3.0, 2.0, 5.0])
x_perm = np.array([3.0, 5.0, 1.0, 2.0])  # Same elements, different order

print("Permutation invariance of sum:")
print(f"  sum([1,3,2,5]) = {np.sum(x_vec)}")
print(f"  sum([3,5,1,2]) = {np.sum(x_perm)}")
print(f"  Invariant? {np.isclose(np.sum(x_vec), np.sum(x_perm))}")

# Example 2: Sorting is permutation-equivariant (maps any permutation to sorted order)
print("\nSorting as equivariant function:")
print(f"  sort([1,3,2,5]) = {np.sort(x_vec)}")
print(f"  sort([3,5,1,2]) = {np.sort(x_perm)}")
print(f"  Equivariant? {np.array_equal(np.sort(x_vec), np.sort(x_perm))}")

# --- 11.5  Attention Mechanism Decomposition ---

print("\n=== 11.5 Attention as Function Composition ===")

def attention(Q, K, V):
    """Scaled dot-product attention: softmax(QKᵀ / √d_k) V"""
    d_k = Q.shape[-1]
    scores = Q @ K.T / np.sqrt(d_k)              # Bilinear: ℝ^(n×d) × ℝ^(n×d) → ℝ^(n×n)
    weights = np.exp(scores) / np.exp(scores).sum(axis=-1, keepdims=True)  # Softmax: ℝ^(n×n) → Δⁿ
    output = weights @ V                           # Linear map: Δⁿ × ℝ^(n×d) → ℝ^(n×d)
    return output, weights

# Small example: 4 tokens, dimension 3
np.random.seed(42)
n_tokens, d_model = 4, 3
X = np.random.randn(n_tokens, d_model)

# Linear projections (simplified: same W for Q, K, V)
W_Q = np.random.randn(d_model, d_model) * 0.5
W_K = np.random.randn(d_model, d_model) * 0.5
W_V = np.random.randn(d_model, d_model) * 0.5

Q, K, V = X @ W_Q, X @ W_K, X @ W_V
output, attn_weights = attention(Q, K, V)

print(f"Input shape:            {X.shape} (tokens × d_model)")
print(f"Q, K, V shapes:         {Q.shape} (after linear projection)")
print(f"Attention weights shape: {attn_weights.shape} (token-to-token)")
print(f"Output shape:           {output.shape} (same as input)")
print(f"\nAttention weight matrix (rows sum to 1):")
print(np.array2string(attn_weights, precision=3))
print(f"Row sums: {attn_weights.sum(axis=1)}")

# --- 11.6  Loss Function Landscape ---

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# 1D loss landscapes for different losses
y_true = 1.0
y_pred = np.linspace(-2, 4, 500)

# MSE
mse = (y_true - y_pred)**2
axes[0].plot(y_pred, mse, 'b-', linewidth=2)
axes[0].axvline(x=y_true, color='r', linestyle='--', label=f'y_true={y_true}')
axes[0].set_title('MSE Loss: $(y - \\hat{y})^2$')
axes[0].set_xlabel('$\\hat{y}$')
axes[0].set_ylabel('Loss')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Cross-entropy (binary)
p_pred = np.linspace(0.01, 0.99, 500)
ce = -(y_true * np.log(p_pred) + (1 - y_true) * np.log(1 - p_pred))
axes[1].plot(p_pred, ce, 'g-', linewidth=2)
axes[1].axvline(x=y_true, color='r', linestyle='--', label=f'y_true={y_true}')
axes[1].set_title('Cross-Entropy: $-y\\log\\hat{y} - (1-y)\\log(1-\\hat{y})$')
axes[1].set_xlabel('$\\hat{y}$ (predicted probability)')
axes[1].set_ylabel('Loss')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# Hinge loss
hinge = np.maximum(0, 1 - y_true * y_pred)
axes[2].plot(y_pred, hinge, 'm-', linewidth=2)
axes[2].axvline(x=y_true, color='r', linestyle='--', label=f'y_true={y_true}')
axes[2].set_title('Hinge Loss: $\\max(0, 1 - y \\cdot \\hat{y})$')
axes[2].set_xlabel('$\\hat{y}$')
axes[2].set_ylabel('Loss')
axes[2].legend()
axes[2].grid(True, alpha=0.3)

plt.suptitle('Loss Functions as Mappings: ℝ → ℝ₊', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.savefig('11_loss_functions.png', dpi=100, bbox_inches='tight')
plt.show()

# --- 11.7  Training as Function Evolution ---

print("\n=== 11.7 Training as Function Transformation ===")

# Simple 1D regression: watch how the function evolves during gradient descent
np.random.seed(123)
x_train = np.linspace(-2, 2, 30)
y_train = np.sin(np.pi * x_train) + 0.1 * np.random.randn(30)

# Simple 2-layer network with manual gradient descent
def predict(x, w1, b1, w2, b2):
    h = np.maximum(0, np.outer(x, w1) + b1)  # ReLU hidden layer
    return h @ w2 + b2

n_hidden = 20
w1 = np.random.randn(n_hidden) * 0.5
b1 = np.zeros(n_hidden)
w2 = np.random.randn(n_hidden) * 0.5
b2 = 0.0
lr = 0.01

fig, axes = plt.subplots(1, 4, figsize=(16, 3.5))
x_plot = np.linspace(-2.5, 2.5, 200)
snapshots = [0, 50, 500, 5000]
snap_idx = 0

for epoch in range(5001):
    # Forward pass
    h = np.maximum(0, np.outer(x_train, w1) + b1)
    y_pred = h @ w2 + b2
    loss = np.mean((y_train - y_pred)**2)
    
    # Capture snapshots
    if epoch in snapshots:
        y_plot = predict(x_plot, w1, b1, w2, b2)
        axes[snap_idx].plot(x_plot, y_plot, 'r-', linewidth=2, label=f'f_θ (epoch {epoch})')
        axes[snap_idx].scatter(x_train, y_train, c='blue', s=15, alpha=0.6, label='Data')
        axes[snap_idx].plot(x_plot, np.sin(np.pi * x_plot), 'g--', alpha=0.5, label='True f')
        axes[snap_idx].set_title(f'Epoch {epoch}, Loss={loss:.4f}')
        axes[snap_idx].set_ylim(-2, 2)
        axes[snap_idx].legend(fontsize=7)
        axes[snap_idx].grid(True, alpha=0.3)
        snap_idx += 1
    
    # Backward pass (manual gradients)
    error = y_pred - y_train  # (30,)
    grad_b2 = np.mean(error)
    grad_w2 = h.T @ error / len(x_train)
    grad_h = np.outer(error, w2)  # (30, n_hidden)
    grad_h[h <= 0] = 0  # ReLU gradient
    grad_w1 = (x_train[:, None] * grad_h).mean(axis=0)
    grad_b1 = grad_h.mean(axis=0)
    
    w1 -= lr * grad_w1
    b1 -= lr * grad_b1
    w2 -= lr * grad_w2
    b2 -= lr * grad_b2

plt.suptitle('Training = Evolving $f_θ$ Through Function Space', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.savefig('11_training_evolution.png', dpi=100, bbox_inches='tight')
plt.show()
print(f"✓ Function f_θ evolved from random → trained over 5000 gradient steps")

12. Category Theory Perspective

"Category theory is the mathematics of mathematics." — Eugenia Cheng

Category theory provides the most abstract and unifying view of functions. While it may seem distant from practical ML, its ideas have deeply influenced functional programming, type systems, and increasingly, the design of neural architectures and probabilistic programming.


12.1 Categories: The Abstraction

Definition. A category C\mathcal{C} consists of:

  1. A collection of objects: Ob(C)\text{Ob}(\mathcal{C})
  2. For each pair of objects A,BA, B, a collection of morphisms (arrows): Hom(A,B)\text{Hom}(A, B)
  3. A composition operation: if f:ABf: A \to B and g:BCg: B \to C, then gf:ACg \circ f: A \to C
  4. An identity morphism idA:AA\text{id}_A: A \to A for each object

subject to:

  • Associativity: h(gf)=(hg)fh \circ (g \circ f) = (h \circ g) \circ f
  • Identity: fidA=f=idBff \circ \text{id}_A = f = \text{id}_B \circ f

These are exactly the properties we proved for function composition in Section 4!


12.2 Important Categories for ML

CategoryObjectsMorphismsML Connection
SetSetsFunctionsData types and transformations
VectVector spacesLinear mapsLinear layers, embeddings
TopTopological spacesContinuous mapsManifold learning, topology of data
MeasMeasurable spacesMeasurable functionsProbabilistic models
ProbProbability spacesMeasure-preserving mapsSampling, distribution transforms
ParaEuclidean spacesParameterised functionsNeural network layers
DiffSmooth manifoldsSmooth mapsDifferential geometry of networks

The Category Para (Parameterised Maps)

This is perhaps the most ML-relevant category. Objects are Euclidean spaces, and a morphism from AA to BB is a pair (P,f)(P, f) where PP is a parameter space and f:P×ABf: P \times A \to B.

Neural network layer: fW,b(x)=σ(Wx+b)is a morphism in Para\text{Neural network layer: } f_{W,b}(x) = \sigma(Wx + b) \quad \text{is a morphism in Para}

Composition: (P1,f)(P2,g)=(P1×P2,fg)(P_1, f) \circ (P_2, g) = (P_1 \times P_2, \, f \circ g) where parameters are concatenated.


12.3 Functors: Structure-Preserving Maps Between Categories

Definition. A functor F:CDF: \mathcal{C} \to \mathcal{D} maps:

  • Objects: AF(A)A \mapsto F(A)
  • Morphisms: f:ABF(f):F(A)F(B)f: A \to B \mapsto F(f): F(A) \to F(B)

preserving composition and identities:

F(gf)=F(g)F(f),F(idA)=idF(A)F(g \circ f) = F(g) \circ F(f), \qquad F(\text{id}_A) = \text{id}_{F(A)}

ML Examples of Functors

FunctorFrom → ToActionML Interpretation
Feature map ϕ\phiDataVectMaps data points to vectorsEmbedding (Word2Vec, BERT)
Kernel kkSetHilbMaps to reproducing kernel Hilbert spaceKernel methods, Gaussian processes
BackpropParaop^{\text{op}}ParaReverses computation graphAutomatic differentiation
ForgetfulVectSetForget vector space structureTreating vectors as plain data
FreeSetVectGenerate vector space from setOne-hot encoding

12.4 Natural Transformations

Definition. A natural transformation η:FG\eta: F \Rightarrow G between functors F,G:CDF, G: \mathcal{C} \to \mathcal{D} assigns to each object AA a morphism ηA:F(A)G(A)\eta_A: F(A) \to G(A) such that for every morphism f:ABf: A \to B:

ηBF(f)=G(f)ηA\eta_B \circ F(f) = G(f) \circ \eta_A

This naturality square commutes:

F(A) ---F(f)--→ F(B)
 |                |
 η_A              η_B
 |                |
 ↓                ↓
G(A) ---G(f)--→ G(B)

ML interpretation: A natural transformation is a "systematic" way to convert between representations that respects the structure. Examples:

  • Batch normalisation transforms features naturally across different input dimensions
  • Adapter layers in transfer learning are natural transformations between pre-trained and fine-tuned feature functors
  • Attention can be viewed as a natural transformation in the functor category

12.5 Why Category Theory for AI?

Compositionality

Neural networks are fundamentally compositional: layers compose to form networks, networks compose to form systems. Category theory provides the language for this.

Type Safety in ML Pipelines

Categorical thinking prevents dimension mismatches and ensures pipeline validity (the "types" must match for morphisms to compose).

Probabilistic Programming

Languages like Pyro and Stan use monadic structures (from category theory) to compose probability distributions:

sample:Distribution(A)A(Kleisli arrow of the probability monad)\text{sample} : \text{Distribution}(A) \to A \qquad \text{(Kleisli arrow of the probability monad)}

Emerging Applications

ConceptCategorical ToolApplication
BackpropagationReverse-mode AD as functorEfficient gradient computation
Generative modelsProbability monadVAEs, normalising flows
Graph neural networksSheaves on graphsMessage passing formalisation
Composable AI systemsString diagramsVisual programming of ML pipelines
Symmetry in MLGroup actions as functorsEquivariant architectures

References: Mac Lane, S. (1998). Categories for the Working Mathematician, 2nd ed. Springer. Fong, B., Spivak, D. (2019). An Invitation to Applied Category Theory. Cambridge UP. Cruttwell, G. et al. (2022). Categorical foundations of gradient-based learning. ESOP.

Code cell 49

# ══════════════════════════════════════════════════════════════════
# SECTION 12 — Category Theory: Demonstrations
# ══════════════════════════════════════════════════════════════════

# --- 12.1  Category Verification ---

class Category:
    """Minimal category implementation to verify axioms."""
    
    def __init__(self, name: str):
        self.name = name
        self.objects = set()
        self.morphisms = {}  # (source, target) -> list of morphism names
        self.identities = {}  # object -> identity morphism
        self.compose_table = {}  # (f, g) -> h  where h = g ∘ f
    
    def add_object(self, obj):
        self.objects.add(obj)
        # Every object gets an identity morphism
        id_name = f"id_{obj}"
        self.identities[obj] = id_name
        self.morphisms.setdefault((obj, obj), []).append(id_name)
    
    def add_morphism(self, name, source, target):
        self.morphisms.setdefault((source, target), []).append(name)
    
    def set_composition(self, f, g, result):
        """Set g ∘ f = result (f goes first, then g)."""
        self.compose_table[(f, g)] = result
    
    def verify_identity_law(self, f, source, target):
        """Check: f ∘ id_source = f and id_target ∘ f = f"""
        id_s = self.identities[source]
        id_t = self.identities[target]
        
        # f ∘ id_source should exist and equal f
        right_id = self.compose_table.get((id_s, f), f)  # id composed with f
        left_id = self.compose_table.get((f, id_t), f)   # f composed with id
        return right_id == f and left_id == f
    
    def verify_associativity(self, f, g, h):
        """Check: h ∘ (g ∘ f) = (h ∘ g) ∘ f"""
        gf = self.compose_table.get((f, g))
        hg = self.compose_table.get((g, h))
        if gf is None or hg is None:
            return None  # Can't verify
        h_gf = self.compose_table.get((gf, h))
        hg_f = self.compose_table.get((f, hg))
        return h_gf == hg_f

# Build a small category: three objects, morphisms between them
print("=== 12.1 Category Axiom Verification ===\n")

C = Category("SmallCat")
for obj in ['A', 'B', 'C']:
    C.add_object(obj)

C.add_morphism('f', 'A', 'B')
C.add_morphism('g', 'B', 'C')
C.add_morphism('gf', 'A', 'C')  # The composition g ∘ f

# Set composition rules
C.set_composition('f', 'g', 'gf')  # g ∘ f = gf
# Identity compositions
for obj in ['A', 'B', 'C']:
    C.set_composition(C.identities[obj], C.identities[obj], C.identities[obj])
C.set_composition(C.identities['A'], 'f', 'f')
C.set_composition('f', C.identities['B'], 'f')
C.set_composition(C.identities['B'], 'g', 'g')
C.set_composition('g', C.identities['C'], 'g')
C.set_composition(C.identities['A'], 'gf', 'gf')
C.set_composition('gf', C.identities['C'], 'gf')

print(f"Category: {C.name}")
print(f"Objects: {C.objects}")
print(f"Morphisms: f: A→B, g: B→C, g∘f: A→C")
print(f"Identity law for f: {C.verify_identity_law('f', 'A', 'B')}")
print(f"Identity law for g: {C.verify_identity_law('g', 'B', 'C')}")

# --- 12.3  Functor Example: Free Functor (Set → Vect) ---

print("\n=== 12.3 Functor: One-Hot Encoding as Free Functor ===\n")

def free_functor_objects(finite_set):
    """Free functor on objects: map a finite set to a vector space (one-hot)."""
    elements = sorted(finite_set)
    n = len(elements)
    encoding = {}
    for i, elem in enumerate(elements):
        vec = np.zeros(n)
        vec[i] = 1.0
        encoding[elem] = vec
    return encoding, n

def free_functor_morphism(f_map, encoding_domain, encoding_codomain, dim_domain, dim_codomain):
    """
    Free functor on morphisms: a function f: A → B becomes a linear map F(f): ℝ^|A| → ℝ^|B|.
    The matrix has F(f)[j, i] = 1 if f maps the i-th element of A to the j-th element of B.
    """
    M = np.zeros((dim_codomain, dim_domain))
    domain_elements = sorted(encoding_domain.keys())
    codomain_elements = sorted(encoding_codomain.keys())
    
    for i, a in enumerate(domain_elements):
        b = f_map[a]
        j = codomain_elements.index(b)
        M[j, i] = 1.0
    return M

# Example: f: {cat, dog, fish} → {mammal, non-mammal}
animals = {'cat', 'dog', 'fish'}
classes = {'mammal', 'non-mammal'}
classify = {'cat': 'mammal', 'dog': 'mammal', 'fish': 'non-mammal'}

enc_animals, dim_a = free_functor_objects(animals)
enc_classes, dim_c = free_functor_objects(classes)
M = free_functor_morphism(classify, enc_animals, enc_classes, dim_a, dim_c)

print("Set function: classify = {cat→mammal, dog→mammal, fish→non-mammal}")
print(f"\nFree functor on objects (one-hot encoding):")
for elem, vec in sorted(enc_animals.items()):
    print(f"  {elem:>5}{vec}")
print(f"\nFree functor on morphism (linear map matrix):")
print(f"  F(classify) =\n{M}")

# Verify: F(classify)(one_hot(cat)) = one_hot(mammal)
for animal in sorted(animals):
    input_vec = enc_animals[animal]
    output_vec = M @ input_vec
    expected = enc_classes[classify[animal]]
    print(f"  F(classify) · e_{animal} = {output_vec}  "
          f"(= e_{classify[animal]}? {np.allclose(output_vec, expected)})")

# --- 12.4  Natural Transformation Verification ---

print("\n=== 12.4 Natural Transformation: Naturality Square ===\n")

# Two functors F, G: Set → Set
# F(X) = X × X  (diagonal/product functor)
# G(X) = X      (identity functor)
# Natural transformation η: F ⇒ G given by η_X(x, y) = x (first projection)

# For the naturality square to commute:
# For any f: A → B, we need: η_B ∘ F(f) = G(f) ∘ η_A

# Let A = {1,2,3}, B = {a,b}, f: A → B
f_nat = {1: 'a', 2: 'b', 3: 'a'}

# F(f): A×A → B×B defined as F(f)(x,y) = (f(x), f(y))
# G(f): A → B defined as G(f)(x) = f(x)
# η_A: A×A → A defined as η_A(x,y) = x (first projection)
# η_B: B×B → B defined as η_B(x,y) = x (first projection)

print("Checking naturality square for first projection η: (-)×(-) ⇒ Id")
print(f"f = {f_nat}\n")

all_commute = True
for x in [1, 2, 3]:
    for y in [1, 2, 3]:
        # Path 1: η_B ∘ F(f)  →  (x,y) ↦ F(f)(x,y) = (f(x),f(y)) ↦ η_B = f(x)
        path1 = f_nat[x]  # η_B(F(f)(x,y)) = η_B(f(x), f(y)) = f(x)
        
        # Path 2: G(f) ∘ η_A  →  (x,y) ↦ η_A(x,y) = x ↦ G(f)(x) = f(x)
        path2 = f_nat[x]  # G(f)(η_A(x,y)) = G(f)(x) = f(x)
        
        commutes = (path1 == path2)
        all_commute = all_commute and commutes

print(f"Naturality square commutes for all inputs? {all_commute}")
print("η_B ∘ F(f) = G(f) ∘ η_A  ✓")

# --- 12.5  Visualise: Neural Network as Categorical Diagram ---

print("\n=== 12.5 Neural Network as Morphism Composition in Para ===\n")

# A 3-layer MLP as composition of morphisms in the Para category
layers = [
    ("Input → Hidden₁",  "ℝ⁷⁸⁴",  "ℝ²⁵⁶",  784*256 + 256),
    ("Hidden₁ → Hidden₂", "ℝ²⁵⁶",  "ℝ¹²⁸",  256*128 + 128),
    ("Hidden₂ → Output",  "ℝ¹²⁸",  "ℝ¹⁰",   128*10 + 10),
]

total_params = 0
print(f"{'Morphism':<25} {'Source':>8} {'Target':>8} {'|Params|':>10}")
print("-" * 55)
for name, src, tgt, params in layers:
    print(f"{name:<25} {src:>8} {tgt:>8} {params:>10,}")
    total_params += params

print("-" * 55)
print(f"{'Composition (full network)':<25} {'ℝ⁷⁸⁴':>8} {'ℝ¹⁰':>8} {total_params:>10,}")
print(f"\nIn Para: objects are spaces, morphisms are (param_space, function) pairs")
print(f"Composition concatenates parameter spaces: P₁ × P₂ × P₃ = ℝ^{total_params}")

13. Common Mistakes and Pitfalls

Understanding where practitioners commonly go wrong deepens comprehension and prevents costly bugs in ML systems.

#MistakeWhy It's WrongCorrect UnderstandingML Consequence
1Confusing codomain with rangef(x)=x2f(x) = x^2: codomain can be R\mathbb{R}, but range is [0,)[0, \infty)Range \subseteq Codomain; they're equal only for surjectionsOutput layer dimension ≠ number of distinct outputs
2Assuming all functions have inversesOnly bijections have true inversesNon-bijective ⇒ use pseudo-inverse or left/right inverseDecoder can't perfectly invert encoder if dimensions differ
3Assuming composition is commutativegffgg \circ f \ne f \circ g in generalEven domain/codomain may not match for reversed orderLayer order in a network matters fundamentally
4Confusing linear with affinef(x)=Wx+bf(x) = Wx + b is affine, not linear (unless b=0b = 0)Linear maps must satisfy f(0)=0f(0) = 0"Linear layer" in PyTorch is actually affine
5Applying chain rule incorrectly for matrices(AB)AB(AB)' \ne A'B'; need Jacobiansx(gf)=Jg(f(x))Jf(x)\frac{\partial}{\partial x}(g \circ f) = J_g(f(x)) \cdot J_f(x)Backpropagation errors from wrong Jacobian multiplication order
6Ignoring domain restrictionslog(x)\log(x) undefined for x0x \le 0Always check domain before applyingNaN from log(0) in cross-entropy; use log(x + ε)
7Treating correlation as functional dependenceY=X2Y = X^2: XX and YY are dependent but can have corr=0\text{corr} = 0Correlation measures linear relationship onlyFeature selection based only on correlation misses nonlinear dependencies
8Confusing continuity with differentiabilityReLU is continuous but not differentiable at 0C0C1C2C^0 \supset C^1 \supset C^2 \supset \cdotsReLU works despite non-differentiability; subgradients suffice
9Assuming universal approximation means learnabilityUAT guarantees existence, not learnabilityApproximation ≠ Optimisation ≠ GeneralisationA network can represent any function but may not learn it from data
10Confusing function equality with pointwise equality on training dataTwo functions agreeing on training points may disagree elsewheref=gf = g requires agreement on entire domainOverfitting: perfect training accuracy, poor generalisation

Code cell 51

# ══════════════════════════════════════════════════════════════════
# SECTION 13 — Common Mistakes: Demonstrations
# ══════════════════════════════════════════════════════════════════

print("=== Pitfall Demonstrations ===\n")

# --- Pitfall 1: Codomain ≠ Range ---
print("Pitfall 1: Codomain vs Range")
f = lambda x: x**2
domain = np.linspace(-3, 3, 100)
codomain = set(range(-10, 11))  # integers from -10 to 10
actual_range = {round(f(x)) for x in range(-3, 4)}
print(f"  f(x) = x²,  domain = [-3, 3]")
print(f"  If codomain = {{-10,...,10}}, range = {sorted(actual_range)}")
print(f"  Range ⊂ Codomain: {actual_range.issubset(codomain)}")

# --- Pitfall 3: Composition is NOT commutative ---
print("\nPitfall 3: Composition Non-Commutativity")
f_p3 = lambda x: x + 1
g_p3 = lambda x: x * 2
x_test = 5
print(f"  f(x) = x+1, g(x) = 2x")
print(f"  (g ∘ f)(5) = g(f(5)) = g(6) = {g_p3(f_p3(x_test))}")
print(f"  (f ∘ g)(5) = f(g(5)) = f(10) = {f_p3(g_p3(x_test))}")
print(f"  g ∘ f ≠ f ∘ g: {g_p3(f_p3(x_test)) != f_p3(g_p3(x_test))}")

# --- Pitfall 4: Linear vs Affine ---
print("\nPitfall 4: Linear vs Affine")
W = np.array([[2, 1], [0, 3]])
b = np.array([1, -1])
x_zero = np.array([0, 0])

affine = lambda x: W @ x + b
print(f"  f(x) = Wx + b where b = {b}")
print(f"  f(0) = {affine(x_zero)}  (not zero → NOT linear!)")
print(f"  True linear map requires f(0) = 0")

# Check superposition principle: f(x+y) = f(x) + f(y)?
x1 = np.array([1, 2])
x2 = np.array([3, -1])
print(f"  f(x₁ + x₂) = {affine(x1 + x2)}")
print(f"  f(x₁) + f(x₂) = {affine(x1) + affine(x2)}")
print(f"  Equal? {np.allclose(affine(x1 + x2), affine(x1) + affine(x2))}  (fails due to b)")

# --- Pitfall 6: Domain Restrictions ---
print("\nPitfall 6: Domain Restrictions in Practice")
probs = np.array([0.9, 0.1, 0.0, 0.7])
print(f"  Probabilities: {probs}")

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    log_probs_bad = np.log(probs)  # Will produce -inf for 0
    print(f"  log(probs):    {log_probs_bad}  ← -inf from log(0)!")

epsilon = 1e-7
log_probs_safe = np.log(probs + epsilon)
print(f"  log(probs+ε):  {np.round(log_probs_safe, 4)}  ← safe with ε={epsilon}")

# --- Pitfall 7: Correlation ≠ Functional Dependence ---
print("\nPitfall 7: Correlation vs Dependence")
np.random.seed(42)
X = np.random.uniform(-2, 2, 10000)
Y = X**2  # Perfect functional dependence: Y = f(X)

correlation = np.corrcoef(X, Y)[0, 1]
print(f"  Y = X², clearly Y depends on X")
print(f"  Pearson correlation: {correlation:.6f}  ← near zero!")
print(f"  Correlation misses nonlinear dependence entirely")

# --- Pitfall 8: Continuity vs Differentiability ---
print("\nPitfall 8: Continuity ≠ Differentiability")
relu = lambda x: np.maximum(0, x)
x_at_zero = 0.0

# ReLU is continuous at 0
left_limit = relu(-1e-10)
right_limit = relu(1e-10)
value_at_zero = relu(0)
print(f"  ReLU: continuous at 0? lim⁻={left_limit:.2e}, f(0)={value_at_zero}, lim⁺={right_limit:.2e}")
print(f"  Continuous: ✓")

# But not differentiable: left and right derivatives differ
left_deriv = (relu(0) - relu(-1e-10)) / 1e-10
right_deriv = (relu(1e-10) - relu(0)) / 1e-10
print(f"  Left derivative at 0:  {left_deriv:.1f}")
print(f"  Right derivative at 0: {right_deriv:.1f}")
print(f"  Differentiable: ✗ (left ≠ right)")
print(f"  → In practice, we use subgradient (typically set f'(0) = 0)")

# --- Pitfall 10: Training Fit ≠ Function Equality ---
print("\nPitfall 10: Pointwise Agreement ≠ Function Equality")
x_train_p10 = np.array([0, 1, 2, 3])
y_train_p10 = np.array([0, 1, 4, 9])

# Fit polynomial (degree 3 through 4 points = exact fit)
poly_overfit = np.polyfit(x_train_p10, y_train_p10, 3)
poly_correct = [1, 0, 0]  # x²

# Both agree on training data
x_test_p10 = np.array([0.5, 1.5, 2.5, 4.0, 5.0])
pred_overfit = np.polyval(poly_overfit, x_test_p10)
pred_correct = np.polyval(poly_correct, x_test_p10)

print(f"  Degree-3 poly (overfit) and x² agree on training points: "
      f"{np.allclose(np.polyval(poly_overfit, x_train_p10), y_train_p10)}")
print(f"  But disagree on test points:")
for xt, po, pc in zip(x_test_p10, pred_overfit, pred_correct):
    print(f"    x={xt}: overfit={po:.2f}, correct(x²)={pc:.2f}, diff={abs(po-pc):.2f}")

14. Practice Exercises

Work through these exercises to consolidate your understanding. Solutions are provided in the code cell below.


Exercise Set A: Foundations

A1. Let f:{1,2,3,4}{a,b,c}f: \{1, 2, 3, 4\} \to \{a, b, c\} with f={(1,a),(2,b),(3,a),(4,c)}f = \{(1,a), (2,b), (3,a), (4,c)\}.

  • Find the range of ff
  • Is ff injective? Surjective? Bijective?
  • Find f1({a})f^{-1}(\{a\}) (the preimage of {a}\{a\})

A2. Prove or disprove: If f:ABf: A \to B is injective and g:BCg: B \to C is injective, then gf:ACg \circ f: A \to C is injective.


Exercise Set B: Composition and Inverses

B1. Let f(x)=2x+1f(x) = 2x + 1 and g(x)=x2g(x) = x^2. Compute:

  • (gf)(3)(g \circ f)(3)
  • (fg)(3)(f \circ g)(3)
  • Verify that gffgg \circ f \ne f \circ g

B2. Find the inverse of f(x)=3x12x+5f(x) = \frac{3x - 1}{2x + 5} and verify ff1=idf \circ f^{-1} = \text{id}.


Exercise Set C: Special Function Classes

C1. Classify each function as linear, affine, polynomial, or none:

  • f(x)=3xf(x) = 3x
  • g(x)=3x+2g(x) = 3x + 2
  • h(x)=x2+1h(x) = x^2 + 1
  • σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}}

C2. Prove that the composition of two linear functions is linear. (Hint: show f(g(αx+βy))=αf(g(x))+βf(g(y))f(g(\alpha x + \beta y)) = \alpha f(g(x)) + \beta f(g(y)))


Exercise Set D: Calculus of Functions

D1. Compute the Jacobian of f:R2R2f: \mathbb{R}^2 \to \mathbb{R}^2 defined by f(x,y)=(x2+y,xy)f(x, y) = (x^2 + y, \, xy) at the point (1,2)(1, 2).

D2. Use the chain rule: if g(u,v)=u2vg(u, v) = u^2 v and f(x)=(2x,x+1)f(x) = (2x, x+1), find ddx(gf)(x)\frac{d}{dx}(g \circ f)(x) at x=1x = 1.


Exercise Set E: Higher-Order Functions

E1. Write a Python function compose_n(f, n) that returns ff composed with itself nn times: f(n)=ffff^{(n)} = f \circ f \circ \cdots \circ f.

E2. The fixed-point iteration xn+1=cos(xn)x_{n+1} = \cos(x_n) converges. Find the fixed point numerically and verify that it satisfies x=cos(x)x^* = \cos(x^*).


Exercise Set F: Measure Theory and Probability

F1. Show that the CDF of the standard normal distribution is a valid CDF by checking: monotonicity, right-continuity, and boundary values.

F2. The quantile function Q(p)=F1(p)Q(p) = F^{-1}(p) is the (generalised) inverse of the CDF. For the exponential distribution with rate λ=2\lambda = 2, compute Q(0.5)Q(0.5) (the median).


Exercise Set G: ML Applications

G1. For a 3-class classification with softmax output [0.7,0.2,0.1][0.7, 0.2, 0.1] and true label class 0, compute the cross-entropy loss.

G2. Show that the softmax function is not Lipschitz continuous on all of Rn\mathbb{R}^n, but is Lipschitz on any bounded subset.


Exercise Set H: Category Theory

H1. Verify that Set (with sets as objects and functions as morphisms) forms a valid category by checking the identity and associativity axioms.

H2. Describe the one-hot encoding functor from FinSet to Vect and show it preserves composition.

Code cell 53

# ══════════════════════════════════════════════════════════════════
# SECTION 14 — Exercise Solutions
# ══════════════════════════════════════════════════════════════════

print("=" * 60)
print("EXERCISE SOLUTIONS")
print("=" * 60)

# --- A1 ---
print("\n--- A1: Basic Function Properties ---")
f_A1 = {1: 'a', 2: 'b', 3: 'a', 4: 'c'}
range_f = set(f_A1.values())
print(f"f = {f_A1}")
print(f"Range = {range_f}")
print(f"Injective? {len(f_A1.values()) == len(set(f_A1.values()))}  (1 and 3 both map to 'a')")
print(f"Surjective? {range_f == {'a', 'b', 'c'}}  (range = codomain)")
print(f"Bijective? No (not injective)")
preimage_a = {x for x, y in f_A1.items() if y == 'a'}
print(f"f⁻¹({{a}}) = {preimage_a}")

# --- A2 ---
print("\n--- A2: Composition Preserves Injectivity ---")
print("Proof: Suppose g(f(x₁)) = g(f(x₂)).")
print("Since g is injective: f(x₁) = f(x₂).")
print("Since f is injective: x₁ = x₂.")
print("Therefore g ∘ f is injective. ✓")

# Numerical verification
f_inj = lambda x: 2*x + 1  # injective
g_inj = lambda x: x**3      # injective
test_vals = np.random.randn(1000)
gf_vals = g_inj(f_inj(test_vals))
print(f"Numerical check: all g(f(xᵢ)) distinct? {len(set(np.round(gf_vals, 10))) == len(test_vals)}")

# --- B1 ---
print("\n--- B1: Composition ---")
f_B1 = lambda x: 2*x + 1
g_B1 = lambda x: x**2
print(f"(g ∘ f)(3) = g(f(3)) = g(7) = {g_B1(f_B1(3))}")
print(f"(f ∘ g)(3) = f(g(3)) = f(9) = {f_B1(g_B1(3))}")
print(f"g ∘ f ≠ f ∘ g: {g_B1(f_B1(3)) != f_B1(g_B1(3))}")

# --- B2 ---
print("\n--- B2: Finding Inverse ---")
print("f(x) = (3x - 1)/(2x + 5)")
print("Solve y = (3x-1)/(2x+5) for x:")
print("  y(2x+5) = 3x-1  →  2xy + 5y = 3x - 1  →  x(2y-3) = -5y-1")
print("  x = (-5y-1)/(2y-3) = (5y+1)/(3-2y)")

f_inv = lambda y: (5*y + 1) / (3 - 2*y)
f_B2 = lambda x: (3*x - 1) / (2*x + 5)
test_x = np.array([0, 1, -1, 2, 0.5])
roundtrip = np.array([f_inv(f_B2(x)) for x in test_x])
print(f"Verification f⁻¹(f(x)) = x:")
for x, rt in zip(test_x, roundtrip):
    print(f"  f⁻¹(f({x})) = {rt:.6f}")

# --- C1 ---
print("\n--- C1: Function Classification ---")
classifications = [
    ("f(x) = 3x",       "Linear",     "f(0)=0, f(ax+by)=af(x)+bf(y)"),
    ("g(x) = 3x + 2",   "Affine",     "g(0)=2≠0, so not linear"),
    ("h(x) = x² + 1",   "Polynomial", "degree 2"),
    ("σ(x) = 1/(1+e⁻ˣ)", "None (sigmoid)", "transcendental, not polynomial"),
]
for func, cls, reason in classifications:
    print(f"  {func:<22}{cls:<15} ({reason})")

# --- D1 ---
print("\n--- D1: Jacobian Computation ---")
print("f(x,y) = (x² + y,  xy)")
print("J = [[∂f₁/∂x, ∂f₁/∂y],  = [[2x, 1],")
print("     [∂f₂/∂x, ∂f₂/∂y]]     [ y, x]]")
x_d1, y_d1 = 1, 2
J = np.array([[2*x_d1, 1], [y_d1, x_d1]])
print(f"At (1,2): J = \n{J}")

# --- D2 ---
print("\n--- D2: Chain Rule ---")
print("g(u,v) = u²v,  f(x) = (2x, x+1)")
print("d/dx [g(f(x))] = ∇g(f(x)) · f'(x)")
print("  ∇g = (2uv, u²), f'(x) = (2, 1)")
x_d2 = 1
u, v = 2*x_d2, x_d2 + 1  # f(1) = (2, 2)
grad_g = np.array([2*u*v, u**2])  # ∇g at (2,2) = (8, 4)
f_prime = np.array([2, 1])
result = grad_g @ f_prime
print(f"  At x=1: f(1)=(2,2), ∇g(2,2)={grad_g}, f'(1)={f_prime}")
print(f"  d/dx [g(f(x))] at x=1 = {grad_g} · {f_prime} = {result}")

# --- E1 ---
print("\n--- E1: compose_n ---")
def compose_n(f, n):
    """Return f composed with itself n times."""
    def composed(x):
        result = x
        for _ in range(n):
            result = f(result)
        return result
    return composed

f_double = lambda x: 2 * x
f_cubed = compose_n(f_double, 3)
print(f"  f(x) = 2x,  f³(5) = f(f(f(5))) = {f_cubed(5)}  (should be 40)")

# --- E2 ---
print("\n--- E2: Fixed Point of cos ---")
x_fp = 1.0
for i in range(100):
    x_fp = np.cos(x_fp)
print(f"  Fixed point x* = {x_fp:.10f}")
print(f"  cos(x*) = {np.cos(x_fp):.10f}")
print(f"  |x* - cos(x*)| = {abs(x_fp - np.cos(x_fp)):.2e}")

# --- F2 ---
print("\n--- F2: Quantile Function ---")
print("Exponential(λ=2): F(x) = 1 - e^(-2x)")
print("Q(p) = F⁻¹(p) = -ln(1-p) / λ")
lam = 2
p_median = 0.5
Q_median = -np.log(1 - p_median) / lam
print(f"  Q(0.5) = -ln(0.5)/2 = {Q_median:.6f}")
print(f"  Verify: F(Q(0.5)) = 1 - e^(-2·{Q_median:.4f}) = {1 - np.exp(-lam * Q_median):.6f}")

# --- G1 ---
print("\n--- G1: Cross-Entropy Loss ---")
softmax_output = np.array([0.7, 0.2, 0.1])
true_label = 0  # Class 0
ce_loss = -np.log(softmax_output[true_label])
print(f"  Softmax output: {softmax_output}")
print(f"  True label: class {true_label}")
print(f"  CE loss = -log({softmax_output[true_label]}) = {ce_loss:.6f}")

# Full one-hot version
one_hot = np.array([1, 0, 0])
ce_full = -np.sum(one_hot * np.log(softmax_output))
print(f"  Full CE = -Σ yᵢ log(ŷᵢ) = {ce_full:.6f}  (same)")

print("\n" + "=" * 60)
print("All exercises solved ✓")
print("=" * 60)

15. Why This Matters for AI/ML — Impact Summary

Every concept in this notebook connects directly to building, understanding, and debugging machine learning systems. Here is the complete impact map:

SectionCore ConceptDirect ML ApplicationWhat Breaks Without It
1. IntuitionFunctions as mappingsUnderstanding what models doCan't reason about model behavior
2. Formal DefinitionsDomain, codomain, rangeInput validation, output spacesDimension mismatches, invalid inputs
3. Types (Inj/Surj/Bij)Function classificationInvertibility of encoders/decodersInformation loss in autoencoders
4. Compositiongfg \circ fDeep network = layer compositionWrong layer ordering
5. Inverse Functionsf1f^{-1}, pseudo-inverseGenerative models, normalising flowsCan't decode/invert representations
6. Special ClassesLinear, convex, LipschitzArchitecture choice, stabilityGradient explosion, non-convergence
7. Higher-OrderFunctionals, operatorsLoss functions, optimisersCan't formalise "learning"
8. Continuityε\varepsilon-δ\delta, smoothnessRobustness, adversarial attacksTiny input changes → huge output changes
9. DifferentiabilityDerivatives, JacobiansBackpropagation, gradient descentTraining is impossible
10. Measure TheoryMeasurable functionsProbabilistic models, expectationsInvalid probability computations
11. ML FunctionsUAT, parameterised familiesNetwork design, capacityWrong architecture choices
12. Category TheoryFunctors, compositionType-safe pipelines, symmetryAd-hoc designs without principled structure

Conceptual Bridge: What Comes Next

This notebook on Functions and Mappings is the gateway to everything that follows in the mathematics for AI curriculum.

                    ┌─────────────────────────────────────────────┐
                    │     FUNCTIONS AND MAPPINGS (This Notebook)   │
                    │  Domain → Codomain, Composition, Inverses   │
                    └──────────────────┬──────────────────────────┘
                                       │
              ┌────────────────────────┼────────────────────────┐
              │                        │                        │
              ▼                        ▼                        ▼
    ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
    │  LINEAR ALGEBRA  │    │    CALCULUS      │    │  PROBABILITY    │
    │  Linear maps     │    │  Differentiable  │    │  Measurable     │
    │  = Linear funcs  │    │  functions       │    │  functions      │
    │  Matrix = func   │    │  Gradients       │    │  Random vars    │
    │  repr            │    │  = func deriv    │    │  = measurable   │
    │                  │    │                  │    │  functions      │
    └────────┬─────┬──┘    └───────┬──────────┘    └───────┬────────┘
             │     │               │                       │
             │     └───────────────┼───────────────────────┘
             │                     │
             ▼                     ▼
    ┌─────────────────┐    ┌─────────────────┐
    │  OPTIMISATION    │    │  INFORMATION     │
    │  min f(θ)       │    │  THEORY          │
    │  Loss landscape │    │  H, KL, MI       │
    │  = function over │    │  = functionals   │
    │  parameter space │    │  on distributions│
    └────────┬────────┘    └───────┬──────────┘
             │                     │
             └──────────┬──────────┘
                        ▼
              ┌─────────────────┐
              │   DEEP LEARNING  │
              │   f_θ = σ∘W∘···  │
              │   Everything is  │
              │   a function     │
              └─────────────────┘

Key Takeaways

  1. A neural network IS a function — specifically, a parameterised composition of simpler functions
  2. Training IS optimisation — minimising a functional (loss) over function space
  3. Probability IS measure theory — random variables are measurable functions
  4. Backpropagation IS the chain rule — for composed functions
  5. Architecture design IS choosing the function class — the hypothesis space H\mathcal{H}

"In mathematics, you don't understand things. You just get used to them." — John von Neumann


References (Complete)

  1. Halmos, P. (1960). Naive Set Theory. Van Nostrand.
  2. Kolmogorov, A.N. & Fomin, S.V. (1975). Introductory Real Analysis. Dover.
  3. Rudin, W. (1976). Principles of Mathematical Analysis, 3rd ed. McGraw-Hill.
  4. Mac Lane, S. (1998). Categories for the Working Mathematician, 2nd ed. Springer.
  5. Billingsley, P. (1995). Probability and Measure, 3rd ed. Wiley.
  6. Goodfellow, I. et al. (2016). Deep Learning. MIT Press.
  7. Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4), 303–314.
  8. Hornik, K. et al. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366.
  9. Vaswani, A. et al. (2017). Attention is all you need. NeurIPS.
  10. Jacot, A. et al. (2018). Neural tangent kernel. NeurIPS.
  11. Fong, B. & Spivak, D. (2019). An Invitation to Applied Category Theory. Cambridge UP.
  12. Cruttwell, G. et al. (2022). Categorical foundations of gradient-based learning. ESOP.

End of notebook. Proceed to 04-Summation-and-Product-Notation for the next topic.

Code cell 55

# ══════════════════════════════════════════════════════════════════
# SECTION 15 — Final Summary & Notebook Statistics
# ══════════════════════════════════════════════════════════════════

print("=" * 70)
print("   FUNCTIONS AND MAPPINGS — NOTEBOOK COMPLETE")
print("=" * 70)

sections = [
    ("1.  Intuitive Understanding",         "Functions as machines, AI pipeline view"),
    ("2.  Formal Definitions",              "Domain, codomain, range, function equality"),
    ("3.  Types of Functions",              "Injective, surjective, bijective, monotonic"),
    ("4.  Function Composition",            "Associativity, composition in ML, fixed points"),
    ("5.  Inverse Functions",               "Inverses, pseudo-inverses, Moore-Penrose"),
    ("6.  Special Function Classes",        "Linear, affine, polynomial, activation, Lipschitz"),
    ("7.  Higher-Order Functions",          "Functionals, operators, currying, function spaces"),
    ("8.  Continuity & Limits",             "ε-δ definition, types of discontinuity"),
    ("9.  Differentiable Functions",        "Derivatives, Jacobians, chain rule, Taylor"),
    ("10. Measure-Theoretic View",          "Measurable functions, Lebesgue, distributions"),
    ("11. Functions in ML",                 "UAT, parameterised families, attention, loss"),
    ("12. Category Theory",                 "Categories, functors, natural transformations"),
    ("13. Common Mistakes",                 "10 pitfalls with corrections"),
    ("14. Practice Exercises",              "8 exercise sets with solutions"),
    ("15. Impact Summary & Bridge",         "Complete roadmap to next topics"),
]

print(f"\n{'#':<4} {'Section':<37} {'Coverage'}")
print("-" * 70)
for i, (section, coverage) in enumerate(sections, 1):
    print(f"{i:<4} {section:<37} {coverage}")

print(f"\n{'─' * 70}")
print(f"Total sections:       {len(sections)}")
print(f"Key concepts covered: Functions, composition, inverses, continuity,")
print(f"                      differentiability, measure theory, category theory")
print(f"ML connections:       Neural networks, backpropagation, attention,")
print(f"                      loss functions, UAT, equivariance, training dynamics")
print(f"{'─' * 70}")
print(f"\n✓ Ready to proceed to: 04-Summation-and-Product-Notation")