Part 2

28 min read18 headingsSplit lesson page

Lesson overview | Previous part | Next part

Vector Spaces and Subspaces: Part 7: The Four Fundamental Subspaces to 11. Subspaces in Functional Analysis

7. The Four Fundamental Subspaces

Recall: Earlier sections used these subspaces informally:

Systems of Equations 4.3 - used col( $A$ ) and null( $A^\top$ ) to characterise when $Ax = b$ is solvable

Matrix Rank 4 - used rank and null space dimension to state the rank-nullity theorem

This section is the canonical home: rigorous definitions, dimensional identities, bases computed from RREF, the orthogonality relationships, and their geometric interpretation as the complete decomposition of $\mathbb{R}^m$ and $\mathbb{R}^n$ .

For any matrix $A \in \mathbb{R}^{m \times n}$ with $\text{rank}(A) = r$ , Gilbert Strang identified four fundamental subspaces that together give a complete picture of how $A$ acts as a linear map $\mathbb{R}^n \to \mathbb{R}^m$ . Understanding all four - their definitions, dimensions, bases, and mutual relationships - is the complete theory of linear systems.

7.1 Definition of All Four Subspaces

1. Column space $\text{col}(A) \subseteq \mathbb{R}^m$ (also called the image or range):

\text{col}(A) = \{A\mathbf{x} : \mathbf{x} \in \mathbb{R}^n\}

The set of all possible outputs of $A$ ; the "reachable" subspace in $\mathbb{R}^m$
$\dim(\text{col}(A)) = r$
Basis: pivot columns of $A$ (the original $A$ , not its RREF)
The system $A\mathbf{x} = \mathbf{b}$ is consistent if and only if $\mathbf{b} \in \text{col}(A)$

2. Null space $\text{null}(A) \subseteq \mathbb{R}^n$ (also called the kernel):

\text{null}(A) = \{\mathbf{x} \in \mathbb{R}^n : A\mathbf{x} = \mathbf{0}\}

The set of all inputs mapped to zero; the "invisible" directions in $\mathbb{R}^n$
$\dim(\text{null}(A)) = n - r$ (by Rank-Nullity)
Basis: free variable solution vectors from the RREF of $A$
$A$ cannot distinguish $\mathbf{x}$ from $\mathbf{x} + \mathbf{z}$ for any $\mathbf{z} \in \text{null}(A)$ : both produce the same output

3. Row space $\text{row}(A) = \text{col}(A^\top) \subseteq \mathbb{R}^n$ :

\text{row}(A) = \{A^\top \mathbf{y} : \mathbf{y} \in \mathbb{R}^m\} = \text{span of rows of } A

The set of all linear combinations of the rows of $A$ ; the directions in $\mathbb{R}^n$ that $A$ "listens to"
$\dim(\text{row}(A)) = r$ (same as column space - both equal the rank)
Basis: non-zero rows of the RREF of $A$
The projection of any input $\mathbf{x}$ onto $\text{row}(A)$ determines the output; the null space component of $\mathbf{x}$ is discarded

4. Left null space $\text{null}(A^\top) \subseteq \mathbb{R}^m$ :

\text{null}(A^\top) = \{\mathbf{y} \in \mathbb{R}^m : A^\top \mathbf{y} = \mathbf{0}\}

The set of all output-space vectors orthogonal to the column space of $A$
$\dim(\text{null}(A^\top)) = m - r$ (by Rank-Nullity applied to $A^\top$ )
Basis: free variable solution vectors from the RREF of $A^\top$
If $\mathbf{b} \in \text{null}(A^\top)$ then $\mathbf{b} \perp \text{col}(A)$ , which means $A\mathbf{x} = \mathbf{b}$ has no solution

Dimension summary:

Subspace	Ambient space	Dimension	Complement
$\text{row}(A)$	$\mathbb{R}^n$	$r$	$\text{null}(A)$
$\text{null}(A)$	$\mathbb{R}^n$	$n-r$	$\text{row}(A)$
$\text{col}(A)$	$\mathbb{R}^m$	$r$	$\text{null}(A^\top)$
$\text{null}(A^\top)$	$\mathbb{R}^m$	$m-r$	$\text{col}(A)$

7.2 The Orthogonality Relations

The four subspaces pair into two orthogonal decompositions:

\mathbb{R}^n = \text{row}(A) \oplus \text{null}(A) \qquad \text{(orthogonal direct sum in } \mathbb{R}^n\text{)}

\mathbb{R}^m = \text{col}(A) \oplus \text{null}(A^\top) \qquad \text{(orthogonal direct sum in } \mathbb{R}^m\text{)}

Proof that $\text{row}(A) \perp \text{null}(A)$ .

Take any $\mathbf{x} \in \text{null}(A)$ and any $\mathbf{y} \in \text{row}(A) = \text{col}(A^\top)$ . Then $\mathbf{y} = A^\top \mathbf{v}$ for some $\mathbf{v} \in \mathbb{R}^m$ . Compute:

\langle \mathbf{x}, \mathbf{y} \rangle = \mathbf{x}^\top \mathbf{y} = \mathbf{x}^\top A^\top \mathbf{v} = (A\mathbf{x})^\top \mathbf{v} = \mathbf{0}^\top \mathbf{v} = 0

So every null space vector is orthogonal to every row space vector.

Since $\dim(\text{row}(A)) + \dim(\text{null}(A)) = r + (n-r) = n = \dim(\mathbb{R}^n)$ and the two subspaces are orthogonal (hence their intersection is $\{\mathbf{0}\}$ ), they form an orthogonal direct sum decomposition of $\mathbb{R}^n$ . Every vector $\mathbf{x} \in \mathbb{R}^n$ decomposes uniquely as:

\mathbf{x} = \underbrace{\mathbf{x}_r}_{\in \text{row}(A)} + \underbrace{\mathbf{x}_n}_{\in \text{null}(A)}

And $A\mathbf{x} = A\mathbf{x}_r + A\mathbf{x}_n = A\mathbf{x}_r + \mathbf{0} = A\mathbf{x}_r$ . The null space component is silenced; only the row space component matters.

7.3 Strang's Big Picture

The complete action of $A$ as a linear map $\mathbb{R}^n \to \mathbb{R}^m$ is captured in the following diagram:

STRANG'S FOUR FUNDAMENTAL SUBSPACES


            R (input space)                    R (output space)
          
                                                              
     row(A)                            col(A)                 
     dim = r                A>   dim = r                
                                                              
     "the listening space"             "the reachable space"  
     A is sensitive here               A can write here       
                                                              
                                                            
                                                              
     null(A)                           null(A)               
     dim = n - r            A>   dim = m - r            
                              (->0)                            
     "the silent space"                "the unreachable       
     A maps this to 0                   space"                
                                                              
          

  Key facts:
  - A is a bijection from row(A) to col(A) - restricted to row(A),
    A is invertible.
  - A maps all of null(A) to the single point {0} in R.
  - No input - however large - can produce an output in null(A).
  - The orthogonal complements pair up:
    row(A) = null(A)  and  col(A) = null(A)

The action of $A$ completely described. For any input $\mathbf{x} = \mathbf{x}_r + \mathbf{x}_n$ (row space + null space decomposition):

A\mathbf{x} = A\mathbf{x}_r \in \text{col}(A)

The null space component vanishes; the row space component is mapped bijectively into the column space. The left null space is entirely inaccessible from the input side.

7.4 Computing the Four Subspaces

The systematic procedure to find bases for all four fundamental subspaces from a matrix $A \in \mathbb{R}^{m \times n}$ :

Step 1: Row reduce $A$ to RREF.

A \xrightarrow{\text{row ops}} R = \text{RREF}(A)

Identify the $r$ pivot positions (row $i$ , column $j$ pairs where $R_{ij} = 1$ is the leading 1 of row $i$ ). Let $r = \text{rank}(A)$ .

Step 2: null(A).

Free variables: columns of $R$ without a pivot (say columns $j_1, \ldots, j_{n-r}$ )
For each free variable $x_{j_\ell}$ : set $x_{j_\ell} = 1$ , all other free variables $= 0$ , solve for the pivot variables from $R\mathbf{x} = \mathbf{0}$
The resulting vector $\mathbf{n}_\ell \in \mathbb{R}^n$ is the $\ell$ -th null space basis vector
Repeat for $\ell = 1, \ldots, n-r$ ; these $n-r$ vectors form a basis for $\text{null}(A)$

Step 3: col(A).

The pivot columns of the original matrix $A$ (not $R$ ) form a basis for $\text{col}(A)$
Specifically: if pivots appear in columns $p_1, p_2, \ldots, p_r$ , then $\{\mathbf{a}_{p_1}, \mathbf{a}_{p_2}, \ldots, \mathbf{a}_{p_r}\}$ (columns of $A$ ) is a basis

Why original $A$ , not $R$ ? Row operations change the column space (they change the linear dependencies between columns). The pivot positions identify which columns are independent, but the actual column vectors must be taken from the original $A$ to get a basis for the original column space.

Step 4: row(A).

The non-zero rows of $R$ (the RREF of $A$ ) form a basis for $\text{row}(A)$
Unlike for the column space, row operations preserve the row space; the non-zero rows of $R$ are particular convenient linear combinations of the rows of $A$ that happen to be in reduced form

Step 5: null(A^T).

Row reduce $A^\top$ to its RREF, then apply the null space procedure (Step 2) to $A^\top$
Alternatively: if you have already computed the four dimensions (and bases for three of the four subspaces), use the complementary dimension argument
Basis vectors of $\text{null}(A^\top)$ can also be read off from certain row operations applied to $[A \mid I_m]$

7.5 AI Interpretation of the Four Subspaces

For a weight matrix $W \in \mathbb{R}^{m \times n}$ in a neural network layer $\mathbf{y} = W\mathbf{x}$ , the four subspaces tell the complete story of what this layer can and cannot do:

Column space col(W) - the "reachable" subspace.

Only outputs in $\text{col}(W) \subseteq \mathbb{R}^m$ can be produced by this layer. If $\text{rank}(W) = r < m$ , then $m - r$ dimensions of the output space receive exactly zero contribution from this layer, regardless of input. In a transformer residual stream: each attention head and MLP block writes to a specific subspace of $\mathbb{R}^d$ ; the column space of the output projection $W_O$ is the subspace the head writes to. Dimensions orthogonal to $\text{col}(W_O)$ are untouched by this head.

Null space null(W) - the "invisible" subspace.

Input directions in $\text{null}(W)$ are completely ignored by the layer. If an input vector has all its "mass" in the null space, the output is zero - the layer cannot see it. This is both a limitation and a design feature: in mechanistic interpretability, the null space of a query projection $W_Q$ is the set of residual stream directions that this attention head does not use to form its queries. These directions are invisible to the head's query computation.

Row space row(W) - the "listening" subspace.

The row space is the complement of the null space: input directions in $\text{row}(W)$ are the ones the layer actually "listens to". The projection of the input onto $\text{row}(W)$ determines the output; the null space projection vanishes. In a transformer, the row space of $W_K$ determines which directions of the residual stream the key computation is sensitive to. Keys "live in" the row space of $W_K$ .

Left null space null(W^T) - the "unreachable" output subspace.

The left null space $\text{null}(W^\top) \subseteq \mathbb{R}^m$ is orthogonal to $\text{col}(W)$ . No input, however crafted, can produce an output with a component in this subspace from this layer. In a transformer with residual connections, the left null space of one layer's output projection is the subspace that must be written by other layers. This is why residual connections are essential: they allow information to flow "around" layers whose column spaces don't reach certain directions.

AI INTERPRETATION: WEIGHT MATRIX SUBSPACES


  W in R, rank r

  R (input):                        R (output):
                 
    row(W)                           col(W)          
    dim r           W>  dim r           
    "W listens here"                 "W writes here" 
                                                   
    null(W)                          null(W)        
    dim n-r         W>{0}        dim m-r         
    "W ignores this"                 "W never here"  
                 

  Transformer layer (y = Wx + b, residual):
  - Attention head h reads from row(W_Q^h), row(W_K^h), row(W_V^h)
  - Head h writes to col(W_O^h)
  - Information in null(W_Q^h) is invisible to head h's queries
  - Information in null(W_V^h) is not read by head h's values
  - The residual stream preserves ALL d dimensions; individual
    layers each modify only their respective column space subspaces

8. Subspace Operations

8.1 Sum of Subspaces

For subspaces $W_1, W_2 \subseteq V$ , their sum is:

W_1 + W_2 = \{\mathbf{w}_1 + \mathbf{w}_2 : \mathbf{w}_1 \in W_1,\ \mathbf{w}_2 \in W_2\}

Theorem. $W_1 + W_2$ is a subspace of $V$ .

Proof.

Contains $\mathbf{0}$ : $\mathbf{0} = \mathbf{0} + \mathbf{0} \in W_1 + W_2$
Closed under $+$ : $(\mathbf{w}_1 + \mathbf{w}_2) + (\mathbf{w}_1' + \mathbf{w}_2') = (\mathbf{w}_1 + \mathbf{w}_1') + (\mathbf{w}_2 + \mathbf{w}_2') \in W_1 + W_2$ (since $W_1$ , $W_2$ are closed)
Closed under $\cdot$ : $\alpha(\mathbf{w}_1 + \mathbf{w}_2) = \alpha\mathbf{w}_1 + \alpha\mathbf{w}_2 \in W_1 + W_2$

$W_1 + W_2$ is the smallest subspace containing both $W_1$ and $W_2$ . Any subspace $U$ that contains both $W_1$ and $W_2$ must contain all sums $\mathbf{w}_1 + \mathbf{w}_2$ (by closure), so $W_1 + W_2 \subseteq U$ .

Grassmann dimension formula:

\dim(W_1 + W_2) = \dim(W_1) + \dim(W_2) - \dim(W_1 \cap W_2)

Important distinction: $W_1 \cup W_2$ (set union) is not a subspace in general. Take $W_1 = \text{span}\{(1,0)\}$ (x-axis) and $W_2 = \text{span}\{(0,1)\}$ (y-axis) in $\mathbb{R}^2$ . Then $(1,0) \in W_1 \cup W_2$ and $(0,1) \in W_1 \cup W_2$ , but $(1,0) + (0,1) = (1,1) \notin W_1 \cup W_2$ . The union is not closed under addition. The sum $W_1 + W_2 = \mathbb{R}^2$ (the whole plane) is a subspace; the union (just the two axes) is not.

8.2 Intersection of Subspaces

For subspaces $W_1, W_2 \subseteq V$ , their intersection is:

W_1 \cap W_2 = \{\mathbf{v} \in V : \mathbf{v} \in W_1 \text{ and } \mathbf{v} \in W_2\}

Theorem. $W_1 \cap W_2$ is a subspace of $V$ .

Proof.

Contains $\mathbf{0}$ : $\mathbf{0} \in W_1$ and $\mathbf{0} \in W_2$ , so $\mathbf{0} \in W_1 \cap W_2$
Closed under $+$ : if $\mathbf{v}, \mathbf{w} \in W_1 \cap W_2$ , then $\mathbf{v} + \mathbf{w} \in W_1$ (since $W_1$ closed) and $\mathbf{v} + \mathbf{w} \in W_2$ (since $W_2$ closed), so $\mathbf{v} + \mathbf{w} \in W_1 \cap W_2$
Closed under $\cdot$ : $\alpha\mathbf{v} \in W_1$ and $\alpha\mathbf{v} \in W_2$ , so $\alpha\mathbf{v} \in W_1 \cap W_2$

$W_1 \cap W_2$ is the largest subspace contained in both $W_1$ and $W_2$ .

Computing $W_1 \cap W_2$ : If $W_1 = \text{null}(A_1)$ and $W_2 = \text{null}(A_2)$ , then:

W_1 \cap W_2 = \text{null}\left(\begin{bmatrix} A_1 \\ A_2 \end{bmatrix}\right)

For general subspaces given by spanning sets: find vectors in $\text{span}\{\mathbf{u}_1, \ldots\}$ that also lie in $\text{span}\{\mathbf{v}_1, \ldots\}$ by setting up a linear system and solving.

8.3 Direct Sum

Subspaces $W_1$ and $W_2$ are complementary in $V$ (equivalently, $V$ is their direct sum) if:

$W_1 + W_2 = V$ - they together span $V$
$W_1 \cap W_2 = \{\mathbf{0}\}$ - they share only the origin

We write $V = W_1 \oplus W_2$ .

Theorem. $V = W_1 \oplus W_2$ if and only if every $\mathbf{v} \in V$ has a unique decomposition $\mathbf{v} = \mathbf{w}_1 + \mathbf{w}_2$ with $\mathbf{w}_1 \in W_1$ and $\mathbf{w}_2 \in W_2$ .

Proof of uniqueness. If $\mathbf{v} = \mathbf{w}_1 + \mathbf{w}_2 = \mathbf{w}_1' + \mathbf{w}_2'$ , then $\mathbf{w}_1 - \mathbf{w}_1' = \mathbf{w}_2' - \mathbf{w}_2$ . The left side is in $W_1$ ; the right side is in $W_2$ . So both sides lie in $W_1 \cap W_2 = \{\mathbf{0}\}$ . Hence $\mathbf{w}_1 = \mathbf{w}_1'$ and $\mathbf{w}_2 = \mathbf{w}_2'$ .

Dimension: $\dim(W_1 \oplus W_2) = \dim(W_1) + \dim(W_2)$ . This follows from the Grassmann formula with $\dim(W_1 \cap W_2) = 0$ .

Complements are not unique. For a given subspace $W \subseteq V$ , there are many subspaces $W'$ such that $V = W \oplus W'$ . The orthogonal complement $W^\perp$ is the canonical, unique choice:

W^\perp = \{\mathbf{v} \in V : \langle \mathbf{v}, \mathbf{w} \rangle = 0 \text{ for all } \mathbf{w} \in W\}

Theorem (for finite-dimensional inner product spaces). $V = W \oplus W^\perp$ , and $\dim(W^\perp) = \dim(V) - \dim(W)$ .

The fundamental subspace decompositions from Section 7 are orthogonal direct sums:

\mathbb{R}^n = \text{row}(A) \oplus \text{null}(A) = \text{row}(A) \oplus \text{row}(A)^\perp

\mathbb{R}^m = \text{col}(A) \oplus \text{null}(A^\top) = \text{col}(A) \oplus \text{col}(A)^\perp

Multiple direct sums. The direct sum generalises: $V = W_1 \oplus W_2 \oplus \cdots \oplus W_k$ if the $W_i$ are pairwise "independent" (each $W_i \cap (W_1 + \cdots + \hat{W}_i + \cdots + W_k) = \{\mathbf{0}\}$ ) and together span $V$ . Then every $\mathbf{v}$ has a unique decomposition $\mathbf{v} = \mathbf{w}_1 + \cdots + \mathbf{w}_k$ and $\dim(V) = \sum_i \dim(W_i)$ .

In the spectral theorem (Section 13), the orthogonal decomposition into eigenspaces is a multiple orthogonal direct sum: $\mathbb{R}^n = E(\lambda_1) \oplus E(\lambda_2) \oplus \cdots \oplus E(\lambda_k)$ .

8.4 Projection onto Subspaces

Given a subspace $W \subseteq V$ , the orthogonal projection $\text{Proj}_W: V \to W$ maps each vector $\mathbf{v}$ to the closest point in $W$ :

\text{Proj}_W(\mathbf{v}) = \arg\min_{\mathbf{w} \in W} \|\mathbf{v} - \mathbf{w}\|

The unique minimiser exists because $W$ is a closed convex set (any subspace is convex, and in finite dimensions, closed).

Characterisation. $\hat{\mathbf{v}} = \text{Proj}_W(\mathbf{v})$ is the unique vector in $W$ such that $\mathbf{v} - \hat{\mathbf{v}} \perp W$ , i.e., $\langle \mathbf{v} - \hat{\mathbf{v}}, \mathbf{w} \rangle = 0$ for all $\mathbf{w} \in W$ .

Formula 1: orthonormal basis.

If $W$ has orthonormal basis $\{\mathbf{q}_1, \ldots, \mathbf{q}_k\}$ , then:

\text{Proj}_W(\mathbf{v}) = \sum_{i=1}^k \langle \mathbf{v}, \mathbf{q}_i \rangle \mathbf{q}_i

In matrix form, with $Q = [\mathbf{q}_1 \mid \cdots \mid \mathbf{q}_k] \in \mathbb{R}^{n \times k}$ (orthonormal columns):

\text{Proj}_W(\mathbf{v}) = QQ^\top \mathbf{v}

The projection matrix is $P = QQ^\top \in \mathbb{R}^{n \times n}$ .

Formula 2: general (non-orthonormal) basis.

If $W = \text{col}(A)$ for $A \in \mathbb{R}^{n \times k}$ with linearly independent columns (full column rank):

\text{Proj}_W(\mathbf{v}) = A(A^\top A)^{-1} A^\top \mathbf{v}

The projection matrix is $P = A(A^\top A)^{-1}A^\top$ . The matrix $(A^\top A)^{-1} A^\top$ is the Moore-Penrose pseudoinverse $A^\dagger$ when $A$ has full column rank.

Properties of an orthogonal projection matrix $P$ :

Property	Expression	Meaning
Idempotent	$P^2 = P$	Projecting twice = projecting once
Symmetric	$P^\top = P$	Projection is self-adjoint
Eigenvalues	$\lambda \in \{0, 1\}$	Vectors in $W$ map to themselves; vectors in $W^\perp$ map to $\mathbf{0}$
Rank	$\text{rank}(P) = \dim(W)$	Rank = dimension of the subspace projected onto
Trace	$\text{tr}(P) = \dim(W)$	Since eigenvalues are 0s and 1s; trace = sum of eigenvalues
Complement	$I - P$	Projection onto $W^\perp$

Proof that $P^2 = P$ : $P^2 = (QQ^\top)(QQ^\top) = Q(Q^\top Q)Q^\top = QIQ^\top = QQ^\top = P$ (using $Q^\top Q = I_k$ for orthonormal columns).

Decomposition. Every $\mathbf{v} \in V$ decomposes as:

\mathbf{v} = \underbrace{P\mathbf{v}}_{\in W} + \underbrace{(I-P)\mathbf{v}}_{\in W^\perp}

The residual $(I-P)\mathbf{v} = \mathbf{v} - P\mathbf{v}$ is the component of $\mathbf{v}$ orthogonal to $W$ , and $(I-P)$ is itself an orthogonal projection matrix (onto $W^\perp$ ).

AI applications of projection:

Least squares: the best-fit solution $\hat{\mathbf{x}} = (A^\top A)^{-1} A^\top \mathbf{b}$ projects $\mathbf{b}$ onto $\text{col}(A)$ ; least squares is orthogonal projection
PCA: projecting data onto the top- $r$ principal component subspace is a rank- $r$ projection $P = U_r U_r^\top$ where $U_r$ contains the top $r$ left singular vectors
Attention: (soft) attention can be viewed as a weighted projection of value vectors onto query-determined subspaces
Concept erasure: removing a concept from an embedding by projecting onto the orthogonal complement of the concept subspace: $P_{\perp\text{concept}} = I - P_{\text{concept}}$ ; used in LEACE and related methods

8.5 Subspace Angles and Principal Angles

The angle between two vectors is well-defined: $\cos\theta = \langle \mathbf{u}, \mathbf{v} \rangle / (\|\mathbf{u}\| \|\mathbf{v}\|)$ . The "angle" between two subspaces is more subtle - it requires a collection of angles called principal angles.

Definition. For subspaces $W_1, W_2 \subseteq V$ with $\dim(W_1) = p$ , $\dim(W_2) = q$ , $k = \min(p, q)$ , the principal angles $0 \leq \theta_1 \leq \theta_2 \leq \cdots \leq \theta_k \leq \pi/2$ are defined recursively:

\cos\theta_j = \max_{\substack{\mathbf{u} \in W_1 \\ \mathbf{v} \in W_2}} \langle \mathbf{u}, \mathbf{v} \rangle \quad \text{subject to } \|\mathbf{u}\| = \|\mathbf{v}\| = 1,\ \mathbf{u} \perp \mathbf{u}_i,\ \mathbf{v} \perp \mathbf{v}_i \text{ for } i < j

Computation via SVD. If $Q_1 \in \mathbb{R}^{n \times p}$ and $Q_2 \in \mathbb{R}^{n \times q}$ are orthonormal bases for $W_1$ and $W_2$ :

Q_1^\top Q_2 = U \Sigma V^\top \qquad (\text{SVD})

Then $\cos\theta_i = \sigma_i$ (the $i$ -th singular value of $Q_1^\top Q_2$ ). The principal angles are the arc-cosines of the singular values.

Interpretation:

$\theta_1 = 0$ : the subspaces share a common direction (their intersection is non-trivial)
$\theta_k = \pi/2$ : the subspaces have a pair of orthogonal directions (they are "partially orthogonal")
All $\theta_i = \pi/2$ : the subspaces are orthogonal ( $W_1 \perp W_2$ , i.e., $W_1 \subseteq W_2^\perp$ )
All $\theta_i = 0$ : $W_1 \subseteq W_2$ or $W_2 \subseteq W_1$ (one contains the other)

AI applications:

Head overlap: principal angles between two attention heads' OV subspaces measure how much they overlap. If $\theta_1 \approx 0$ , the heads share a direction and are partially redundant. If all $\theta_i \approx \pi/2$ , the heads write to orthogonal subspaces - they are fully independent.
Gradient similarity across layers: principal angles between gradient subspaces at different layers measure gradient diversity during training.
Representation similarity: the CKA (Centered Kernel Alignment) measure of representation similarity between two networks is related to principal angles between their representation subspaces.
LoRA subspace alignment: after fine-tuning, the principal angles between the LoRA update subspace and the top- $r$ gradient subspace reveal how well LoRA captured the gradient's preferred directions.

9. Inner Product Spaces and Orthogonality

Recall: Vectors and Spaces 5-6 developed norms and inner products concretely in $\mathbb{R}^n$ - dot product, angle, Cauchy-Schwarz, and orthogonal projection. This section extends those ideas to abstract inner product spaces: general Hilbert spaces, orthonormal bases, Gram-Schmidt orthogonalization, and orthogonal complements. The abstract setting is what makes these concepts applicable to function spaces (Fourier analysis, kernels) and not just $\mathbb{R}^n$ .

9.1 Inner Products

An inner product on a vector space $V$ is a function $\langle \cdot, \cdot \rangle: V \times V \to \mathbb{R}$ satisfying:

Symmetry: $\langle \mathbf{u}, \mathbf{v} \rangle = \langle \mathbf{v}, \mathbf{u} \rangle$
Linearity in the first argument: $\langle \alpha\mathbf{u} + \beta\mathbf{w}, \mathbf{v} \rangle = \alpha\langle\mathbf{u},\mathbf{v}\rangle + \beta\langle\mathbf{w},\mathbf{v}\rangle$
Positive definiteness: $\langle \mathbf{v}, \mathbf{v} \rangle \geq 0$ with equality if and only if $\mathbf{v} = \mathbf{0}$

Together, properties 1 and 2 imply linearity in the second argument as well (by symmetry), making the inner product bilinear: linear in each argument separately.

The induced norm is $\|\mathbf{v}\| = \sqrt{\langle \mathbf{v}, \mathbf{v} \rangle}$ , and the induced metric (distance) is $d(\mathbf{u},\mathbf{v}) = \|\mathbf{u} - \mathbf{v}\|$ .

Standard inner products:

Space	Inner product	Induced norm
$\mathbb{R}^n$	$\langle \mathbf{u}, \mathbf{v} \rangle = \mathbf{u}^\top \mathbf{v} = \sum_i u_i v_i$	$\\|\mathbf{u}\\| = \sqrt{\sum_i u_i^2}$ (Euclidean)
$\mathbb{R}^{m \times n}$	$\langle A, B \rangle = \text{tr}(A^\top B) = \sum_{i,j} A_{ij} B_{ij}$	$\\|A\\|_F = \sqrt{\sum_{i,j} A_{ij}^2}$ (Frobenius)
$C([a,b])$	$\langle f, g \rangle = \int_a^b f(t) g(t)\, dt$	$\\|f\\| = \sqrt{\int_a^b f(t)^2\, dt}$ ( $L^2$ norm)

Weighted inner product. For a symmetric positive definite matrix $M \in \mathbb{R}^{n \times n}$ :

\langle \mathbf{u}, \mathbf{v} \rangle_M = \mathbf{u}^\top M \mathbf{v}

This defines a different geometry on $\mathbb{R}^n$ : the unit ball $\{\mathbf{v} : \langle\mathbf{v},\mathbf{v}\rangle_M \leq 1\}$ is an ellipsoid rather than a sphere. The natural gradient in optimisation uses the Fisher information matrix $F$ as the weight matrix: $\langle \mathbf{g}, \mathbf{g} \rangle_F = \mathbf{g}^\top F \mathbf{g}$ measures gradient magnitude in the geometry of the statistical model, not the flat geometry of parameter space.

Cauchy-Schwarz inequality. For any inner product:

|\langle \mathbf{u}, \mathbf{v} \rangle| \leq \|\mathbf{u}\| \cdot \|\mathbf{v}\|

with equality if and only if $\mathbf{u}$ and $\mathbf{v}$ are linearly dependent ( $\mathbf{u} = \alpha\mathbf{v}$ for some scalar $\alpha$ ).

This is one of the most important inequalities in mathematics. It implies:

Cosine similarity $\frac{\langle \mathbf{u},\mathbf{v}\rangle}{\|\mathbf{u}\|\|\mathbf{v}\|}$ is always in $[-1, 1]$ - well-defined as a cosine
Triangle inequality $\|\mathbf{u} + \mathbf{v}\| \leq \|\mathbf{u}\| + \|\mathbf{v}\|$ (from Cauchy-Schwarz applied to $\langle\mathbf{u},\mathbf{v}\rangle \leq \|\mathbf{u}\|\|\mathbf{v}\|$ )
The law of cosines: $\|\mathbf{u}-\mathbf{v}\|^2 = \|\mathbf{u}\|^2 - 2\langle\mathbf{u},\mathbf{v}\rangle + \|\mathbf{v}\|^2$

9.2 Orthogonality

Vectors $\mathbf{u}$ and $\mathbf{v}$ are orthogonal, written $\mathbf{u} \perp \mathbf{v}$ , if $\langle \mathbf{u}, \mathbf{v} \rangle = 0$ .

Orthogonality generalises perpendicularity from Euclidean geometry to any inner product space. In the Fourier inner product on $L^2$ , $\sin(mt)$ and $\cos(nt)$ are orthogonal for all integers $m, n$ . In the Frobenius inner product on matrices, symmetric and skew-symmetric matrices are orthogonal (since $\text{tr}(A^\top B) = 0$ when $A = A^\top$ and $B = -B^\top$ ).

Pythagorean theorem. If $\mathbf{u} \perp \mathbf{v}$ , then:

\|\mathbf{u} + \mathbf{v}\|^2 = \|\mathbf{u}\|^2 + \|\mathbf{v}\|^2

Proof:

\|\mathbf{u} + \mathbf{v}\|^2 = \langle\mathbf{u}+\mathbf{v},\mathbf{u}+\mathbf{v}\rangle = \langle\mathbf{u},\mathbf{u}\rangle + 2\langle\mathbf{u},\mathbf{v}\rangle + \langle\mathbf{v},\mathbf{v}\rangle = \|\mathbf{u}\|^2 + 0 + \|\mathbf{v}\|^2 \quad \checkmark

More generally, if $\mathbf{v}_1, \ldots, \mathbf{v}_k$ are pairwise orthogonal:

\left\|\sum_{i=1}^k \mathbf{v}_i\right\|^2 = \sum_{i=1}^k \|\mathbf{v}_i\|^2

Orthogonal sets and independence. Any set of non-zero pairwise orthogonal vectors is linearly independent.

Proof. Suppose $\sum_{i=1}^k \alpha_i \mathbf{v}_i = \mathbf{0}$ with $\mathbf{v}_i \neq \mathbf{0}$ pairwise orthogonal. Take the inner product of both sides with $\mathbf{v}_j$ :

0 = \left\langle \sum_i \alpha_i \mathbf{v}_i,\ \mathbf{v}_j \right\rangle = \sum_i \alpha_i \langle \mathbf{v}_i, \mathbf{v}_j \rangle = \alpha_j \langle \mathbf{v}_j, \mathbf{v}_j \rangle = \alpha_j \|\mathbf{v}_j\|^2

Since $\mathbf{v}_j \neq \mathbf{0}$ , we have $\|\mathbf{v}_j\|^2 > 0$ , so $\alpha_j = 0$ . This holds for all $j$ .

This is a powerful observation: orthogonality implies independence. An orthogonal set is automatically independent, so an orthogonal spanning set is automatically a basis.

9.3 Gram-Schmidt Orthogonalisation

Given $k$ linearly independent vectors $\{\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_k\}$ in an inner product space, the Gram-Schmidt process produces an orthonormal set $\{\mathbf{q}_1, \mathbf{q}_2, \ldots, \mathbf{q}_k\}$ such that:

\text{span}\{\mathbf{q}_1, \ldots, \mathbf{q}_j\} = \text{span}\{\mathbf{v}_1, \ldots, \mathbf{v}_j\} \quad \text{for all } j = 1, \ldots, k

The key invariant is that at each step, the orthonormal basis spans the same subspace as the original vectors up to that index.

Algorithm:

Step 1:  u_1 = v_1
         q_1 = u_1 / u_1

Step 2:  u_2 = v_2 - v_2, q_1 q_1
         q_2 = u_2 / u_2

Step 3:  u_3 = v_3 - v_3, q_1 q_1 - v_3, q_2 q_2
         q_3 = u_3 / u_3

         

Step j:  u = v - sum_1^{j-1} v, q q
         q = u / u

What each step does. At step $j$ , we subtract from $\mathbf{v}_j$ all of its projections onto the previously constructed orthonormal vectors $\mathbf{q}_1, \ldots, \mathbf{q}_{j-1}$ . The result $\mathbf{u}_j$ is the component of $\mathbf{v}_j$ not explained by the previous vectors - the "new" direction that $\mathbf{v}_j$ adds. Normalising gives $\mathbf{q}_j$ .

Why $\mathbf{u}_j \neq \mathbf{0}$ : Since $\{\mathbf{v}_1, \ldots, \mathbf{v}_k\}$ are linearly independent, $\mathbf{v}_j \notin \text{span}\{\mathbf{v}_1, \ldots, \mathbf{v}_{j-1}\} = \text{span}\{\mathbf{q}_1, \ldots, \mathbf{q}_{j-1}\}$ . Therefore $\mathbf{u}_j = \mathbf{v}_j - \sum_{i<j}\langle\mathbf{v}_j,\mathbf{q}_i\rangle\mathbf{q}_i \neq \mathbf{0}$ (if it were zero, $\mathbf{v}_j$ would be in the span of the previous $\mathbf{q}$ 's, contradicting independence).

Verification of orthonormality. For $i < j$ :

\langle \mathbf{q}_j, \mathbf{q}_i \rangle = \frac{1}{\|\mathbf{u}_j\|} \left\langle \mathbf{v}_j - \sum_{\ell < j} \langle \mathbf{v}_j, \mathbf{q}_\ell \rangle \mathbf{q}_\ell,\ \mathbf{q}_i \right\rangle = \frac{1}{\|\mathbf{u}_j\|}\left(\langle \mathbf{v}_j, \mathbf{q}_i \rangle - \langle \mathbf{v}_j, \mathbf{q}_i \rangle \langle \mathbf{q}_i, \mathbf{q}_i \rangle\right) = 0

Connection to QR decomposition. The Gram-Schmidt process is the algorithm underlying QR decomposition. For a matrix $A = [\mathbf{v}_1 \mid \cdots \mid \mathbf{v}_k]$ with independent columns:

A = QR

where $Q = [\mathbf{q}_1 \mid \cdots \mid \mathbf{q}_k]$ has orthonormal columns and $R$ is upper triangular with positive diagonal:

R_{ij} = \begin{cases} \langle \mathbf{v}_j, \mathbf{q}_i \rangle & \text{if } i \leq j \\ 0 & \text{if } i > j \end{cases}

The diagonal entries are $R_{jj} = \|\mathbf{u}_j\| > 0$ .

Numerical issues. The classical Gram-Schmidt algorithm as stated is numerically unstable for nearly dependent vectors: rounding errors accumulate and the resulting vectors may not be accurately orthogonal. The Modified Gram-Schmidt algorithm reorders the operations to reduce error propagation. For production use, Householder QR (using Householder reflections rather than projections) is numerically stable and preferred.

Worked example. Let $\mathbf{v}_1 = (1, 1, 0)^\top$ , $\mathbf{v}_2 = (1, 0, 1)^\top$ in $\mathbb{R}^3$ .

Step 1: $\mathbf{u}_1 = (1,1,0)^\top$ , $\|\mathbf{u}_1\| = \sqrt{2}$ , $\mathbf{q}_1 = (1/\sqrt{2},\ 1/\sqrt{2},\ 0)^\top$

Step 2: $\langle \mathbf{v}_2, \mathbf{q}_1 \rangle = (1)(1/\sqrt{2}) + (0)(1/\sqrt{2}) + (1)(0) = 1/\sqrt{2}$

$\mathbf{u}_2 = (1,0,1)^\top - \frac{1}{\sqrt{2}}(1/\sqrt{2}, 1/\sqrt{2}, 0)^\top = (1,0,1)^\top - (1/2, 1/2, 0)^\top = (1/2, -1/2, 1)^\top$

$\|\mathbf{u}_2\| = \sqrt{1/4 + 1/4 + 1} = \sqrt{3/2}$

$\mathbf{q}_2 = \frac{1}{\sqrt{3/2}}(1/2, -1/2, 1)^\top = (1/\sqrt{6},\ -1/\sqrt{6},\ 2/\sqrt{6})^\top$

Verify: $\langle\mathbf{q}_1,\mathbf{q}_2\rangle = (1/\sqrt{2})(1/\sqrt{6}) + (1/\sqrt{2})(-1/\sqrt{6}) + 0 = 1/\sqrt{12} - 1/\sqrt{12} = 0$

9.4 The Orthogonal Complement

For a subspace $W \subseteq V$ , the orthogonal complement is:

W^\perp = \{\mathbf{v} \in V : \langle \mathbf{v}, \mathbf{w} \rangle = 0 \text{ for all } \mathbf{w} \in W\}

$W^\perp$ is a subspace (easy check):

$\langle \mathbf{0}, \mathbf{w} \rangle = 0$ for all $\mathbf{w}$
If $\langle \mathbf{v}, \mathbf{w} \rangle = 0$ and $\langle \mathbf{u}, \mathbf{w} \rangle = 0$ for all $\mathbf{w} \in W$ , then $\langle \mathbf{v}+\mathbf{u}, \mathbf{w} \rangle = 0$
$\langle \alpha\mathbf{v}, \mathbf{w} \rangle = \alpha \langle \mathbf{v},\mathbf{w} \rangle = 0$

Properties (for finite-dimensional inner product spaces):

Property	Statement
Double complement	$(W^\perp)^\perp = W$
Dimension	$\dim(W) + \dim(W^\perp) = \dim(V)$
Trivial intersection	$W \cap W^\perp = \{\mathbf{0}\}$
Direct sum decomposition	$V = W \oplus W^\perp$
Fundamental subspaces	$\text{null}(A) = \text{row}(A)^\perp$ ; $\text{null}(A^\top) = \text{col}(A)^\perp$

Why $(W^\perp)^\perp = W$ : Clearly $W \subseteq (W^\perp)^\perp$ (every vector in $W$ is orthogonal to everything in $W^\perp$ ). By the dimension formula: $\dim((W^\perp)^\perp) = \dim(V) - \dim(W^\perp) = \dim(V) - (\dim(V) - \dim(W)) = \dim(W)$ . Same dimension and one contains the other -> they are equal.

Computing $W^\perp$ . If $W = \text{span}\{\mathbf{w}_1, \ldots, \mathbf{w}_k\}$ and we form $A = [\mathbf{w}_1 \mid \cdots \mid \mathbf{w}_k]^\top$ (rows are the $\mathbf{w}_i$ ), then:

W^\perp = \text{null}(A)

because $\langle \mathbf{v}, \mathbf{w}_i \rangle = 0$ for all $i$ is equivalent to $A\mathbf{v} = \mathbf{0}$ .

9.5 Orthogonal Bases for AI

The choice of basis - orthogonal vs arbitrary - has concrete consequences in AI applications.

Interpretability and orthogonal bases. When the basis vectors of an embedding space are orthogonal (as in the standard basis of $\mathbb{R}^d$ ), each dimension corresponds to an independent direction. Inner products between embeddings measure similarity in a well-defined way. Projections onto individual basis directions give clean, independent "components". In practice, the features of a learned representation are rarely aligned with the standard basis - but if they were, each feature could be read off by looking at one coordinate.

The superposition hypothesis and basis independence. Elhage et al. (2022) argue that transformers represent features as directions in $\mathbb{R}^d$ , not as individual coordinates. The model does not commit to any particular orthonormal basis; instead, features are placed as arbitrary unit vectors in $\mathbb{R}^d$ . When there are fewer features than dimensions, they can be nearly orthogonal (low interference). When features exceed $d$ , they must be non-orthogonal (superposed). The interference between features is measured by the inner products between their representing directions - exactly the off-diagonal entries of the Gram matrix of feature vectors.

Cosine similarity. The dominant similarity measure in embedding spaces:

\text{sim}(\mathbf{u}, \mathbf{v}) = \frac{\langle \mathbf{u}, \mathbf{v} \rangle}{\|\mathbf{u}\| \|\mathbf{v}\|}

This is the cosine of the angle between $\mathbf{u}$ and $\mathbf{v}$ , ranging from $-1$ (anti-parallel) to $+1$ (parallel). Orthogonal vectors have cosine similarity 0 - they are "unrelated" by this measure. The cosine similarity is basis-dependent: it changes if we apply a non-orthogonal change of basis to the space.

Orthogonal vs orthonormal bases. Orthogonal bases (pairwise orthogonal, but not necessarily unit length) are often more natural than orthonormal ones for intermediate computations. An orthonormal basis (each vector unit length AND pairwise orthogonal) is the "gold standard" for numerical stability and interpretability. The Gram-Schmidt process converts any independent set into an orthonormal one.

PCA as an orthonormal basis for data. Principal Component Analysis finds the orthonormal basis of $\mathbb{R}^d$ that best aligns with the directions of maximal variance in a dataset. In the PCA basis, the covariance matrix is diagonal (its off-diagonal elements are zero), meaning the principal components are statistically uncorrelated. PCA is precisely the process of finding the orthonormal eigenbasis of the covariance matrix and using it as the new coordinate system.

Layer normalisation and orthogonality. Layer normalisation in transformers includes a learnable scale $\gamma$ and shift $\beta$ :

\text{LayerNorm}(\mathbf{x}) = \gamma \odot \frac{\mathbf{x} - \mu}{\sigma} + \beta

The mean-subtraction step projects $\mathbf{x}$ onto the orthogonal complement of the all-ones vector $\mathbf{1}/\sqrt{d}$ . This is an explicit orthogonal projection: $\mathbf{x} - \mu\mathbf{1} = (I - \frac{1}{d}\mathbf{1}\mathbf{1}^\top)\mathbf{x}$ , where $P = \frac{1}{d}\mathbf{1}\mathbf{1}^\top$ is the projection onto the 1-dimensional "mean direction" and $I - P$ is the projection onto its orthogonal complement.

10. Affine Subspaces and Quotient Spaces

10.1 Affine Subspaces

An affine subspace (also called an affine flat or coset) is a translate of a linear subspace:

W + \mathbf{v}_0 = \{\mathbf{w} + \mathbf{v}_0 : \mathbf{w} \in W\}

for a fixed offset vector $\mathbf{v}_0 \in V$ and a linear subspace $W \subseteq V$ .

Geometric picture:

If $W$ is a line through the origin, $W + \mathbf{v}_0$ is a parallel line through $\mathbf{v}_0$
If $W$ is a plane through the origin, $W + \mathbf{v}_0$ is a parallel plane through $\mathbf{v}_0$
The affine subspace is a "shifted" copy of the linear subspace; it has the same "shape" and dimension as $W$ , but it generally does not pass through the origin (unless $\mathbf{v}_0 \in W$ , in which case $W + \mathbf{v}_0 = W$ )

Affine subspaces are NOT vector subspaces. Unless $\mathbf{v}_0 \in W$ , the affine subspace $W + \mathbf{v}_0$ does not contain $\mathbf{0}$ and is not closed under addition (the sum of two elements is offset by $2\mathbf{v}_0$ , not $\mathbf{v}_0$ ).

Key example: solution sets of linear systems. The solution set of $A\mathbf{x} = \mathbf{b}$ with $\mathbf{b} \neq \mathbf{0}$ is an affine subspace:

\{\mathbf{x} : A\mathbf{x} = \mathbf{b}\} = \mathbf{x}_p + \text{null}(A)

where $\mathbf{x}_p$ is any particular solution and $\text{null}(A)$ is the null space (a linear subspace). The solution set is a translate of the null space. All solutions lie in the same affine subspace parallel to $\text{null}(A)$ , offset by $\mathbf{x}_p$ .

Other examples:

A line not through the origin in $\mathbb{R}^2$ : $\{(1,0) + t(1,2) : t \in \mathbb{R}\}$ - a translate of the span of $(1,2)$
The probability simplex $\Delta^{n-1}$ : an $(n-1)$ -dimensional affine subspace of $\mathbb{R}^n$ , defined by $\sum_i p_i = 1$ - a translate of the hyperplane $\sum_i x_i = 0$
Batch normalisation output space: after fixing the batch statistics, the normalised output lives in an affine subspace determined by the running mean and variance

Dimension of an affine subspace. The affine subspace $W + \mathbf{v}_0$ has the same dimension as the underlying linear subspace $W$ . A $k$ -dimensional affine subspace in $\mathbb{R}^n$ is sometimes called a $k$ -flat.

AFFINE VS LINEAR SUBSPACES


  Linear subspace W:            Affine subspace W + v_0:
   passes through origin        passes through v_0 (not 0)
   closed under + and scaling   closed under affine combinations
   IS a vector space            is NOT a vector space
   examples: lines/planes/      examples: lines/planes/
    hyperplanes through 0         hyperplanes NOT through 0

  In AI:
  Linear: null(W), col(W), LoRA update subspace
  Affine: solution set of Ax=b, probability simplex,
          offset embeddings before centering

10.2 Affine Combinations

An affine combination of vectors $\mathbf{v}_1, \ldots, \mathbf{v}_k$ is a linear combination $\sum_{i=1}^k \alpha_i \mathbf{v}_i$ where the coefficients sum to one:

\sum_{i=1}^k \alpha_i = 1

Affine combinations "stay within" affine subspaces: if $\mathbf{v}_1, \ldots, \mathbf{v}_k \in W + \mathbf{v}_0$ , then any affine combination of them also lies in $W + \mathbf{v}_0$ .

Convex combinations are affine combinations with the additional constraint $\alpha_i \geq 0$ :

\sum_{i=1}^k \alpha_i \mathbf{v}_i \quad \text{with } \alpha_i \geq 0 \text{ and } \sum_i \alpha_i = 1

Convex combinations stay within the convex hull of $\{\mathbf{v}_1, \ldots, \mathbf{v}_k\}$ .

Why the constraint $\sum \alpha_i = 1$ matters. Without it, scaling the vectors by a common factor would scale the combination - the result would depend on "how large" the vectors are. With $\sum \alpha_i = 1$ , the combination is "location-aware" rather than "direction-aware" - it picks a point in the affine hull regardless of scale.

AI applications of affine and convex combinations:

Embedding interpolation. Given two embedding vectors $\mathbf{u}$ and $\mathbf{v}$ , the linear interpolation $\alpha\mathbf{u} + (1-\alpha)\mathbf{v}$ for $\alpha \in [0,1]$ is a convex combination. This is the operation underlying latent space arithmetic: "find the midpoint of 'cat' and 'dog' in embedding space". Note: this is NOT generally a valid probability-weighted average of the corresponding tokens' distributions - that arithmetic must happen in logit space.
Spherical linear interpolation (slerp). For unit vectors (on the unit sphere), linear interpolation does not stay on the sphere. Slerp uses trigonometric weights to interpolate along the great circle: $\text{slerp}(\mathbf{u}, \mathbf{v}, t) = \frac{\sin((1-t)\theta)}{\sin\theta}\mathbf{u} + \frac{\sin(t\theta)}{\sin\theta}\mathbf{v}$ where $\cos\theta = \langle\mathbf{u},\mathbf{v}\rangle$ . The weights $\frac{\sin((1-t)\theta)}{\sin\theta}$ and $\frac{\sin(t\theta)}{\sin\theta}$ sum to... almost 1 (they are affine-like weights for the sphere geometry).
Model averaging and interpolation. Averaging two neural networks $\theta_1$ and $\theta_2$ by $\theta = \frac{1}{2}(\theta_1 + \theta_2)$ is a convex combination in parameter space. "Model soup" (Wortsman et al. 2022) averages fine-tuned models; WiSE-FT interpolates between pretrained and fine-tuned weights. These are affine combinations in $\mathbb{R}^p$ .
Probability distributions. A mixture distribution $p_{\text{mix}}(x) = \alpha p_1(x) + (1-\alpha) p_2(x)$ is a convex combination of two distributions. This is valid because: $p_{\text{mix}}(x) \geq 0$ (convex combination of non-negatives) and $\int p_{\text{mix}} dx = \alpha + (1-\alpha) = 1$ (the constraint $\sum \alpha_i = 1$ ensures the mixture is still a probability distribution). This is why mixture models work: affine combinations with unit-sum weights preserve the probability simplex.

10.3 Quotient Spaces

The quotient space $V / W$ (read "V mod W") captures the idea of "ignoring the $W$ directions" - it identifies all vectors that differ by an element of $W$ .

Definition. For a subspace $W \subseteq V$ , define the equivalence relation: $\mathbf{u} \sim \mathbf{v}$ iff $\mathbf{u} - \mathbf{v} \in W$ . The coset of $\mathbf{v}$ is:

[\mathbf{v}] = \mathbf{v} + W = \{\mathbf{v} + \mathbf{w} : \mathbf{w} \in W\}

This is an affine subspace - a translate of $W$ passing through $\mathbf{v}$ . Note that $[\mathbf{u}] = [\mathbf{v}]$ iff $\mathbf{u} - \mathbf{v} \in W$ (the cosets are identical iff the vectors are equivalent).

The quotient space $V / W = \{[\mathbf{v}] : \mathbf{v} \in V\}$ is the collection of all cosets of $W$ in $V$ .

Operations on $V/W$ :

Addition: $[\mathbf{u}] + [\mathbf{v}] = [\mathbf{u} + \mathbf{v}]$ (well-defined: the result does not depend on which representatives $\mathbf{u}, \mathbf{v}$ we choose, as long as the cosets are the same)
Scalar multiplication: $\alpha[\mathbf{v}] = [\alpha\mathbf{v}]$ (well-defined by the same argument)

These operations make $V/W$ a vector space (the quotient space). The zero vector of $V/W$ is $[\mathbf{0}] = W$ (the coset containing $\mathbf{0}$ , which is $W$ itself).

Dimension: $\dim(V/W) = \dim(V) - \dim(W)$ .

Intuition. $V/W$ "collapses" the $W$ directions to a point (the zero element $[W]$ ) and retains only the directions perpendicular to $W$ . The quotient space has dimension $= \dim(V) - \dim(W)$ because it "forgets" $\dim(W)$ directions.

First Isomorphism Theorem. For a linear map $T: V \to U$ :

V / \text{null}(T) \cong \text{col}(T)

The quotient space $V / \text{null}(T)$ is isomorphic to the image of $T$ . Intuitively: the null space is exactly the "ambiguity" in $T$ - different inputs mapping to the same output; the quotient space removes this ambiguity, leaving a bijection between cosets and outputs.

AI applications:

Layer normalisation. The mean-subtraction in LayerNorm - subtracting the mean of all $d$ coordinates from each coordinate - is equivalent to projecting onto the orthogonal complement of the all-ones direction. The "normalised" space can be viewed as the quotient $\mathbb{R}^d / \text{span}\{\mathbf{1}\}$ , where the direction $\mathbf{1} = (1,1,\ldots,1)^\top / \sqrt{d}$ is "divided out". Two activations that differ only by a constant shift (equal means) are identified in this quotient space.
Residual connections. The residual connection $\mathbf{x} \leftarrow \mathbf{x} + f(\mathbf{x})$ adds a vector to the current residual stream. From the quotient space perspective: the "content" of the residual stream modulo the current layer's contribution is what gets passed to the next layer. Each layer writes to its column-space subspace; the rest of $\mathbb{R}^d$ is preserved as-is.
Equivalence classes in training. If $W = \text{null}(A)$ for a data matrix $A$ , then all weight vectors in the same coset $\mathbf{w} + \text{null}(A)$ produce the same predictions on the training data. The space of distinct prediction functions is the quotient $\mathbb{R}^n / \text{null}(A) \cong \text{row}(A)$ . Gradient descent with squared loss finds the minimum-norm representative of the coset $[\mathbf{w}^*] = \mathbf{w}^* + \text{null}(A)$ - the vector in the coset closest to the origin.

10.4 Cosets and Their Structure

The cosets $\mathbf{v} + W$ partition $V$ into disjoint, equally-shaped affine subspaces:

Partition: every $\mathbf{v} \in V$ belongs to exactly one coset; cosets are either identical or disjoint
Uniform shape: every coset is a translate of $W$ , so all cosets have the same dimension ( $= \dim(W)$ ) and the same "shape"
No overlap: two cosets $\mathbf{u} + W$ and $\mathbf{v} + W$ are either equal (when $\mathbf{u} - \mathbf{v} \in W$ ) or disjoint

Solution structure for linear systems. This is the most immediate application. For $A\mathbf{x} = \mathbf{b}$ :

The solution set (if non-empty) is the coset $\mathbf{x}_p + \text{null}(A)$ for any particular solution $\mathbf{x}_p$
All solutions in this coset produce the same output: $A(\mathbf{x}_p + \mathbf{z}) = A\mathbf{x}_p + A\mathbf{z} = \mathbf{b} + \mathbf{0} = \mathbf{b}$ for any $\mathbf{z} \in \text{null}(A)$
The minimum-norm solution (the "pseudoinverse solution") is $\mathbf{x}^+ = A^\dagger \mathbf{b} = A^\top(AA^\top)^{-1}\mathbf{b}$ , which lies in $\text{row}(A)$ (the orthogonal complement of $\text{null}(A)$ within the coset)

Implicit bias as coset selection. In overparameterised neural networks ( $n > m$ , more parameters than constraints), gradient descent with zero initialisation and MSE loss converges to the minimum-norm solution in $\text{row}(A)$ (the unique solution in $\text{row}(A) \cap (\mathbf{x}_p + \text{null}(A))$ ). This "implicit bias" towards minimum-norm solutions is a coset-selection phenomenon: gradient descent selects the minimum- $\ell^2$ -norm representative from the coset of all equally good solutions.

11. Subspaces in Functional Analysis

11.1 Infinite-Dimensional Spaces and Closed Subspaces

In finite dimensions, every subspace is automatically closed (in the topological sense). This ceases to be true in infinite dimensions, where one must distinguish between algebraic subspaces (satisfying the three conditions of Section 3) and closed subspaces (additionally closed under limits).

Closed subspace. A subspace $W$ of a Hilbert space $H$ is closed if every Cauchy sequence in $W$ converges to a limit that remains in $W$ . Equivalently, $W$ is closed if it equals its topological closure $\bar{W}$ .

Why closedness matters. In infinite dimensions:

The projection theorem ( $H = W \oplus W^\perp$ ) holds only for closed subspaces
The best approximation in $W$ exists only if $W$ is closed
The spectral theorem and other key results require closed invariant subspaces

Examples of closed subspaces:

$\mathcal{P}_n \subset L^2([0,1])$ : polynomials of degree $\leq n$ ; finite-dimensional, hence closed
$\text{null}(T)$ for a bounded linear operator $T$ : always closed (because $T$ is continuous and $\{0\}$ is closed)
$L^2_{\text{even}}([-\pi,\pi]) = \{f \in L^2 : f(-x) = f(x)\}$ : the even functions; closed under limits
The span of any finite set of vectors in a Hilbert space: always closed (finite-dimensional subspace)

Examples of non-closed subspaces in infinite dimensions:

$\mathcal{P} \subset L^2([0,1])$ : all polynomials (of any degree); algebraically a subspace, but not closed - the sequence $f_n(x) = \sum_{k=0}^n x^k / k!$ converges in $L^2$ to $e^x$ , which is not a polynomial. The closure $\overline{\mathcal{P}} = L^2([0,1])$ (by the Weierstrass approximation theorem in $L^2$ )
Finitely supported sequences in $\ell^2$ : sequences with only finitely many non-zero entries; algebraically a subspace, but limit of $\mathbf{e}_1, \mathbf{e}_1 + (1/2)\mathbf{e}_2, \mathbf{e}_1 + (1/2)\mathbf{e}_2 + (1/3)\mathbf{e}_3, \ldots$ is $(1, 1/2, 1/3, \ldots) \in \ell^2$ but not finitely supported

Projection in infinite dimensions. The projection theorem extends to Hilbert spaces: if $W$ is a closed subspace of a Hilbert space $H$ , then $H = W \oplus W^\perp$ and every $\mathbf{v} \in H$ has a unique best approximation $\hat{\mathbf{v}} \in W$ . This is the foundation of least squares in infinite dimensions and of the representer theorem in kernel methods.

11.2 Function Spaces Relevant to AI

$L^2(\mathbb{R}^d)$ - square-integrable functions on $\mathbb{R}^d$ .

The space of functions $f: \mathbb{R}^d \to \mathbb{R}$ with $\int_{\mathbb{R}^d} f(\mathbf{x})^2 \, d\mathbf{x} < \infty$ , equipped with inner product $\langle f, g \rangle = \int f(\mathbf{x}) g(\mathbf{x}) \, d\mathbf{x}$ . This is a separable Hilbert space. It is the natural setting for:

Kernel methods: the RKHS of a translation-invariant kernel is a subspace of $L^2$
Probability: densities of probability distributions with finite second moment live here
Signal processing: finite-energy signals are in $L^2(\mathbb{R})$
Neural networks: functions representable by a network can be analysed as elements of $L^2$

Sobolev spaces $W^{k,p}(\Omega)$ .

Functions with $k$ weak derivatives all lying in $L^p(\Omega)$ , equipped with norms that account for function values and derivative values. They specify "smoothness":

$W^{0,2} = L^2$ (no smoothness condition beyond square integrability)
$W^{1,2} = H^1$ (square-integrable function with square-integrable first derivative): relevant for PDEs and PINNs
$W^{2,2} = H^2$ (second derivatives in $L^2$ ): relevant for spline regression, Gaussian processes with Matrn kernels

Sobolev spaces are used in physics-informed neural networks (PINNs): the solution to a PDE is sought in a Sobolev space, and the network is trained to minimise a loss that penalises PDE residuals in the $L^2$ norm. The constraint "solution in $H^1$ " is a subspace constraint on the function space.

Reproducing Kernel Hilbert Spaces (RKHS).

An RKHS is a Hilbert space $\mathcal{H}$ of functions $f: \mathcal{X} \to \mathbb{R}$ such that point evaluation is a bounded linear functional: for each $\mathbf{x} \in \mathcal{X}$ , the map $f \mapsto f(\mathbf{x})$ is continuous in the $\mathcal{H}$ -norm.

By the Riesz representation theorem, there exists a unique function $k(\cdot, \mathbf{x}) \in \mathcal{H}$ such that:

f(\mathbf{x}) = \langle f, k(\cdot, \mathbf{x}) \rangle_{\mathcal{H}} \quad \text{for all } f \in \mathcal{H}

The function $k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ defined by $k(\mathbf{x}, \mathbf{x}') = \langle k(\cdot, \mathbf{x}), k(\cdot, \mathbf{x}') \rangle_{\mathcal{H}}$ is the reproducing kernel. It is symmetric and positive semi-definite.

Representer theorem. For any regularised learning problem of the form:

\min_{f \in \mathcal{H}_k} \sum_{i=1}^n \ell(y_i, f(\mathbf{x}_i)) + \lambda \|f\|_{\mathcal{H}_k}^2

the optimal $f^*$ lies in the finite-dimensional subspace $\text{span}\{k(\cdot, \mathbf{x}_1), \ldots, k(\cdot, \mathbf{x}_n)\} \subseteq \mathcal{H}_k$ . This is a subspace result: despite the infinite-dimensional function space, the optimal solution lies in an $n$ -dimensional subspace spanned by the kernel functions at the training points. SVMs, Gaussian process regression, and kernel ridge regression all instantiate this theorem.

Common kernels and their RKHS subspaces:

Linear kernel $k(\mathbf{x}, \mathbf{x}') = \mathbf{x}^\top \mathbf{x}'$ : RKHS = $\mathbb{R}^d$ (linear functions); finite-dimensional subspace
RBF / Gaussian kernel $k(\mathbf{x}, \mathbf{x}') = \exp(-\|\mathbf{x}-\mathbf{x}'\|^2 / 2\sigma^2)$ : RKHS is a dense subspace of $L^2(\mathbb{R}^d)$ ; infinite-dimensional
Matrn kernel: RKHS is the Sobolev space $W^{\nu+d/2, 2}$ ; smoothness parameter $\nu$ controls which Sobolev subspace
Polynomial kernel $k(\mathbf{x}, \mathbf{x}') = (1 + \mathbf{x}^\top\mathbf{x}')^p$ : RKHS = polynomials of degree $\leq p$ ; $(d+p)$ -choose- $p$ -dimensional

11.3 Neural Networks as Subspaces of Function Spaces

A neural network architecture with parameter space $\mathbb{R}^p$ defines a parametric family of functions:

\mathcal{F}_\theta = \{f_\theta : \theta \in \mathbb{R}^p\} \subset L^2(\mathcal{X})

This family is not a subspace of $L^2$ . In general, $f_{\theta_1} + f_{\theta_2}$ is not equal to $f_\theta$ for any $\theta$ (the sum of two networks is not itself a network with the same architecture). Similarly for scalar multiples. The non-linearity of the architecture means the function family is a non-linear manifold embedded in $L^2$ , not a subspace.

However, the tangent space at any parameter $\theta$ IS a subspace. The tangent space to the manifold $\mathcal{F}_\theta$ at the point $f_\theta$ is:

T_{f_\theta}\mathcal{F}_\theta = \text{span}\left\{\frac{\partial f_\theta}{\partial \theta_i} : i = 1, \ldots, p\right\} \subseteq L^2(\mathcal{X})

This is the span of $p$ functions - a (at most) $p$ -dimensional linear subspace of $L^2$ . When you take a gradient step $\theta \leftarrow \theta - \eta \nabla_\theta \mathcal{L}$ , you are moving in the direction that lies in this tangent space. First-order optimisation operates in the tangent subspace.

Neural Tangent Kernel (NTK). In the infinite-width limit (network width $\to \infty$ with appropriate scaling), the evolution of the network function during gradient flow is governed by a linear equation in function space:

\dot{f}_{t}(\mathbf{x}) = -\sum_{\mathbf{x}'} K(\mathbf{x}, \mathbf{x}') \frac{\partial \mathcal{L}}{\partial f(\mathbf{x}')}

where $K(\mathbf{x}, \mathbf{x}') = \langle \nabla_\theta f_\theta(\mathbf{x}), \nabla_\theta f_\theta(\mathbf{x}') \rangle_{\mathbb{R}^p}$ is the NTK. In the infinite-width limit, $K$ becomes constant during training. The trained network converges to kernel regression with the NTK as kernel - meaning the trained function lies in the RKHS defined by the NTK, which is a subspace of $L^2$ . The NTK result says: in the lazy training regime, the neural network effectively performs a linear projection onto a specific infinite-dimensional subspace of function space.

Universal approximation and subspace density. The universal approximation theorem says that the set of functions representable by a neural network with bounded width and arbitrary depth (or arbitrary width and one hidden layer) is dense in $C([0,1]^n)$ (continuous functions on the unit hypercube) under the uniform norm. "Dense" means: for any continuous function $f$ and any $\varepsilon > 0$ , there is a network function $f_\theta$ with $\|f - f_\theta\|_\infty < \varepsilon$ . This is a subspace density statement: the parametric family $\mathcal{F}$ is dense in the function space $C([0,1]^n)$ . Depth (or width) determines which subspace is reachable; nonlinearity is what allows density.

11.4 Krylov Subspaces

Krylov subspaces are the foundation of the most practical iterative linear algebra algorithms. They connect the abstract geometry of subspaces to the computational efficiency of matrix-vector products.

Definition. For a matrix $A \in \mathbb{R}^{n \times n}$ and a vector $\mathbf{b} \in \mathbb{R}^n$ , the Krylov subspace of order $k$ is:

\mathcal{K}_k(A, \mathbf{b}) = \text{span}\{\mathbf{b},\ A\mathbf{b},\ A^2\mathbf{b},\ \ldots,\ A^{k-1}\mathbf{b}\}

This is the span of the first $k$ vectors in the sequence $\mathbf{b}, A\mathbf{b}, A^2\mathbf{b}, \ldots$ - the orbit of $\mathbf{b}$ under repeated application of $A$ .

Nested structure. The Krylov subspaces form a nested sequence:

\mathcal{K}_1 \subseteq \mathcal{K}_2 \subseteq \mathcal{K}_3 \subseteq \cdots \subseteq \mathcal{K}_n

The sequence stabilises at some $r \leq n$ : $\mathcal{K}_r = \mathcal{K}_{r+1} = \cdots$ At that point, $A \cdot \mathcal{K}_r \subseteq \mathcal{K}_r$ (the subspace is invariant under $A$ ), and $r$ equals the degree of the minimal polynomial of $A$ with respect to $\mathbf{b}$ .

Krylov methods. Iterative solvers based on Krylov subspaces find the best approximate solution within $\mathcal{K}_k$ at step $k$ , then expand the subspace:

Method	Problem	Optimisation in $\mathcal{K}_k$
Conjugate Gradients (CG)	$A\mathbf{x} = \mathbf{b}$ , $A$ SPD	Minimises $\\|\mathbf{x} - \mathbf{x}^*\\|_A$ over $\mathcal{K}_k$
GMRES	$A\mathbf{x} = \mathbf{b}$ , general $A$	Minimises $\\|A\mathbf{x} - \mathbf{b}\\|_2$ over $\mathcal{K}_k$
Lanczos	Eigenvalues of $A$ symmetric	Finds best rank- $k$ approximation to spectrum
Arnoldi	Eigenvalues of general $A$	Orthogonalises $\mathcal{K}_k$ via Gram-Schmidt

Each step costs one matrix-vector product with $A$ and $O(k \cdot n)$ work for orthogonalisation. For a sparse $A$ with $\text{nnz}$ non-zeros, each step costs $O(\text{nnz})$ . Contrast with direct methods (Gaussian elimination): $O(n^3)$ cost regardless of sparsity.

AI applications of Krylov methods:

Second-order optimisation. Methods like K-FAC (Kronecker-Factored Approximate Curvature) need to solve linear systems involving the Fisher information matrix $F$ to compute the natural gradient. Krylov methods can solve $F\mathbf{d} = \mathbf{g}$ (where $\mathbf{g}$ is the gradient) without explicitly forming $F$ - only matrix-vector products $F\mathbf{v}$ are needed, which can be computed efficiently.
Eigenvalue computation for interpretability. Computing the top eigenvalues of the Hessian $\nabla^2 \mathcal{L}$ or the covariance matrix of gradients is done via Lanczos, which builds a Krylov subspace using Hessian-vector products. These products are available cheaply via the "pearlmutter trick" (forward-over-backward AD). The Krylov subspace approach gives the dominant eigenspace in $O(k)$ Hessian-vector products.
Linear attention and state space models. SSMs (S4, Mamba) compute recurrences of the form $\mathbf{h}_{t+1} = A\mathbf{h}_t + B\mathbf{x}_t$ whose outputs lie in Krylov-like subspaces. The efficiency of convolutional SSM computation is related to the structure of the Krylov subspace generated by the state matrix $A$ and input matrix $B$ .

Vector Spaces Subspaces: Part 2 - The Four Fundamental Subspaces To 11 Subspaces In Functional Analysi