Part 2Math for LLMs

Vector Spaces Subspaces: Part 2 - The Four Fundamental Subspaces To 11 Subspaces In Functional Analysi

Linear Algebra Basics / Vector Spaces Subspaces

Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 2
28 min read18 headingsSplit lesson page

Lesson overview | Previous part | Next part

Vector Spaces and Subspaces: Part 7: The Four Fundamental Subspaces to 11. Subspaces in Functional Analysis

7. The Four Fundamental Subspaces

Recall: Earlier sections used these subspaces informally:

  • Systems of Equations 4.3 - used col(AA) and null(AA^\top) to characterise when Ax=bAx = b is solvable
  • Matrix Rank 4 - used rank and null space dimension to state the rank-nullity theorem

This section is the canonical home: rigorous definitions, dimensional identities, bases computed from RREF, the orthogonality relationships, and their geometric interpretation as the complete decomposition of Rm\mathbb{R}^m and Rn\mathbb{R}^n.

For any matrix ARm×nA \in \mathbb{R}^{m \times n} with rank(A)=r\text{rank}(A) = r, Gilbert Strang identified four fundamental subspaces that together give a complete picture of how AA acts as a linear map RnRm\mathbb{R}^n \to \mathbb{R}^m. Understanding all four - their definitions, dimensions, bases, and mutual relationships - is the complete theory of linear systems.

7.1 Definition of All Four Subspaces

1. Column space col(A)Rm\text{col}(A) \subseteq \mathbb{R}^m (also called the image or range):

col(A)={Ax:xRn}\text{col}(A) = \{A\mathbf{x} : \mathbf{x} \in \mathbb{R}^n\}
  • The set of all possible outputs of AA; the "reachable" subspace in Rm\mathbb{R}^m
  • dim(col(A))=r\dim(\text{col}(A)) = r
  • Basis: pivot columns of AA (the original AA, not its RREF)
  • The system Ax=bA\mathbf{x} = \mathbf{b} is consistent if and only if bcol(A)\mathbf{b} \in \text{col}(A)

2. Null space null(A)Rn\text{null}(A) \subseteq \mathbb{R}^n (also called the kernel):

null(A)={xRn:Ax=0}\text{null}(A) = \{\mathbf{x} \in \mathbb{R}^n : A\mathbf{x} = \mathbf{0}\}
  • The set of all inputs mapped to zero; the "invisible" directions in Rn\mathbb{R}^n
  • dim(null(A))=nr\dim(\text{null}(A)) = n - r (by Rank-Nullity)
  • Basis: free variable solution vectors from the RREF of AA
  • AA cannot distinguish x\mathbf{x} from x+z\mathbf{x} + \mathbf{z} for any znull(A)\mathbf{z} \in \text{null}(A): both produce the same output

3. Row space row(A)=col(A)Rn\text{row}(A) = \text{col}(A^\top) \subseteq \mathbb{R}^n:

row(A)={Ay:yRm}=span of rows of A\text{row}(A) = \{A^\top \mathbf{y} : \mathbf{y} \in \mathbb{R}^m\} = \text{span of rows of } A
  • The set of all linear combinations of the rows of AA; the directions in Rn\mathbb{R}^n that AA "listens to"
  • dim(row(A))=r\dim(\text{row}(A)) = r (same as column space - both equal the rank)
  • Basis: non-zero rows of the RREF of AA
  • The projection of any input x\mathbf{x} onto row(A)\text{row}(A) determines the output; the null space component of x\mathbf{x} is discarded

4. Left null space null(A)Rm\text{null}(A^\top) \subseteq \mathbb{R}^m:

null(A)={yRm:Ay=0}\text{null}(A^\top) = \{\mathbf{y} \in \mathbb{R}^m : A^\top \mathbf{y} = \mathbf{0}\}
  • The set of all output-space vectors orthogonal to the column space of AA
  • dim(null(A))=mr\dim(\text{null}(A^\top)) = m - r (by Rank-Nullity applied to AA^\top)
  • Basis: free variable solution vectors from the RREF of AA^\top
  • If bnull(A)\mathbf{b} \in \text{null}(A^\top) then bcol(A)\mathbf{b} \perp \text{col}(A), which means Ax=bA\mathbf{x} = \mathbf{b} has no solution

Dimension summary:

SubspaceAmbient spaceDimensionComplement
row(A)\text{row}(A)Rn\mathbb{R}^nrrnull(A)\text{null}(A)
null(A)\text{null}(A)Rn\mathbb{R}^nnrn-rrow(A)\text{row}(A)
col(A)\text{col}(A)Rm\mathbb{R}^mrrnull(A)\text{null}(A^\top)
null(A)\text{null}(A^\top)Rm\mathbb{R}^mmrm-rcol(A)\text{col}(A)

7.2 The Orthogonality Relations

The four subspaces pair into two orthogonal decompositions:

Rn=row(A)null(A)(orthogonal direct sum in Rn)\mathbb{R}^n = \text{row}(A) \oplus \text{null}(A) \qquad \text{(orthogonal direct sum in } \mathbb{R}^n\text{)} Rm=col(A)null(A)(orthogonal direct sum in Rm)\mathbb{R}^m = \text{col}(A) \oplus \text{null}(A^\top) \qquad \text{(orthogonal direct sum in } \mathbb{R}^m\text{)}

Proof that row(A)null(A)\text{row}(A) \perp \text{null}(A).

Take any xnull(A)\mathbf{x} \in \text{null}(A) and any yrow(A)=col(A)\mathbf{y} \in \text{row}(A) = \text{col}(A^\top). Then y=Av\mathbf{y} = A^\top \mathbf{v} for some vRm\mathbf{v} \in \mathbb{R}^m. Compute:

x,y=xy=xAv=(Ax)v=0v=0\langle \mathbf{x}, \mathbf{y} \rangle = \mathbf{x}^\top \mathbf{y} = \mathbf{x}^\top A^\top \mathbf{v} = (A\mathbf{x})^\top \mathbf{v} = \mathbf{0}^\top \mathbf{v} = 0

So every null space vector is orthogonal to every row space vector.

Since dim(row(A))+dim(null(A))=r+(nr)=n=dim(Rn)\dim(\text{row}(A)) + \dim(\text{null}(A)) = r + (n-r) = n = \dim(\mathbb{R}^n) and the two subspaces are orthogonal (hence their intersection is {0}\{\mathbf{0}\}), they form an orthogonal direct sum decomposition of Rn\mathbb{R}^n. Every vector xRn\mathbf{x} \in \mathbb{R}^n decomposes uniquely as:

x=xrrow(A)+xnnull(A)\mathbf{x} = \underbrace{\mathbf{x}_r}_{\in \text{row}(A)} + \underbrace{\mathbf{x}_n}_{\in \text{null}(A)}

And Ax=Axr+Axn=Axr+0=AxrA\mathbf{x} = A\mathbf{x}_r + A\mathbf{x}_n = A\mathbf{x}_r + \mathbf{0} = A\mathbf{x}_r. The null space component is silenced; only the row space component matters.

7.3 Strang's Big Picture

The complete action of AA as a linear map RnRm\mathbb{R}^n \to \mathbb{R}^m is captured in the following diagram:

STRANG'S FOUR FUNDAMENTAL SUBSPACES


            R (input space)                    R (output space)
          
                                                              
     row(A)                            col(A)                 
     dim = r                A>   dim = r                
                                                              
     "the listening space"             "the reachable space"  
     A is sensitive here               A can write here       
                                                              
                                                            
                                                              
     null(A)                           null(A)               
     dim = n - r            A>   dim = m - r            
                              (->0)                            
     "the silent space"                "the unreachable       
     A maps this to 0                   space"                
                                                              
          

  Key facts:
  - A is a bijection from row(A) to col(A) - restricted to row(A),
    A is invertible.
  - A maps all of null(A) to the single point {0} in R.
  - No input - however large - can produce an output in null(A).
  - The orthogonal complements pair up:
    row(A) = null(A)  and  col(A) = null(A)

The action of AA completely described. For any input x=xr+xn\mathbf{x} = \mathbf{x}_r + \mathbf{x}_n (row space + null space decomposition):

Ax=Axrcol(A)A\mathbf{x} = A\mathbf{x}_r \in \text{col}(A)

The null space component vanishes; the row space component is mapped bijectively into the column space. The left null space is entirely inaccessible from the input side.

7.4 Computing the Four Subspaces

The systematic procedure to find bases for all four fundamental subspaces from a matrix ARm×nA \in \mathbb{R}^{m \times n}:

Step 1: Row reduce AA to RREF.

Arow opsR=RREF(A)A \xrightarrow{\text{row ops}} R = \text{RREF}(A)

Identify the rr pivot positions (row ii, column jj pairs where Rij=1R_{ij} = 1 is the leading 1 of row ii). Let r=rank(A)r = \text{rank}(A).

Step 2: null(A).

  • Free variables: columns of RR without a pivot (say columns j1,,jnrj_1, \ldots, j_{n-r})
  • For each free variable xjx_{j_\ell}: set xj=1x_{j_\ell} = 1, all other free variables =0= 0, solve for the pivot variables from Rx=0R\mathbf{x} = \mathbf{0}
  • The resulting vector nRn\mathbf{n}_\ell \in \mathbb{R}^n is the \ell-th null space basis vector
  • Repeat for =1,,nr\ell = 1, \ldots, n-r; these nrn-r vectors form a basis for null(A)\text{null}(A)

Step 3: col(A).

  • The pivot columns of the original matrix AA (not RR) form a basis for col(A)\text{col}(A)
  • Specifically: if pivots appear in columns p1,p2,,prp_1, p_2, \ldots, p_r, then {ap1,ap2,,apr}\{\mathbf{a}_{p_1}, \mathbf{a}_{p_2}, \ldots, \mathbf{a}_{p_r}\} (columns of AA) is a basis

Why original AA, not RR? Row operations change the column space (they change the linear dependencies between columns). The pivot positions identify which columns are independent, but the actual column vectors must be taken from the original AA to get a basis for the original column space.

Step 4: row(A).

  • The non-zero rows of RR (the RREF of AA) form a basis for row(A)\text{row}(A)
  • Unlike for the column space, row operations preserve the row space; the non-zero rows of RR are particular convenient linear combinations of the rows of AA that happen to be in reduced form

Step 5: null(A^T).

  • Row reduce AA^\top to its RREF, then apply the null space procedure (Step 2) to AA^\top
  • Alternatively: if you have already computed the four dimensions (and bases for three of the four subspaces), use the complementary dimension argument
  • Basis vectors of null(A)\text{null}(A^\top) can also be read off from certain row operations applied to [AIm][A \mid I_m]

7.5 AI Interpretation of the Four Subspaces

For a weight matrix WRm×nW \in \mathbb{R}^{m \times n} in a neural network layer y=Wx\mathbf{y} = W\mathbf{x}, the four subspaces tell the complete story of what this layer can and cannot do:

Column space col(W) - the "reachable" subspace.

Only outputs in col(W)Rm\text{col}(W) \subseteq \mathbb{R}^m can be produced by this layer. If rank(W)=r<m\text{rank}(W) = r < m, then mrm - r dimensions of the output space receive exactly zero contribution from this layer, regardless of input. In a transformer residual stream: each attention head and MLP block writes to a specific subspace of Rd\mathbb{R}^d; the column space of the output projection WOW_O is the subspace the head writes to. Dimensions orthogonal to col(WO)\text{col}(W_O) are untouched by this head.

Null space null(W) - the "invisible" subspace.

Input directions in null(W)\text{null}(W) are completely ignored by the layer. If an input vector has all its "mass" in the null space, the output is zero - the layer cannot see it. This is both a limitation and a design feature: in mechanistic interpretability, the null space of a query projection WQW_Q is the set of residual stream directions that this attention head does not use to form its queries. These directions are invisible to the head's query computation.

Row space row(W) - the "listening" subspace.

The row space is the complement of the null space: input directions in row(W)\text{row}(W) are the ones the layer actually "listens to". The projection of the input onto row(W)\text{row}(W) determines the output; the null space projection vanishes. In a transformer, the row space of WKW_K determines which directions of the residual stream the key computation is sensitive to. Keys "live in" the row space of WKW_K.

Left null space null(W^T) - the "unreachable" output subspace.

The left null space null(W)Rm\text{null}(W^\top) \subseteq \mathbb{R}^m is orthogonal to col(W)\text{col}(W). No input, however crafted, can produce an output with a component in this subspace from this layer. In a transformer with residual connections, the left null space of one layer's output projection is the subspace that must be written by other layers. This is why residual connections are essential: they allow information to flow "around" layers whose column spaces don't reach certain directions.

AI INTERPRETATION: WEIGHT MATRIX SUBSPACES


  W in R, rank r

  R (input):                        R (output):
                 
    row(W)                           col(W)          
    dim r           W>  dim r           
    "W listens here"                 "W writes here" 
                                                   
    null(W)                          null(W)        
    dim n-r         W>{0}        dim m-r         
    "W ignores this"                 "W never here"  
                 

  Transformer layer (y = Wx + b, residual):
  - Attention head h reads from row(W_Q^h), row(W_K^h), row(W_V^h)
  - Head h writes to col(W_O^h)
  - Information in null(W_Q^h) is invisible to head h's queries
  - Information in null(W_V^h) is not read by head h's values
  - The residual stream preserves ALL d dimensions; individual
    layers each modify only their respective column space subspaces

8. Subspace Operations

8.1 Sum of Subspaces

For subspaces W1,W2VW_1, W_2 \subseteq V, their sum is:

W1+W2={w1+w2:w1W1, w2W2}W_1 + W_2 = \{\mathbf{w}_1 + \mathbf{w}_2 : \mathbf{w}_1 \in W_1,\ \mathbf{w}_2 \in W_2\}

Theorem. W1+W2W_1 + W_2 is a subspace of VV.

Proof.

  1. Contains 0\mathbf{0}: 0=0+0W1+W2\mathbf{0} = \mathbf{0} + \mathbf{0} \in W_1 + W_2
  2. Closed under ++: (w1+w2)+(w1+w2)=(w1+w1)+(w2+w2)W1+W2(\mathbf{w}_1 + \mathbf{w}_2) + (\mathbf{w}_1' + \mathbf{w}_2') = (\mathbf{w}_1 + \mathbf{w}_1') + (\mathbf{w}_2 + \mathbf{w}_2') \in W_1 + W_2 (since W1W_1, W2W_2 are closed)
  3. Closed under \cdot: α(w1+w2)=αw1+αw2W1+W2\alpha(\mathbf{w}_1 + \mathbf{w}_2) = \alpha\mathbf{w}_1 + \alpha\mathbf{w}_2 \in W_1 + W_2

W1+W2W_1 + W_2 is the smallest subspace containing both W1W_1 and W2W_2. Any subspace UU that contains both W1W_1 and W2W_2 must contain all sums w1+w2\mathbf{w}_1 + \mathbf{w}_2 (by closure), so W1+W2UW_1 + W_2 \subseteq U.

Grassmann dimension formula:

dim(W1+W2)=dim(W1)+dim(W2)dim(W1W2)\dim(W_1 + W_2) = \dim(W_1) + \dim(W_2) - \dim(W_1 \cap W_2)

Important distinction: W1W2W_1 \cup W_2 (set union) is not a subspace in general. Take W1=span{(1,0)}W_1 = \text{span}\{(1,0)\} (x-axis) and W2=span{(0,1)}W_2 = \text{span}\{(0,1)\} (y-axis) in R2\mathbb{R}^2. Then (1,0)W1W2(1,0) \in W_1 \cup W_2 and (0,1)W1W2(0,1) \in W_1 \cup W_2, but (1,0)+(0,1)=(1,1)W1W2(1,0) + (0,1) = (1,1) \notin W_1 \cup W_2. The union is not closed under addition. The sum W1+W2=R2W_1 + W_2 = \mathbb{R}^2 (the whole plane) is a subspace; the union (just the two axes) is not.

8.2 Intersection of Subspaces

For subspaces W1,W2VW_1, W_2 \subseteq V, their intersection is:

W1W2={vV:vW1 and vW2}W_1 \cap W_2 = \{\mathbf{v} \in V : \mathbf{v} \in W_1 \text{ and } \mathbf{v} \in W_2\}

Theorem. W1W2W_1 \cap W_2 is a subspace of VV.

Proof.

  1. Contains 0\mathbf{0}: 0W1\mathbf{0} \in W_1 and 0W2\mathbf{0} \in W_2, so 0W1W2\mathbf{0} \in W_1 \cap W_2
  2. Closed under ++: if v,wW1W2\mathbf{v}, \mathbf{w} \in W_1 \cap W_2, then v+wW1\mathbf{v} + \mathbf{w} \in W_1 (since W1W_1 closed) and v+wW2\mathbf{v} + \mathbf{w} \in W_2 (since W2W_2 closed), so v+wW1W2\mathbf{v} + \mathbf{w} \in W_1 \cap W_2
  3. Closed under \cdot: αvW1\alpha\mathbf{v} \in W_1 and αvW2\alpha\mathbf{v} \in W_2, so αvW1W2\alpha\mathbf{v} \in W_1 \cap W_2

W1W2W_1 \cap W_2 is the largest subspace contained in both W1W_1 and W2W_2.

Computing W1W2W_1 \cap W_2: If W1=null(A1)W_1 = \text{null}(A_1) and W2=null(A2)W_2 = \text{null}(A_2), then:

W1W2=null([A1A2])W_1 \cap W_2 = \text{null}\left(\begin{bmatrix} A_1 \\ A_2 \end{bmatrix}\right)

For general subspaces given by spanning sets: find vectors in span{u1,}\text{span}\{\mathbf{u}_1, \ldots\} that also lie in span{v1,}\text{span}\{\mathbf{v}_1, \ldots\} by setting up a linear system and solving.

8.3 Direct Sum

Subspaces W1W_1 and W2W_2 are complementary in VV (equivalently, VV is their direct sum) if:

  1. W1+W2=VW_1 + W_2 = V - they together span VV
  2. W1W2={0}W_1 \cap W_2 = \{\mathbf{0}\} - they share only the origin

We write V=W1W2V = W_1 \oplus W_2.

Theorem. V=W1W2V = W_1 \oplus W_2 if and only if every vV\mathbf{v} \in V has a unique decomposition v=w1+w2\mathbf{v} = \mathbf{w}_1 + \mathbf{w}_2 with w1W1\mathbf{w}_1 \in W_1 and w2W2\mathbf{w}_2 \in W_2.

Proof of uniqueness. If v=w1+w2=w1+w2\mathbf{v} = \mathbf{w}_1 + \mathbf{w}_2 = \mathbf{w}_1' + \mathbf{w}_2', then w1w1=w2w2\mathbf{w}_1 - \mathbf{w}_1' = \mathbf{w}_2' - \mathbf{w}_2. The left side is in W1W_1; the right side is in W2W_2. So both sides lie in W1W2={0}W_1 \cap W_2 = \{\mathbf{0}\}. Hence w1=w1\mathbf{w}_1 = \mathbf{w}_1' and w2=w2\mathbf{w}_2 = \mathbf{w}_2'.

Dimension: dim(W1W2)=dim(W1)+dim(W2)\dim(W_1 \oplus W_2) = \dim(W_1) + \dim(W_2). This follows from the Grassmann formula with dim(W1W2)=0\dim(W_1 \cap W_2) = 0.

Complements are not unique. For a given subspace WVW \subseteq V, there are many subspaces WW' such that V=WWV = W \oplus W'. The orthogonal complement WW^\perp is the canonical, unique choice:

W={vV:v,w=0 for all wW}W^\perp = \{\mathbf{v} \in V : \langle \mathbf{v}, \mathbf{w} \rangle = 0 \text{ for all } \mathbf{w} \in W\}

Theorem (for finite-dimensional inner product spaces). V=WWV = W \oplus W^\perp, and dim(W)=dim(V)dim(W)\dim(W^\perp) = \dim(V) - \dim(W).

The fundamental subspace decompositions from Section 7 are orthogonal direct sums:

Rn=row(A)null(A)=row(A)row(A)\mathbb{R}^n = \text{row}(A) \oplus \text{null}(A) = \text{row}(A) \oplus \text{row}(A)^\perp Rm=col(A)null(A)=col(A)col(A)\mathbb{R}^m = \text{col}(A) \oplus \text{null}(A^\top) = \text{col}(A) \oplus \text{col}(A)^\perp

Multiple direct sums. The direct sum generalises: V=W1W2WkV = W_1 \oplus W_2 \oplus \cdots \oplus W_k if the WiW_i are pairwise "independent" (each Wi(W1++W^i++Wk)={0}W_i \cap (W_1 + \cdots + \hat{W}_i + \cdots + W_k) = \{\mathbf{0}\}) and together span VV. Then every v\mathbf{v} has a unique decomposition v=w1++wk\mathbf{v} = \mathbf{w}_1 + \cdots + \mathbf{w}_k and dim(V)=idim(Wi)\dim(V) = \sum_i \dim(W_i).

In the spectral theorem (Section 13), the orthogonal decomposition into eigenspaces is a multiple orthogonal direct sum: Rn=E(λ1)E(λ2)E(λk)\mathbb{R}^n = E(\lambda_1) \oplus E(\lambda_2) \oplus \cdots \oplus E(\lambda_k).

8.4 Projection onto Subspaces

Given a subspace WVW \subseteq V, the orthogonal projection ProjW:VW\text{Proj}_W: V \to W maps each vector v\mathbf{v} to the closest point in WW:

ProjW(v)=argminwWvw\text{Proj}_W(\mathbf{v}) = \arg\min_{\mathbf{w} \in W} \|\mathbf{v} - \mathbf{w}\|

The unique minimiser exists because WW is a closed convex set (any subspace is convex, and in finite dimensions, closed).

Characterisation. v^=ProjW(v)\hat{\mathbf{v}} = \text{Proj}_W(\mathbf{v}) is the unique vector in WW such that vv^W\mathbf{v} - \hat{\mathbf{v}} \perp W, i.e., vv^,w=0\langle \mathbf{v} - \hat{\mathbf{v}}, \mathbf{w} \rangle = 0 for all wW\mathbf{w} \in W.

Formula 1: orthonormal basis.

If WW has orthonormal basis {q1,,qk}\{\mathbf{q}_1, \ldots, \mathbf{q}_k\}, then:

ProjW(v)=i=1kv,qiqi\text{Proj}_W(\mathbf{v}) = \sum_{i=1}^k \langle \mathbf{v}, \mathbf{q}_i \rangle \mathbf{q}_i

In matrix form, with Q=[q1qk]Rn×kQ = [\mathbf{q}_1 \mid \cdots \mid \mathbf{q}_k] \in \mathbb{R}^{n \times k} (orthonormal columns):

ProjW(v)=QQv\text{Proj}_W(\mathbf{v}) = QQ^\top \mathbf{v}

The projection matrix is P=QQRn×nP = QQ^\top \in \mathbb{R}^{n \times n}.

Formula 2: general (non-orthonormal) basis.

If W=col(A)W = \text{col}(A) for ARn×kA \in \mathbb{R}^{n \times k} with linearly independent columns (full column rank):

ProjW(v)=A(AA)1Av\text{Proj}_W(\mathbf{v}) = A(A^\top A)^{-1} A^\top \mathbf{v}

The projection matrix is P=A(AA)1AP = A(A^\top A)^{-1}A^\top. The matrix (AA)1A(A^\top A)^{-1} A^\top is the Moore-Penrose pseudoinverse AA^\dagger when AA has full column rank.

Properties of an orthogonal projection matrix PP:

PropertyExpressionMeaning
IdempotentP2=PP^2 = PProjecting twice = projecting once
SymmetricP=PP^\top = PProjection is self-adjoint
Eigenvaluesλ{0,1}\lambda \in \{0, 1\}Vectors in WW map to themselves; vectors in WW^\perp map to 0\mathbf{0}
Rankrank(P)=dim(W)\text{rank}(P) = \dim(W)Rank = dimension of the subspace projected onto
Tracetr(P)=dim(W)\text{tr}(P) = \dim(W)Since eigenvalues are 0s and 1s; trace = sum of eigenvalues
ComplementIPI - PProjection onto WW^\perp

Proof that P2=PP^2 = P: P2=(QQ)(QQ)=Q(QQ)Q=QIQ=QQ=PP^2 = (QQ^\top)(QQ^\top) = Q(Q^\top Q)Q^\top = QIQ^\top = QQ^\top = P (using QQ=IkQ^\top Q = I_k for orthonormal columns).

Decomposition. Every vV\mathbf{v} \in V decomposes as:

v=PvW+(IP)vW\mathbf{v} = \underbrace{P\mathbf{v}}_{\in W} + \underbrace{(I-P)\mathbf{v}}_{\in W^\perp}

The residual (IP)v=vPv(I-P)\mathbf{v} = \mathbf{v} - P\mathbf{v} is the component of v\mathbf{v} orthogonal to WW, and (IP)(I-P) is itself an orthogonal projection matrix (onto WW^\perp).

AI applications of projection:

  • Least squares: the best-fit solution x^=(AA)1Ab\hat{\mathbf{x}} = (A^\top A)^{-1} A^\top \mathbf{b} projects b\mathbf{b} onto col(A)\text{col}(A); least squares is orthogonal projection
  • PCA: projecting data onto the top-rr principal component subspace is a rank-rr projection P=UrUrP = U_r U_r^\top where UrU_r contains the top rr left singular vectors
  • Attention: (soft) attention can be viewed as a weighted projection of value vectors onto query-determined subspaces
  • Concept erasure: removing a concept from an embedding by projecting onto the orthogonal complement of the concept subspace: Pconcept=IPconceptP_{\perp\text{concept}} = I - P_{\text{concept}}; used in LEACE and related methods

8.5 Subspace Angles and Principal Angles

The angle between two vectors is well-defined: cosθ=u,v/(uv)\cos\theta = \langle \mathbf{u}, \mathbf{v} \rangle / (\|\mathbf{u}\| \|\mathbf{v}\|). The "angle" between two subspaces is more subtle - it requires a collection of angles called principal angles.

Definition. For subspaces W1,W2VW_1, W_2 \subseteq V with dim(W1)=p\dim(W_1) = p, dim(W2)=q\dim(W_2) = q, k=min(p,q)k = \min(p, q), the principal angles 0θ1θ2θkπ/20 \leq \theta_1 \leq \theta_2 \leq \cdots \leq \theta_k \leq \pi/2 are defined recursively:

cosθj=maxuW1vW2u,vsubject to u=v=1, uui, vvi for i<j\cos\theta_j = \max_{\substack{\mathbf{u} \in W_1 \\ \mathbf{v} \in W_2}} \langle \mathbf{u}, \mathbf{v} \rangle \quad \text{subject to } \|\mathbf{u}\| = \|\mathbf{v}\| = 1,\ \mathbf{u} \perp \mathbf{u}_i,\ \mathbf{v} \perp \mathbf{v}_i \text{ for } i < j

Computation via SVD. If Q1Rn×pQ_1 \in \mathbb{R}^{n \times p} and Q2Rn×qQ_2 \in \mathbb{R}^{n \times q} are orthonormal bases for W1W_1 and W2W_2:

Q1Q2=UΣV(SVD)Q_1^\top Q_2 = U \Sigma V^\top \qquad (\text{SVD})

Then cosθi=σi\cos\theta_i = \sigma_i (the ii-th singular value of Q1Q2Q_1^\top Q_2). The principal angles are the arc-cosines of the singular values.

Interpretation:

  • θ1=0\theta_1 = 0: the subspaces share a common direction (their intersection is non-trivial)
  • θk=π/2\theta_k = \pi/2: the subspaces have a pair of orthogonal directions (they are "partially orthogonal")
  • All θi=π/2\theta_i = \pi/2: the subspaces are orthogonal (W1W2W_1 \perp W_2, i.e., W1W2W_1 \subseteq W_2^\perp)
  • All θi=0\theta_i = 0: W1W2W_1 \subseteq W_2 or W2W1W_2 \subseteq W_1 (one contains the other)

AI applications:

  • Head overlap: principal angles between two attention heads' OV subspaces measure how much they overlap. If θ10\theta_1 \approx 0, the heads share a direction and are partially redundant. If all θiπ/2\theta_i \approx \pi/2, the heads write to orthogonal subspaces - they are fully independent.
  • Gradient similarity across layers: principal angles between gradient subspaces at different layers measure gradient diversity during training.
  • Representation similarity: the CKA (Centered Kernel Alignment) measure of representation similarity between two networks is related to principal angles between their representation subspaces.
  • LoRA subspace alignment: after fine-tuning, the principal angles between the LoRA update subspace and the top-rr gradient subspace reveal how well LoRA captured the gradient's preferred directions.

9. Inner Product Spaces and Orthogonality

Recall: Vectors and Spaces 5-6 developed norms and inner products concretely in Rn\mathbb{R}^n - dot product, angle, Cauchy-Schwarz, and orthogonal projection. This section extends those ideas to abstract inner product spaces: general Hilbert spaces, orthonormal bases, Gram-Schmidt orthogonalization, and orthogonal complements. The abstract setting is what makes these concepts applicable to function spaces (Fourier analysis, kernels) and not just Rn\mathbb{R}^n.

9.1 Inner Products

An inner product on a vector space VV is a function ,:V×VR\langle \cdot, \cdot \rangle: V \times V \to \mathbb{R} satisfying:

  1. Symmetry: u,v=v,u\langle \mathbf{u}, \mathbf{v} \rangle = \langle \mathbf{v}, \mathbf{u} \rangle
  2. Linearity in the first argument: αu+βw,v=αu,v+βw,v\langle \alpha\mathbf{u} + \beta\mathbf{w}, \mathbf{v} \rangle = \alpha\langle\mathbf{u},\mathbf{v}\rangle + \beta\langle\mathbf{w},\mathbf{v}\rangle
  3. Positive definiteness: v,v0\langle \mathbf{v}, \mathbf{v} \rangle \geq 0 with equality if and only if v=0\mathbf{v} = \mathbf{0}

Together, properties 1 and 2 imply linearity in the second argument as well (by symmetry), making the inner product bilinear: linear in each argument separately.

The induced norm is v=v,v\|\mathbf{v}\| = \sqrt{\langle \mathbf{v}, \mathbf{v} \rangle}, and the induced metric (distance) is d(u,v)=uvd(\mathbf{u},\mathbf{v}) = \|\mathbf{u} - \mathbf{v}\|.

Standard inner products:

SpaceInner productInduced norm
Rn\mathbb{R}^nu,v=uv=iuivi\langle \mathbf{u}, \mathbf{v} \rangle = \mathbf{u}^\top \mathbf{v} = \sum_i u_i v_iu=iui2\|\mathbf{u}\| = \sqrt{\sum_i u_i^2} (Euclidean)
Rm×n\mathbb{R}^{m \times n}A,B=tr(AB)=i,jAijBij\langle A, B \rangle = \text{tr}(A^\top B) = \sum_{i,j} A_{ij} B_{ij}AF=i,jAij2\|A\|_F = \sqrt{\sum_{i,j} A_{ij}^2} (Frobenius)
C([a,b])C([a,b])f,g=abf(t)g(t)dt\langle f, g \rangle = \int_a^b f(t) g(t)\, dtf=abf(t)2dt\|f\| = \sqrt{\int_a^b f(t)^2\, dt} (L2L^2 norm)

Weighted inner product. For a symmetric positive definite matrix MRn×nM \in \mathbb{R}^{n \times n}:

u,vM=uMv\langle \mathbf{u}, \mathbf{v} \rangle_M = \mathbf{u}^\top M \mathbf{v}

This defines a different geometry on Rn\mathbb{R}^n: the unit ball {v:v,vM1}\{\mathbf{v} : \langle\mathbf{v},\mathbf{v}\rangle_M \leq 1\} is an ellipsoid rather than a sphere. The natural gradient in optimisation uses the Fisher information matrix FF as the weight matrix: g,gF=gFg\langle \mathbf{g}, \mathbf{g} \rangle_F = \mathbf{g}^\top F \mathbf{g} measures gradient magnitude in the geometry of the statistical model, not the flat geometry of parameter space.

Cauchy-Schwarz inequality. For any inner product:

u,vuv|\langle \mathbf{u}, \mathbf{v} \rangle| \leq \|\mathbf{u}\| \cdot \|\mathbf{v}\|

with equality if and only if u\mathbf{u} and v\mathbf{v} are linearly dependent (u=αv\mathbf{u} = \alpha\mathbf{v} for some scalar α\alpha).

This is one of the most important inequalities in mathematics. It implies:

  • Cosine similarity u,vuv\frac{\langle \mathbf{u},\mathbf{v}\rangle}{\|\mathbf{u}\|\|\mathbf{v}\|} is always in [1,1][-1, 1] - well-defined as a cosine
  • Triangle inequality u+vu+v\|\mathbf{u} + \mathbf{v}\| \leq \|\mathbf{u}\| + \|\mathbf{v}\| (from Cauchy-Schwarz applied to u,vuv\langle\mathbf{u},\mathbf{v}\rangle \leq \|\mathbf{u}\|\|\mathbf{v}\|)
  • The law of cosines: uv2=u22u,v+v2\|\mathbf{u}-\mathbf{v}\|^2 = \|\mathbf{u}\|^2 - 2\langle\mathbf{u},\mathbf{v}\rangle + \|\mathbf{v}\|^2

9.2 Orthogonality

Vectors u\mathbf{u} and v\mathbf{v} are orthogonal, written uv\mathbf{u} \perp \mathbf{v}, if u,v=0\langle \mathbf{u}, \mathbf{v} \rangle = 0.

Orthogonality generalises perpendicularity from Euclidean geometry to any inner product space. In the Fourier inner product on L2L^2, sin(mt)\sin(mt) and cos(nt)\cos(nt) are orthogonal for all integers m,nm, n. In the Frobenius inner product on matrices, symmetric and skew-symmetric matrices are orthogonal (since tr(AB)=0\text{tr}(A^\top B) = 0 when A=AA = A^\top and B=BB = -B^\top).

Pythagorean theorem. If uv\mathbf{u} \perp \mathbf{v}, then:

u+v2=u2+v2\|\mathbf{u} + \mathbf{v}\|^2 = \|\mathbf{u}\|^2 + \|\mathbf{v}\|^2

Proof:

u+v2=u+v,u+v=u,u+2u,v+v,v=u2+0+v2\|\mathbf{u} + \mathbf{v}\|^2 = \langle\mathbf{u}+\mathbf{v},\mathbf{u}+\mathbf{v}\rangle = \langle\mathbf{u},\mathbf{u}\rangle + 2\langle\mathbf{u},\mathbf{v}\rangle + \langle\mathbf{v},\mathbf{v}\rangle = \|\mathbf{u}\|^2 + 0 + \|\mathbf{v}\|^2 \quad \checkmark

More generally, if v1,,vk\mathbf{v}_1, \ldots, \mathbf{v}_k are pairwise orthogonal:

i=1kvi2=i=1kvi2\left\|\sum_{i=1}^k \mathbf{v}_i\right\|^2 = \sum_{i=1}^k \|\mathbf{v}_i\|^2

Orthogonal sets and independence. Any set of non-zero pairwise orthogonal vectors is linearly independent.

Proof. Suppose i=1kαivi=0\sum_{i=1}^k \alpha_i \mathbf{v}_i = \mathbf{0} with vi0\mathbf{v}_i \neq \mathbf{0} pairwise orthogonal. Take the inner product of both sides with vj\mathbf{v}_j:

0=iαivi, vj=iαivi,vj=αjvj,vj=αjvj20 = \left\langle \sum_i \alpha_i \mathbf{v}_i,\ \mathbf{v}_j \right\rangle = \sum_i \alpha_i \langle \mathbf{v}_i, \mathbf{v}_j \rangle = \alpha_j \langle \mathbf{v}_j, \mathbf{v}_j \rangle = \alpha_j \|\mathbf{v}_j\|^2

Since vj0\mathbf{v}_j \neq \mathbf{0}, we have vj2>0\|\mathbf{v}_j\|^2 > 0, so αj=0\alpha_j = 0. This holds for all jj.

This is a powerful observation: orthogonality implies independence. An orthogonal set is automatically independent, so an orthogonal spanning set is automatically a basis.

9.3 Gram-Schmidt Orthogonalisation

Given kk linearly independent vectors {v1,v2,,vk}\{\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_k\} in an inner product space, the Gram-Schmidt process produces an orthonormal set {q1,q2,,qk}\{\mathbf{q}_1, \mathbf{q}_2, \ldots, \mathbf{q}_k\} such that:

span{q1,,qj}=span{v1,,vj}for all j=1,,k\text{span}\{\mathbf{q}_1, \ldots, \mathbf{q}_j\} = \text{span}\{\mathbf{v}_1, \ldots, \mathbf{v}_j\} \quad \text{for all } j = 1, \ldots, k

The key invariant is that at each step, the orthonormal basis spans the same subspace as the original vectors up to that index.

Algorithm:

Step 1:  u_1 = v_1
         q_1 = u_1 / u_1

Step 2:  u_2 = v_2 - v_2, q_1 q_1
         q_2 = u_2 / u_2

Step 3:  u_3 = v_3 - v_3, q_1 q_1 - v_3, q_2 q_2
         q_3 = u_3 / u_3

         

Step j:  u = v - sum_1^{j-1} v, q q
         q = u / u

What each step does. At step jj, we subtract from vj\mathbf{v}_j all of its projections onto the previously constructed orthonormal vectors q1,,qj1\mathbf{q}_1, \ldots, \mathbf{q}_{j-1}. The result uj\mathbf{u}_j is the component of vj\mathbf{v}_j not explained by the previous vectors - the "new" direction that vj\mathbf{v}_j adds. Normalising gives qj\mathbf{q}_j.

Why uj0\mathbf{u}_j \neq \mathbf{0}: Since {v1,,vk}\{\mathbf{v}_1, \ldots, \mathbf{v}_k\} are linearly independent, vjspan{v1,,vj1}=span{q1,,qj1}\mathbf{v}_j \notin \text{span}\{\mathbf{v}_1, \ldots, \mathbf{v}_{j-1}\} = \text{span}\{\mathbf{q}_1, \ldots, \mathbf{q}_{j-1}\}. Therefore uj=vji<jvj,qiqi0\mathbf{u}_j = \mathbf{v}_j - \sum_{i<j}\langle\mathbf{v}_j,\mathbf{q}_i\rangle\mathbf{q}_i \neq \mathbf{0} (if it were zero, vj\mathbf{v}_j would be in the span of the previous q\mathbf{q}'s, contradicting independence).

Verification of orthonormality. For i<ji < j:

qj,qi=1ujvj<jvj,qq, qi=1uj(vj,qivj,qiqi,qi)=0\langle \mathbf{q}_j, \mathbf{q}_i \rangle = \frac{1}{\|\mathbf{u}_j\|} \left\langle \mathbf{v}_j - \sum_{\ell < j} \langle \mathbf{v}_j, \mathbf{q}_\ell \rangle \mathbf{q}_\ell,\ \mathbf{q}_i \right\rangle = \frac{1}{\|\mathbf{u}_j\|}\left(\langle \mathbf{v}_j, \mathbf{q}_i \rangle - \langle \mathbf{v}_j, \mathbf{q}_i \rangle \langle \mathbf{q}_i, \mathbf{q}_i \rangle\right) = 0

Connection to QR decomposition. The Gram-Schmidt process is the algorithm underlying QR decomposition. For a matrix A=[v1vk]A = [\mathbf{v}_1 \mid \cdots \mid \mathbf{v}_k] with independent columns:

A=QRA = QR

where Q=[q1qk]Q = [\mathbf{q}_1 \mid \cdots \mid \mathbf{q}_k] has orthonormal columns and RR is upper triangular with positive diagonal:

Rij={vj,qiif ij0if i>jR_{ij} = \begin{cases} \langle \mathbf{v}_j, \mathbf{q}_i \rangle & \text{if } i \leq j \\ 0 & \text{if } i > j \end{cases}

The diagonal entries are Rjj=uj>0R_{jj} = \|\mathbf{u}_j\| > 0.

Numerical issues. The classical Gram-Schmidt algorithm as stated is numerically unstable for nearly dependent vectors: rounding errors accumulate and the resulting vectors may not be accurately orthogonal. The Modified Gram-Schmidt algorithm reorders the operations to reduce error propagation. For production use, Householder QR (using Householder reflections rather than projections) is numerically stable and preferred.

Worked example. Let v1=(1,1,0)\mathbf{v}_1 = (1, 1, 0)^\top, v2=(1,0,1)\mathbf{v}_2 = (1, 0, 1)^\top in R3\mathbb{R}^3.

Step 1: u1=(1,1,0)\mathbf{u}_1 = (1,1,0)^\top, u1=2\|\mathbf{u}_1\| = \sqrt{2}, q1=(1/2, 1/2, 0)\mathbf{q}_1 = (1/\sqrt{2},\ 1/\sqrt{2},\ 0)^\top

Step 2: v2,q1=(1)(1/2)+(0)(1/2)+(1)(0)=1/2\langle \mathbf{v}_2, \mathbf{q}_1 \rangle = (1)(1/\sqrt{2}) + (0)(1/\sqrt{2}) + (1)(0) = 1/\sqrt{2}

u2=(1,0,1)12(1/2,1/2,0)=(1,0,1)(1/2,1/2,0)=(1/2,1/2,1)\mathbf{u}_2 = (1,0,1)^\top - \frac{1}{\sqrt{2}}(1/\sqrt{2}, 1/\sqrt{2}, 0)^\top = (1,0,1)^\top - (1/2, 1/2, 0)^\top = (1/2, -1/2, 1)^\top

u2=1/4+1/4+1=3/2\|\mathbf{u}_2\| = \sqrt{1/4 + 1/4 + 1} = \sqrt{3/2}

q2=13/2(1/2,1/2,1)=(1/6, 1/6, 2/6)\mathbf{q}_2 = \frac{1}{\sqrt{3/2}}(1/2, -1/2, 1)^\top = (1/\sqrt{6},\ -1/\sqrt{6},\ 2/\sqrt{6})^\top

Verify: q1,q2=(1/2)(1/6)+(1/2)(1/6)+0=1/121/12=0\langle\mathbf{q}_1,\mathbf{q}_2\rangle = (1/\sqrt{2})(1/\sqrt{6}) + (1/\sqrt{2})(-1/\sqrt{6}) + 0 = 1/\sqrt{12} - 1/\sqrt{12} = 0

9.4 The Orthogonal Complement

For a subspace WVW \subseteq V, the orthogonal complement is:

W={vV:v,w=0 for all wW}W^\perp = \{\mathbf{v} \in V : \langle \mathbf{v}, \mathbf{w} \rangle = 0 \text{ for all } \mathbf{w} \in W\}

WW^\perp is a subspace (easy check):

  1. 0,w=0\langle \mathbf{0}, \mathbf{w} \rangle = 0 for all w\mathbf{w}
  2. If v,w=0\langle \mathbf{v}, \mathbf{w} \rangle = 0 and u,w=0\langle \mathbf{u}, \mathbf{w} \rangle = 0 for all wW\mathbf{w} \in W, then v+u,w=0\langle \mathbf{v}+\mathbf{u}, \mathbf{w} \rangle = 0
  3. αv,w=αv,w=0\langle \alpha\mathbf{v}, \mathbf{w} \rangle = \alpha \langle \mathbf{v},\mathbf{w} \rangle = 0

Properties (for finite-dimensional inner product spaces):

PropertyStatement
Double complement(W)=W(W^\perp)^\perp = W
Dimensiondim(W)+dim(W)=dim(V)\dim(W) + \dim(W^\perp) = \dim(V)
Trivial intersectionWW={0}W \cap W^\perp = \{\mathbf{0}\}
Direct sum decompositionV=WWV = W \oplus W^\perp
Fundamental subspacesnull(A)=row(A)\text{null}(A) = \text{row}(A)^\perp; null(A)=col(A)\text{null}(A^\top) = \text{col}(A)^\perp

Why (W)=W(W^\perp)^\perp = W: Clearly W(W)W \subseteq (W^\perp)^\perp (every vector in WW is orthogonal to everything in WW^\perp). By the dimension formula: dim((W))=dim(V)dim(W)=dim(V)(dim(V)dim(W))=dim(W)\dim((W^\perp)^\perp) = \dim(V) - \dim(W^\perp) = \dim(V) - (\dim(V) - \dim(W)) = \dim(W). Same dimension and one contains the other -> they are equal.

Computing WW^\perp. If W=span{w1,,wk}W = \text{span}\{\mathbf{w}_1, \ldots, \mathbf{w}_k\} and we form A=[w1wk]A = [\mathbf{w}_1 \mid \cdots \mid \mathbf{w}_k]^\top (rows are the wi\mathbf{w}_i), then:

W=null(A)W^\perp = \text{null}(A)

because v,wi=0\langle \mathbf{v}, \mathbf{w}_i \rangle = 0 for all ii is equivalent to Av=0A\mathbf{v} = \mathbf{0}.

9.5 Orthogonal Bases for AI

The choice of basis - orthogonal vs arbitrary - has concrete consequences in AI applications.

Interpretability and orthogonal bases. When the basis vectors of an embedding space are orthogonal (as in the standard basis of Rd\mathbb{R}^d), each dimension corresponds to an independent direction. Inner products between embeddings measure similarity in a well-defined way. Projections onto individual basis directions give clean, independent "components". In practice, the features of a learned representation are rarely aligned with the standard basis - but if they were, each feature could be read off by looking at one coordinate.

The superposition hypothesis and basis independence. Elhage et al. (2022) argue that transformers represent features as directions in Rd\mathbb{R}^d, not as individual coordinates. The model does not commit to any particular orthonormal basis; instead, features are placed as arbitrary unit vectors in Rd\mathbb{R}^d. When there are fewer features than dimensions, they can be nearly orthogonal (low interference). When features exceed dd, they must be non-orthogonal (superposed). The interference between features is measured by the inner products between their representing directions - exactly the off-diagonal entries of the Gram matrix of feature vectors.

Cosine similarity. The dominant similarity measure in embedding spaces:

sim(u,v)=u,vuv\text{sim}(\mathbf{u}, \mathbf{v}) = \frac{\langle \mathbf{u}, \mathbf{v} \rangle}{\|\mathbf{u}\| \|\mathbf{v}\|}

This is the cosine of the angle between u\mathbf{u} and v\mathbf{v}, ranging from 1-1 (anti-parallel) to +1+1 (parallel). Orthogonal vectors have cosine similarity 0 - they are "unrelated" by this measure. The cosine similarity is basis-dependent: it changes if we apply a non-orthogonal change of basis to the space.

Orthogonal vs orthonormal bases. Orthogonal bases (pairwise orthogonal, but not necessarily unit length) are often more natural than orthonormal ones for intermediate computations. An orthonormal basis (each vector unit length AND pairwise orthogonal) is the "gold standard" for numerical stability and interpretability. The Gram-Schmidt process converts any independent set into an orthonormal one.

PCA as an orthonormal basis for data. Principal Component Analysis finds the orthonormal basis of Rd\mathbb{R}^d that best aligns with the directions of maximal variance in a dataset. In the PCA basis, the covariance matrix is diagonal (its off-diagonal elements are zero), meaning the principal components are statistically uncorrelated. PCA is precisely the process of finding the orthonormal eigenbasis of the covariance matrix and using it as the new coordinate system.

Layer normalisation and orthogonality. Layer normalisation in transformers includes a learnable scale γ\gamma and shift β\beta:

LayerNorm(x)=γxμσ+β\text{LayerNorm}(\mathbf{x}) = \gamma \odot \frac{\mathbf{x} - \mu}{\sigma} + \beta

The mean-subtraction step projects x\mathbf{x} onto the orthogonal complement of the all-ones vector 1/d\mathbf{1}/\sqrt{d}. This is an explicit orthogonal projection: xμ1=(I1d11)x\mathbf{x} - \mu\mathbf{1} = (I - \frac{1}{d}\mathbf{1}\mathbf{1}^\top)\mathbf{x}, where P=1d11P = \frac{1}{d}\mathbf{1}\mathbf{1}^\top is the projection onto the 1-dimensional "mean direction" and IPI - P is the projection onto its orthogonal complement.


10. Affine Subspaces and Quotient Spaces

10.1 Affine Subspaces

An affine subspace (also called an affine flat or coset) is a translate of a linear subspace:

W+v0={w+v0:wW}W + \mathbf{v}_0 = \{\mathbf{w} + \mathbf{v}_0 : \mathbf{w} \in W\}

for a fixed offset vector v0V\mathbf{v}_0 \in V and a linear subspace WVW \subseteq V.

Geometric picture:

  • If WW is a line through the origin, W+v0W + \mathbf{v}_0 is a parallel line through v0\mathbf{v}_0
  • If WW is a plane through the origin, W+v0W + \mathbf{v}_0 is a parallel plane through v0\mathbf{v}_0
  • The affine subspace is a "shifted" copy of the linear subspace; it has the same "shape" and dimension as WW, but it generally does not pass through the origin (unless v0W\mathbf{v}_0 \in W, in which case W+v0=WW + \mathbf{v}_0 = W)

Affine subspaces are NOT vector subspaces. Unless v0W\mathbf{v}_0 \in W, the affine subspace W+v0W + \mathbf{v}_0 does not contain 0\mathbf{0} and is not closed under addition (the sum of two elements is offset by 2v02\mathbf{v}_0, not v0\mathbf{v}_0).

Key example: solution sets of linear systems. The solution set of Ax=bA\mathbf{x} = \mathbf{b} with b0\mathbf{b} \neq \mathbf{0} is an affine subspace:

{x:Ax=b}=xp+null(A)\{\mathbf{x} : A\mathbf{x} = \mathbf{b}\} = \mathbf{x}_p + \text{null}(A)

where xp\mathbf{x}_p is any particular solution and null(A)\text{null}(A) is the null space (a linear subspace). The solution set is a translate of the null space. All solutions lie in the same affine subspace parallel to null(A)\text{null}(A), offset by xp\mathbf{x}_p.

Other examples:

  • A line not through the origin in R2\mathbb{R}^2: {(1,0)+t(1,2):tR}\{(1,0) + t(1,2) : t \in \mathbb{R}\} - a translate of the span of (1,2)(1,2)
  • The probability simplex Δn1\Delta^{n-1}: an (n1)(n-1)-dimensional affine subspace of Rn\mathbb{R}^n, defined by ipi=1\sum_i p_i = 1 - a translate of the hyperplane ixi=0\sum_i x_i = 0
  • Batch normalisation output space: after fixing the batch statistics, the normalised output lives in an affine subspace determined by the running mean and variance

Dimension of an affine subspace. The affine subspace W+v0W + \mathbf{v}_0 has the same dimension as the underlying linear subspace WW. A kk-dimensional affine subspace in Rn\mathbb{R}^n is sometimes called a kk-flat.

AFFINE VS LINEAR SUBSPACES


  Linear subspace W:            Affine subspace W + v_0:
   passes through origin        passes through v_0 (not 0)
   closed under + and scaling   closed under affine combinations
   IS a vector space            is NOT a vector space
   examples: lines/planes/      examples: lines/planes/
    hyperplanes through 0         hyperplanes NOT through 0

  In AI:
  Linear: null(W), col(W), LoRA update subspace
  Affine: solution set of Ax=b, probability simplex,
          offset embeddings before centering

10.2 Affine Combinations

An affine combination of vectors v1,,vk\mathbf{v}_1, \ldots, \mathbf{v}_k is a linear combination i=1kαivi\sum_{i=1}^k \alpha_i \mathbf{v}_i where the coefficients sum to one:

i=1kαi=1\sum_{i=1}^k \alpha_i = 1

Affine combinations "stay within" affine subspaces: if v1,,vkW+v0\mathbf{v}_1, \ldots, \mathbf{v}_k \in W + \mathbf{v}_0, then any affine combination of them also lies in W+v0W + \mathbf{v}_0.

Convex combinations are affine combinations with the additional constraint αi0\alpha_i \geq 0:

i=1kαiviwith αi0 and iαi=1\sum_{i=1}^k \alpha_i \mathbf{v}_i \quad \text{with } \alpha_i \geq 0 \text{ and } \sum_i \alpha_i = 1

Convex combinations stay within the convex hull of {v1,,vk}\{\mathbf{v}_1, \ldots, \mathbf{v}_k\}.

Why the constraint αi=1\sum \alpha_i = 1 matters. Without it, scaling the vectors by a common factor would scale the combination - the result would depend on "how large" the vectors are. With αi=1\sum \alpha_i = 1, the combination is "location-aware" rather than "direction-aware" - it picks a point in the affine hull regardless of scale.

AI applications of affine and convex combinations:

  • Embedding interpolation. Given two embedding vectors u\mathbf{u} and v\mathbf{v}, the linear interpolation αu+(1α)v\alpha\mathbf{u} + (1-\alpha)\mathbf{v} for α[0,1]\alpha \in [0,1] is a convex combination. This is the operation underlying latent space arithmetic: "find the midpoint of 'cat' and 'dog' in embedding space". Note: this is NOT generally a valid probability-weighted average of the corresponding tokens' distributions - that arithmetic must happen in logit space.

  • Spherical linear interpolation (slerp). For unit vectors (on the unit sphere), linear interpolation does not stay on the sphere. Slerp uses trigonometric weights to interpolate along the great circle: slerp(u,v,t)=sin((1t)θ)sinθu+sin(tθ)sinθv\text{slerp}(\mathbf{u}, \mathbf{v}, t) = \frac{\sin((1-t)\theta)}{\sin\theta}\mathbf{u} + \frac{\sin(t\theta)}{\sin\theta}\mathbf{v} where cosθ=u,v\cos\theta = \langle\mathbf{u},\mathbf{v}\rangle. The weights sin((1t)θ)sinθ\frac{\sin((1-t)\theta)}{\sin\theta} and sin(tθ)sinθ\frac{\sin(t\theta)}{\sin\theta} sum to... almost 1 (they are affine-like weights for the sphere geometry).

  • Model averaging and interpolation. Averaging two neural networks θ1\theta_1 and θ2\theta_2 by θ=12(θ1+θ2)\theta = \frac{1}{2}(\theta_1 + \theta_2) is a convex combination in parameter space. "Model soup" (Wortsman et al. 2022) averages fine-tuned models; WiSE-FT interpolates between pretrained and fine-tuned weights. These are affine combinations in Rp\mathbb{R}^p.

  • Probability distributions. A mixture distribution pmix(x)=αp1(x)+(1α)p2(x)p_{\text{mix}}(x) = \alpha p_1(x) + (1-\alpha) p_2(x) is a convex combination of two distributions. This is valid because: pmix(x)0p_{\text{mix}}(x) \geq 0 (convex combination of non-negatives) and pmixdx=α+(1α)=1\int p_{\text{mix}} dx = \alpha + (1-\alpha) = 1 (the constraint αi=1\sum \alpha_i = 1 ensures the mixture is still a probability distribution). This is why mixture models work: affine combinations with unit-sum weights preserve the probability simplex.

10.3 Quotient Spaces

The quotient space V/WV / W (read "V mod W") captures the idea of "ignoring the WW directions" - it identifies all vectors that differ by an element of WW.

Definition. For a subspace WVW \subseteq V, define the equivalence relation: uv\mathbf{u} \sim \mathbf{v} iff uvW\mathbf{u} - \mathbf{v} \in W. The coset of v\mathbf{v} is:

[v]=v+W={v+w:wW}[\mathbf{v}] = \mathbf{v} + W = \{\mathbf{v} + \mathbf{w} : \mathbf{w} \in W\}

This is an affine subspace - a translate of WW passing through v\mathbf{v}. Note that [u]=[v][\mathbf{u}] = [\mathbf{v}] iff uvW\mathbf{u} - \mathbf{v} \in W (the cosets are identical iff the vectors are equivalent).

The quotient space V/W={[v]:vV}V / W = \{[\mathbf{v}] : \mathbf{v} \in V\} is the collection of all cosets of WW in VV.

Operations on V/WV/W:

  • Addition: [u]+[v]=[u+v][\mathbf{u}] + [\mathbf{v}] = [\mathbf{u} + \mathbf{v}] (well-defined: the result does not depend on which representatives u,v\mathbf{u}, \mathbf{v} we choose, as long as the cosets are the same)
  • Scalar multiplication: α[v]=[αv]\alpha[\mathbf{v}] = [\alpha\mathbf{v}] (well-defined by the same argument)

These operations make V/WV/W a vector space (the quotient space). The zero vector of V/WV/W is [0]=W[\mathbf{0}] = W (the coset containing 0\mathbf{0}, which is WW itself).

Dimension: dim(V/W)=dim(V)dim(W)\dim(V/W) = \dim(V) - \dim(W).

Intuition. V/WV/W "collapses" the WW directions to a point (the zero element [W][W]) and retains only the directions perpendicular to WW. The quotient space has dimension =dim(V)dim(W)= \dim(V) - \dim(W) because it "forgets" dim(W)\dim(W) directions.

First Isomorphism Theorem. For a linear map T:VUT: V \to U:

V/null(T)col(T)V / \text{null}(T) \cong \text{col}(T)

The quotient space V/null(T)V / \text{null}(T) is isomorphic to the image of TT. Intuitively: the null space is exactly the "ambiguity" in TT - different inputs mapping to the same output; the quotient space removes this ambiguity, leaving a bijection between cosets and outputs.

AI applications:

  • Layer normalisation. The mean-subtraction in LayerNorm - subtracting the mean of all dd coordinates from each coordinate - is equivalent to projecting onto the orthogonal complement of the all-ones direction. The "normalised" space can be viewed as the quotient Rd/span{1}\mathbb{R}^d / \text{span}\{\mathbf{1}\}, where the direction 1=(1,1,,1)/d\mathbf{1} = (1,1,\ldots,1)^\top / \sqrt{d} is "divided out". Two activations that differ only by a constant shift (equal means) are identified in this quotient space.

  • Residual connections. The residual connection xx+f(x)\mathbf{x} \leftarrow \mathbf{x} + f(\mathbf{x}) adds a vector to the current residual stream. From the quotient space perspective: the "content" of the residual stream modulo the current layer's contribution is what gets passed to the next layer. Each layer writes to its column-space subspace; the rest of Rd\mathbb{R}^d is preserved as-is.

  • Equivalence classes in training. If W=null(A)W = \text{null}(A) for a data matrix AA, then all weight vectors in the same coset w+null(A)\mathbf{w} + \text{null}(A) produce the same predictions on the training data. The space of distinct prediction functions is the quotient Rn/null(A)row(A)\mathbb{R}^n / \text{null}(A) \cong \text{row}(A). Gradient descent with squared loss finds the minimum-norm representative of the coset [w]=w+null(A)[\mathbf{w}^*] = \mathbf{w}^* + \text{null}(A) - the vector in the coset closest to the origin.

10.4 Cosets and Their Structure

The cosets v+W\mathbf{v} + W partition VV into disjoint, equally-shaped affine subspaces:

  • Partition: every vV\mathbf{v} \in V belongs to exactly one coset; cosets are either identical or disjoint
  • Uniform shape: every coset is a translate of WW, so all cosets have the same dimension (=dim(W)= \dim(W)) and the same "shape"
  • No overlap: two cosets u+W\mathbf{u} + W and v+W\mathbf{v} + W are either equal (when uvW\mathbf{u} - \mathbf{v} \in W) or disjoint

Solution structure for linear systems. This is the most immediate application. For Ax=bA\mathbf{x} = \mathbf{b}:

  • The solution set (if non-empty) is the coset xp+null(A)\mathbf{x}_p + \text{null}(A) for any particular solution xp\mathbf{x}_p
  • All solutions in this coset produce the same output: A(xp+z)=Axp+Az=b+0=bA(\mathbf{x}_p + \mathbf{z}) = A\mathbf{x}_p + A\mathbf{z} = \mathbf{b} + \mathbf{0} = \mathbf{b} for any znull(A)\mathbf{z} \in \text{null}(A)
  • The minimum-norm solution (the "pseudoinverse solution") is x+=Ab=A(AA)1b\mathbf{x}^+ = A^\dagger \mathbf{b} = A^\top(AA^\top)^{-1}\mathbf{b}, which lies in row(A)\text{row}(A) (the orthogonal complement of null(A)\text{null}(A) within the coset)

Implicit bias as coset selection. In overparameterised neural networks (n>mn > m, more parameters than constraints), gradient descent with zero initialisation and MSE loss converges to the minimum-norm solution in row(A)\text{row}(A) (the unique solution in row(A)(xp+null(A))\text{row}(A) \cap (\mathbf{x}_p + \text{null}(A))). This "implicit bias" towards minimum-norm solutions is a coset-selection phenomenon: gradient descent selects the minimum-2\ell^2-norm representative from the coset of all equally good solutions.


11. Subspaces in Functional Analysis

11.1 Infinite-Dimensional Spaces and Closed Subspaces

In finite dimensions, every subspace is automatically closed (in the topological sense). This ceases to be true in infinite dimensions, where one must distinguish between algebraic subspaces (satisfying the three conditions of Section 3) and closed subspaces (additionally closed under limits).

Closed subspace. A subspace WW of a Hilbert space HH is closed if every Cauchy sequence in WW converges to a limit that remains in WW. Equivalently, WW is closed if it equals its topological closure Wˉ\bar{W}.

Why closedness matters. In infinite dimensions:

  • The projection theorem (H=WWH = W \oplus W^\perp) holds only for closed subspaces
  • The best approximation in WW exists only if WW is closed
  • The spectral theorem and other key results require closed invariant subspaces

Examples of closed subspaces:

  • PnL2([0,1])\mathcal{P}_n \subset L^2([0,1]): polynomials of degree n\leq n; finite-dimensional, hence closed
  • null(T)\text{null}(T) for a bounded linear operator TT: always closed (because TT is continuous and {0}\{0\} is closed)
  • Leven2([π,π])={fL2:f(x)=f(x)}L^2_{\text{even}}([-\pi,\pi]) = \{f \in L^2 : f(-x) = f(x)\}: the even functions; closed under limits
  • The span of any finite set of vectors in a Hilbert space: always closed (finite-dimensional subspace)

Examples of non-closed subspaces in infinite dimensions:

  • PL2([0,1])\mathcal{P} \subset L^2([0,1]): all polynomials (of any degree); algebraically a subspace, but not closed - the sequence fn(x)=k=0nxk/k!f_n(x) = \sum_{k=0}^n x^k / k! converges in L2L^2 to exe^x, which is not a polynomial. The closure P=L2([0,1])\overline{\mathcal{P}} = L^2([0,1]) (by the Weierstrass approximation theorem in L2L^2)
  • Finitely supported sequences in 2\ell^2: sequences with only finitely many non-zero entries; algebraically a subspace, but limit of e1,e1+(1/2)e2,e1+(1/2)e2+(1/3)e3,\mathbf{e}_1, \mathbf{e}_1 + (1/2)\mathbf{e}_2, \mathbf{e}_1 + (1/2)\mathbf{e}_2 + (1/3)\mathbf{e}_3, \ldots is (1,1/2,1/3,)2(1, 1/2, 1/3, \ldots) \in \ell^2 but not finitely supported

Projection in infinite dimensions. The projection theorem extends to Hilbert spaces: if WW is a closed subspace of a Hilbert space HH, then H=WWH = W \oplus W^\perp and every vH\mathbf{v} \in H has a unique best approximation v^W\hat{\mathbf{v}} \in W. This is the foundation of least squares in infinite dimensions and of the representer theorem in kernel methods.

11.2 Function Spaces Relevant to AI

L2(Rd)L^2(\mathbb{R}^d) - square-integrable functions on Rd\mathbb{R}^d.

The space of functions f:RdRf: \mathbb{R}^d \to \mathbb{R} with Rdf(x)2dx<\int_{\mathbb{R}^d} f(\mathbf{x})^2 \, d\mathbf{x} < \infty, equipped with inner product f,g=f(x)g(x)dx\langle f, g \rangle = \int f(\mathbf{x}) g(\mathbf{x}) \, d\mathbf{x}. This is a separable Hilbert space. It is the natural setting for:

  • Kernel methods: the RKHS of a translation-invariant kernel is a subspace of L2L^2
  • Probability: densities of probability distributions with finite second moment live here
  • Signal processing: finite-energy signals are in L2(R)L^2(\mathbb{R})
  • Neural networks: functions representable by a network can be analysed as elements of L2L^2

Sobolev spaces Wk,p(Ω)W^{k,p}(\Omega).

Functions with kk weak derivatives all lying in Lp(Ω)L^p(\Omega), equipped with norms that account for function values and derivative values. They specify "smoothness":

  • W0,2=L2W^{0,2} = L^2 (no smoothness condition beyond square integrability)
  • W1,2=H1W^{1,2} = H^1 (square-integrable function with square-integrable first derivative): relevant for PDEs and PINNs
  • W2,2=H2W^{2,2} = H^2 (second derivatives in L2L^2): relevant for spline regression, Gaussian processes with Matrn kernels

Sobolev spaces are used in physics-informed neural networks (PINNs): the solution to a PDE is sought in a Sobolev space, and the network is trained to minimise a loss that penalises PDE residuals in the L2L^2 norm. The constraint "solution in H1H^1" is a subspace constraint on the function space.

Reproducing Kernel Hilbert Spaces (RKHS).

An RKHS is a Hilbert space H\mathcal{H} of functions f:XRf: \mathcal{X} \to \mathbb{R} such that point evaluation is a bounded linear functional: for each xX\mathbf{x} \in \mathcal{X}, the map ff(x)f \mapsto f(\mathbf{x}) is continuous in the H\mathcal{H}-norm.

By the Riesz representation theorem, there exists a unique function k(,x)Hk(\cdot, \mathbf{x}) \in \mathcal{H} such that:

f(x)=f,k(,x)Hfor all fHf(\mathbf{x}) = \langle f, k(\cdot, \mathbf{x}) \rangle_{\mathcal{H}} \quad \text{for all } f \in \mathcal{H}

The function k:X×XRk: \mathcal{X} \times \mathcal{X} \to \mathbb{R} defined by k(x,x)=k(,x),k(,x)Hk(\mathbf{x}, \mathbf{x}') = \langle k(\cdot, \mathbf{x}), k(\cdot, \mathbf{x}') \rangle_{\mathcal{H}} is the reproducing kernel. It is symmetric and positive semi-definite.

Representer theorem. For any regularised learning problem of the form:

minfHki=1n(yi,f(xi))+λfHk2\min_{f \in \mathcal{H}_k} \sum_{i=1}^n \ell(y_i, f(\mathbf{x}_i)) + \lambda \|f\|_{\mathcal{H}_k}^2

the optimal ff^* lies in the finite-dimensional subspace span{k(,x1),,k(,xn)}Hk\text{span}\{k(\cdot, \mathbf{x}_1), \ldots, k(\cdot, \mathbf{x}_n)\} \subseteq \mathcal{H}_k. This is a subspace result: despite the infinite-dimensional function space, the optimal solution lies in an nn-dimensional subspace spanned by the kernel functions at the training points. SVMs, Gaussian process regression, and kernel ridge regression all instantiate this theorem.

Common kernels and their RKHS subspaces:

  • Linear kernel k(x,x)=xxk(\mathbf{x}, \mathbf{x}') = \mathbf{x}^\top \mathbf{x}': RKHS = Rd\mathbb{R}^d (linear functions); finite-dimensional subspace
  • RBF / Gaussian kernel k(x,x)=exp(xx2/2σ2)k(\mathbf{x}, \mathbf{x}') = \exp(-\|\mathbf{x}-\mathbf{x}'\|^2 / 2\sigma^2): RKHS is a dense subspace of L2(Rd)L^2(\mathbb{R}^d); infinite-dimensional
  • Matrn kernel: RKHS is the Sobolev space Wν+d/2,2W^{\nu+d/2, 2}; smoothness parameter ν\nu controls which Sobolev subspace
  • Polynomial kernel k(x,x)=(1+xx)pk(\mathbf{x}, \mathbf{x}') = (1 + \mathbf{x}^\top\mathbf{x}')^p: RKHS = polynomials of degree p\leq p; (d+p)(d+p)-choose-pp-dimensional

11.3 Neural Networks as Subspaces of Function Spaces

A neural network architecture with parameter space Rp\mathbb{R}^p defines a parametric family of functions:

Fθ={fθ:θRp}L2(X)\mathcal{F}_\theta = \{f_\theta : \theta \in \mathbb{R}^p\} \subset L^2(\mathcal{X})

This family is not a subspace of L2L^2. In general, fθ1+fθ2f_{\theta_1} + f_{\theta_2} is not equal to fθf_\theta for any θ\theta (the sum of two networks is not itself a network with the same architecture). Similarly for scalar multiples. The non-linearity of the architecture means the function family is a non-linear manifold embedded in L2L^2, not a subspace.

However, the tangent space at any parameter θ\theta IS a subspace. The tangent space to the manifold Fθ\mathcal{F}_\theta at the point fθf_\theta is:

TfθFθ=span{fθθi:i=1,,p}L2(X)T_{f_\theta}\mathcal{F}_\theta = \text{span}\left\{\frac{\partial f_\theta}{\partial \theta_i} : i = 1, \ldots, p\right\} \subseteq L^2(\mathcal{X})

This is the span of pp functions - a (at most) pp-dimensional linear subspace of L2L^2. When you take a gradient step θθηθL\theta \leftarrow \theta - \eta \nabla_\theta \mathcal{L}, you are moving in the direction that lies in this tangent space. First-order optimisation operates in the tangent subspace.

Neural Tangent Kernel (NTK). In the infinite-width limit (network width \to \infty with appropriate scaling), the evolution of the network function during gradient flow is governed by a linear equation in function space:

f˙t(x)=xK(x,x)Lf(x)\dot{f}_{t}(\mathbf{x}) = -\sum_{\mathbf{x}'} K(\mathbf{x}, \mathbf{x}') \frac{\partial \mathcal{L}}{\partial f(\mathbf{x}')}

where K(x,x)=θfθ(x),θfθ(x)RpK(\mathbf{x}, \mathbf{x}') = \langle \nabla_\theta f_\theta(\mathbf{x}), \nabla_\theta f_\theta(\mathbf{x}') \rangle_{\mathbb{R}^p} is the NTK. In the infinite-width limit, KK becomes constant during training. The trained network converges to kernel regression with the NTK as kernel - meaning the trained function lies in the RKHS defined by the NTK, which is a subspace of L2L^2. The NTK result says: in the lazy training regime, the neural network effectively performs a linear projection onto a specific infinite-dimensional subspace of function space.

Universal approximation and subspace density. The universal approximation theorem says that the set of functions representable by a neural network with bounded width and arbitrary depth (or arbitrary width and one hidden layer) is dense in C([0,1]n)C([0,1]^n) (continuous functions on the unit hypercube) under the uniform norm. "Dense" means: for any continuous function ff and any ε>0\varepsilon > 0, there is a network function fθf_\theta with ffθ<ε\|f - f_\theta\|_\infty < \varepsilon. This is a subspace density statement: the parametric family F\mathcal{F} is dense in the function space C([0,1]n)C([0,1]^n). Depth (or width) determines which subspace is reachable; nonlinearity is what allows density.

11.4 Krylov Subspaces

Krylov subspaces are the foundation of the most practical iterative linear algebra algorithms. They connect the abstract geometry of subspaces to the computational efficiency of matrix-vector products.

Definition. For a matrix ARn×nA \in \mathbb{R}^{n \times n} and a vector bRn\mathbf{b} \in \mathbb{R}^n, the Krylov subspace of order kk is:

Kk(A,b)=span{b, Ab, A2b, , Ak1b}\mathcal{K}_k(A, \mathbf{b}) = \text{span}\{\mathbf{b},\ A\mathbf{b},\ A^2\mathbf{b},\ \ldots,\ A^{k-1}\mathbf{b}\}

This is the span of the first kk vectors in the sequence b,Ab,A2b,\mathbf{b}, A\mathbf{b}, A^2\mathbf{b}, \ldots - the orbit of b\mathbf{b} under repeated application of AA.

Nested structure. The Krylov subspaces form a nested sequence:

K1K2K3Kn\mathcal{K}_1 \subseteq \mathcal{K}_2 \subseteq \mathcal{K}_3 \subseteq \cdots \subseteq \mathcal{K}_n

The sequence stabilises at some rnr \leq n: Kr=Kr+1=\mathcal{K}_r = \mathcal{K}_{r+1} = \cdots At that point, AKrKrA \cdot \mathcal{K}_r \subseteq \mathcal{K}_r (the subspace is invariant under AA), and rr equals the degree of the minimal polynomial of AA with respect to b\mathbf{b}.

Krylov methods. Iterative solvers based on Krylov subspaces find the best approximate solution within Kk\mathcal{K}_k at step kk, then expand the subspace:

MethodProblemOptimisation in Kk\mathcal{K}_k
Conjugate Gradients (CG)Ax=bA\mathbf{x} = \mathbf{b}, AA SPDMinimises xxA\|\mathbf{x} - \mathbf{x}^*\|_A over Kk\mathcal{K}_k
GMRESAx=bA\mathbf{x} = \mathbf{b}, general AAMinimises Axb2\|A\mathbf{x} - \mathbf{b}\|_2 over Kk\mathcal{K}_k
LanczosEigenvalues of AA symmetricFinds best rank-kk approximation to spectrum
ArnoldiEigenvalues of general AAOrthogonalises Kk\mathcal{K}_k via Gram-Schmidt

Each step costs one matrix-vector product with AA and O(kn)O(k \cdot n) work for orthogonalisation. For a sparse AA with nnz\text{nnz} non-zeros, each step costs O(nnz)O(\text{nnz}). Contrast with direct methods (Gaussian elimination): O(n3)O(n^3) cost regardless of sparsity.

AI applications of Krylov methods:

  • Second-order optimisation. Methods like K-FAC (Kronecker-Factored Approximate Curvature) need to solve linear systems involving the Fisher information matrix FF to compute the natural gradient. Krylov methods can solve Fd=gF\mathbf{d} = \mathbf{g} (where g\mathbf{g} is the gradient) without explicitly forming FF - only matrix-vector products FvF\mathbf{v} are needed, which can be computed efficiently.
  • Eigenvalue computation for interpretability. Computing the top eigenvalues of the Hessian 2L\nabla^2 \mathcal{L} or the covariance matrix of gradients is done via Lanczos, which builds a Krylov subspace using Hessian-vector products. These products are available cheaply via the "pearlmutter trick" (forward-over-backward AD). The Krylov subspace approach gives the dominant eigenspace in O(k)O(k) Hessian-vector products.
  • Linear attention and state space models. SSMs (S4, Mamba) compute recurrences of the form ht+1=Aht+Bxt\mathbf{h}_{t+1} = A\mathbf{h}_t + B\mathbf{x}_t whose outputs lie in Krylov-like subspaces. The efficiency of convolutional SSM computation is related to the structure of the Krylov subspace generated by the state matrix AA and input matrix BB.

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue