Part 3

25 min read12 headingsSplit lesson page

Lesson overview | Previous part | Next part

Vector Spaces and Subspaces: Part 12: Subspace Methods in Machine Learning to 14. Common Mistakes

12. Subspace Methods in Machine Learning

12.1 PCA as Subspace Finding

Principal Component Analysis is fundamentally a subspace problem: find the rank- $r$ subspace of $\mathbb{R}^d$ that best explains the variance in a dataset.

Setup. Given data $X = [\mathbf{x}_1 \mid \cdots \mid \mathbf{x}_n]^\top \in \mathbb{R}^{n \times d}$ with centred rows ( $\bar{\mathbf{x}} = \frac{1}{n}\sum_i \mathbf{x}_i = \mathbf{0}$ ), the sample covariance matrix is:

C = \frac{1}{n} X^\top X = \frac{1}{n} \sum_{i=1}^n \mathbf{x}_i \mathbf{x}_i^\top \in \mathbb{R}^{d \times d}

$C$ is symmetric positive semidefinite. Its eigendecomposition $C = Q \Lambda Q^\top$ (with $Q$ orthogonal, $\Lambda$ diagonal with $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d \geq 0$ ) defines the principal components.

PCA as orthogonal projection. The optimal rank- $r$ subspace $W^*$ minimises the mean squared reconstruction error:

W^* = \arg\min_{\substack{W \subseteq \mathbb{R}^d \\ \dim(W) = r}} \frac{1}{n} \sum_{i=1}^n \|\mathbf{x}_i - \text{Proj}_W(\mathbf{x}_i)\|^2

Theorem (Eckart-Young for PCA). The solution is $W^* = \text{span}\{\mathbf{q}_1, \ldots, \mathbf{q}_r\}$ , the span of the top- $r$ eigenvectors of $C$ (equivalently, the top- $r$ left singular vectors of $\frac{1}{\sqrt{n}}X$ ).

The minimum reconstruction error equals the sum of the discarded eigenvalues: $\frac{1}{n}\sum_{i=1}^n \|\mathbf{x}_i - P_{W^*}\mathbf{x}_i\|^2 = \sum_{j=r+1}^d \lambda_j$ .

Equivalences. PCA can be described four equivalent ways, all pointing to the same subspace:

Maximum variance: find the $r$ -dimensional subspace on which projected data has maximum variance
Minimum reconstruction error: find the $r$ -dimensional subspace minimising projection error (the formulation above)
SVD: compute $\frac{1}{\sqrt{n}}X = U\Sigma V^\top$ ; the top- $r$ right singular vectors $\{v_1,\ldots,v_r\}$ span $W^*$
Eigendecomposition: compute eigenvectors of $C$ ; top- $r$ eigenvectors span $W^*$

AI applications of PCA as subspace finding:

Embedding analysis. PCA of a word embedding matrix $E \in \mathbb{R}^{|V| \times d}$ reveals the dominant semantic axes. The top principal component often corresponds to a frequency direction; later components correspond to semantic distinctions. PCA on embeddings is a way of finding the most informative subspace of the $d$ -dimensional embedding space.
Activation subspaces. PCA on the activations of a layer across many inputs reveals the effective dimensionality of the layer's representation. If 95% of variance is explained by the top- $k$ principal components, the layer effectively operates in a $k$ -dimensional subspace of $\mathbb{R}^{d_{\text{hidden}}}$ , even though nominally $d_{\text{hidden}}$ -dimensional.
Weight matrix analysis. PCA on the rows or columns of a weight matrix reveals which directions the matrix emphasises. The top singular vectors of $W$ are the principal directions - the row space direction that $W$ maps to the largest column space direction.
Collapse detection. In self-supervised learning (SimCLR, BYOL, VICReg), representation collapse means all embeddings converge to a low-dimensional subspace (or a single point). Tracking the rank (effective number of significant singular values) of the representation matrix over training detects and diagnoses collapse.

12.2 Subspace Tracking During Training

The gradient subspace evolves over the course of training, and its structure is a key diagnostic for understanding optimisation.

The gradient subspace. At training step $t$ , the gradient $\nabla_\theta \mathcal{L}(\theta_t) \in \mathbb{R}^p$ is a single vector. Over $T$ steps, the gradients $\{\nabla_\theta \mathcal{L}(\theta_t)\}_{t=1}^T$ lie in some subspace of $\mathbb{R}^p$ . The gradient subspace at step $T$ is approximately:

G_T = \text{span}\{\nabla_\theta \mathcal{L}(\theta_1), \nabla_\theta \mathcal{L}(\theta_2), \ldots, \nabla_\theta \mathcal{L}(\theta_T)\}

or more precisely, the leading eigenspace of the gradient covariance matrix $\sum_{t=1}^T \nabla_t \nabla_t^\top$ .

Empirical finding: the gradient subspace is small. Gur-Ari, Bar-On, and Shashua (2018) showed that for large neural networks, the gradient vectors observed during training effectively lie in a subspace of dimension much smaller than $p$ . For models with $p \sim 10^7$ to $10^9$ parameters, the effective gradient subspace has dimension in the thousands or tens of thousands. This means:

Gradient descent traverses a low-dimensional "groove" in the high-dimensional parameter space
The $p - k$ directions orthogonal to the gradient subspace are never updated
A $k$ -dimensional update rule (where $k \ll p$ ) can achieve nearly the same performance

Intrinsic dimensionality of fine-tuning (Aghajanyan, Zettlemoyer, Gupta, 2020). They parameterise the fine-tuning update as $\theta = \theta_0 + P\delta$ where $P \in \mathbb{R}^{p \times d}$ is a fixed random projection matrix and $\delta \in \mathbb{R}^d$ is optimised. The smallest $d$ for which fine-tuning achieves 90% of full-fine-tuning performance is the intrinsic dimension. For GPT-2 fine-tuned on MRPC (a sentence similarity task), the intrinsic dimension is $\approx 200$ out of $p \approx 1.5 \times 10^8$ parameters. Fine-tuning happens in a subspace of effective dimension $\approx 200$ .

LoRA as gradient subspace approximation. LoRA restricts $\Delta W = BA^\top$ to a rank- $r$ subspace of $\mathbb{R}^{m \times n}$ . The justification is that the gradient $\nabla_W \mathcal{L}$ during fine-tuning lives in a low-rank subspace. By restricting updates to rank $r$ , LoRA approximates this gradient subspace with $r(m+n)$ parameters instead of $mn$ . The choice of $r$ balances: larger $r$ = larger subspace = more expressive but more parameters; smaller $r$ = tighter constraint but fewer parameters. The "right" $r$ is approximately the intrinsic dimension of the task.

GaLore (Zhao et al. 2024). Rather than permanently fixing the subspace (as LoRA does), GaLore periodically updates the projection subspace by computing the top- $r$ singular vectors of the gradient matrix. The gradient is projected onto the current subspace before the optimiser step; optimizer state is maintained in the subspace. Memory savings come from keeping optimizer state in $\mathbb{R}^r$ rather than $\mathbb{R}^{m \times n}$ .

12.3 Representation Subspaces

The linear representation hypothesis proposes that concepts are encoded as linear subspaces of the residual stream in large language models. This is a strong structural claim with substantial empirical support.

Concept directions. A concept direction for a binary concept $c$ (e.g., "positive sentiment", "female gender", "medical domain") is a unit vector $\hat{\mathbf{d}} \in \mathbb{R}^d$ such that $\langle \mathbf{x}, \hat{\mathbf{d}} \rangle > 0$ indicates the concept is present and $\langle \mathbf{x}, \hat{\mathbf{d}} \rangle < 0$ indicates it is absent. The concept is represented as a 1-dimensional subspace (a line through the origin) in $\mathbb{R}^d$ .

Evidence: probing classifiers (linear models predicting a concept from an embedding) work well for many concepts, suggesting the concept is linearly decodable. The existence of a linear probe is evidence that the concept has a direction in the embedding space - a 1D subspace.

Multi-dimensional concept subspaces. Some concepts are not well-represented by a single direction. "Colour" might involve multiple hues; "grammatical role" might involve multiple positions. In these cases, the concept corresponds to a subspace of dimension $> 1$ . The concept subspace $C \subseteq \mathbb{R}^d$ contains all directions relevant to the concept.

LEACE (Least-squares Concept Erasure) (Belrose et al. 2023). Given two sets of representations - one with concept present, one absent - LEACE finds the minimal subspace $W$ such that projecting onto $W^\perp$ makes the concept undetectable by any linear probe. This is explicitly a subspace computation:

Compute the within-class covariance $\Sigma_W$ and between-class difference $\boldsymbol{\mu}_1 - \boldsymbol{\mu}_0$
Find the projection direction that maximally separates classes (Fisher discriminant direction)
Project onto the orthogonal complement: $\mathbf{x} \leftarrow (I - P_{\text{concept}})\mathbf{x}$

Concept erasure = projection onto the orthogonal complement of the concept subspace.

Distributed representations. A single concept may activate many neurons, and a single neuron may respond to many concepts. This is distributed representation, and it is the normal state in large networks:

Distributed over neurons: each neuron contributes a little to many concepts; the concept is "smeared" across many dimensions
Superposition: many concepts share the same dimensions (polysemanticity); the concept subspaces are non-orthogonal

The superposition hypothesis (Elhage et al. 2022) says that in the regime where the number of features $F > d$ (more features than dimensions), the optimal strategy is to store features as nearly-orthogonal, non-orthogonal unit vectors. Interference is minimised by choosing feature directions to be as orthogonal as possible, but it cannot be entirely eliminated when $F > d$ . This is the geometry of packing more vectors into a space than the dimension allows.

12.4 Mechanistic Interpretability Through Subspaces

Mechanistic interpretability analyses the internal computations of neural networks at the level of circuits - sequences of operations that implement specific algorithmic behaviours. Subspace geometry is the natural language for this analysis.

Residual stream decomposition. At layer $\ell$ , the residual stream $\mathbf{x}^\ell \in \mathbb{R}^d$ is updated by:

\mathbf{x}^{\ell+1} = \mathbf{x}^\ell + \text{Attn}^\ell(\mathbf{x}^\ell) + \text{MLP}^\ell(\mathbf{x}^\ell)

Each component (attention head $h$ , MLP layer $\ell$ ) writes to a specific subspace of $\mathbb{R}^d$ (its column space) and reads from another (its row space). The residual stream is a shared highway; every component adds its contribution to the same $d$ -dimensional space. The subspace written to by component $c$ is $\text{col}(W_{O}^c)$ for attention heads or determined by the MLP weight matrices for MLP blocks.

OV and QK circuits. For attention head $h$ with projections $W_Q^h, W_K^h, W_V^h \in \mathbb{R}^{d \times d_k}$ and $W_O^h \in \mathbb{R}^{d_k \times d}$ :

QK circuit: $W_Q^h (W_K^h)^\top \in \mathbb{R}^{d \times d}$ but has rank $\leq d_k$ ; it maps the residual stream to a $d_k$ -dimensional subspace for query-key comparison; the head computes attention scores using only $d_k$ dimensions of the $d$ -dimensional residual stream
OV circuit: $W_O^h W_V^h \in \mathbb{R}^{d \times d}$ but has rank $\leq d_k$ ; the head reads from a $d_k$ -dimensional subspace (via $W_V^h$ ) and writes to a $d_k$ -dimensional subspace (via $W_O^h$ ); the composition is a rank- $d_k$ map from $\mathbb{R}^d$ to $\mathbb{R}^d$

Both the QK and OV circuits are low-rank (rank $\leq d_k$ ) operations in $\mathbb{R}^d$ , even though $\mathbb{R}^d$ is $d$ -dimensional. Each head has access to $d$ -dimensional representations but can only "look at" or "write to" $d_k$ -dimensional subspaces. When $d_k \ll d$ (typical: $d_k = d/H$ for $H$ heads), each head is a dimensionality-reducing probe and writer.

Induction heads. An induction head (Olsson et al. 2022) is an attention head that implements in-context learning by attending to previous occurrences of the current token. It works via two components: a "previous token head" that shifts information one position back (writing to a specific subspace), and an "induction head" that looks for matches in that subspace. The circuit is: head A writes "what came before position $t$ " to a subspace $S_A$ ; head B reads from $S_A$ and attends accordingly. The communication between heads A and B happens through a shared subspace of the residual stream.

Superposition and polysemanticity. Toy models of superposition (Elhage et al. 2022) show that:

With $F \leq d$ features: each feature can be stored in its own orthogonal direction; no interference; polysemanticity unnecessary
With $F > d$ features (superposition regime): features are stored as nearly-orthogonal non-orthogonal directions; each direction contains a weighted sum of multiple features; individual neurons are polysemantic (they respond to multiple distinct features)
The "geometry" of superposition: features are arranged as vertices of polytopes inscribed in the unit sphere; the maximum number of nearly-orthogonal unit vectors in $\mathbb{R}^d$ that have pairwise inner products $\leq \epsilon$ is approximately $e^{c \cdot d}$ for some constant $c > 0$ depending on $\epsilon$ - exponential in the dimension

12.5 Subspace Fine-Tuning Methods

The empirical low-dimensionality of fine-tuning gradients motivates a family of methods that explicitly restrict weight updates to specific subspaces. Here is a comparative overview:

LoRA (Hu et al. 2021).

Parameterise the weight update as $\Delta W = BA^\top$ with $B \in \mathbb{R}^{m \times r}$ , $A \in \mathbb{R}^{n \times r}$ , rank $r \ll \min(m,n)$ . The update lives in the subspace of rank- $\leq r$ matrices in $\mathbb{R}^{m \times n}$ . Only $r(m+n)$ parameters are needed instead of $mn$ . The pre-trained weights $W_0$ are frozen; only $A$ and $B$ are trained. At inference, the update is merged: $W = W_0 + BA^\top$ , adding no latency.

AdaLoRA (Zhang et al. 2023).

Parameterise $\Delta W = P \Lambda Q^\top$ where $P \in \mathbb{R}^{m \times r}$ , $Q \in \mathbb{R}^{n \times r}$ are updated by gradient descent and $\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_r)$ is a diagonal matrix of learned singular values. Singular values that shrink toward zero can be pruned during training, giving adaptive rank allocation: some layers get high rank, others get low rank, depending on the task's needs. The total parameter count is bounded, but rank is allocated where the gradient signal is strongest.

IA^3 (Liu et al. 2022).

Rather than adding a low-rank matrix, IA^3 multiplies activations by learned scaling vectors: $\mathbf{h} \leftarrow \mathbf{l} \odot \mathbf{h}$ for a learned vector $\mathbf{l}$ . Equivalently, this rescales the columns of weight matrices: $W_{\text{eff}} = W \cdot \text{diag}(\mathbf{l})$ . Each learned vector is a 1D subspace modulation - a rank-1 multiplicative update to a direction in activation space. Extremely parameter-efficient ( $\leq d$ parameters per layer).

Prefix Tuning (Li and Liang 2021).

Prepend $k$ trainable "prefix" token embeddings to the input sequence. At each attention layer, the prefix tokens contribute additional keys and values that the original tokens can attend to. The prefix embeddings inject task-specific information into the attention computation. The "prefix subspace" is $\text{span}\{\mathbf{p}_1, \ldots, \mathbf{p}_k\} \subseteq \mathbb{R}^d$ - the $k$ -dimensional subspace of information injected at each layer.

DoRA (Liu et al. 2024).

Decompose the weight matrix as $W = \|W\|_c \cdot (W / \|W\|_c)$ (magnitude times direction, computed column-wise). Fine-tune magnitude and direction separately: magnitude is a scalar per column (1D), direction uses LoRA (rank- $r$ subspace). The decomposition reflects the observation that weight magnitude and weight direction play different roles in learning: magnitude controls the "strength" of a feature; direction controls "which feature". DoRA consistently outperforms LoRA at the same rank by freeing the direction update from the magnitude constraint.

DARE (Yu et al. 2024).

After standard fine-tuning, randomly prune $\delta W = W_{\text{FT}} - W_0$ with probability $p$ (set to zero) and rescale the remainder by $1/(1-p)$ to maintain expectation. The effect is to project the fine-tuning update onto a sparse "subspace" (sparse in the coordinate basis). Pruned updates can be summed across multiple fine-tuned models (model merging) with reduced interference, since the non-zero coordinates are less likely to overlap.

SUBSPACE FINE-TUNING: COMPARISON


  Method      Subspace type        Parameters        Memory saving
              
  Full FT     all of R         mn                1x  (baseline)
  LoRA r      rank-r subspace      r(m+n)            mn/r(m+n)
  AdaLoRA     adaptive rank-r      r(m+n) + r        similar to LoRA
  IA^3         column scaling       m or n            mn/d ~= n
  Prefix k    k-dim prefix         k*d per layer     varies
  DoRA        magnitude + LoRA     r(m+n) + n        similar to LoRA
  DARE        sparse random        mn (train), 0 (prune)  0 (post-hoc)

  All methods exploit: fine-tuning gradient lives in low-dim subspace

13. Invariant Subspaces and Spectral Theory

13.1 Invariant Subspaces

A subspace $W \subseteq V$ is invariant under a linear map $T: V \to V$ if $T$ maps $W$ back into $W$ :

T(W) \subseteq W \quad \text{i.e., for all } \mathbf{w} \in W:\ T(\mathbf{w}) \in W

Invariant subspaces are the subspaces that $T$ "respects" - it does not mix $W$ with its complement. Restricted to $W$ , the map $T|_W: W \to W$ is a linear map from $W$ to itself.

Trivial invariant subspaces. Every linear map has two trivial invariant subspaces: $\{\mathbf{0}\}$ and $V$ itself. Any map $T$ maps $\mathbf{0}$ to $\mathbf{0}$ and maps $V$ to $T(V) \subseteq V$ .

Eigenspaces are 1-dimensional invariant subspaces. If $T\mathbf{v} = \lambda\mathbf{v}$ (eigenvector equation), then $\text{span}\{\mathbf{v}\}$ is invariant: for any $\mathbf{w} = \alpha\mathbf{v} \in \text{span}\{\mathbf{v}\}$ :

T(\mathbf{w}) = T(\alpha\mathbf{v}) = \alpha T(\mathbf{v}) = \alpha\lambda\mathbf{v} = \lambda\mathbf{w} \in \text{span}\{\mathbf{v}\}

Conversely, any 1-dimensional invariant subspace $\text{span}\{\mathbf{v}\}$ must have $T(\mathbf{v}) = \lambda\mathbf{v}$ for some scalar $\lambda$ (since $T(\mathbf{v})$ must lie in $\text{span}\{\mathbf{v}\}$ ). So eigenvectors and 1-dimensional invariant subspaces are the same thing.

Generalised eigenspaces. For eigenvalue $\lambda$ , the generalised eigenspace is:

G(\lambda) = \text{null}((T - \lambda I)^k) \quad \text{for large enough } k

(Specifically, $k = n$ always works, but often a smaller $k$ suffices - the smallest $k$ with $\text{null}((T-\lambda I)^k) = \text{null}((T-\lambda I)^{k+1})$ is the index of $\lambda$ .) The generalised eigenspace is invariant under $T$ : $T(G(\lambda)) \subseteq G(\lambda)$ .

Invariant subspace decomposition. The ideal scenario is when $V$ decomposes as a direct sum of invariant subspaces:

V = W_1 \oplus W_2 \oplus \cdots \oplus W_k \quad \text{with each } W_i \text{ invariant under } T

In this case, $T$ acts independently on each $W_i$ : the restriction $T|_{W_i}$ is a linear map from $W_i$ to $W_i$ , and the overall action of $T$ is the "direct sum" of these restricted maps. In a basis that respects this decomposition, the matrix of $T$ is block diagonal:

[T]_{\mathcal{B}} = \begin{pmatrix} [T|_{W_1}] & & \\ & \ddots & \\ & & [T|_{W_k}] \end{pmatrix}

Block diagonalisation is the goal of spectral theory: decompose $V$ into invariant subspaces so that $T$ is maximally "decoupled".

13.2 The Spectral Theorem via Invariant Subspaces

The real spectral theorem is the most important result about invariant subspaces. It applies to symmetric matrices (or more generally, self-adjoint operators on inner product spaces).

Theorem (Real Spectral Theorem). For a symmetric matrix $A = A^\top \in \mathbb{R}^{n \times n}$ :

All eigenvalues of $A$ are real
Eigenvectors corresponding to distinct eigenvalues are orthogonal
$\mathbb{R}^n$ decomposes as an orthogonal direct sum of eigenspaces:

\mathbb{R}^n = E(\lambda_1) \oplus E(\lambda_2) \oplus \cdots \oplus E(\lambda_k)

$A = Q \Lambda Q^\top$ where $Q$ is orthogonal (eigenvectors as columns) and $\Lambda$ is diagonal (eigenvalues)

Why eigenvalues are real. For symmetric $A$ : $\langle A\mathbf{u}, \mathbf{v} \rangle = (A\mathbf{u})^\top \mathbf{v} = \mathbf{u}^\top A^\top \mathbf{v} = \mathbf{u}^\top A \mathbf{v} = \langle \mathbf{u}, A\mathbf{v} \rangle$ . Such an operator is called self-adjoint. If $A\mathbf{v} = \lambda\mathbf{v}$ (over $\mathbb{C}$ ):

\lambda \langle \mathbf{v}, \mathbf{v} \rangle = \langle \lambda\mathbf{v}, \mathbf{v} \rangle = \langle A\mathbf{v}, \mathbf{v} \rangle = \langle \mathbf{v}, A\mathbf{v} \rangle = \langle \mathbf{v}, \lambda\mathbf{v} \rangle = \bar{\lambda} \langle \mathbf{v}, \mathbf{v} \rangle

Since $\|\mathbf{v}\|^2 > 0$ : $\lambda = \bar{\lambda}$ , so $\lambda \in \mathbb{R}$ .

Why eigenvectors for distinct eigenvalues are orthogonal. If $A\mathbf{u} = \lambda\mathbf{u}$ and $A\mathbf{v} = \mu\mathbf{v}$ with $\lambda \neq \mu$ :

\lambda \langle \mathbf{u}, \mathbf{v} \rangle = \langle A\mathbf{u}, \mathbf{v} \rangle = \langle \mathbf{u}, A\mathbf{v} \rangle = \mu \langle \mathbf{u}, \mathbf{v} \rangle

So $(\lambda - \mu)\langle\mathbf{u},\mathbf{v}\rangle = 0$ , and since $\lambda \neq \mu$ : $\langle\mathbf{u},\mathbf{v}\rangle = 0$ .

Spectral decomposition. Let $P_i$ be the orthogonal projection onto the eigenspace $E(\lambda_i)$ . Then:

A = \sum_{i=1}^k \lambda_i P_i, \qquad I = \sum_{i=1}^k P_i, \qquad P_i P_j = \delta_{ij} P_i

The action of $A$ decomposes into independent scalings of orthogonal subspaces: $A$ scales $E(\lambda_i)$ by $\lambda_i$ and acts on each eigenspace independently.

AI relevance. Symmetric matrices appear constantly in deep learning:

Covariance matrices $C = \frac{1}{n}X^\top X$ : PCA = spectral decomposition of $C$ ; eigenspaces = principal component subspaces; eigenvalues = variances in each direction
Gram matrices $G = XX^\top$ (dual of covariance): non-zero eigenvalues of $G$ = non-zero eigenvalues of $C$ ; eigenvectors related by $X$
Hessian $H = \nabla^2 \mathcal{L}(\theta)$ : curvature of loss landscape; eigenspaces = directions of different curvature; positive eigenvalues = directions curving up (need small learning rate); near-zero eigenvalues = flat directions (optimisation easy in these directions)
Laplacian of a graph $L = D - A$ : symmetric positive semidefinite; eigenvectors = graph Fourier basis; used in spectral clustering and graph neural networks

13.3 Singular Value Decomposition as Subspace Decomposition

The SVD extends the spectral theorem to non-square (and non-symmetric) matrices. It is the complete subspace decomposition of any linear map.

Theorem (SVD). For any matrix $A \in \mathbb{R}^{m \times n}$ with $\text{rank}(A) = r$ , there exist orthogonal matrices $U \in \mathbb{R}^{m \times m}$ and $V \in \mathbb{R}^{n \times n}$ and a matrix $\Sigma \in \mathbb{R}^{m \times n}$ with non-negative entries only on the main diagonal $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0 = \sigma_{r+1} = \cdots$ such that:

A = U \Sigma V^\top

The $\sigma_i$ are the singular values, $\mathbf{u}_i$ (columns of $U$ ) are the left singular vectors, and $\mathbf{v}_i$ (columns of $V$ ) are the right singular vectors.

SVD as complete subspace decomposition:

Domain $\mathbb{R}^n$ : decomposes as $\text{row}(A) \oplus \text{null}(A)$
- $\text{row}(A) = \text{span}\{\mathbf{v}_1, \ldots, \mathbf{v}_r\}$ (right singular vectors for non-zero singular values)
- $\text{null}(A) = \text{span}\{\mathbf{v}_{r+1}, \ldots, \mathbf{v}_n\}$ (right singular vectors for zero singular values)
Codomain $\mathbb{R}^m$ : decomposes as $\text{col}(A) \oplus \text{null}(A^\top)$
- $\text{col}(A) = \text{span}\{\mathbf{u}_1, \ldots, \mathbf{u}_r\}$ (left singular vectors for non-zero singular values)
- $\text{null}(A^\top) = \text{span}\{\mathbf{u}_{r+1}, \ldots, \mathbf{u}_m\}$ (left singular vectors for zero singular values)
The action of $A$ : $A$ maps the $i$ -th right singular direction to the $i$ -th left singular direction, with scaling $\sigma_i$ :

A \mathbf{v}_i = \sigma_i \mathbf{u}_i \quad \text{for } i = 1, \ldots, r, \qquad A \mathbf{v}_i = \mathbf{0} \quad \text{for } i > r

Rank- $k$ approximation. The best rank- $k$ approximation to $A$ (in Frobenius or operator norm) is:

A_k = \sum_{i=1}^k \sigma_i \mathbf{u}_i \mathbf{v}_i^\top = U_k \Sigma_k V_k^\top

where $U_k, V_k$ contain the first $k$ left/right singular vectors and $\Sigma_k$ the top- $k$ singular values. This is the Eckart-Young theorem: among all rank- $k$ matrices, $A_k$ is closest to $A$ .

The approximation error: $\|A - A_k\|_F^2 = \sum_{i=k+1}^r \sigma_i^2$ (discarded singular values).

AI applications of SVD as subspace decomposition:

Weight matrix compression. A weight matrix $W \in \mathbb{R}^{m \times n}$ can be approximated as $W \approx U_k \Sigma_k V_k^\top$ using the top- $k$ singular components. This compresses $mn$ parameters to $k(m+n)$ . The column space of the compressed weight uses only the top- $k$ left singular directions; the null space grows accordingly.
LoRA in SVD language. LoRA's $\Delta W = BA^\top$ is a rank- $r$ matrix, hence its SVD has at most $r$ non-zero singular values. The columns of $B$ span the column space of $\Delta W$ ; the columns of $A$ span the row space. Initialising $B = \mathbf{0}$ (as in standard LoRA) means $\Delta W = \mathbf{0}$ at the start, which is correct for starting from the pre-trained weights.
Activation space analysis. The SVD of the activation matrix $H \in \mathbb{R}^{n \times d}$ (rows = activations for $n$ inputs) reveals the effective dimensionality of the representation. If many singular values are near zero, the activations lie in a low-dimensional subspace of $\mathbb{R}^d$ . This is a direct measure of representation collapse or dimensionality.
Principal angle computation. As noted in 8.5, the principal angles between two subspaces spanned by $Q_1$ and $Q_2$ are given by the singular values of $Q_1^\top Q_2$ . SVD gives the geometry of subspace-to-subspace relationships.

13.4 Schur Decomposition

Theorem (Schur). Every square matrix $A \in \mathbb{R}^{n \times n}$ (working over $\mathbb{C}$ , or over $\mathbb{R}$ for real Schur form) can be written as:

A = Q T Q^*

where $Q$ is unitary ( $Q^* Q = I$ ) and $T$ is upper triangular. The diagonal entries of $T$ are the eigenvalues of $A$ (possibly complex, in some order).

Real Schur form. Over $\mathbb{R}$ , $A = QTQ^\top$ where $T$ is quasi-upper triangular: block upper triangular with $1 \times 1$ blocks (real eigenvalues) and $2 \times 2$ blocks (complex conjugate eigenvalue pairs). The columns of $Q$ form an orthonormal basis.

Invariant subspace connection. The first $k$ columns of $Q$ span a $k$ -dimensional invariant subspace of $A$ :

A \, \text{span}\{Q_1, \ldots, Q_k\} \subseteq \text{span}\{Q_1, \ldots, Q_k\}

This follows because $AQ_k = QT$ and $T$ is upper triangular: $(AQ)_{j} = \sum_{i=1}^j T_{ij} Q_i$ , so $AQ_j$ is a linear combination of $Q_1, \ldots, Q_j$ . The Schur decomposition gives a nested sequence of invariant subspaces:

\{0\} \subset \text{span}\{Q_1\} \subset \text{span}\{Q_1, Q_2\} \subset \cdots \subset \mathbb{R}^n

Each subspace in the chain is invariant under $A$ . The Schur form is the "staircase" of all invariant subspaces simultaneously.

Relationship to spectral theorem. For symmetric $A$ : the Schur form has $T = \Lambda$ (diagonal), since eigenvectors are orthogonal and $T$ being upper triangular with orthogonal columns forces it diagonal. The Schur decomposition of a symmetric matrix is exactly the spectral decomposition.

Practical relevance. The QR algorithm (the standard algorithm for computing eigenvalues) works by iteratively applying QR decompositions to drive $A$ toward its Schur form. Each QR step refines the invariant subspace structure. The Schur form is numerically stable (unitary transformations do not amplify errors) and can always be computed, unlike the eigendecomposition (which may not exist for non-diagonalisable matrices).

14. Common Mistakes

#	Mistake	Why It's Wrong	Correct Understanding
1	"The union of two subspaces is a subspace"	$W_1 \cup W_2$ fails closure under addition: take $\mathbf{u} \in W_1 \setminus W_2$ and $\mathbf{v} \in W_2 \setminus W_1$ ; their sum $\mathbf{u} + \mathbf{v}$ need not lie in $W_1$ or $W_2$ . Concrete example: $W_1 = \text{span}\{(1,0)\}$ (x-axis) and $W_2 = \text{span}\{(0,1)\}$ (y-axis) in $\mathbb{R}^2$ ; $(1,0) + (0,1) = (1,1) \notin W_1 \cup W_2$ .	The sum $W_1 + W_2$ is a subspace; the union is generally not. Use $W_1 + W_2$ when you need the subspace generated by both.
2	"An affine subspace is a subspace"	The solution set $\{A\mathbf{x} = \mathbf{b}\}$ with $\mathbf{b} \neq \mathbf{0}$ does not contain the zero vector ( $A \cdot \mathbf{0} = \mathbf{0} \neq \mathbf{b}$ ) and is not closed under addition ( $A(\mathbf{x}+\mathbf{y}) = 2\mathbf{b} \neq \mathbf{b}$ ). The probability simplex $\Delta^n$ is an affine subspace - not a vector subspace.	An affine subspace = coset of a linear subspace = linear subspace shifted off the origin. It shares the same dimension and "shape" but is NOT a vector space. Subspaces must contain $\mathbf{0}$ .
3	"Closure under addition is sufficient for a subspace"	Closure under addition alone (without scalar multiplication) is insufficient. The set $\mathbb{Z}^n \subset \mathbb{R}^n$ is closed under addition but $\pi \cdot (1,0,\ldots,0) = (\pi, 0, \ldots, 0) \notin \mathbb{Z}^n$ , so it fails the scalar multiplication axiom.	All three conditions are required: (1) contains $\mathbf{0}$ , (2) closed under addition, (3) closed under scalar multiplication. All three are independent of each other.
4	" $\text{span}\{\mathbf{v}_1,\ldots,\mathbf{v}_k\} = \text{span}\{\mathbf{v}_1,\ldots,\mathbf{v}_{k+1}\}$ implies $\mathbf{v}_{k+1} = \mathbf{0}$ "	The spans being equal means $\mathbf{v}_{k+1}$ is linearly dependent on $\mathbf{v}_1,\ldots,\mathbf{v}_k$ - it can be expressed as a linear combination of the earlier vectors. But it need not be zero. For example: $\text{span}\{(1,0),(0,1)\} = \text{span}\{(1,0),(0,1),(1,1)\}$ ; $(1,1)$ is not zero.	Equal spans means $\mathbf{v}_{k+1} \in \text{span}\{\mathbf{v}_1,\ldots,\mathbf{v}_k\}$ - redundant, not zero.
5	" $\dim(W_1 \cap W_2) = \dim(W_1) + \dim(W_2) - \dim(V)$ "	This formula is wrong. The correct formula involves the sum: $\dim(W_1 + W_2) = \dim(W_1) + \dim(W_2) - \dim(W_1 \cap W_2)$ . The dimension of $W_1 \cap W_2$ ranges between $\max(0, \dim(W_1)+\dim(W_2)-n)$ and $\min(\dim(W_1),\dim(W_2))$ .	Correct formula: $\dim(W_1 \cap W_2) = \dim(W_1) + \dim(W_2) - \dim(W_1 + W_2)$ . You cannot determine $\dim(W_1 \cap W_2)$ from dimensions alone without knowing $\dim(W_1 + W_2)$ .
6	"The complement of a subspace is a subspace"	The set complement $V \setminus W$ is never a subspace: it does not contain $\mathbf{0}$ (since $\mathbf{0} \in W$ ) and is not closed under addition or scalar multiplication.	The orthogonal complement $W^\perp$ IS a subspace. Always say "orthogonal complement $W^\perp$ ", never just "complement". The orthogonal complement is the canonical linear-algebraic complement of $W$ .
7	"All bases for $V$ have the same vectors"	Bases are not unique. Any set of $n$ linearly independent spanning vectors is a basis for an $n$ -dimensional space. For $\mathbb{R}^2$ , $\{(1,0),(0,1)\}$ , $\{(1,1),(1,-1)\}$ , and $\{(2,3),(1,2)\}$ are three distinct bases.	All bases for $V$ have the same number of vectors (= $\dim(V)$ ). The actual vectors can be anything linearly independent and spanning. The choice of basis is a choice of coordinate system.
8	" $\text{null}(AB) = \text{null}(B)$ "	$\text{null}(B) \subseteq \text{null}(AB)$ always (if $B\mathbf{x} = \mathbf{0}$ then $AB\mathbf{x} = \mathbf{0}$ ). But $\text{null}(AB)$ can be strictly larger: if $A$ maps some $B\mathbf{x} \neq \mathbf{0}$ to zero ( $B\mathbf{x} \in \text{null}(A)$ ), then $\mathbf{x} \in \text{null}(AB)$ but $\mathbf{x} \notin \text{null}(B)$ .	$\text{null}(AB) \supseteq \text{null}(B)$ . Equality holds iff $\text{null}(A) \cap \text{col}(B) = \{\mathbf{0}\}$ , i.e., $A$ is injective on the column space of $B$ .
9	"Two subspaces with the same dimension are equal"	Dimension only describes the "size" of a subspace, not its orientation. The x-axis $\text{span}\{(1,0)\}$ and y-axis $\text{span}\{(0,1)\}$ in $\mathbb{R}^2$ both have dimension 1 but are completely different subspaces (they share only the origin).	Equal dimension means isomorphic as abstract vector spaces. Actual equality as subsets requires showing every vector in one is in the other - or equivalently, that each spans the other.
10	"Gram-Schmidt always produces a basis from a spanning set"	Gram-Schmidt fails (division by zero) when it encounters a vector already in the span of the previously constructed orthonormal vectors. The intermediate vector $\mathbf{u}_j = \mathbf{v}_j - \sum_{i<j}\langle\mathbf{v}_j,\mathbf{q}_i\rangle\mathbf{q}_i$ becomes $\mathbf{0}$ , which cannot be normalised.	Apply Gram-Schmidt only to linearly independent vectors. For a spanning set, first extract an independent subset via row reduction, then apply Gram-Schmidt. A zero intermediate vector signals linear dependence - discard that vector and continue.
11	"The pivot columns of the RREF give a basis for the column space"	Row operations change the column space. The RREF of $A$ has different columns than $A$ itself. The pivot positions (column indices) are correct, but the actual basis vectors must be taken from the original matrix $A$ , not from the RREF.	Identify pivot positions from the RREF, but extract the corresponding columns from the original $A$ . The pivot columns of $A$ (not its RREF) form a basis for $\text{col}(A)$ .
12	" $A\mathbf{x} = \mathbf{0}$ has only the trivial solution implies $A$ is invertible"	For $A \in \mathbb{R}^{m \times n}$ : if $m \neq n$ , $A$ cannot be invertible regardless. The trivial null space means $A$ is injective ( $\text{rank}(A) = n$ ), but invertibility requires square and $\text{rank}(A) = n = m$ .	For square $A \in \mathbb{R}^{n \times n}$ : $\text{null}(A) = \{\mathbf{0}\}$ iff $A$ is invertible. For rectangular $A$ : $\text{null}(A) = \{\mathbf{0}\}$ means $A$ is injective (left-invertible) but not invertible.

Vector Spaces Subspaces: Part 3 - Subspace Methods In Machine Learning To 14 Common Mistakes