Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 3
25 min read12 headingsSplit lesson page

Lesson overview | Previous part | Next part

Vector Spaces and Subspaces: Part 12: Subspace Methods in Machine Learning to 14. Common Mistakes

12. Subspace Methods in Machine Learning

12.1 PCA as Subspace Finding

Principal Component Analysis is fundamentally a subspace problem: find the rank-rr subspace of Rd\mathbb{R}^d that best explains the variance in a dataset.

Setup. Given data X=[x1xn]Rn×dX = [\mathbf{x}_1 \mid \cdots \mid \mathbf{x}_n]^\top \in \mathbb{R}^{n \times d} with centred rows (xˉ=1nixi=0\bar{\mathbf{x}} = \frac{1}{n}\sum_i \mathbf{x}_i = \mathbf{0}), the sample covariance matrix is:

C=1nXX=1ni=1nxixiRd×dC = \frac{1}{n} X^\top X = \frac{1}{n} \sum_{i=1}^n \mathbf{x}_i \mathbf{x}_i^\top \in \mathbb{R}^{d \times d}

CC is symmetric positive semidefinite. Its eigendecomposition C=QΛQC = Q \Lambda Q^\top (with QQ orthogonal, Λ\Lambda diagonal with λ1λ2λd0\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d \geq 0) defines the principal components.

PCA as orthogonal projection. The optimal rank-rr subspace WW^* minimises the mean squared reconstruction error:

W=argminWRddim(W)=r1ni=1nxiProjW(xi)2W^* = \arg\min_{\substack{W \subseteq \mathbb{R}^d \\ \dim(W) = r}} \frac{1}{n} \sum_{i=1}^n \|\mathbf{x}_i - \text{Proj}_W(\mathbf{x}_i)\|^2

Theorem (Eckart-Young for PCA). The solution is W=span{q1,,qr}W^* = \text{span}\{\mathbf{q}_1, \ldots, \mathbf{q}_r\}, the span of the top-rr eigenvectors of CC (equivalently, the top-rr left singular vectors of 1nX\frac{1}{\sqrt{n}}X).

The minimum reconstruction error equals the sum of the discarded eigenvalues: 1ni=1nxiPWxi2=j=r+1dλj\frac{1}{n}\sum_{i=1}^n \|\mathbf{x}_i - P_{W^*}\mathbf{x}_i\|^2 = \sum_{j=r+1}^d \lambda_j.

Equivalences. PCA can be described four equivalent ways, all pointing to the same subspace:

  1. Maximum variance: find the rr-dimensional subspace on which projected data has maximum variance
  2. Minimum reconstruction error: find the rr-dimensional subspace minimising projection error (the formulation above)
  3. SVD: compute 1nX=UΣV\frac{1}{\sqrt{n}}X = U\Sigma V^\top; the top-rr right singular vectors {v1,,vr}\{v_1,\ldots,v_r\} span WW^*
  4. Eigendecomposition: compute eigenvectors of CC; top-rr eigenvectors span WW^*

AI applications of PCA as subspace finding:

  • Embedding analysis. PCA of a word embedding matrix ERV×dE \in \mathbb{R}^{|V| \times d} reveals the dominant semantic axes. The top principal component often corresponds to a frequency direction; later components correspond to semantic distinctions. PCA on embeddings is a way of finding the most informative subspace of the dd-dimensional embedding space.
  • Activation subspaces. PCA on the activations of a layer across many inputs reveals the effective dimensionality of the layer's representation. If 95% of variance is explained by the top-kk principal components, the layer effectively operates in a kk-dimensional subspace of Rdhidden\mathbb{R}^{d_{\text{hidden}}}, even though nominally dhiddend_{\text{hidden}}-dimensional.
  • Weight matrix analysis. PCA on the rows or columns of a weight matrix reveals which directions the matrix emphasises. The top singular vectors of WW are the principal directions - the row space direction that WW maps to the largest column space direction.
  • Collapse detection. In self-supervised learning (SimCLR, BYOL, VICReg), representation collapse means all embeddings converge to a low-dimensional subspace (or a single point). Tracking the rank (effective number of significant singular values) of the representation matrix over training detects and diagnoses collapse.

12.2 Subspace Tracking During Training

The gradient subspace evolves over the course of training, and its structure is a key diagnostic for understanding optimisation.

The gradient subspace. At training step tt, the gradient θL(θt)Rp\nabla_\theta \mathcal{L}(\theta_t) \in \mathbb{R}^p is a single vector. Over TT steps, the gradients {θL(θt)}t=1T\{\nabla_\theta \mathcal{L}(\theta_t)\}_{t=1}^T lie in some subspace of Rp\mathbb{R}^p. The gradient subspace at step TT is approximately:

GT=span{θL(θ1),θL(θ2),,θL(θT)}G_T = \text{span}\{\nabla_\theta \mathcal{L}(\theta_1), \nabla_\theta \mathcal{L}(\theta_2), \ldots, \nabla_\theta \mathcal{L}(\theta_T)\}

or more precisely, the leading eigenspace of the gradient covariance matrix t=1Ttt\sum_{t=1}^T \nabla_t \nabla_t^\top.

Empirical finding: the gradient subspace is small. Gur-Ari, Bar-On, and Shashua (2018) showed that for large neural networks, the gradient vectors observed during training effectively lie in a subspace of dimension much smaller than pp. For models with p107p \sim 10^7 to 10910^9 parameters, the effective gradient subspace has dimension in the thousands or tens of thousands. This means:

  • Gradient descent traverses a low-dimensional "groove" in the high-dimensional parameter space
  • The pkp - k directions orthogonal to the gradient subspace are never updated
  • A kk-dimensional update rule (where kpk \ll p) can achieve nearly the same performance

Intrinsic dimensionality of fine-tuning (Aghajanyan, Zettlemoyer, Gupta, 2020). They parameterise the fine-tuning update as θ=θ0+Pδ\theta = \theta_0 + P\delta where PRp×dP \in \mathbb{R}^{p \times d} is a fixed random projection matrix and δRd\delta \in \mathbb{R}^d is optimised. The smallest dd for which fine-tuning achieves 90% of full-fine-tuning performance is the intrinsic dimension. For GPT-2 fine-tuned on MRPC (a sentence similarity task), the intrinsic dimension is 200\approx 200 out of p1.5×108p \approx 1.5 \times 10^8 parameters. Fine-tuning happens in a subspace of effective dimension 200\approx 200.

LoRA as gradient subspace approximation. LoRA restricts ΔW=BA\Delta W = BA^\top to a rank-rr subspace of Rm×n\mathbb{R}^{m \times n}. The justification is that the gradient WL\nabla_W \mathcal{L} during fine-tuning lives in a low-rank subspace. By restricting updates to rank rr, LoRA approximates this gradient subspace with r(m+n)r(m+n) parameters instead of mnmn. The choice of rr balances: larger rr = larger subspace = more expressive but more parameters; smaller rr = tighter constraint but fewer parameters. The "right" rr is approximately the intrinsic dimension of the task.

GaLore (Zhao et al. 2024). Rather than permanently fixing the subspace (as LoRA does), GaLore periodically updates the projection subspace by computing the top-rr singular vectors of the gradient matrix. The gradient is projected onto the current subspace before the optimiser step; optimizer state is maintained in the subspace. Memory savings come from keeping optimizer state in Rr\mathbb{R}^r rather than Rm×n\mathbb{R}^{m \times n}.

12.3 Representation Subspaces

The linear representation hypothesis proposes that concepts are encoded as linear subspaces of the residual stream in large language models. This is a strong structural claim with substantial empirical support.

Concept directions. A concept direction for a binary concept cc (e.g., "positive sentiment", "female gender", "medical domain") is a unit vector d^Rd\hat{\mathbf{d}} \in \mathbb{R}^d such that x,d^>0\langle \mathbf{x}, \hat{\mathbf{d}} \rangle > 0 indicates the concept is present and x,d^<0\langle \mathbf{x}, \hat{\mathbf{d}} \rangle < 0 indicates it is absent. The concept is represented as a 1-dimensional subspace (a line through the origin) in Rd\mathbb{R}^d.

Evidence: probing classifiers (linear models predicting a concept from an embedding) work well for many concepts, suggesting the concept is linearly decodable. The existence of a linear probe is evidence that the concept has a direction in the embedding space - a 1D subspace.

Multi-dimensional concept subspaces. Some concepts are not well-represented by a single direction. "Colour" might involve multiple hues; "grammatical role" might involve multiple positions. In these cases, the concept corresponds to a subspace of dimension >1> 1. The concept subspace CRdC \subseteq \mathbb{R}^d contains all directions relevant to the concept.

LEACE (Least-squares Concept Erasure) (Belrose et al. 2023). Given two sets of representations - one with concept present, one absent - LEACE finds the minimal subspace WW such that projecting onto WW^\perp makes the concept undetectable by any linear probe. This is explicitly a subspace computation:

  1. Compute the within-class covariance ΣW\Sigma_W and between-class difference μ1μ0\boldsymbol{\mu}_1 - \boldsymbol{\mu}_0
  2. Find the projection direction that maximally separates classes (Fisher discriminant direction)
  3. Project onto the orthogonal complement: x(IPconcept)x\mathbf{x} \leftarrow (I - P_{\text{concept}})\mathbf{x}

Concept erasure = projection onto the orthogonal complement of the concept subspace.

Distributed representations. A single concept may activate many neurons, and a single neuron may respond to many concepts. This is distributed representation, and it is the normal state in large networks:

  • Distributed over neurons: each neuron contributes a little to many concepts; the concept is "smeared" across many dimensions
  • Superposition: many concepts share the same dimensions (polysemanticity); the concept subspaces are non-orthogonal

The superposition hypothesis (Elhage et al. 2022) says that in the regime where the number of features F>dF > d (more features than dimensions), the optimal strategy is to store features as nearly-orthogonal, non-orthogonal unit vectors. Interference is minimised by choosing feature directions to be as orthogonal as possible, but it cannot be entirely eliminated when F>dF > d. This is the geometry of packing more vectors into a space than the dimension allows.

12.4 Mechanistic Interpretability Through Subspaces

Mechanistic interpretability analyses the internal computations of neural networks at the level of circuits - sequences of operations that implement specific algorithmic behaviours. Subspace geometry is the natural language for this analysis.

Residual stream decomposition. At layer \ell, the residual stream xRd\mathbf{x}^\ell \in \mathbb{R}^d is updated by:

x+1=x+Attn(x)+MLP(x)\mathbf{x}^{\ell+1} = \mathbf{x}^\ell + \text{Attn}^\ell(\mathbf{x}^\ell) + \text{MLP}^\ell(\mathbf{x}^\ell)

Each component (attention head hh, MLP layer \ell) writes to a specific subspace of Rd\mathbb{R}^d (its column space) and reads from another (its row space). The residual stream is a shared highway; every component adds its contribution to the same dd-dimensional space. The subspace written to by component cc is col(WOc)\text{col}(W_{O}^c) for attention heads or determined by the MLP weight matrices for MLP blocks.

OV and QK circuits. For attention head hh with projections WQh,WKh,WVhRd×dkW_Q^h, W_K^h, W_V^h \in \mathbb{R}^{d \times d_k} and WOhRdk×dW_O^h \in \mathbb{R}^{d_k \times d}:

  • QK circuit: WQh(WKh)Rd×dW_Q^h (W_K^h)^\top \in \mathbb{R}^{d \times d} but has rank dk\leq d_k; it maps the residual stream to a dkd_k-dimensional subspace for query-key comparison; the head computes attention scores using only dkd_k dimensions of the dd-dimensional residual stream
  • OV circuit: WOhWVhRd×dW_O^h W_V^h \in \mathbb{R}^{d \times d} but has rank dk\leq d_k; the head reads from a dkd_k-dimensional subspace (via WVhW_V^h) and writes to a dkd_k-dimensional subspace (via WOhW_O^h); the composition is a rank-dkd_k map from Rd\mathbb{R}^d to Rd\mathbb{R}^d

Both the QK and OV circuits are low-rank (rank dk\leq d_k) operations in Rd\mathbb{R}^d, even though Rd\mathbb{R}^d is dd-dimensional. Each head has access to dd-dimensional representations but can only "look at" or "write to" dkd_k-dimensional subspaces. When dkdd_k \ll d (typical: dk=d/Hd_k = d/H for HH heads), each head is a dimensionality-reducing probe and writer.

Induction heads. An induction head (Olsson et al. 2022) is an attention head that implements in-context learning by attending to previous occurrences of the current token. It works via two components: a "previous token head" that shifts information one position back (writing to a specific subspace), and an "induction head" that looks for matches in that subspace. The circuit is: head A writes "what came before position tt" to a subspace SAS_A; head B reads from SAS_A and attends accordingly. The communication between heads A and B happens through a shared subspace of the residual stream.

Superposition and polysemanticity. Toy models of superposition (Elhage et al. 2022) show that:

  • With FdF \leq d features: each feature can be stored in its own orthogonal direction; no interference; polysemanticity unnecessary
  • With F>dF > d features (superposition regime): features are stored as nearly-orthogonal non-orthogonal directions; each direction contains a weighted sum of multiple features; individual neurons are polysemantic (they respond to multiple distinct features)
  • The "geometry" of superposition: features are arranged as vertices of polytopes inscribed in the unit sphere; the maximum number of nearly-orthogonal unit vectors in Rd\mathbb{R}^d that have pairwise inner products ϵ\leq \epsilon is approximately ecde^{c \cdot d} for some constant c>0c > 0 depending on ϵ\epsilon - exponential in the dimension

12.5 Subspace Fine-Tuning Methods

The empirical low-dimensionality of fine-tuning gradients motivates a family of methods that explicitly restrict weight updates to specific subspaces. Here is a comparative overview:

LoRA (Hu et al. 2021).

Parameterise the weight update as ΔW=BA\Delta W = BA^\top with BRm×rB \in \mathbb{R}^{m \times r}, ARn×rA \in \mathbb{R}^{n \times r}, rank rmin(m,n)r \ll \min(m,n). The update lives in the subspace of rank-r\leq r matrices in Rm×n\mathbb{R}^{m \times n}. Only r(m+n)r(m+n) parameters are needed instead of mnmn. The pre-trained weights W0W_0 are frozen; only AA and BB are trained. At inference, the update is merged: W=W0+BAW = W_0 + BA^\top, adding no latency.

AdaLoRA (Zhang et al. 2023).

Parameterise ΔW=PΛQ\Delta W = P \Lambda Q^\top where PRm×rP \in \mathbb{R}^{m \times r}, QRn×rQ \in \mathbb{R}^{n \times r} are updated by gradient descent and Λ=diag(λ1,,λr)\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_r) is a diagonal matrix of learned singular values. Singular values that shrink toward zero can be pruned during training, giving adaptive rank allocation: some layers get high rank, others get low rank, depending on the task's needs. The total parameter count is bounded, but rank is allocated where the gradient signal is strongest.

IA^3 (Liu et al. 2022).

Rather than adding a low-rank matrix, IA^3 multiplies activations by learned scaling vectors: hlh\mathbf{h} \leftarrow \mathbf{l} \odot \mathbf{h} for a learned vector l\mathbf{l}. Equivalently, this rescales the columns of weight matrices: Weff=Wdiag(l)W_{\text{eff}} = W \cdot \text{diag}(\mathbf{l}). Each learned vector is a 1D subspace modulation - a rank-1 multiplicative update to a direction in activation space. Extremely parameter-efficient (d\leq d parameters per layer).

Prefix Tuning (Li and Liang 2021).

Prepend kk trainable "prefix" token embeddings to the input sequence. At each attention layer, the prefix tokens contribute additional keys and values that the original tokens can attend to. The prefix embeddings inject task-specific information into the attention computation. The "prefix subspace" is span{p1,,pk}Rd\text{span}\{\mathbf{p}_1, \ldots, \mathbf{p}_k\} \subseteq \mathbb{R}^d - the kk-dimensional subspace of information injected at each layer.

DoRA (Liu et al. 2024).

Decompose the weight matrix as W=Wc(W/Wc)W = \|W\|_c \cdot (W / \|W\|_c) (magnitude times direction, computed column-wise). Fine-tune magnitude and direction separately: magnitude is a scalar per column (1D), direction uses LoRA (rank-rr subspace). The decomposition reflects the observation that weight magnitude and weight direction play different roles in learning: magnitude controls the "strength" of a feature; direction controls "which feature". DoRA consistently outperforms LoRA at the same rank by freeing the direction update from the magnitude constraint.

DARE (Yu et al. 2024).

After standard fine-tuning, randomly prune δW=WFTW0\delta W = W_{\text{FT}} - W_0 with probability pp (set to zero) and rescale the remainder by 1/(1p)1/(1-p) to maintain expectation. The effect is to project the fine-tuning update onto a sparse "subspace" (sparse in the coordinate basis). Pruned updates can be summed across multiple fine-tuned models (model merging) with reduced interference, since the non-zero coordinates are less likely to overlap.

SUBSPACE FINE-TUNING: COMPARISON


  Method      Subspace type        Parameters        Memory saving
              
  Full FT     all of R         mn                1x  (baseline)
  LoRA r      rank-r subspace      r(m+n)            mn/r(m+n)
  AdaLoRA     adaptive rank-r      r(m+n) + r        similar to LoRA
  IA^3         column scaling       m or n            mn/d ~= n
  Prefix k    k-dim prefix         k*d per layer     varies
  DoRA        magnitude + LoRA     r(m+n) + n        similar to LoRA
  DARE        sparse random        mn (train), 0 (prune)  0 (post-hoc)

  All methods exploit: fine-tuning gradient lives in low-dim subspace

13. Invariant Subspaces and Spectral Theory

13.1 Invariant Subspaces

A subspace WVW \subseteq V is invariant under a linear map T:VVT: V \to V if TT maps WW back into WW:

T(W)Wi.e., for all wW: T(w)WT(W) \subseteq W \quad \text{i.e., for all } \mathbf{w} \in W:\ T(\mathbf{w}) \in W

Invariant subspaces are the subspaces that TT "respects" - it does not mix WW with its complement. Restricted to WW, the map TW:WWT|_W: W \to W is a linear map from WW to itself.

Trivial invariant subspaces. Every linear map has two trivial invariant subspaces: {0}\{\mathbf{0}\} and VV itself. Any map TT maps 0\mathbf{0} to 0\mathbf{0} and maps VV to T(V)VT(V) \subseteq V.

Eigenspaces are 1-dimensional invariant subspaces. If Tv=λvT\mathbf{v} = \lambda\mathbf{v} (eigenvector equation), then span{v}\text{span}\{\mathbf{v}\} is invariant: for any w=αvspan{v}\mathbf{w} = \alpha\mathbf{v} \in \text{span}\{\mathbf{v}\}:

T(w)=T(αv)=αT(v)=αλv=λwspan{v}T(\mathbf{w}) = T(\alpha\mathbf{v}) = \alpha T(\mathbf{v}) = \alpha\lambda\mathbf{v} = \lambda\mathbf{w} \in \text{span}\{\mathbf{v}\}

Conversely, any 1-dimensional invariant subspace span{v}\text{span}\{\mathbf{v}\} must have T(v)=λvT(\mathbf{v}) = \lambda\mathbf{v} for some scalar λ\lambda (since T(v)T(\mathbf{v}) must lie in span{v}\text{span}\{\mathbf{v}\}). So eigenvectors and 1-dimensional invariant subspaces are the same thing.

Generalised eigenspaces. For eigenvalue λ\lambda, the generalised eigenspace is:

G(λ)=null((TλI)k)for large enough kG(\lambda) = \text{null}((T - \lambda I)^k) \quad \text{for large enough } k

(Specifically, k=nk = n always works, but often a smaller kk suffices - the smallest kk with null((TλI)k)=null((TλI)k+1)\text{null}((T-\lambda I)^k) = \text{null}((T-\lambda I)^{k+1}) is the index of λ\lambda.) The generalised eigenspace is invariant under TT: T(G(λ))G(λ)T(G(\lambda)) \subseteq G(\lambda).

Invariant subspace decomposition. The ideal scenario is when VV decomposes as a direct sum of invariant subspaces:

V=W1W2Wkwith each Wi invariant under TV = W_1 \oplus W_2 \oplus \cdots \oplus W_k \quad \text{with each } W_i \text{ invariant under } T

In this case, TT acts independently on each WiW_i: the restriction TWiT|_{W_i} is a linear map from WiW_i to WiW_i, and the overall action of TT is the "direct sum" of these restricted maps. In a basis that respects this decomposition, the matrix of TT is block diagonal:

[T]B=([TW1][TWk])[T]_{\mathcal{B}} = \begin{pmatrix} [T|_{W_1}] & & \\ & \ddots & \\ & & [T|_{W_k}] \end{pmatrix}

Block diagonalisation is the goal of spectral theory: decompose VV into invariant subspaces so that TT is maximally "decoupled".

13.2 The Spectral Theorem via Invariant Subspaces

The real spectral theorem is the most important result about invariant subspaces. It applies to symmetric matrices (or more generally, self-adjoint operators on inner product spaces).

Theorem (Real Spectral Theorem). For a symmetric matrix A=ARn×nA = A^\top \in \mathbb{R}^{n \times n}:

  1. All eigenvalues of AA are real
  2. Eigenvectors corresponding to distinct eigenvalues are orthogonal
  3. Rn\mathbb{R}^n decomposes as an orthogonal direct sum of eigenspaces:
Rn=E(λ1)E(λ2)E(λk)\mathbb{R}^n = E(\lambda_1) \oplus E(\lambda_2) \oplus \cdots \oplus E(\lambda_k)
  1. A=QΛQA = Q \Lambda Q^\top where QQ is orthogonal (eigenvectors as columns) and Λ\Lambda is diagonal (eigenvalues)

Why eigenvalues are real. For symmetric AA: Au,v=(Au)v=uAv=uAv=u,Av\langle A\mathbf{u}, \mathbf{v} \rangle = (A\mathbf{u})^\top \mathbf{v} = \mathbf{u}^\top A^\top \mathbf{v} = \mathbf{u}^\top A \mathbf{v} = \langle \mathbf{u}, A\mathbf{v} \rangle. Such an operator is called self-adjoint. If Av=λvA\mathbf{v} = \lambda\mathbf{v} (over C\mathbb{C}):

λv,v=λv,v=Av,v=v,Av=v,λv=λˉv,v\lambda \langle \mathbf{v}, \mathbf{v} \rangle = \langle \lambda\mathbf{v}, \mathbf{v} \rangle = \langle A\mathbf{v}, \mathbf{v} \rangle = \langle \mathbf{v}, A\mathbf{v} \rangle = \langle \mathbf{v}, \lambda\mathbf{v} \rangle = \bar{\lambda} \langle \mathbf{v}, \mathbf{v} \rangle

Since v2>0\|\mathbf{v}\|^2 > 0: λ=λˉ\lambda = \bar{\lambda}, so λR\lambda \in \mathbb{R}.

Why eigenvectors for distinct eigenvalues are orthogonal. If Au=λuA\mathbf{u} = \lambda\mathbf{u} and Av=μvA\mathbf{v} = \mu\mathbf{v} with λμ\lambda \neq \mu:

λu,v=Au,v=u,Av=μu,v\lambda \langle \mathbf{u}, \mathbf{v} \rangle = \langle A\mathbf{u}, \mathbf{v} \rangle = \langle \mathbf{u}, A\mathbf{v} \rangle = \mu \langle \mathbf{u}, \mathbf{v} \rangle

So (λμ)u,v=0(\lambda - \mu)\langle\mathbf{u},\mathbf{v}\rangle = 0, and since λμ\lambda \neq \mu: u,v=0\langle\mathbf{u},\mathbf{v}\rangle = 0.

Spectral decomposition. Let PiP_i be the orthogonal projection onto the eigenspace E(λi)E(\lambda_i). Then:

A=i=1kλiPi,I=i=1kPi,PiPj=δijPiA = \sum_{i=1}^k \lambda_i P_i, \qquad I = \sum_{i=1}^k P_i, \qquad P_i P_j = \delta_{ij} P_i

The action of AA decomposes into independent scalings of orthogonal subspaces: AA scales E(λi)E(\lambda_i) by λi\lambda_i and acts on each eigenspace independently.

AI relevance. Symmetric matrices appear constantly in deep learning:

  • Covariance matrices C=1nXXC = \frac{1}{n}X^\top X: PCA = spectral decomposition of CC; eigenspaces = principal component subspaces; eigenvalues = variances in each direction
  • Gram matrices G=XXG = XX^\top (dual of covariance): non-zero eigenvalues of GG = non-zero eigenvalues of CC; eigenvectors related by XX
  • Hessian H=2L(θ)H = \nabla^2 \mathcal{L}(\theta): curvature of loss landscape; eigenspaces = directions of different curvature; positive eigenvalues = directions curving up (need small learning rate); near-zero eigenvalues = flat directions (optimisation easy in these directions)
  • Laplacian of a graph L=DAL = D - A: symmetric positive semidefinite; eigenvectors = graph Fourier basis; used in spectral clustering and graph neural networks

13.3 Singular Value Decomposition as Subspace Decomposition

The SVD extends the spectral theorem to non-square (and non-symmetric) matrices. It is the complete subspace decomposition of any linear map.

Theorem (SVD). For any matrix ARm×nA \in \mathbb{R}^{m \times n} with rank(A)=r\text{rank}(A) = r, there exist orthogonal matrices URm×mU \in \mathbb{R}^{m \times m} and VRn×nV \in \mathbb{R}^{n \times n} and a matrix ΣRm×n\Sigma \in \mathbb{R}^{m \times n} with non-negative entries only on the main diagonal σ1σ2σr>0=σr+1=\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0 = \sigma_{r+1} = \cdots such that:

A=UΣVA = U \Sigma V^\top

The σi\sigma_i are the singular values, ui\mathbf{u}_i (columns of UU) are the left singular vectors, and vi\mathbf{v}_i (columns of VV) are the right singular vectors.

SVD as complete subspace decomposition:

  • Domain Rn\mathbb{R}^n: decomposes as row(A)null(A)\text{row}(A) \oplus \text{null}(A)

    • row(A)=span{v1,,vr}\text{row}(A) = \text{span}\{\mathbf{v}_1, \ldots, \mathbf{v}_r\} (right singular vectors for non-zero singular values)
    • null(A)=span{vr+1,,vn}\text{null}(A) = \text{span}\{\mathbf{v}_{r+1}, \ldots, \mathbf{v}_n\} (right singular vectors for zero singular values)
  • Codomain Rm\mathbb{R}^m: decomposes as col(A)null(A)\text{col}(A) \oplus \text{null}(A^\top)

    • col(A)=span{u1,,ur}\text{col}(A) = \text{span}\{\mathbf{u}_1, \ldots, \mathbf{u}_r\} (left singular vectors for non-zero singular values)
    • null(A)=span{ur+1,,um}\text{null}(A^\top) = \text{span}\{\mathbf{u}_{r+1}, \ldots, \mathbf{u}_m\} (left singular vectors for zero singular values)
  • The action of AA: AA maps the ii-th right singular direction to the ii-th left singular direction, with scaling σi\sigma_i:

Avi=σiuifor i=1,,r,Avi=0for i>rA \mathbf{v}_i = \sigma_i \mathbf{u}_i \quad \text{for } i = 1, \ldots, r, \qquad A \mathbf{v}_i = \mathbf{0} \quad \text{for } i > r

Rank-kk approximation. The best rank-kk approximation to AA (in Frobenius or operator norm) is:

Ak=i=1kσiuivi=UkΣkVkA_k = \sum_{i=1}^k \sigma_i \mathbf{u}_i \mathbf{v}_i^\top = U_k \Sigma_k V_k^\top

where Uk,VkU_k, V_k contain the first kk left/right singular vectors and Σk\Sigma_k the top-kk singular values. This is the Eckart-Young theorem: among all rank-kk matrices, AkA_k is closest to AA.

The approximation error: AAkF2=i=k+1rσi2\|A - A_k\|_F^2 = \sum_{i=k+1}^r \sigma_i^2 (discarded singular values).

AI applications of SVD as subspace decomposition:

  • Weight matrix compression. A weight matrix WRm×nW \in \mathbb{R}^{m \times n} can be approximated as WUkΣkVkW \approx U_k \Sigma_k V_k^\top using the top-kk singular components. This compresses mnmn parameters to k(m+n)k(m+n). The column space of the compressed weight uses only the top-kk left singular directions; the null space grows accordingly.

  • LoRA in SVD language. LoRA's ΔW=BA\Delta W = BA^\top is a rank-rr matrix, hence its SVD has at most rr non-zero singular values. The columns of BB span the column space of ΔW\Delta W; the columns of AA span the row space. Initialising B=0B = \mathbf{0} (as in standard LoRA) means ΔW=0\Delta W = \mathbf{0} at the start, which is correct for starting from the pre-trained weights.

  • Activation space analysis. The SVD of the activation matrix HRn×dH \in \mathbb{R}^{n \times d} (rows = activations for nn inputs) reveals the effective dimensionality of the representation. If many singular values are near zero, the activations lie in a low-dimensional subspace of Rd\mathbb{R}^d. This is a direct measure of representation collapse or dimensionality.

  • Principal angle computation. As noted in 8.5, the principal angles between two subspaces spanned by Q1Q_1 and Q2Q_2 are given by the singular values of Q1Q2Q_1^\top Q_2. SVD gives the geometry of subspace-to-subspace relationships.

13.4 Schur Decomposition

Theorem (Schur). Every square matrix ARn×nA \in \mathbb{R}^{n \times n} (working over C\mathbb{C}, or over R\mathbb{R} for real Schur form) can be written as:

A=QTQA = Q T Q^*

where QQ is unitary (QQ=IQ^* Q = I) and TT is upper triangular. The diagonal entries of TT are the eigenvalues of AA (possibly complex, in some order).

Real Schur form. Over R\mathbb{R}, A=QTQA = QTQ^\top where TT is quasi-upper triangular: block upper triangular with 1×11 \times 1 blocks (real eigenvalues) and 2×22 \times 2 blocks (complex conjugate eigenvalue pairs). The columns of QQ form an orthonormal basis.

Invariant subspace connection. The first kk columns of QQ span a kk-dimensional invariant subspace of AA:

Aspan{Q1,,Qk}span{Q1,,Qk}A \, \text{span}\{Q_1, \ldots, Q_k\} \subseteq \text{span}\{Q_1, \ldots, Q_k\}

This follows because AQk=QTAQ_k = QT and TT is upper triangular: (AQ)j=i=1jTijQi(AQ)_{j} = \sum_{i=1}^j T_{ij} Q_i, so AQjAQ_j is a linear combination of Q1,,QjQ_1, \ldots, Q_j. The Schur decomposition gives a nested sequence of invariant subspaces:

{0}span{Q1}span{Q1,Q2}Rn\{0\} \subset \text{span}\{Q_1\} \subset \text{span}\{Q_1, Q_2\} \subset \cdots \subset \mathbb{R}^n

Each subspace in the chain is invariant under AA. The Schur form is the "staircase" of all invariant subspaces simultaneously.

Relationship to spectral theorem. For symmetric AA: the Schur form has T=ΛT = \Lambda (diagonal), since eigenvectors are orthogonal and TT being upper triangular with orthogonal columns forces it diagonal. The Schur decomposition of a symmetric matrix is exactly the spectral decomposition.

Practical relevance. The QR algorithm (the standard algorithm for computing eigenvalues) works by iteratively applying QR decompositions to drive AA toward its Schur form. Each QR step refines the invariant subspace structure. The Schur form is numerically stable (unitary transformations do not amplify errors) and can always be computed, unlike the eigendecomposition (which may not exist for non-diagonalisable matrices).


14. Common Mistakes

#MistakeWhy It's WrongCorrect Understanding
1"The union of two subspaces is a subspace"W1W2W_1 \cup W_2 fails closure under addition: take uW1W2\mathbf{u} \in W_1 \setminus W_2 and vW2W1\mathbf{v} \in W_2 \setminus W_1; their sum u+v\mathbf{u} + \mathbf{v} need not lie in W1W_1 or W2W_2. Concrete example: W1=span{(1,0)}W_1 = \text{span}\{(1,0)\} (x-axis) and W2=span{(0,1)}W_2 = \text{span}\{(0,1)\} (y-axis) in R2\mathbb{R}^2; (1,0)+(0,1)=(1,1)W1W2(1,0) + (0,1) = (1,1) \notin W_1 \cup W_2.The sum W1+W2W_1 + W_2 is a subspace; the union is generally not. Use W1+W2W_1 + W_2 when you need the subspace generated by both.
2"An affine subspace is a subspace"The solution set {Ax=b}\{A\mathbf{x} = \mathbf{b}\} with b0\mathbf{b} \neq \mathbf{0} does not contain the zero vector (A0=0bA \cdot \mathbf{0} = \mathbf{0} \neq \mathbf{b}) and is not closed under addition (A(x+y)=2bbA(\mathbf{x}+\mathbf{y}) = 2\mathbf{b} \neq \mathbf{b}). The probability simplex Δn\Delta^n is an affine subspace - not a vector subspace.An affine subspace = coset of a linear subspace = linear subspace shifted off the origin. It shares the same dimension and "shape" but is NOT a vector space. Subspaces must contain 0\mathbf{0}.
3"Closure under addition is sufficient for a subspace"Closure under addition alone (without scalar multiplication) is insufficient. The set ZnRn\mathbb{Z}^n \subset \mathbb{R}^n is closed under addition but π(1,0,,0)=(π,0,,0)Zn\pi \cdot (1,0,\ldots,0) = (\pi, 0, \ldots, 0) \notin \mathbb{Z}^n, so it fails the scalar multiplication axiom.All three conditions are required: (1) contains 0\mathbf{0}, (2) closed under addition, (3) closed under scalar multiplication. All three are independent of each other.
4"span{v1,,vk}=span{v1,,vk+1}\text{span}\{\mathbf{v}_1,\ldots,\mathbf{v}_k\} = \text{span}\{\mathbf{v}_1,\ldots,\mathbf{v}_{k+1}\} implies vk+1=0\mathbf{v}_{k+1} = \mathbf{0}"The spans being equal means vk+1\mathbf{v}_{k+1} is linearly dependent on v1,,vk\mathbf{v}_1,\ldots,\mathbf{v}_k - it can be expressed as a linear combination of the earlier vectors. But it need not be zero. For example: span{(1,0),(0,1)}=span{(1,0),(0,1),(1,1)}\text{span}\{(1,0),(0,1)\} = \text{span}\{(1,0),(0,1),(1,1)\}; (1,1)(1,1) is not zero.Equal spans means vk+1span{v1,,vk}\mathbf{v}_{k+1} \in \text{span}\{\mathbf{v}_1,\ldots,\mathbf{v}_k\} - redundant, not zero.
5"dim(W1W2)=dim(W1)+dim(W2)dim(V)\dim(W_1 \cap W_2) = \dim(W_1) + \dim(W_2) - \dim(V)"This formula is wrong. The correct formula involves the sum: dim(W1+W2)=dim(W1)+dim(W2)dim(W1W2)\dim(W_1 + W_2) = \dim(W_1) + \dim(W_2) - \dim(W_1 \cap W_2). The dimension of W1W2W_1 \cap W_2 ranges between max(0,dim(W1)+dim(W2)n)\max(0, \dim(W_1)+\dim(W_2)-n) and min(dim(W1),dim(W2))\min(\dim(W_1),\dim(W_2)).Correct formula: dim(W1W2)=dim(W1)+dim(W2)dim(W1+W2)\dim(W_1 \cap W_2) = \dim(W_1) + \dim(W_2) - \dim(W_1 + W_2). You cannot determine dim(W1W2)\dim(W_1 \cap W_2) from dimensions alone without knowing dim(W1+W2)\dim(W_1 + W_2).
6"The complement of a subspace is a subspace"The set complement VWV \setminus W is never a subspace: it does not contain 0\mathbf{0} (since 0W\mathbf{0} \in W) and is not closed under addition or scalar multiplication.The orthogonal complement WW^\perp IS a subspace. Always say "orthogonal complement WW^\perp", never just "complement". The orthogonal complement is the canonical linear-algebraic complement of WW.
7"All bases for VV have the same vectors"Bases are not unique. Any set of nn linearly independent spanning vectors is a basis for an nn-dimensional space. For R2\mathbb{R}^2, {(1,0),(0,1)}\{(1,0),(0,1)\}, {(1,1),(1,1)}\{(1,1),(1,-1)\}, and {(2,3),(1,2)}\{(2,3),(1,2)\} are three distinct bases.All bases for VV have the same number of vectors (= dim(V)\dim(V)). The actual vectors can be anything linearly independent and spanning. The choice of basis is a choice of coordinate system.
8"null(AB)=null(B)\text{null}(AB) = \text{null}(B)"null(B)null(AB)\text{null}(B) \subseteq \text{null}(AB) always (if Bx=0B\mathbf{x} = \mathbf{0} then ABx=0AB\mathbf{x} = \mathbf{0}). But null(AB)\text{null}(AB) can be strictly larger: if AA maps some Bx0B\mathbf{x} \neq \mathbf{0} to zero (Bxnull(A)B\mathbf{x} \in \text{null}(A)), then xnull(AB)\mathbf{x} \in \text{null}(AB) but xnull(B)\mathbf{x} \notin \text{null}(B).null(AB)null(B)\text{null}(AB) \supseteq \text{null}(B). Equality holds iff null(A)col(B)={0}\text{null}(A) \cap \text{col}(B) = \{\mathbf{0}\}, i.e., AA is injective on the column space of BB.
9"Two subspaces with the same dimension are equal"Dimension only describes the "size" of a subspace, not its orientation. The x-axis span{(1,0)}\text{span}\{(1,0)\} and y-axis span{(0,1)}\text{span}\{(0,1)\} in R2\mathbb{R}^2 both have dimension 1 but are completely different subspaces (they share only the origin).Equal dimension means isomorphic as abstract vector spaces. Actual equality as subsets requires showing every vector in one is in the other - or equivalently, that each spans the other.
10"Gram-Schmidt always produces a basis from a spanning set"Gram-Schmidt fails (division by zero) when it encounters a vector already in the span of the previously constructed orthonormal vectors. The intermediate vector uj=vji<jvj,qiqi\mathbf{u}_j = \mathbf{v}_j - \sum_{i<j}\langle\mathbf{v}_j,\mathbf{q}_i\rangle\mathbf{q}_i becomes 0\mathbf{0}, which cannot be normalised.Apply Gram-Schmidt only to linearly independent vectors. For a spanning set, first extract an independent subset via row reduction, then apply Gram-Schmidt. A zero intermediate vector signals linear dependence - discard that vector and continue.
11"The pivot columns of the RREF give a basis for the column space"Row operations change the column space. The RREF of AA has different columns than AA itself. The pivot positions (column indices) are correct, but the actual basis vectors must be taken from the original matrix AA, not from the RREF.Identify pivot positions from the RREF, but extract the corresponding columns from the original AA. The pivot columns of AA (not its RREF) form a basis for col(A)\text{col}(A).
12"Ax=0A\mathbf{x} = \mathbf{0} has only the trivial solution implies AA is invertible"For ARm×nA \in \mathbb{R}^{m \times n}: if mnm \neq n, AA cannot be invertible regardless. The trivial null space means AA is injective (rank(A)=n\text{rank}(A) = n), but invertibility requires square and rank(A)=n=m\text{rank}(A) = n = m.For square ARn×nA \in \mathbb{R}^{n \times n}: null(A)={0}\text{null}(A) = \{\mathbf{0}\} iff AA is invertible. For rectangular AA: null(A)={0}\text{null}(A) = \{\mathbf{0}\} means AA is injective (left-invertible) but not invertible.

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue