Lesson overview | Previous part | Next part
Vector Spaces and Subspaces: Part 12: Subspace Methods in Machine Learning to 14. Common Mistakes
12. Subspace Methods in Machine Learning
12.1 PCA as Subspace Finding
Principal Component Analysis is fundamentally a subspace problem: find the rank- subspace of that best explains the variance in a dataset.
Setup. Given data with centred rows (), the sample covariance matrix is:
is symmetric positive semidefinite. Its eigendecomposition (with orthogonal, diagonal with ) defines the principal components.
PCA as orthogonal projection. The optimal rank- subspace minimises the mean squared reconstruction error:
Theorem (Eckart-Young for PCA). The solution is , the span of the top- eigenvectors of (equivalently, the top- left singular vectors of ).
The minimum reconstruction error equals the sum of the discarded eigenvalues: .
Equivalences. PCA can be described four equivalent ways, all pointing to the same subspace:
- Maximum variance: find the -dimensional subspace on which projected data has maximum variance
- Minimum reconstruction error: find the -dimensional subspace minimising projection error (the formulation above)
- SVD: compute ; the top- right singular vectors span
- Eigendecomposition: compute eigenvectors of ; top- eigenvectors span
AI applications of PCA as subspace finding:
- Embedding analysis. PCA of a word embedding matrix reveals the dominant semantic axes. The top principal component often corresponds to a frequency direction; later components correspond to semantic distinctions. PCA on embeddings is a way of finding the most informative subspace of the -dimensional embedding space.
- Activation subspaces. PCA on the activations of a layer across many inputs reveals the effective dimensionality of the layer's representation. If 95% of variance is explained by the top- principal components, the layer effectively operates in a -dimensional subspace of , even though nominally -dimensional.
- Weight matrix analysis. PCA on the rows or columns of a weight matrix reveals which directions the matrix emphasises. The top singular vectors of are the principal directions - the row space direction that maps to the largest column space direction.
- Collapse detection. In self-supervised learning (SimCLR, BYOL, VICReg), representation collapse means all embeddings converge to a low-dimensional subspace (or a single point). Tracking the rank (effective number of significant singular values) of the representation matrix over training detects and diagnoses collapse.
12.2 Subspace Tracking During Training
The gradient subspace evolves over the course of training, and its structure is a key diagnostic for understanding optimisation.
The gradient subspace. At training step , the gradient is a single vector. Over steps, the gradients lie in some subspace of . The gradient subspace at step is approximately:
or more precisely, the leading eigenspace of the gradient covariance matrix .
Empirical finding: the gradient subspace is small. Gur-Ari, Bar-On, and Shashua (2018) showed that for large neural networks, the gradient vectors observed during training effectively lie in a subspace of dimension much smaller than . For models with to parameters, the effective gradient subspace has dimension in the thousands or tens of thousands. This means:
- Gradient descent traverses a low-dimensional "groove" in the high-dimensional parameter space
- The directions orthogonal to the gradient subspace are never updated
- A -dimensional update rule (where ) can achieve nearly the same performance
Intrinsic dimensionality of fine-tuning (Aghajanyan, Zettlemoyer, Gupta, 2020). They parameterise the fine-tuning update as where is a fixed random projection matrix and is optimised. The smallest for which fine-tuning achieves 90% of full-fine-tuning performance is the intrinsic dimension. For GPT-2 fine-tuned on MRPC (a sentence similarity task), the intrinsic dimension is out of parameters. Fine-tuning happens in a subspace of effective dimension .
LoRA as gradient subspace approximation. LoRA restricts to a rank- subspace of . The justification is that the gradient during fine-tuning lives in a low-rank subspace. By restricting updates to rank , LoRA approximates this gradient subspace with parameters instead of . The choice of balances: larger = larger subspace = more expressive but more parameters; smaller = tighter constraint but fewer parameters. The "right" is approximately the intrinsic dimension of the task.
GaLore (Zhao et al. 2024). Rather than permanently fixing the subspace (as LoRA does), GaLore periodically updates the projection subspace by computing the top- singular vectors of the gradient matrix. The gradient is projected onto the current subspace before the optimiser step; optimizer state is maintained in the subspace. Memory savings come from keeping optimizer state in rather than .
12.3 Representation Subspaces
The linear representation hypothesis proposes that concepts are encoded as linear subspaces of the residual stream in large language models. This is a strong structural claim with substantial empirical support.
Concept directions. A concept direction for a binary concept (e.g., "positive sentiment", "female gender", "medical domain") is a unit vector such that indicates the concept is present and indicates it is absent. The concept is represented as a 1-dimensional subspace (a line through the origin) in .
Evidence: probing classifiers (linear models predicting a concept from an embedding) work well for many concepts, suggesting the concept is linearly decodable. The existence of a linear probe is evidence that the concept has a direction in the embedding space - a 1D subspace.
Multi-dimensional concept subspaces. Some concepts are not well-represented by a single direction. "Colour" might involve multiple hues; "grammatical role" might involve multiple positions. In these cases, the concept corresponds to a subspace of dimension . The concept subspace contains all directions relevant to the concept.
LEACE (Least-squares Concept Erasure) (Belrose et al. 2023). Given two sets of representations - one with concept present, one absent - LEACE finds the minimal subspace such that projecting onto makes the concept undetectable by any linear probe. This is explicitly a subspace computation:
- Compute the within-class covariance and between-class difference
- Find the projection direction that maximally separates classes (Fisher discriminant direction)
- Project onto the orthogonal complement:
Concept erasure = projection onto the orthogonal complement of the concept subspace.
Distributed representations. A single concept may activate many neurons, and a single neuron may respond to many concepts. This is distributed representation, and it is the normal state in large networks:
- Distributed over neurons: each neuron contributes a little to many concepts; the concept is "smeared" across many dimensions
- Superposition: many concepts share the same dimensions (polysemanticity); the concept subspaces are non-orthogonal
The superposition hypothesis (Elhage et al. 2022) says that in the regime where the number of features (more features than dimensions), the optimal strategy is to store features as nearly-orthogonal, non-orthogonal unit vectors. Interference is minimised by choosing feature directions to be as orthogonal as possible, but it cannot be entirely eliminated when . This is the geometry of packing more vectors into a space than the dimension allows.
12.4 Mechanistic Interpretability Through Subspaces
Mechanistic interpretability analyses the internal computations of neural networks at the level of circuits - sequences of operations that implement specific algorithmic behaviours. Subspace geometry is the natural language for this analysis.
Residual stream decomposition. At layer , the residual stream is updated by:
Each component (attention head , MLP layer ) writes to a specific subspace of (its column space) and reads from another (its row space). The residual stream is a shared highway; every component adds its contribution to the same -dimensional space. The subspace written to by component is for attention heads or determined by the MLP weight matrices for MLP blocks.
OV and QK circuits. For attention head with projections and :
- QK circuit: but has rank ; it maps the residual stream to a -dimensional subspace for query-key comparison; the head computes attention scores using only dimensions of the -dimensional residual stream
- OV circuit: but has rank ; the head reads from a -dimensional subspace (via ) and writes to a -dimensional subspace (via ); the composition is a rank- map from to
Both the QK and OV circuits are low-rank (rank ) operations in , even though is -dimensional. Each head has access to -dimensional representations but can only "look at" or "write to" -dimensional subspaces. When (typical: for heads), each head is a dimensionality-reducing probe and writer.
Induction heads. An induction head (Olsson et al. 2022) is an attention head that implements in-context learning by attending to previous occurrences of the current token. It works via two components: a "previous token head" that shifts information one position back (writing to a specific subspace), and an "induction head" that looks for matches in that subspace. The circuit is: head A writes "what came before position " to a subspace ; head B reads from and attends accordingly. The communication between heads A and B happens through a shared subspace of the residual stream.
Superposition and polysemanticity. Toy models of superposition (Elhage et al. 2022) show that:
- With features: each feature can be stored in its own orthogonal direction; no interference; polysemanticity unnecessary
- With features (superposition regime): features are stored as nearly-orthogonal non-orthogonal directions; each direction contains a weighted sum of multiple features; individual neurons are polysemantic (they respond to multiple distinct features)
- The "geometry" of superposition: features are arranged as vertices of polytopes inscribed in the unit sphere; the maximum number of nearly-orthogonal unit vectors in that have pairwise inner products is approximately for some constant depending on - exponential in the dimension
12.5 Subspace Fine-Tuning Methods
The empirical low-dimensionality of fine-tuning gradients motivates a family of methods that explicitly restrict weight updates to specific subspaces. Here is a comparative overview:
LoRA (Hu et al. 2021).
Parameterise the weight update as with , , rank . The update lives in the subspace of rank- matrices in . Only parameters are needed instead of . The pre-trained weights are frozen; only and are trained. At inference, the update is merged: , adding no latency.
AdaLoRA (Zhang et al. 2023).
Parameterise where , are updated by gradient descent and is a diagonal matrix of learned singular values. Singular values that shrink toward zero can be pruned during training, giving adaptive rank allocation: some layers get high rank, others get low rank, depending on the task's needs. The total parameter count is bounded, but rank is allocated where the gradient signal is strongest.
IA^3 (Liu et al. 2022).
Rather than adding a low-rank matrix, IA^3 multiplies activations by learned scaling vectors: for a learned vector . Equivalently, this rescales the columns of weight matrices: . Each learned vector is a 1D subspace modulation - a rank-1 multiplicative update to a direction in activation space. Extremely parameter-efficient ( parameters per layer).
Prefix Tuning (Li and Liang 2021).
Prepend trainable "prefix" token embeddings to the input sequence. At each attention layer, the prefix tokens contribute additional keys and values that the original tokens can attend to. The prefix embeddings inject task-specific information into the attention computation. The "prefix subspace" is - the -dimensional subspace of information injected at each layer.
DoRA (Liu et al. 2024).
Decompose the weight matrix as (magnitude times direction, computed column-wise). Fine-tune magnitude and direction separately: magnitude is a scalar per column (1D), direction uses LoRA (rank- subspace). The decomposition reflects the observation that weight magnitude and weight direction play different roles in learning: magnitude controls the "strength" of a feature; direction controls "which feature". DoRA consistently outperforms LoRA at the same rank by freeing the direction update from the magnitude constraint.
DARE (Yu et al. 2024).
After standard fine-tuning, randomly prune with probability (set to zero) and rescale the remainder by to maintain expectation. The effect is to project the fine-tuning update onto a sparse "subspace" (sparse in the coordinate basis). Pruned updates can be summed across multiple fine-tuned models (model merging) with reduced interference, since the non-zero coordinates are less likely to overlap.
SUBSPACE FINE-TUNING: COMPARISON
Method Subspace type Parameters Memory saving
Full FT all of R mn 1x (baseline)
LoRA r rank-r subspace r(m+n) mn/r(m+n)
AdaLoRA adaptive rank-r r(m+n) + r similar to LoRA
IA^3 column scaling m or n mn/d ~= n
Prefix k k-dim prefix k*d per layer varies
DoRA magnitude + LoRA r(m+n) + n similar to LoRA
DARE sparse random mn (train), 0 (prune) 0 (post-hoc)
All methods exploit: fine-tuning gradient lives in low-dim subspace
13. Invariant Subspaces and Spectral Theory
13.1 Invariant Subspaces
A subspace is invariant under a linear map if maps back into :
Invariant subspaces are the subspaces that "respects" - it does not mix with its complement. Restricted to , the map is a linear map from to itself.
Trivial invariant subspaces. Every linear map has two trivial invariant subspaces: and itself. Any map maps to and maps to .
Eigenspaces are 1-dimensional invariant subspaces. If (eigenvector equation), then is invariant: for any :
Conversely, any 1-dimensional invariant subspace must have for some scalar (since must lie in ). So eigenvectors and 1-dimensional invariant subspaces are the same thing.
Generalised eigenspaces. For eigenvalue , the generalised eigenspace is:
(Specifically, always works, but often a smaller suffices - the smallest with is the index of .) The generalised eigenspace is invariant under : .
Invariant subspace decomposition. The ideal scenario is when decomposes as a direct sum of invariant subspaces:
In this case, acts independently on each : the restriction is a linear map from to , and the overall action of is the "direct sum" of these restricted maps. In a basis that respects this decomposition, the matrix of is block diagonal:
Block diagonalisation is the goal of spectral theory: decompose into invariant subspaces so that is maximally "decoupled".
13.2 The Spectral Theorem via Invariant Subspaces
The real spectral theorem is the most important result about invariant subspaces. It applies to symmetric matrices (or more generally, self-adjoint operators on inner product spaces).
Theorem (Real Spectral Theorem). For a symmetric matrix :
- All eigenvalues of are real
- Eigenvectors corresponding to distinct eigenvalues are orthogonal
- decomposes as an orthogonal direct sum of eigenspaces:
- where is orthogonal (eigenvectors as columns) and is diagonal (eigenvalues)
Why eigenvalues are real. For symmetric : . Such an operator is called self-adjoint. If (over ):
Since : , so .
Why eigenvectors for distinct eigenvalues are orthogonal. If and with :
So , and since : .
Spectral decomposition. Let be the orthogonal projection onto the eigenspace . Then:
The action of decomposes into independent scalings of orthogonal subspaces: scales by and acts on each eigenspace independently.
AI relevance. Symmetric matrices appear constantly in deep learning:
- Covariance matrices : PCA = spectral decomposition of ; eigenspaces = principal component subspaces; eigenvalues = variances in each direction
- Gram matrices (dual of covariance): non-zero eigenvalues of = non-zero eigenvalues of ; eigenvectors related by
- Hessian : curvature of loss landscape; eigenspaces = directions of different curvature; positive eigenvalues = directions curving up (need small learning rate); near-zero eigenvalues = flat directions (optimisation easy in these directions)
- Laplacian of a graph : symmetric positive semidefinite; eigenvectors = graph Fourier basis; used in spectral clustering and graph neural networks
13.3 Singular Value Decomposition as Subspace Decomposition
The SVD extends the spectral theorem to non-square (and non-symmetric) matrices. It is the complete subspace decomposition of any linear map.
Theorem (SVD). For any matrix with , there exist orthogonal matrices and and a matrix with non-negative entries only on the main diagonal such that:
The are the singular values, (columns of ) are the left singular vectors, and (columns of ) are the right singular vectors.
SVD as complete subspace decomposition:
-
Domain : decomposes as
- (right singular vectors for non-zero singular values)
- (right singular vectors for zero singular values)
-
Codomain : decomposes as
- (left singular vectors for non-zero singular values)
- (left singular vectors for zero singular values)
-
The action of : maps the -th right singular direction to the -th left singular direction, with scaling :
Rank- approximation. The best rank- approximation to (in Frobenius or operator norm) is:
where contain the first left/right singular vectors and the top- singular values. This is the Eckart-Young theorem: among all rank- matrices, is closest to .
The approximation error: (discarded singular values).
AI applications of SVD as subspace decomposition:
-
Weight matrix compression. A weight matrix can be approximated as using the top- singular components. This compresses parameters to . The column space of the compressed weight uses only the top- left singular directions; the null space grows accordingly.
-
LoRA in SVD language. LoRA's is a rank- matrix, hence its SVD has at most non-zero singular values. The columns of span the column space of ; the columns of span the row space. Initialising (as in standard LoRA) means at the start, which is correct for starting from the pre-trained weights.
-
Activation space analysis. The SVD of the activation matrix (rows = activations for inputs) reveals the effective dimensionality of the representation. If many singular values are near zero, the activations lie in a low-dimensional subspace of . This is a direct measure of representation collapse or dimensionality.
-
Principal angle computation. As noted in 8.5, the principal angles between two subspaces spanned by and are given by the singular values of . SVD gives the geometry of subspace-to-subspace relationships.
13.4 Schur Decomposition
Theorem (Schur). Every square matrix (working over , or over for real Schur form) can be written as:
where is unitary () and is upper triangular. The diagonal entries of are the eigenvalues of (possibly complex, in some order).
Real Schur form. Over , where is quasi-upper triangular: block upper triangular with blocks (real eigenvalues) and blocks (complex conjugate eigenvalue pairs). The columns of form an orthonormal basis.
Invariant subspace connection. The first columns of span a -dimensional invariant subspace of :
This follows because and is upper triangular: , so is a linear combination of . The Schur decomposition gives a nested sequence of invariant subspaces:
Each subspace in the chain is invariant under . The Schur form is the "staircase" of all invariant subspaces simultaneously.
Relationship to spectral theorem. For symmetric : the Schur form has (diagonal), since eigenvectors are orthogonal and being upper triangular with orthogonal columns forces it diagonal. The Schur decomposition of a symmetric matrix is exactly the spectral decomposition.
Practical relevance. The QR algorithm (the standard algorithm for computing eigenvalues) works by iteratively applying QR decompositions to drive toward its Schur form. Each QR step refines the invariant subspace structure. The Schur form is numerically stable (unitary transformations do not amplify errors) and can always be computed, unlike the eigendecomposition (which may not exist for non-diagonalisable matrices).
14. Common Mistakes
| # | Mistake | Why It's Wrong | Correct Understanding |
|---|---|---|---|
| 1 | "The union of two subspaces is a subspace" | fails closure under addition: take and ; their sum need not lie in or . Concrete example: (x-axis) and (y-axis) in ; . | The sum is a subspace; the union is generally not. Use when you need the subspace generated by both. |
| 2 | "An affine subspace is a subspace" | The solution set with does not contain the zero vector () and is not closed under addition (). The probability simplex is an affine subspace - not a vector subspace. | An affine subspace = coset of a linear subspace = linear subspace shifted off the origin. It shares the same dimension and "shape" but is NOT a vector space. Subspaces must contain . |
| 3 | "Closure under addition is sufficient for a subspace" | Closure under addition alone (without scalar multiplication) is insufficient. The set is closed under addition but , so it fails the scalar multiplication axiom. | All three conditions are required: (1) contains , (2) closed under addition, (3) closed under scalar multiplication. All three are independent of each other. |
| 4 | " implies " | The spans being equal means is linearly dependent on - it can be expressed as a linear combination of the earlier vectors. But it need not be zero. For example: ; is not zero. | Equal spans means - redundant, not zero. |
| 5 | "" | This formula is wrong. The correct formula involves the sum: . The dimension of ranges between and . | Correct formula: . You cannot determine from dimensions alone without knowing . |
| 6 | "The complement of a subspace is a subspace" | The set complement is never a subspace: it does not contain (since ) and is not closed under addition or scalar multiplication. | The orthogonal complement IS a subspace. Always say "orthogonal complement ", never just "complement". The orthogonal complement is the canonical linear-algebraic complement of . |
| 7 | "All bases for have the same vectors" | Bases are not unique. Any set of linearly independent spanning vectors is a basis for an -dimensional space. For , , , and are three distinct bases. | All bases for have the same number of vectors (= ). The actual vectors can be anything linearly independent and spanning. The choice of basis is a choice of coordinate system. |
| 8 | "" | always (if then ). But can be strictly larger: if maps some to zero (), then but . | . Equality holds iff , i.e., is injective on the column space of . |
| 9 | "Two subspaces with the same dimension are equal" | Dimension only describes the "size" of a subspace, not its orientation. The x-axis and y-axis in both have dimension 1 but are completely different subspaces (they share only the origin). | Equal dimension means isomorphic as abstract vector spaces. Actual equality as subsets requires showing every vector in one is in the other - or equivalently, that each spans the other. |
| 10 | "Gram-Schmidt always produces a basis from a spanning set" | Gram-Schmidt fails (division by zero) when it encounters a vector already in the span of the previously constructed orthonormal vectors. The intermediate vector becomes , which cannot be normalised. | Apply Gram-Schmidt only to linearly independent vectors. For a spanning set, first extract an independent subset via row reduction, then apply Gram-Schmidt. A zero intermediate vector signals linear dependence - discard that vector and continue. |
| 11 | "The pivot columns of the RREF give a basis for the column space" | Row operations change the column space. The RREF of has different columns than itself. The pivot positions (column indices) are correct, but the actual basis vectors must be taken from the original matrix , not from the RREF. | Identify pivot positions from the RREF, but extract the corresponding columns from the original . The pivot columns of (not its RREF) form a basis for . |
| 12 | " has only the trivial solution implies is invertible" | For : if , cannot be invertible regardless. The trivial null space means is injective (), but invertibility requires square and . | For square : iff is invertible. For rectangular : means is injective (left-invertible) but not invertible. |