Lesson overview | Previous part | Lesson overview
Vector Spaces and Subspaces: Part 15: Exercises to Conceptual Bridge
15. Exercises
Exercise 1: Verifying Vector Space Axioms
For each of the following, determine whether it is a vector space with the given operations. If not, identify which axiom(s) fail. For each valid vector space, identify the zero vector and describe the additive inverses.
(a) , addition: , scalar mult: - scalar multiplication only affects the first coordinate.
(b) , addition defined as (multiplication of positive reals), scalar multiplication .
(c) with the standard addition and scalar multiplication inherited from .
(d) with standard pointwise addition and scalar multiplication .
(e) with the same standard operations.
Hints for (b): Verify all eight axioms carefully with the non-standard operations. The zero element of the vector space (if it exists) is the identity for , not the number 0. For (d) and (e), think about which evaluation constraint is compatible with the zero function.
Exercise 2: Subspace Verification
For each subset, determine whether it is a subspace of the given vector space. Prove it is a subspace (using the three-condition test) or find an explicit counterexample showing it is not.
(a) inside
(b) (the two coordinate axes together) inside
(c) (traceless matrices) inside
(d) (polynomials of degree that vanish at ) inside
(e) (double cone) inside
For each valid subspace: find a basis and state the dimension.
Exercise 3: The Four Fundamental Subspaces
Let .
(a) Row-reduce to RREF. Identify the pivot columns and free columns.
(b) Find a basis for . Verify the dimension via the Rank-Nullity Theorem.
(c) Find a basis for using the pivot columns of the original (not the RREF). State the dimension.
(d) Find a basis for using the non-zero rows of the RREF. State the dimension.
(e) Find a basis for by row-reducing and finding its null space. Alternatively, argue from the dimension formula.
(f) Verify the dimension counts: and .
(g) Verify the orthogonality : compute the dot product of every basis vector of with every basis vector of and confirm it equals zero.
Exercise 4: Span, Independence, and Basis in
Let , , , .
(a) Form the matrix and row-reduce. Are linearly independent?
(b) What is ?
(c) Identify a subset of the given vectors that forms a basis for .
(d) Express any dependent vectors as explicit linear combinations of your basis vectors.
(e) Is in ? If yes, find coordinates such that . If no, explain why.
Exercise 5: Gram-Schmidt and Orthogonal Projection
Work in with the standard inner product.
(a) Starting from , , , verify that these three vectors are linearly independent (compute the determinant of the matrix they form).
(b) Apply Gram-Schmidt to produce an orthonormal basis for .
(c) Verify your result: check that for each and for .
(d) Express in your orthonormal basis: find for each . Verify that .
(e) Construct the orthogonal projection matrix onto using where . Verify: (i) , (ii) , (iii) . Compute and the residual . Verify that and .
Exercise 6: Subspace Operations and Dimension Formula
(a) In , let (the -plane) and . Find a basis for . Is ? Find and verify the Grassmann formula: .
(b) Is (direct sum)? If not, find a subspace such that .
(c) In , let and . Find , , , and . Verify the Grassmann formula.
(d) For from part (c): find an explicit basis.
Exercise 7: Change of Basis and Coordinates
In , let with and , and let with and .
(a) Verify that both and are bases for (compute the determinants of the corresponding matrices).
(b) Find the change-of-basis matrix : express each as a linear combination of , and use these as the columns of .
(c) For the vector with (coordinates in basis ), compute .
(d) Compute explicitly in the standard basis. Then verify your answer to (c) by directly expressing as a linear combination of and .
(e) If a linear map has matrix in the standard basis, find its matrix representation in basis .
Exercise 8: AI Application - Subspace Analysis of a Weight Matrix
Let .
(a) Compute and . Is full rank?
(b) Find a basis for and . Verify .
(c) A LoRA adapter has rank with update where and . Explicitly compute . What is ? What is ?
(d) Consider the updated weight . Without computing explicitly, state the upper and lower bounds on . Under what condition on the relationship between and (i.e., the left null space of ) would ? Under what condition would ?
(e) Now compute explicitly. Verify your predictions from (d) by computing .
Exercise 9 (Challenge): The Superposition Geometry
This exercise develops the geometry of the superposition hypothesis.
(a) In (a 2-dimensional embedding space), suppose we want to store features as unit vectors with maximum pairwise orthogonality. Show that it is impossible for all three feature vectors to be pairwise orthogonal. (Hint: three pairwise orthogonal unit vectors in cannot exist.)
(b) Instead, place three unit vectors at angles , , from the positive -axis: , , . Compute all pairwise inner products for . What is the interference level?
(c) The "reconstruction loss" for feature when all features have activation is: the residual after reading off the -th feature and subtracting the contribution of the feature direction. Specifically, if the residual stream stores , and we estimate , show that . The error is the interference from other features.
(d) Compute the expected interference for the configuration in (b) assuming each independently with probability of being active. What does the interference approach as ?
Exercise 10 (Challenge): Krylov Subspaces
Let and .
(a) Compute and .
(b) Does ? (Check whether and are linearly independent.)
(c) Apply one step of the Lanczos algorithm to produce an orthonormal basis for (Gram-Schmidt applied to ).
(d) In this orthonormal basis , compute the tridiagonal matrix where . Verify that is symmetric and (nearly) tridiagonal. The eigenvalues of approximate the eigenvalues of - verify this by comparing with the actual eigenvalues of .
16. Why This Matters for AI (2026 Perspective)
| Aspect | Impact |
|---|---|
| Residual stream as shared vector space | The residual stream in a transformer is the central shared vector space. Every attention head, MLP layer, and positional encoding adds vectors to this space via the residual connection. All components communicate through this one -dimensional space - nothing else is shared. Understanding the subspace decomposition of (which subspaces are written by which components, which are read by which) is literally the same as understanding how information flows through a transformer. There is no higher-level description. |
| LoRA rank = subspace dimension | LoRA's entire design is a subspace constraint. The update is a rank- matrix, living in an -fraction of the full parameter space. Choosing is choosing the dimension of the subspace to search in. Too small: the subspace doesn't contain the optimal update direction, and performance suffers. Too large: you are searching a subspace larger than necessary, wasting parameters and potentially overfitting. The right is the intrinsic dimension of the fine-tuning task - a subspace dimension. |
| Superposition and polysemanticity | The superposition hypothesis says LLMs represent more features than their embedding dimension allows. Since more than linearly independent directions cannot exist in , features beyond must share dimensions through non-orthogonal superposition. Each neuron becomes polysemantic - it responds to multiple features. This limits interpretability (you cannot read off features from individual dimensions) and causes interference between features. Solving superposition is one of the central unsolved problems in mechanistic interpretability, and the solution must be phrased in the language of subspace geometry. |
| MLA and KV subspace compression | DeepSeek-V3's Multi-head Latent Attention compresses KV projections to a rank- subspace with . The KV cache stores only the -dimensional compressed representation; at inference, it is decompressed back to . The rank is the architectural bottleneck that enables a reduction in KV cache memory. The subspace dimension is the design variable that trades memory against expressiveness. This is subspace thinking at the architecture level: the architecture is designed around a low-dimensional subspace constraint. |
| Mechanistic interpretability circuits | Every circuit found in mechanistic interpretability is a composition of subspace operations. A "previous token head" reads from a specific subspace of the residual stream (via its , row spaces) and writes to a specific subspace (via its column space). An "induction head" communicates with the previous token head through a shared subspace. Superposition of features happens in specific subspaces. The "language" of mechanistic interpretability is entirely the language of subspaces: read, write, rotate, project, in . |
| Representation collapse prevention | In self-supervised learning, representation collapse means all embeddings converge to a low-dimensional (or 0-dimensional) subspace - the network learns to output the same or similar vectors for all inputs. VICReg, Barlow Twins, and BYOL all prevent collapse by adding losses that penalise low-dimensional representations. VICReg's variance loss requires that the covariance matrix of batch embeddings has high trace (high average variance), which prevents the embeddings from collapsing to a subspace. Collapse = subspace dimension ; the loss explicitly maximises subspace dimension. |
| Implicit bias and minimum-norm solutions | Gradient descent on overparameterised linear models (and, empirically, on large neural networks) converges to minimum-norm solutions. The minimum-norm solution lies in the row space of the data matrix - it is the unique solution in the subspace , the orthogonal complement of the null space. The implicit bias of gradient descent is a subspace selection: it selects the solution in rather than any other coset representative. This subspace selection is what enables generalisation in overparameterised models. |
| Orthogonality and head diversity | If attention heads write to mutually orthogonal subspaces of the residual stream, they do not interfere with each other - each head has an independent information channel. Head redundancy is equivalent to subspace overlap: if head 's column space is a subspace of head 's column space, head is redundant. Pruning heads that write to subspaces already covered by remaining heads preserves expressiveness while reducing computation. Orthogonality between head subspaces is the precise mathematical criterion for head independence. |
| Function space universality | The universal approximation theorem says that neural networks with nonlinear activations can approximate any continuous function to arbitrary accuracy. In subspace language: the set of functions representable by a sufficiently large network is dense in - it is not a subspace (the network family is non-linear), but its closure is the entire function space. The depth and nonlinearity of the architecture determine which function-space "directions" are accessible. Width and depth determine the "dimensionality" of the accessible function subspace. |
| Second-order optimisation | The Hessian of the training loss is a matrix with most of its spectral mass concentrated in a low-dimensional subspace (the "bulk" subspace where curvature is large). The remaining directions have near-zero curvature (flat directions). K-FAC, Shampoo, and SOAP all identify and exploit this curvature subspace by approximating the inverse Hessian in the curved subspace and using the identity in the flat directions. Natural gradient preconditions updates by the inverse Fisher (a positive semidefinite matrix): it aligns updates with the geometry of the curved subspace and suppresses movement in the flat directions. |
Conceptual Bridge
Vector spaces and subspaces are the geometry underlying all of linear algebra. Every concept - linear independence, span, basis, dimension, rank, orthogonality, projections, eigenvalues - is ultimately a statement about the structure of subspaces. The eight axioms that define a vector space are simple; the richness emerges from the subspace hierarchy they generate.
The abstract framework is what makes the theory universal. The same theorems that govern arrows in a plane govern polynomials, matrices, functions, and probability distributions. Verifying the eight axioms once grants access to a century of results - for free, without reproof. This universality is not just mathematical elegance; it is practical power. When you recognise that the gradient vectors during training form a set of vectors in and that is a vector space, you immediately know that all of linear algebra applies: span, independence, subspaces, projections, and dimension all become tools for understanding training dynamics.
For AI in 2026, subspaces are not abstract. They are the architectural primitives of transformers (the residual stream, the attention subspace, the MLP subspace), the design variables of efficient fine-tuning (LoRA rank = subspace dimension), the diagnostic language of interpretability (which subspace does this head write to?), and the theoretical foundation of generalisation (gradient updates live in a low-dimensional subspace; the implicit bias selects the minimum-norm solution in the row space).
WHERE THIS MODULE SITS IN THE CURRICULUM
Sets and Logic -> Functions -> Summation
Proof Techniques
Vectors and Spaces (geometry: what vectors are)
Matrix Operations (computation: how to multiply matrices)
Systems of Equations (solving: Gaussian elimination)
Determinants (volume: the signed volume of a parallelepiped)
Matrix Rank (dimensionality: the rank-nullity connection)
Vector Spaces and Subspaces (structure) <- THIS MODULE
The payoff: everything from earlier modules
is now understood geometrically as a statement
about subspaces. Rank = dimension of col space.
Null space = kernel. Solutions of Ax=b live in
a coset of null(A). Determinant = 0 iff the
column space is a proper subspace of R.
Eigenvalues and Decompositions (spectrum of a linear map)
- invariant subspaces of T: revealed through its eigenvalues
- eigenvector: 1D invariant subspace
- spectral theorem: orthogonal direct sum of eigenspaces
- SVD: complete subspace decomposition of any matrix
Probability and Information Theory
- probability distributions as vectors in function space
- KL divergence as a geometry on the simplex
Calculus and Optimisation
- gradient lives in R, a vector space
- Hessian's eigenspaces = directions of curvature
- natural gradient uses curved subspace geometry
LLM Mathematics
What comes next: Eigenvalues, Eigenvectors, and Matrix Decompositions.
The next module reveals the internal subspace structure of a linear map - the invariant subspaces that a matrix preserves, revealed through its eigenvalues and eigenvectors. An eigenvector spans a 1-dimensional invariant subspace; the spectral theorem decomposes any symmetric matrix into orthogonal 1-dimensional invariant subspaces; the SVD extends this to the complete subspace structure of any rectangular matrix. The four fundamental subspaces are decomposed even further - into singular subspaces aligned with singular values. All of this is the continuation of the subspace story begun in this module.
<- Back to Linear Algebra Basics | Next: Eigenvalues and Decompositions ->