Part 4

15 min read3 headingsSplit lesson page

Lesson overview | Previous part | Lesson overview

Vector Spaces and Subspaces: Part 15: Exercises to Conceptual Bridge

15. Exercises

Exercise 1: Verifying Vector Space Axioms

For each of the following, determine whether it is a vector space with the given operations. If not, identify which axiom(s) fail. For each valid vector space, identify the zero vector and describe the additive inverses.

(a) $V = \mathbb{R}^2$ , addition: $(u_1, u_2) + (v_1, v_2) = (u_1 + v_1,\ u_2 + v_2)$ , scalar mult: $\alpha(u_1, u_2) = (\alpha u_1,\ u_2)$ - scalar multiplication only affects the first coordinate.

(b) $V = \mathbb{R}_{+} = \{x \in \mathbb{R} : x > 0\}$ , addition defined as $u \oplus v = uv$ (multiplication of positive reals), scalar multiplication $\alpha \odot u = u^\alpha$ .

(c) $V = \{(x, y, z) \in \mathbb{R}^3 : x + y + z = 1\}$ with the standard addition and scalar multiplication inherited from $\mathbb{R}^3$ .

(d) $V = \{f : \mathbb{R} \to \mathbb{R} : f(0) = 0\}$ with standard pointwise addition $(f+g)(t) = f(t)+g(t)$ and scalar multiplication $(\alpha f)(t) = \alpha f(t)$ .

(e) $V = \{f : \mathbb{R} \to \mathbb{R} : f(0) = 1\}$ with the same standard operations.

Hints for (b): Verify all eight axioms carefully with the non-standard operations. The zero element of the vector space (if it exists) is the identity for $\oplus$ , not the number 0. For (d) and (e), think about which evaluation constraint is compatible with the zero function.

Exercise 2: Subspace Verification

For each subset, determine whether it is a subspace of the given vector space. Prove it is a subspace (using the three-condition test) or find an explicit counterexample showing it is not.

(a) $W = \{(x, y, z) \in \mathbb{R}^3 : 2x - y + 3z = 0\}$ inside $\mathbb{R}^3$

(b) $W = \{(x, y) \in \mathbb{R}^2 : xy = 0\}$ (the two coordinate axes together) inside $\mathbb{R}^2$

(c) $W = \{A \in \mathbb{R}^{2 \times 2} : \text{tr}(A) = 0\}$ (traceless $2 \times 2$ matrices) inside $\mathbb{R}^{2 \times 2}$

(d) $W = \{p \in \mathcal{P}_3 : p(1) = 0\}$ (polynomials of degree $\leq 3$ that vanish at $t = 1$ ) inside $\mathcal{P}_3$

(e) $W = \{(x, y, z) \in \mathbb{R}^3 : x^2 + y^2 = z^2\}$ (double cone) inside $\mathbb{R}^3$

For each valid subspace: find a basis and state the dimension.

Exercise 3: The Four Fundamental Subspaces

Let $A = \begin{pmatrix} 1 & 2 & 0 & 1 \\ 0 & 0 & 1 & 3 \\ 1 & 2 & 1 & 4 \end{pmatrix}$ .

(a) Row-reduce $A$ to RREF. Identify the pivot columns and free columns.

(b) Find a basis for $\text{null}(A)$ . Verify the dimension via the Rank-Nullity Theorem.

(c) Find a basis for $\text{col}(A)$ using the pivot columns of the original $A$ (not the RREF). State the dimension.

(d) Find a basis for $\text{row}(A)$ using the non-zero rows of the RREF. State the dimension.

(e) Find a basis for $\text{null}(A^\top)$ by row-reducing $A^\top$ and finding its null space. Alternatively, argue from the dimension formula.

(f) Verify the dimension counts: $\dim(\text{col}(A)) + \dim(\text{null}(A^\top)) = 3$ and $\dim(\text{row}(A)) + \dim(\text{null}(A)) = 4$ .

(g) Verify the orthogonality $\text{row}(A) \perp \text{null}(A)$ : compute the dot product of every basis vector of $\text{row}(A)$ with every basis vector of $\text{null}(A)$ and confirm it equals zero.

Exercise 4: Span, Independence, and Basis in $\mathbb{R}^4$

Let $\mathbf{v}_1 = (1,2,0,1)^\top$ , $\mathbf{v}_2 = (0,1,1,2)^\top$ , $\mathbf{v}_3 = (1,3,1,3)^\top$ , $\mathbf{v}_4 = (2,1,-1,-1)^\top$ .

(a) Form the matrix $A = [\mathbf{v}_1 \mid \mathbf{v}_2 \mid \mathbf{v}_3 \mid \mathbf{v}_4]$ and row-reduce. Are $\{\mathbf{v}_1, \mathbf{v}_2, \mathbf{v}_3, \mathbf{v}_4\}$ linearly independent?

(b) What is $\dim(\text{span}\{\mathbf{v}_1, \mathbf{v}_2, \mathbf{v}_3, \mathbf{v}_4\})$ ?

(c) Identify a subset $\{\mathbf{v}_{i_1}, \mathbf{v}_{i_2}, \ldots\}$ of the given vectors that forms a basis for $\text{span}\{\mathbf{v}_1, \mathbf{v}_2, \mathbf{v}_3, \mathbf{v}_4\}$ .

(d) Express any dependent vectors as explicit linear combinations of your basis vectors.

(e) Is $\mathbf{w} = (3, 5, 1, 4)^\top$ in $\text{span}\{\mathbf{v}_1, \mathbf{v}_2, \mathbf{v}_3, \mathbf{v}_4\}$ ? If yes, find coordinates $(\alpha_1, \alpha_2, \ldots)$ such that $\mathbf{w} = \alpha_1 \mathbf{v}_{i_1} + \alpha_2 \mathbf{v}_{i_2} + \cdots$ . If no, explain why.

Exercise 5: Gram-Schmidt and Orthogonal Projection

Work in $\mathbb{R}^3$ with the standard inner product.

(a) Starting from $\mathbf{v}_1 = (1,1,0)^\top$ , $\mathbf{v}_2 = (1,0,1)^\top$ , $\mathbf{v}_3 = (0,1,1)^\top$ , verify that these three vectors are linearly independent (compute the determinant of the matrix they form).

(b) Apply Gram-Schmidt to produce an orthonormal basis $\{\mathbf{q}_1, \mathbf{q}_2, \mathbf{q}_3\}$ for $\mathbb{R}^3$ .

(c) Verify your result: check that $\|\mathbf{q}_i\| = 1$ for each $i$ and $\langle \mathbf{q}_i, \mathbf{q}_j \rangle = 0$ for $i \neq j$ .

(d) Express $\mathbf{w} = (1, 2, 3)^\top$ in your orthonormal basis: find $\alpha_i = \langle \mathbf{w}, \mathbf{q}_i \rangle$ for each $i$ . Verify that $\mathbf{w} = \sum_i \alpha_i \mathbf{q}_i$ .

(e) Construct the orthogonal projection matrix $P$ onto $W = \text{span}\{\mathbf{v}_1, \mathbf{v}_2\} = \text{span}\{\mathbf{q}_1, \mathbf{q}_2\}$ using $P = Q_2 Q_2^\top$ where $Q_2 = [\mathbf{q}_1 \mid \mathbf{q}_2]$ . Verify: (i) $P^2 = P$ , (ii) $P^\top = P$ , (iii) $\text{rank}(P) = 2$ . Compute $P\mathbf{w}$ and the residual $\mathbf{r} = \mathbf{w} - P\mathbf{w}$ . Verify that $\mathbf{r} \perp \mathbf{v}_1$ and $\mathbf{r} \perp \mathbf{v}_2$ .

Exercise 6: Subspace Operations and Dimension Formula

(a) In $\mathbb{R}^3$ , let $W_1 = \text{span}\{(1,0,0)^\top, (0,1,0)^\top\}$ (the $xy$ -plane) and $W_2 = \text{span}\{(1,1,0)^\top, (0,0,1)^\top\}$ . Find a basis for $W_1 + W_2$ . Is $W_1 + W_2 = \mathbb{R}^3$ ? Find $W_1 \cap W_2$ and verify the Grassmann formula: $\dim(W_1 + W_2) = \dim(W_1) + \dim(W_2) - \dim(W_1 \cap W_2)$ .

(b) Is $\mathbb{R}^3 = W_1 \oplus W_2$ (direct sum)? If not, find a subspace $W_3 \subseteq \mathbb{R}^3$ such that $\mathbb{R}^3 = W_1 \oplus W_3$ .

(c) In $\mathbb{R}^4$ , let $W_1 = \{\mathbf{x} : x_1 + x_2 = 0\}$ and $W_2 = \{\mathbf{x} : x_3 + x_4 = 0\}$ . Find $\dim(W_1)$ , $\dim(W_2)$ , $\dim(W_1 \cap W_2)$ , and $\dim(W_1 + W_2)$ . Verify the Grassmann formula.

(d) For $W_1 \cap W_2$ from part (c): find an explicit basis.

Exercise 7: Change of Basis and Coordinates

In $\mathbb{R}^2$ , let $\mathcal{B} = \{\mathbf{b}_1, \mathbf{b}_2\}$ with $\mathbf{b}_1 = (1,2)^\top$ and $\mathbf{b}_2 = (1,-1)^\top$ , and let $\mathcal{C} = \{\mathbf{c}_1, \mathbf{c}_2\}$ with $\mathbf{c}_1 = (2,1)^\top$ and $\mathbf{c}_2 = (-1,1)^\top$ .

(a) Verify that both $\mathcal{B}$ and $\mathcal{C}$ are bases for $\mathbb{R}^2$ (compute the determinants of the corresponding matrices).

(b) Find the change-of-basis matrix $P_{\mathcal{B} \to \mathcal{C}}$ : express each $\mathbf{b}_i$ as a linear combination of $\mathbf{c}_1, \mathbf{c}_2$ , and use these as the columns of $P$ .

(c) For the vector $\mathbf{v}$ with $[\mathbf{v}]_{\mathcal{B}} = (3, -1)^\top$ (coordinates $3\mathbf{b}_1 - \mathbf{b}_2$ in basis $\mathcal{B}$ ), compute $[\mathbf{v}]_{\mathcal{C}} = P_{\mathcal{B} \to \mathcal{C}} [\mathbf{v}]_{\mathcal{B}}$ .

(d) Compute $\mathbf{v}$ explicitly in the standard basis. Then verify your answer to (c) by directly expressing $\mathbf{v}$ as a linear combination of $\mathbf{c}_1$ and $\mathbf{c}_2$ .

(e) If a linear map $T: \mathbb{R}^2 \to \mathbb{R}^2$ has matrix $M = \begin{pmatrix} 2 & 1 \\ 0 & 3 \end{pmatrix}$ in the standard basis, find its matrix representation in basis $\mathcal{B}$ .

Exercise 8: AI Application - Subspace Analysis of a Weight Matrix

Let $W = \begin{pmatrix} 3 & 1 & 2 \\ 1 & 2 & 1 \\ 2 & 1 & 3 \end{pmatrix}$ .

(a) Compute $\det(W)$ and $\text{rank}(W)$ . Is $W$ full rank?

(b) Find a basis for $\text{null}(W)$ and $\text{col}(W)$ . Verify $\dim(\text{null}(W)) + \dim(\text{col}(W)) = 3$ .

(c) A LoRA adapter has rank $r = 1$ with update $\Delta W = \mathbf{b}\mathbf{a}^\top$ where $\mathbf{b} = (1, 0, 1)^\top$ and $\mathbf{a} = (1, 1, 0)^\top$ . Explicitly compute $\Delta W$ . What is $\text{col}(\Delta W)$ ? What is $\text{rank}(\Delta W)$ ?

(d) Consider the updated weight $W' = W + \Delta W$ . Without computing $W'$ explicitly, state the upper and lower bounds on $\text{rank}(W')$ . Under what condition on the relationship between $\text{col}(\Delta W)$ and $\text{null}(W^\top)$ (i.e., the left null space of $W$ ) would $\text{rank}(W') = \text{rank}(W) + 1$ ? Under what condition would $\text{rank}(W') = \text{rank}(W)$ ?

(e) Now compute $W'$ explicitly. Verify your predictions from (d) by computing $\text{rank}(W')$ .

Exercise 9 (Challenge): The Superposition Geometry

This exercise develops the geometry of the superposition hypothesis.

(a) In $\mathbb{R}^2$ (a 2-dimensional embedding space), suppose we want to store $F = 3$ features as unit vectors with maximum pairwise orthogonality. Show that it is impossible for all three feature vectors to be pairwise orthogonal. (Hint: three pairwise orthogonal unit vectors in $\mathbb{R}^2$ cannot exist.)

(b) Instead, place three unit vectors at angles $0$ , $120$ , $240$ from the positive $x$ -axis: $\mathbf{f}_1 = (1,0)^\top$ , $\mathbf{f}_2 = (-1/2, \sqrt{3}/2)^\top$ , $\mathbf{f}_3 = (-1/2, -\sqrt{3}/2)^\top$ . Compute all pairwise inner products $\langle \mathbf{f}_i, \mathbf{f}_j \rangle$ for $i \neq j$ . What is the interference level?

(c) The "reconstruction loss" for feature $i$ when all features have activation $x_i \in \{0,1\}$ is: the residual after reading off the $i$ -th feature and subtracting the contribution of the feature direction. Specifically, if the residual stream stores $\mathbf{h} = \sum_j x_j \mathbf{f}_j$ , and we estimate $\hat{x}_i = \langle \mathbf{h}, \mathbf{f}_i \rangle$ , show that $\hat{x}_i = x_i + \sum_{j \neq i} x_j \langle \mathbf{f}_j, \mathbf{f}_i \rangle$ . The error $\hat{x}_i - x_i$ is the interference from other features.

(d) Compute the expected interference for the configuration in (b) assuming each $x_j \in \{0, 1\}$ independently with probability $p$ of being active. What does the interference approach as $p \to 0$ ?

Exercise 10 (Challenge): Krylov Subspaces

Let $A = \begin{pmatrix} 2 & 1 \\ 0 & 3 \end{pmatrix}$ and $\mathbf{b} = (1, 1)^\top$ .

(a) Compute $\mathcal{K}_1(A, \mathbf{b}) = \text{span}\{\mathbf{b}\}$ and $\mathcal{K}_2(A, \mathbf{b}) = \text{span}\{\mathbf{b}, A\mathbf{b}\}$ .

(b) Does $\mathcal{K}_2(A, \mathbf{b}) = \mathbb{R}^2$ ? (Check whether $\mathbf{b}$ and $A\mathbf{b}$ are linearly independent.)

(c) Apply one step of the Lanczos algorithm to produce an orthonormal basis for $\mathcal{K}_2(A, \mathbf{b})$ (Gram-Schmidt applied to $\{\mathbf{b}, A\mathbf{b}\}$ ).

(d) In this orthonormal basis $\{\mathbf{q}_1, \mathbf{q}_2\}$ , compute the tridiagonal matrix $T = Q^\top A Q$ where $Q = [\mathbf{q}_1 \mid \mathbf{q}_2]$ . Verify that $T$ is symmetric and (nearly) tridiagonal. The eigenvalues of $T$ approximate the eigenvalues of $A$ - verify this by comparing with the actual eigenvalues of $A$ .

16. Why This Matters for AI (2026 Perspective)

Aspect	Impact
Residual stream as shared vector space	The residual stream $\mathbf{x} \in \mathbb{R}^d$ in a transformer is the central shared vector space. Every attention head, MLP layer, and positional encoding adds vectors to this space via the residual connection. All components communicate through this one $d$ -dimensional space - nothing else is shared. Understanding the subspace decomposition of $\mathbb{R}^d$ (which subspaces are written by which components, which are read by which) is literally the same as understanding how information flows through a transformer. There is no higher-level description.
LoRA rank = subspace dimension	LoRA's entire design is a subspace constraint. The update $\Delta W = BA^\top$ is a rank- $r$ matrix, living in an $r(m+n)/(mn)$ -fraction of the full parameter space. Choosing $r$ is choosing the dimension of the subspace to search in. Too small: the subspace doesn't contain the optimal update direction, and performance suffers. Too large: you are searching a subspace larger than necessary, wasting parameters and potentially overfitting. The right $r$ is the intrinsic dimension of the fine-tuning task - a subspace dimension.
Superposition and polysemanticity	The superposition hypothesis says LLMs represent more features than their embedding dimension allows. Since more than $d$ linearly independent directions cannot exist in $\mathbb{R}^d$ , features beyond $d$ must share dimensions through non-orthogonal superposition. Each neuron becomes polysemantic - it responds to multiple features. This limits interpretability (you cannot read off features from individual dimensions) and causes interference between features. Solving superposition is one of the central unsolved problems in mechanistic interpretability, and the solution must be phrased in the language of subspace geometry.
MLA and KV subspace compression	DeepSeek-V3's Multi-head Latent Attention compresses KV projections to a rank- $r$ subspace with $r \ll d$ . The KV cache stores only the $r$ -dimensional compressed representation; at inference, it is decompressed back to $\mathbb{R}^d$ . The rank $r$ is the architectural bottleneck that enables a $5.75\times$ reduction in KV cache memory. The subspace dimension $r$ is the design variable that trades memory against expressiveness. This is subspace thinking at the architecture level: the architecture is designed around a low-dimensional subspace constraint.
Mechanistic interpretability circuits	Every circuit found in mechanistic interpretability is a composition of subspace operations. A "previous token head" reads from a specific subspace of the residual stream (via its $W_Q$ , $W_K$ row spaces) and writes to a specific subspace (via its $W_O$ column space). An "induction head" communicates with the previous token head through a shared subspace. Superposition of features happens in specific subspaces. The "language" of mechanistic interpretability is entirely the language of subspaces: read, write, rotate, project, in $\mathbb{R}^d$ .
Representation collapse prevention	In self-supervised learning, representation collapse means all embeddings converge to a low-dimensional (or 0-dimensional) subspace - the network learns to output the same or similar vectors for all inputs. VICReg, Barlow Twins, and BYOL all prevent collapse by adding losses that penalise low-dimensional representations. VICReg's variance loss requires that the covariance matrix of batch embeddings has high trace (high average variance), which prevents the embeddings from collapsing to a subspace. Collapse = subspace dimension $\to 0$ ; the loss explicitly maximises subspace dimension.
Implicit bias and minimum-norm solutions	Gradient descent on overparameterised linear models (and, empirically, on large neural networks) converges to minimum-norm solutions. The minimum-norm solution lies in the row space of the data matrix - it is the unique solution in the subspace $\text{row}(X)$ , the orthogonal complement of the null space. The implicit bias of gradient descent is a subspace selection: it selects the solution in $\text{row}(X)$ rather than any other coset representative. This subspace selection is what enables generalisation in overparameterised models.
Orthogonality and head diversity	If attention heads write to mutually orthogonal subspaces of the residual stream, they do not interfere with each other - each head has an independent information channel. Head redundancy is equivalent to subspace overlap: if head $A$ 's column space is a subspace of head $B$ 's column space, head $A$ is redundant. Pruning heads that write to subspaces already covered by remaining heads preserves expressiveness while reducing computation. Orthogonality between head subspaces is the precise mathematical criterion for head independence.
Function space universality	The universal approximation theorem says that neural networks with nonlinear activations can approximate any continuous function to arbitrary accuracy. In subspace language: the set of functions representable by a sufficiently large network is dense in $C([0,1]^n)$ - it is not a subspace (the network family is non-linear), but its closure is the entire function space. The depth and nonlinearity of the architecture determine which function-space "directions" are accessible. Width and depth determine the "dimensionality" of the accessible function subspace.
Second-order optimisation	The Hessian $H = \nabla^2 \mathcal{L}(\theta)$ of the training loss is a $p \times p$ matrix with most of its spectral mass concentrated in a low-dimensional subspace (the "bulk" subspace where curvature is large). The remaining $p - k$ directions have near-zero curvature (flat directions). K-FAC, Shampoo, and SOAP all identify and exploit this curvature subspace by approximating the inverse Hessian in the curved subspace and using the identity in the flat directions. Natural gradient preconditions updates by the inverse Fisher (a positive semidefinite matrix): it aligns updates with the geometry of the curved subspace and suppresses movement in the flat directions.

Conceptual Bridge

Vector spaces and subspaces are the geometry underlying all of linear algebra. Every concept - linear independence, span, basis, dimension, rank, orthogonality, projections, eigenvalues - is ultimately a statement about the structure of subspaces. The eight axioms that define a vector space are simple; the richness emerges from the subspace hierarchy they generate.

The abstract framework is what makes the theory universal. The same theorems that govern arrows in a plane govern polynomials, matrices, functions, and probability distributions. Verifying the eight axioms once grants access to a century of results - for free, without reproof. This universality is not just mathematical elegance; it is practical power. When you recognise that the gradient vectors during training form a set of vectors in $\mathbb{R}^p$ and that $\mathbb{R}^p$ is a vector space, you immediately know that all of linear algebra applies: span, independence, subspaces, projections, and dimension all become tools for understanding training dynamics.

For AI in 2026, subspaces are not abstract. They are the architectural primitives of transformers (the residual stream, the attention subspace, the MLP subspace), the design variables of efficient fine-tuning (LoRA rank = subspace dimension), the diagnostic language of interpretability (which subspace does this head write to?), and the theoretical foundation of generalisation (gradient updates live in a low-dimensional subspace; the implicit bias selects the minimum-norm solution in the row space).

WHERE THIS MODULE SITS IN THE CURRICULUM


  Sets and Logic -> Functions -> Summation
          
  Proof Techniques
          
  Vectors and Spaces (geometry: what vectors are)
          
  Matrix Operations (computation: how to multiply matrices)
          
  Systems of Equations (solving: Gaussian elimination)
          
  Determinants (volume: the signed volume of a parallelepiped)
          
  Matrix Rank (dimensionality: the rank-nullity connection)
          
  Vector Spaces and Subspaces (structure) <- THIS MODULE
  
  The payoff: everything from earlier modules
  is now understood geometrically as a statement
  about subspaces. Rank = dimension of col space.
  Null space = kernel. Solutions of Ax=b live in
  a coset of null(A). Determinant = 0 iff the
  column space is a proper subspace of R.
          
  Eigenvalues and Decompositions (spectrum of a linear map)
  - invariant subspaces of T: revealed through its eigenvalues
  - eigenvector: 1D invariant subspace
  - spectral theorem: orthogonal direct sum of eigenspaces
  - SVD: complete subspace decomposition of any matrix
          
  Probability and Information Theory
  - probability distributions as vectors in function space
  - KL divergence as a geometry on the simplex
          
  Calculus and Optimisation
  - gradient lives in R, a vector space
  - Hessian's eigenspaces = directions of curvature
  - natural gradient uses curved subspace geometry
          
          LLM Mathematics

What comes next: Eigenvalues, Eigenvectors, and Matrix Decompositions.

The next module reveals the internal subspace structure of a linear map - the invariant subspaces that a matrix preserves, revealed through its eigenvalues and eigenvectors. An eigenvector spans a 1-dimensional invariant subspace; the spectral theorem decomposes any symmetric matrix into orthogonal 1-dimensional invariant subspaces; the SVD extends this to the complete subspace structure of any rectangular matrix. The four fundamental subspaces are decomposed even further - into singular subspaces aligned with singular values. All of this is the continuation of the subspace story begun in this module.

<- Back to Linear Algebra Basics | Next: Eigenvalues and Decompositions ->

Vector Spaces Subspaces: Part 4 - Exercises To Conceptual Bridge

Vector Spaces and Subspaces: Part 15: Exercises to Conceptual Bridge

15. Exercises

16. Why This Matters for AI (2026 Perspective)

Conceptual Bridge

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?