Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 1
29 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Vector Spaces and Subspaces: Part 1: Intuition to 6. Basis and Dimension

1. Intuition

1.1 What Is a Vector Space?

A vector space is a collection of objects - called vectors - together with two operations (addition and scalar multiplication) that behave exactly the way arrows in the plane behave. You can add any two vectors and get another vector in the same collection. You can stretch any vector by any scalar and stay in the collection. And these two operations interact with each other in the exact way you would expect from ordinary arithmetic.

The power of the abstract definition is this: once you verify that a collection satisfies the vector space axioms, every theorem ever proved about vector spaces applies immediately - whether the "vectors" are geometric arrows, polynomials, matrices, functions, probability distributions, or sequences of real numbers. You do not reprove theorems for each new setting; you simply check the axioms once and inherit a century of results for free.

The word "space" is deliberate. A vector space has geometric structure. You can talk about directions, linear combinations, independence, and dimensionality - even when the "vectors" are abstract objects that look nothing like arrows. The polynomial 1+2t+3t21 + 2t + 3t^2 lives in a vector space. So does the 3×33 \times 3 identity matrix. So does the constant function f(x)=7f(x) = 7. In each case, the "geometry" is real and consequential, even if it is not immediately visible.

For AI, this abstraction is not philosophical - it is practical:

VECTOR SPACES THAT APPEAR IN EVERY NEURAL NETWORK


  R  - embedding space
   tokens, sentences, images, users are all vectors here
       arithmetic on vectors (king - man + woman ~= queen)
       only makes sense because R is a vector space

  R  - parameter space
   all weights of a neural network form a single vector in R
       gradient descent navigates this space
       LoRA restricts updates to a subspace of R

  R - weight matrix space
   each weight matrix is a "vector" in a higher-dimensional space
       adding matrices (as in LoRA: W + DeltaW) is vector addition
       scaling a matrix by a learning rate is scalar multiplication

  infinity-dim - function space
   the set of all functions a neural architecture can represent
       approximation theory asks which subspace of all functions
       a given architecture can reach

The abstract definition captures all of these simultaneously. That is the point.

1.2 Why Vector Spaces Are the Language of Deep Learning

Every major architectural concept in modern deep learning can be phrased as a statement about vector spaces or their subspaces. This is not a coincidence - it reflects the fundamental linearity of most operations in a transformer.

Embedding spaces. Tokens, sentences, images, and users are all represented as vectors in Rd\mathbb{R}^d. This only makes sense because Rd\mathbb{R}^d is a vector space: you can meaningfully add embedding vectors, scale them, and take linear combinations. The famous word embedding arithmetic kingman+womanqueen\text{king} - \text{man} + \text{woman} \approx \text{queen} is a statement about the linear structure of the embedding space. Cosine similarity - the standard measure of semantic relatedness - is an inner product operation. None of this would be coherent if the embedding space were not a vector space.

Parameter space. All parameters of a neural network - every weight matrix, every bias vector, every LayerNorm scale - can be concatenated into a single vector θRp\theta \in \mathbb{R}^p. Gradient descent navigates this pp-dimensional vector space. The loss landscape L:RpR\mathcal{L}: \mathbb{R}^p \to \mathbb{R} lives over this space. The geometry of the parameter space - its curvature, the directions of steepest descent, the flat directions - shapes every aspect of optimisation. Gradient clipping, momentum, and weight decay are all operations in the vector space Rp\mathbb{R}^p.

Residual stream. In transformer architectures, the central computational object is the residual stream: a vector xRd\mathbf{x} \in \mathbb{R}^d that is passed from layer to layer and updated by each component:

xx+Attention(x)+MLP(x)\mathbf{x} \leftarrow \mathbf{x} + \text{Attention}(\mathbf{x}) + \text{MLP}(\mathbf{x})

The update rule makes sense only because Rd\mathbb{R}^d is a vector space - you can add the attention output (a vector in Rd\mathbb{R}^d) and the MLP output (also a vector in Rd\mathbb{R}^d) to the residual stream. Each layer writes a new vector into the same shared space. All components communicate through this one dd-dimensional vector space. The subspace structure of Rd\mathbb{R}^d - which directions are "owned" by which components - is what mechanistic interpretability studies.

Function spaces. Neural networks represent functions. The set of all functions representable by a given architecture with bounded parameters is a subset of the infinite-dimensional function space L2L^2. Approximation theory studies which subspaces of L2L^2 can be well-approximated by which architectures. The universal approximation theorem is a statement about density in a function space. Understanding which functions a model can and cannot represent requires understanding subspaces of infinite-dimensional vector spaces.

Gradient space. At each training step, the gradient L(θ)\nabla \mathcal{L}(\theta) is a vector in the same parameter space Rp\mathbb{R}^p. Gradient accumulation (summing gradients across microbatches) is vector addition. Gradient clipping (bounding the norm) is a scaling operation. Momentum (maintaining a weighted sum of past gradients) is a linear combination. All of these rely on the vector space structure of Rp\mathbb{R}^p. Empirically, the gradient vectors observed during training lie in a much lower-dimensional subspace of Rp\mathbb{R}^p - a fact that motivates LoRA and related methods.

Tangent spaces. The loss surface is a manifold embedded in Rp\mathbb{R}^p. At each point θ\theta, the tangent space - the set of directions one can move while remaining (to first order) on the manifold - is a vector space. Natural gradient methods use the geometry of these tangent spaces; the Fisher information matrix defines a metric on the parameter manifold that is fundamentally a statement about the inner product structure of the tangent space.

1.3 The Hierarchy of Structure

Vector spaces occupy a specific level in a hierarchy of mathematical structures. Each level adds additional structure on top of the previous, enabling more refined geometric notions:

HIERARCHY OF MATHEMATICAL STRUCTURE


  Set  (no structure)
      
      + addition and scalar multiplication obeying 8 axioms
      
  Vector Space  (linear structure: add, scale)
      - can speak of linear combinations, span, independence, basis
      - can speak of dimension
      
      + norm  v  (a function measuring "length")
      
  Normed Space  (metric structure: can measure distances)
      - can speak of convergence, continuity, open/closed sets
      - can speak of the "size" of a vector
      
      + inner product  u, v  (a function measuring "alignment")
      
  Inner Product Space  (angular structure: can measure angles)
      - can speak of orthogonality, projections, cosine similarity
      - can speak of the "angle" between two vectors
      
      + completeness  (every Cauchy sequence converges in the space)
      
  Hilbert Space  (complete inner product space)
      - can speak of infinite sums that converge
      - can speak of Fourier decompositions
      - the correct setting for quantum mechanics and function spaces

  
  Key facts:
  - R (Euclidean space) satisfies ALL levels simultaneously.
  - Every finite-dimensional inner product space is a Hilbert space
    (completeness is automatic in finite dimensions).
  - Infinite-dimensional function spaces (L^2, Sobolev spaces)
    require the Hilbert structure to behave well.
  - AI embedding spaces are finite-dimensional: all levels apply.
  - RKHS for kernel methods: infinite-dimensional Hilbert spaces.

The hierarchy matters because each additional level of structure enables new theorems. In a bare vector space, you can talk about linear combinations and dimension. Add a norm: you can measure convergence. Add an inner product: you can define orthogonality and projections. Add completeness: you can take limits of infinite sequences. For AI, finite-dimensional vector spaces with inner products (Hilbert spaces) are the default working environment. Function spaces require infinite-dimensional Hilbert structure.

1.4 The Intuition of a Subspace

A subspace is a vector space living inside a larger vector space - a flat, linear subset that passes through the origin and is closed under the same two operations.

"Flat" and "through the origin" are the two defining geometric properties:

  • A line through the origin in R3\mathbb{R}^3 is a subspace. A line not through the origin is not.
  • A plane through the origin is a subspace. A plane not through the origin is not.
  • A sphere, a circle, a parabola, a curved surface - none are subspaces.
  • The positive quadrant {(x,y):x0,y0}\{(x,y) : x \geq 0, y \geq 0\} is not a subspace (not closed under scalar multiplication by 1-1).

Why "through the origin"? The zero vector must be in any subspace. This follows from closure under scalar multiplication: if v\mathbf{v} is in the subspace, then 0v=00 \cdot \mathbf{v} = \mathbf{0} must also be in the subspace. A subset that doesn't contain 0\mathbf{0} cannot be a subspace.

Why "flat"? Take any two vectors in the subspace. Their sum must also be in the subspace. Multiply either by any scalar - must still be in the subspace. This forces the subset to be "straight". There can be no curves, bends, or boundaries. If the subspace contains a vector v\mathbf{v}, it must contain all scalar multiples αv\alpha\mathbf{v} for every αR\alpha \in \mathbb{R} - the entire line through the origin in the direction of v\mathbf{v}.

SUBSPACES AND NON-SUBSPACES IN R^2 AND R^3


  R^2:                         R^3:

    The whole plane R^2         All of R^3
    Any line through origin    Any plane through origin
    {0}                        Any line through origin
                                {0}

    Line not through origin    Plane not through origin
    Half-plane                 Sphere / cylinder / cone
    Unit circle                Positive octant
    Filled disk                Affine subspace (offset from origin)
    Upper half-plane

  Test: does it contain 0? is it closed under + and scaling?
  If any answer is NO -> not a subspace.

For AI, the subspace concept is concrete. The column space of a weight matrix is a subspace of the output space. The null space is a subspace of the input space. The set of all weight updates targeted by LoRA is a subspace of the weight matrix space. The set of all probability vectors (probabilities summing to 1) is not a subspace (it does not contain the zero vector and is not closed under vector addition). This distinction matters: arithmetic that is meaningful in a vector space may be meaningless in a non-subspace.

1.5 Subspaces That Appear Constantly in AI

The following subspaces recur throughout deep learning theory, practice, and interpretability. Each is worth understanding geometrically before encountering it formally.

Column space col(W). For a weight matrix WRm×nW \in \mathbb{R}^{m \times n}, the column space is the set of all vectors WxW\mathbf{x} as x\mathbf{x} ranges over Rn\mathbb{R}^n. It is the set of all reachable outputs of the linear layer. If rank(W)=r<m\text{rank}(W) = r < m, then only an rr-dimensional subspace of Rm\mathbb{R}^m can be produced by this layer regardless of the input. The remaining mrm - r dimensions of the output space are always zero from this layer's contribution.

Null space null(W). The null space is the set of all inputs x\mathbf{x} that produce zero output: Wx=0W\mathbf{x} = \mathbf{0}. It is the "invisible" subspace - directions in the input that this layer completely ignores. A vector in the null space carries no information through the layer. Its dimension is nrank(W)n - \text{rank}(W). Mechanistically: if a component of the residual stream lies in the null space of all query and key projections of some head, that component is invisible to that head.

Row space row(W). The row space is the span of the rows of WW, equivalently the column space of WW^\top. It is the set of input directions that WW "listens to". The projection of an input onto the row space determines the output; the projection onto the null space is discarded.

Attention subspace. Each attention head in a transformer has query and key projection matrices WQ,WKRd×dkW_Q, W_K \in \mathbb{R}^{d \times d_k}. The attention scores are computed in the dkd_k-dimensional subspace defined by these projections. The head is completely blind to directions in Rd\mathbb{R}^d orthogonal to the column spaces of WQW_Q and WKW_K. The value and output projections WV,WOW_V, W_O similarly define the subspace that the head writes to. Attention is fundamentally a subspace operation: it reads from one subspace and writes to another.

Gradient subspace. The gradient L(θt)\nabla \mathcal{L}(\theta_t) at step tt is a vector in Rp\mathbb{R}^p. Over the course of training, these gradient vectors empirically lie in a subspace of Rp\mathbb{R}^p whose dimension is much smaller than pp. Gur-Ari et al. (2018) showed that the top eigenspace of the gradient covariance matrix - the "gradient subspace" - has effective dimension in the hundreds or thousands, not in the hundreds of millions. This is the empirical foundation for LoRA: if gradients live in a low-dimensional subspace, it may suffice to restrict weight updates to that subspace.

Residual stream subspace. Mechanistic interpretability reveals that different components of a transformer write to different subspaces of the dd-dimensional residual stream. Each attention head "owns" a subspace (defined by its OV circuit) and communicates with other components through that subspace. The residual stream is a shared message-passing space, and the subspace structure determines which components can communicate with which.

LoRA update subspace. Low-Rank Adaptation restricts the weight update ΔW=BA\Delta W = B A^\top where BRm×rB \in \mathbb{R}^{m \times r} and ARn×rA \in \mathbb{R}^{n \times r}. This constrains ΔW\Delta W to a rank-rr subspace of Rm×n\mathbb{R}^{m \times n}: the subspace of matrices with column space contained in col(B)\text{col}(B). The rank rr is literally the dimension of the subspace being searched. Choosing rr is choosing the dimension of the subspace.

1.6 Historical Timeline

The concept of a vector space did not emerge from a single insight - it crystallised gradually over seven decades as mathematicians in different countries, working on different problems, all needed the same abstract structure:

YearPerson / EventSignificance
1827Mbius - barycentric coordinatesLinear combinations with fixed coefficient sum; early subspace thinking embedded in geometry
1844Grassmann - Ausdehnungslehre (Theory of Extension)Abstract nn-dimensional linear spaces, vector algebra, exterior products; 50 years ahead of its time; largely ignored until the 1880s
1858Cayley - matrix algebraMatrices as the natural representation of linear maps; implicit use of vector space structure in Rn\mathbb{R}^n
1888Peano - Calcolo GeometricoFirst rigorous axiomatic definition of a vector space (Peano called them "linear systems"); the modern eight axioms are essentially unchanged from his formulation
1900-10Hilbert - infinite-dimensional spacesSpectral theory of operators; sequences as vectors; laid the foundation for quantum mechanics and functional analysis; introduced Hilbert spaces
1920sBanach - normed complete spacesGeneralised Hilbert spaces to settings without an inner product; functional analysis matures as a discipline
1929von Neumann - Hilbert space formalismRigorous foundation for quantum mechanics as operators on Hilbert space; spectral theorem in full generality
1930s-60sBourbaki - abstract algebraVector spaces as modules over fields; categorical perspective; the modern textbook treatment of linear algebra
1935Whitney - tangent bundlesA vector space (tangent space) attached to every point of a smooth manifold; modern differential geometry
1950s-90sKrylov subspace methodsCG, GMRES, Lanczos: iterative solvers that expand a nested sequence of subspaces; subspace geometry as a computational algorithm
1970s-90sLAPACK / numerical linear algebraPractical algorithms for computing with subspaces (QR, SVD, Gram-Schmidt); subspace methods become usable at scale
1901 / 1991Pearson / Turk-Pentland - PCAPearson (1901) defines principal component analysis; Turk and Pentland (1991) apply PCA to face recognition ("eigenfaces"); subspace learning as a machine learning tool
2013Mikolov et al. - word2vecWords as vectors in Rd\mathbb{R}^d; semantic arithmetic; gender direction as a 1D subspace; demonstrated that natural language semantics has linear subspace structure
2018Gur-Ari et al. - gradient subspaceThe "gradient subspace" of LLM training has dimension p\ll p; gradient updates live in a low-dimensional subspace of parameter space
2020Aghajanyan et al. - intrinsic dimensionalityFine-tuning gradient updates for GPT-2 lie in a subspace of dimension 200\approx 200 out of p1.5×108p \approx 1.5 \times 10^8; formal evidence for low-dimensional fine-tuning
2021Hu et al. - LoRAExplicit restriction of weight updates to a rank-rr subspace; subspace constraint as an inductive bias; state-of-the-art parameter-efficient fine-tuning
2022Elhage et al. - superposition hypothesisFeatures in LLMs are represented as directions (1D subspaces) in the residual stream; features outnumber dimensions, forcing non-orthogonal superposition; subspace geometry is the language of mechanistic interpretability
2024DeepSeek-V2/V3 - MLAMulti-head Latent Attention: KV projections compressed to a rank-rr subspace; the KV information is architecturally constrained to a low-dimensional subspace; 5.75x memory reduction

The arc of this timeline is clear: what began as abstract geometry became, over 180 years, the core language of AI architecture and efficiency.


2. Formal Definitions

Recall: Vectors and Spaces 3 introduced the vector space axioms with concrete geometric intuition (Rn\mathbb{R}^n, function spaces, polynomial spaces). This section re-derives them in full axiomatic rigour - the goal here is provability, not intuition. The two sections are complementary: 01 tells you what the axioms describe; this section tells you what follows from them.

2.1 The Axioms of a Vector Space

A vector space over a field FF (we use F=RF = \mathbb{R} throughout unless stated otherwise) is a set VV together with two operations:

  • Addition: +:V×VV+: V \times V \to V - takes two vectors, produces a vector
  • Scalar multiplication: :R×VV\cdot: \mathbb{R} \times V \to V - takes a scalar and a vector, produces a vector

satisfying the following eight axioms for all u,v,wV\mathbf{u}, \mathbf{v}, \mathbf{w} \in V and all α,βR\alpha, \beta \in \mathbb{R}:

Axioms for addition:

#NameStatement
1Commutativityu+v=v+u\mathbf{u} + \mathbf{v} = \mathbf{v} + \mathbf{u}
2Associativity(u+v)+w=u+(v+w)(\mathbf{u} + \mathbf{v}) + \mathbf{w} = \mathbf{u} + (\mathbf{v} + \mathbf{w})
3Additive identity0V\exists\, \mathbf{0} \in V such that v+0=v\mathbf{v} + \mathbf{0} = \mathbf{v} for all v\mathbf{v}
4Additive inverseFor each vV\mathbf{v} \in V, (v)V\exists\, (-\mathbf{v}) \in V such that v+(v)=0\mathbf{v} + (-\mathbf{v}) = \mathbf{0}

Axioms for scalar multiplication:

#NameStatement
5Multiplicative identity1v=v1 \cdot \mathbf{v} = \mathbf{v}
6Compatibilityα(βv)=(αβ)v\alpha(\beta \mathbf{v}) = (\alpha\beta)\mathbf{v}

Distributivity (linking the two operations):

#NameStatement
7Scalar over vector sumα(u+v)=αu+αv\alpha(\mathbf{u} + \mathbf{v}) = \alpha\mathbf{u} + \alpha\mathbf{v}
8Vector over scalar sum(α+β)v=αv+βv(\alpha + \beta)\mathbf{v} = \alpha\mathbf{v} + \beta\mathbf{v}

These eight axioms are the complete definition. Anything satisfying all eight is a vector space; anything failing even one is not. The axioms are minimal: each is independent of the others; removing any one produces a strictly weaker structure.

Notice what the axioms do and do not say:

  • They say nothing about what the "vectors" look like. Arrows, polynomials, matrices, functions - all are equally valid.
  • They say nothing about angles, lengths, or orthogonality. Those require additional structure (inner product).
  • They say nothing about convergence. That requires a topology (norm or metric).
  • They require closure under both operations: the result of adding two vectors must be back in VV; scaling a vector must stay in VV.
THE AXIOMS AS CLOSURE CONDITIONS


  Axioms 1-4: addition is a well-behaved operation on V
              V is an abelian group under +

  Axioms 5-6: scalar multiplication interacts correctly with itself

  Axioms 7-8: scalar multiplication distributes over both kinds of +
              (over vector sums and over scalar sums)

  Together: V behaves like "an R that we haven't yet coordinatised"

2.2 Immediate Consequences of the Axioms

The eight axioms imply several useful facts that are not axioms themselves. These are theorems - they must be proved from the axioms - but they follow quickly:

Uniqueness of the zero vector. If 0\mathbf{0} and 0\mathbf{0}' both satisfy Axiom 3, then:

0=0+0=0\mathbf{0} = \mathbf{0} + \mathbf{0}' = \mathbf{0}'

The first equality uses 0\mathbf{0}' as the identity; the second uses 0\mathbf{0}. So the zero vector is unique.

Uniqueness of additive inverses. If v+w=0\mathbf{v} + \mathbf{w} = \mathbf{0} and v+w=0\mathbf{v} + \mathbf{w}' = \mathbf{0}, then:

w=w+0=w+(v+w)=(w+v)+w=0+w=w\mathbf{w} = \mathbf{w} + \mathbf{0} = \mathbf{w} + (\mathbf{v} + \mathbf{w}') = (\mathbf{w} + \mathbf{v}) + \mathbf{w}' = \mathbf{0} + \mathbf{w}' = \mathbf{w}'

So the additive inverse v-\mathbf{v} is unique for each v\mathbf{v}.

0v=00 \cdot \mathbf{v} = \mathbf{0} (the zero scalar times any vector = zero vector):

0v=(0+0)v=0v+0v0 \cdot \mathbf{v} = (0 + 0) \cdot \mathbf{v} = 0 \cdot \mathbf{v} + 0 \cdot \mathbf{v}

Subtract 0v0 \cdot \mathbf{v} from both sides (add the additive inverse of 0v0 \cdot \mathbf{v}):

0=0v\mathbf{0} = 0 \cdot \mathbf{v}

α0=0\alpha \cdot \mathbf{0} = \mathbf{0} (any scalar times the zero vector = zero vector):

α0=α(0+0)=α0+α0\alpha \cdot \mathbf{0} = \alpha(\mathbf{0} + \mathbf{0}) = \alpha \cdot \mathbf{0} + \alpha \cdot \mathbf{0}

Subtract α0\alpha \cdot \mathbf{0}: 0=α0\mathbf{0} = \alpha \cdot \mathbf{0}.

(1)v=v(-1) \cdot \mathbf{v} = -\mathbf{v} (multiplying by 1-1 gives the additive inverse):

v+(1)v=1v+(1)v=(1+(1))v=0v=0\mathbf{v} + (-1)\mathbf{v} = 1 \cdot \mathbf{v} + (-1)\mathbf{v} = (1 + (-1))\mathbf{v} = 0 \cdot \mathbf{v} = \mathbf{0}

So (1)v(-1)\mathbf{v} is the additive inverse of v\mathbf{v}, i.e., (1)v=v(-1)\mathbf{v} = -\mathbf{v}.

Cancellation law for scalars. If αv=0\alpha\mathbf{v} = \mathbf{0}, then either α=0\alpha = 0 or v=0\mathbf{v} = \mathbf{0}:

Proof: Suppose α0\alpha \neq 0. Then α\alpha has a multiplicative inverse α1\alpha^{-1} in R\mathbb{R}. Multiply both sides:

v=1v=(α1α)v=α1(αv)=α10=0\mathbf{v} = 1 \cdot \mathbf{v} = (\alpha^{-1}\alpha)\mathbf{v} = \alpha^{-1}(\alpha\mathbf{v}) = \alpha^{-1} \cdot \mathbf{0} = \mathbf{0}

This last fact is particularly useful in linear independence arguments: if a scalar multiple of a vector is zero, either the scalar is zero or the vector is zero - no third option.

2.3 Standard Examples of Vector Spaces

Understanding which objects form vector spaces - and why - is as important as knowing the axioms. Here are the key examples, from most concrete to most abstract.


Rn\mathbb{R}^n (Euclidean space)

  • Vectors: nn-tuples of real numbers (x1,x2,,xn)(x_1, x_2, \ldots, x_n)
  • Addition: (x1,,xn)+(y1,,yn)=(x1+y1,,xn+yn)(x_1,\ldots,x_n) + (y_1,\ldots,y_n) = (x_1+y_1,\ldots,x_n+y_n) (componentwise)
  • Scalar multiplication: α(x1,,xn)=(αx1,,αxn)\alpha(x_1,\ldots,x_n) = (\alpha x_1,\ldots,\alpha x_n) (componentwise)
  • Zero vector: (0,0,,0)(0, 0, \ldots, 0)
  • Additive inverse: (x1,,xn)=(x1,,xn)-(x_1,\ldots,x_n) = (-x_1,\ldots,-x_n)
  • Dimension: nn

This is the prototype. Every finite-dimensional real vector space is isomorphic to Rn\mathbb{R}^n for some nn. All eight axioms follow directly from the arithmetic of real numbers applied componentwise. The AI embedding space Rd\mathbb{R}^d is exactly this.


Rm×n\mathbb{R}^{m \times n} (space of m×nm \times n real matrices)

  • Vectors: m×nm \times n matrices AA with real entries
  • Addition: entrywise addition (A+B)ij=Aij+Bij(A + B)_{ij} = A_{ij} + B_{ij}
  • Scalar multiplication: (αA)ij=αAij(\alpha A)_{ij} = \alpha A_{ij}
  • Zero: zero matrix 0m×n\mathbf{0}_{m \times n}
  • Dimension: mnmn (each of the mnmn entries is an independent coordinate)

This is the vector space of all weight matrices of a fixed shape. Gradient descent in a single layer's weight space is gradient descent in Rm×n\mathbb{R}^{m \times n}. When LoRA writes WW+BAW \leftarrow W + BA^\top, it is performing vector addition in Rm×n\mathbb{R}^{m \times n}. The learning rate times the gradient, ηWL\eta \nabla_W \mathcal{L}, is a scalar multiplication in Rm×n\mathbb{R}^{m \times n}.


Pn\mathcal{P}_n (polynomials of degree n\leq n)

  • Vectors: polynomials p(t)=a0+a1t+a2t2++antnp(t) = a_0 + a_1 t + a_2 t^2 + \cdots + a_n t^n
  • Addition: standard polynomial addition: (p+q)(t)=p(t)+q(t)(p+q)(t) = p(t) + q(t)
  • Scalar multiplication: (αp)(t)=αp(t)(\alpha p)(t) = \alpha p(t)
  • Zero: the zero polynomial p(t)=0p(t) = 0 (all coefficients zero)
  • Dimension: n+1n + 1 (the coefficients a0,a1,,ana_0, a_1, \ldots, a_n are the n+1n+1 coordinates)

The vectors here "look like" arrows in Rn+1\mathbb{R}^{n+1} (just replace each polynomial with its coefficient vector), but the polynomial structure is richer - you can evaluate p(t)p(t) at any point, differentiate, integrate, and compose. The vector space structure is the foundation; the polynomial structure is additional.


P\mathcal{P} (all polynomials, any degree)

  • Vectors: polynomials of any (finite) degree
  • Operations: same as Pn\mathcal{P}_n
  • Zero: zero polynomial
  • Dimension: \infty (countably infinite; basis ={1,t,t2,t3,}= \{1, t, t^2, t^3, \ldots\})

Note that the sum of two polynomials of degree n\leq n is again a polynomial of degree n\leq n, so Pn\mathcal{P}_n is a subspace of P\mathcal{P}. This is our first example of a subspace inside a larger space.


C([a,b])C([a, b]) (continuous functions on [a,b][a,b])

  • Vectors: continuous functions f:[a,b]Rf: [a,b] \to \mathbb{R}
  • Addition: (f+g)(t)=f(t)+g(t)(f + g)(t) = f(t) + g(t) (pointwise)
  • Scalar multiplication: (αf)(t)=αf(t)(\alpha f)(t) = \alpha f(t) (pointwise)
  • Zero: constant zero function f(t)=0f(t) = 0
  • Dimension: \infty (uncountably infinite dimensional)

The sum of two continuous functions is continuous; a scalar multiple of a continuous function is continuous. All eight axioms follow from corresponding facts about real numbers applied pointwise. The space C([a,b])C([a,b]) is the natural setting for physics-informed neural networks (PINNs), neural ODEs, and any application where the model represents a continuous function.


L2([a,b])L^2([a,b]) (square-integrable functions)

  • Vectors: (equivalence classes of) functions f:[a,b]Rf: [a,b] \to \mathbb{R} with abf(t)2dt<\int_a^b f(t)^2 \, dt < \infty
  • Operations: pointwise, same as C([a,b])C([a,b])
  • Inner product: f,g=abf(t)g(t)dt\langle f, g \rangle = \int_a^b f(t) g(t) \, dt; induced norm f=f,f\|f\| = \sqrt{\langle f, f \rangle}
  • Dimension: \infty; a separable Hilbert space

L2L^2 is the correct setting for Fourier analysis (the Fourier basis {1,sin(t),cos(t),sin(2t),}\{1, \sin(t), \cos(t), \sin(2t), \ldots\} is an orthonormal basis for L2([0,2π])L^2([0, 2\pi])), for quantum mechanics (wavefunctions are unit vectors in L2L^2), and for kernel methods (the RKHS of a translation-invariant kernel is a subspace of L2L^2). Every finite-dimensional vector space with an inner product is a Hilbert space; L2L^2 is the prototypical infinite-dimensional Hilbert space.


2\ell^2 (square-summable sequences)

  • Vectors: infinite sequences (x1,x2,x3,)(x_1, x_2, x_3, \ldots) with i=1xi2<\sum_{i=1}^\infty x_i^2 < \infty
  • Operations: componentwise
  • Inner product: x,y=i=1xiyi\langle \mathbf{x}, \mathbf{y} \rangle = \sum_{i=1}^\infty x_i y_i
  • Dimension: \infty; a separable Hilbert space; isomorphic to L2([0,1])L^2([0,1])

The infinite-dimensional analogue of Rn\mathbb{R}^n. Relevant as the theoretical limit of embedding spaces as vocabulary size V|V| \to \infty.


{0}\{\mathbf{0}\} (trivial vector space)

  • The set containing only the zero vector
  • Addition: 0+0=0\mathbf{0} + \mathbf{0} = \mathbf{0}; scalar multiplication: α0=0\alpha \cdot \mathbf{0} = \mathbf{0}
  • The smallest vector space; dimension 0
  • It is a subspace of every vector space

2.4 Non-Examples of Vector Spaces

Understanding which sets fail to be vector spaces - and precisely which axiom they violate - sharpens intuition and prevents common errors in AI applications.

R+n={(x1,,xn):xi>0}\mathbb{R}_{+}^n = \{(x_1,\ldots,x_n) : x_i > 0\} (positive reals)

Fails closure under scalar multiplication and additive inverse: (1)(1,2,3)=(1,2,3)R+n(-1) \cdot (1, 2, 3) = (-1, -2, -3) \notin \mathbb{R}_{+}^n. Also fails to contain the zero vector. This is why you cannot do linear algebra directly in probability space (where all entries must be positive).

Sn1={xRn:x=1}S^{n-1} = \{\mathbf{x} \in \mathbb{R}^n : \|\mathbf{x}\| = 1\} (unit sphere)

Fails closure under addition (u+v1\|\mathbf{u} + \mathbf{v}\| \neq 1 in general) and scalar multiplication (2u=21\|2\mathbf{u}\| = 2 \neq 1). Fails to contain the zero vector. Normalised embedding vectors live on the unit sphere - a non-subspace. This is why arithmetic on normalised vectors (like averaging two unit vectors) is not straightforward; the result must be renormalised.

Δn1={pRn:pi0, ipi=1}\Delta^{n-1} = \{\mathbf{p} \in \mathbb{R}^n : p_i \geq 0,\ \sum_i p_i = 1\} (probability simplex)

Fails closure under addition: p+q\mathbf{p} + \mathbf{q} has i(pi+qi)=21\sum_i(p_i + q_i) = 2 \neq 1. Fails to contain the zero vector. Fails scalar multiplication: 2p2\mathbf{p} has i2pi=21\sum_i 2p_i = 2 \neq 1. The simplex is an affine subspace (it is a translate of a linear subspace), not a vector subspace.

This is a crucial AI pitfall: softmax outputs live in the interior of ΔV\Delta^{|V|}. Averaging two softmax output vectors does not produce a valid probability distribution in the same parameterisation. Interpolation, averaging, and arithmetic must be done in logit space (pre-softmax), which is a vector space.

Zn\mathbb{Z}^n (integer vectors)

Closed under addition (integer + integer = integer) but fails closure under scalar multiplication over R\mathbb{R}: π(1,0,,0)=(π,0,,0)Zn\pi \cdot (1, 0, \ldots, 0) = (\pi, 0, \ldots, 0) \notin \mathbb{Z}^n. The integers form a module over Z\mathbb{Z} (a more general structure), but not a vector space over R\mathbb{R}.

{xRn:Ax=b}\{\mathbf{x} \in \mathbb{R}^n : A\mathbf{x} = \mathbf{b}\} with b0\mathbf{b} \neq \mathbf{0} (inhomogeneous linear system)

Does not contain the zero vector (A0=0bA \cdot \mathbf{0} = \mathbf{0} \neq \mathbf{b}). Not closed under addition: A(x+y)=Ax+Ay=b+b=2bbA(\mathbf{x} + \mathbf{y}) = A\mathbf{x} + A\mathbf{y} = \mathbf{b} + \mathbf{b} = 2\mathbf{b} \neq \mathbf{b}. This is an affine subspace (a coset of the null space of AA), not a vector subspace. The general solution to Ax=bA\mathbf{x} = \mathbf{b} is xp+null(A)\mathbf{x}_p + \text{null}(A), where xp\mathbf{x}_p is any particular solution - a coset structure, not a subspace structure.

SUMMARY: WHY COMMON AI SETS FAIL TO BE VECTOR SPACES


  Softmax outputs (interior of Delta):
       no zero vector    not closed under +    not closed under scaling
      -> Do arithmetic in logit space, not probability space

  Normalised embeddings (unit sphere):
       no zero vector    not closed under +    not closed under scaling
      -> Spherical interpolation (slerp) for moving on the sphere

  Integer weight tensors (quantised models):
       closed under +    not closed under R-scaling
      -> Module over Z; must dequantise before doing real linear algebra

  Solution set Ax = b, b != 0:
       no zero vector    not closed under +
      -> Affine subspace = coset of null(A); add particular solution last

2.5 Vector Spaces over Other Fields

While we work over R\mathbb{R} throughout, the definition of a vector space makes sense over any field FF. The most relevant alternatives for AI:

Cn\mathbb{C}^n (complex vector space). Each scalar is complex; each component is complex. Used in quantum computing, signal processing, and complex-valued Fourier analysis. Dimension nn over C\mathbb{C}, but 2n2n over R\mathbb{R} (since each complex number requires 2 real numbers). The inner product becomes sesquilinear: u,v=uv=iuivi\langle \mathbf{u}, \mathbf{v} \rangle = \mathbf{u}^\dagger \mathbf{v} = \sum_i \overline{u_i} v_i (conjugate of first argument). Complex exponentials eiωte^{i\omega t} are the natural eigenfunctions of differentiation; Fourier analysis over C\mathbb{C} is cleaner than over R\mathbb{R}.

F2n\mathbb{F}_2^n (binary vectors). Field F2={0,1}\mathbb{F}_2 = \{0, 1\} with arithmetic modulo 2: 1+1=01 + 1 = 0. Addition = XOR; multiplication = AND. Used in error-correcting codes (Hamming codes, LDPC codes, turbo codes) and cryptography. All of linear algebra carries over: basis, span, null space, rank - all computable over F2\mathbb{F}_2 by the same row reduction algorithm, with arithmetic mod 2.

Fpn\mathbb{F}_p^n (vectors over finite fields). For prime pp, Fp={0,1,,p1}\mathbb{F}_p = \{0, 1, \ldots, p-1\} with arithmetic mod pp. Used in coding theory, cryptography (elliptic curve cryptography over finite fields), and randomised algorithms. Reed-Solomon codes are linear codes over Fp\mathbb{F}_p for large pp; the generator matrix spans the code as a subspace of Fpn\mathbb{F}_p^n.

Practical AI note. Transformers use real vector spaces Rd\mathbb{R}^d (or approximately real, since floating-point arithmetic is finite-precision). Quantum transformers (theoretical) would use Cd\mathbb{C}^d. Reliable communication protocols for distributed training use F2\mathbb{F}_2 codes for error correction. GPU hardware operates over approximately R\mathbb{R} with format-specific precision (BF16, FP8, INT4).


3. Subspaces

3.1 Definition and the Three Conditions

A subset WVW \subseteq V is a subspace of VV if WW is itself a vector space under the same operations inherited from VV.

Checking all eight axioms would be redundant. The operations themselves are inherited from VV: the addition of two elements of WW uses the same addition formula as in VV, and similarly for scalar multiplication. Because these operations are already defined and well-behaved in VV, the axioms involving equalities (commutativity, associativity, distributivity) are automatically satisfied for elements of WW - they hold for all elements of VV, so in particular they hold for elements of WVW \subseteq V.

What can fail is closure: the result of an operation on elements of WW might fall outside WW. And the zero vector might not be in WW. So only three conditions must be verified:

Subspace Test. A subset WVW \subseteq V is a subspace of VV if and only if:

  1. Non-empty / contains zero: 0W\mathbf{0} \in W
  2. Closed under addition: for all u,wW\mathbf{u}, \mathbf{w} \in W: u+wW\mathbf{u} + \mathbf{w} \in W
  3. Closed under scalar multiplication: for all wW\mathbf{w} \in W and αR\alpha \in \mathbb{R}: αwW\alpha\mathbf{w} \in W

Equivalently, conditions 2 and 3 can be combined into a single condition:

Combined condition: For all u,wW\mathbf{u}, \mathbf{w} \in W and all α,βR\alpha, \beta \in \mathbb{R}: αu+βwW\alpha\mathbf{u} + \beta\mathbf{w} \in W

This single condition says that WW is closed under all linear combinations of its elements. It is often the fastest way to verify that something is a subspace: show that every linear combination of elements of WW remains in WW.

Strategy for proving subspace:

  1. Show WW is non-empty (usually by showing 0W\mathbf{0} \in W directly)
  2. Take arbitrary u,wW\mathbf{u}, \mathbf{w} \in W and arbitrary α,βR\alpha, \beta \in \mathbb{R}
  3. Compute αu+βw\alpha\mathbf{u} + \beta\mathbf{w} and verify it satisfies the defining property of WW

Strategy for disproving subspace:

  • Find a specific counterexample: either 0W\mathbf{0} \notin W, or two elements whose sum is not in WW, or an element whose scalar multiple is not in WW.

3.2 Why These Three Conditions Suffice

Each condition handles a different potential failure:

Condition 1 (contains zero): Without 0\mathbf{0}, WW has no additive identity, so Axiom 3 fails. More fundamentally: if 0W\mathbf{0} \notin W, then WW cannot be a vector space under the same operations as VV, because the additive identity is uniquely determined by the operations (we proved this in 2.2).

Note that showing 0W\mathbf{0} \in W also shows WW is non-empty, which is required for WW to be a set with meaningful structure. In practice, checking 0W\mathbf{0} \in W first is the quickest way to rule out non-subspaces: if the zero vector fails to satisfy the defining property of WW, we are done.

Condition 2 (closed under addition): Without this, addition is not even a well-defined operation on WW - adding two "vectors" in WW might produce something outside WW. If the operation leaves the set, the set cannot be a vector space under that operation.

Condition 3 (closed under scalar multiplication): Without this, scalar multiplication is not well-defined on WW - scaling a vector in WW might produce something outside WW. This failure also implies Axiom 4 (additive inverse) fails: since (1)v(-1) \cdot \mathbf{v} must be in WW for each vW\mathbf{v} \in W, closure under scalar multiplication is required for inverses to exist.

The remaining axioms are inherited automatically. For example, commutativity: for any u,vWV\mathbf{u}, \mathbf{v} \in W \subseteq V, we have u+v=v+u\mathbf{u} + \mathbf{v} = \mathbf{v} + \mathbf{u} because this equality holds in VV for all vectors, and u,v\mathbf{u}, \mathbf{v} are in particular vectors in VV. The same argument applies to associativity and both distributivity laws. These properties are free.

3.3 Standard Subspace Examples in R

Lines through the origin in R2\mathbb{R}^2 and R3\mathbb{R}^3

W={(t,2t):tR}R2W = \{(t, 2t) : t \in \mathbb{R}\} \subseteq \mathbb{R}^2 - the line y=2xy = 2x.

  • Contains 0\mathbf{0}: set t=0t = 0 -> (0,0)W(0, 0) \in W
  • Closed under ++: (t,2t)+(s,2s)=(t+s,2(t+s))W(t, 2t) + (s, 2s) = (t+s, 2(t+s)) \in W
  • Closed under \cdot: α(t,2t)=(αt,2αt)W\alpha(t, 2t) = (\alpha t, 2\alpha t) \in W

Dimension: 1. Basis: {(1,2)}\{(1, 2)^\top\}. This is a 1-dimensional subspace of R2\mathbb{R}^2 - a line through the origin.


Planes through the origin in R3\mathbb{R}^3

W={(x,y,z)R3:ax+by+cz=0}W = \{(x, y, z) \in \mathbb{R}^3 : ax + by + cz = 0\} - a homogeneous linear equation.

  • Contains 0\mathbf{0}: a0+b0+c0=0a \cdot 0 + b \cdot 0 + c \cdot 0 = 0
  • Closed under ++: if ax+by+cz=0ax + by + cz = 0 and ax+by+cz=0ax' + by' + cz' = 0, then a(x+x)+b(y+y)+c(z+z)=0a(x+x') + b(y+y') + c(z+z') = 0 (by linearity of the equation)
  • Closed under \cdot: if ax+by+cz=0ax + by + cz = 0, then a(αx)+b(αy)+c(αz)=α(ax+by+cz)=0a(\alpha x) + b(\alpha y) + c(\alpha z) = \alpha(ax+by+cz) = 0

Dimension: 2 (one homogeneous linear constraint on 3 variables leaves 2 free variables).


Null space of a matrix (solution set of a homogeneous system)

W=null(A)={xRn:Ax=0}W = \text{null}(A) = \{\mathbf{x} \in \mathbb{R}^n : A\mathbf{x} = \mathbf{0}\} for ARm×nA \in \mathbb{R}^{m \times n}.

  • Contains 0\mathbf{0}: A0=0A \cdot \mathbf{0} = \mathbf{0}
  • Closed under ++: A(x+y)=Ax+Ay=0+0=0A(\mathbf{x} + \mathbf{y}) = A\mathbf{x} + A\mathbf{y} = \mathbf{0} + \mathbf{0} = \mathbf{0}
  • Closed under \cdot: A(αx)=αAx=α0=0A(\alpha \mathbf{x}) = \alpha A\mathbf{x} = \alpha \cdot \mathbf{0} = \mathbf{0}

The null space is always a subspace of Rn\mathbb{R}^n, regardless of AA. By the Rank-Nullity Theorem (6.7): dim(null(A))=nrank(A)\dim(\text{null}(A)) = n - \text{rank}(A).


Column space of a matrix

W=col(A)={Ax:xRn}RmW = \text{col}(A) = \{A\mathbf{x} : \mathbf{x} \in \mathbb{R}^n\} \subseteq \mathbb{R}^m for ARm×nA \in \mathbb{R}^{m \times n}.

  • Contains 0\mathbf{0}: A0=0A \cdot \mathbf{0} = \mathbf{0}
  • Closed under ++: Ax+Ay=A(x+y)col(A)A\mathbf{x} + A\mathbf{y} = A(\mathbf{x} + \mathbf{y}) \in \text{col}(A)
  • Closed under \cdot: α(Ax)=A(αx)col(A)\alpha(A\mathbf{x}) = A(\alpha\mathbf{x}) \in \text{col}(A)

The column space is always a subspace of Rm\mathbb{R}^m. Its dimension equals rank(A)\text{rank}(A).


Symmetric matrices

W={ARn×n:A=A}Rn×nW = \{A \in \mathbb{R}^{n \times n} : A = A^\top\} \subseteq \mathbb{R}^{n \times n}.

  • Contains 0\mathbf{0}: 0=0\mathbf{0}^\top = \mathbf{0}
  • Closed under ++: (A+B)=A+B=A+B(A + B)^\top = A^\top + B^\top = A + B
  • Closed under \cdot: (αA)=αA=αA(\alpha A)^\top = \alpha A^\top = \alpha A

Dimension: n(n+1)/2n(n+1)/2. A basis consists of all matrices EiiE_{ii} (diagonal) and Eij+EjiE_{ij} + E_{ji} (off-diagonal symmetric pairs) for i<ji < j. Symmetric matrices appear constantly in AI: covariance matrices, Gram matrices, Hessians, and attention score matrices (in some formulations) are all symmetric.


Traceless matrices

W={ARn×n:tr(A)=0}Rn×nW = \{A \in \mathbb{R}^{n \times n} : \text{tr}(A) = 0\} \subseteq \mathbb{R}^{n \times n}.

  • Contains 0\mathbf{0}: tr(0)=0\text{tr}(\mathbf{0}) = 0
  • Closed under ++: tr(A+B)=tr(A)+tr(B)=0\text{tr}(A + B) = \text{tr}(A) + \text{tr}(B) = 0
  • Closed under \cdot: tr(αA)=αtr(A)=0\text{tr}(\alpha A) = \alpha \, \text{tr}(A) = 0

Dimension: n21n^2 - 1 (one linear constraint on n2n^2 free entries). Relevant to normalisation in transformers: the centering step of LayerNorm projects activations onto the traceless (mean-zero) subspace.


Upper triangular matrices

W={ARn×n:Aij=0 for all i>j}W = \{A \in \mathbb{R}^{n \times n} : A_{ij} = 0 \text{ for all } i > j\}.

The defining condition Aij=0A_{ij} = 0 for i>ji > j is a collection of linear constraints; closure under ++ and \cdot follows because the zero entries stay zero under linear combinations. Dimension: n(n+1)/2n(n+1)/2. The space of upper triangular matrices is a subspace of Rn×n\mathbb{R}^{n \times n}.


Polynomials vanishing at a point

W={pPn:p(1)=0}W = \{p \in \mathcal{P}_n : p(1) = 0\} - polynomials of degree n\leq n with p(1)=0p(1) = 0.

  • Contains 0\mathbf{0}: zero polynomial satisfies 0(1)=00(1) = 0
  • Closed: if p(1)=0p(1) = 0 and q(1)=0q(1) = 0, then (αp+βq)(1)=αp(1)+βq(1)=0(\alpha p + \beta q)(1) = \alpha p(1) + \beta q(1) = 0

Dimension: nn (one linear constraint on n+1n+1 coefficients). A basis: {t1, t21, t31, , tn1}\{t - 1,\ t^2 - 1,\ t^3 - 1,\ \ldots,\ t^n - 1\}.

3.4 Non-Subspace Examples

For each of the following, we identify the specific condition that fails:

Affine subspace {x:Ax=b}\{\mathbf{x} : A\mathbf{x} = \mathbf{b}\} with b0\mathbf{b} \neq \mathbf{0}.

Condition 1 fails: A0=0bA \cdot \mathbf{0} = \mathbf{0} \neq \mathbf{b}, so 0W\mathbf{0} \notin W. Additionally, if x,y\mathbf{x}, \mathbf{y} are both solutions, A(x+y)=2bbA(\mathbf{x} + \mathbf{y}) = 2\mathbf{b} \neq \mathbf{b}, so condition 2 also fails. This set is a coset of null(A)\text{null}(A): the general solution is xp+null(A)\mathbf{x}_p + \text{null}(A) for any particular solution xp\mathbf{x}_p.

Unit ball {xRn:x1}\{\mathbf{x} \in \mathbb{R}^n : \|\mathbf{x}\| \leq 1\}.

Condition 1 holds: 0B\mathbf{0} \in B. Condition 3 fails: take x\mathbf{x} with x=1\|\mathbf{x}\| = 1 and α=2\alpha = 2: then αx=2>1\|\alpha\mathbf{x}\| = 2 > 1, so αxB\alpha\mathbf{x} \notin B. The unit ball is bounded; a subspace is never bounded (it contains all scalar multiples of its vectors, so it extends to infinity in every direction it contains).

First quadrant {(x,y)R2:x0, y0}\{(x, y) \in \mathbb{R}^2 : x \geq 0,\ y \geq 0\}.

Conditions 1 and 2 hold: 0W\mathbf{0} \in W; sum of non-negative vectors is non-negative. Condition 3 fails: (1)(1,1)=(1,1)W(-1) \cdot (1, 1) = (-1, -1) \notin W. A subspace must be closed under all scalars, including negative ones - which forces it to be symmetric about the origin.

Nonzero vectors {vV:v0}\{\mathbf{v} \in V : \mathbf{v} \neq \mathbf{0}\}.

Condition 1 fails immediately: 0\mathbf{0} is excluded by definition. Since the zero vector is always required, this cannot be a subspace.

The probability simplex ΔV1\Delta^{|V|-1} (softmax outputs).

All three conditions fail: no zero vector; not closed under ++; not closed under scaling. The AI implication: you cannot average, interpolate, or take linear combinations of probability distributions and expect to get another valid probability distribution. Arithmetic must happen in logit space (pre-softmax), where the vector space structure is intact.

QUICK NON-SUBSPACE CHECK


  Given a candidate set W, the fastest refutation is usually:

  1. Check 0 in W - if the zero vector fails the defining property,
     stop immediately. W is not a subspace.

  2. Check negative scalar - find v in W and test -v in W.
     This catches all sets that require positivity or a fixed norm.

  3. Check sum of two "extreme" elements - if W has a boundary
     condition, vectors near the boundary often sum outside W.

3.5 The Two Trivial Subspaces

Every vector space VV has exactly two subspaces that are always present, regardless of what VV is:

  1. {0}\{\mathbf{0}\} - the trivial subspace, containing only the zero vector. Dimension 0.
  2. VV itself - the whole space. Dimension =dim(V)= \dim(V).

Any subspace WVW \subseteq V satisfies {0}WV\{\mathbf{0}\} \subseteq W \subseteq V.

A proper subspace is any subspace WVW \subsetneq V (not the whole space). A non-trivial subspace is any subspace W{0}W \neq \{\mathbf{0}\} (not just the zero vector). The interesting subspaces are the non-trivial proper ones - the lines, planes, hyperplanes, and higher-dimensional flats that live inside VV without being all of VV.

For AI, interpreting these extremes:

  • A layer that writes only the zero vector to the residual stream corresponds to the trivial subspace {0}\{\mathbf{0}\} - it contributes nothing.
  • A layer with full rank that can write anything to Rd\mathbb{R}^d corresponds to the full space VV - it has maximum expressive power.
  • Most layers operate in a proper non-trivial subspace: they can produce outputs in a rr-dimensional subspace of Rd\mathbb{R}^d where 0<r<d0 < r < d.

4. Span and Linear Combinations

4.1 Linear Combinations

A linear combination of vectors v1,v2,,vkV\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_k \in V is any vector of the form:

α1v1+α2v2++αkvk=i=1kαivi\alpha_1 \mathbf{v}_1 + \alpha_2 \mathbf{v}_2 + \cdots + \alpha_k \mathbf{v}_k = \sum_{i=1}^k \alpha_i \mathbf{v}_i

where α1,α2,,αkR\alpha_1, \alpha_2, \ldots, \alpha_k \in \mathbb{R} are arbitrary scalars called the coefficients of the linear combination.

A linear combination uses only the two vector space operations: scalar multiplication (to produce each αivi\alpha_i \mathbf{v}_i) and addition (to sum them). By the closure axioms, any linear combination of vectors in VV produces another vector in VV. By the closure conditions for subspaces, any linear combination of vectors in a subspace WW produces another vector in WW.

Special cases of linear combinations:

CoefficientsNameMeaning
Any αiR\alpha_i \in \mathbb{R}Linear combinationGeneral; unrestricted
iαi=1\sum_i \alpha_i = 1Affine combinationPreserves "position" (stays in affine hull)
αi0\alpha_i \geq 0 and iαi=1\sum_i \alpha_i = 1Convex combinationStays in convex hull (interpolation)
αi{0,1}\alpha_i \in \{0, 1\}Subset sumUsed in combinatorics over F2\mathbb{F}_2

For AI, the most important special case is the unconstrained linear combination - it defines the span of a set of vectors, which in turn defines the reachable outputs of a linear layer.

4.2 The Span

The span of a set S={v1,v2,,vk}VS = \{\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_k\} \subseteq V is the set of all linear combinations of vectors in SS:

span(S)={i=1kαivi  :  α1,,αkR}\text{span}(S) = \left\{ \sum_{i=1}^k \alpha_i \mathbf{v}_i \;:\; \alpha_1, \ldots, \alpha_k \in \mathbb{R} \right\}

Theorem: span(S)\text{span}(S) is always a subspace of VV.

Proof:

  1. Contains 0\mathbf{0}: set all αi=0\alpha_i = 0 -> αivi=0span(S)\sum \alpha_i \mathbf{v}_i = \mathbf{0} \in \text{span}(S)
  2. Closed under addition: if u=αivi\mathbf{u} = \sum \alpha_i \mathbf{v}_i and w=βivi\mathbf{w} = \sum \beta_i \mathbf{v}_i, then u+w=(αi+βi)vispan(S)\mathbf{u} + \mathbf{w} = \sum (\alpha_i + \beta_i) \mathbf{v}_i \in \text{span}(S)
  3. Closed under scalar multiplication: if u=αivi\mathbf{u} = \sum \alpha_i \mathbf{v}_i then γu=(γαi)vispan(S)\gamma \mathbf{u} = \sum (\gamma\alpha_i) \mathbf{v}_i \in \text{span}(S)

Minimality: span(S)\text{span}(S) is the smallest subspace containing all vectors in SS. Any subspace WW that contains all vectors v1,,vk\mathbf{v}_1, \ldots, \mathbf{v}_k must also contain all their linear combinations (by closure), and hence must contain span(S)\text{span}(S).

Conventions:

  • span()={0}\text{span}(\emptyset) = \{\mathbf{0}\} - the span of the empty set is the trivial subspace. (An empty sum equals zero.)
  • The span grows as SS grows: if STS \subseteq T then span(S)span(T)\text{span}(S) \subseteq \text{span}(T).
  • Adding a vector already in span(S)\text{span}(S) does not increase the span.

4.3 Geometric Pictures of Span

Geometric intuition for span is crucial and carries directly into AI:

GEOMETRIC MEANING OF SPAN


  span{v}  (one non-zero vector in R):
  
  A line through the origin in the direction of v.
  All scalar multiples of v: {alphav : alpha in R}.
  Always a 1-dimensional subspace.

  span{u, v}  (two non-parallel vectors in R):
  
  A plane through the origin containing both directions.
  All combinations alphau + betav: a 2-dimensional subspace.
  (If u and v ARE parallel: same as span{u} - 1-dimensional.)

  span{u, v, w}  (in R^3):
  
  If u, v, w are coplanar -> a plane (2D subspace).
  If not coplanar -> all of R^3 (3D subspace = whole space).

  span{v_1, ..., v}  (in R):
  
  Dimension = rank of the matrix [v_1 | v_2 | ... | v].
  The "flat hull" of the vectors, passing through the origin.
  It is the column space of the matrix with columns v_1, ..., v.

The key dimension count. Given kk vectors in Rn\mathbb{R}^n:

  • If k<nk < n: span has dimension at most kk (a proper subspace)
  • If k=nk = n: span has dimension nn iff the vectors are independent (iff the matrix is invertible)
  • If k>nk > n: span has dimension at most nn; the vectors must be linearly dependent

4.4 Spanning Sets

A set SS spans VV (or is a spanning set for VV) if span(S)=V\text{span}(S) = V - equivalently, if every vector in VV can be expressed as a linear combination of vectors in SS.

Standard examples:

  • {e1,e2,,en}\{\mathbf{e}_1, \mathbf{e}_2, \ldots, \mathbf{e}_n\} spans Rn\mathbb{R}^n (standard basis, by definition of coordinates).
  • {e1,e2,e1+e2}\{\mathbf{e}_1, \mathbf{e}_2, \mathbf{e}_1 + \mathbf{e}_2\} spans R2\mathbb{R}^2 - but is redundant; e1+e2span{e1,e2}\mathbf{e}_1 + \mathbf{e}_2 \in \text{span}\{\mathbf{e}_1, \mathbf{e}_2\} already.
  • {(1,1),(1,1)}\{(1,1), (1,-1)\} spans R2\mathbb{R}^2: every (x,y)(x,y) can be written as x+y2(1,1)+xy2(1,1)\frac{x+y}{2}(1,1) + \frac{x-y}{2}(1,-1).
  • {1,t,t2,,tn}\{1, t, t^2, \ldots, t^n\} spans Pn\mathcal{P}_n (every polynomial is a linear combination of monomials).
  • The columns of a matrix ARm×nA \in \mathbb{R}^{m \times n} span col(A)Rm\text{col}(A) \subseteq \mathbb{R}^m; they span all of Rm\mathbb{R}^m iff AA has full row rank (rank(A)=m\text{rank}(A) = m).

AI application. The columns of a weight matrix WRm×nW \in \mathbb{R}^{m \times n} span col(W)Rm\text{col}(W) \subseteq \mathbb{R}^m. If rank(W)=r<m\text{rank}(W) = r < m, then only an rr-dimensional subspace of Rm\mathbb{R}^m is reachable: regardless of the input, the layer's output can never leave the rr-dimensional column space. Outputs in the orthogonal complement null(W)=col(W)\text{null}(W^\top) = \text{col}(W)^\perp are permanently inaccessible from this layer. They must come from other components (residual connections, other layers).

A spanning set need not be a basis - it may contain redundant vectors. The minimal spanning sets are the bases (Section 6).

4.5 Span in Terms of Matrices

Given vectors v1,,vkRm\mathbf{v}_1, \ldots, \mathbf{v}_k \in \mathbb{R}^m, form the matrix:

V=[v1v2vk]Rm×kV = [\mathbf{v}_1 \mid \mathbf{v}_2 \mid \cdots \mid \mathbf{v}_k] \in \mathbb{R}^{m \times k}

Then:

span{v1,,vk}={Vα:αRk}=col(V)\text{span}\{\mathbf{v}_1, \ldots, \mathbf{v}_k\} = \{V\boldsymbol{\alpha} : \boldsymbol{\alpha} \in \mathbb{R}^k\} = \text{col}(V)

This connects span to:

  • Column space: span\text{span} of the vectors = column space of the matrix they form
  • Rank: dim(span{v1,,vk})=rank(V)\dim(\text{span}\{\mathbf{v}_1,\ldots,\mathbf{v}_k\}) = \text{rank}(V)
  • Linear system solvability: bspan{v1,,vk}\mathbf{b} \in \text{span}\{\mathbf{v}_1,\ldots,\mathbf{v}_k\} if and only if the system Vα=bV\boldsymbol{\alpha} = \mathbf{b} is consistent, which by the Rouch-Capelli theorem holds if and only if rank(V)=rank([Vb])\text{rank}(V) = \text{rank}([V \mid \mathbf{b}]).

Algorithm for membership testing. To check whether bspan{v1,,vk}\mathbf{b} \in \text{span}\{\mathbf{v}_1,\ldots,\mathbf{v}_k\}:

  1. Form the augmented matrix [Vb][V \mid \mathbf{b}]
  2. Row reduce to RREF
  3. If no row of the form [0 0  0][0\ 0\ \cdots\ 0 \mid *] with 0* \neq 0 appears: bspan\mathbf{b} \in \text{span}
  4. If such a row appears: bspan\mathbf{b} \notin \text{span}

If bspan\mathbf{b} \in \text{span}, the RREF also yields the coordinates α\boldsymbol{\alpha} expressing b\mathbf{b} as a linear combination.


5. Linear Independence

5.1 Definition of Linear Independence

Vectors v1,v2,,vk\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_k are linearly independent if the only linear combination that equals the zero vector is the trivial one - all coefficients zero:

α1v1+α2v2++αkvk=0    α1=α2==αk=0\alpha_1 \mathbf{v}_1 + \alpha_2 \mathbf{v}_2 + \cdots + \alpha_k \mathbf{v}_k = \mathbf{0} \implies \alpha_1 = \alpha_2 = \cdots = \alpha_k = 0

Equivalently: no vector in the set is a linear combination of the others; each vector contributes a genuinely new direction that cannot be reached from the others.

The vectors are linearly dependent if the negation holds: there exist coefficients α1,,αk\alpha_1, \ldots, \alpha_k, not all zero, such that iαivi=0\sum_i \alpha_i \mathbf{v}_i = \mathbf{0}. In this case, at least one vector is redundant - it can be expressed as a linear combination of the others.

The contrast is worth making explicit:

INDEPENDENT vs DEPENDENT


  Independent: each vector adds a new direction.
               The set is "lean" - nothing is redundant.
               Remove any one vector: the span strictly shrinks.
               dim(span) = k  (maximum possible given k vectors)

  Dependent:   at least one vector is a combination of the others.
               The set contains redundancy.
               Remove the dependent vector: span unchanged.
               dim(span) < k  (less than the number of vectors)

Why it matters. Linear independence is what makes a basis a basis. Without it, some vectors in a spanning set are superfluous - they do not add to the span. A basis is the most efficient spanning set: independent and spanning.

5.2 The Dependence Relation

If v1,,vk\mathbf{v}_1, \ldots, \mathbf{v}_k are linearly dependent, there exist coefficients α1,,αk\alpha_1, \ldots, \alpha_k (not all zero) with i=1kαivi=0\sum_{i=1}^k \alpha_i \mathbf{v}_i = \mathbf{0}. Let jj be any index with αj0\alpha_j \neq 0. Then:

vj=1αjijαivi=ij(αiαj)vi\mathbf{v}_j = -\frac{1}{\alpha_j} \sum_{i \neq j} \alpha_i \mathbf{v}_i = \sum_{i \neq j} \left(-\frac{\alpha_i}{\alpha_j}\right) \mathbf{v}_i

So vj\mathbf{v}_j is a linear combination of the other vectors. Consequently, removing vj\mathbf{v}_j from the set does not reduce the span:

span{v1,,vk}=span{v1,,v^j,,vk}\text{span}\{\mathbf{v}_1, \ldots, \mathbf{v}_k\} = \text{span}\{\mathbf{v}_1, \ldots, \hat{\mathbf{v}}_j, \ldots, \mathbf{v}_k\}

where v^j\hat{\mathbf{v}}_j means vj\mathbf{v}_j is omitted. This is the operational meaning of dependence: you can identify and remove dependent vectors without losing span. Row reduction does exactly this.

5.3 Checking Linear Independence

Matrix method (universal).

Form the matrix A=[v1v2vk]Rm×kA = [\mathbf{v}_1 \mid \mathbf{v}_2 \mid \cdots \mid \mathbf{v}_k] \in \mathbb{R}^{m \times k} and row reduce to RREF.

  • Linearly independent iff every column of AA has a pivot iff rank(A)=k\text{rank}(A) = k iff null(A)={0}\text{null}(A) = \{\mathbf{0}\}
  • Linearly dependent iff some column has no pivot iff rank(A)<k\text{rank}(A) < k iff null(A){0}\text{null}(A) \neq \{\mathbf{0}\}

When dependent, the RREF reveals which vectors are dependent and expresses them in terms of the pivot columns.

Determinant test (square matrices only, k=nk = n).

v1,,vnRn\mathbf{v}_1, \ldots, \mathbf{v}_n \in \mathbb{R}^n are linearly independent iff det([v1vn])0\det([\mathbf{v}_1 \mid \cdots \mid \mathbf{v}_n]) \neq 0.

SVD / singular value test (numerical).

σmin(A)>ε\sigma_{\min}(A) > \varepsilon for some tolerance ε\varepsilon iff the columns are linearly independent to numerical precision ε\varepsilon. The smallest singular value σmin\sigma_{\min} measures "how far from dependent" the columns are. When σmin0\sigma_{\min} \approx 0, the columns are nearly dependent (nearly lying in a lower-dimensional subspace).

Gram matrix test.

The Gram matrix G=AARk×kG = A^\top A \in \mathbb{R}^{k \times k} with Gij=vi,vjG_{ij} = \langle \mathbf{v}_i, \mathbf{v}_j \rangle. The columns v1,,vk\mathbf{v}_1,\ldots,\mathbf{v}_k are linearly independent iff GG is positive definite (all eigenvalues >0> 0) iff det(G)>0\det(G) > 0.

5.4 Key Properties

Any set containing 0\mathbf{0} is linearly dependent.

10+0v2+=01 \cdot \mathbf{0} + 0 \cdot \mathbf{v}_2 + \cdots = \mathbf{0} - a non-trivial combination. So 0\mathbf{0} is always dependent on any other set of vectors. This is consistent with the subspace requirement: 0\mathbf{0} carries no directional information.

A single non-zero vector is always linearly independent.

αv=0\alpha \mathbf{v} = \mathbf{0} with v0\mathbf{v} \neq \mathbf{0} forces α=0\alpha = 0 (by the cancellation law proved in 2.2). So {v}\{\mathbf{v}\} is independent whenever v0\mathbf{v} \neq \mathbf{0}.

Two vectors are linearly dependent iff one is a scalar multiple of the other.

αu+βv=0\alpha \mathbf{u} + \beta \mathbf{v} = \mathbf{0} with (α,β)(0,0)(\alpha, \beta) \neq (0,0). If β0\beta \neq 0: v=(α/β)u\mathbf{v} = -(\alpha/\beta)\mathbf{u}. If β=0\beta = 0: α0\alpha \neq 0 so u=0\mathbf{u} = \mathbf{0}. So dependence of two vectors = one is a scalar multiple of the other = they point in the same (or opposite) direction.

More vectors than dimensions forces dependence.

If k>nk > n for vectors in Rn\mathbb{R}^n, they are always linearly dependent. The matrix A=[v1vk]Rn×kA = [\mathbf{v}_1|\cdots|\mathbf{v}_k] \in \mathbb{R}^{n \times k} has rank at most n<kn < k; by rank-nullity, null(A)\text{null}(A) is non-trivial, giving a non-trivial dependence relation.

This has a profound implication: in Rd\mathbb{R}^d, you cannot have more than dd linearly independent vectors. In a dd-dimensional embedding space, there are at most dd independent directions. Any collection of more than dd feature vectors must be linearly dependent.

Subsets of independent sets are independent.

If {v1,,vk}\{\mathbf{v}_1, \ldots, \mathbf{v}_k\} are linearly independent, so is every subset. (Removing vectors from an independent set cannot introduce a new dependence - the remaining vectors were already independent when part of the larger set.)

Supersets of dependent sets are dependent.

If {v1,,vk}\{\mathbf{v}_1, \ldots, \mathbf{v}_k\} are linearly dependent, so is every superset. (The same non-trivial dependence relation, with zero coefficients for the added vectors, still witnesses dependence.)

5.5 Linear Independence in AI Contexts

Attention heads. If the HH attention heads write to linearly independent subspaces of Rd\mathbb{R}^d (their OV circuits have mutually orthogonal column spaces), then each head provides genuinely independent information. Dependent heads write to overlapping subspaces; they interfere and are candidates for pruning. Head redundancy is exactly the linear dependence of the subspaces they write to.

The superposition hypothesis. Elhage et al. (2022) observe that large language models appear to represent more features than the embedding dimension dd allows. Since more than dd linearly independent directions cannot exist in Rd\mathbb{R}^d, the features must be stored as linearly dependent (non-orthogonal) directions. This is the superposition hypothesis: features are packed into the embedding space at densities exceeding the dimension, using non-orthogonal directions. Each feature "leaks" into the directions of nearby features - a phenomenon called polysemanticity. Linear independence is the threshold: below dd features, they can be orthogonal and non-interfering; beyond dd, interference is inevitable.

LoRA rank selection. In LoRA, the update is ΔW=BA\Delta W = BA^\top with BRm×rB \in \mathbb{R}^{m \times r}, ARn×rA \in \mathbb{R}^{n \times r}. The rank of ΔW\Delta W equals the number of linearly independent columns in BB (equivalently, in AA). If the columns of BB are linearly dependent, the effective rank is less than rr; some of the rr rank budget is wasted. Choosing rr large helps only if the corresponding columns of BB actually end up spanning an rr-dimensional space after training.

Gradient directions across batches. If gradients from different batches are linearly independent, each batch provides new information about the loss landscape. If gradients are nearly parallel (nearly dependent), the batches are redundant - they are teaching the model the same thing. Gradient diversity, measured by the rank or effective dimension of the gradient matrix [L1LB][\nabla\mathcal{L}_1 \mid \cdots \mid \nabla\mathcal{L}_B], is a measure of how much information each batch contributes.

Feature directions in representation learning. Each probing direction in an embedding space is a vector in Rd\mathbb{R}^d. For kk concepts to be probed without interference, their associated directions should be linearly independent (ideally orthogonal). The maximum number of perfectly non-interfering linear features is dd. Beyond that, interference is structural - a consequence of linear independence breaking down.


6. Basis and Dimension

6.1 Definition of a Basis

A basis for a vector space VV is a set B={b1,b2,,bn}\mathcal{B} = \{\mathbf{b}_1, \mathbf{b}_2, \ldots, \mathbf{b}_n\} satisfying two conditions simultaneously:

  1. Linearly independent: the vectors bi\mathbf{b}_i are linearly independent
  2. Spans VV: every vector in VV is a linear combination of the bi\mathbf{b}_i

A basis is the most efficient possible spanning set: it spans VV with no redundancy. Equivalently:

  • A basis is a minimal spanning set: remove any single vector and the set no longer spans VV
  • A basis is a maximal independent set: add any vector from VV and the set becomes linearly dependent

These three characterisations - (independent + spanning), (minimal spanning), (maximal independent) - are all equivalent. Each one defines the same objects.

THE ROLE OF A BASIS


  A basis gives V a coordinate system.

  Once you fix a basis {b_1, ..., b} for V:
  - Every vector v in V corresponds to a unique n-tuple (alpha_1,...,alpha)
  - The n-tuple is the vector's coordinates in this basis
  - V becomes "identified" with R via this correspondence

  Different bases = different coordinate systems for the same space.
  The space itself is unchanged; only the coordinatisation differs.

  In AI: the "basis" of the residual stream is not canonical.
  Attention heads and MLP layers operate in their own "preferred"
  bases, related to the standard basis by learned rotations.

6.2 The Unique Representation Theorem

Theorem. If B={b1,,bn}\mathcal{B} = \{\mathbf{b}_1, \ldots, \mathbf{b}_n\} is a basis for VV, then every vector vV\mathbf{v} \in V has a unique representation:

v=α1b1+α2b2++αnbn\mathbf{v} = \alpha_1 \mathbf{b}_1 + \alpha_2 \mathbf{b}_2 + \cdots + \alpha_n \mathbf{b}_n

The scalars (α1,,αn)(\alpha_1, \ldots, \alpha_n) are called the coordinates (or components) of v\mathbf{v} with respect to basis B\mathcal{B}, written [v]B=(α1,,αn)[\mathbf{v}]_{\mathcal{B}} = (\alpha_1, \ldots, \alpha_n).

Proof of uniqueness. Suppose v=iαibi\mathbf{v} = \sum_i \alpha_i \mathbf{b}_i and v=iβibi\mathbf{v} = \sum_i \beta_i \mathbf{b}_i. Subtract:

0=vv=i=1n(αiβi)bi\mathbf{0} = \mathbf{v} - \mathbf{v} = \sum_{i=1}^n (\alpha_i - \beta_i) \mathbf{b}_i

By linear independence of the basis vectors, all coefficients are zero: αiβi=0\alpha_i - \beta_i = 0, so αi=βi\alpha_i = \beta_i for all ii.

Why uniqueness matters. Without uniqueness, coordinates would be ambiguous - the same vector could have multiple different "addresses" in the same coordinate system. The combination of independence (ensures uniqueness) and spanning (ensures existence) is precisely what makes a basis the right tool for establishing a coordinate system.

The isomorphism. The unique representation theorem says that every nn-dimensional vector space VV is isomorphic to Rn\mathbb{R}^n: the map v[v]BRn\mathbf{v} \mapsto [\mathbf{v}]_{\mathcal{B}} \in \mathbb{R}^n is a bijection that preserves both operations (addition and scalar multiplication). This is why working in coordinates is always valid: the coordinate map translates the abstract vector space structure faithfully into Rn\mathbb{R}^n.

6.3 Dimension

Theorem. All bases for a given vector space VV have the same number of vectors. This number is the dimension of VV:

dim(V)=number of vectors in any basis of V\dim(V) = \text{number of vectors in any basis of } V

Proof sketch. Suppose B1={b1,,bm}\mathcal{B}_1 = \{\mathbf{b}_1, \ldots, \mathbf{b}_m\} and B2={c1,,cn}\mathcal{B}_2 = \{\mathbf{c}_1, \ldots, \mathbf{c}_n\} are both bases for VV. Since B1\mathcal{B}_1 spans VV and B2\mathcal{B}_2 is independent, by the Steinitz Exchange Lemma (or the Replacement Theorem), we must have nmn \leq m. By symmetry, mnm \leq n. Hence m=nm = n.

Dimension is the fundamental measure of the "size" of a vector space - the number of independent directions it contains. It is an intrinsic property of the space, independent of any particular basis.

Dimensions of standard spaces:

SpaceDimensionBasis
Rn\mathbb{R}^nnn{e1,,en}\{\mathbf{e}_1, \ldots, \mathbf{e}_n\} (standard)
Rm×n\mathbb{R}^{m \times n}mnmn{Eij:1im, 1jn}\{E_{ij} : 1 \leq i \leq m,\ 1 \leq j \leq n\}
Pn\mathcal{P}_nn+1n+1{1,t,t2,,tn}\{1, t, t^2, \ldots, t^n\}
Symn\text{Sym}_n (symmetric n×nn \times n)n(n+1)/2n(n+1)/2{Eii}{Eij+Eji:i<j}\{E_{ii}\} \cup \{E_{ij}+E_{ji} : i < j\}
Skewn\text{Skew}_n (skew-symmetric)n(n1)/2n(n-1)/2{EijEji:i<j}\{E_{ij} - E_{ji} : i < j\}
Diagn\text{Diag}_n (diagonal)nn{Eii}\{E_{ii}\}
{0}\{\mathbf{0}\}00\emptyset
C([a,b])C([a,b]), L2([a,b])L^2([a,b]), P\mathcal{P}\inftyCountably infinite bases exist

Dimension inequalities. For subspace WVW \subseteq V:

  • dim(W)dim(V)\dim(W) \leq \dim(V)
  • dim(W)=dim(V)\dim(W) = \dim(V) and WVW \subseteq V together imply W=VW = V (in finite dimensions)

This last point is crucial: if you find a subspace with the same dimension as the whole space, it must equal the whole space. In particular, a subspace of Rn\mathbb{R}^n with dimension nn is all of Rn\mathbb{R}^n.

6.4 Standard Bases

Standard basis for Rn\mathbb{R}^n:

ei=(0,,0,1i-th,0,,0)\mathbf{e}_i = (0, \ldots, 0, \underbrace{1}_{i\text{-th}}, 0, \ldots, 0)

The coordinates of any vector v=(v1,,vn)\mathbf{v} = (v_1, \ldots, v_n) in the standard basis are simply its components: [v]std=(v1,,vn)[\mathbf{v}]_{\text{std}} = (v_1, \ldots, v_n). This is why the standard basis is the "natural" choice for Rn\mathbb{R}^n - coordinates and components coincide.

Standard basis for Rm×n\mathbb{R}^{m \times n}:

EijE_{ij} is the matrix with a 1 in position (i,j)(i,j) and 0 elsewhere. There are mnmn such matrices, and they form a basis for Rm×n\mathbb{R}^{m \times n}. The coordinate of a matrix AA in this basis is simply AijA_{ij}.

Standard basis for Pn\mathcal{P}_n:

{1,t,t2,,tn}\{1, t, t^2, \ldots, t^n\} - the monomials. The coordinates of p(t)=a0+a1t++antnp(t) = a_0 + a_1 t + \cdots + a_n t^n are its coefficients (a0,a1,,an)(a_0, a_1, \ldots, a_n).

Fourier basis for L2([0,2π])L^2([0, 2\pi]):

{12π, cos(t)π, sin(t)π, cos(2t)π, sin(2t)π, }\left\{\frac{1}{\sqrt{2\pi}},\ \frac{\cos(t)}{\sqrt{\pi}},\ \frac{\sin(t)}{\sqrt{\pi}},\ \frac{\cos(2t)}{\sqrt{\pi}},\ \frac{\sin(2t)}{\sqrt{\pi}},\ \ldots\right\}

An orthonormal basis. The coordinates of ff in this basis are the Fourier coefficients. This is the infinite-dimensional analogue of an orthonormal basis in Rn\mathbb{R}^n.

Eigenbasis. For a linear map T:VVT: V \to V with nn linearly independent eigenvectors {q1,,qn}\{\mathbf{q}_1, \ldots, \mathbf{q}_n\}, this set forms a basis called the eigenbasis of TT. In the eigenbasis, TT acts diagonally: Tqi=λiqiT\mathbf{q}_i = \lambda_i \mathbf{q}_i. This diagonalises TT, making its action transparent. PCA uses the eigenbasis of the covariance matrix; the action of the covariance matrix in its eigenbasis is pure scaling by eigenvalues (variances).

6.5 Change of Basis

The same vector can be described by different coordinate tuples in different bases. Understanding how coordinates transform under a change of basis is essential for working in multiple representations simultaneously - which is exactly what happens in a transformer, where the residual stream has a "standard" basis but each component may operate in a different preferred basis.

Setup. Let B={b1,,bn}\mathcal{B} = \{\mathbf{b}_1, \ldots, \mathbf{b}_n\} and C={c1,,cn}\mathcal{C} = \{\mathbf{c}_1, \ldots, \mathbf{c}_n\} be two bases for VV. The change-of-basis matrix PBCP_{\mathcal{B} \to \mathcal{C}} is the n×nn \times n matrix whose jj-th column is the coordinate representation of bj\mathbf{b}_j in basis C\mathcal{C}:

PBC=[[b1]C[b2]C[bn]C]P_{\mathcal{B} \to \mathcal{C}} = \big[[\mathbf{b}_1]_{\mathcal{C}} \mid [\mathbf{b}_2]_{\mathcal{C}} \mid \cdots \mid [\mathbf{b}_n]_{\mathcal{C}}\big]

Coordinate transformation. For any vV\mathbf{v} \in V:

[v]C=PBC[v]B[\mathbf{v}]_{\mathcal{C}} = P_{\mathcal{B} \to \mathcal{C}}\, [\mathbf{v}]_{\mathcal{B}}

Properties:

  • PBCP_{\mathcal{B} \to \mathcal{C}} is always invertible (since both B\mathcal{B} and C\mathcal{C} are bases)
  • PCB=PBC1P_{\mathcal{C} \to \mathcal{B}} = P_{\mathcal{B} \to \mathcal{C}}^{-1} - the inverse change-of-basis goes the other way
  • Composition: PAC=PBCPABP_{\mathcal{A} \to \mathcal{C}} = P_{\mathcal{B} \to \mathcal{C}}\, P_{\mathcal{A} \to \mathcal{B}}

Linear maps under change of basis. If a linear map T:VVT: V \to V has matrix MM in basis B\mathcal{B}, its matrix in basis C\mathcal{C} is:

[T]C=PBCMPBC1[T]_{\mathcal{C}} = P_{\mathcal{B} \to \mathcal{C}}\, M\, P_{\mathcal{B} \to \mathcal{C}}^{-1}

This is a similarity transformation: MM and PMP1P M P^{-1} represent the same linear map in different bases. They have the same eigenvalues, rank, trace, and determinant - these are basis-independent properties of the map.

AI application: the residual stream basis. In a transformer, the residual stream has dimension dd and technically lives in Rd\mathbb{R}^d with the standard basis. But each attention head's OV circuit is parameterised by WVRd×dkW_V \in \mathbb{R}^{d \times d_k} and WORdk×dW_O \in \mathbb{R}^{d_k \times d}: the head reads in a dkd_k-dimensional subspace and writes in another. The "natural basis" for head hh is different from the standard basis of Rd\mathbb{R}^d. Change-of-basis matrices connect these representations. When we say a head "rotates" the residual stream, we mean it applies a change-of-basis transformation to the component it reads.

6.6 Constructing a Basis

From a spanning set (row reduction).

Given a spanning set {v1,,vk}\{\mathbf{v}_1, \ldots, \mathbf{v}_k\} for some subspace WW:

  1. Form the matrix A=[v1vk]A = [\mathbf{v}_1 \mid \cdots \mid \mathbf{v}_k]
  2. Row reduce AA to RREF
  3. The pivot columns of AA (not the RREF, but the original AA) form a basis for col(A)=W\text{col}(A) = W

The non-pivot columns correspond to dependent vectors that can be removed without reducing the span.

From a null space (free variables).

To find a basis for null(A)\text{null}(A):

  1. Row reduce AA to RREF; identify free variables xj1,,xjkx_{j_1}, \ldots, x_{j_k}
  2. For each free variable xjx_{j_\ell}, set xj=1x_{j_\ell} = 1 and all other free variables =0= 0; solve for pivot variables
  3. The resulting vector is one basis vector for null(A)\text{null}(A)
  4. Repeat for each free variable; the kk vectors form a basis for null(A)\text{null}(A)

Gram-Schmidt (from independent vectors to orthonormal basis).

Given kk linearly independent vectors {v1,,vk}\{\mathbf{v}_1, \ldots, \mathbf{v}_k\}, produce an orthonormal basis {q1,,qk}\{\mathbf{q}_1, \ldots, \mathbf{q}_k\} for span{v1,,vk}\text{span}\{\mathbf{v}_1, \ldots, \mathbf{v}_k\}. See Section 9.3 for the full procedure.

Extending a basis.

Given a basis BW={b1,,bk}\mathcal{B}_W = \{\mathbf{b}_1, \ldots, \mathbf{b}_k\} for a subspace WVW \subsetneq V, extend to a basis for all of VV:

  1. Start with S=BWS = \mathcal{B}_W
  2. Find any vVspan(S)\mathbf{v} \in V \setminus \text{span}(S) and add it: SS{v}S \leftarrow S \cup \{\mathbf{v}\}
  3. Repeat until span(S)=V\text{span}(S) = V

This process terminates in at most dim(V)dim(W)\dim(V) - \dim(W) steps.

6.7 Dimension Counting Theorems

These theorems relate the dimensions of related subspaces. They are among the most useful tools in linear algebra.


Rank-Nullity Theorem. For any matrix ARm×nA \in \mathbb{R}^{m \times n}:

dim(null(A))nullity(A)+dim(col(A))rank(A)=n\underbrace{\dim(\text{null}(A))}_{\text{nullity}(A)} + \underbrace{\dim(\text{col}(A))}_{\text{rank}(A)} = n

In words: the number of "invisible" input directions plus the number of "active" input directions equals the total number of input dimensions.

Proof idea. Row reduce AA to RREF. Let r=rank(A)r = \text{rank}(A). There are rr pivot columns (corresponding to rr independent output directions -> dim(col(A))=r\dim(\text{col}(A)) = r) and nrn - r free columns (corresponding to nrn - r null space basis vectors -> dim(null(A))=nr\dim(\text{null}(A)) = n - r). Total: r+(nr)=nr + (n - r) = n.

AI application. For a weight matrix WRm×nW \in \mathbb{R}^{m \times n} with rank rr:

  • rr input dimensions "pass through" to the output (row space directions)
  • nrn - r input dimensions are silenced (null space directions)
  • The layer uses rr out of nn available input dimensions

Dimension of a Sum. For subspaces W1,W2VW_1, W_2 \subseteq V:

dim(W1+W2)=dim(W1)+dim(W2)dim(W1W2)\dim(W_1 + W_2) = \dim(W_1) + \dim(W_2) - \dim(W_1 \cap W_2)

This is the Grassmann formula or inclusion-exclusion for dimensions. It is the direct analogue of AB=A+BAB|A \cup B| = |A| + |B| - |A \cap B| for finite sets, but for dimensions of subspaces.

Key consequence. If W1W2={0}W_1 \cap W_2 = \{\mathbf{0}\} (intersection is trivial):

dim(W1+W2)=dim(W1)+dim(W2)\dim(W_1 + W_2) = \dim(W_1) + \dim(W_2)

and the sum is a direct sum: W1+W2=W1W2W_1 + W_2 = W_1 \oplus W_2. Dimensions add when the subspaces are independent (share only the origin).


Dimension bounds for intersection. Given dim(W1)=k1\dim(W_1) = k_1, dim(W2)=k2\dim(W_2) = k_2, dim(V)=n\dim(V) = n:

max(0, k1+k2n)dim(W1W2)min(k1,k2)\max(0,\ k_1 + k_2 - n) \leq \dim(W_1 \cap W_2) \leq \min(k_1, k_2)

The upper bound: the intersection is contained in both W1W_1 and W2W_2, so its dimension cannot exceed either. The lower bound: from the Grassmann formula and dim(W1+W2)n\dim(W_1 + W_2) \leq n.

AI application. Two attention heads with dkd_k-dimensional subspaces in Rd\mathbb{R}^d: their subspace intersection has dimension at least max(0,2dkd)\max(0, 2d_k - d). If dk>d/2d_k > d/2, the heads must share at least 2dkd2d_k - d dimensions; their communication channels overlap inevitably.


Complement dimension. For subspace WVW \subseteq V with dim(V)=n\dim(V) = n:

dim(W)=ndim(W)\dim(W^\perp) = n - \dim(W)

So dim(W)+dim(W)=n\dim(W) + \dim(W^\perp) = n. The orthogonal complement "fills in" the remaining dimensions. The fundamental subspaces obey:

rank(A)+nullity(A)=ndim(row(A))+dim(null(A))=n\text{rank}(A) + \text{nullity}(A) = n \quad \Leftrightarrow \quad \dim(\text{row}(A)) + \dim(\text{null}(A)) = n

which is just the Rank-Nullity Theorem re-expressed using the fact that null(A)=row(A)\text{null}(A) = \text{row}(A)^\perp.


Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue