Lesson overview | Previous part | Next part
Vector Spaces and Subspaces: Part 7: The Four Fundamental Subspaces to 11. Subspaces in Functional Analysis
7. The Four Fundamental Subspaces
Recall: Earlier sections used these subspaces informally:
- Systems of Equations 4.3 - used col() and null() to characterise when is solvable
- Matrix Rank 4 - used rank and null space dimension to state the rank-nullity theorem
This section is the canonical home: rigorous definitions, dimensional identities, bases computed from RREF, the orthogonality relationships, and their geometric interpretation as the complete decomposition of and .
For any matrix with , Gilbert Strang identified four fundamental subspaces that together give a complete picture of how acts as a linear map . Understanding all four - their definitions, dimensions, bases, and mutual relationships - is the complete theory of linear systems.
7.1 Definition of All Four Subspaces
1. Column space (also called the image or range):
- The set of all possible outputs of ; the "reachable" subspace in
- Basis: pivot columns of (the original , not its RREF)
- The system is consistent if and only if
2. Null space (also called the kernel):
- The set of all inputs mapped to zero; the "invisible" directions in
- (by Rank-Nullity)
- Basis: free variable solution vectors from the RREF of
- cannot distinguish from for any : both produce the same output
3. Row space :
- The set of all linear combinations of the rows of ; the directions in that "listens to"
- (same as column space - both equal the rank)
- Basis: non-zero rows of the RREF of
- The projection of any input onto determines the output; the null space component of is discarded
4. Left null space :
- The set of all output-space vectors orthogonal to the column space of
- (by Rank-Nullity applied to )
- Basis: free variable solution vectors from the RREF of
- If then , which means has no solution
Dimension summary:
| Subspace | Ambient space | Dimension | Complement |
|---|---|---|---|
7.2 The Orthogonality Relations
The four subspaces pair into two orthogonal decompositions:
Proof that .
Take any and any . Then for some . Compute:
So every null space vector is orthogonal to every row space vector.
Since and the two subspaces are orthogonal (hence their intersection is ), they form an orthogonal direct sum decomposition of . Every vector decomposes uniquely as:
And . The null space component is silenced; only the row space component matters.
7.3 Strang's Big Picture
The complete action of as a linear map is captured in the following diagram:
STRANG'S FOUR FUNDAMENTAL SUBSPACES
R (input space) R (output space)
row(A) col(A)
dim = r A> dim = r
"the listening space" "the reachable space"
A is sensitive here A can write here
null(A) null(A)
dim = n - r A> dim = m - r
(->0)
"the silent space" "the unreachable
A maps this to 0 space"
Key facts:
- A is a bijection from row(A) to col(A) - restricted to row(A),
A is invertible.
- A maps all of null(A) to the single point {0} in R.
- No input - however large - can produce an output in null(A).
- The orthogonal complements pair up:
row(A) = null(A) and col(A) = null(A)
The action of completely described. For any input (row space + null space decomposition):
The null space component vanishes; the row space component is mapped bijectively into the column space. The left null space is entirely inaccessible from the input side.
7.4 Computing the Four Subspaces
The systematic procedure to find bases for all four fundamental subspaces from a matrix :
Step 1: Row reduce to RREF.
Identify the pivot positions (row , column pairs where is the leading 1 of row ). Let .
Step 2: null(A).
- Free variables: columns of without a pivot (say columns )
- For each free variable : set , all other free variables , solve for the pivot variables from
- The resulting vector is the -th null space basis vector
- Repeat for ; these vectors form a basis for
Step 3: col(A).
- The pivot columns of the original matrix (not ) form a basis for
- Specifically: if pivots appear in columns , then (columns of ) is a basis
Why original , not ? Row operations change the column space (they change the linear dependencies between columns). The pivot positions identify which columns are independent, but the actual column vectors must be taken from the original to get a basis for the original column space.
Step 4: row(A).
- The non-zero rows of (the RREF of ) form a basis for
- Unlike for the column space, row operations preserve the row space; the non-zero rows of are particular convenient linear combinations of the rows of that happen to be in reduced form
Step 5: null(A^T).
- Row reduce to its RREF, then apply the null space procedure (Step 2) to
- Alternatively: if you have already computed the four dimensions (and bases for three of the four subspaces), use the complementary dimension argument
- Basis vectors of can also be read off from certain row operations applied to
7.5 AI Interpretation of the Four Subspaces
For a weight matrix in a neural network layer , the four subspaces tell the complete story of what this layer can and cannot do:
Column space col(W) - the "reachable" subspace.
Only outputs in can be produced by this layer. If , then dimensions of the output space receive exactly zero contribution from this layer, regardless of input. In a transformer residual stream: each attention head and MLP block writes to a specific subspace of ; the column space of the output projection is the subspace the head writes to. Dimensions orthogonal to are untouched by this head.
Null space null(W) - the "invisible" subspace.
Input directions in are completely ignored by the layer. If an input vector has all its "mass" in the null space, the output is zero - the layer cannot see it. This is both a limitation and a design feature: in mechanistic interpretability, the null space of a query projection is the set of residual stream directions that this attention head does not use to form its queries. These directions are invisible to the head's query computation.
Row space row(W) - the "listening" subspace.
The row space is the complement of the null space: input directions in are the ones the layer actually "listens to". The projection of the input onto determines the output; the null space projection vanishes. In a transformer, the row space of determines which directions of the residual stream the key computation is sensitive to. Keys "live in" the row space of .
Left null space null(W^T) - the "unreachable" output subspace.
The left null space is orthogonal to . No input, however crafted, can produce an output with a component in this subspace from this layer. In a transformer with residual connections, the left null space of one layer's output projection is the subspace that must be written by other layers. This is why residual connections are essential: they allow information to flow "around" layers whose column spaces don't reach certain directions.
AI INTERPRETATION: WEIGHT MATRIX SUBSPACES
W in R, rank r
R (input): R (output):
row(W) col(W)
dim r W> dim r
"W listens here" "W writes here"
null(W) null(W)
dim n-r W>{0} dim m-r
"W ignores this" "W never here"
Transformer layer (y = Wx + b, residual):
- Attention head h reads from row(W_Q^h), row(W_K^h), row(W_V^h)
- Head h writes to col(W_O^h)
- Information in null(W_Q^h) is invisible to head h's queries
- Information in null(W_V^h) is not read by head h's values
- The residual stream preserves ALL d dimensions; individual
layers each modify only their respective column space subspaces
8. Subspace Operations
8.1 Sum of Subspaces
For subspaces , their sum is:
Theorem. is a subspace of .
Proof.
- Contains :
- Closed under : (since , are closed)
- Closed under :
is the smallest subspace containing both and . Any subspace that contains both and must contain all sums (by closure), so .
Grassmann dimension formula:
Important distinction: (set union) is not a subspace in general. Take (x-axis) and (y-axis) in . Then and , but . The union is not closed under addition. The sum (the whole plane) is a subspace; the union (just the two axes) is not.
8.2 Intersection of Subspaces
For subspaces , their intersection is:
Theorem. is a subspace of .
Proof.
- Contains : and , so
- Closed under : if , then (since closed) and (since closed), so
- Closed under : and , so
is the largest subspace contained in both and .
Computing : If and , then:
For general subspaces given by spanning sets: find vectors in that also lie in by setting up a linear system and solving.
8.3 Direct Sum
Subspaces and are complementary in (equivalently, is their direct sum) if:
- - they together span
- - they share only the origin
We write .
Theorem. if and only if every has a unique decomposition with and .
Proof of uniqueness. If , then . The left side is in ; the right side is in . So both sides lie in . Hence and .
Dimension: . This follows from the Grassmann formula with .
Complements are not unique. For a given subspace , there are many subspaces such that . The orthogonal complement is the canonical, unique choice:
Theorem (for finite-dimensional inner product spaces). , and .
The fundamental subspace decompositions from Section 7 are orthogonal direct sums:
Multiple direct sums. The direct sum generalises: if the are pairwise "independent" (each ) and together span . Then every has a unique decomposition and .
In the spectral theorem (Section 13), the orthogonal decomposition into eigenspaces is a multiple orthogonal direct sum: .
8.4 Projection onto Subspaces
Given a subspace , the orthogonal projection maps each vector to the closest point in :
The unique minimiser exists because is a closed convex set (any subspace is convex, and in finite dimensions, closed).
Characterisation. is the unique vector in such that , i.e., for all .
Formula 1: orthonormal basis.
If has orthonormal basis , then:
In matrix form, with (orthonormal columns):
The projection matrix is .
Formula 2: general (non-orthonormal) basis.
If for with linearly independent columns (full column rank):
The projection matrix is . The matrix is the Moore-Penrose pseudoinverse when has full column rank.
Properties of an orthogonal projection matrix :
| Property | Expression | Meaning |
|---|---|---|
| Idempotent | Projecting twice = projecting once | |
| Symmetric | Projection is self-adjoint | |
| Eigenvalues | Vectors in map to themselves; vectors in map to | |
| Rank | Rank = dimension of the subspace projected onto | |
| Trace | Since eigenvalues are 0s and 1s; trace = sum of eigenvalues | |
| Complement | Projection onto |
Proof that : (using for orthonormal columns).
Decomposition. Every decomposes as:
The residual is the component of orthogonal to , and is itself an orthogonal projection matrix (onto ).
AI applications of projection:
- Least squares: the best-fit solution projects onto ; least squares is orthogonal projection
- PCA: projecting data onto the top- principal component subspace is a rank- projection where contains the top left singular vectors
- Attention: (soft) attention can be viewed as a weighted projection of value vectors onto query-determined subspaces
- Concept erasure: removing a concept from an embedding by projecting onto the orthogonal complement of the concept subspace: ; used in LEACE and related methods
8.5 Subspace Angles and Principal Angles
The angle between two vectors is well-defined: . The "angle" between two subspaces is more subtle - it requires a collection of angles called principal angles.
Definition. For subspaces with , , , the principal angles are defined recursively:
Computation via SVD. If and are orthonormal bases for and :
Then (the -th singular value of ). The principal angles are the arc-cosines of the singular values.
Interpretation:
- : the subspaces share a common direction (their intersection is non-trivial)
- : the subspaces have a pair of orthogonal directions (they are "partially orthogonal")
- All : the subspaces are orthogonal (, i.e., )
- All : or (one contains the other)
AI applications:
- Head overlap: principal angles between two attention heads' OV subspaces measure how much they overlap. If , the heads share a direction and are partially redundant. If all , the heads write to orthogonal subspaces - they are fully independent.
- Gradient similarity across layers: principal angles between gradient subspaces at different layers measure gradient diversity during training.
- Representation similarity: the CKA (Centered Kernel Alignment) measure of representation similarity between two networks is related to principal angles between their representation subspaces.
- LoRA subspace alignment: after fine-tuning, the principal angles between the LoRA update subspace and the top- gradient subspace reveal how well LoRA captured the gradient's preferred directions.
9. Inner Product Spaces and Orthogonality
Recall: Vectors and Spaces 5-6 developed norms and inner products concretely in - dot product, angle, Cauchy-Schwarz, and orthogonal projection. This section extends those ideas to abstract inner product spaces: general Hilbert spaces, orthonormal bases, Gram-Schmidt orthogonalization, and orthogonal complements. The abstract setting is what makes these concepts applicable to function spaces (Fourier analysis, kernels) and not just .
9.1 Inner Products
An inner product on a vector space is a function satisfying:
- Symmetry:
- Linearity in the first argument:
- Positive definiteness: with equality if and only if
Together, properties 1 and 2 imply linearity in the second argument as well (by symmetry), making the inner product bilinear: linear in each argument separately.
The induced norm is , and the induced metric (distance) is .
Standard inner products:
| Space | Inner product | Induced norm |
|---|---|---|
| (Euclidean) | ||
| (Frobenius) | ||
| ( norm) |
Weighted inner product. For a symmetric positive definite matrix :
This defines a different geometry on : the unit ball is an ellipsoid rather than a sphere. The natural gradient in optimisation uses the Fisher information matrix as the weight matrix: measures gradient magnitude in the geometry of the statistical model, not the flat geometry of parameter space.
Cauchy-Schwarz inequality. For any inner product:
with equality if and only if and are linearly dependent ( for some scalar ).
This is one of the most important inequalities in mathematics. It implies:
- Cosine similarity is always in - well-defined as a cosine
- Triangle inequality (from Cauchy-Schwarz applied to )
- The law of cosines:
9.2 Orthogonality
Vectors and are orthogonal, written , if .
Orthogonality generalises perpendicularity from Euclidean geometry to any inner product space. In the Fourier inner product on , and are orthogonal for all integers . In the Frobenius inner product on matrices, symmetric and skew-symmetric matrices are orthogonal (since when and ).
Pythagorean theorem. If , then:
Proof:
More generally, if are pairwise orthogonal:
Orthogonal sets and independence. Any set of non-zero pairwise orthogonal vectors is linearly independent.
Proof. Suppose with pairwise orthogonal. Take the inner product of both sides with :
Since , we have , so . This holds for all .
This is a powerful observation: orthogonality implies independence. An orthogonal set is automatically independent, so an orthogonal spanning set is automatically a basis.
9.3 Gram-Schmidt Orthogonalisation
Given linearly independent vectors in an inner product space, the Gram-Schmidt process produces an orthonormal set such that:
The key invariant is that at each step, the orthonormal basis spans the same subspace as the original vectors up to that index.
Algorithm:
Step 1: u_1 = v_1
q_1 = u_1 / u_1
Step 2: u_2 = v_2 - v_2, q_1 q_1
q_2 = u_2 / u_2
Step 3: u_3 = v_3 - v_3, q_1 q_1 - v_3, q_2 q_2
q_3 = u_3 / u_3
Step j: u = v - sum_1^{j-1} v, q q
q = u / u
What each step does. At step , we subtract from all of its projections onto the previously constructed orthonormal vectors . The result is the component of not explained by the previous vectors - the "new" direction that adds. Normalising gives .
Why : Since are linearly independent, . Therefore (if it were zero, would be in the span of the previous 's, contradicting independence).
Verification of orthonormality. For :
Connection to QR decomposition. The Gram-Schmidt process is the algorithm underlying QR decomposition. For a matrix with independent columns:
where has orthonormal columns and is upper triangular with positive diagonal:
The diagonal entries are .
Numerical issues. The classical Gram-Schmidt algorithm as stated is numerically unstable for nearly dependent vectors: rounding errors accumulate and the resulting vectors may not be accurately orthogonal. The Modified Gram-Schmidt algorithm reorders the operations to reduce error propagation. For production use, Householder QR (using Householder reflections rather than projections) is numerically stable and preferred.
Worked example. Let , in .
Step 1: , ,
Step 2:
Verify:
9.4 The Orthogonal Complement
For a subspace , the orthogonal complement is:
is a subspace (easy check):
- for all
- If and for all , then
Properties (for finite-dimensional inner product spaces):
| Property | Statement |
|---|---|
| Double complement | |
| Dimension | |
| Trivial intersection | |
| Direct sum decomposition | |
| Fundamental subspaces | ; |
Why : Clearly (every vector in is orthogonal to everything in ). By the dimension formula: . Same dimension and one contains the other -> they are equal.
Computing . If and we form (rows are the ), then:
because for all is equivalent to .
9.5 Orthogonal Bases for AI
The choice of basis - orthogonal vs arbitrary - has concrete consequences in AI applications.
Interpretability and orthogonal bases. When the basis vectors of an embedding space are orthogonal (as in the standard basis of ), each dimension corresponds to an independent direction. Inner products between embeddings measure similarity in a well-defined way. Projections onto individual basis directions give clean, independent "components". In practice, the features of a learned representation are rarely aligned with the standard basis - but if they were, each feature could be read off by looking at one coordinate.
The superposition hypothesis and basis independence. Elhage et al. (2022) argue that transformers represent features as directions in , not as individual coordinates. The model does not commit to any particular orthonormal basis; instead, features are placed as arbitrary unit vectors in . When there are fewer features than dimensions, they can be nearly orthogonal (low interference). When features exceed , they must be non-orthogonal (superposed). The interference between features is measured by the inner products between their representing directions - exactly the off-diagonal entries of the Gram matrix of feature vectors.
Cosine similarity. The dominant similarity measure in embedding spaces:
This is the cosine of the angle between and , ranging from (anti-parallel) to (parallel). Orthogonal vectors have cosine similarity 0 - they are "unrelated" by this measure. The cosine similarity is basis-dependent: it changes if we apply a non-orthogonal change of basis to the space.
Orthogonal vs orthonormal bases. Orthogonal bases (pairwise orthogonal, but not necessarily unit length) are often more natural than orthonormal ones for intermediate computations. An orthonormal basis (each vector unit length AND pairwise orthogonal) is the "gold standard" for numerical stability and interpretability. The Gram-Schmidt process converts any independent set into an orthonormal one.
PCA as an orthonormal basis for data. Principal Component Analysis finds the orthonormal basis of that best aligns with the directions of maximal variance in a dataset. In the PCA basis, the covariance matrix is diagonal (its off-diagonal elements are zero), meaning the principal components are statistically uncorrelated. PCA is precisely the process of finding the orthonormal eigenbasis of the covariance matrix and using it as the new coordinate system.
Layer normalisation and orthogonality. Layer normalisation in transformers includes a learnable scale and shift :
The mean-subtraction step projects onto the orthogonal complement of the all-ones vector . This is an explicit orthogonal projection: , where is the projection onto the 1-dimensional "mean direction" and is the projection onto its orthogonal complement.
10. Affine Subspaces and Quotient Spaces
10.1 Affine Subspaces
An affine subspace (also called an affine flat or coset) is a translate of a linear subspace:
for a fixed offset vector and a linear subspace .
Geometric picture:
- If is a line through the origin, is a parallel line through
- If is a plane through the origin, is a parallel plane through
- The affine subspace is a "shifted" copy of the linear subspace; it has the same "shape" and dimension as , but it generally does not pass through the origin (unless , in which case )
Affine subspaces are NOT vector subspaces. Unless , the affine subspace does not contain and is not closed under addition (the sum of two elements is offset by , not ).
Key example: solution sets of linear systems. The solution set of with is an affine subspace:
where is any particular solution and is the null space (a linear subspace). The solution set is a translate of the null space. All solutions lie in the same affine subspace parallel to , offset by .
Other examples:
- A line not through the origin in : - a translate of the span of
- The probability simplex : an -dimensional affine subspace of , defined by - a translate of the hyperplane
- Batch normalisation output space: after fixing the batch statistics, the normalised output lives in an affine subspace determined by the running mean and variance
Dimension of an affine subspace. The affine subspace has the same dimension as the underlying linear subspace . A -dimensional affine subspace in is sometimes called a -flat.
AFFINE VS LINEAR SUBSPACES
Linear subspace W: Affine subspace W + v_0:
passes through origin passes through v_0 (not 0)
closed under + and scaling closed under affine combinations
IS a vector space is NOT a vector space
examples: lines/planes/ examples: lines/planes/
hyperplanes through 0 hyperplanes NOT through 0
In AI:
Linear: null(W), col(W), LoRA update subspace
Affine: solution set of Ax=b, probability simplex,
offset embeddings before centering
10.2 Affine Combinations
An affine combination of vectors is a linear combination where the coefficients sum to one:
Affine combinations "stay within" affine subspaces: if , then any affine combination of them also lies in .
Convex combinations are affine combinations with the additional constraint :
Convex combinations stay within the convex hull of .
Why the constraint matters. Without it, scaling the vectors by a common factor would scale the combination - the result would depend on "how large" the vectors are. With , the combination is "location-aware" rather than "direction-aware" - it picks a point in the affine hull regardless of scale.
AI applications of affine and convex combinations:
-
Embedding interpolation. Given two embedding vectors and , the linear interpolation for is a convex combination. This is the operation underlying latent space arithmetic: "find the midpoint of 'cat' and 'dog' in embedding space". Note: this is NOT generally a valid probability-weighted average of the corresponding tokens' distributions - that arithmetic must happen in logit space.
-
Spherical linear interpolation (slerp). For unit vectors (on the unit sphere), linear interpolation does not stay on the sphere. Slerp uses trigonometric weights to interpolate along the great circle: where . The weights and sum to... almost 1 (they are affine-like weights for the sphere geometry).
-
Model averaging and interpolation. Averaging two neural networks and by is a convex combination in parameter space. "Model soup" (Wortsman et al. 2022) averages fine-tuned models; WiSE-FT interpolates between pretrained and fine-tuned weights. These are affine combinations in .
-
Probability distributions. A mixture distribution is a convex combination of two distributions. This is valid because: (convex combination of non-negatives) and (the constraint ensures the mixture is still a probability distribution). This is why mixture models work: affine combinations with unit-sum weights preserve the probability simplex.
10.3 Quotient Spaces
The quotient space (read "V mod W") captures the idea of "ignoring the directions" - it identifies all vectors that differ by an element of .
Definition. For a subspace , define the equivalence relation: iff . The coset of is:
This is an affine subspace - a translate of passing through . Note that iff (the cosets are identical iff the vectors are equivalent).
The quotient space is the collection of all cosets of in .
Operations on :
- Addition: (well-defined: the result does not depend on which representatives we choose, as long as the cosets are the same)
- Scalar multiplication: (well-defined by the same argument)
These operations make a vector space (the quotient space). The zero vector of is (the coset containing , which is itself).
Dimension: .
Intuition. "collapses" the directions to a point (the zero element ) and retains only the directions perpendicular to . The quotient space has dimension because it "forgets" directions.
First Isomorphism Theorem. For a linear map :
The quotient space is isomorphic to the image of . Intuitively: the null space is exactly the "ambiguity" in - different inputs mapping to the same output; the quotient space removes this ambiguity, leaving a bijection between cosets and outputs.
AI applications:
-
Layer normalisation. The mean-subtraction in LayerNorm - subtracting the mean of all coordinates from each coordinate - is equivalent to projecting onto the orthogonal complement of the all-ones direction. The "normalised" space can be viewed as the quotient , where the direction is "divided out". Two activations that differ only by a constant shift (equal means) are identified in this quotient space.
-
Residual connections. The residual connection adds a vector to the current residual stream. From the quotient space perspective: the "content" of the residual stream modulo the current layer's contribution is what gets passed to the next layer. Each layer writes to its column-space subspace; the rest of is preserved as-is.
-
Equivalence classes in training. If for a data matrix , then all weight vectors in the same coset produce the same predictions on the training data. The space of distinct prediction functions is the quotient . Gradient descent with squared loss finds the minimum-norm representative of the coset - the vector in the coset closest to the origin.
10.4 Cosets and Their Structure
The cosets partition into disjoint, equally-shaped affine subspaces:
- Partition: every belongs to exactly one coset; cosets are either identical or disjoint
- Uniform shape: every coset is a translate of , so all cosets have the same dimension () and the same "shape"
- No overlap: two cosets and are either equal (when ) or disjoint
Solution structure for linear systems. This is the most immediate application. For :
- The solution set (if non-empty) is the coset for any particular solution
- All solutions in this coset produce the same output: for any
- The minimum-norm solution (the "pseudoinverse solution") is , which lies in (the orthogonal complement of within the coset)
Implicit bias as coset selection. In overparameterised neural networks (, more parameters than constraints), gradient descent with zero initialisation and MSE loss converges to the minimum-norm solution in (the unique solution in ). This "implicit bias" towards minimum-norm solutions is a coset-selection phenomenon: gradient descent selects the minimum--norm representative from the coset of all equally good solutions.
11. Subspaces in Functional Analysis
11.1 Infinite-Dimensional Spaces and Closed Subspaces
In finite dimensions, every subspace is automatically closed (in the topological sense). This ceases to be true in infinite dimensions, where one must distinguish between algebraic subspaces (satisfying the three conditions of Section 3) and closed subspaces (additionally closed under limits).
Closed subspace. A subspace of a Hilbert space is closed if every Cauchy sequence in converges to a limit that remains in . Equivalently, is closed if it equals its topological closure .
Why closedness matters. In infinite dimensions:
- The projection theorem () holds only for closed subspaces
- The best approximation in exists only if is closed
- The spectral theorem and other key results require closed invariant subspaces
Examples of closed subspaces:
- : polynomials of degree ; finite-dimensional, hence closed
- for a bounded linear operator : always closed (because is continuous and is closed)
- : the even functions; closed under limits
- The span of any finite set of vectors in a Hilbert space: always closed (finite-dimensional subspace)
Examples of non-closed subspaces in infinite dimensions:
- : all polynomials (of any degree); algebraically a subspace, but not closed - the sequence converges in to , which is not a polynomial. The closure (by the Weierstrass approximation theorem in )
- Finitely supported sequences in : sequences with only finitely many non-zero entries; algebraically a subspace, but limit of is but not finitely supported
Projection in infinite dimensions. The projection theorem extends to Hilbert spaces: if is a closed subspace of a Hilbert space , then and every has a unique best approximation . This is the foundation of least squares in infinite dimensions and of the representer theorem in kernel methods.
11.2 Function Spaces Relevant to AI
- square-integrable functions on .
The space of functions with , equipped with inner product . This is a separable Hilbert space. It is the natural setting for:
- Kernel methods: the RKHS of a translation-invariant kernel is a subspace of
- Probability: densities of probability distributions with finite second moment live here
- Signal processing: finite-energy signals are in
- Neural networks: functions representable by a network can be analysed as elements of
Sobolev spaces .
Functions with weak derivatives all lying in , equipped with norms that account for function values and derivative values. They specify "smoothness":
- (no smoothness condition beyond square integrability)
- (square-integrable function with square-integrable first derivative): relevant for PDEs and PINNs
- (second derivatives in ): relevant for spline regression, Gaussian processes with Matrn kernels
Sobolev spaces are used in physics-informed neural networks (PINNs): the solution to a PDE is sought in a Sobolev space, and the network is trained to minimise a loss that penalises PDE residuals in the norm. The constraint "solution in " is a subspace constraint on the function space.
Reproducing Kernel Hilbert Spaces (RKHS).
An RKHS is a Hilbert space of functions such that point evaluation is a bounded linear functional: for each , the map is continuous in the -norm.
By the Riesz representation theorem, there exists a unique function such that:
The function defined by is the reproducing kernel. It is symmetric and positive semi-definite.
Representer theorem. For any regularised learning problem of the form:
the optimal lies in the finite-dimensional subspace . This is a subspace result: despite the infinite-dimensional function space, the optimal solution lies in an -dimensional subspace spanned by the kernel functions at the training points. SVMs, Gaussian process regression, and kernel ridge regression all instantiate this theorem.
Common kernels and their RKHS subspaces:
- Linear kernel : RKHS = (linear functions); finite-dimensional subspace
- RBF / Gaussian kernel : RKHS is a dense subspace of ; infinite-dimensional
- Matrn kernel: RKHS is the Sobolev space ; smoothness parameter controls which Sobolev subspace
- Polynomial kernel : RKHS = polynomials of degree ; -choose--dimensional
11.3 Neural Networks as Subspaces of Function Spaces
A neural network architecture with parameter space defines a parametric family of functions:
This family is not a subspace of . In general, is not equal to for any (the sum of two networks is not itself a network with the same architecture). Similarly for scalar multiples. The non-linearity of the architecture means the function family is a non-linear manifold embedded in , not a subspace.
However, the tangent space at any parameter IS a subspace. The tangent space to the manifold at the point is:
This is the span of functions - a (at most) -dimensional linear subspace of . When you take a gradient step , you are moving in the direction that lies in this tangent space. First-order optimisation operates in the tangent subspace.
Neural Tangent Kernel (NTK). In the infinite-width limit (network width with appropriate scaling), the evolution of the network function during gradient flow is governed by a linear equation in function space:
where is the NTK. In the infinite-width limit, becomes constant during training. The trained network converges to kernel regression with the NTK as kernel - meaning the trained function lies in the RKHS defined by the NTK, which is a subspace of . The NTK result says: in the lazy training regime, the neural network effectively performs a linear projection onto a specific infinite-dimensional subspace of function space.
Universal approximation and subspace density. The universal approximation theorem says that the set of functions representable by a neural network with bounded width and arbitrary depth (or arbitrary width and one hidden layer) is dense in (continuous functions on the unit hypercube) under the uniform norm. "Dense" means: for any continuous function and any , there is a network function with . This is a subspace density statement: the parametric family is dense in the function space . Depth (or width) determines which subspace is reachable; nonlinearity is what allows density.
11.4 Krylov Subspaces
Krylov subspaces are the foundation of the most practical iterative linear algebra algorithms. They connect the abstract geometry of subspaces to the computational efficiency of matrix-vector products.
Definition. For a matrix and a vector , the Krylov subspace of order is:
This is the span of the first vectors in the sequence - the orbit of under repeated application of .
Nested structure. The Krylov subspaces form a nested sequence:
The sequence stabilises at some : At that point, (the subspace is invariant under ), and equals the degree of the minimal polynomial of with respect to .
Krylov methods. Iterative solvers based on Krylov subspaces find the best approximate solution within at step , then expand the subspace:
| Method | Problem | Optimisation in |
|---|---|---|
| Conjugate Gradients (CG) | , SPD | Minimises over |
| GMRES | , general | Minimises over |
| Lanczos | Eigenvalues of symmetric | Finds best rank- approximation to spectrum |
| Arnoldi | Eigenvalues of general | Orthogonalises via Gram-Schmidt |
Each step costs one matrix-vector product with and work for orthogonalisation. For a sparse with non-zeros, each step costs . Contrast with direct methods (Gaussian elimination): cost regardless of sparsity.
AI applications of Krylov methods:
- Second-order optimisation. Methods like K-FAC (Kronecker-Factored Approximate Curvature) need to solve linear systems involving the Fisher information matrix to compute the natural gradient. Krylov methods can solve (where is the gradient) without explicitly forming - only matrix-vector products are needed, which can be computed efficiently.
- Eigenvalue computation for interpretability. Computing the top eigenvalues of the Hessian or the covariance matrix of gradients is done via Lanczos, which builds a Krylov subspace using Hessian-vector products. These products are available cheaply via the "pearlmutter trick" (forward-over-backward AD). The Krylov subspace approach gives the dominant eigenspace in Hessian-vector products.
- Linear attention and state space models. SSMs (S4, Mamba) compute recurrences of the form whose outputs lie in Krylov-like subspaces. The efficiency of convolutional SSM computation is related to the structure of the Krylov subspace generated by the state matrix and input matrix .