Theory Notebook
Converted from
theory.ipynbfor web reading.
Vector Spaces and Subspaces — From Axioms to Interpretability
A vector space is not a collection of arrows. It is a collection of objects that behave like arrows — and that behaviour is the entire point. Once you verify the axioms, every theorem ever proved about vector spaces applies for free.
This notebook is the interactive companion to notes.md. It is arranged as a learning path rather than a section-by-section copy.
| Block | Topic | What you will build |
|---|---|---|
| 1 | Intuition | why vector spaces are the language of deep learning |
| 2 | Formal Definitions | axiom verification for standard and exotic spaces |
| 3 | Subspaces | the three-condition test on concrete examples |
| 4 | Span & Linear Combinations | geometric pictures and matrix formulation |
| 5 | Linear Independence | dependence detection and redundancy |
| 6 | Basis & Dimension | coordinate systems, change of basis, dimension counting |
| 7 | Four Fundamental Subspaces | column space, null space, row space, left null space |
| 8 | Subspace Operations | sum, intersection, direct sum, projection |
| 9 | Inner Product Spaces & Orthogonality | Gram-Schmidt, orthogonal complements |
| 10 | Affine Subspaces & Quotient Spaces | translated subspaces, cosets |
| 11 | Subspaces in Functional Analysis | infinite dimensions, RKHS, Krylov |
| 12 | Subspace Methods in ML | PCA, LoRA, representation subspaces, mechanistic interpretability |
| 13 | Invariant Subspaces & Spectral Theory | eigenspaces, SVD as subspace decomposition |
The notebook uses only numpy and optional matplotlib. It is intentionally code-first: the goal is to make the abstract algebra feel operational.
Primary references: Axler (Linear Algebra Done Right), Strang (Linear Algebra and Its Applications), Halmos (Finite-Dimensional Vector Spaces), Hu et al. (2021, LoRA), Elhage et al. (2022, superposition), Aghajanyan et al. (2020, intrinsic dimensionality).
Code cell 2
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
try:
import seaborn as sns
sns.set_theme(style="whitegrid", palette="colorblind")
HAS_SNS = True
except ImportError:
plt.style.use("seaborn-v0_8-whitegrid")
HAS_SNS = False
mpl.rcParams.update({
"figure.figsize": (10, 6),
"figure.dpi": 120,
"font.size": 13,
"axes.titlesize": 15,
"axes.labelsize": 13,
"xtick.labelsize": 11,
"ytick.labelsize": 11,
"legend.fontsize": 11,
"legend.framealpha": 0.85,
"lines.linewidth": 2.0,
"axes.spines.top": False,
"axes.spines.right": False,
"savefig.bbox": "tight",
"savefig.dpi": 150,
})
np.random.seed(42)
print("Plot setup complete.")
Code cell 3
import numpy as np
import numpy.linalg as la
import scipy.linalg as sla
from scipy import stats
COLORS = {
"primary": "#0077BB",
"secondary": "#EE7733",
"tertiary": "#009988",
"error": "#CC3311",
"neutral": "#555555",
"highlight": "#EE3377",
}
HAS_MPL = True
np.set_printoptions(precision=8, suppress=True)
np.random.seed(42)
def header(title):
print("\n" + "=" * len(title))
print(title)
print("=" * len(title))
def check_true(name, cond):
ok = bool(cond)
print(f"{'PASS' if ok else 'FAIL'} - {name}")
return ok
def check_close(name, got, expected, tol=1e-8):
ok = np.allclose(got, expected, atol=tol, rtol=tol)
print(f"{'PASS' if ok else 'FAIL'} - {name}: got {got}, expected {expected}")
return ok
def check(name, got, expected, tol=1e-8):
return check_close(name, got, expected, tol=tol)
def softmax(z, axis=-1, tau=1.0):
z = np.asarray(z, dtype=float) / float(tau)
z = z - np.max(z, axis=axis, keepdims=True)
e = np.exp(z)
return e / np.sum(e, axis=axis, keepdims=True)
def cosine_similarity(a, b):
a = np.asarray(a, dtype=float); b = np.asarray(b, dtype=float)
return float(a @ b / (la.norm(a) * la.norm(b) + 1e-12))
def numerical_rank(A, tol=1e-10):
return int(np.sum(la.svd(A, compute_uv=False) > tol))
def orthonormal_basis(A, tol=1e-10):
Q, R = la.qr(A)
keep = np.abs(np.diag(R)) > tol
return Q[:, keep]
def null_space(A, tol=1e-10):
U, S, Vt = la.svd(A)
return Vt[S.size:,:].T if S.size < Vt.shape[0] else Vt[S <= tol,:].T
# Compatibility helpers used by the Chapter 02 theory and exercise cells.
def null_space(A, tol=1e-10):
A = np.asarray(A, dtype=float)
U, S, Vt = la.svd(A, full_matrices=True)
rank = int(np.sum(S > tol))
return Vt[rank:].T
svd_null_space = null_space
def gram_schmidt(vectors, tol=1e-10):
A = np.asarray(vectors, dtype=float)
if A.ndim == 1:
A = A.reshape(1, -1)
basis = []
for v in A:
w = v.astype(float).copy()
for q in basis:
w = w - np.dot(w, q) * q
norm = la.norm(w)
if norm > tol:
basis.append(w / norm)
return np.array(basis)
def projection_matrix_from_columns(A, tol=1e-10):
Q = orthonormal_basis(np.asarray(A, dtype=float), tol=tol)
return Q @ Q.T
def random_unit_vectors(n, d):
X = np.random.randn(n, d)
return X / np.maximum(la.norm(X, axis=1, keepdims=True), 1e-12)
def pairwise_distances(X):
X = np.asarray(X, dtype=float)
diff = X[:, None, :] - X[None, :, :]
return la.norm(diff, axis=-1)
def normalize(x, axis=None, tol=1e-12):
x = np.asarray(x, dtype=float)
norm = la.norm(x, axis=axis, keepdims=True)
return x / np.maximum(norm, tol)
def frobenius_inner(A, B):
return float(np.sum(np.asarray(A, dtype=float) * np.asarray(B, dtype=float)))
def outer_sum_product(A, B):
A = np.asarray(A, dtype=float)
B = np.asarray(B, dtype=float)
return sum(np.outer(A[:, k], B[k, :]) for k in range(A.shape[1]))
def softmax_rows(X):
return softmax(X, axis=1)
def col_space(A, tol=1e-10):
return orthonormal_basis(np.asarray(A, dtype=float), tol=tol)
def row_space(A, tol=1e-10):
return orthonormal_basis(np.asarray(A, dtype=float).T, tol=tol).T
def rref(A, tol=1e-10):
R = np.array(A, dtype=float, copy=True)
m, n = R.shape
pivots = []
row = 0
for col in range(n):
pivot = row + int(np.argmax(np.abs(R[row:, col]))) if row < m else row
if row >= m or abs(R[pivot, col]) <= tol:
continue
if pivot != row:
R[[row, pivot]] = R[[pivot, row]]
R[row] = R[row] / R[row, col]
for r in range(m):
if r != row:
R[r] = R[r] - R[r, col] * R[row]
pivots.append(col)
row += 1
if row == m:
break
R[np.abs(R) < tol] = 0.0
return R, pivots
def nullspace_basis(A, tol=1e-10):
A = np.asarray(A, dtype=float)
U, S, Vt = la.svd(A, full_matrices=True)
rank = int(np.sum(S > tol))
return Vt[rank:].T, rank
print("Chapter helper setup complete.")
1. Intuition — What Is a Vector Space and Why It Matters
A vector space is a collection of objects — called vectors — together with two operations:
- Addition: combine two vectors to get another vector in the same collection.
- Scalar multiplication: stretch any vector by a real number and stay in the collection.
These two operations must satisfy eight axioms (Section 2). Once verified, every theorem ever proved about vector spaces applies automatically — whether the "vectors" are geometric arrows, polynomials, matrices, functions, probability distributions, or neural network weights.
Why this abstraction matters for AI
| AI concept | Underlying vector space | Why it matters |
|---|---|---|
| Token embeddings | king − man + woman ≈ queen relies on vector arithmetic | |
| Parameter space | Gradient descent navigates this space; loss landscape lives over it | |
| Residual stream | at each layer | x ← x + Attn(x) + MLP(x) is vector addition |
| Gradient space | Gradient accumulation, clipping, momentum — all vector operations | |
| Function spaces | , RKHS | Neural networks approximate functions; approximation theory uses subspaces |
The hierarchy of structure
Set (no structure)
↓ + addition and scalar multiplication + axioms
Vector Space (linear structure: add, scale)
↓ + norm ‖v‖
Normed Space (distances: can measure length)
↓ + inner product ⟨u,v⟩
Inner Product Space (angles: can measure cosine similarity)
↓ + completeness
Hilbert Space (complete inner product space)
Each level adds structure and inherits all theorems from above. satisfies all levels simultaneously. AI embedding spaces are finite-dimensional, so all levels are equivalent. Function spaces require infinite dimensions, where the distinctions become essential.
Code cell 5
# === 1.1 The same algebra works across wildly different "vector" types ===
# Vectors in R^3
u = np.array([1.0, 2.0, -1.0])
v = np.array([3.0, 0.0, 1.0])
print("R^3 vectors:")
print(" u + v =", u + v)
print(" 2.5 * u =", 2.5 * u)
# Matrices as vectors in R^{2x3}
A = np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
B = np.array([[0.0, -1.0, 2.0], [1.0, 0.0, -1.0]])
print("\nR^{2x3} matrices:")
print(" A + B =\n", A + B)
print(" -3 * A =\n", -3 * A)
# Polynomials (represented by coefficient vectors)
# p(t) = 1 + 2t - t^2, q(t) = 3 - t + 4t^2
p = np.array([1.0, 2.0, -1.0]) # coefficients of 1, t, t^2
q = np.array([3.0, -1.0, 4.0])
print("\nP_2 polynomials (coefficients):")
print(" p + q =", p + q, " -> (p+q)(t) = 4 + t + 3t^2")
print(" 2p =", 2 * p, " -> 2p(t) = 2 + 4t - 2t^2")
# Functions (sampled pointwise)
t = np.linspace(0, 1, 5)
f = t**2
g = np.sin(np.pi * t)
print("\nC([0,1]) functions (sampled):")
print(" (f+g)(t) =", f + g)
print(" (3f)(t) =", 3 * f)
print("\nKey insight: + and scalar * work the same way in every vector space.")
print("The axioms guarantee this uniformity.")
Code cell 6
# === 1.2 Vector space structure in AI: embedding arithmetic ===
# Synthetic embedding space (small d for clarity; real models use d=768+)
tokens = ["king", "queen", "man", "woman", "apple"]
E = np.array([
[0.8, 0.9, 0.2, 0.1], # king
[0.8, 0.9, -0.2, 0.1], # queen
[0.7, 0.8, 0.4, 0.1], # man
[0.7, 0.8, -0.4, 0.1], # woman
[-0.3, 0.2, 0.0, 0.9], # apple
], dtype=float)
# king - man + woman: relies on R^d being a vector space
analogy = E[0] - E[2] + E[3] # king - man + woman
print("king - man + woman =", analogy)
print("queen =", E[1])
print("cosine(analogy, queen) =", cosine_similarity(analogy, E[1]))
# This arithmetic is meaningful ONLY because R^d is a vector space:
# subtraction (additive inverse), addition, and the result stays in R^d.
# Residual stream: x <- x + attn_output + mlp_output
d = 8
x = np.random.randn(d)
attn_out = np.random.randn(d) * 0.3
mlp_out = np.random.randn(d) * 0.3
x_next = x + attn_out + mlp_out # vector addition in R^d
print("\nResidual stream update:")
print(" ||x|| =", np.linalg.norm(x))
print(" ||x + attn + mlp|| =", np.linalg.norm(x_next))
print(" This sum makes sense because R^d is closed under addition.")
Subspace intuition (preview)
A subspace is a vector space living inside a larger vector space — a flat, linear subset that passes through the origin and is closed under the same operations.
- A line through the origin in is a subspace.
- A line not through the origin is not (it doesn't contain the zero vector).
- A plane through the origin is a subspace; a curved surface never is.
Why "through the origin": the zero vector must belong to any subspace (scale any vector by 0 and you get 0; if 0 isn't in the set, it can't be closed under scalar multiplication).
Why "flat": closure under addition and scalar multiplication forces the set to be "straight" — no curves, no bends.
Subspaces that appear constantly in AI:
| Subspace | Definition | AI meaning |
|---|---|---|
| Reachable outputs of layer | ||
| Input directions the layer ignores | ||
| Input directions the layer listens to | ||
| LoRA update space | for | Rank- subspace of weight updates |
Code cell 8
# === 1.3 Subspace vs non-subspace: quick visual check ===
# A line through the origin: W = {t*(1,2) : t in R}
# Check: contains 0? Yes (t=0). Closed under + and scalar mult? Yes.
direction = np.array([1.0, 2.0])
t_vals = np.array([-2, -1, 0, 0.5, 1, 3])
line_points = np.outer(t_vals, direction)
print("Line through origin (subspace):")
for t, pt in zip(t_vals, line_points):
print(f" t={t:>4}: ({pt[0]:>5.1f}, {pt[1]:>5.1f})")
# Sum of two points: (1,2) + (3,6) = (4,8) = 4*(1,2) -> still on the line
print(" (1,2)+(3,6) =", direction + 3*direction, "= 4*(1,2) -> in W")
# A line NOT through origin: {(1,0) + t*(1,2) : t in R}
# Does not contain 0: (1,0) + t*(1,2) = (0,0) requires t=0 and 1=0. Contradiction.
print("\nLine NOT through origin (NOT a subspace):")
print(" Does it contain (0,0)? No. Already fails.")
# Probability simplex: {p in R^3 : p_i >= 0, sum = 1}
p = np.array([0.5, 0.3, 0.2])
q = np.array([0.1, 0.6, 0.3])
print("\nProbability simplex (NOT a subspace):")
print(" p =", p, "sum =", p.sum())
print(" q =", q, "sum =", q.sum())
print(" p + q =", p + q, "sum =", (p+q).sum(), "!= 1 -> not in simplex")
print(" 2*p =", 2*p, "sum =", (2*p).sum(), "!= 1 -> not in simplex")
print(" (0,0,0) not in simplex (sum=0 != 1) -> no zero vector")
2. Formal Definitions — The Eight Axioms
A vector space over is a set together with two operations:
- Addition:
- Scalar multiplication:
satisfying these eight axioms for all and all :
Addition axioms:
| # | Name | Statement |
|---|---|---|
| A1 | Commutativity | |
| A2 | Associativity | |
| A3 | Additive identity | such that for all |
| A4 | Additive inverse | For each , such that |
Scalar multiplication axioms:
| # | Name | Statement |
|---|---|---|
| S1 | Multiplicative identity | |
| S2 | Compatibility |
Distributivity axioms:
| # | Name | Statement |
|---|---|---|
| D1 | Over vector addition | |
| D2 | Over scalar addition |
These eight axioms are the complete definition. The axioms are minimal: each is independent of the others.
Code cell 10
# === 2.1 Verify all eight axioms for R^n ===
u = np.array([1.0, -2.0, 3.0])
v = np.array([4.0, 0.5, -1.0])
w = np.array([-1.0, 3.0, 2.0])
alpha, beta = 2.5, -1.3
zero = np.zeros(3)
print("Verifying all 8 axioms for R^3:")
print(f"A1 Commutativity: u+v == v+u? {np.allclose(u+v, v+u)}")
print(f"A2 Associativity: (u+v)+w == u+(v+w)? {np.allclose((u+v)+w, u+(v+w))}")
print(f"A3 Additive identity: v+0 == v? {np.allclose(v+zero, v)}")
print(f"A4 Additive inverse: v+(-v) == 0? {np.allclose(v+(-v), zero)}")
print(f"S1 Mult identity: 1*v == v? {np.allclose(1*v, v)}")
print(f"S2 Compatibility: a(bv) == (ab)v? {np.allclose(alpha*(beta*v), (alpha*beta)*v)}")
print(f"D1 Scalar distrib: a(u+v) == au+av? {np.allclose(alpha*(u+v), alpha*u+alpha*v)}")
print(f"D2 Vector distrib: (a+b)v == av+bv? {np.allclose((alpha+beta)*v, alpha*v+beta*v)}")
Code cell 11
# === 2.2 Verify axioms for R^{m x n} (matrices as vectors) ===
A = np.array([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
B = np.array([[0.0, -1.0], [2.0, 1.0], [-1.0, 3.0]])
C = np.array([[1.0, 1.0], [1.0, 1.0], [1.0, 1.0]])
Z = np.zeros((3, 2))
a, b = 3.0, -0.7
print("Verifying axioms for R^{3x2}:")
print(f"A1 Commutativity: {np.allclose(A+B, B+A)}")
print(f"A2 Associativity: {np.allclose((A+B)+C, A+(B+C))}")
print(f"A3 Zero matrix: {np.allclose(A+Z, A)}")
print(f"A4 Inverse: {np.allclose(A+(-A), Z)}")
print(f"S1 Identity: {np.allclose(1*A, A)}")
print(f"S2 Compatibility: {np.allclose(a*(b*A), (a*b)*A)}")
print(f"D1 Scalar distrib: {np.allclose(a*(A+B), a*A+a*B)}")
print(f"D2 Vector distrib: {np.allclose((a+b)*A, a*A+b*A)}")
print(f"\ndim(R^{{3x2}}) = 3*2 = {3*2}")
Code cell 12
# === 2.3 Verify axioms for P_n (polynomials of degree <= n) ===
# Represent p(t) = a0 + a1*t + a2*t^2 by coefficient vector [a0, a1, a2]
p = np.array([1.0, 2.0, -1.0]) # 1 + 2t - t^2
q = np.array([3.0, -1.0, 4.0]) # 3 - t + 4t^2
r = np.array([0.0, 1.0, 1.0]) # t + t^2
z = np.zeros(3) # zero polynomial
a, b = 2.0, -0.5
print("Verifying axioms for P_2 (polynomials of degree <= 2):")
print(f" p(t) = 1 + 2t - t^2, q(t) = 3 - t + 4t^2")
print(f" p + q = {p + q} -> (p+q)(t) = 4 + t + 3t^2")
print(f" A1: {np.allclose(p+q, q+p)}")
print(f" A2: {np.allclose((p+q)+r, p+(q+r))}")
print(f" A3: {np.allclose(p+z, p)}")
print(f" A4: {np.allclose(p+(-p), z)}")
print(f" S1: {np.allclose(1*p, p)}")
print(f" S2: {np.allclose(a*(b*p), (a*b)*p)}")
print(f" D1: {np.allclose(a*(p+q), a*p+a*q)}")
print(f" D2: {np.allclose((a+b)*p, a*p+b*p)}")
print(f"\ndim(P_2) = 3 (basis: {{1, t, t^2}})")
print("Note: p+q has degree 2 <= 2, so it stays in P_2. Closure holds.")
Immediate consequences of the axioms
These follow from the eight axioms alone — they hold in every vector space:
-
Uniqueness of zero: if and both satisfy A3, then .
-
Uniqueness of additive inverse: if and , then .
-
(zero scalar × any vector = zero vector):
- . Subtract : .
-
(any scalar × zero vector = zero vector):
- . Subtract: .
-
: .
-
If , then or : if , multiply by : .
Code cell 14
# === 2.4 Verify consequences computationally ===
v = np.array([3.0, -1.0, 7.0, 2.0])
zero = np.zeros(4)
print("Consequence 3: 0 * v = 0")
print(" 0 * v =", 0 * v)
print(" equals zero?", np.allclose(0 * v, zero))
print("\nConsequence 4: alpha * 0 = 0")
print(" 7.5 * 0 =", 7.5 * zero)
print(" equals zero?", np.allclose(7.5 * zero, zero))
print("\nConsequence 5: (-1) * v = -v")
print(" (-1)*v =", (-1)*v)
print(" -v =", -v)
print(" equal?", np.allclose((-1)*v, -v))
print("\nConsequence 6: alpha*v = 0 implies alpha=0 or v=0")
alpha = 3.0
result = alpha * v
print(f" {alpha} * {v} = {result}")
print(f" result is zero? {np.allclose(result, zero)} (alpha != 0 and v != 0, so correct)")
Non-examples: what fails and why
Understanding what fails the axioms is as important as knowing the examples.
| Set | What fails | Why |
|---|---|---|
| (positive reals) | Scalar mult, inverse | |
| Unit sphere | Addition, zero | generally; |
| Simplex | Addition, scalar mult, zero | sums to 2; |
| (integers) | Scalar mult over | |
| , | Zero, addition |
Code cell 16
# === 2.5 Demonstrate axiom failures for non-vector-spaces ===
print("=== R_+^n (positive reals) ===")
v_pos = np.array([1.0, 2.0, 3.0])
scaled = (-1) * v_pos
print(f" v = {v_pos} (all positive)")
print(f" (-1)*v = {scaled} (not all positive -> FAILS closure under scalar mult)")
print(f" Also: no additive inverse in R_+ (v + (-v) = 0, but -v not in R_+)")
print("\n=== Unit sphere S^{n-1} ===")
u_sphere = normalize(np.array([1.0, 0.0, 0.0]))
v_sphere = normalize(np.array([0.0, 1.0, 0.0]))
s = u_sphere + v_sphere
print(f" ||u|| = {np.linalg.norm(u_sphere):.1f}, ||v|| = {np.linalg.norm(v_sphere):.1f}")
print(f" u + v = {s}, ||u+v|| = {np.linalg.norm(s):.4f} != 1 (FAILS closure under +)")
print(f" ||0|| = 0 != 1 (zero vector not on sphere -> FAILS zero)")
print("\n=== Probability simplex ===")
p = np.array([0.5, 0.3, 0.2])
q = np.array([0.1, 0.6, 0.3])
print(f" p = {p}, sum = {p.sum():.1f}")
print(f" q = {q}, sum = {q.sum():.1f}")
print(f" p+q = {p+q}, sum = {(p+q).sum():.1f} != 1 (FAILS closure under +)")
print(f" This is why logit space (pre-softmax) is used for arithmetic, not probability space.")
print("\n=== Affine subspace {x : Ax = b}, b != 0 ===")
A_sys = np.array([[1.0, 2.0], [3.0, 4.0]])
b_sys = np.array([5.0, 6.0])
x_particular = np.linalg.solve(A_sys, b_sys)
print(f" A*0 = {A_sys @ np.zeros(2)} != b = {b_sys} (zero not in solution set -> FAILS)")
print(f" One solution: x = {x_particular}")
print(f" A*x = {A_sys @ x_particular} = b (correct, but set is not a subspace)")
Vector spaces over other fields
The definition works over any field , not just :
| Field | Notation | Used in |
|---|---|---|
| Real numbers | Standard ML, transformers | |
| Complex numbers | Quantum computing, Fourier analysis, signal processing | |
| Binary field | Error-correcting codes, cryptography | |
| Rationals | Exact arithmetic, symbolic computation |
In : addition is XOR, multiplication is AND. All linear algebra carries over — including Gaussian elimination, rank, and null spaces.
Code cell 18
# === 2.6 Vector space over F_2 (binary field) ===
# Addition = XOR, Scalar multiplication: 0*v=0, 1*v=v
u_bin = np.array([1, 0, 1, 1])
v_bin = np.array([0, 1, 1, 0])
# Addition in F_2 is XOR
add_f2 = np.bitwise_xor(u_bin, v_bin)
print("F_2^4 vector space:")
print(f" u = {u_bin}")
print(f" v = {v_bin}")
print(f" u + v (XOR) = {add_f2}")
# Additive inverse: in F_2, every element is its own inverse (1+1=0)
print(f" u + u (XOR) = {np.bitwise_xor(u_bin, u_bin)} (= zero vector, so -u = u in F_2)")
# Commutativity
print(f" u+v == v+u? {np.array_equal(add_f2, np.bitwise_xor(v_bin, u_bin))}")
print(" Used in: error-correcting codes, parity checks, cryptographic linear algebra")
3. Subspaces — The Three-Condition Test
A subset is a subspace of if is itself a vector space under the same operations inherited from .
Checking all eight axioms would be redundant. Commutativity, associativity, and distributivity are inherited automatically from . Only three conditions need to be verified:
- Contains zero:
- Closed under addition:
- Closed under scalar multiplication:
Equivalent single condition (often quickest): for all and :
Why these three suffice
- Condition 1 ensures is non-empty and has an additive identity.
- Condition 2 ensures stays inside .
- Condition 3 ensures scalar multiplication stays inside (and gives additive inverses via ).
- The remaining axioms (commutativity, associativity, distributivity) are equalities that hold in for all vectors; they hold in particular for vectors in .
Code cell 20
# === 3.1 Lines through the origin (1D subspaces of R^2 and R^3) ===
# W = {t*(1,2) : t in R} -- a line through origin in R^2
d = np.array([1.0, 2.0])
# Test 1: contains zero?
print("Line W = span{(1,2)} in R^2:")
print(f" Contains 0? 0*(1,2) = {0*d} -> Yes")
# Test 2: closed under addition?
w1 = 3.0 * d # (3, 6)
w2 = -1.5 * d # (-1.5, -3)
w_sum = w1 + w2 # (1.5, 3) = 1.5*(1,2)
t_sum = w_sum[0] / d[0] # coefficient
print(f" w1 = {w1}, w2 = {w2}")
print(f" w1+w2 = {w_sum} = {t_sum}*(1,2) -> in W? {np.allclose(w_sum, t_sum*d)}")
# Test 3: closed under scalar mult?
alpha = -2.7
w_scaled = alpha * w1
t_scaled = w_scaled[0] / d[0]
print(f" {alpha}*w1 = {w_scaled} = {t_scaled}*(1,2) -> in W? {np.allclose(w_scaled, t_scaled*d)}")
print(" Conclusion: W is a subspace of R^2. dim(W) = 1.")
Code cell 21
# === 3.2 Planes through the origin (2D subspaces of R^3) ===
# W = {(x,y,z) : x + 2y - z = 0} -- a plane through origin in R^3
# This is null(A) where A = [[1, 2, -1]]
A_plane = np.array([[1.0, 2.0, -1.0]])
# Find null space basis
N = null_space(A_plane)
print("Plane W = {(x,y,z) : x + 2y - z = 0}")
print(f"Null space basis (columns):\n{N}")
print(f"dim(W) = {N.shape[1]}")
# Verify subspace conditions with random elements of W
coeffs1 = np.array([2.0, -1.0])
coeffs2 = np.array([0.5, 3.0])
w1 = N @ coeffs1
w2 = N @ coeffs2
def scalar_value(x):
return float(np.asarray(x).reshape(-1)[0])
print(f"\nw1 = {w1}, A*w1 = {scalar_value(A_plane @ w1):.10f} (= 0, so w1 in W)")
print(f"w2 = {w2}, A*w2 = {scalar_value(A_plane @ w2):.10f} (= 0, so w2 in W)")
# Test closure
print(f"\nw1+w2 = {w1+w2}, A*(w1+w2) = {scalar_value(A_plane @ (w1+w2)):.10f} (closed under +)")
print(f"3.5*w1 = {3.5*w1}, A*(3.5*w1) = {scalar_value(A_plane @ (3.5*w1)):.10f} (closed under scalar mult)")
print("All three conditions satisfied -> W is a subspace.")
Code cell 22
# === 3.3 Null space of a matrix is always a subspace ===
# null(A) = {x : Ax = 0}
# Proof that it's a subspace:
# 1. 0 in null(A) because A*0 = 0
# 2. If Ax=0 and Ay=0, then A(x+y) = Ax+Ay = 0+0 = 0
# 3. If Ax=0, then A(alpha*x) = alpha*Ax = alpha*0 = 0
A = np.array([
[1.0, 2.0, 3.0, 4.0],
[2.0, 4.0, 6.0, 8.0],
[1.0, 1.0, 2.0, 3.0]
])
N = null_space(A)
rank_A = np.linalg.matrix_rank(A)
n = A.shape[1]
print(f"A is {A.shape[0]}x{A.shape[1]}, rank = {rank_A}")
print(f"nullity = n - rank = {n} - {rank_A} = {n - rank_A}")
print(f"Null space basis (columns):\n{N}")
print(f"\nVerification: A @ N =\n{A @ N}")
print(f"All zero? {np.allclose(A @ N, 0)}")
# Show closure explicitly
if N.shape[1] >= 2:
x = N[:, 0]
y = N[:, 1]
print(f"\nx in null(A): A*x = {A @ x}")
print(f"y in null(A): A*y = {A @ y}")
print(f"x+y in null(A)? A*(x+y) = {A @ (x+y)} -> {np.allclose(A @ (x+y), 0)}")
print(f"5x in null(A)? A*(5x) = {A @ (5*x)} -> {np.allclose(A @ (5*x), 0)}")
Code cell 23
# === 3.4 Column space of a matrix is always a subspace ===
# col(A) = {Ax : x in R^n}
# Proof that it's a subspace:
# 1. 0 in col(A) because A*0 = 0
# 2. If Ax, Ay in col(A), then Ax+Ay = A(x+y) in col(A)
# 3. alpha*(Ax) = A(alpha*x) in col(A)
A = np.array([
[1.0, 0.0, 2.0],
[0.0, 1.0, 1.0],
[1.0, 1.0, 3.0]
])
C = col_space(A)
rank_A = np.linalg.matrix_rank(A)
print(f"A =\n{A}")
print(f"rank(A) = {rank_A}")
print(f"dim(col(A)) = {rank_A}")
print(f"Column space basis (orthonormal columns):\n{C}")
# Generate two vectors in col(A) and check closure
x1 = np.array([1.0, 0.0, 0.0])
x2 = np.array([0.0, 2.0, -1.0])
v1 = A @ x1
v2 = A @ x2
print(f"\nv1 = A*x1 = {v1} (in col(A))")
print(f"v2 = A*x2 = {v2} (in col(A))")
v_sum = v1 + v2
# v1+v2 = A*(x1+x2), so it's in col(A)
print(f"v1+v2 = {v_sum} = A*(x1+x2) = A*{x1+x2} = {A @ (x1+x2)} -> in col(A)")
Code cell 24
# === 3.5 Symmetric matrices form a subspace of R^{nxn} ===
# W = {A in R^{nxn} : A = A^T}
# Check: 0 is symmetric, sum of symmetric = symmetric, scalar * symmetric = symmetric
n = 3
S1 = np.array([[1.0, 2.0, 3.0], [2.0, 5.0, 4.0], [3.0, 4.0, 6.0]])
S2 = np.array([[0.0, -1.0, 2.0], [-1.0, 3.0, 0.0], [2.0, 0.0, 1.0]])
print("Symmetric matrices as a subspace of R^{3x3}:")
print(f" S1 symmetric? {np.allclose(S1, S1.T)}")
print(f" S2 symmetric? {np.allclose(S2, S2.T)}")
print(f" Zero symmetric? {np.allclose(np.zeros((n,n)), np.zeros((n,n)).T)}")
print(f" S1+S2 symmetric? {np.allclose(S1+S2, (S1+S2).T)}")
print(f" -3*S1 symmetric? {np.allclose(-3*S1, (-3*S1).T)}")
print(f"\n dim(R^{{3x3}}) = {n*n}")
print(f" dim(Sym_3) = n(n+1)/2 = {n*(n+1)//2}")
print(f" Sym_3 is a proper subspace of R^{{3x3}} (6 < 9 dimensions)")
Code cell 25
# === 3.6 Non-subspace examples: systematic failure analysis ===
print("=== {x : Ax = b} with b != 0 (affine subspace, NOT a subspace) ===")
A_aff = np.array([[1.0, 1.0], [0.0, 1.0]])
b_aff = np.array([3.0, 1.0])
x_p = np.linalg.solve(A_aff, b_aff)
print(f" A*0 = {A_aff @ np.zeros(2)} != b = {b_aff} -> 0 not in set (FAILS)")
# Also: if x1, x2 are solutions, x1+x2 is not: A(x1+x2) = 2b != b
print(f" If x is a solution, A*(2x) = 2b = {2*b_aff} != b -> not closed under scalar mult")
print("\n=== Unit ball {{x : ||x|| <= 1}} ===")
x_ball = np.array([0.8, 0.5])
print(f" x = {x_ball}, ||x|| = {np.linalg.norm(x_ball):.3f} <= 1 (in ball)")
print(f" 2*x = {2*x_ball}, ||2x|| = {np.linalg.norm(2*x_ball):.3f} > 1 (NOT in ball -> FAILS)")
print(f" Contains 0? Yes. Closed under +? Maybe. Scalar mult? No.")
print("\n=== First quadrant {{(x,y) : x>=0, y>=0}} ===")
v_quad = np.array([1.0, 2.0])
print(f" v = {v_quad} (in first quadrant)")
print(f" (-1)*v = {-1*v_quad} (NOT in first quadrant -> FAILS scalar mult)")
print(f" Contains 0? Yes. Closed under +? Yes. But fails scalar mult.")
print("\n=== Invertible matrices (NOT a subspace of R^{nxn}) ===")
print(" The zero matrix is NOT invertible -> 0 not in set (FAILS immediately)")
print(" Also: A + (-A) = 0, which is not invertible -> not closed under addition")
Code cell 26
# === 3.7 The two trivial subspaces ===
# Every vector space V has exactly two subspaces that always exist:
# 1. {0} -- the trivial subspace, dimension 0
# 2. V -- the whole space, dimension dim(V)
print("The two trivial subspaces of R^4:")
print(f" {{0}}: dim = 0")
print(f" R^4: dim = 4")
print(f" Any subspace W satisfies {{0}} ⊆ W ⊆ R^4")
print(f" Possible subspace dimensions in R^4: 0, 1, 2, 3, 4")
# Demonstrate with null spaces of different ranks
for r in range(5):
if r == 0:
M = np.zeros((1, 4))
elif r == 4:
M = np.eye(4)
else:
M = np.random.randn(r, 4)
N = null_space(M)
dim_null = N.shape[1] if N.size > 0 else 0
print(f" rank-{r} matrix -> null space dim = {dim_null}")
Visualising subspaces in
In , subspaces can only be:
- The origin (dim 0)
- A line through the origin (dim 1)
- A plane through the origin (dim 2)
- All of (dim 3)
No curves, no spheres, no shifted planes — only flat spaces through the origin.
Code cell 28
# === 3.8 Visualise subspaces of R^3 ===
if HAS_MPL:
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(14, 5))
# 1D subspace: line through origin in direction (1,2,1)
ax1 = fig.add_subplot(131, projection='3d')
d_line = np.array([1, 2, 1], dtype=float)
t = np.linspace(-2, 2, 50)
ax1.plot(t*d_line[0], t*d_line[1], t*d_line[2], 'b-', linewidth=2)
ax1.scatter([0], [0], [0], color='red', s=60, zorder=5)
ax1.set_title('1D subspace (line)', fontsize=10)
ax1.set_xlabel('x'); ax1.set_ylabel('y'); ax1.set_zlabel('z')
# 2D subspace: plane x + y - z = 0
ax2 = fig.add_subplot(132, projection='3d')
s_range = np.linspace(-2, 2, 20)
S, T = np.meshgrid(s_range, s_range)
# Two basis vectors for the plane x+y-z=0: (1,0,1) and (0,1,1)
X_plane = S * 1 + T * 0
Y_plane = S * 0 + T * 1
Z_plane = S * 1 + T * 1
ax2.plot_surface(X_plane, Y_plane, Z_plane, alpha=0.3, color='cyan')
ax2.scatter([0], [0], [0], color='red', s=60, zorder=5)
ax2.set_title('2D subspace (plane)', fontsize=10)
ax2.set_xlabel('x'); ax2.set_ylabel('y'); ax2.set_zlabel('z')
# NOT a subspace: plane x + y + z = 1 (does not pass through origin)
ax3 = fig.add_subplot(133, projection='3d')
X_aff = S * 1 + T * 0
Y_aff = S * 0 + T * 1
Z_aff = 1 - X_aff - Y_aff
ax3.plot_surface(X_aff, Y_aff, Z_aff, alpha=0.3, color='salmon')
ax3.scatter([0], [0], [0], color='red', s=60, zorder=5, label='origin')
ax3.set_title('NOT a subspace (affine)', fontsize=10)
ax3.set_xlabel('x'); ax3.set_ylabel('y'); ax3.set_zlabel('z')
plt.tight_layout()
plt.show()
else:
print("(matplotlib not available — skipping 3D subspace plots)")
4. Span and Linear Combinations
A linear combination of vectors is any vector of the form:
where are arbitrary scalars (the coefficients).
The span of a set is the set of all linear combinations:
Key facts about span
- is always a subspace (the smallest subspace containing all vectors in ).
- by convention.
- If , then .
- A set spans if .
Geometric pictures
- : a line through the origin (1D)
- (non-parallel): a plane through the origin (2D)
- (not coplanar): a 3D subspace
Connection to matrices
Given , form .
To check if : solve . Consistent iff .
Code cell 30
# === 4.1 Linear combinations and span in R^3 ===
v1 = np.array([1.0, 0.0, 0.0])
v2 = np.array([0.0, 1.0, 0.0])
v3 = np.array([1.0, 1.0, 0.0]) # v3 = v1 + v2 (redundant)
# Linear combination
combo = 2*v1 + 3*v2 - 1*v3
print("v1 =", v1)
print("v2 =", v2)
print("v3 = v1 + v2 =", v3)
print("2*v1 + 3*v2 - 1*v3 =", combo, " (lies in the xy-plane)")
# span{v1, v2} = xy-plane (2D subspace of R^3)
# span{v1, v2, v3} = same xy-plane (v3 is redundant)
V12 = np.column_stack([v1, v2])
V123 = np.column_stack([v1, v2, v3])
print(f"\nrank([v1|v2]) = {np.linalg.matrix_rank(V12)} -> span is 2D")
print(f"rank([v1|v2|v3]) = {np.linalg.matrix_rank(V123)} -> adding v3 didn't increase span")
print("v3 is redundant: it's already in span{v1, v2}.")
Code cell 31
# === 4.2 Checking membership in a span ===
# Does b lie in span{v1, v2}?
v1 = np.array([1.0, 2.0, 3.0])
v2 = np.array([0.0, 1.0, 1.0])
V = np.column_stack([v1, v2])
b_in = np.array([2.0, 5.0, 7.0]) # 2*v1 + 1*v2
b_out = np.array([1.0, 0.0, 0.0]) # not in span
for label, b in [("b_in", b_in), ("b_out", b_out)]:
augmented = np.column_stack([V, b])
rank_V = np.linalg.matrix_rank(V)
rank_aug = np.linalg.matrix_rank(augmented)
in_span = rank_V == rank_aug
print(f"{label} = {b}")
print(f" rank(V) = {rank_V}, rank([V|b]) = {rank_aug}")
print(f" In span? {in_span}")
if in_span:
coeffs, *_ = np.linalg.lstsq(V, b, rcond=None)
print(f" Coefficients: {coeffs[0]:.4f}*v1 + {coeffs[1]:.4f}*v2")
print(f" Verify: {coeffs[0]:.4f}*v1 + {coeffs[1]:.4f}*v2 = {V @ coeffs}")
print()
Code cell 32
# === 4.3 Span = column space: the matrix viewpoint ===
# The columns of a weight matrix W span col(W) -- the reachable outputs.
# If rank(W) < m, some output directions are unreachable.
W = np.array([
[1.0, 2.0],
[2.0, 4.0],
[0.0, 1.0]
])
print(f"W (3x2) =\n{W}")
print(f"rank(W) = {np.linalg.matrix_rank(W)}")
print(f"col(W) is a {np.linalg.matrix_rank(W)}D subspace of R^3")
# Try many inputs: outputs always land in col(W)
C = col_space(W)
print(f"\nOrthonormal basis for col(W):\n{C}")
np.random.seed(0)
for i in range(5):
x = np.random.randn(2)
y = W @ x
# Project y onto col(W) and check it equals y
y_proj = C @ (C.T @ y)
print(f" x={x} -> Wx={y} -> proj_col(W)(Wx)={y_proj} -> same? {np.allclose(y, y_proj)}")
print("\nAll outputs lie in col(W). Directions outside col(W) are unreachable.")
5. Linear Independence
Vectors are linearly independent if the only linear combination equal to the zero vector uses all-zero coefficients:
Equivalently: no vector in the set is a linear combination of the others. Each vector adds a genuinely new direction.
Linearly dependent: non-trivial (not all-zero) coefficients such that . At least one vector is redundant.
Checking independence
Form and row reduce:
- Independent iff every column has a pivot iff
- Dependent iff
- Determinant test (square case): independent iff
Key properties
- Any set containing is dependent ()
- A single non-zero vector is always independent
- Two vectors are dependent iff one is a scalar multiple of the other
- More vectors than dimensions always dependent ( in )
Code cell 34
# === 5.1 Independent vs dependent sets ===
# Independent set in R^3
ind = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]], dtype=float).T # standard basis
# Dependent set: third vector = first + second
dep = np.array([[1, 0, 0], [0, 1, 0], [1, 1, 0]], dtype=float).T
print("Independent set (standard basis of R^3):")
print(f" rank = {np.linalg.matrix_rank(ind)}, k = {ind.shape[1]}")
print(f" rank == k? {np.linalg.matrix_rank(ind) == ind.shape[1]} -> independent")
print("\nDependent set {(1,0,0), (0,1,0), (1,1,0)}:")
print(f" rank = {np.linalg.matrix_rank(dep)}, k = {dep.shape[1]}")
print(f" rank < k? {np.linalg.matrix_rank(dep) < dep.shape[1]} -> dependent")
print(" Dependence relation: 1*(1,0,0) + 1*(0,1,0) + (-1)*(1,1,0) = (0,0,0)")
print(" Verify:", 1*dep[:,0] + 1*dep[:,1] + (-1)*dep[:,2])
Code cell 35
# === 5.2 Detecting dependence via RREF and the determinant ===
V = np.array([
[1.0, 2.0, 3.0],
[4.0, 5.0, 6.0],
[7.0, 8.0, 9.0]
])
print("V =\n", V)
print(f"det(V) = {np.linalg.det(V):.6f}")
print(f"rank(V) = {np.linalg.matrix_rank(V)}")
R, pivots = rref(V)
print(f"\nRREF(V) =\n{R}")
print(f"Pivot columns: {pivots}")
print(f"Only {len(pivots)} pivots for 3 columns -> column 3 is dependent on columns 1,2")
print(f"Dependence: col3 = {R[0,2]:.0f}*col1 + {R[1,2]:.0f}*col2")
print(f"Verify: {R[0,2]}*(1,4,7) + {R[1,2]}*(2,5,8) = {R[0,2]*V[:,0] + R[1,2]*V[:,1]}")
print(f"col3 = {V[:,2]}")
Code cell 36
# === 5.3 More vectors than dimensions -> always dependent ===
np.random.seed(42)
n = 3
k = 5 # 5 vectors in R^3 -> must be dependent
V = np.random.randn(n, k)
print(f"{k} vectors in R^{n}:")
print(f" rank([v1|...|v{k}]) = {np.linalg.matrix_rank(V)}")
print(f" rank < k = {k}? {np.linalg.matrix_rank(V) < k} -> dependent (pigeonhole)")
print(f" At most {n} vectors in R^{n} can be independent.")
# The null space of V^T gives the dependence relations
N = null_space(V.T)
if N.shape[1] > 0:
c = N[:, 0]
print(f"\n Dependence coefficients: {c}")
print(f" Sum c_i * v_i = {V @ c}")
print(f" Is zero? {np.allclose(V @ c, 0)}")
Code cell 37
# === 5.4 AI context: superposition — more features than dimensions ===
# The superposition hypothesis: LLMs store more features than dimensions.
# If |features| > d, feature directions MUST be linearly dependent.
# This forces features to "share" dimensions and interfere with each other.
d = 4 # embedding dimension
n_features = 8 # more features than dimensions
np.random.seed(7)
# Random unit feature directions
features = np.random.randn(n_features, d)
features = features / np.linalg.norm(features, axis=1, keepdims=True)
print(f"{n_features} feature directions in R^{d}:")
print(f" rank = {np.linalg.matrix_rank(features)}")
print(f" {n_features} > {d} -> features are linearly dependent (superposition)")
# Measure interference via pairwise cosine similarities
G = features @ features.T
np.fill_diagonal(G, 0) # ignore self-similarity
print(f"\n Max |cosine| between features: {np.max(np.abs(G)):.4f}")
print(f" Mean |cosine| between features: {np.mean(np.abs(G[np.triu_indices(n_features, 1)])):.4f}")
print(" Non-zero cosines = interference between features.")
6. Basis and Dimension
A basis for a vector space is a set that is:
- Linearly independent — no vector is redundant
- Spans — every vector in is a linear combination of the
A basis is both the minimal spanning set and the maximal independent set.
The Unique Representation Theorem
If is a basis, then every has a unique representation . The scalars are the coordinates of with respect to .
Dimension
All bases for have the same number of elements. This number is the dimension:
| Space | Dimension |
|---|---|
| (polynomials deg ) | |
| Symmetric matrices | |
Dimension counting theorems
Rank-Nullity: for .
Dimension of sum: .
Code cell 39
# === 6.1 Coordinates in the standard basis vs a non-standard basis ===
# Standard basis for R^2: {e1, e2}
# Non-standard basis: B = {(1,1), (1,-1)}
B = np.array([[1.0, 1.0], [1.0, -1.0]]).T # columns are basis vectors
x = np.array([4.0, 2.0])
# Coordinates in standard basis = entries themselves
print(f"x = {x}")
print(f"[x]_standard = {x} (coordinates = entries)")
# Coordinates in basis B: solve B * coords = x
coords_B = np.linalg.solve(B, x)
print(f"\nBasis B = columns of\n{B}")
print(f"[x]_B = {coords_B}")
print(f"Verify: {coords_B[0]}*(1,1) + {coords_B[1]}*(1,-1) = {B @ coords_B}")
# The vector is unchanged; only the coordinate representation changes
print(f"\nSame vector, different coordinates:")
print(f" Standard: ({x[0]}, {x[1]})")
print(f" Basis B: ({coords_B[0]}, {coords_B[1]})")
Code cell 40
# === 6.2 Change of basis ===
# Two bases for R^2
B1 = np.array([[1.0, 1.0], [1.0, -1.0]]).T # columns = basis vectors
B2 = np.array([[2.0, 1.0], [0.0, 1.0]]).T
# Change-of-basis matrix from B1 to B2: P = B2^{-1} B1
P_B1_to_B2 = np.linalg.solve(B2, B1)
print("Change-of-basis matrix P (B1 -> B2):")
print(P_B1_to_B2)
# A vector x with known coords in B1
x = np.array([4.0, 2.0])
coords_B1 = np.linalg.solve(B1, x)
coords_B2 = P_B1_to_B2 @ coords_B1
coords_B2_direct = np.linalg.solve(B2, x)
print(f"\nx = {x}")
print(f"[x]_B1 = {coords_B1}")
print(f"[x]_B2 via change-of-basis = {coords_B2}")
print(f"[x]_B2 via direct solve = {coords_B2_direct}")
print(f"Match? {np.allclose(coords_B2, coords_B2_direct)}")
# Inverse change of basis
P_B2_to_B1 = np.linalg.inv(P_B1_to_B2)
print(f"\nP_B2_to_B1 @ P_B1_to_B2 = I? {np.allclose(P_B2_to_B1 @ P_B1_to_B2, np.eye(2))}")
Code cell 41
# === 6.3 Rank-Nullity theorem: conservation of dimensions ===
np.random.seed(42)
for m, n in [(3, 5), (4, 4), (5, 3), (3, 3)]:
A = np.random.randn(m, n)
r = np.linalg.matrix_rank(A)
nullity = n - r
print(f"A is {m}x{n}: rank = {r}, nullity = {nullity}, rank + nullity = {r + nullity} = n = {n}")
# Concrete example with structure
A = np.array([
[1.0, 2.0, 1.0, 0.0],
[0.0, 0.0, 1.0, 1.0],
[1.0, 2.0, 2.0, 1.0]
])
r = np.linalg.matrix_rank(A)
N = null_space(A)
print(f"\nA (3x4) =\n{A}")
print(f"rank = {r}, nullity = {N.shape[1]}")
print(f"rank + nullity = {r + N.shape[1]} = n = {A.shape[1]}")
print(f"Null space basis:\n{N}")
print(f"Verify A @ N = 0? {np.allclose(A @ N, 0)}")
7. The Four Fundamental Subspaces
For with rank , there are four fundamental subspaces:
| Subspace | Lives in | Dimension | Definition |
|---|---|---|---|
| Column space | — reachable outputs | ||
| Null space | — invisible inputs | ||
| Row space | — directions listens to | ||
| Left null space | — unreachable output directions |
The orthogonality relations
Strang's Big Picture
R^n R^m
┌──────────────────┐ ┌──────────────────┐
│ row(A) │ │ col(A) │
│ dim = r │────A─────▶│ dim = r │
│ │ │ │
│ ⊕ │ │ ⊕ │
│ │ │ │
│ null(A) │────A─────▶│ {0} │
│ dim = n − r │ (→ 0) │ │
└──────────────────┘ └──────────────────┘
is a bijection from to . maps to .
Code cell 43
# === 7.1 Computing all four fundamental subspaces via SVD ===
A = np.array([
[1.0, 2.0, 3.0],
[2.0, 4.0, 6.0],
[1.0, 1.0, 2.0]
])
U, S, Vt = np.linalg.svd(A, full_matrices=True)
r = np.linalg.matrix_rank(A)
m, n = A.shape
col_basis = U[:, :r]
null_A_basis = Vt[r:].T
row_basis = Vt[:r]
left_null_basis = U[:, r:]
print(f"A ({m}x{n}), rank = {r}")
print(f"Singular values: {S}")
print(f"\n--- Four Fundamental Subspaces ---")
print(f"\nColumn space (dim {r}, lives in R^{m}):")
print(col_basis)
print(f"\nNull space (dim {n-r}, lives in R^{n}):")
print(null_A_basis)
print(f"\nRow space (dim {r}, lives in R^{n}):")
print(row_basis)
print(f"\nLeft null space (dim {m-r}, lives in R^{m}):")
print(left_null_basis)
# Dimension check
print(f"\n--- Dimension checks ---")
print(f"col + left_null = {r} + {m-r} = {m} = m")
print(f"row + null = {r} + {n-r} = {n} = n")
Code cell 44
# === 7.2 Orthogonality verification and direct-sum decomposition ===
# Verify: row(A) ⊥ null(A) and col(A) ⊥ null(A^T)
print("Orthogonality checks:")
if null_A_basis.size > 0:
cross = row_basis @ null_A_basis
print(f" row(A)^T @ null(A) = \n{cross}")
print(f" All zero? {np.allclose(cross, 0)}")
if left_null_basis.size > 0:
cross2 = col_basis.T @ left_null_basis
print(f" col(A)^T @ null(A^T) = \n{cross2}")
print(f" All zero? {np.allclose(cross2, 0)}")
# Direct-sum decomposition in the domain R^n
z = np.array([2.0, -1.0, 3.0])
row_comp = row_basis.T @ (row_basis @ z) # projection onto row space
null_comp = null_A_basis @ (null_A_basis.T @ z) # projection onto null space
print(f"\n--- Direct-sum decomposition of z = {z} in R^{n} ---")
print(f" Row-space component: {row_comp}")
print(f" Null-space component: {null_comp}")
print(f" Sum: {row_comp + null_comp}")
print(f" Equals z? {np.allclose(row_comp + null_comp, z)}")
print(f" Orthogonal? dot = {np.dot(row_comp, null_comp):.10f}")
print(f"\n A maps row-comp to: {A @ row_comp}")
print(f" A maps null-comp to: {A @ null_comp} (= 0, as expected)")
Code cell 45
# === 7.3 AI interpretation: weight matrix subspaces ===
# For a neural network layer W in R^{m x n}:
# col(W) = reachable output directions
# null(W) = input directions the layer ignores
# row(W) = input directions the layer listens to
np.random.seed(42)
m, n, r = 6, 8, 3 # 6 outputs, 8 inputs, effective rank 3
# Create a rank-r weight matrix (simulating a bottleneck layer)
U_full = np.random.randn(m, r)
V_full = np.random.randn(r, n)
W = U_full @ V_full
print(f"Weight matrix W: {m}x{n}, rank = {np.linalg.matrix_rank(W)}")
print(f"\nWhat this means for the layer:")
print(f" col(W): {r}D subspace of R^{m} -> only {r} of {m} output directions are reachable")
print(f" null(W): {n-r}D subspace of R^{n} -> {n-r} input directions are ignored")
print(f" row(W): {r}D subspace of R^{n} -> only {r} input directions matter")
print(f"\n Dead output dimensions: {m-r}")
print(f" Ignored input dimensions: {n-r}")
# Demonstrate: two inputs differing only in null(W) produce the same output
N_W = null_space(W)
x = np.random.randn(n)
z = N_W @ np.random.randn(N_W.shape[1]) * 5 # large null-space perturbation
print(f"\n x and x+z differ by a null-space vector:")
print(f" ||z|| = {np.linalg.norm(z):.4f} (large perturbation)")
print(f" W*x = {W @ x}")
print(f" W*(x+z)= {W @ (x + z)}")
print(f" Same output? {np.allclose(W @ x, W @ (x + z))} (null space is invisible)")
8. Subspace Operations
Sum of subspaces
The sum is the smallest subspace containing both and .
Dimension formula:
Intersection of subspaces
Always a subspace. The largest subspace contained in both. Note: is generally not a subspace.
Direct sum
if and .
Every has a unique decomposition .
Orthogonal projection
For subspace with orthonormal basis :
The projection matrix satisfies (idempotent) and (symmetric).
Code cell 47
# === 8.1 Sum and intersection of subspaces ===
# W1 = span{(1,0,0), (0,1,0)} = xy-plane
# W2 = span{(0,1,0), (0,0,1)} = yz-plane
# W1 + W2 = R^3 (every vector is reachable)
# W1 ∩ W2 = span{(0,1,0)} = y-axis (shared direction)
B1 = np.array([[1,0,0],[0,1,0]], dtype=float).T
B2 = np.array([[0,1,0],[0,0,1]], dtype=float).T
# Sum: combine all basis vectors and find rank
B_sum = np.column_stack([B1, B2])
dim_sum = np.linalg.matrix_rank(B_sum)
dim_W1 = np.linalg.matrix_rank(B1)
dim_W2 = np.linalg.matrix_rank(B2)
dim_intersect = dim_W1 + dim_W2 - dim_sum
print("W1 = xy-plane (dim 2), W2 = yz-plane (dim 2)")
print(f"dim(W1) = {dim_W1}")
print(f"dim(W2) = {dim_W2}")
print(f"dim(W1 + W2) = {dim_sum}")
print(f"dim(W1 ∩ W2) = dim(W1) + dim(W2) - dim(W1+W2) = {dim_W1}+{dim_W2}-{dim_sum} = {dim_intersect}")
print(f"\nDimension formula verified: {dim_sum} = {dim_W1} + {dim_W2} - {dim_intersect}")
print("W1 ∩ W2 = y-axis (the only direction shared by both planes)")
Code cell 48
# === 8.2 Union is NOT a subspace (important counterexample) ===
# W1 = x-axis = span{(1,0)}, W2 = y-axis = span{(0,1)}
v1 = np.array([1.0, 0.0]) # in W1
v2 = np.array([0.0, 1.0]) # in W2
v_sum = v1 + v2 # (1,1) -- not in W1 or W2
print("W1 = x-axis, W2 = y-axis")
print(f" (1,0) in W1 ∪ W2? Yes")
print(f" (0,1) in W1 ∪ W2? Yes")
print(f" (1,0)+(0,1) = {v_sum}")
print(f" (1,1) in W1? No (not on x-axis)")
print(f" (1,1) in W2? No (not on y-axis)")
print(f" (1,1) in W1 ∪ W2? No -> union is NOT closed under addition")
print(f"\n W1 ∪ W2 is NOT a subspace.")
print(f" W1 + W2 = R^2 IS a subspace (the correct notion of 'combining' subspaces).")
Code cell 49
# === 8.3 Orthogonal projection onto a subspace ===
# W = span{(1,1,0), (0,1,1)} in R^3
W_basis = np.array([[1,1,0],[0,1,1]], dtype=float).T
# Orthonormalize the basis
Q = gram_schmidt(W_basis.T).T # columns are orthonormal basis for W
print("Orthonormal basis Q for W:")
print(Q)
print(f"Q^T Q = \n{Q.T @ Q}")
# Projection matrix P = Q Q^T
P = Q @ Q.T
print(f"\nProjection matrix P = QQ^T:")
print(P)
print(f"P^2 = P (idempotent)? {np.allclose(P @ P, P)}")
print(f"P^T = P (symmetric)? {np.allclose(P.T, P)}")
print(f"rank(P) = {np.linalg.matrix_rank(P)} = dim(W)")
# Project a vector
v = np.array([3.0, 1.0, 2.0])
proj_v = P @ v
residual = v - proj_v
print(f"\nv = {v}")
print(f"Proj_W(v) = {proj_v}")
print(f"Residual v - Proj_W(v) = {residual}")
print(f"Residual ⊥ W? Q^T @ residual = {Q.T @ residual}")
print(f"||v||^2 = ||proj||^2 + ||resid||^2? {np.linalg.norm(v)**2:.4f} = {np.linalg.norm(proj_v)**2:.4f} + {np.linalg.norm(residual)**2:.4f} = {np.linalg.norm(proj_v)**2 + np.linalg.norm(residual)**2:.4f}")
Code cell 50
# === 8.4 Principal angles between subspaces ===
# The principal angles between two subspaces measure how "aligned" they are.
# cos(theta_i) = sigma_i(Q1^T Q2)
np.random.seed(42)
# Two 2D subspaces of R^4
A1 = np.random.randn(4, 2)
A2 = np.random.randn(4, 2)
Q1 = gram_schmidt(A1.T).T # orthonormal basis, columns
Q2 = gram_schmidt(A2.T).T
cos_angles = np.linalg.svd(Q1.T @ Q2, compute_uv=False)
angles_deg = np.degrees(np.arccos(np.clip(cos_angles, -1, 1)))
print("Two random 2D subspaces of R^4:")
print(f" cos(principal angles) = {cos_angles}")
print(f" principal angles (degrees) = {angles_deg}")
print(f"\n First angle = {angles_deg[0]:.1f}° (closest directions between subspaces)")
print(f" Second angle = {angles_deg[1]:.1f}° (next pair)")
# Identical subspaces -> angles = 0
cos_same = np.linalg.svd(Q1.T @ Q1, compute_uv=False)
print(f"\n Same subspace: cos(angles) = {cos_same} (all 1 -> all angles 0°)")
# Orthogonal subspaces
Q_orth = gram_schmidt(np.eye(4)[2:]).T
cos_orth = np.linalg.svd(Q1.T @ Q_orth, compute_uv=False)
angles_orth = np.degrees(np.arccos(np.clip(cos_orth, -1, 1)))
print(f" Orthogonal subspace: angles = {angles_orth} degrees")
9. Inner Product Spaces and Orthogonality
An inner product on is a function satisfying:
- Symmetry:
- Linearity:
- Positive definiteness: , with equality iff
Induced norm: . Cauchy-Schwarz: .
Standard inner products
| Space | Inner product | Induced norm |
|---|---|---|
| norm | ||
| Frobenius norm | ||
| norm |
Orthogonality and Gram-Schmidt
- if
- Pythagorean theorem:
- Orthogonal/orthonormal sets are always linearly independent
- Gram-Schmidt converts any independent set to an orthonormal basis for the same span
- The orthogonal complement
- and
Code cell 52
# === 9.1 Inner product properties and Cauchy-Schwarz ===
u = np.array([1.0, 2.0, 3.0])
v = np.array([4.0, -1.0, 2.0])
w = np.array([0.0, 1.0, -1.0])
alpha, beta = 2.0, -0.5
# Verify inner product axioms for standard dot product
print("Inner product axioms (standard dot product on R^3):")
print(f" Symmetry: <u,v> = {np.dot(u,v)}, <v,u> = {np.dot(v,u)}, equal? {np.isclose(np.dot(u,v), np.dot(v,u))}")
print(f" Linearity: <au+bw,v> = {np.dot(alpha*u+beta*w, v):.4f}")
print(f" a<u,v>+b<w,v> = {alpha*np.dot(u,v)+beta*np.dot(w,v):.4f}")
print(f" equal? {np.isclose(np.dot(alpha*u+beta*w, v), alpha*np.dot(u,v)+beta*np.dot(w,v))}")
print(f" Pos. def.: <u,u> = {np.dot(u,u):.4f} >= 0")
print(f" <0,0> = {np.dot(np.zeros(3), np.zeros(3)):.4f} = 0")
# Cauchy-Schwarz
lhs = abs(np.dot(u, v))
rhs = np.linalg.norm(u) * np.linalg.norm(v)
print(f"\nCauchy-Schwarz: |<u,v>| = {lhs:.4f} <= ||u|| ||v|| = {rhs:.4f}? {lhs <= rhs + 1e-12}")
# Frobenius inner product for matrices
A = np.array([[1.0, 2.0], [3.0, 4.0]])
B = np.array([[5.0, 6.0], [7.0, 8.0]])
fro_ip = np.trace(A.T @ B)
print(f"\nFrobenius inner product: <A,B> = tr(A^T B) = {fro_ip}")
print(f"Manual sum: {np.sum(A * B)}")
print(f"||A||_F = sqrt(<A,A>) = {np.sqrt(np.trace(A.T @ A)):.4f} = {np.linalg.norm(A, 'fro'):.4f}")
Code cell 53
# === 9.2 Gram-Schmidt orthogonalisation: step by step ===
vectors = np.array([
[1.0, 1.0, 0.0],
[1.0, 0.0, 1.0],
[0.0, 1.0, 1.0]
], dtype=float)
print("Input vectors (rows):")
for i, v in enumerate(vectors):
print(f" v{i+1} = {v}")
Q = gram_schmidt(vectors)
print("\nGram-Schmidt output (orthonormal):")
for i, q in enumerate(Q):
print(f" q{i+1} = {q}")
print("\nVerification:")
print(f" Q Q^T (should be I):\n{Q @ Q.T}")
print(f" Orthonormal? {np.allclose(Q @ Q.T, np.eye(len(Q)))}")
# Same span check: rank([v1..vk]) == rank([q1..qk])
V_mat = vectors.T
Q_mat = Q.T
print(f"\n rank(original vectors) = {np.linalg.matrix_rank(V_mat)}")
print(f" rank(orthonormal basis) = {np.linalg.matrix_rank(Q_mat)}")
print(f" Same span? {np.linalg.matrix_rank(V_mat) == np.linalg.matrix_rank(Q_mat)}")
# Connection to QR decomposition
A_qr = vectors.T # columns = original vectors
Q_qr, R_qr = np.linalg.qr(A_qr)
print(f"\n QR decomposition: A = QR")
print(f" Q (orthonormal columns):\n{Q_qr}")
print(f" R (upper triangular):\n{R_qr}")
print(f" A = QR? {np.allclose(A_qr, Q_qr @ R_qr)}")
Code cell 54
# === 9.3 Orthogonal complement and V = W ⊕ W⊥ ===
# W = null(A) for some A. Then W⊥ = row(A).
# This is exactly the fundamental subspace decomposition.
A = np.array([
[1.0, 2.0, 1.0],
[0.0, 1.0, 1.0]
])
W_basis = null_space(A) # null(A) = W
W_perp_basis = row_space(A).T # row(A) = W⊥
dim_W = W_basis.shape[1] if W_basis.size > 0 else 0
dim_Wp = W_perp_basis.shape[1] if W_perp_basis.size > 0 else 0
print(f"A ({A.shape[0]}x{A.shape[1]}), rank = {np.linalg.matrix_rank(A)}")
print(f"W = null(A): dim = {dim_W}")
print(f"W⊥ = row(A): dim = {dim_Wp}")
print(f"dim(W) + dim(W⊥) = {dim_W + dim_Wp} = n = {A.shape[1]}")
# Verify orthogonality
if dim_W > 0 and dim_Wp > 0:
cross = W_perp_basis.T @ W_basis
print(f"\nW⊥^T @ W =\n{cross}")
print(f"All zero (orthogonal)? {np.allclose(cross, 0)}")
# Decompose a vector
v = np.array([3.0, 1.0, 2.0])
# Project onto W and W⊥
P_W = W_basis @ np.linalg.pinv(W_basis) if dim_W > 0 else np.zeros((3,3))
P_Wp = W_perp_basis @ np.linalg.pinv(W_perp_basis) if dim_Wp > 0 else np.zeros((3,3))
v_W = P_W @ v
v_Wp = P_Wp @ v
print(f"\nv = {v}")
print(f"W-component: {v_W}")
print(f"W⊥-component: {v_Wp}")
print(f"Sum = v? {np.allclose(v_W + v_Wp, v)}")
print(f"Orthogonal? <v_W, v_Wp> = {np.dot(v_W, v_Wp):.10f}")
10. Affine Subspaces and Quotient Spaces
Affine subspaces
An affine subspace is a translation of a linear subspace: .
- Does not pass through the origin (unless ).
- The solution set of with is an affine subspace: .
- Same dimension as the underlying , but NOT a subspace of .
Affine and convex combinations
- Affine combination: where (coefficients sum to 1)
- Convex combination: affine combination with (stay in convex hull)
- AI: interpolation between embeddings is a convex combination
Quotient spaces (intuition)
"forgets" the -directions: two vectors are equivalent if they differ by an element of .
. In normalisation layers, mean-subtraction projects onto a quotient.
Code cell 56
# === 10.1 Affine subspace: solution set of Ax = b ===
A = np.array([
[1.0, 1.0, 1.0],
[0.0, 1.0, 2.0]
])
b = np.array([6.0, 4.0])
# Particular solution
x_p, *_ = np.linalg.lstsq(A, b, rcond=None)
# Null space
N = null_space(A)
print("System Ax = b:")
print(f"A =\n{A}")
print(f"b = {b}")
print(f"\nParticular solution x_p = {x_p}")
print(f"Verify: A*x_p = {A @ x_p}")
print(f"\nNull space of A (dim = {N.shape[1]}):")
print(f" Basis:\n{N}")
# The full solution set is x_p + null(A)
print(f"\nSolution set = x_p + null(A):")
for t in [-2, -1, 0, 1, 2]:
x = x_p + t * N[:, 0]
print(f" t={t:>2}: x = {x}, A*x = {A @ x}")
print(f"\nAll solutions differ by null-space vectors.")
print(f"Solution set is an affine subspace, NOT a linear subspace (0 not in it).")
print(f"A*0 = {A @ np.zeros(3)} != b = {b}")
Code cell 57
# === 10.2 Convex combinations and embedding interpolation ===
# Interpolation between two embedding vectors
e_cat = np.array([0.8, 0.3, -0.1, 0.5])
e_dog = np.array([0.7, 0.4, -0.2, 0.6])
print("Embedding interpolation (convex combination):")
print(f" e_cat = {e_cat}")
print(f" e_dog = {e_dog}")
for alpha in [0.0, 0.25, 0.5, 0.75, 1.0]:
interp = alpha * e_cat + (1 - alpha) * e_dog
cos_cat = cosine_similarity(interp, e_cat)
cos_dog = cosine_similarity(interp, e_dog)
print(f" alpha={alpha:.2f}: {interp} cos(cat)={cos_cat:.3f} cos(dog)={cos_dog:.3f}")
# Quotient space intuition: mean subtraction
print(f"\n--- Quotient space: mean subtraction ---")
x = np.array([5.0, 3.0, 7.0, 1.0])
mean = np.mean(x)
x_centered = x - mean
print(f"x = {x}, mean = {mean}")
print(f"x - mean = {x_centered}, sum = {x_centered.sum():.10f}")
print("Mean subtraction projects onto the hyperplane sum(x_i)=0.")
print("This is the quotient V / span{(1,1,...,1)}: forget the 'all-ones' direction.")
11. Subspaces in Functional Analysis
Infinite-dimensional spaces
In finite dimensions, every subspace is automatically closed. In infinite dimensions, there exist subspaces that are not closed — dense proper subspaces whose limits escape the subspace (e.g. polynomials inside : every function is a limit of polynomials, but not every function is a polynomial).
Function spaces relevant to AI
| Space | Description | AI application |
|---|---|---|
| Square-integrable functions | Kernel methods, GP priors | |
| Sobolev | Functions with weak derivatives in | PINNs, variational problems |
| RKHS | Reproducing Kernel Hilbert Space | SVMs, kernel regression |
Krylov subspaces
A nested sequence of expanding subspaces. Iterative solvers (CG, GMRES, Lanczos) find the best solution within and expand until convergence. Cost: matrix-vector products — far cheaper than direct methods for sparse systems.
Neural networks and function spaces
A neural network architecture defines a subset — generally not a subspace (the sum of two neural networks is not typically representable by one). But the tangent space at is a subspace: the Neural Tangent Kernel (NTK) regime corresponds to learning in this tangent subspace.
Code cell 59
# === 11.1 Krylov subspaces: expanding nested subspaces ===
np.random.seed(42)
n = 6
A = np.random.randn(n, n)
A = A + A.T # make symmetric for cleaner behaviour
b = np.random.randn(n)
print(f"Krylov subspaces K_k(A, b) for {n}x{n} symmetric A:")
print(f" K_k = span{{b, Ab, A^2 b, ..., A^(k-1) b}}\n")
for k in range(1, n + 1):
# Build Krylov basis: [b, Ab, A^2 b, ..., A^{k-1} b]
K = np.zeros((n, k))
K[:, 0] = b
for j in range(1, k):
K[:, j] = A @ K[:, j-1]
dim_k = np.linalg.matrix_rank(K)
print(f" K_{k}: {k} vectors, effective dim = {dim_k}", end="")
if dim_k < k:
print(f" (Krylov sequence stalled — rank stabilised)")
else:
print()
print(f"\n The Krylov subspace grows until it captures the full relevant space.")
print(f" CG/GMRES solve Ax=b by searching in K_k; converge when K_k is rich enough.")
Code cell 60
# === 11.2 RKHS intuition: kernel defines a subspace of functions ===
# A kernel k(x, x') defines an RKHS H_k.
# Given data {x_1, ..., x_n}, the representer theorem says the optimal
# predictor lies in span{k(x_1, .), k(x_2, .), ..., k(x_n, .)} — a
# finite-dimensional subspace of the infinite-dimensional H_k.
# RBF kernel: k(x, x') = exp(-||x-x'||^2 / (2*sigma^2))
def rbf_kernel(X, sigma=1.0):
sq_dists = np.sum(X**2, axis=1, keepdims=True) - 2*X@X.T + np.sum(X**2, axis=1)
return np.exp(-sq_dists / (2 * sigma**2))
np.random.seed(0)
X_data = np.sort(np.random.randn(8, 1), axis=0)
K = rbf_kernel(X_data, sigma=1.0)
print(f"Data points: {X_data.ravel()}")
print(f"\nKernel Gram matrix K (8x8):\n{K}")
print(f"rank(K) = {np.linalg.matrix_rank(K)}")
print(f"K is PSD? min eigenvalue = {np.min(np.linalg.eigvalsh(K)):.6f} >= 0")
print(f"\nThe optimal predictor lives in an {np.linalg.matrix_rank(K)}-dimensional")
print(f"subspace of the RKHS, spanned by the kernel functions at data points.")
12. Subspace Methods in Machine Learning
PCA as subspace finding
PCA finds the rank- subspace of that captures the most variance:
Solution: the top- eigenvectors of the covariance matrix .
LoRA as explicit subspace restriction
LoRA replaces with where , . This restricts updates to a rank- subspace. The intrinsic dimension of fine-tuning is often (Aghajanyan et al. 2020).
Mechanistic interpretability
- Attention heads write to and read from , — specific subspaces of .
- The superposition hypothesis: features outnumber dimensions, so they live in non-orthogonal subspaces with interference.
- LEACE (concept erasure): project out a subspace to remove a concept from representations.
Code cell 62
# === 12.1 PCA: finding the optimal subspace ===
np.random.seed(42)
n_samples = 200
d = 5
# Data lives mostly in a 2D subspace with noise in other directions
true_directions = np.array([
[1, 1, 0, 0, 0],
[0, 0, 1, 1, 0]
], dtype=float)
true_directions = true_directions / np.linalg.norm(true_directions, axis=1, keepdims=True)
coeffs = np.random.randn(n_samples, 2) * np.array([3.0, 2.0])
noise = np.random.randn(n_samples, d) * 0.1
X = coeffs @ true_directions + noise
# Center the data
mu = X.mean(axis=0)
X_centered = X - mu
# Covariance matrix and its eigendecomposition
C = (X_centered.T @ X_centered) / n_samples
eigvals, eigvecs = np.linalg.eigh(C)
# Sort descending
idx = np.argsort(eigvals)[::-1]
eigvals = eigvals[idx]
eigvecs = eigvecs[:, idx]
print(f"Data: {n_samples} points in R^{d}")
print(f"Eigenvalues of covariance matrix: {eigvals}")
# Variance explained
total_var = eigvals.sum()
for r in range(1, d + 1):
explained = eigvals[:r].sum() / total_var
print(f" Top-{r} PCs explain {explained:.4f} of total variance")
print(f"\nThe data lives approximately in a 2D subspace.")
print(f"PCA finds this subspace by maximising captured variance.")
Code cell 63
# === 12.2 Gradient subspace during training + LoRA rank analysis ===
#
# Empirical finding (Gur-Ari et al. 2018, Aghajanyan et al. 2020):
# Gradient updates across training steps lie in a subspace whose dimension
# is much smaller than the full parameter count p.
# LoRA exploits this by restricting delta_W to a rank-r subspace.
np.random.seed(0)
m, n = 64, 64
p = m * n
T = 50
grads = []
for t in range(T):
true_dirs = np.random.randn(5, p)
true_dirs /= np.linalg.norm(true_dirs, axis=1, keepdims=True)
g = (np.random.randn(5) @ true_dirs) + 0.05 * np.random.randn(p)
grads.append(g)
G = np.array(grads) # T x p
Gram = G @ G.T / T # T x T
eigvals_gram = np.sort(np.linalg.eigvalsh(Gram))[::-1]
total_energy = eigvals_gram.sum()
cumulative = np.cumsum(eigvals_gram) / total_energy
print(f'Full parameter count p = {p:,}')
print(f'Gradient snapshots T = {T}')
print(f'\nTop-10 eigenvalues of gradient Gram matrix:')
for i in range(10):
print(f' lambda_{i+1:2d} = {eigvals_gram[i]:8.4f} cumulative variance = {cumulative[i]:.4f}')
dim_90 = int(np.searchsorted(cumulative, 0.90)) + 1
print(f'\n90% of gradient energy in top-{dim_90} directions (full dim p = {p:,})')
print(f'Compression ratio: {p / dim_90:.1f}x')
print(f'\n--- LoRA rank analysis ---')
print(f'{"r":>4} {"params":>10} {"fraction":>10} {"covers 90% gradient subspace?":>32}')
for r in [1, 2, 4, 5, 8, 16, 32]:
params_lora = r * (m + n)
frac = params_lora / p
covers = 'YES' if r >= dim_90 else 'no'
print(f' r={r:>2} {params_lora:>10,} {frac:>10.4f} {covers}')
print(f'\nKey: the intrinsic gradient dimension (approx {dim_90}) sets the minimum')
print(f'LoRA rank needed to capture 90% of the gradient update information.')
print(f'Using r < {dim_90} loses information; using r > {dim_90} wastes capacity.')
Code cell 64
# === 12.3 Concept directions in embedding space ===
#
# The linear representation hypothesis: concepts are encoded as directions
# (1-D subspaces) in the residual stream.
# Classic: gender direction in word2vec; king - man + woman = queen.
np.random.seed(7)
d = 16
# Fixed semantic directions (random unit vectors in R^d)
gender_dir = np.random.randn(d)
gender_dir /= np.linalg.norm(gender_dir)
royalty_dir = np.random.randn(d)
royalty_dir -= royalty_dir @ gender_dir * gender_dir # orthogonalise
royalty_dir /= np.linalg.norm(royalty_dir)
base = np.random.randn(d) * 0.5
noise = lambda: np.random.randn(d) * 0.05
king = base + 2.0 * gender_dir + 2.0 * royalty_dir + noise()
queen = base - 2.0 * gender_dir + 2.0 * royalty_dir + noise()
man = base + 2.0 * gender_dir + noise()
woman = base - 2.0 * gender_dir + noise()
prince = base + 2.0 * gender_dir + 1.0 * royalty_dir + noise()
def cosine(u, v):
return np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))
print('=== Embedding arithmetic: king - man + woman approx queen ===')
analogy = king - man + woman
print(f' cos(king-man+woman, queen) = {cosine(analogy, queen):+.4f} (should be near +1)')
print(f' cos(king-man+woman, king) = {cosine(analogy, king):+.4f}')
print(f' cos(king-man+woman, man) = {cosine(analogy, man):+.4f}')
print('\n=== Recovering the gender direction via difference of means ===')
male_words = np.array([king, man, prince])
female_words = np.array([queen, woman])
gender_est = male_words.mean(0) - female_words.mean(0)
gender_est /= np.linalg.norm(gender_est)
print(f' cos(estimated, true gender direction) = {cosine(gender_est, gender_dir):.4f}')
print('\n=== Concept erasure: project out the gender direction ===')
def project_out(v, direction):
return v - np.dot(v, direction) * direction
king_d = project_out(king, gender_est)
queen_d = project_out(queen, gender_est)
man_d = project_out(man, gender_est)
woman_d = project_out(woman, gender_est)
print(' After projecting out gender direction:')
print(f' cos(king_debiased, queen_debiased) = {cosine(king_d, queen_d):+.4f}')
print(f' cos(king_debiased, man_debiased) = {cosine(king_d, man_d):+.4f}')
print(f' cos(man_debiased, woman_debiased) = {cosine(man_d, woman_d):+.4f}')
print(' Gender contrast (man vs woman) is erased; royalty contrast remains.')
print('\n=== Superposition: more concepts than dimensions ===')
np.random.seed(3)
n_features, d_emb = 40, 16
feat = np.random.randn(n_features, d_emb)
feat /= np.linalg.norm(feat, axis=1, keepdims=True)
G_feat = feat @ feat.T
np.fill_diagonal(G_feat, 0)
avg_int = np.abs(G_feat).sum() / (n_features * (n_features - 1))
print(f' {n_features} features in d={d_emb} space')
print(f' Average |cos(fi, fj)| = {avg_int:.4f} (0 = orthogonal, impossible here)')
print(f' Since {n_features} > {d_emb}, features MUST be non-orthogonal (superposed).')
print(f' Interference is unavoidable -- this is the superposition hypothesis.')
Code cell 65
# === 12.4 Mechanistic interpretability: OV and QK subspace circuits ===
#
# Attention head h:
# W_Q, W_K in R^{d x d_k} -- query/key projections
# W_V in R^{d x d_v}, W_O in R^{d_v x d} -- value/output projections
#
# QK circuit: score(i,j) = x_i^T W_Q W_K^T x_j
# Rank of score matrix <= d_k regardless of sequence length.
# OV circuit: output = softmax(scores) V W_O
# Head writes to col(W_O^T) -- a d_v-dim subspace of R^d.
np.random.seed(11)
d, d_k, d_v, seq = 32, 8, 8, 6
W_Q = np.random.randn(d, d_k) * 0.1
W_K = np.random.randn(d, d_k) * 0.1
W_V = np.random.randn(d, d_v) * 0.1
W_O = np.random.randn(d_v, d) * 0.1
X = np.random.randn(seq, d)
QK_mat = W_Q @ W_K.T # d x d
rank_QK = np.linalg.matrix_rank(QK_mat)
scores = X @ QK_mat @ X.T # seq x seq
print(f'=== QK Circuit ===')
print(f' W_Q W_K^T: shape {QK_mat.shape}, rank = {rank_QK} (bound: d_k = {d_k})')
print(f' Score matrix S = X (W_Q W_K^T) X^T: shape {scores.shape}, rank = {np.linalg.matrix_rank(scores)}')
print(f' Key: {d}-dim residual stream compressed to {d_k} dims for attention comparison.')
def softmax(x):
e = np.exp(x - x.max(axis=-1, keepdims=True))
return e / e.sum(axis=-1, keepdims=True)
attn_weights = softmax(scores / np.sqrt(d_k))
V_proj = X @ W_V # seq x d_v
head_out = attn_weights @ V_proj @ W_O # seq x d
OV_mat = W_O.T @ W_V.T # d x d
rank_OV = np.linalg.matrix_rank(OV_mat)
print(f'\n=== OV Circuit ===')
print(f' W_O^T W_V^T: shape {OV_mat.shape}, rank = {rank_OV} (bound: d_v = {d_v})')
write_dim = np.linalg.matrix_rank(W_O.T)
read_dim = np.linalg.matrix_rank(W_V)
print(f' Write subspace col(W_O^T): dim = {write_dim} of {d}')
print(f' Read subspace col(W_V): dim = {read_dim} of {d}')
print(f' Bottleneck: {d} -> {d_k} (QK) -> score -> {d_v} (OV) -> {d} (write)')
print(f'\n=== Principal angles between write subspaces of two independent heads ===')
np.random.seed(22)
W_O2 = np.random.randn(d_v, d) * 0.1
U1 = np.linalg.svd(W_O.T, full_matrices=False)[0]
U2 = np.linalg.svd(W_O2.T, full_matrices=False)[0]
cos_angles = np.linalg.svd(U1.T @ U2, compute_uv=False)
angles_deg = np.degrees(np.arccos(np.clip(cos_angles, -1, 1)))
overlap = np.mean(cos_angles**2)
print(f' cos(theta_i): {cos_angles}')
print(f' angles (deg): {angles_deg}')
print(f' Mean cos^2 = {overlap:.4f} (0=orthogonal/independent, 1=identical heads)')
print(f' Orthogonal write subspaces => heads do not interfere => can prune either safely.')
Code cell 66
# === 12.5 Subspace fine-tuning: LoRA rank vs approximation quality ===
#
# If the true weight update DW_true has intrinsic rank r_true,
# the best rank-r LoRA approximation captures all signal when r >= r_true.
# Eckart-Young: best rank-r approximation uses truncated SVD.
np.random.seed(5)
m, n = 256, 256
true_rank = 8
lora_ranks = [1, 2, 4, 8, 16, 32, 64]
# Construct a true rank-true_rank update + small noise
U_t = np.random.randn(m, true_rank)
V_t = np.random.randn(n, true_rank)
DW_true = U_t @ V_t.T + 0.01 * np.random.randn(m, n)
fro_true = np.linalg.norm(DW_true, 'fro')
U_s, s_s, Vt_s = np.linalg.svd(DW_true, full_matrices=False)
print(f'Weight matrix: R^{{{m} x {n}}}, full params = {m*n:,}')
print(f'True update rank = {true_rank}')
print(f'Singular values of true DW (top 12): {s_s[:12]}')
print(f'\n{"LoRA r":>8} {"Params":>10} {"Fraction":>10} {"Relative error":>16}')
for r in lora_ranks:
DW_r = (U_s[:, :r] * s_s[:r]) @ Vt_s[:r, :]
err = np.linalg.norm(DW_true - DW_r, 'fro') / fro_true
pars = r * (m + n)
print(f' r={r:>3} {pars:>10,} {pars/(m*n):>10.4f} {err:>16.6f}')
print(f'\nAt r = {true_rank} = true rank, error drops to noise level.')
print(f'r < {true_rank}: information lost. r > {true_rank}: capacity wasted.')
print(f'\n=== Fine-tuning method comparison (all are subspace constraints) ===')
rows = [
('LoRA', 'DW = B A^T, B in R^{mxr}, A in R^{nxr}', 'r(m+n)'),
('AdaLoRA', 'DW = P Lambda Q^T, prune small sing vals', 'varies by layer'),
('IA3', 'multiply activations by learned scalar', '~1 per layer'),
('BitFit', 'fine-tune bias vectors only', 'sum of biases'),
('DoRA', 'DW = magnitude * direction (rank-r dir)', 'r(m+n)+m'),
]
print(f'{"Method":>10} {"Constraint":>50} {"Params":>20}')
for name, desc, pars in rows:
print(f' {name:>8} {desc:>50} {pars:>20}')
print(f'\nAll methods restrict DW to a structured low-dimensional subspace of R^{{mxn}}.')
print(f'The choice of subspace structure (rank, sparsity, magnitude/direction) is the design.')
13. Invariant Subspaces and Spectral Theory
A subspace is invariant under a linear map if . Invariant subspaces are the key to understanding what a linear map does: every invariant subspace is a channel along which acts independently.
Eigenspaces are the simplest invariant subspaces. If , then is a 1-D invariant subspace.
The Spectral Theorem says every symmetric matrix admits an orthogonal decomposition:
and where is orthogonal and is diagonal.
SVD extends this to non-square matrices: decomposes both domain and codomain into invariant directions.
For AI: covariance matrices, Hessians, and Gram matrices are symmetric — the Spectral Theorem always applies. PCA is exactly the Spectral Theorem applied to the covariance matrix. LoRA targets the top invariant directions of the gradient covariance.
Code cell 68
# === 13.1 Invariant subspaces: eigenspaces and the Spectral Theorem ===
np.random.seed(17)
# --- Example 1: rotation has no real 1-D invariant subspace ---
theta = np.pi / 4
R_rot = np.array([[np.cos(theta), -np.sin(theta)],
[np.sin(theta), np.cos(theta)]])
print('=== 2x2 rotation matrix (theta = pi/4) ===')
print(f' R =\n{R_rot}')
eigvals_rot = np.linalg.eigvals(R_rot)
print(f' Eigenvalues (complex): {eigvals_rot}')
print(f' No real eigenvalues => no real 1-D invariant subspace.')
print(f' The entire plane rotates; no direction is preserved.\n')
# --- Example 2: symmetric matrix => orthogonal invariant eigenspaces ---
A_sym = np.array([[4.0, 2.0, 0.0],
[2.0, 3.0, 1.0],
[0.0, 1.0, 2.0]])
eigvals_s, eigvecs_s = np.linalg.eigh(A_sym) # guaranteed real for symmetric
print('=== Symmetric 3x3 matrix: spectral decomposition ===')
print(f' Eigenvalues: {eigvals_s}')
print(f' Eigenvectors Q:\n{eigvecs_s}\n')
for i in range(3):
v = eigvecs_s[:, i]
Av = A_sym @ v
ok = np.allclose(Av, eigvals_s[i] * v)
print(f' E(lambda={eigvals_s[i]:.4f}): A v = lambda v? {ok} (1-D invariant subspace)')
print(f'\n Q^T Q (should be I):\n{eigvecs_s.T @ eigvecs_s}\n')
# Spectral decomposition A = Q Lambda Q^T
Lam = np.diag(eigvals_s)
A_recon = eigvecs_s @ Lam @ eigvecs_s.T
print(f' A = Q Lambda Q^T reconstruction error: {np.max(np.abs(A_sym - A_recon)):.2e}')
# As sum of rank-1 projections
A_sum = sum(eigvals_s[i] * np.outer(eigvecs_s[:, i], eigvecs_s[:, i]) for i in range(3))
print(f' A = sum lambda_i q_i q_i^T error: {np.max(np.abs(A_sym - A_sum)):.2e}')
# === 13.2 R^n = E(lambda1) + E(lambda2) + ... ===
print(f'\n=== Spectral Theorem: R^3 = E(l1) oplus E(l2) oplus E(l3) ===')
projs = [np.outer(eigvecs_s[:, i], eigvecs_s[:, i]) for i in range(3)]
print(f' P1+P2+P3 = I? error: {np.max(np.abs(sum(projs) - np.eye(3))):.2e}')
x = np.array([1.0, 2.0, -1.0])
Ax_spec = sum(eigvals_s[i] * projs[i] @ x for i in range(3))
print(f' Ax via eigenspace decomposition: {Ax_spec}')
print(f' A @ x directly: {A_sym @ x}')
print(f' Match: {np.allclose(Ax_spec, A_sym @ x)}')
print(f'\n A acts on each eigenspace independently -- scaling by lambda_i.')
Code cell 69
# === 13.3 SVD: complete subspace decomposition of any linear map ===
#
# A = U Sigma V^T:
# Domain: R^n = row(A) [v1..vr] (+) null(A) [v_{r+1}..v_n]
# Codomain: R^m = col(A) [u1..ur] (+) null(A^T) [u_{r+1}..u_m]
# A maps v_i -> sigma_i u_i (bijection between row(A) and col(A))
# A maps null(A) -> 0
A_svd = np.array([
[3.0, 1.0, 0.0, 2.0],
[1.0, 2.0, 1.0, 0.0],
[2.0, 0.0, 1.0, 1.0]
]) # 3 x 4
U, s, Vt = np.linalg.svd(A_svd, full_matrices=True)
V = Vt.T
r = np.linalg.matrix_rank(A_svd)
m_s, n_s = A_svd.shape
print(f'=== SVD of A in R^{{{m_s} x {n_s}}} ===')
print(f' rank = {r}, singular values = {s}')
print(f'\n=== Domain R^{n_s}: row(A) oplus null(A) ===')
print(f' row(A) = span{{v1,...,v{r}}} dim = {r}')
print(f' null(A) = span{{v{r+1},...,v{n_s}}} dim = {n_s - r}')
print(f'\n=== Codomain R^{m_s}: col(A) oplus null(A^T) ===')
print(f' col(A) = span{{u1,...,u{r}}} dim = {r}')
print(f' null(A^T) = span{{u{r+1},...,u{m_s}}} dim = {m_s - r}')
print(f'\n=== A is a bijection: row(A) -> col(A) ===')
for i in range(r):
vi = V[:, i]
Avi = A_svd @ vi
s_ui = s[i] * U[:, i]
print(f' A v{i+1} = sigma{i+1} u{i+1}? {np.allclose(Avi, s_ui)} sigma{i+1} = {s[i]:.4f}')
print(f'\n=== A maps null(A) to 0 ===')
for i in range(r, n_s):
print(f' ||A v{i+1}|| = {np.linalg.norm(A_svd @ V[:, i]):.2e}')
print(f'\n=== Low-rank approximation (Eckart-Young) ===')
print(f'{"rank k":>8} {"Frobenius error":>16} {"Variance captured":>20}')
for k in range(r + 1):
if k == 0:
err = np.linalg.norm(A_svd, 'fro')
var = 0.0
else:
Ak = sum(s[i] * np.outer(U[:, i], V[:, i]) for i in range(k))
err = np.linalg.norm(A_svd - Ak, 'fro')
var = 1.0 - err**2 / np.sum(s**2)
print(f' k={k} {err:>16.4f} {var:>20.4f}')
Code cell 70
# === 13.4 Schur decomposition and nested invariant subspaces ===
#
# Every square A admits A = Q T Q^* where Q orthogonal, T upper triangular.
# First k columns of Q span a k-dim invariant subspace of A.
# For symmetric A, T is diagonal (= spectral theorem).
from scipy.linalg import schur
print('=== Schur decomposition of a non-symmetric matrix ===')
A_sch = np.array([
[4.0, 3.0, 2.0],
[0.0, 2.0, 1.0],
[0.0, 1.0, 3.0]
])
T_s, Q_s = schur(A_sch)
print(f'A =\n{A_sch}')
print(f'\nSchur T (upper triangular):\n{T_s}')
print(f'\nSchur Q (orthonormal columns):\n{Q_s}')
print(f'\nReconstruction A = Q T Q^T error: {np.max(np.abs(A_sch - Q_s @ T_s @ Q_s.T)):.2e}')
eigvals_d = np.sort(np.linalg.eigvals(A_sch).real)
eigvals_t = np.sort(np.diag(T_s).real)
print(f'\nEigenvalues from eigvals: {eigvals_d}')
print(f'Eigenvalues from diag(T): {eigvals_t}')
print(f'Match: {np.allclose(eigvals_d, eigvals_t)}')
print(f'\n=== Nested invariant subspaces from Schur vectors ===')
for k in range(1, 4):
Qk = Q_s[:, :k]
AQk = A_sch @ Qk
proj = Qk @ Qk.T @ AQk
is_inv = np.allclose(AQk, proj)
print(f' span{{q1,...,q{k}}} invariant under A? {is_inv}')
print(f'\n=== Symmetric case: Schur = spectral (T diagonal) ===')
A_sym3 = np.array([[5.0, 2.0], [2.0, 3.0]])
T_sym, Q_sym = schur(A_sym3)
print(f' T for symmetric A:\n{T_sym}')
print(f' T is diagonal: {np.allclose(T_sym, np.diag(np.diag(T_sym)))}')
print(f'\n=== Connection to AI ===')
print(f' Hessian H of training loss is symmetric => spectral theorem applies.')
print(f' Top-k Schur/eigenspace of H = high-curvature subspace for 2nd-order optimisers.')
print(f' K-FAC / Shampoo / SOAP approximate H^{{-1}} g using structured invariant subspaces.')
print(f' Natural gradient = gradient in metric defined by the Fisher (symmetric PSD).')
14. Common Mistakes
| Mistake | Why It's Wrong | Fix |
|---|---|---|
| "Union of two subspaces is a subspace" | fails closure under addition: is not on either coordinate axis. | Use the sum , which IS a subspace. |
| "An affine subspace is a subspace" | with does not contain . | An affine subspace is a coset of a linear subspace — parallel but offset. A linear subspace must contain . |
| "Closure under addition alone makes a subspace" | is closed under addition but not scalar mult (). | All three conditions: , closed under , closed under scalar mult. |
| "span equals unchanged => new vector is zero" | could be any linear combination of the others — dependent, not zero. | Span equality means is redundant (linearly dependent), not necessarily . |
| "" | The intersection dimension depends on orientation, not just sizes. | Correct formula is for the sum: . |
| "The complement of a subspace is a subspace" | does not contain and is not closed under addition. | The orthogonal complement IS a subspace. Always say , not "complement". |
| "All bases have the same vectors" | Infinitely many valid bases exist for with . | All bases have the same count (= ); the actual vectors vary freely. |
| "null = null" | null null always; equality can fail if is not injective on col. | null null; equality iff null col. |
| "Same dimension => same subspace" | The -axis and -axis in both have but are different subspaces. | Equal dimension means isomorphic; actual equality requires containment in both directions. |
| "Gram-Schmidt works on any set of vectors" | If a vector is already in the span of previous ones, and cannot be normalised. | First extract a linearly independent subset; discard any vector that produces . |
16. Why This Matters for AI (2026 Perspective)
| Aspect | Impact |
|---|---|
| Residual stream as shared vector space | The residual stream is the central shared vector space of a transformer. Every attention head and MLP layer adds vectors to this space. Understanding subspace decompositions of IS understanding how transformers route and store information. |
| LoRA and gradient subspaces | LoRA's premise: weight updates lie in a low-dimensional subspace. Aghajanyan et al. (2020) showed the fine-tuning gradient subspace has dimension . Choosing LoRA rank = choosing the subspace dimension to search in. |
| Superposition and polysemanticity | LLMs represent more features than dimensions by storing them as non-orthogonal directions. Features share subspace dimensions, causing polysemanticity. This is the superposition hypothesis (Elhage et al. 2022). |
| MLA and KV subspace compression | DeepSeek-V3's Multi-head Latent Attention compresses KV projections to a rank- subspace (), enabling 5.75 memory reduction. Subspace dimensionality IS the design variable. |
| Mechanistic interpretability | Every circuit in mechanistic interpretability is a composition of subspace operations: heads read from and write to specific subspaces of via their matrices. |
| Representation collapse prevention | VICReg, Barlow Twins, BYOL prevent collapse by ensuring embeddings span a high-dimensional subspace. Collapse = subspace dimension ; these losses maximise it. |
| Generalisation and implicit bias | Gradient descent on overparameterised models converges to minimum-norm solutions supported in a low-dimensional subspace of parameter space. Generalisation is explained by subspace geometry. |
| Orthogonality and head diversity | Attention heads writing to orthogonal subspaces do not interfere. Head redundancy = subspace overlap. Orthogonality = independence = safe pruning. |
| Second-order optimisation | The Hessian concentrates spectral mass in a low-dimensional high-curvature subspace. K-FAC, Shampoo, and SOAP exploit this. Natural gradient aligns updates with the curved subspace. |
Conceptual Bridge
Vector spaces and subspaces are the geometry underlying all of linear algebra. Every concept — span, independence, basis, dimension, rank, orthogonality, projections, eigenvalues — is ultimately a statement about the structure of subspaces.
The abstract framework (eight axioms) is what makes the theory universal: the same theorems that govern arrows in a plane also govern polynomials, matrices, functions, and probability distributions. When you recognise a new mathematical object as a vector space, you immediately inherit centuries of theorems for free.
For AI in 2026, subspaces are the architectural primitives of transformers, the design variables of efficient fine-tuning, the diagnostic language of mechanistic interpretability, and the theoretical foundation of generalisation.
Next: Eigenvalues, Eigenvectors, and Matrix Decompositions — where invariant subspaces of a linear map are revealed through its spectrum. An eigenvector spans a 1-D invariant subspace; the spectral theorem decomposes any symmetric matrix into orthogonal invariant eigenspaces; the SVD extends this to the complete subspace structure of any rectangular matrix.