Theory NotebookMath for LLMs

Vector Spaces Subspaces

Linear Algebra Basics / Vector Spaces Subspaces

Run notebook
Theory Notebook

Theory Notebook

Converted from theory.ipynb for web reading.

Vector Spaces and Subspaces — From Axioms to Interpretability

A vector space is not a collection of arrows. It is a collection of objects that behave like arrows — and that behaviour is the entire point. Once you verify the axioms, every theorem ever proved about vector spaces applies for free.

This notebook is the interactive companion to notes.md. It is arranged as a learning path rather than a section-by-section copy.

BlockTopicWhat you will build
1Intuitionwhy vector spaces are the language of deep learning
2Formal Definitionsaxiom verification for standard and exotic spaces
3Subspacesthe three-condition test on concrete examples
4Span & Linear Combinationsgeometric pictures and matrix formulation
5Linear Independencedependence detection and redundancy
6Basis & Dimensioncoordinate systems, change of basis, dimension counting
7Four Fundamental Subspacescolumn space, null space, row space, left null space
8Subspace Operationssum, intersection, direct sum, projection
9Inner Product Spaces & OrthogonalityGram-Schmidt, orthogonal complements
10Affine Subspaces & Quotient Spacestranslated subspaces, cosets
11Subspaces in Functional Analysisinfinite dimensions, RKHS, Krylov
12Subspace Methods in MLPCA, LoRA, representation subspaces, mechanistic interpretability
13Invariant Subspaces & Spectral Theoryeigenspaces, SVD as subspace decomposition

The notebook uses only numpy and optional matplotlib. It is intentionally code-first: the goal is to make the abstract algebra feel operational.

Primary references: Axler (Linear Algebra Done Right), Strang (Linear Algebra and Its Applications), Halmos (Finite-Dimensional Vector Spaces), Hu et al. (2021, LoRA), Elhage et al. (2022, superposition), Aghajanyan et al. (2020, intrinsic dimensionality).

Code cell 2

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

try:
    import seaborn as sns
    sns.set_theme(style="whitegrid", palette="colorblind")
    HAS_SNS = True
except ImportError:
    plt.style.use("seaborn-v0_8-whitegrid")
    HAS_SNS = False

mpl.rcParams.update({
    "figure.figsize":    (10, 6),
    "figure.dpi":         120,
    "font.size":           13,
    "axes.titlesize":      15,
    "axes.labelsize":      13,
    "xtick.labelsize":     11,
    "ytick.labelsize":     11,
    "legend.fontsize":     11,
    "legend.framealpha":   0.85,
    "lines.linewidth":      2.0,
    "axes.spines.top":     False,
    "axes.spines.right":   False,
    "savefig.bbox":       "tight",
    "savefig.dpi":         150,
})
np.random.seed(42)
print("Plot setup complete.")

Code cell 3

import numpy as np
import numpy.linalg as la
import scipy.linalg as sla
from scipy import stats

COLORS = {
    "primary": "#0077BB",
    "secondary": "#EE7733",
    "tertiary": "#009988",
    "error": "#CC3311",
    "neutral": "#555555",
    "highlight": "#EE3377",
}
HAS_MPL = True
np.set_printoptions(precision=8, suppress=True)
np.random.seed(42)

def header(title):
    print("\n" + "=" * len(title))
    print(title)
    print("=" * len(title))

def check_true(name, cond):
    ok = bool(cond)
    print(f"{'PASS' if ok else 'FAIL'} - {name}")
    return ok

def check_close(name, got, expected, tol=1e-8):
    ok = np.allclose(got, expected, atol=tol, rtol=tol)
    print(f"{'PASS' if ok else 'FAIL'} - {name}: got {got}, expected {expected}")
    return ok

def check(name, got, expected, tol=1e-8):
    return check_close(name, got, expected, tol=tol)

def softmax(z, axis=-1, tau=1.0):
    z = np.asarray(z, dtype=float) / float(tau)
    z = z - np.max(z, axis=axis, keepdims=True)
    e = np.exp(z)
    return e / np.sum(e, axis=axis, keepdims=True)

def cosine_similarity(a, b):
    a = np.asarray(a, dtype=float); b = np.asarray(b, dtype=float)
    return float(a @ b / (la.norm(a) * la.norm(b) + 1e-12))

def numerical_rank(A, tol=1e-10):
    return int(np.sum(la.svd(A, compute_uv=False) > tol))

def orthonormal_basis(A, tol=1e-10):
    Q, R = la.qr(A)
    keep = np.abs(np.diag(R)) > tol
    return Q[:, keep]

def null_space(A, tol=1e-10):
    U, S, Vt = la.svd(A)
    return Vt[S.size:,:].T if S.size < Vt.shape[0] else Vt[S <= tol,:].T



# Compatibility helpers used by the Chapter 02 theory and exercise cells.
def null_space(A, tol=1e-10):
    A = np.asarray(A, dtype=float)
    U, S, Vt = la.svd(A, full_matrices=True)
    rank = int(np.sum(S > tol))
    return Vt[rank:].T

svd_null_space = null_space

def gram_schmidt(vectors, tol=1e-10):
    A = np.asarray(vectors, dtype=float)
    if A.ndim == 1:
        A = A.reshape(1, -1)
    basis = []
    for v in A:
        w = v.astype(float).copy()
        for q in basis:
            w = w - np.dot(w, q) * q
        norm = la.norm(w)
        if norm > tol:
            basis.append(w / norm)
    return np.array(basis)

def projection_matrix_from_columns(A, tol=1e-10):
    Q = orthonormal_basis(np.asarray(A, dtype=float), tol=tol)
    return Q @ Q.T


def random_unit_vectors(n, d):
    X = np.random.randn(n, d)
    return X / np.maximum(la.norm(X, axis=1, keepdims=True), 1e-12)

def pairwise_distances(X):
    X = np.asarray(X, dtype=float)
    diff = X[:, None, :] - X[None, :, :]
    return la.norm(diff, axis=-1)


def normalize(x, axis=None, tol=1e-12):
    x = np.asarray(x, dtype=float)
    norm = la.norm(x, axis=axis, keepdims=True)
    return x / np.maximum(norm, tol)

def frobenius_inner(A, B):
    return float(np.sum(np.asarray(A, dtype=float) * np.asarray(B, dtype=float)))

def outer_sum_product(A, B):
    A = np.asarray(A, dtype=float)
    B = np.asarray(B, dtype=float)
    return sum(np.outer(A[:, k], B[k, :]) for k in range(A.shape[1]))

def softmax_rows(X):
    return softmax(X, axis=1)

def col_space(A, tol=1e-10):
    return orthonormal_basis(np.asarray(A, dtype=float), tol=tol)

def row_space(A, tol=1e-10):
    return orthonormal_basis(np.asarray(A, dtype=float).T, tol=tol).T

def rref(A, tol=1e-10):
    R = np.array(A, dtype=float, copy=True)
    m, n = R.shape
    pivots = []
    row = 0
    for col in range(n):
        pivot = row + int(np.argmax(np.abs(R[row:, col]))) if row < m else row
        if row >= m or abs(R[pivot, col]) <= tol:
            continue
        if pivot != row:
            R[[row, pivot]] = R[[pivot, row]]
        R[row] = R[row] / R[row, col]
        for r in range(m):
            if r != row:
                R[r] = R[r] - R[r, col] * R[row]
        pivots.append(col)
        row += 1
        if row == m:
            break
    R[np.abs(R) < tol] = 0.0
    return R, pivots

def nullspace_basis(A, tol=1e-10):
    A = np.asarray(A, dtype=float)
    U, S, Vt = la.svd(A, full_matrices=True)
    rank = int(np.sum(S > tol))
    return Vt[rank:].T, rank

print("Chapter helper setup complete.")

1. Intuition — What Is a Vector Space and Why It Matters

A vector space is a collection of objects — called vectors — together with two operations:

  • Addition: combine two vectors to get another vector in the same collection.
  • Scalar multiplication: stretch any vector by a real number and stay in the collection.

These two operations must satisfy eight axioms (Section 2). Once verified, every theorem ever proved about vector spaces applies automatically — whether the "vectors" are geometric arrows, polynomials, matrices, functions, probability distributions, or neural network weights.

Why this abstraction matters for AI

AI conceptUnderlying vector spaceWhy it matters
Token embeddingsRd\mathbb{R}^dking − man + woman ≈ queen relies on vector arithmetic
Parameter spaceRp\mathbb{R}^pGradient descent navigates this space; loss landscape lives over it
Residual streamRd\mathbb{R}^d at each layerx ← x + Attn(x) + MLP(x) is vector addition
Gradient spaceRp\mathbb{R}^pGradient accumulation, clipping, momentum — all vector operations
Function spacesL2L^2, RKHSNeural networks approximate functions; approximation theory uses subspaces

The hierarchy of structure

Set (no structure)
    ↓ + addition and scalar multiplication + axioms
Vector Space (linear structure: add, scale)
    ↓ + norm ‖v‖
Normed Space (distances: can measure length)
    ↓ + inner product ⟨u,v⟩
Inner Product Space (angles: can measure cosine similarity)
    ↓ + completeness
Hilbert Space (complete inner product space)

Each level adds structure and inherits all theorems from above. Rn\mathbb{R}^n satisfies all levels simultaneously. AI embedding spaces are finite-dimensional, so all levels are equivalent. Function spaces require infinite dimensions, where the distinctions become essential.

Code cell 5

# === 1.1 The same algebra works across wildly different "vector" types ===

# Vectors in R^3
u = np.array([1.0, 2.0, -1.0])
v = np.array([3.0, 0.0, 1.0])
print("R^3 vectors:")
print("  u + v =", u + v)
print("  2.5 * u =", 2.5 * u)

# Matrices as vectors in R^{2x3}
A = np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
B = np.array([[0.0, -1.0, 2.0], [1.0, 0.0, -1.0]])
print("\nR^{2x3} matrices:")
print("  A + B =\n", A + B)
print("  -3 * A =\n", -3 * A)

# Polynomials (represented by coefficient vectors)
# p(t) = 1 + 2t - t^2,  q(t) = 3 - t + 4t^2
p = np.array([1.0, 2.0, -1.0])   # coefficients of 1, t, t^2
q = np.array([3.0, -1.0, 4.0])
print("\nP_2 polynomials (coefficients):")
print("  p + q =", p + q, " -> (p+q)(t) = 4 + t + 3t^2")
print("  2p   =", 2 * p, "   -> 2p(t) = 2 + 4t - 2t^2")

# Functions (sampled pointwise)
t = np.linspace(0, 1, 5)
f = t**2
g = np.sin(np.pi * t)
print("\nC([0,1]) functions (sampled):")
print("  (f+g)(t) =", f + g)
print("  (3f)(t)  =", 3 * f)

print("\nKey insight: + and scalar * work the same way in every vector space.")
print("The axioms guarantee this uniformity.")

Code cell 6

# === 1.2 Vector space structure in AI: embedding arithmetic ===

# Synthetic embedding space (small d for clarity; real models use d=768+)
tokens = ["king", "queen", "man", "woman", "apple"]
E = np.array([
    [0.8,  0.9,  0.2,  0.1],   # king
    [0.8,  0.9, -0.2,  0.1],   # queen
    [0.7,  0.8,  0.4,  0.1],   # man
    [0.7,  0.8, -0.4,  0.1],   # woman
    [-0.3, 0.2,  0.0,  0.9],   # apple
], dtype=float)

# king - man + woman: relies on R^d being a vector space
analogy = E[0] - E[2] + E[3]  # king - man + woman
print("king - man + woman =", analogy)
print("queen              =", E[1])
print("cosine(analogy, queen) =", cosine_similarity(analogy, E[1]))

# This arithmetic is meaningful ONLY because R^d is a vector space:
# subtraction (additive inverse), addition, and the result stays in R^d.

# Residual stream: x <- x + attn_output + mlp_output
d = 8
x = np.random.randn(d)
attn_out = np.random.randn(d) * 0.3
mlp_out = np.random.randn(d) * 0.3
x_next = x + attn_out + mlp_out   # vector addition in R^d
print("\nResidual stream update:")
print("  ||x||       =", np.linalg.norm(x))
print("  ||x + attn + mlp|| =", np.linalg.norm(x_next))
print("  This sum makes sense because R^d is closed under addition.")

Subspace intuition (preview)

A subspace is a vector space living inside a larger vector space — a flat, linear subset that passes through the origin and is closed under the same operations.

  • A line through the origin in R3\mathbb{R}^3 is a subspace.
  • A line not through the origin is not (it doesn't contain the zero vector).
  • A plane through the origin is a subspace; a curved surface never is.

Why "through the origin": the zero vector must belong to any subspace (scale any vector by 0 and you get 0; if 0 isn't in the set, it can't be closed under scalar multiplication).

Why "flat": closure under addition and scalar multiplication forces the set to be "straight" — no curves, no bends.

Subspaces that appear constantly in AI:

SubspaceDefinitionAI meaning
col(W)\text{col}(W){Wx:xRn}\{Wx : x \in \mathbb{R}^n\}Reachable outputs of layer WW
null(W)\text{null}(W){x:Wx=0}\{x : Wx = 0\}Input directions the layer ignores
row(W)\text{row}(W){WTy:yRm}\{W^Ty : y \in \mathbb{R}^m\}Input directions the layer listens to
LoRA update spacecol(B)\text{col}(B) for ΔW=BAT\Delta W = BA^TRank-rr subspace of weight updates

Code cell 8

# === 1.3 Subspace vs non-subspace: quick visual check ===

# A line through the origin: W = {t*(1,2) : t in R}
# Check: contains 0? Yes (t=0). Closed under + and scalar mult? Yes.
direction = np.array([1.0, 2.0])
t_vals = np.array([-2, -1, 0, 0.5, 1, 3])
line_points = np.outer(t_vals, direction)
print("Line through origin (subspace):")
for t, pt in zip(t_vals, line_points):
    print(f"  t={t:>4}: ({pt[0]:>5.1f}, {pt[1]:>5.1f})")
# Sum of two points: (1,2) + (3,6) = (4,8) = 4*(1,2) -> still on the line
print("  (1,2)+(3,6) =", direction + 3*direction, "= 4*(1,2) -> in W")

# A line NOT through origin: {(1,0) + t*(1,2) : t in R}
# Does not contain 0: (1,0) + t*(1,2) = (0,0) requires t=0 and 1=0. Contradiction.
print("\nLine NOT through origin (NOT a subspace):")
print("  Does it contain (0,0)? No. Already fails.")

# Probability simplex: {p in R^3 : p_i >= 0, sum = 1}
p = np.array([0.5, 0.3, 0.2])
q = np.array([0.1, 0.6, 0.3])
print("\nProbability simplex (NOT a subspace):")
print("  p =", p, "sum =", p.sum())
print("  q =", q, "sum =", q.sum())
print("  p + q =", p + q, "sum =", (p+q).sum(), "!= 1 -> not in simplex")
print("  2*p  =", 2*p, "sum =", (2*p).sum(), "!= 1 -> not in simplex")
print("  (0,0,0) not in simplex (sum=0 != 1) -> no zero vector")

2. Formal Definitions — The Eight Axioms

A vector space over R\mathbb{R} is a set VV together with two operations:

  • Addition: +:V×VV+ : V \times V \to V
  • Scalar multiplication: :R×VV\cdot : \mathbb{R} \times V \to V

satisfying these eight axioms for all u,v,wV\mathbf{u}, \mathbf{v}, \mathbf{w} \in V and all α,βR\alpha, \beta \in \mathbb{R}:

Addition axioms:

#NameStatement
A1Commutativityu+v=v+u\mathbf{u} + \mathbf{v} = \mathbf{v} + \mathbf{u}
A2Associativity(u+v)+w=u+(v+w)(\mathbf{u} + \mathbf{v}) + \mathbf{w} = \mathbf{u} + (\mathbf{v} + \mathbf{w})
A3Additive identity0V\exists\, \mathbf{0} \in V such that v+0=v\mathbf{v} + \mathbf{0} = \mathbf{v} for all v\mathbf{v}
A4Additive inverseFor each v\mathbf{v}, (v)\exists\, (-\mathbf{v}) such that v+(v)=0\mathbf{v} + (-\mathbf{v}) = \mathbf{0}

Scalar multiplication axioms:

#NameStatement
S1Multiplicative identity1v=v1 \cdot \mathbf{v} = \mathbf{v}
S2Compatibilityα(βv)=(αβ)v\alpha(\beta \mathbf{v}) = (\alpha\beta) \mathbf{v}

Distributivity axioms:

#NameStatement
D1Over vector additionα(u+v)=αu+αv\alpha(\mathbf{u} + \mathbf{v}) = \alpha\mathbf{u} + \alpha\mathbf{v}
D2Over scalar addition(α+β)v=αv+βv(\alpha + \beta)\mathbf{v} = \alpha\mathbf{v} + \beta\mathbf{v}

These eight axioms are the complete definition. The axioms are minimal: each is independent of the others.

Code cell 10

# === 2.1 Verify all eight axioms for R^n ===

u = np.array([1.0, -2.0, 3.0])
v = np.array([4.0, 0.5, -1.0])
w = np.array([-1.0, 3.0, 2.0])
alpha, beta = 2.5, -1.3
zero = np.zeros(3)

print("Verifying all 8 axioms for R^3:")
print(f"A1 Commutativity:     u+v == v+u?            {np.allclose(u+v, v+u)}")
print(f"A2 Associativity:     (u+v)+w == u+(v+w)?    {np.allclose((u+v)+w, u+(v+w))}")
print(f"A3 Additive identity: v+0 == v?              {np.allclose(v+zero, v)}")
print(f"A4 Additive inverse:  v+(-v) == 0?           {np.allclose(v+(-v), zero)}")
print(f"S1 Mult identity:     1*v == v?              {np.allclose(1*v, v)}")
print(f"S2 Compatibility:     a(bv) == (ab)v?        {np.allclose(alpha*(beta*v), (alpha*beta)*v)}")
print(f"D1 Scalar distrib:    a(u+v) == au+av?       {np.allclose(alpha*(u+v), alpha*u+alpha*v)}")
print(f"D2 Vector distrib:    (a+b)v == av+bv?       {np.allclose((alpha+beta)*v, alpha*v+beta*v)}")

Code cell 11

# === 2.2 Verify axioms for R^{m x n} (matrices as vectors) ===

A = np.array([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
B = np.array([[0.0, -1.0], [2.0, 1.0], [-1.0, 3.0]])
C = np.array([[1.0, 1.0], [1.0, 1.0], [1.0, 1.0]])
Z = np.zeros((3, 2))
a, b = 3.0, -0.7

print("Verifying axioms for R^{3x2}:")
print(f"A1 Commutativity:  {np.allclose(A+B, B+A)}")
print(f"A2 Associativity:  {np.allclose((A+B)+C, A+(B+C))}")
print(f"A3 Zero matrix:    {np.allclose(A+Z, A)}")
print(f"A4 Inverse:        {np.allclose(A+(-A), Z)}")
print(f"S1 Identity:       {np.allclose(1*A, A)}")
print(f"S2 Compatibility:  {np.allclose(a*(b*A), (a*b)*A)}")
print(f"D1 Scalar distrib: {np.allclose(a*(A+B), a*A+a*B)}")
print(f"D2 Vector distrib: {np.allclose((a+b)*A, a*A+b*A)}")
print(f"\ndim(R^{{3x2}}) = 3*2 = {3*2}")

Code cell 12

# === 2.3 Verify axioms for P_n (polynomials of degree <= n) ===
# Represent p(t) = a0 + a1*t + a2*t^2 by coefficient vector [a0, a1, a2]

p = np.array([1.0, 2.0, -1.0])    # 1 + 2t - t^2
q = np.array([3.0, -1.0, 4.0])    # 3 - t + 4t^2
r = np.array([0.0, 1.0, 1.0])     # t + t^2
z = np.zeros(3)                     # zero polynomial
a, b = 2.0, -0.5

print("Verifying axioms for P_2 (polynomials of degree <= 2):")
print(f"  p(t) = 1 + 2t - t^2,  q(t) = 3 - t + 4t^2")
print(f"  p + q = {p + q}  -> (p+q)(t) = 4 + t + 3t^2")
print(f"  A1: {np.allclose(p+q, q+p)}")
print(f"  A2: {np.allclose((p+q)+r, p+(q+r))}")
print(f"  A3: {np.allclose(p+z, p)}")
print(f"  A4: {np.allclose(p+(-p), z)}")
print(f"  S1: {np.allclose(1*p, p)}")
print(f"  S2: {np.allclose(a*(b*p), (a*b)*p)}")
print(f"  D1: {np.allclose(a*(p+q), a*p+a*q)}")
print(f"  D2: {np.allclose((a+b)*p, a*p+b*p)}")
print(f"\ndim(P_2) = 3 (basis: {{1, t, t^2}})")
print("Note: p+q has degree 2 <= 2, so it stays in P_2. Closure holds.")

Immediate consequences of the axioms

These follow from the eight axioms alone — they hold in every vector space:

  1. Uniqueness of zero: if 0\mathbf{0} and 0\mathbf{0}' both satisfy A3, then 0=0+0=0\mathbf{0} = \mathbf{0} + \mathbf{0}' = \mathbf{0}'.

  2. Uniqueness of additive inverse: if v+w=0\mathbf{v} + \mathbf{w} = \mathbf{0} and v+w=0\mathbf{v} + \mathbf{w}' = \mathbf{0}, then w=w\mathbf{w} = \mathbf{w}'.

  3. 0v=00 \cdot \mathbf{v} = \mathbf{0} (zero scalar × any vector = zero vector):

    • 0v=(0+0)v=0v+0v0\mathbf{v} = (0+0)\mathbf{v} = 0\mathbf{v} + 0\mathbf{v}. Subtract 0v0\mathbf{v}: 0=0v\mathbf{0} = 0\mathbf{v}. \checkmark
  4. α0=0\alpha \cdot \mathbf{0} = \mathbf{0} (any scalar × zero vector = zero vector):

    • α0=α(0+0)=α0+α0\alpha\mathbf{0} = \alpha(\mathbf{0}+\mathbf{0}) = \alpha\mathbf{0} + \alpha\mathbf{0}. Subtract: 0=α0\mathbf{0} = \alpha\mathbf{0}. \checkmark
  5. (1)v=v(-1)\mathbf{v} = -\mathbf{v}: v+(1)v=(1+(1))v=0v=0\mathbf{v} + (-1)\mathbf{v} = (1+(-1))\mathbf{v} = 0\mathbf{v} = \mathbf{0}. \checkmark

  6. If αv=0\alpha\mathbf{v} = \mathbf{0}, then α=0\alpha = 0 or v=0\mathbf{v} = \mathbf{0}: if α0\alpha \neq 0, multiply by α1\alpha^{-1}: v=α10=0\mathbf{v} = \alpha^{-1}\mathbf{0} = \mathbf{0}. \checkmark

Code cell 14

# === 2.4 Verify consequences computationally ===

v = np.array([3.0, -1.0, 7.0, 2.0])
zero = np.zeros(4)

print("Consequence 3: 0 * v = 0")
print("  0 * v =", 0 * v)
print("  equals zero?", np.allclose(0 * v, zero))

print("\nConsequence 4: alpha * 0 = 0")
print("  7.5 * 0 =", 7.5 * zero)
print("  equals zero?", np.allclose(7.5 * zero, zero))

print("\nConsequence 5: (-1) * v = -v")
print("  (-1)*v =", (-1)*v)
print("  -v     =", -v)
print("  equal?", np.allclose((-1)*v, -v))

print("\nConsequence 6: alpha*v = 0 implies alpha=0 or v=0")
alpha = 3.0
result = alpha * v
print(f"  {alpha} * {v} = {result}")
print(f"  result is zero? {np.allclose(result, zero)} (alpha != 0 and v != 0, so correct)")

Non-examples: what fails and why

Understanding what fails the axioms is as important as knowing the examples.

SetWhat failsWhy
R+n\mathbb{R}_+^n (positive reals)Scalar mult, inverse(1)(1,2)=(1,2)R+n(-1) \cdot (1,2) = (-1,-2) \notin \mathbb{R}_+^n
Unit sphere Sn1S^{n-1}Addition, zerou+v1\|u+v\| \neq 1 generally; 0Sn1\mathbf{0} \notin S^{n-1}
Simplex Δn\Delta^nAddition, scalar mult, zero(p+q)(p+q) sums to 2; 0Δn\mathbf{0} \notin \Delta^n
Zn\mathbb{Z}^n (integers)Scalar mult over R\mathbb{R}π(1,0)=(π,0)Zn\pi \cdot (1,0) = (\pi, 0) \notin \mathbb{Z}^n
{x:Ax=b}\{x : Ax = b\}, b0b \neq 0Zero, additionA0=0bA\mathbf{0} = \mathbf{0} \neq b

Code cell 16

# === 2.5 Demonstrate axiom failures for non-vector-spaces ===

print("=== R_+^n (positive reals) ===")
v_pos = np.array([1.0, 2.0, 3.0])
scaled = (-1) * v_pos
print(f"  v = {v_pos}  (all positive)")
print(f"  (-1)*v = {scaled}  (not all positive -> FAILS closure under scalar mult)")
print(f"  Also: no additive inverse in R_+  (v + (-v) = 0, but -v not in R_+)")

print("\n=== Unit sphere S^{n-1} ===")
u_sphere = normalize(np.array([1.0, 0.0, 0.0]))
v_sphere = normalize(np.array([0.0, 1.0, 0.0]))
s = u_sphere + v_sphere
print(f"  ||u|| = {np.linalg.norm(u_sphere):.1f},  ||v|| = {np.linalg.norm(v_sphere):.1f}")
print(f"  u + v = {s},  ||u+v|| = {np.linalg.norm(s):.4f} != 1  (FAILS closure under +)")
print(f"  ||0|| = 0 != 1  (zero vector not on sphere -> FAILS zero)")

print("\n=== Probability simplex ===")
p = np.array([0.5, 0.3, 0.2])
q = np.array([0.1, 0.6, 0.3])
print(f"  p = {p}, sum = {p.sum():.1f}")
print(f"  q = {q}, sum = {q.sum():.1f}")
print(f"  p+q = {p+q}, sum = {(p+q).sum():.1f} != 1  (FAILS closure under +)")
print(f"  This is why logit space (pre-softmax) is used for arithmetic, not probability space.")

print("\n=== Affine subspace {x : Ax = b}, b != 0 ===")
A_sys = np.array([[1.0, 2.0], [3.0, 4.0]])
b_sys = np.array([5.0, 6.0])
x_particular = np.linalg.solve(A_sys, b_sys)
print(f"  A*0 = {A_sys @ np.zeros(2)} != b = {b_sys}  (zero not in solution set -> FAILS)")
print(f"  One solution: x = {x_particular}")
print(f"  A*x = {A_sys @ x_particular} = b  (correct, but set is not a subspace)")

Vector spaces over other fields

The definition works over any field F\mathbb{F}, not just R\mathbb{R}:

FieldNotationUsed in
Real numbers R\mathbb{R}Rn\mathbb{R}^nStandard ML, transformers
Complex numbers C\mathbb{C}Cn\mathbb{C}^nQuantum computing, Fourier analysis, signal processing
Binary field F2={0,1}\mathbb{F}_2 = \{0,1\}F2n\mathbb{F}_2^nError-correcting codes, cryptography
Rationals Q\mathbb{Q}Qn\mathbb{Q}^nExact arithmetic, symbolic computation

In F2n\mathbb{F}_2^n: addition is XOR, multiplication is AND. All linear algebra carries over — including Gaussian elimination, rank, and null spaces.

Code cell 18

# === 2.6 Vector space over F_2 (binary field) ===
# Addition = XOR, Scalar multiplication: 0*v=0, 1*v=v

u_bin = np.array([1, 0, 1, 1])
v_bin = np.array([0, 1, 1, 0])

# Addition in F_2 is XOR
add_f2 = np.bitwise_xor(u_bin, v_bin)
print("F_2^4 vector space:")
print(f"  u = {u_bin}")
print(f"  v = {v_bin}")
print(f"  u + v (XOR) = {add_f2}")

# Additive inverse: in F_2, every element is its own inverse (1+1=0)
print(f"  u + u (XOR) = {np.bitwise_xor(u_bin, u_bin)} (= zero vector, so -u = u in F_2)")

# Commutativity
print(f"  u+v == v+u? {np.array_equal(add_f2, np.bitwise_xor(v_bin, u_bin))}")
print("  Used in: error-correcting codes, parity checks, cryptographic linear algebra")

3. Subspaces — The Three-Condition Test

A subset WVW \subseteq V is a subspace of VV if WW is itself a vector space under the same operations inherited from VV.

Checking all eight axioms would be redundant. Commutativity, associativity, and distributivity are inherited automatically from VV. Only three conditions need to be verified:

  1. Contains zero: 0W\mathbf{0} \in W
  2. Closed under addition: u,wW    u+wW\mathbf{u}, \mathbf{w} \in W \implies \mathbf{u} + \mathbf{w} \in W
  3. Closed under scalar multiplication: wW,  αR    αwW\mathbf{w} \in W,\; \alpha \in \mathbb{R} \implies \alpha\mathbf{w} \in W

Equivalent single condition (often quickest): for all u,wW\mathbf{u}, \mathbf{w} \in W and α,βR\alpha, \beta \in \mathbb{R}:

αu+βwW\alpha\mathbf{u} + \beta\mathbf{w} \in W

Why these three suffice

  • Condition 1 ensures WW is non-empty and has an additive identity.
  • Condition 2 ensures ++ stays inside WW.
  • Condition 3 ensures scalar multiplication stays inside WW (and gives additive inverses via α=1\alpha = -1).
  • The remaining axioms (commutativity, associativity, distributivity) are equalities that hold in VV for all vectors; they hold in particular for vectors in WVW \subseteq V.

Code cell 20

# === 3.1 Lines through the origin (1D subspaces of R^2 and R^3) ===

# W = {t*(1,2) : t in R} -- a line through origin in R^2
d = np.array([1.0, 2.0])

# Test 1: contains zero?
print("Line W = span{(1,2)} in R^2:")
print(f"  Contains 0? 0*(1,2) = {0*d}  -> Yes")

# Test 2: closed under addition?
w1 = 3.0 * d   # (3, 6)
w2 = -1.5 * d  # (-1.5, -3)
w_sum = w1 + w2 # (1.5, 3) = 1.5*(1,2)
t_sum = w_sum[0] / d[0]  # coefficient
print(f"  w1 = {w1}, w2 = {w2}")
print(f"  w1+w2 = {w_sum} = {t_sum}*(1,2) -> in W? {np.allclose(w_sum, t_sum*d)}")

# Test 3: closed under scalar mult?
alpha = -2.7
w_scaled = alpha * w1
t_scaled = w_scaled[0] / d[0]
print(f"  {alpha}*w1 = {w_scaled} = {t_scaled}*(1,2) -> in W? {np.allclose(w_scaled, t_scaled*d)}")
print("  Conclusion: W is a subspace of R^2. dim(W) = 1.")

Code cell 21

# === 3.2 Planes through the origin (2D subspaces of R^3) ===

# W = {(x,y,z) : x + 2y - z = 0} -- a plane through origin in R^3
# This is null(A) where A = [[1, 2, -1]]

A_plane = np.array([[1.0, 2.0, -1.0]])

# Find null space basis
N = null_space(A_plane)
print("Plane W = {(x,y,z) : x + 2y - z = 0}")
print(f"Null space basis (columns):\n{N}")
print(f"dim(W) = {N.shape[1]}")

# Verify subspace conditions with random elements of W
coeffs1 = np.array([2.0, -1.0])
coeffs2 = np.array([0.5, 3.0])
w1 = N @ coeffs1
w2 = N @ coeffs2

def scalar_value(x):
    return float(np.asarray(x).reshape(-1)[0])

print(f"\nw1 = {w1},  A*w1 = {scalar_value(A_plane @ w1):.10f}  (= 0, so w1 in W)")
print(f"w2 = {w2},  A*w2 = {scalar_value(A_plane @ w2):.10f}  (= 0, so w2 in W)")

# Test closure
print(f"\nw1+w2 = {w1+w2},  A*(w1+w2) = {scalar_value(A_plane @ (w1+w2)):.10f}  (closed under +)")
print(f"3.5*w1 = {3.5*w1},  A*(3.5*w1) = {scalar_value(A_plane @ (3.5*w1)):.10f}  (closed under scalar mult)")
print("All three conditions satisfied -> W is a subspace.")

Code cell 22

# === 3.3 Null space of a matrix is always a subspace ===

# null(A) = {x : Ax = 0}
# Proof that it's a subspace:
#   1. 0 in null(A) because A*0 = 0
#   2. If Ax=0 and Ay=0, then A(x+y) = Ax+Ay = 0+0 = 0
#   3. If Ax=0, then A(alpha*x) = alpha*Ax = alpha*0 = 0

A = np.array([
    [1.0, 2.0, 3.0, 4.0],
    [2.0, 4.0, 6.0, 8.0],
    [1.0, 1.0, 2.0, 3.0]
])

N = null_space(A)
rank_A = np.linalg.matrix_rank(A)
n = A.shape[1]

print(f"A is {A.shape[0]}x{A.shape[1]}, rank = {rank_A}")
print(f"nullity = n - rank = {n} - {rank_A} = {n - rank_A}")
print(f"Null space basis (columns):\n{N}")
print(f"\nVerification: A @ N =\n{A @ N}")
print(f"All zero? {np.allclose(A @ N, 0)}")

# Show closure explicitly
if N.shape[1] >= 2:
    x = N[:, 0]
    y = N[:, 1]
    print(f"\nx in null(A): A*x = {A @ x}")
    print(f"y in null(A): A*y = {A @ y}")
    print(f"x+y in null(A)? A*(x+y) = {A @ (x+y)} -> {np.allclose(A @ (x+y), 0)}")
    print(f"5x in null(A)?  A*(5x) = {A @ (5*x)} -> {np.allclose(A @ (5*x), 0)}")

Code cell 23

# === 3.4 Column space of a matrix is always a subspace ===

# col(A) = {Ax : x in R^n}
# Proof that it's a subspace:
#   1. 0 in col(A) because A*0 = 0
#   2. If Ax, Ay in col(A), then Ax+Ay = A(x+y) in col(A)
#   3. alpha*(Ax) = A(alpha*x) in col(A)

A = np.array([
    [1.0, 0.0, 2.0],
    [0.0, 1.0, 1.0],
    [1.0, 1.0, 3.0]
])

C = col_space(A)
rank_A = np.linalg.matrix_rank(A)

print(f"A =\n{A}")
print(f"rank(A) = {rank_A}")
print(f"dim(col(A)) = {rank_A}")
print(f"Column space basis (orthonormal columns):\n{C}")

# Generate two vectors in col(A) and check closure
x1 = np.array([1.0, 0.0, 0.0])
x2 = np.array([0.0, 2.0, -1.0])
v1 = A @ x1
v2 = A @ x2
print(f"\nv1 = A*x1 = {v1}  (in col(A))")
print(f"v2 = A*x2 = {v2}  (in col(A))")
v_sum = v1 + v2
# v1+v2 = A*(x1+x2), so it's in col(A)
print(f"v1+v2 = {v_sum} = A*(x1+x2) = A*{x1+x2} = {A @ (x1+x2)} -> in col(A)")

Code cell 24

# === 3.5 Symmetric matrices form a subspace of R^{nxn} ===

# W = {A in R^{nxn} : A = A^T}
# Check: 0 is symmetric, sum of symmetric = symmetric, scalar * symmetric = symmetric

n = 3
S1 = np.array([[1.0, 2.0, 3.0], [2.0, 5.0, 4.0], [3.0, 4.0, 6.0]])
S2 = np.array([[0.0, -1.0, 2.0], [-1.0, 3.0, 0.0], [2.0, 0.0, 1.0]])

print("Symmetric matrices as a subspace of R^{3x3}:")
print(f"  S1 symmetric? {np.allclose(S1, S1.T)}")
print(f"  S2 symmetric? {np.allclose(S2, S2.T)}")
print(f"  Zero symmetric? {np.allclose(np.zeros((n,n)), np.zeros((n,n)).T)}")
print(f"  S1+S2 symmetric? {np.allclose(S1+S2, (S1+S2).T)}")
print(f"  -3*S1 symmetric? {np.allclose(-3*S1, (-3*S1).T)}")
print(f"\n  dim(R^{{3x3}}) = {n*n}")
print(f"  dim(Sym_3)    = n(n+1)/2 = {n*(n+1)//2}")
print(f"  Sym_3 is a proper subspace of R^{{3x3}} (6 < 9 dimensions)")

Code cell 25

# === 3.6 Non-subspace examples: systematic failure analysis ===

print("=== {x : Ax = b} with b != 0  (affine subspace, NOT a subspace) ===")
A_aff = np.array([[1.0, 1.0], [0.0, 1.0]])
b_aff = np.array([3.0, 1.0])
x_p = np.linalg.solve(A_aff, b_aff)
print(f"  A*0 = {A_aff @ np.zeros(2)} != b = {b_aff}  -> 0 not in set (FAILS)")
# Also: if x1, x2 are solutions, x1+x2 is not: A(x1+x2) = 2b != b
print(f"  If x is a solution, A*(2x) = 2b = {2*b_aff} != b  -> not closed under scalar mult")

print("\n=== Unit ball {{x : ||x|| <= 1}} ===")
x_ball = np.array([0.8, 0.5])
print(f"  x = {x_ball}, ||x|| = {np.linalg.norm(x_ball):.3f} <= 1  (in ball)")
print(f"  2*x = {2*x_ball}, ||2x|| = {np.linalg.norm(2*x_ball):.3f} > 1  (NOT in ball -> FAILS)")
print(f"  Contains 0? Yes. Closed under +? Maybe. Scalar mult? No.")

print("\n=== First quadrant {{(x,y) : x>=0, y>=0}} ===")
v_quad = np.array([1.0, 2.0])
print(f"  v = {v_quad}  (in first quadrant)")
print(f"  (-1)*v = {-1*v_quad}  (NOT in first quadrant -> FAILS scalar mult)")
print(f"  Contains 0? Yes. Closed under +? Yes. But fails scalar mult.")

print("\n=== Invertible matrices (NOT a subspace of R^{nxn}) ===")
print("  The zero matrix is NOT invertible -> 0 not in set (FAILS immediately)")
print("  Also: A + (-A) = 0, which is not invertible -> not closed under addition")

Code cell 26

# === 3.7 The two trivial subspaces ===

# Every vector space V has exactly two subspaces that always exist:
#   1. {0}  -- the trivial subspace, dimension 0
#   2. V    -- the whole space, dimension dim(V)

print("The two trivial subspaces of R^4:")
print(f"  {{0}}: dim = 0")
print(f"  R^4:  dim = 4")
print(f"  Any subspace W satisfies {{0}} ⊆ W ⊆ R^4")
print(f"  Possible subspace dimensions in R^4: 0, 1, 2, 3, 4")

# Demonstrate with null spaces of different ranks
for r in range(5):
    if r == 0:
        M = np.zeros((1, 4))
    elif r == 4:
        M = np.eye(4)
    else:
        M = np.random.randn(r, 4)
    N = null_space(M)
    dim_null = N.shape[1] if N.size > 0 else 0
    print(f"  rank-{r} matrix -> null space dim = {dim_null}")

Visualising subspaces in R3\mathbb{R}^3

In R3\mathbb{R}^3, subspaces can only be:

  • The origin (dim 0)
  • A line through the origin (dim 1)
  • A plane through the origin (dim 2)
  • All of R3\mathbb{R}^3 (dim 3)

No curves, no spheres, no shifted planes — only flat spaces through the origin.

Code cell 28

# === 3.8 Visualise subspaces of R^3 ===

if HAS_MPL:
    from mpl_toolkits.mplot3d import Axes3D

    fig = plt.figure(figsize=(14, 5))

    # 1D subspace: line through origin in direction (1,2,1)
    ax1 = fig.add_subplot(131, projection='3d')
    d_line = np.array([1, 2, 1], dtype=float)
    t = np.linspace(-2, 2, 50)
    ax1.plot(t*d_line[0], t*d_line[1], t*d_line[2], 'b-', linewidth=2)
    ax1.scatter([0], [0], [0], color='red', s=60, zorder=5)
    ax1.set_title('1D subspace (line)', fontsize=10)
    ax1.set_xlabel('x'); ax1.set_ylabel('y'); ax1.set_zlabel('z')

    # 2D subspace: plane x + y - z = 0
    ax2 = fig.add_subplot(132, projection='3d')
    s_range = np.linspace(-2, 2, 20)
    S, T = np.meshgrid(s_range, s_range)
    # Two basis vectors for the plane x+y-z=0: (1,0,1) and (0,1,1)
    X_plane = S * 1 + T * 0
    Y_plane = S * 0 + T * 1
    Z_plane = S * 1 + T * 1
    ax2.plot_surface(X_plane, Y_plane, Z_plane, alpha=0.3, color='cyan')
    ax2.scatter([0], [0], [0], color='red', s=60, zorder=5)
    ax2.set_title('2D subspace (plane)', fontsize=10)
    ax2.set_xlabel('x'); ax2.set_ylabel('y'); ax2.set_zlabel('z')

    # NOT a subspace: plane x + y + z = 1 (does not pass through origin)
    ax3 = fig.add_subplot(133, projection='3d')
    X_aff = S * 1 + T * 0
    Y_aff = S * 0 + T * 1
    Z_aff = 1 - X_aff - Y_aff
    ax3.plot_surface(X_aff, Y_aff, Z_aff, alpha=0.3, color='salmon')
    ax3.scatter([0], [0], [0], color='red', s=60, zorder=5, label='origin')
    ax3.set_title('NOT a subspace (affine)', fontsize=10)
    ax3.set_xlabel('x'); ax3.set_ylabel('y'); ax3.set_zlabel('z')

    plt.tight_layout()
    plt.show()
else:
    print("(matplotlib not available — skipping 3D subspace plots)")

4. Span and Linear Combinations

A linear combination of vectors v1,v2,,vkV\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_k \in V is any vector of the form:

α1v1+α2v2++αkvk\alpha_1 \mathbf{v}_1 + \alpha_2 \mathbf{v}_2 + \cdots + \alpha_k \mathbf{v}_k

where α1,,αkR\alpha_1, \ldots, \alpha_k \in \mathbb{R} are arbitrary scalars (the coefficients).

The span of a set S={v1,,vk}S = \{\mathbf{v}_1, \ldots, \mathbf{v}_k\} is the set of all linear combinations:

span(S)={i=1kαivi:α1,,αkR}\text{span}(S) = \left\{ \sum_{i=1}^k \alpha_i \mathbf{v}_i : \alpha_1, \ldots, \alpha_k \in \mathbb{R} \right\}

Key facts about span

  • span(S)\text{span}(S) is always a subspace (the smallest subspace containing all vectors in SS).
  • span()={0}\text{span}(\emptyset) = \{\mathbf{0}\} by convention.
  • If STS \subseteq T, then span(S)span(T)\text{span}(S) \subseteq \text{span}(T).
  • A set SS spans VV if span(S)=V\text{span}(S) = V.

Geometric pictures

  • span{v}\text{span}\{\mathbf{v}\}: a line through the origin (1D)
  • span{u,v}\text{span}\{\mathbf{u}, \mathbf{v}\} (non-parallel): a plane through the origin (2D)
  • span{u,v,w}\text{span}\{\mathbf{u}, \mathbf{v}, \mathbf{w}\} (not coplanar): a 3D subspace

Connection to matrices

Given v1,,vkRm\mathbf{v}_1, \ldots, \mathbf{v}_k \in \mathbb{R}^m, form V=[v1vk]Rm×kV = [\mathbf{v}_1 | \cdots | \mathbf{v}_k] \in \mathbb{R}^{m \times k}.

span{v1,,vk}={Vα:αRk}=col(V)\text{span}\{\mathbf{v}_1, \ldots, \mathbf{v}_k\} = \{V\boldsymbol{\alpha} : \boldsymbol{\alpha} \in \mathbb{R}^k\} = \text{col}(V)

To check if bspan{v1,,vk}\mathbf{b} \in \text{span}\{\mathbf{v}_1, \ldots, \mathbf{v}_k\}: solve Vα=bV\boldsymbol{\alpha} = \mathbf{b}. Consistent iff rank(V)=rank([Vb])\text{rank}(V) = \text{rank}([V|\mathbf{b}]).

Code cell 30

# === 4.1 Linear combinations and span in R^3 ===

v1 = np.array([1.0, 0.0, 0.0])
v2 = np.array([0.0, 1.0, 0.0])
v3 = np.array([1.0, 1.0, 0.0])  # v3 = v1 + v2 (redundant)

# Linear combination
combo = 2*v1 + 3*v2 - 1*v3
print("v1 =", v1)
print("v2 =", v2)
print("v3 = v1 + v2 =", v3)
print("2*v1 + 3*v2 - 1*v3 =", combo, " (lies in the xy-plane)")

# span{v1, v2} = xy-plane (2D subspace of R^3)
# span{v1, v2, v3} = same xy-plane (v3 is redundant)
V12 = np.column_stack([v1, v2])
V123 = np.column_stack([v1, v2, v3])
print(f"\nrank([v1|v2])    = {np.linalg.matrix_rank(V12)}  -> span is 2D")
print(f"rank([v1|v2|v3]) = {np.linalg.matrix_rank(V123)}  -> adding v3 didn't increase span")
print("v3 is redundant: it's already in span{v1, v2}.")

Code cell 31

# === 4.2 Checking membership in a span ===

# Does b lie in span{v1, v2}?
v1 = np.array([1.0, 2.0, 3.0])
v2 = np.array([0.0, 1.0, 1.0])
V = np.column_stack([v1, v2])

b_in = np.array([2.0, 5.0, 7.0])     # 2*v1 + 1*v2
b_out = np.array([1.0, 0.0, 0.0])    # not in span

for label, b in [("b_in", b_in), ("b_out", b_out)]:
    augmented = np.column_stack([V, b])
    rank_V = np.linalg.matrix_rank(V)
    rank_aug = np.linalg.matrix_rank(augmented)
    in_span = rank_V == rank_aug
    print(f"{label} = {b}")
    print(f"  rank(V) = {rank_V}, rank([V|b]) = {rank_aug}")
    print(f"  In span? {in_span}")
    if in_span:
        coeffs, *_ = np.linalg.lstsq(V, b, rcond=None)
        print(f"  Coefficients: {coeffs[0]:.4f}*v1 + {coeffs[1]:.4f}*v2")
        print(f"  Verify: {coeffs[0]:.4f}*v1 + {coeffs[1]:.4f}*v2 = {V @ coeffs}")
    print()

Code cell 32

# === 4.3 Span = column space: the matrix viewpoint ===

# The columns of a weight matrix W span col(W) -- the reachable outputs.
# If rank(W) < m, some output directions are unreachable.

W = np.array([
    [1.0, 2.0],
    [2.0, 4.0],
    [0.0, 1.0]
])
print(f"W (3x2) =\n{W}")
print(f"rank(W) = {np.linalg.matrix_rank(W)}")
print(f"col(W) is a {np.linalg.matrix_rank(W)}D subspace of R^3")

# Try many inputs: outputs always land in col(W)
C = col_space(W)
print(f"\nOrthonormal basis for col(W):\n{C}")

np.random.seed(0)
for i in range(5):
    x = np.random.randn(2)
    y = W @ x
    # Project y onto col(W) and check it equals y
    y_proj = C @ (C.T @ y)
    print(f"  x={x} -> Wx={y} -> proj_col(W)(Wx)={y_proj} -> same? {np.allclose(y, y_proj)}")

print("\nAll outputs lie in col(W). Directions outside col(W) are unreachable.")

5. Linear Independence

Vectors v1,v2,,vk\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_k are linearly independent if the only linear combination equal to the zero vector uses all-zero coefficients:

α1v1+α2v2++αkvk=0    α1=α2==αk=0\alpha_1 \mathbf{v}_1 + \alpha_2 \mathbf{v}_2 + \cdots + \alpha_k \mathbf{v}_k = \mathbf{0} \implies \alpha_1 = \alpha_2 = \cdots = \alpha_k = 0

Equivalently: no vector in the set is a linear combination of the others. Each vector adds a genuinely new direction.

Linearly dependent: \exists non-trivial (not all-zero) coefficients such that αivi=0\sum \alpha_i \mathbf{v}_i = \mathbf{0}. At least one vector is redundant.

Checking independence

Form V=[v1vk]V = [\mathbf{v}_1 | \cdots | \mathbf{v}_k] and row reduce:

  • Independent iff every column has a pivot iff rank(V)=k\text{rank}(V) = k
  • Dependent iff rank(V)<k\text{rank}(V) < k
  • Determinant test (square case): independent iff det(V)0\det(V) \neq 0

Key properties

  • Any set containing 0\mathbf{0} is dependent (10+0v2+=01 \cdot \mathbf{0} + 0 \cdot \mathbf{v}_2 + \cdots = \mathbf{0})
  • A single non-zero vector is always independent
  • Two vectors are dependent iff one is a scalar multiple of the other
  • More vectors than dimensions     \implies always dependent (k>nk > n in Rn\mathbb{R}^n)

Code cell 34

# === 5.1 Independent vs dependent sets ===

# Independent set in R^3
ind = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]], dtype=float).T  # standard basis
# Dependent set: third vector = first + second
dep = np.array([[1, 0, 0], [0, 1, 0], [1, 1, 0]], dtype=float).T

print("Independent set (standard basis of R^3):")
print(f"  rank = {np.linalg.matrix_rank(ind)}, k = {ind.shape[1]}")
print(f"  rank == k? {np.linalg.matrix_rank(ind) == ind.shape[1]} -> independent")

print("\nDependent set {(1,0,0), (0,1,0), (1,1,0)}:")
print(f"  rank = {np.linalg.matrix_rank(dep)}, k = {dep.shape[1]}")
print(f"  rank < k? {np.linalg.matrix_rank(dep) < dep.shape[1]} -> dependent")
print("  Dependence relation: 1*(1,0,0) + 1*(0,1,0) + (-1)*(1,1,0) = (0,0,0)")
print("  Verify:", 1*dep[:,0] + 1*dep[:,1] + (-1)*dep[:,2])

Code cell 35

# === 5.2 Detecting dependence via RREF and the determinant ===

V = np.array([
    [1.0, 2.0, 3.0],
    [4.0, 5.0, 6.0],
    [7.0, 8.0, 9.0]
])
print("V =\n", V)
print(f"det(V) = {np.linalg.det(V):.6f}")
print(f"rank(V) = {np.linalg.matrix_rank(V)}")

R, pivots = rref(V)
print(f"\nRREF(V) =\n{R}")
print(f"Pivot columns: {pivots}")
print(f"Only {len(pivots)} pivots for 3 columns -> column 3 is dependent on columns 1,2")
print(f"Dependence: col3 = {R[0,2]:.0f}*col1 + {R[1,2]:.0f}*col2")
print(f"Verify: {R[0,2]}*(1,4,7) + {R[1,2]}*(2,5,8) = {R[0,2]*V[:,0] + R[1,2]*V[:,1]}")
print(f"col3 = {V[:,2]}")

Code cell 36

# === 5.3 More vectors than dimensions -> always dependent ===

np.random.seed(42)
n = 3
k = 5  # 5 vectors in R^3 -> must be dependent
V = np.random.randn(n, k)
print(f"{k} vectors in R^{n}:")
print(f"  rank([v1|...|v{k}]) = {np.linalg.matrix_rank(V)}")
print(f"  rank < k = {k}? {np.linalg.matrix_rank(V) < k} -> dependent (pigeonhole)")
print(f"  At most {n} vectors in R^{n} can be independent.")

# The null space of V^T gives the dependence relations
N = null_space(V.T)
if N.shape[1] > 0:
    c = N[:, 0]
    print(f"\n  Dependence coefficients: {c}")
    print(f"  Sum c_i * v_i = {V @ c}")
    print(f"  Is zero? {np.allclose(V @ c, 0)}")

Code cell 37

# === 5.4 AI context: superposition — more features than dimensions ===

# The superposition hypothesis: LLMs store more features than dimensions.
# If |features| > d, feature directions MUST be linearly dependent.
# This forces features to "share" dimensions and interfere with each other.

d = 4           # embedding dimension
n_features = 8  # more features than dimensions

np.random.seed(7)
# Random unit feature directions
features = np.random.randn(n_features, d)
features = features / np.linalg.norm(features, axis=1, keepdims=True)

print(f"{n_features} feature directions in R^{d}:")
print(f"  rank = {np.linalg.matrix_rank(features)}")
print(f"  {n_features} > {d} -> features are linearly dependent (superposition)")

# Measure interference via pairwise cosine similarities
G = features @ features.T
np.fill_diagonal(G, 0)  # ignore self-similarity
print(f"\n  Max |cosine| between features: {np.max(np.abs(G)):.4f}")
print(f"  Mean |cosine| between features: {np.mean(np.abs(G[np.triu_indices(n_features, 1)])):.4f}")
print("  Non-zero cosines = interference between features.")

6. Basis and Dimension

A basis for a vector space VV is a set B={b1,,bn}\mathcal{B} = \{\mathbf{b}_1, \ldots, \mathbf{b}_n\} that is:

  1. Linearly independent — no vector is redundant
  2. Spans VV — every vector in VV is a linear combination of the bi\mathbf{b}_i

A basis is both the minimal spanning set and the maximal independent set.

The Unique Representation Theorem

If B\mathcal{B} is a basis, then every vV\mathbf{v} \in V has a unique representation v=αibi\mathbf{v} = \sum \alpha_i \mathbf{b}_i. The scalars (α1,,αn)(\alpha_1, \ldots, \alpha_n) are the coordinates of v\mathbf{v} with respect to B\mathcal{B}.

Dimension

All bases for VV have the same number of elements. This number is the dimension:

dim(V)=any basis of V\dim(V) = |\text{any basis of } V|
SpaceDimension
Rn\mathbb{R}^nnn
Rm×n\mathbb{R}^{m \times n}mnmn
PnP_n (polynomials deg n\leq n)n+1n+1
Symmetric n×nn \times n matricesn(n+1)/2n(n+1)/2
{0}\{\mathbf{0}\}00

Dimension counting theorems

Rank-Nullity: nullity(A)+rank(A)=n\text{nullity}(A) + \text{rank}(A) = n for ARm×nA \in \mathbb{R}^{m \times n}.

Dimension of sum: dim(W1+W2)=dim(W1)+dim(W2)dim(W1W2)\dim(W_1 + W_2) = \dim(W_1) + \dim(W_2) - \dim(W_1 \cap W_2).

Code cell 39

# === 6.1 Coordinates in the standard basis vs a non-standard basis ===

# Standard basis for R^2: {e1, e2}
# Non-standard basis: B = {(1,1), (1,-1)}
B = np.array([[1.0, 1.0], [1.0, -1.0]]).T  # columns are basis vectors
x = np.array([4.0, 2.0])

# Coordinates in standard basis = entries themselves
print(f"x = {x}")
print(f"[x]_standard = {x}  (coordinates = entries)")

# Coordinates in basis B: solve B * coords = x
coords_B = np.linalg.solve(B, x)
print(f"\nBasis B = columns of\n{B}")
print(f"[x]_B = {coords_B}")
print(f"Verify: {coords_B[0]}*(1,1) + {coords_B[1]}*(1,-1) = {B @ coords_B}")

# The vector is unchanged; only the coordinate representation changes
print(f"\nSame vector, different coordinates:")
print(f"  Standard: ({x[0]}, {x[1]})")
print(f"  Basis B:  ({coords_B[0]}, {coords_B[1]})")

Code cell 40

# === 6.2 Change of basis ===

# Two bases for R^2
B1 = np.array([[1.0, 1.0], [1.0, -1.0]]).T   # columns = basis vectors
B2 = np.array([[2.0, 1.0], [0.0, 1.0]]).T

# Change-of-basis matrix from B1 to B2: P = B2^{-1} B1
P_B1_to_B2 = np.linalg.solve(B2, B1)
print("Change-of-basis matrix P (B1 -> B2):")
print(P_B1_to_B2)

# A vector x with known coords in B1
x = np.array([4.0, 2.0])
coords_B1 = np.linalg.solve(B1, x)
coords_B2 = P_B1_to_B2 @ coords_B1
coords_B2_direct = np.linalg.solve(B2, x)

print(f"\nx = {x}")
print(f"[x]_B1 = {coords_B1}")
print(f"[x]_B2 via change-of-basis = {coords_B2}")
print(f"[x]_B2 via direct solve    = {coords_B2_direct}")
print(f"Match? {np.allclose(coords_B2, coords_B2_direct)}")

# Inverse change of basis
P_B2_to_B1 = np.linalg.inv(P_B1_to_B2)
print(f"\nP_B2_to_B1 @ P_B1_to_B2 = I? {np.allclose(P_B2_to_B1 @ P_B1_to_B2, np.eye(2))}")

Code cell 41

# === 6.3 Rank-Nullity theorem: conservation of dimensions ===

np.random.seed(42)
for m, n in [(3, 5), (4, 4), (5, 3), (3, 3)]:
    A = np.random.randn(m, n)
    r = np.linalg.matrix_rank(A)
    nullity = n - r
    print(f"A is {m}x{n}: rank = {r}, nullity = {nullity}, rank + nullity = {r + nullity} = n = {n}")

# Concrete example with structure
A = np.array([
    [1.0, 2.0, 1.0, 0.0],
    [0.0, 0.0, 1.0, 1.0],
    [1.0, 2.0, 2.0, 1.0]
])
r = np.linalg.matrix_rank(A)
N = null_space(A)
print(f"\nA (3x4) =\n{A}")
print(f"rank = {r}, nullity = {N.shape[1]}")
print(f"rank + nullity = {r + N.shape[1]} = n = {A.shape[1]}")
print(f"Null space basis:\n{N}")
print(f"Verify A @ N = 0? {np.allclose(A @ N, 0)}")

7. The Four Fundamental Subspaces

For ARm×nA \in \mathbb{R}^{m \times n} with rank rr, there are four fundamental subspaces:

SubspaceLives inDimensionDefinition
Column space col(A)\text{col}(A)Rm\mathbb{R}^mrr{Ax:xRn}\{A\mathbf{x} : \mathbf{x} \in \mathbb{R}^n\} — reachable outputs
Null space null(A)\text{null}(A)Rn\mathbb{R}^nnrn - r{x:Ax=0}\{\mathbf{x} : A\mathbf{x} = \mathbf{0}\} — invisible inputs
Row space row(A)\text{row}(A)Rn\mathbb{R}^nrr{ATy:yRm}\{A^T\mathbf{y} : \mathbf{y} \in \mathbb{R}^m\} — directions AA listens to
Left null space null(AT)\text{null}(A^T)Rm\mathbb{R}^mmrm - r{y:ATy=0}\{\mathbf{y} : A^T\mathbf{y} = \mathbf{0}\} — unreachable output directions

The orthogonality relations

row(A)null(A)in Rn,col(A)null(AT)in Rm\text{row}(A) \perp \text{null}(A) \quad \text{in } \mathbb{R}^n, \qquad \text{col}(A) \perp \text{null}(A^T) \quad \text{in } \mathbb{R}^m Rn=row(A)null(A),Rm=col(A)null(AT)\mathbb{R}^n = \text{row}(A) \oplus \text{null}(A), \qquad \mathbb{R}^m = \text{col}(A) \oplus \text{null}(A^T)

Strang's Big Picture

              R^n                              R^m
    ┌──────────────────┐           ┌──────────────────┐
    │  row(A)          │           │  col(A)          │
    │  dim = r         │────A─────▶│  dim = r         │
    │                  │           │                  │
    │    ⊕             │           │    ⊕             │
    │                  │           │                  │
    │  null(A)         │────A─────▶│  {0}             │
    │  dim = n − r     │  (→ 0)    │                  │
    └──────────────────┘           └──────────────────┘

AA is a bijection from row(A)\text{row}(A) to col(A)\text{col}(A). AA maps null(A)\text{null}(A) to {0}\{\mathbf{0}\}.

Code cell 43

# === 7.1 Computing all four fundamental subspaces via SVD ===

A = np.array([
    [1.0, 2.0, 3.0],
    [2.0, 4.0, 6.0],
    [1.0, 1.0, 2.0]
])

U, S, Vt = np.linalg.svd(A, full_matrices=True)
r = np.linalg.matrix_rank(A)
m, n = A.shape

col_basis = U[:, :r]
null_A_basis = Vt[r:].T
row_basis = Vt[:r]
left_null_basis = U[:, r:]

print(f"A ({m}x{n}), rank = {r}")
print(f"Singular values: {S}")
print(f"\n--- Four Fundamental Subspaces ---")
print(f"\nColumn space (dim {r}, lives in R^{m}):")
print(col_basis)
print(f"\nNull space (dim {n-r}, lives in R^{n}):")
print(null_A_basis)
print(f"\nRow space (dim {r}, lives in R^{n}):")
print(row_basis)
print(f"\nLeft null space (dim {m-r}, lives in R^{m}):")
print(left_null_basis)

# Dimension check
print(f"\n--- Dimension checks ---")
print(f"col + left_null = {r} + {m-r} = {m} = m")
print(f"row + null      = {r} + {n-r} = {n} = n")

Code cell 44

# === 7.2 Orthogonality verification and direct-sum decomposition ===

# Verify: row(A) ⊥ null(A)  and  col(A) ⊥ null(A^T)
print("Orthogonality checks:")
if null_A_basis.size > 0:
    cross = row_basis @ null_A_basis
    print(f"  row(A)^T @ null(A) = \n{cross}")
    print(f"  All zero? {np.allclose(cross, 0)}")

if left_null_basis.size > 0:
    cross2 = col_basis.T @ left_null_basis
    print(f"  col(A)^T @ null(A^T) = \n{cross2}")
    print(f"  All zero? {np.allclose(cross2, 0)}")

# Direct-sum decomposition in the domain R^n
z = np.array([2.0, -1.0, 3.0])
row_comp = row_basis.T @ (row_basis @ z)          # projection onto row space
null_comp = null_A_basis @ (null_A_basis.T @ z)    # projection onto null space

print(f"\n--- Direct-sum decomposition of z = {z} in R^{n} ---")
print(f"  Row-space component:  {row_comp}")
print(f"  Null-space component: {null_comp}")
print(f"  Sum:                  {row_comp + null_comp}")
print(f"  Equals z? {np.allclose(row_comp + null_comp, z)}")
print(f"  Orthogonal? dot = {np.dot(row_comp, null_comp):.10f}")
print(f"\n  A maps row-comp to: {A @ row_comp}")
print(f"  A maps null-comp to: {A @ null_comp} (= 0, as expected)")

Code cell 45

# === 7.3 AI interpretation: weight matrix subspaces ===

# For a neural network layer W in R^{m x n}:
#   col(W) = reachable output directions
#   null(W) = input directions the layer ignores
#   row(W) = input directions the layer listens to

np.random.seed(42)
m, n, r = 6, 8, 3  # 6 outputs, 8 inputs, effective rank 3

# Create a rank-r weight matrix (simulating a bottleneck layer)
U_full = np.random.randn(m, r)
V_full = np.random.randn(r, n)
W = U_full @ V_full

print(f"Weight matrix W: {m}x{n}, rank = {np.linalg.matrix_rank(W)}")
print(f"\nWhat this means for the layer:")
print(f"  col(W): {r}D subspace of R^{m} -> only {r} of {m} output directions are reachable")
print(f"  null(W): {n-r}D subspace of R^{n} -> {n-r} input directions are ignored")
print(f"  row(W): {r}D subspace of R^{n} -> only {r} input directions matter")
print(f"\n  Dead output dimensions: {m-r}")
print(f"  Ignored input dimensions: {n-r}")

# Demonstrate: two inputs differing only in null(W) produce the same output
N_W = null_space(W)
x = np.random.randn(n)
z = N_W @ np.random.randn(N_W.shape[1]) * 5  # large null-space perturbation
print(f"\n  x and x+z differ by a null-space vector:")
print(f"  ||z||  = {np.linalg.norm(z):.4f}  (large perturbation)")
print(f"  W*x    = {W @ x}")
print(f"  W*(x+z)= {W @ (x + z)}")
print(f"  Same output? {np.allclose(W @ x, W @ (x + z))}  (null space is invisible)")

8. Subspace Operations

Sum of subspaces

W1+W2={w1+w2:w1W1,  w2W2}W_1 + W_2 = \{\mathbf{w}_1 + \mathbf{w}_2 : \mathbf{w}_1 \in W_1,\; \mathbf{w}_2 \in W_2\}

The sum is the smallest subspace containing both W1W_1 and W2W_2.

Dimension formula: dim(W1+W2)=dim(W1)+dim(W2)dim(W1W2)\dim(W_1 + W_2) = \dim(W_1) + \dim(W_2) - \dim(W_1 \cap W_2)

Intersection of subspaces

W1W2={v:vW1 and vW2}W_1 \cap W_2 = \{\mathbf{v} : \mathbf{v} \in W_1 \text{ and } \mathbf{v} \in W_2\}

Always a subspace. The largest subspace contained in both. Note: W1W2W_1 \cup W_2 is generally not a subspace.

Direct sum

V=W1W2V = W_1 \oplus W_2 if W1+W2=VW_1 + W_2 = V and W1W2={0}W_1 \cap W_2 = \{\mathbf{0}\}.

Every vV\mathbf{v} \in V has a unique decomposition v=w1+w2\mathbf{v} = \mathbf{w}_1 + \mathbf{w}_2.

Orthogonal projection

For subspace WW with orthonormal basis Q=[q1qk]Q = [\mathbf{q}_1 | \cdots | \mathbf{q}_k]:

ProjW(v)=QQTv=i=1kv,qiqi\text{Proj}_W(\mathbf{v}) = QQ^T\mathbf{v} = \sum_{i=1}^k \langle \mathbf{v}, \mathbf{q}_i \rangle \mathbf{q}_i

The projection matrix P=QQTP = QQ^T satisfies P2=PP^2 = P (idempotent) and PT=PP^T = P (symmetric).

Code cell 47

# === 8.1 Sum and intersection of subspaces ===

# W1 = span{(1,0,0), (0,1,0)} = xy-plane
# W2 = span{(0,1,0), (0,0,1)} = yz-plane
# W1 + W2 = R^3 (every vector is reachable)
# W1 ∩ W2 = span{(0,1,0)} = y-axis (shared direction)

B1 = np.array([[1,0,0],[0,1,0]], dtype=float).T
B2 = np.array([[0,1,0],[0,0,1]], dtype=float).T

# Sum: combine all basis vectors and find rank
B_sum = np.column_stack([B1, B2])
dim_sum = np.linalg.matrix_rank(B_sum)
dim_W1 = np.linalg.matrix_rank(B1)
dim_W2 = np.linalg.matrix_rank(B2)
dim_intersect = dim_W1 + dim_W2 - dim_sum

print("W1 = xy-plane (dim 2), W2 = yz-plane (dim 2)")
print(f"dim(W1) = {dim_W1}")
print(f"dim(W2) = {dim_W2}")
print(f"dim(W1 + W2) = {dim_sum}")
print(f"dim(W1 ∩ W2) = dim(W1) + dim(W2) - dim(W1+W2) = {dim_W1}+{dim_W2}-{dim_sum} = {dim_intersect}")
print(f"\nDimension formula verified: {dim_sum} = {dim_W1} + {dim_W2} - {dim_intersect}")
print("W1 ∩ W2 = y-axis (the only direction shared by both planes)")

Code cell 48

# === 8.2 Union is NOT a subspace (important counterexample) ===

# W1 = x-axis = span{(1,0)}, W2 = y-axis = span{(0,1)}
v1 = np.array([1.0, 0.0])  # in W1
v2 = np.array([0.0, 1.0])  # in W2
v_sum = v1 + v2             # (1,1) -- not in W1 or W2

print("W1 = x-axis, W2 = y-axis")
print(f"  (1,0) in W1 ∪ W2? Yes")
print(f"  (0,1) in W1 ∪ W2? Yes")
print(f"  (1,0)+(0,1) = {v_sum}")
print(f"  (1,1) in W1? No (not on x-axis)")
print(f"  (1,1) in W2? No (not on y-axis)")
print(f"  (1,1) in W1 ∪ W2? No -> union is NOT closed under addition")
print(f"\n  W1 ∪ W2 is NOT a subspace.")
print(f"  W1 + W2 = R^2 IS a subspace (the correct notion of 'combining' subspaces).")

Code cell 49

# === 8.3 Orthogonal projection onto a subspace ===

# W = span{(1,1,0), (0,1,1)} in R^3
W_basis = np.array([[1,1,0],[0,1,1]], dtype=float).T

# Orthonormalize the basis
Q = gram_schmidt(W_basis.T).T  # columns are orthonormal basis for W
print("Orthonormal basis Q for W:")
print(Q)
print(f"Q^T Q = \n{Q.T @ Q}")

# Projection matrix P = Q Q^T
P = Q @ Q.T
print(f"\nProjection matrix P = QQ^T:")
print(P)
print(f"P^2 = P (idempotent)? {np.allclose(P @ P, P)}")
print(f"P^T = P (symmetric)?  {np.allclose(P.T, P)}")
print(f"rank(P) = {np.linalg.matrix_rank(P)} = dim(W)")

# Project a vector
v = np.array([3.0, 1.0, 2.0])
proj_v = P @ v
residual = v - proj_v

print(f"\nv = {v}")
print(f"Proj_W(v) = {proj_v}")
print(f"Residual v - Proj_W(v) = {residual}")
print(f"Residual ⊥ W? Q^T @ residual = {Q.T @ residual}")
print(f"||v||^2 = ||proj||^2 + ||resid||^2? {np.linalg.norm(v)**2:.4f} = {np.linalg.norm(proj_v)**2:.4f} + {np.linalg.norm(residual)**2:.4f} = {np.linalg.norm(proj_v)**2 + np.linalg.norm(residual)**2:.4f}")

Code cell 50

# === 8.4 Principal angles between subspaces ===

# The principal angles between two subspaces measure how "aligned" they are.
# cos(theta_i) = sigma_i(Q1^T Q2)

np.random.seed(42)

# Two 2D subspaces of R^4
A1 = np.random.randn(4, 2)
A2 = np.random.randn(4, 2)

Q1 = gram_schmidt(A1.T).T  # orthonormal basis, columns
Q2 = gram_schmidt(A2.T).T

cos_angles = np.linalg.svd(Q1.T @ Q2, compute_uv=False)
angles_deg = np.degrees(np.arccos(np.clip(cos_angles, -1, 1)))

print("Two random 2D subspaces of R^4:")
print(f"  cos(principal angles) = {cos_angles}")
print(f"  principal angles (degrees) = {angles_deg}")
print(f"\n  First angle = {angles_deg[0]:.1f}° (closest directions between subspaces)")
print(f"  Second angle = {angles_deg[1]:.1f}° (next pair)")

# Identical subspaces -> angles = 0
cos_same = np.linalg.svd(Q1.T @ Q1, compute_uv=False)
print(f"\n  Same subspace: cos(angles) = {cos_same} (all 1 -> all angles 0°)")

# Orthogonal subspaces
Q_orth = gram_schmidt(np.eye(4)[2:]).T
cos_orth = np.linalg.svd(Q1.T @ Q_orth, compute_uv=False)
angles_orth = np.degrees(np.arccos(np.clip(cos_orth, -1, 1)))
print(f"  Orthogonal subspace: angles = {angles_orth} degrees")

9. Inner Product Spaces and Orthogonality

An inner product on VV is a function ,:V×VR\langle \cdot, \cdot \rangle : V \times V \to \mathbb{R} satisfying:

  1. Symmetry: u,v=v,u\langle \mathbf{u}, \mathbf{v} \rangle = \langle \mathbf{v}, \mathbf{u} \rangle
  2. Linearity: αu+βw,v=αu,v+βw,v\langle \alpha\mathbf{u} + \beta\mathbf{w}, \mathbf{v} \rangle = \alpha\langle \mathbf{u}, \mathbf{v} \rangle + \beta\langle \mathbf{w}, \mathbf{v} \rangle
  3. Positive definiteness: v,v0\langle \mathbf{v}, \mathbf{v} \rangle \geq 0, with equality iff v=0\mathbf{v} = \mathbf{0}

Induced norm: v=v,v\|\mathbf{v}\| = \sqrt{\langle \mathbf{v}, \mathbf{v} \rangle}. Cauchy-Schwarz: u,vuv|\langle \mathbf{u}, \mathbf{v} \rangle| \leq \|\mathbf{u}\| \|\mathbf{v}\|.

Standard inner products

SpaceInner productInduced norm
Rn\mathbb{R}^nu,v=uTv\langle \mathbf{u}, \mathbf{v} \rangle = \mathbf{u}^T\mathbf{v}2\ell^2 norm
Rm×n\mathbb{R}^{m \times n}A,B=tr(ATB)\langle A, B \rangle = \text{tr}(A^TB)Frobenius norm
C([a,b])C([a,b])f,g=abf(t)g(t)dt\langle f, g \rangle = \int_a^b f(t)g(t)\,dtL2L^2 norm

Orthogonality and Gram-Schmidt

  • uv\mathbf{u} \perp \mathbf{v} if u,v=0\langle \mathbf{u}, \mathbf{v} \rangle = 0
  • Pythagorean theorem: uv    u+v2=u2+v2\mathbf{u} \perp \mathbf{v} \implies \|\mathbf{u}+\mathbf{v}\|^2 = \|\mathbf{u}\|^2 + \|\mathbf{v}\|^2
  • Orthogonal/orthonormal sets are always linearly independent
  • Gram-Schmidt converts any independent set to an orthonormal basis for the same span
  • The orthogonal complement W={v:v,w=0  wW}W^\perp = \{\mathbf{v} : \langle \mathbf{v}, \mathbf{w} \rangle = 0\;\forall\, \mathbf{w} \in W\}
  • V=WWV = W \oplus W^\perp and dim(W)+dim(W)=dim(V)\dim(W) + \dim(W^\perp) = \dim(V)

Code cell 52

# === 9.1 Inner product properties and Cauchy-Schwarz ===

u = np.array([1.0, 2.0, 3.0])
v = np.array([4.0, -1.0, 2.0])
w = np.array([0.0, 1.0, -1.0])
alpha, beta = 2.0, -0.5

# Verify inner product axioms for standard dot product
print("Inner product axioms (standard dot product on R^3):")
print(f"  Symmetry:  <u,v> = {np.dot(u,v)}, <v,u> = {np.dot(v,u)}, equal? {np.isclose(np.dot(u,v), np.dot(v,u))}")
print(f"  Linearity: <au+bw,v> = {np.dot(alpha*u+beta*w, v):.4f}")
print(f"             a<u,v>+b<w,v> = {alpha*np.dot(u,v)+beta*np.dot(w,v):.4f}")
print(f"             equal? {np.isclose(np.dot(alpha*u+beta*w, v), alpha*np.dot(u,v)+beta*np.dot(w,v))}")
print(f"  Pos. def.: <u,u> = {np.dot(u,u):.4f} >= 0")
print(f"             <0,0> = {np.dot(np.zeros(3), np.zeros(3)):.4f} = 0")

# Cauchy-Schwarz
lhs = abs(np.dot(u, v))
rhs = np.linalg.norm(u) * np.linalg.norm(v)
print(f"\nCauchy-Schwarz: |<u,v>| = {lhs:.4f} <= ||u|| ||v|| = {rhs:.4f}? {lhs <= rhs + 1e-12}")

# Frobenius inner product for matrices
A = np.array([[1.0, 2.0], [3.0, 4.0]])
B = np.array([[5.0, 6.0], [7.0, 8.0]])
fro_ip = np.trace(A.T @ B)
print(f"\nFrobenius inner product: <A,B> = tr(A^T B) = {fro_ip}")
print(f"Manual sum: {np.sum(A * B)}")
print(f"||A||_F = sqrt(<A,A>) = {np.sqrt(np.trace(A.T @ A)):.4f} = {np.linalg.norm(A, 'fro'):.4f}")

Code cell 53

# === 9.2 Gram-Schmidt orthogonalisation: step by step ===

vectors = np.array([
    [1.0, 1.0, 0.0],
    [1.0, 0.0, 1.0],
    [0.0, 1.0, 1.0]
], dtype=float)

print("Input vectors (rows):")
for i, v in enumerate(vectors):
    print(f"  v{i+1} = {v}")

Q = gram_schmidt(vectors)

print("\nGram-Schmidt output (orthonormal):")
for i, q in enumerate(Q):
    print(f"  q{i+1} = {q}")

print("\nVerification:")
print(f"  Q Q^T (should be I):\n{Q @ Q.T}")
print(f"  Orthonormal? {np.allclose(Q @ Q.T, np.eye(len(Q)))}")

# Same span check: rank([v1..vk]) == rank([q1..qk])
V_mat = vectors.T
Q_mat = Q.T
print(f"\n  rank(original vectors) = {np.linalg.matrix_rank(V_mat)}")
print(f"  rank(orthonormal basis) = {np.linalg.matrix_rank(Q_mat)}")
print(f"  Same span? {np.linalg.matrix_rank(V_mat) == np.linalg.matrix_rank(Q_mat)}")

# Connection to QR decomposition
A_qr = vectors.T  # columns = original vectors
Q_qr, R_qr = np.linalg.qr(A_qr)
print(f"\n  QR decomposition: A = QR")
print(f"  Q (orthonormal columns):\n{Q_qr}")
print(f"  R (upper triangular):\n{R_qr}")
print(f"  A = QR? {np.allclose(A_qr, Q_qr @ R_qr)}")

Code cell 54

# === 9.3 Orthogonal complement and V = W ⊕ W⊥ ===

# W = null(A) for some A. Then W⊥ = row(A).
# This is exactly the fundamental subspace decomposition.

A = np.array([
    [1.0, 2.0, 1.0],
    [0.0, 1.0, 1.0]
])

W_basis = null_space(A)         # null(A) = W
W_perp_basis = row_space(A).T   # row(A) = W⊥

dim_W = W_basis.shape[1] if W_basis.size > 0 else 0
dim_Wp = W_perp_basis.shape[1] if W_perp_basis.size > 0 else 0

print(f"A ({A.shape[0]}x{A.shape[1]}), rank = {np.linalg.matrix_rank(A)}")
print(f"W = null(A): dim = {dim_W}")
print(f"W⊥ = row(A): dim = {dim_Wp}")
print(f"dim(W) + dim(W⊥) = {dim_W + dim_Wp} = n = {A.shape[1]}")

# Verify orthogonality
if dim_W > 0 and dim_Wp > 0:
    cross = W_perp_basis.T @ W_basis
    print(f"\nW⊥^T @ W =\n{cross}")
    print(f"All zero (orthogonal)? {np.allclose(cross, 0)}")

# Decompose a vector
v = np.array([3.0, 1.0, 2.0])
# Project onto W and W⊥
P_W = W_basis @ np.linalg.pinv(W_basis) if dim_W > 0 else np.zeros((3,3))
P_Wp = W_perp_basis @ np.linalg.pinv(W_perp_basis) if dim_Wp > 0 else np.zeros((3,3))
v_W = P_W @ v
v_Wp = P_Wp @ v

print(f"\nv = {v}")
print(f"W-component:  {v_W}")
print(f"W⊥-component: {v_Wp}")
print(f"Sum = v? {np.allclose(v_W + v_Wp, v)}")
print(f"Orthogonal? <v_W, v_Wp> = {np.dot(v_W, v_Wp):.10f}")

10. Affine Subspaces and Quotient Spaces

Affine subspaces

An affine subspace is a translation of a linear subspace: W+v0={w+v0:wW}W + \mathbf{v}_0 = \{\mathbf{w} + \mathbf{v}_0 : \mathbf{w} \in W\}.

  • Does not pass through the origin (unless v0W\mathbf{v}_0 \in W).
  • The solution set of Ax=bA\mathbf{x} = \mathbf{b} with b0\mathbf{b} \neq \mathbf{0} is an affine subspace: xp+null(A)\mathbf{x}_p + \text{null}(A).
  • Same dimension as the underlying WW, but NOT a subspace of VV.

Affine and convex combinations

  • Affine combination: αivi\sum \alpha_i \mathbf{v}_i where αi=1\sum \alpha_i = 1 (coefficients sum to 1)
  • Convex combination: affine combination with αi0\alpha_i \geq 0 (stay in convex hull)
  • AI: interpolation between embeddings αu+(1α)v\alpha\mathbf{u} + (1-\alpha)\mathbf{v} is a convex combination

Quotient spaces (intuition)

V/WV/W "forgets" the WW-directions: two vectors are equivalent if they differ by an element of WW.

[v]=v+W={v+w:wW}[\mathbf{v}] = \mathbf{v} + W = \{\mathbf{v} + \mathbf{w} : \mathbf{w} \in W\}

dim(V/W)=dim(V)dim(W)\dim(V/W) = \dim(V) - \dim(W). In normalisation layers, mean-subtraction projects onto a quotient.

Code cell 56

# === 10.1 Affine subspace: solution set of Ax = b ===

A = np.array([
    [1.0, 1.0, 1.0],
    [0.0, 1.0, 2.0]
])
b = np.array([6.0, 4.0])

# Particular solution
x_p, *_ = np.linalg.lstsq(A, b, rcond=None)

# Null space
N = null_space(A)

print("System Ax = b:")
print(f"A =\n{A}")
print(f"b = {b}")
print(f"\nParticular solution x_p = {x_p}")
print(f"Verify: A*x_p = {A @ x_p}")

print(f"\nNull space of A (dim = {N.shape[1]}):")
print(f"  Basis:\n{N}")

# The full solution set is x_p + null(A)
print(f"\nSolution set = x_p + null(A):")
for t in [-2, -1, 0, 1, 2]:
    x = x_p + t * N[:, 0]
    print(f"  t={t:>2}: x = {x},  A*x = {A @ x}")

print(f"\nAll solutions differ by null-space vectors.")
print(f"Solution set is an affine subspace, NOT a linear subspace (0 not in it).")
print(f"A*0 = {A @ np.zeros(3)} != b = {b}")

Code cell 57

# === 10.2 Convex combinations and embedding interpolation ===

# Interpolation between two embedding vectors
e_cat = np.array([0.8, 0.3, -0.1, 0.5])
e_dog = np.array([0.7, 0.4, -0.2, 0.6])

print("Embedding interpolation (convex combination):")
print(f"  e_cat = {e_cat}")
print(f"  e_dog = {e_dog}")
for alpha in [0.0, 0.25, 0.5, 0.75, 1.0]:
    interp = alpha * e_cat + (1 - alpha) * e_dog
    cos_cat = cosine_similarity(interp, e_cat)
    cos_dog = cosine_similarity(interp, e_dog)
    print(f"  alpha={alpha:.2f}: {interp}  cos(cat)={cos_cat:.3f}  cos(dog)={cos_dog:.3f}")

# Quotient space intuition: mean subtraction
print(f"\n--- Quotient space: mean subtraction ---")
x = np.array([5.0, 3.0, 7.0, 1.0])
mean = np.mean(x)
x_centered = x - mean
print(f"x = {x}, mean = {mean}")
print(f"x - mean = {x_centered}, sum = {x_centered.sum():.10f}")
print("Mean subtraction projects onto the hyperplane sum(x_i)=0.")
print("This is the quotient V / span{(1,1,...,1)}: forget the 'all-ones' direction.")

11. Subspaces in Functional Analysis

Infinite-dimensional spaces

In finite dimensions, every subspace is automatically closed. In infinite dimensions, there exist subspaces that are not closed — dense proper subspaces whose limits escape the subspace (e.g. polynomials inside L2L^2: every L2L^2 function is a limit of polynomials, but not every L2L^2 function is a polynomial).

Function spaces relevant to AI

SpaceDescriptionAI application
L2(Rd)L^2(\mathbb{R}^d)Square-integrable functionsKernel methods, GP priors
Sobolev Wk,pW^{k,p}Functions with kk weak derivatives in LpL^pPINNs, variational problems
RKHS Hk\mathcal{H}_kReproducing Kernel Hilbert SpaceSVMs, kernel regression

Krylov subspaces

Kk(A,b)=span{b,Ab,A2b,,Ak1b}\mathcal{K}_k(A, \mathbf{b}) = \text{span}\{\mathbf{b}, A\mathbf{b}, A^2\mathbf{b}, \ldots, A^{k-1}\mathbf{b}\}

A nested sequence of expanding subspaces. Iterative solvers (CG, GMRES, Lanczos) find the best solution within Kk\mathcal{K}_k and expand kk until convergence. Cost: kk matrix-vector products — far cheaper than direct methods for sparse systems.

Neural networks and function spaces

A neural network architecture defines a subset FθL2\mathcal{F}_\theta \subset L^2 — generally not a subspace (the sum of two neural networks is not typically representable by one). But the tangent space at θ\theta is a subspace: the Neural Tangent Kernel (NTK) regime corresponds to learning in this tangent subspace.

Code cell 59

# === 11.1 Krylov subspaces: expanding nested subspaces ===

np.random.seed(42)
n = 6
A = np.random.randn(n, n)
A = A + A.T  # make symmetric for cleaner behaviour
b = np.random.randn(n)

print(f"Krylov subspaces K_k(A, b) for {n}x{n} symmetric A:")
print(f"  K_k = span{{b, Ab, A^2 b, ..., A^(k-1) b}}\n")

for k in range(1, n + 1):
    # Build Krylov basis: [b, Ab, A^2 b, ..., A^{k-1} b]
    K = np.zeros((n, k))
    K[:, 0] = b
    for j in range(1, k):
        K[:, j] = A @ K[:, j-1]
    dim_k = np.linalg.matrix_rank(K)
    print(f"  K_{k}: {k} vectors, effective dim = {dim_k}", end="")
    if dim_k < k:
        print(f"  (Krylov sequence stalled — rank stabilised)")
    else:
        print()

print(f"\n  The Krylov subspace grows until it captures the full relevant space.")
print(f"  CG/GMRES solve Ax=b by searching in K_k; converge when K_k is rich enough.")

Code cell 60

# === 11.2 RKHS intuition: kernel defines a subspace of functions ===

# A kernel k(x, x') defines an RKHS H_k.
# Given data {x_1, ..., x_n}, the representer theorem says the optimal
# predictor lies in span{k(x_1, .), k(x_2, .), ..., k(x_n, .)} — a
# finite-dimensional subspace of the infinite-dimensional H_k.

# RBF kernel: k(x, x') = exp(-||x-x'||^2 / (2*sigma^2))
def rbf_kernel(X, sigma=1.0):
    sq_dists = np.sum(X**2, axis=1, keepdims=True) - 2*X@X.T + np.sum(X**2, axis=1)
    return np.exp(-sq_dists / (2 * sigma**2))

np.random.seed(0)
X_data = np.sort(np.random.randn(8, 1), axis=0)
K = rbf_kernel(X_data, sigma=1.0)

print(f"Data points: {X_data.ravel()}")
print(f"\nKernel Gram matrix K (8x8):\n{K}")
print(f"rank(K) = {np.linalg.matrix_rank(K)}")
print(f"K is PSD? min eigenvalue = {np.min(np.linalg.eigvalsh(K)):.6f} >= 0")
print(f"\nThe optimal predictor lives in an {np.linalg.matrix_rank(K)}-dimensional")
print(f"subspace of the RKHS, spanned by the kernel functions at data points.")

12. Subspace Methods in Machine Learning

PCA as subspace finding

PCA finds the rank-rr subspace of Rd\mathbb{R}^d that captures the most variance:

W=argmindim(W)=r1ni=1nxiProjW(xi)2W^* = \arg\min_{\dim(W)=r} \frac{1}{n}\sum_{i=1}^n \|\mathbf{x}_i - \text{Proj}_W(\mathbf{x}_i)\|^2

Solution: the top-rr eigenvectors of the covariance matrix C=1n(xiμ)(xiμ)TC = \frac{1}{n}\sum (\mathbf{x}_i - \boldsymbol{\mu})(\mathbf{x}_i - \boldsymbol{\mu})^T.

LoRA as explicit subspace restriction

LoRA replaces ΔWRm×n\Delta W \in \mathbb{R}^{m \times n} with ΔW=BAT\Delta W = BA^T where BRm×rB \in \mathbb{R}^{m \times r}, ARn×rA \in \mathbb{R}^{n \times r}. This restricts updates to a rank-rr subspace. The intrinsic dimension of fine-tuning is often p\ll p (Aghajanyan et al. 2020).

Mechanistic interpretability

  • Attention heads write to col(WOh)\text{col}(W_O^h) and read from row(WQh)\text{row}(W_Q^h), row(WKh)\text{row}(W_K^h) — specific subspaces of Rd\mathbb{R}^d.
  • The superposition hypothesis: features outnumber dimensions, so they live in non-orthogonal subspaces with interference.
  • LEACE (concept erasure): project out a subspace to remove a concept from representations.

Code cell 62

# === 12.1 PCA: finding the optimal subspace ===

np.random.seed(42)
n_samples = 200
d = 5

# Data lives mostly in a 2D subspace with noise in other directions
true_directions = np.array([
    [1, 1, 0, 0, 0],
    [0, 0, 1, 1, 0]
], dtype=float)
true_directions = true_directions / np.linalg.norm(true_directions, axis=1, keepdims=True)

coeffs = np.random.randn(n_samples, 2) * np.array([3.0, 2.0])
noise = np.random.randn(n_samples, d) * 0.1
X = coeffs @ true_directions + noise

# Center the data
mu = X.mean(axis=0)
X_centered = X - mu

# Covariance matrix and its eigendecomposition
C = (X_centered.T @ X_centered) / n_samples
eigvals, eigvecs = np.linalg.eigh(C)
# Sort descending
idx = np.argsort(eigvals)[::-1]
eigvals = eigvals[idx]
eigvecs = eigvecs[:, idx]

print(f"Data: {n_samples} points in R^{d}")
print(f"Eigenvalues of covariance matrix: {eigvals}")

# Variance explained
total_var = eigvals.sum()
for r in range(1, d + 1):
    explained = eigvals[:r].sum() / total_var
    print(f"  Top-{r} PCs explain {explained:.4f} of total variance")

print(f"\nThe data lives approximately in a 2D subspace.")
print(f"PCA finds this subspace by maximising captured variance.")

Code cell 63

# === 12.2  Gradient subspace during training + LoRA rank analysis ===
#
# Empirical finding (Gur-Ari et al. 2018, Aghajanyan et al. 2020):
# Gradient updates across training steps lie in a subspace whose dimension
# is much smaller than the full parameter count p.
# LoRA exploits this by restricting delta_W to a rank-r subspace.

np.random.seed(0)
m, n = 64, 64
p = m * n
T = 50

grads = []
for t in range(T):
    true_dirs = np.random.randn(5, p)
    true_dirs /= np.linalg.norm(true_dirs, axis=1, keepdims=True)
    g = (np.random.randn(5) @ true_dirs) + 0.05 * np.random.randn(p)
    grads.append(g)

G = np.array(grads)          # T x p
Gram = G @ G.T / T           # T x T
eigvals_gram = np.sort(np.linalg.eigvalsh(Gram))[::-1]
total_energy = eigvals_gram.sum()
cumulative = np.cumsum(eigvals_gram) / total_energy

print(f'Full parameter count p = {p:,}')
print(f'Gradient snapshots T = {T}')
print(f'\nTop-10 eigenvalues of gradient Gram matrix:')
for i in range(10):
    print(f'  lambda_{i+1:2d} = {eigvals_gram[i]:8.4f}  cumulative variance = {cumulative[i]:.4f}')

dim_90 = int(np.searchsorted(cumulative, 0.90)) + 1
print(f'\n90% of gradient energy in top-{dim_90} directions (full dim p = {p:,})')
print(f'Compression ratio: {p / dim_90:.1f}x')

print(f'\n--- LoRA rank analysis ---')
print(f'{"r":>4}  {"params":>10}  {"fraction":>10}  {"covers 90% gradient subspace?":>32}')
for r in [1, 2, 4, 5, 8, 16, 32]:
    params_lora = r * (m + n)
    frac = params_lora / p
    covers = 'YES' if r >= dim_90 else 'no'
    print(f'  r={r:>2}        {params_lora:>10,}  {frac:>10.4f}  {covers}')

print(f'\nKey: the intrinsic gradient dimension (approx {dim_90}) sets the minimum')
print(f'LoRA rank needed to capture 90% of the gradient update information.')
print(f'Using r < {dim_90} loses information; using r > {dim_90} wastes capacity.')

Code cell 64

# === 12.3  Concept directions in embedding space ===
#
# The linear representation hypothesis: concepts are encoded as directions
# (1-D subspaces) in the residual stream.
# Classic: gender direction in word2vec;  king - man + woman = queen.

np.random.seed(7)
d = 16

# Fixed semantic directions (random unit vectors in R^d)
gender_dir  = np.random.randn(d)
gender_dir /= np.linalg.norm(gender_dir)
royalty_dir = np.random.randn(d)
royalty_dir -= royalty_dir @ gender_dir * gender_dir   # orthogonalise
royalty_dir /= np.linalg.norm(royalty_dir)

base = np.random.randn(d) * 0.5
noise = lambda: np.random.randn(d) * 0.05

king   = base + 2.0 * gender_dir  + 2.0 * royalty_dir + noise()
queen  = base - 2.0 * gender_dir  + 2.0 * royalty_dir + noise()
man    = base + 2.0 * gender_dir                       + noise()
woman  = base - 2.0 * gender_dir                       + noise()
prince = base + 2.0 * gender_dir  + 1.0 * royalty_dir + noise()

def cosine(u, v):
    return np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))

print('=== Embedding arithmetic: king - man + woman approx queen ===')
analogy = king - man + woman
print(f'  cos(king-man+woman, queen) = {cosine(analogy, queen):+.4f}  (should be near +1)')
print(f'  cos(king-man+woman, king)  = {cosine(analogy, king):+.4f}')
print(f'  cos(king-man+woman, man)   = {cosine(analogy, man):+.4f}')

print('\n=== Recovering the gender direction via difference of means ===')
male_words   = np.array([king, man, prince])
female_words = np.array([queen, woman])
gender_est = male_words.mean(0) - female_words.mean(0)
gender_est /= np.linalg.norm(gender_est)
print(f'  cos(estimated, true gender direction) = {cosine(gender_est, gender_dir):.4f}')

print('\n=== Concept erasure: project out the gender direction ===')
def project_out(v, direction):
    return v - np.dot(v, direction) * direction

king_d  = project_out(king,  gender_est)
queen_d = project_out(queen, gender_est)
man_d   = project_out(man,   gender_est)
woman_d = project_out(woman, gender_est)

print('  After projecting out gender direction:')
print(f'    cos(king_debiased,  queen_debiased) = {cosine(king_d, queen_d):+.4f}')
print(f'    cos(king_debiased,  man_debiased)   = {cosine(king_d, man_d):+.4f}')
print(f'    cos(man_debiased,   woman_debiased) = {cosine(man_d, woman_d):+.4f}')
print('  Gender contrast (man vs woman) is erased; royalty contrast remains.')

print('\n=== Superposition: more concepts than dimensions ===')
np.random.seed(3)
n_features, d_emb = 40, 16
feat = np.random.randn(n_features, d_emb)
feat /= np.linalg.norm(feat, axis=1, keepdims=True)
G_feat = feat @ feat.T
np.fill_diagonal(G_feat, 0)
avg_int = np.abs(G_feat).sum() / (n_features * (n_features - 1))
print(f'  {n_features} features in d={d_emb} space')
print(f'  Average |cos(fi, fj)| = {avg_int:.4f}  (0 = orthogonal, impossible here)')
print(f'  Since {n_features} > {d_emb}, features MUST be non-orthogonal (superposed).')
print(f'  Interference is unavoidable -- this is the superposition hypothesis.')

Code cell 65

# === 12.4  Mechanistic interpretability: OV and QK subspace circuits ===
#
# Attention head h:
#   W_Q, W_K in R^{d x d_k}  -- query/key projections
#   W_V in R^{d x d_v}, W_O in R^{d_v x d}  -- value/output projections
#
# QK circuit: score(i,j) = x_i^T W_Q W_K^T x_j
#   Rank of score matrix <= d_k regardless of sequence length.
# OV circuit: output = softmax(scores) V W_O
#   Head writes to col(W_O^T) -- a d_v-dim subspace of R^d.

np.random.seed(11)
d, d_k, d_v, seq = 32, 8, 8, 6

W_Q = np.random.randn(d, d_k) * 0.1
W_K = np.random.randn(d, d_k) * 0.1
W_V = np.random.randn(d, d_v) * 0.1
W_O = np.random.randn(d_v, d) * 0.1
X   = np.random.randn(seq, d)

QK_mat  = W_Q @ W_K.T                          # d x d
rank_QK = np.linalg.matrix_rank(QK_mat)
scores  = X @ QK_mat @ X.T                      # seq x seq

print(f'=== QK Circuit ===')
print(f'  W_Q W_K^T: shape {QK_mat.shape}, rank = {rank_QK} (bound: d_k = {d_k})')
print(f'  Score matrix S = X (W_Q W_K^T) X^T: shape {scores.shape}, rank = {np.linalg.matrix_rank(scores)}')
print(f'  Key: {d}-dim residual stream compressed to {d_k} dims for attention comparison.')

def softmax(x):
    e = np.exp(x - x.max(axis=-1, keepdims=True))
    return e / e.sum(axis=-1, keepdims=True)

attn_weights = softmax(scores / np.sqrt(d_k))
V_proj       = X @ W_V                          # seq x d_v
head_out     = attn_weights @ V_proj @ W_O       # seq x d

OV_mat  = W_O.T @ W_V.T                         # d x d
rank_OV = np.linalg.matrix_rank(OV_mat)

print(f'\n=== OV Circuit ===')
print(f'  W_O^T W_V^T: shape {OV_mat.shape}, rank = {rank_OV} (bound: d_v = {d_v})')
write_dim = np.linalg.matrix_rank(W_O.T)
read_dim  = np.linalg.matrix_rank(W_V)
print(f'  Write subspace col(W_O^T): dim = {write_dim} of {d}')
print(f'  Read  subspace col(W_V):   dim = {read_dim}  of {d}')
print(f'  Bottleneck: {d} -> {d_k} (QK) -> score -> {d_v} (OV) -> {d} (write)')

print(f'\n=== Principal angles between write subspaces of two independent heads ===')
np.random.seed(22)
W_O2 = np.random.randn(d_v, d) * 0.1
U1 = np.linalg.svd(W_O.T,  full_matrices=False)[0]
U2 = np.linalg.svd(W_O2.T, full_matrices=False)[0]
cos_angles = np.linalg.svd(U1.T @ U2, compute_uv=False)
angles_deg = np.degrees(np.arccos(np.clip(cos_angles, -1, 1)))
overlap = np.mean(cos_angles**2)
print(f'  cos(theta_i): {cos_angles}')
print(f'  angles (deg): {angles_deg}')
print(f'  Mean cos^2 = {overlap:.4f}  (0=orthogonal/independent, 1=identical heads)')
print(f'  Orthogonal write subspaces => heads do not interfere => can prune either safely.')

Code cell 66

# === 12.5  Subspace fine-tuning: LoRA rank vs approximation quality ===
#
# If the true weight update DW_true has intrinsic rank r_true,
# the best rank-r LoRA approximation captures all signal when r >= r_true.
# Eckart-Young: best rank-r approximation uses truncated SVD.

np.random.seed(5)
m, n = 256, 256
true_rank = 8
lora_ranks = [1, 2, 4, 8, 16, 32, 64]

# Construct a true rank-true_rank update + small noise
U_t = np.random.randn(m, true_rank)
V_t = np.random.randn(n, true_rank)
DW_true = U_t @ V_t.T + 0.01 * np.random.randn(m, n)

fro_true = np.linalg.norm(DW_true, 'fro')
U_s, s_s, Vt_s = np.linalg.svd(DW_true, full_matrices=False)

print(f'Weight matrix: R^{{{m} x {n}}},  full params = {m*n:,}')
print(f'True update rank = {true_rank}')
print(f'Singular values of true DW (top 12): {s_s[:12]}')

print(f'\n{"LoRA r":>8}  {"Params":>10}  {"Fraction":>10}  {"Relative error":>16}')
for r in lora_ranks:
    DW_r = (U_s[:, :r] * s_s[:r]) @ Vt_s[:r, :]
    err  = np.linalg.norm(DW_true - DW_r, 'fro') / fro_true
    pars = r * (m + n)
    print(f'  r={r:>3}      {pars:>10,}  {pars/(m*n):>10.4f}  {err:>16.6f}')

print(f'\nAt r = {true_rank} = true rank, error drops to noise level.')
print(f'r < {true_rank}: information lost.  r > {true_rank}: capacity wasted.')

print(f'\n=== Fine-tuning method comparison (all are subspace constraints) ===')
rows = [
    ('LoRA',    'DW = B A^T, B in R^{mxr}, A in R^{nxr}',   'r(m+n)'),
    ('AdaLoRA', 'DW = P Lambda Q^T, prune small sing vals',  'varies by layer'),
    ('IA3',     'multiply activations by learned scalar',     '~1 per layer'),
    ('BitFit',  'fine-tune bias vectors only',               'sum of biases'),
    ('DoRA',    'DW = magnitude * direction  (rank-r dir)',   'r(m+n)+m'),
]
print(f'{"Method":>10}  {"Constraint":>50}  {"Params":>20}')
for name, desc, pars in rows:
    print(f'  {name:>8}  {desc:>50}  {pars:>20}')
print(f'\nAll methods restrict DW to a structured low-dimensional subspace of R^{{mxn}}.')
print(f'The choice of subspace structure (rank, sparsity, magnitude/direction) is the design.')

13. Invariant Subspaces and Spectral Theory

A subspace WVW \subseteq V is invariant under a linear map T:VVT : V \to V if T(W)WT(W) \subseteq W. Invariant subspaces are the key to understanding what a linear map does: every invariant subspace is a channel along which TT acts independently.

Eigenspaces are the simplest invariant subspaces. If Tv=λvT\mathbf{v} = \lambda \mathbf{v}, then span{v}\operatorname{span}\{\mathbf{v}\} is a 1-D invariant subspace.

The Spectral Theorem says every symmetric matrix A=AA = A^\top admits an orthogonal decomposition:

Rn=E(λ1)E(λ2)E(λk)\mathbb{R}^n = E(\lambda_1) \oplus E(\lambda_2) \oplus \cdots \oplus E(\lambda_k)

and A=QΛQA = Q \Lambda Q^\top where QQ is orthogonal and Λ\Lambda is diagonal.

SVD extends this to non-square matrices: A=UΣVA = U \Sigma V^\top decomposes both domain and codomain into invariant directions.

For AI: covariance matrices, Hessians, and Gram matrices are symmetric — the Spectral Theorem always applies. PCA is exactly the Spectral Theorem applied to the covariance matrix. LoRA targets the top invariant directions of the gradient covariance.

Code cell 68

# === 13.1  Invariant subspaces: eigenspaces and the Spectral Theorem ===

np.random.seed(17)

# --- Example 1: rotation has no real 1-D invariant subspace ---
theta = np.pi / 4
R_rot = np.array([[np.cos(theta), -np.sin(theta)],
                  [np.sin(theta),  np.cos(theta)]])
print('=== 2x2 rotation matrix (theta = pi/4) ===')
print(f'  R =\n{R_rot}')
eigvals_rot = np.linalg.eigvals(R_rot)
print(f'  Eigenvalues (complex): {eigvals_rot}')
print(f'  No real eigenvalues => no real 1-D invariant subspace.')
print(f'  The entire plane rotates; no direction is preserved.\n')

# --- Example 2: symmetric matrix => orthogonal invariant eigenspaces ---
A_sym = np.array([[4.0, 2.0, 0.0],
                  [2.0, 3.0, 1.0],
                  [0.0, 1.0, 2.0]])
eigvals_s, eigvecs_s = np.linalg.eigh(A_sym)   # guaranteed real for symmetric

print('=== Symmetric 3x3 matrix: spectral decomposition ===')
print(f'  Eigenvalues: {eigvals_s}')
print(f'  Eigenvectors Q:\n{eigvecs_s}\n')

for i in range(3):
    v  = eigvecs_s[:, i]
    Av = A_sym @ v
    ok = np.allclose(Av, eigvals_s[i] * v)
    print(f'  E(lambda={eigvals_s[i]:.4f}): A v = lambda v?  {ok}  (1-D invariant subspace)')

print(f'\n  Q^T Q (should be I):\n{eigvecs_s.T @ eigvecs_s}\n')

# Spectral decomposition A = Q Lambda Q^T
Lam = np.diag(eigvals_s)
A_recon = eigvecs_s @ Lam @ eigvecs_s.T
print(f'  A = Q Lambda Q^T  reconstruction error: {np.max(np.abs(A_sym - A_recon)):.2e}')

# As sum of rank-1 projections
A_sum = sum(eigvals_s[i] * np.outer(eigvecs_s[:, i], eigvecs_s[:, i]) for i in range(3))
print(f'  A = sum lambda_i q_i q_i^T  error: {np.max(np.abs(A_sym - A_sum)):.2e}')

# === 13.2  R^n = E(lambda1) + E(lambda2) + ... ===
print(f'\n=== Spectral Theorem: R^3 = E(l1) oplus E(l2) oplus E(l3) ===')
projs = [np.outer(eigvecs_s[:, i], eigvecs_s[:, i]) for i in range(3)]
print(f'  P1+P2+P3 = I?  error: {np.max(np.abs(sum(projs) - np.eye(3))):.2e}')

x = np.array([1.0, 2.0, -1.0])
Ax_spec = sum(eigvals_s[i] * projs[i] @ x for i in range(3))
print(f'  Ax via eigenspace decomposition: {Ax_spec}')
print(f'  A @ x directly:                  {A_sym @ x}')
print(f'  Match: {np.allclose(Ax_spec, A_sym @ x)}')
print(f'\n  A acts on each eigenspace independently -- scaling by lambda_i.')

Code cell 69

# === 13.3  SVD: complete subspace decomposition of any linear map ===
#
# A = U Sigma V^T:
#   Domain:   R^n = row(A) [v1..vr] (+) null(A) [v_{r+1}..v_n]
#   Codomain: R^m = col(A) [u1..ur] (+) null(A^T) [u_{r+1}..u_m]
#   A maps v_i -> sigma_i u_i (bijection between row(A) and col(A))
#   A maps null(A) -> 0

A_svd = np.array([
    [3.0, 1.0, 0.0, 2.0],
    [1.0, 2.0, 1.0, 0.0],
    [2.0, 0.0, 1.0, 1.0]
])  # 3 x 4

U, s, Vt = np.linalg.svd(A_svd, full_matrices=True)
V = Vt.T
r = np.linalg.matrix_rank(A_svd)
m_s, n_s = A_svd.shape

print(f'=== SVD of A in R^{{{m_s} x {n_s}}} ===')
print(f'  rank = {r},  singular values = {s}')

print(f'\n=== Domain R^{n_s}: row(A) oplus null(A) ===')
print(f'  row(A)  = span{{v1,...,v{r}}}    dim = {r}')
print(f'  null(A) = span{{v{r+1},...,v{n_s}}} dim = {n_s - r}')

print(f'\n=== Codomain R^{m_s}: col(A) oplus null(A^T) ===')
print(f'  col(A)    = span{{u1,...,u{r}}}    dim = {r}')
print(f'  null(A^T) = span{{u{r+1},...,u{m_s}}} dim = {m_s - r}')

print(f'\n=== A is a bijection: row(A) -> col(A) ===')
for i in range(r):
    vi   = V[:, i]
    Avi  = A_svd @ vi
    s_ui = s[i] * U[:, i]
    print(f'  A v{i+1} = sigma{i+1} u{i+1}?  {np.allclose(Avi, s_ui)}   sigma{i+1} = {s[i]:.4f}')

print(f'\n=== A maps null(A) to 0 ===')
for i in range(r, n_s):
    print(f'  ||A v{i+1}|| = {np.linalg.norm(A_svd @ V[:, i]):.2e}')

print(f'\n=== Low-rank approximation (Eckart-Young) ===')
print(f'{"rank k":>8}  {"Frobenius error":>16}  {"Variance captured":>20}')
for k in range(r + 1):
    if k == 0:
        err = np.linalg.norm(A_svd, 'fro')
        var = 0.0
    else:
        Ak  = sum(s[i] * np.outer(U[:, i], V[:, i]) for i in range(k))
        err = np.linalg.norm(A_svd - Ak, 'fro')
        var = 1.0 - err**2 / np.sum(s**2)
    print(f'  k={k}       {err:>16.4f}  {var:>20.4f}')

Code cell 70

# === 13.4  Schur decomposition and nested invariant subspaces ===
#
# Every square A admits  A = Q T Q^*  where Q orthogonal, T upper triangular.
# First k columns of Q span a k-dim invariant subspace of A.
# For symmetric A, T is diagonal (= spectral theorem).

from scipy.linalg import schur

print('=== Schur decomposition of a non-symmetric matrix ===')
A_sch = np.array([
    [4.0, 3.0, 2.0],
    [0.0, 2.0, 1.0],
    [0.0, 1.0, 3.0]
])

T_s, Q_s = schur(A_sch)
print(f'A =\n{A_sch}')
print(f'\nSchur T (upper triangular):\n{T_s}')
print(f'\nSchur Q (orthonormal columns):\n{Q_s}')
print(f'\nReconstruction A = Q T Q^T error: {np.max(np.abs(A_sch - Q_s @ T_s @ Q_s.T)):.2e}')

eigvals_d = np.sort(np.linalg.eigvals(A_sch).real)
eigvals_t = np.sort(np.diag(T_s).real)
print(f'\nEigenvalues from eigvals: {eigvals_d}')
print(f'Eigenvalues from diag(T): {eigvals_t}')
print(f'Match: {np.allclose(eigvals_d, eigvals_t)}')

print(f'\n=== Nested invariant subspaces from Schur vectors ===')
for k in range(1, 4):
    Qk   = Q_s[:, :k]
    AQk  = A_sch @ Qk
    proj = Qk @ Qk.T @ AQk
    is_inv = np.allclose(AQk, proj)
    print(f'  span{{q1,...,q{k}}} invariant under A?  {is_inv}')

print(f'\n=== Symmetric case: Schur = spectral (T diagonal) ===')
A_sym3 = np.array([[5.0, 2.0], [2.0, 3.0]])
T_sym, Q_sym = schur(A_sym3)
print(f'  T for symmetric A:\n{T_sym}')
print(f'  T is diagonal: {np.allclose(T_sym, np.diag(np.diag(T_sym)))}')

print(f'\n=== Connection to AI ===')
print(f'  Hessian H of training loss is symmetric => spectral theorem applies.')
print(f'  Top-k Schur/eigenspace of H = high-curvature subspace for 2nd-order optimisers.')
print(f'  K-FAC / Shampoo / SOAP approximate H^{{-1}} g using structured invariant subspaces.')
print(f'  Natural gradient = gradient in metric defined by the Fisher (symmetric PSD).')

14. Common Mistakes

MistakeWhy It's WrongFix
"Union of two subspaces is a subspace"W1W2W_1 \cup W_2 fails closure under addition: (1,0)+(0,1)=(1,1)(1,0)+(0,1)=(1,1) is not on either coordinate axis.Use the sum W1+W2W_1 + W_2, which IS a subspace.
"An affine subspace is a subspace"{x:Ax=b}\{x : Ax=b\} with b0b \neq 0 does not contain 0\mathbf{0}.An affine subspace is a coset of a linear subspace — parallel but offset. A linear subspace must contain 0\mathbf{0}.
"Closure under addition alone makes a subspace"Zn\mathbb{Z}^n is closed under addition but not scalar mult (π(1,0)=(π,0)Zn\pi \cdot (1,0)=(\pi,0) \notin \mathbb{Z}^n).All three conditions: 0W\mathbf{0} \in W, closed under ++, closed under scalar mult.
"span equals unchanged => new vector is zero"vk+1\mathbf{v}_{k+1} could be any linear combination of the others — dependent, not zero.Span equality means vk+1\mathbf{v}_{k+1} is redundant (linearly dependent), not necessarily 0\mathbf{0}.
"dim(W1W2)=dim(W1)+dim(W2)dim(V)\dim(W_1 \cap W_2) = \dim(W_1)+\dim(W_2)-\dim(V)"The intersection dimension depends on orientation, not just sizes.Correct formula is for the sum: dim(W1+W2)=dim(W1)+dim(W2)dim(W1W2)\dim(W_1+W_2) = \dim(W_1)+\dim(W_2)-\dim(W_1 \cap W_2).
"The complement of a subspace is a subspace"VWV \setminus W does not contain 0\mathbf{0} and is not closed under addition.The orthogonal complement WW^\perp IS a subspace. Always say WW^\perp, not "complement".
"All bases have the same vectors"Infinitely many valid bases exist for Rn\mathbb{R}^n with n>1n>1.All bases have the same count (= dim(V)\dim(V)); the actual vectors vary freely.
"null(AB)(AB) = null(B)(B)"null(B)(B) \subseteq null(AB)(AB) always; equality can fail if AA is not injective on col(B)(B).null(AB)(AB) \supseteq null(B)(B); equality iff null(A)(A) \cap col(B)={0}(B) = \{\mathbf{0}\}.
"Same dimension => same subspace"The xx-axis and yy-axis in R2\mathbb{R}^2 both have dim=1\dim=1 but are different subspaces.Equal dimension means isomorphic; actual equality requires containment in both directions.
"Gram-Schmidt works on any set of vectors"If a vector is already in the span of previous ones, uj=0\mathbf{u}_j = \mathbf{0} and cannot be normalised.First extract a linearly independent subset; discard any vector that produces uj0\mathbf{u}_j \approx \mathbf{0}.

16. Why This Matters for AI (2026 Perspective)

AspectImpact
Residual stream as shared vector spaceThe residual stream xRd\mathbf{x} \in \mathbb{R}^d is the central shared vector space of a transformer. Every attention head and MLP layer adds vectors to this space. Understanding subspace decompositions of Rd\mathbb{R}^d IS understanding how transformers route and store information.
LoRA and gradient subspacesLoRA's premise: weight updates ΔW\Delta W lie in a low-dimensional subspace. Aghajanyan et al. (2020) showed the fine-tuning gradient subspace has dimension p\ll p. Choosing LoRA rank rr = choosing the subspace dimension to search in.
Superposition and polysemanticityLLMs represent more features than dimensions by storing them as non-orthogonal directions. Features share subspace dimensions, causing polysemanticity. This is the superposition hypothesis (Elhage et al. 2022).
MLA and KV subspace compressionDeepSeek-V3's Multi-head Latent Attention compresses KV projections to a rank-rr subspace (rdr \ll d), enabling 5.75×\times memory reduction. Subspace dimensionality IS the design variable.
Mechanistic interpretabilityEvery circuit in mechanistic interpretability is a composition of subspace operations: heads read from and write to specific subspaces of Rd\mathbb{R}^d via their WQ,WK,WV,WOW_Q, W_K, W_V, W_O matrices.
Representation collapse preventionVICReg, Barlow Twins, BYOL prevent collapse by ensuring embeddings span a high-dimensional subspace. Collapse = subspace dimension 0\to 0; these losses maximise it.
Generalisation and implicit biasGradient descent on overparameterised models converges to minimum-norm solutions supported in a low-dimensional subspace of parameter space. Generalisation is explained by subspace geometry.
Orthogonality and head diversityAttention heads writing to orthogonal subspaces do not interfere. Head redundancy = subspace overlap. Orthogonality = independence = safe pruning.
Second-order optimisationThe Hessian concentrates spectral mass in a low-dimensional high-curvature subspace. K-FAC, Shampoo, and SOAP exploit this. Natural gradient aligns updates with the curved subspace.

Conceptual Bridge

Vector spaces and subspaces are the geometry underlying all of linear algebra. Every concept — span, independence, basis, dimension, rank, orthogonality, projections, eigenvalues — is ultimately a statement about the structure of subspaces.

The abstract framework (eight axioms) is what makes the theory universal: the same theorems that govern arrows in a plane also govern polynomials, matrices, functions, and probability distributions. When you recognise a new mathematical object as a vector space, you immediately inherit centuries of theorems for free.

For AI in 2026, subspaces are the architectural primitives of transformers, the design variables of efficient fine-tuning, the diagnostic language of mechanistic interpretability, and the theoretical foundation of generalisation.

Next: Eigenvalues, Eigenvectors, and Matrix Decompositions — where invariant subspaces of a linear map are revealed through its spectrum. An eigenvector spans a 1-D invariant subspace; the spectral theorem decomposes any symmetric matrix into orthogonal invariant eigenspaces; the SVD extends this to the complete subspace structure of any rectangular matrix.