Theory Notebook

Converted from theory.ipynb for web reading.

Vector Spaces and Subspaces — From Axioms to Interpretability

A vector space is not a collection of arrows. It is a collection of objects that behave like arrows — and that behaviour is the entire point. Once you verify the axioms, every theorem ever proved about vector spaces applies for free.

This notebook is the interactive companion to notes.md. It is arranged as a learning path rather than a section-by-section copy.

Block	Topic	What you will build
1	Intuition	why vector spaces are the language of deep learning
2	Formal Definitions	axiom verification for standard and exotic spaces
3	Subspaces	the three-condition test on concrete examples
4	Span & Linear Combinations	geometric pictures and matrix formulation
5	Linear Independence	dependence detection and redundancy
6	Basis & Dimension	coordinate systems, change of basis, dimension counting
7	Four Fundamental Subspaces	column space, null space, row space, left null space
8	Subspace Operations	sum, intersection, direct sum, projection
9	Inner Product Spaces & Orthogonality	Gram-Schmidt, orthogonal complements
10	Affine Subspaces & Quotient Spaces	translated subspaces, cosets
11	Subspaces in Functional Analysis	infinite dimensions, RKHS, Krylov
12	Subspace Methods in ML	PCA, LoRA, representation subspaces, mechanistic interpretability
13	Invariant Subspaces & Spectral Theory	eigenspaces, SVD as subspace decomposition

The notebook uses only numpy and optional matplotlib. It is intentionally code-first: the goal is to make the abstract algebra feel operational.

Primary references: Axler (Linear Algebra Done Right), Strang (Linear Algebra and Its Applications), Halmos (Finite-Dimensional Vector Spaces), Hu et al. (2021, LoRA), Elhage et al. (2022, superposition), Aghajanyan et al. (2020, intrinsic dimensionality).

Code cell 2

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

try:
    import seaborn as sns
    sns.set_theme(style="whitegrid", palette="colorblind")
    HAS_SNS = True
except ImportError:
    plt.style.use("seaborn-v0_8-whitegrid")
    HAS_SNS = False

mpl.rcParams.update({
    "figure.figsize":    (10, 6),
    "figure.dpi":         120,
    "font.size":           13,
    "axes.titlesize":      15,
    "axes.labelsize":      13,
    "xtick.labelsize":     11,
    "ytick.labelsize":     11,
    "legend.fontsize":     11,
    "legend.framealpha":   0.85,
    "lines.linewidth":      2.0,
    "axes.spines.top":     False,
    "axes.spines.right":   False,
    "savefig.bbox":       "tight",
    "savefig.dpi":         150,
})
np.random.seed(42)
print("Plot setup complete.")

Code cell 3

import numpy as np
import numpy.linalg as la
import scipy.linalg as sla
from scipy import stats

COLORS = {
    "primary": "#0077BB",
    "secondary": "#EE7733",
    "tertiary": "#009988",
    "error": "#CC3311",
    "neutral": "#555555",
    "highlight": "#EE3377",
}
HAS_MPL = True
np.set_printoptions(precision=8, suppress=True)
np.random.seed(42)

def header(title):
    print("\n" + "=" * len(title))
    print(title)
    print("=" * len(title))

def check_true(name, cond):
    ok = bool(cond)
    print(f"{'PASS' if ok else 'FAIL'} - {name}")
    return ok

def check_close(name, got, expected, tol=1e-8):
    ok = np.allclose(got, expected, atol=tol, rtol=tol)
    print(f"{'PASS' if ok else 'FAIL'} - {name}: got {got}, expected {expected}")
    return ok

def check(name, got, expected, tol=1e-8):
    return check_close(name, got, expected, tol=tol)

def softmax(z, axis=-1, tau=1.0):
    z = np.asarray(z, dtype=float) / float(tau)
    z = z - np.max(z, axis=axis, keepdims=True)
    e = np.exp(z)
    return e / np.sum(e, axis=axis, keepdims=True)

def cosine_similarity(a, b):
    a = np.asarray(a, dtype=float); b = np.asarray(b, dtype=float)
    return float(a @ b / (la.norm(a) * la.norm(b) + 1e-12))

def numerical_rank(A, tol=1e-10):
    return int(np.sum(la.svd(A, compute_uv=False) > tol))

def orthonormal_basis(A, tol=1e-10):
    Q, R = la.qr(A)
    keep = np.abs(np.diag(R)) > tol
    return Q[:, keep]

def null_space(A, tol=1e-10):
    U, S, Vt = la.svd(A)
    return Vt[S.size:,:].T if S.size < Vt.shape[0] else Vt[S <= tol,:].T



# Compatibility helpers used by the Chapter 02 theory and exercise cells.
def null_space(A, tol=1e-10):
    A = np.asarray(A, dtype=float)
    U, S, Vt = la.svd(A, full_matrices=True)
    rank = int(np.sum(S > tol))
    return Vt[rank:].T

svd_null_space = null_space

def gram_schmidt(vectors, tol=1e-10):
    A = np.asarray(vectors, dtype=float)
    if A.ndim == 1:
        A = A.reshape(1, -1)
    basis = []
    for v in A:
        w = v.astype(float).copy()
        for q in basis:
            w = w - np.dot(w, q) * q
        norm = la.norm(w)
        if norm > tol:
            basis.append(w / norm)
    return np.array(basis)

def projection_matrix_from_columns(A, tol=1e-10):
    Q = orthonormal_basis(np.asarray(A, dtype=float), tol=tol)
    return Q @ Q.T


def random_unit_vectors(n, d):
    X = np.random.randn(n, d)
    return X / np.maximum(la.norm(X, axis=1, keepdims=True), 1e-12)

def pairwise_distances(X):
    X = np.asarray(X, dtype=float)
    diff = X[:, None, :] - X[None, :, :]
    return la.norm(diff, axis=-1)


def normalize(x, axis=None, tol=1e-12):
    x = np.asarray(x, dtype=float)
    norm = la.norm(x, axis=axis, keepdims=True)
    return x / np.maximum(norm, tol)

def frobenius_inner(A, B):
    return float(np.sum(np.asarray(A, dtype=float) * np.asarray(B, dtype=float)))

def outer_sum_product(A, B):
    A = np.asarray(A, dtype=float)
    B = np.asarray(B, dtype=float)
    return sum(np.outer(A[:, k], B[k, :]) for k in range(A.shape[1]))

def softmax_rows(X):
    return softmax(X, axis=1)

def col_space(A, tol=1e-10):
    return orthonormal_basis(np.asarray(A, dtype=float), tol=tol)

def row_space(A, tol=1e-10):
    return orthonormal_basis(np.asarray(A, dtype=float).T, tol=tol).T

def rref(A, tol=1e-10):
    R = np.array(A, dtype=float, copy=True)
    m, n = R.shape
    pivots = []
    row = 0
    for col in range(n):
        pivot = row + int(np.argmax(np.abs(R[row:, col]))) if row < m else row
        if row >= m or abs(R[pivot, col]) <= tol:
            continue
        if pivot != row:
            R[[row, pivot]] = R[[pivot, row]]
        R[row] = R[row] / R[row, col]
        for r in range(m):
            if r != row:
                R[r] = R[r] - R[r, col] * R[row]
        pivots.append(col)
        row += 1
        if row == m:
            break
    R[np.abs(R) < tol] = 0.0
    return R, pivots

def nullspace_basis(A, tol=1e-10):
    A = np.asarray(A, dtype=float)
    U, S, Vt = la.svd(A, full_matrices=True)
    rank = int(np.sum(S > tol))
    return Vt[rank:].T, rank

print("Chapter helper setup complete.")

1. Intuition — What Is a Vector Space and Why It Matters

A vector space is a collection of objects — called vectors — together with two operations:

Addition: combine two vectors to get another vector in the same collection.
Scalar multiplication: stretch any vector by a real number and stay in the collection.

These two operations must satisfy eight axioms (Section 2). Once verified, every theorem ever proved about vector spaces applies automatically — whether the "vectors" are geometric arrows, polynomials, matrices, functions, probability distributions, or neural network weights.

Why this abstraction matters for AI

AI concept	Underlying vector space	Why it matters
Token embeddings	$\mathbb{R}^d$	`king − man + woman ≈ queen` relies on vector arithmetic
Parameter space	$\mathbb{R}^p$	Gradient descent navigates this space; loss landscape lives over it
Residual stream	$\mathbb{R}^d$ at each layer	`x ← x + Attn(x) + MLP(x)` is vector addition
Gradient space	$\mathbb{R}^p$	Gradient accumulation, clipping, momentum — all vector operations
Function spaces	$L^2$ , RKHS	Neural networks approximate functions; approximation theory uses subspaces

The hierarchy of structure

Set (no structure)
    ↓ + addition and scalar multiplication + axioms
Vector Space (linear structure: add, scale)
    ↓ + norm ‖v‖
Normed Space (distances: can measure length)
    ↓ + inner product ⟨u,v⟩
Inner Product Space (angles: can measure cosine similarity)
    ↓ + completeness
Hilbert Space (complete inner product space)

Each level adds structure and inherits all theorems from above. $\mathbb{R}^n$ satisfies all levels simultaneously. AI embedding spaces are finite-dimensional, so all levels are equivalent. Function spaces require infinite dimensions, where the distinctions become essential.

Code cell 5

# === 1.1 The same algebra works across wildly different "vector" types ===

# Vectors in R^3
u = np.array([1.0, 2.0, -1.0])
v = np.array([3.0, 0.0, 1.0])
print("R^3 vectors:")
print("  u + v =", u + v)
print("  2.5 * u =", 2.5 * u)

# Matrices as vectors in R^{2x3}
A = np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
B = np.array([[0.0, -1.0, 2.0], [1.0, 0.0, -1.0]])
print("\nR^{2x3} matrices:")
print("  A + B =\n", A + B)
print("  -3 * A =\n", -3 * A)

# Polynomials (represented by coefficient vectors)
# p(t) = 1 + 2t - t^2,  q(t) = 3 - t + 4t^2
p = np.array([1.0, 2.0, -1.0])   # coefficients of 1, t, t^2
q = np.array([3.0, -1.0, 4.0])
print("\nP_2 polynomials (coefficients):")
print("  p + q =", p + q, " -> (p+q)(t) = 4 + t + 3t^2")
print("  2p   =", 2 * p, "   -> 2p(t) = 2 + 4t - 2t^2")

# Functions (sampled pointwise)
t = np.linspace(0, 1, 5)
f = t**2
g = np.sin(np.pi * t)
print("\nC([0,1]) functions (sampled):")
print("  (f+g)(t) =", f + g)
print("  (3f)(t)  =", 3 * f)

print("\nKey insight: + and scalar * work the same way in every vector space.")
print("The axioms guarantee this uniformity.")

Code cell 6

# === 1.2 Vector space structure in AI: embedding arithmetic ===

# Synthetic embedding space (small d for clarity; real models use d=768+)
tokens = ["king", "queen", "man", "woman", "apple"]
E = np.array([
    [0.8,  0.9,  0.2,  0.1],   # king
    [0.8,  0.9, -0.2,  0.1],   # queen
    [0.7,  0.8,  0.4,  0.1],   # man
    [0.7,  0.8, -0.4,  0.1],   # woman
    [-0.3, 0.2,  0.0,  0.9],   # apple
], dtype=float)

# king - man + woman: relies on R^d being a vector space
analogy = E[0] - E[2] + E[3]  # king - man + woman
print("king - man + woman =", analogy)
print("queen              =", E[1])
print("cosine(analogy, queen) =", cosine_similarity(analogy, E[1]))

# This arithmetic is meaningful ONLY because R^d is a vector space:
# subtraction (additive inverse), addition, and the result stays in R^d.

# Residual stream: x <- x + attn_output + mlp_output
d = 8
x = np.random.randn(d)
attn_out = np.random.randn(d) * 0.3
mlp_out = np.random.randn(d) * 0.3
x_next = x + attn_out + mlp_out   # vector addition in R^d
print("\nResidual stream update:")
print("  ||x||       =", np.linalg.norm(x))
print("  ||x + attn + mlp|| =", np.linalg.norm(x_next))
print("  This sum makes sense because R^d is closed under addition.")

Subspace intuition (preview)

A subspace is a vector space living inside a larger vector space — a flat, linear subset that passes through the origin and is closed under the same operations.

A line through the origin in $\mathbb{R}^3$ is a subspace.
A line not through the origin is not (it doesn't contain the zero vector).
A plane through the origin is a subspace; a curved surface never is.

Why "through the origin": the zero vector must belong to any subspace (scale any vector by 0 and you get 0; if 0 isn't in the set, it can't be closed under scalar multiplication).

Why "flat": closure under addition and scalar multiplication forces the set to be "straight" — no curves, no bends.

Subspaces that appear constantly in AI:

Subspace	Definition	AI meaning
$\text{col}(W)$	$\{Wx : x \in \mathbb{R}^n\}$	Reachable outputs of layer $W$
$\text{null}(W)$	$\{x : Wx = 0\}$	Input directions the layer ignores
$\text{row}(W)$	$\{W^Ty : y \in \mathbb{R}^m\}$	Input directions the layer listens to
LoRA update space	$\text{col}(B)$ for $\Delta W = BA^T$	Rank- $r$ subspace of weight updates

Code cell 8

# === 1.3 Subspace vs non-subspace: quick visual check ===

# A line through the origin: W = {t*(1,2) : t in R}
# Check: contains 0? Yes (t=0). Closed under + and scalar mult? Yes.
direction = np.array([1.0, 2.0])
t_vals = np.array([-2, -1, 0, 0.5, 1, 3])
line_points = np.outer(t_vals, direction)
print("Line through origin (subspace):")
for t, pt in zip(t_vals, line_points):
    print(f"  t={t:>4}: ({pt[0]:>5.1f}, {pt[1]:>5.1f})")
# Sum of two points: (1,2) + (3,6) = (4,8) = 4*(1,2) -> still on the line
print("  (1,2)+(3,6) =", direction + 3*direction, "= 4*(1,2) -> in W")

# A line NOT through origin: {(1,0) + t*(1,2) : t in R}
# Does not contain 0: (1,0) + t*(1,2) = (0,0) requires t=0 and 1=0. Contradiction.
print("\nLine NOT through origin (NOT a subspace):")
print("  Does it contain (0,0)? No. Already fails.")

# Probability simplex: {p in R^3 : p_i >= 0, sum = 1}
p = np.array([0.5, 0.3, 0.2])
q = np.array([0.1, 0.6, 0.3])
print("\nProbability simplex (NOT a subspace):")
print("  p =", p, "sum =", p.sum())
print("  q =", q, "sum =", q.sum())
print("  p + q =", p + q, "sum =", (p+q).sum(), "!= 1 -> not in simplex")
print("  2*p  =", 2*p, "sum =", (2*p).sum(), "!= 1 -> not in simplex")
print("  (0,0,0) not in simplex (sum=0 != 1) -> no zero vector")

2. Formal Definitions — The Eight Axioms

A vector space over $\mathbb{R}$ is a set $V$ together with two operations:

Addition: $+ : V \times V \to V$
Scalar multiplication: $\cdot : \mathbb{R} \times V \to V$

satisfying these eight axioms for all $\mathbf{u}, \mathbf{v}, \mathbf{w} \in V$ and all $\alpha, \beta \in \mathbb{R}$ :

Addition axioms:

#	Name	Statement
A1	Commutativity	$\mathbf{u} + \mathbf{v} = \mathbf{v} + \mathbf{u}$
A2	Associativity	$(\mathbf{u} + \mathbf{v}) + \mathbf{w} = \mathbf{u} + (\mathbf{v} + \mathbf{w})$
A3	Additive identity	$\exists\, \mathbf{0} \in V$ such that $\mathbf{v} + \mathbf{0} = \mathbf{v}$ for all $\mathbf{v}$
A4	Additive inverse	For each $\mathbf{v}$ , $\exists\, (-\mathbf{v})$ such that $\mathbf{v} + (-\mathbf{v}) = \mathbf{0}$

Scalar multiplication axioms:

#	Name	Statement
S1	Multiplicative identity	$1 \cdot \mathbf{v} = \mathbf{v}$
S2	Compatibility	$\alpha(\beta \mathbf{v}) = (\alpha\beta) \mathbf{v}$

Distributivity axioms:

#	Name	Statement
D1	Over vector addition	$\alpha(\mathbf{u} + \mathbf{v}) = \alpha\mathbf{u} + \alpha\mathbf{v}$
D2	Over scalar addition	$(\alpha + \beta)\mathbf{v} = \alpha\mathbf{v} + \beta\mathbf{v}$

These eight axioms are the complete definition. The axioms are minimal: each is independent of the others.

Code cell 10

# === 2.1 Verify all eight axioms for R^n ===

u = np.array([1.0, -2.0, 3.0])
v = np.array([4.0, 0.5, -1.0])
w = np.array([-1.0, 3.0, 2.0])
alpha, beta = 2.5, -1.3
zero = np.zeros(3)

print("Verifying all 8 axioms for R^3:")
print(f"A1 Commutativity:     u+v == v+u?            {np.allclose(u+v, v+u)}")
print(f"A2 Associativity:     (u+v)+w == u+(v+w)?    {np.allclose((u+v)+w, u+(v+w))}")
print(f"A3 Additive identity: v+0 == v?              {np.allclose(v+zero, v)}")
print(f"A4 Additive inverse:  v+(-v) == 0?           {np.allclose(v+(-v), zero)}")
print(f"S1 Mult identity:     1*v == v?              {np.allclose(1*v, v)}")
print(f"S2 Compatibility:     a(bv) == (ab)v?        {np.allclose(alpha*(beta*v), (alpha*beta)*v)}")
print(f"D1 Scalar distrib:    a(u+v) == au+av?       {np.allclose(alpha*(u+v), alpha*u+alpha*v)}")
print(f"D2 Vector distrib:    (a+b)v == av+bv?       {np.allclose((alpha+beta)*v, alpha*v+beta*v)}")

Code cell 11

# === 2.2 Verify axioms for R^{m x n} (matrices as vectors) ===

A = np.array([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
B = np.array([[0.0, -1.0], [2.0, 1.0], [-1.0, 3.0]])
C = np.array([[1.0, 1.0], [1.0, 1.0], [1.0, 1.0]])
Z = np.zeros((3, 2))
a, b = 3.0, -0.7

print("Verifying axioms for R^{3x2}:")
print(f"A1 Commutativity:  {np.allclose(A+B, B+A)}")
print(f"A2 Associativity:  {np.allclose((A+B)+C, A+(B+C))}")
print(f"A3 Zero matrix:    {np.allclose(A+Z, A)}")
print(f"A4 Inverse:        {np.allclose(A+(-A), Z)}")
print(f"S1 Identity:       {np.allclose(1*A, A)}")
print(f"S2 Compatibility:  {np.allclose(a*(b*A), (a*b)*A)}")
print(f"D1 Scalar distrib: {np.allclose(a*(A+B), a*A+a*B)}")
print(f"D2 Vector distrib: {np.allclose((a+b)*A, a*A+b*A)}")
print(f"\ndim(R^{{3x2}}) = 3*2 = {3*2}")

Code cell 12

# === 2.3 Verify axioms for P_n (polynomials of degree <= n) ===
# Represent p(t) = a0 + a1*t + a2*t^2 by coefficient vector [a0, a1, a2]

p = np.array([1.0, 2.0, -1.0])    # 1 + 2t - t^2
q = np.array([3.0, -1.0, 4.0])    # 3 - t + 4t^2
r = np.array([0.0, 1.0, 1.0])     # t + t^2
z = np.zeros(3)                     # zero polynomial
a, b = 2.0, -0.5

print("Verifying axioms for P_2 (polynomials of degree <= 2):")
print(f"  p(t) = 1 + 2t - t^2,  q(t) = 3 - t + 4t^2")
print(f"  p + q = {p + q}  -> (p+q)(t) = 4 + t + 3t^2")
print(f"  A1: {np.allclose(p+q, q+p)}")
print(f"  A2: {np.allclose((p+q)+r, p+(q+r))}")
print(f"  A3: {np.allclose(p+z, p)}")
print(f"  A4: {np.allclose(p+(-p), z)}")
print(f"  S1: {np.allclose(1*p, p)}")
print(f"  S2: {np.allclose(a*(b*p), (a*b)*p)}")
print(f"  D1: {np.allclose(a*(p+q), a*p+a*q)}")
print(f"  D2: {np.allclose((a+b)*p, a*p+b*p)}")
print(f"\ndim(P_2) = 3 (basis: {{1, t, t^2}})")
print("Note: p+q has degree 2 <= 2, so it stays in P_2. Closure holds.")

Immediate consequences of the axioms

These follow from the eight axioms alone — they hold in every vector space:

Uniqueness of zero: if $\mathbf{0}$ and $\mathbf{0}'$ both satisfy A3, then $\mathbf{0} = \mathbf{0} + \mathbf{0}' = \mathbf{0}'$ .
Uniqueness of additive inverse: if $\mathbf{v} + \mathbf{w} = \mathbf{0}$ and $\mathbf{v} + \mathbf{w}' = \mathbf{0}$ , then $\mathbf{w} = \mathbf{w}'$ .
$0 \cdot \mathbf{v} = \mathbf{0}$ (zero scalar × any vector = zero vector):
- $0\mathbf{v} = (0+0)\mathbf{v} = 0\mathbf{v} + 0\mathbf{v}$ . Subtract $0\mathbf{v}$ : $\mathbf{0} = 0\mathbf{v}$ . $\checkmark$
$\alpha \cdot \mathbf{0} = \mathbf{0}$ (any scalar × zero vector = zero vector):
- $\alpha\mathbf{0} = \alpha(\mathbf{0}+\mathbf{0}) = \alpha\mathbf{0} + \alpha\mathbf{0}$ . Subtract: $\mathbf{0} = \alpha\mathbf{0}$ . $\checkmark$
$(-1)\mathbf{v} = -\mathbf{v}$ : $\mathbf{v} + (-1)\mathbf{v} = (1+(-1))\mathbf{v} = 0\mathbf{v} = \mathbf{0}$ . $\checkmark$
If $\alpha\mathbf{v} = \mathbf{0}$ , then $\alpha = 0$ or $\mathbf{v} = \mathbf{0}$ : if $\alpha \neq 0$ , multiply by $\alpha^{-1}$ : $\mathbf{v} = \alpha^{-1}\mathbf{0} = \mathbf{0}$ . $\checkmark$

Code cell 14

# === 2.4 Verify consequences computationally ===

v = np.array([3.0, -1.0, 7.0, 2.0])
zero = np.zeros(4)

print("Consequence 3: 0 * v = 0")
print("  0 * v =", 0 * v)
print("  equals zero?", np.allclose(0 * v, zero))

print("\nConsequence 4: alpha * 0 = 0")
print("  7.5 * 0 =", 7.5 * zero)
print("  equals zero?", np.allclose(7.5 * zero, zero))

print("\nConsequence 5: (-1) * v = -v")
print("  (-1)*v =", (-1)*v)
print("  -v     =", -v)
print("  equal?", np.allclose((-1)*v, -v))

print("\nConsequence 6: alpha*v = 0 implies alpha=0 or v=0")
alpha = 3.0
result = alpha * v
print(f"  {alpha} * {v} = {result}")
print(f"  result is zero? {np.allclose(result, zero)} (alpha != 0 and v != 0, so correct)")

Non-examples: what fails and why

Understanding what fails the axioms is as important as knowing the examples.

Set	What fails	Why
$\mathbb{R}_+^n$ (positive reals)	Scalar mult, inverse	$(-1) \cdot (1,2) = (-1,-2) \notin \mathbb{R}_+^n$
Unit sphere $S^{n-1}$	Addition, zero	$\\|u+v\\| \neq 1$ generally; $\mathbf{0} \notin S^{n-1}$
Simplex $\Delta^n$	Addition, scalar mult, zero	$(p+q)$ sums to 2; $\mathbf{0} \notin \Delta^n$
$\mathbb{Z}^n$ (integers)	Scalar mult over $\mathbb{R}$	$\pi \cdot (1,0) = (\pi, 0) \notin \mathbb{Z}^n$
$\{x : Ax = b\}$ , $b \neq 0$	Zero, addition	$A\mathbf{0} = \mathbf{0} \neq b$

Code cell 16

# === 2.5 Demonstrate axiom failures for non-vector-spaces ===

print("=== R_+^n (positive reals) ===")
v_pos = np.array([1.0, 2.0, 3.0])
scaled = (-1) * v_pos
print(f"  v = {v_pos}  (all positive)")
print(f"  (-1)*v = {scaled}  (not all positive -> FAILS closure under scalar mult)")
print(f"  Also: no additive inverse in R_+  (v + (-v) = 0, but -v not in R_+)")

print("\n=== Unit sphere S^{n-1} ===")
u_sphere = normalize(np.array([1.0, 0.0, 0.0]))
v_sphere = normalize(np.array([0.0, 1.0, 0.0]))
s = u_sphere + v_sphere
print(f"  ||u|| = {np.linalg.norm(u_sphere):.1f},  ||v|| = {np.linalg.norm(v_sphere):.1f}")
print(f"  u + v = {s},  ||u+v|| = {np.linalg.norm(s):.4f} != 1  (FAILS closure under +)")
print(f"  ||0|| = 0 != 1  (zero vector not on sphere -> FAILS zero)")

print("\n=== Probability simplex ===")
p = np.array([0.5, 0.3, 0.2])
q = np.array([0.1, 0.6, 0.3])
print(f"  p = {p}, sum = {p.sum():.1f}")
print(f"  q = {q}, sum = {q.sum():.1f}")
print(f"  p+q = {p+q}, sum = {(p+q).sum():.1f} != 1  (FAILS closure under +)")
print(f"  This is why logit space (pre-softmax) is used for arithmetic, not probability space.")

print("\n=== Affine subspace {x : Ax = b}, b != 0 ===")
A_sys = np.array([[1.0, 2.0], [3.0, 4.0]])
b_sys = np.array([5.0, 6.0])
x_particular = np.linalg.solve(A_sys, b_sys)
print(f"  A*0 = {A_sys @ np.zeros(2)} != b = {b_sys}  (zero not in solution set -> FAILS)")
print(f"  One solution: x = {x_particular}")
print(f"  A*x = {A_sys @ x_particular} = b  (correct, but set is not a subspace)")

Vector spaces over other fields

The definition works over any field $\mathbb{F}$ , not just $\mathbb{R}$ :

Field	Notation	Used in
Real numbers $\mathbb{R}$	$\mathbb{R}^n$	Standard ML, transformers
Complex numbers $\mathbb{C}$	$\mathbb{C}^n$	Quantum computing, Fourier analysis, signal processing
Binary field $\mathbb{F}_2 = \{0,1\}$	$\mathbb{F}_2^n$	Error-correcting codes, cryptography
Rationals $\mathbb{Q}$	$\mathbb{Q}^n$	Exact arithmetic, symbolic computation

In $\mathbb{F}_2^n$ : addition is XOR, multiplication is AND. All linear algebra carries over — including Gaussian elimination, rank, and null spaces.

Code cell 18

# === 2.6 Vector space over F_2 (binary field) ===
# Addition = XOR, Scalar multiplication: 0*v=0, 1*v=v

u_bin = np.array([1, 0, 1, 1])
v_bin = np.array([0, 1, 1, 0])

# Addition in F_2 is XOR
add_f2 = np.bitwise_xor(u_bin, v_bin)
print("F_2^4 vector space:")
print(f"  u = {u_bin}")
print(f"  v = {v_bin}")
print(f"  u + v (XOR) = {add_f2}")

# Additive inverse: in F_2, every element is its own inverse (1+1=0)
print(f"  u + u (XOR) = {np.bitwise_xor(u_bin, u_bin)} (= zero vector, so -u = u in F_2)")

# Commutativity
print(f"  u+v == v+u? {np.array_equal(add_f2, np.bitwise_xor(v_bin, u_bin))}")
print("  Used in: error-correcting codes, parity checks, cryptographic linear algebra")

3. Subspaces — The Three-Condition Test

A subset $W \subseteq V$ is a subspace of $V$ if $W$ is itself a vector space under the same operations inherited from $V$ .

Checking all eight axioms would be redundant. Commutativity, associativity, and distributivity are inherited automatically from $V$ . Only three conditions need to be verified:

Contains zero: $\mathbf{0} \in W$
Closed under addition: $\mathbf{u}, \mathbf{w} \in W \implies \mathbf{u} + \mathbf{w} \in W$
Closed under scalar multiplication: $\mathbf{w} \in W,\; \alpha \in \mathbb{R} \implies \alpha\mathbf{w} \in W$

Equivalent single condition (often quickest): for all $\mathbf{u}, \mathbf{w} \in W$ and $\alpha, \beta \in \mathbb{R}$ :

\alpha\mathbf{u} + \beta\mathbf{w} \in W

Why these three suffice

Condition 1 ensures $W$ is non-empty and has an additive identity.
Condition 2 ensures $+$ stays inside $W$ .
Condition 3 ensures scalar multiplication stays inside $W$ (and gives additive inverses via $\alpha = -1$ ).
The remaining axioms (commutativity, associativity, distributivity) are equalities that hold in $V$ for all vectors; they hold in particular for vectors in $W \subseteq V$ .

Code cell 20

# === 3.1 Lines through the origin (1D subspaces of R^2 and R^3) ===

# W = {t*(1,2) : t in R} -- a line through origin in R^2
d = np.array([1.0, 2.0])

# Test 1: contains zero?
print("Line W = span{(1,2)} in R^2:")
print(f"  Contains 0? 0*(1,2) = {0*d}  -> Yes")

# Test 2: closed under addition?
w1 = 3.0 * d   # (3, 6)
w2 = -1.5 * d  # (-1.5, -3)
w_sum = w1 + w2 # (1.5, 3) = 1.5*(1,2)
t_sum = w_sum[0] / d[0]  # coefficient
print(f"  w1 = {w1}, w2 = {w2}")
print(f"  w1+w2 = {w_sum} = {t_sum}*(1,2) -> in W? {np.allclose(w_sum, t_sum*d)}")

# Test 3: closed under scalar mult?
alpha = -2.7
w_scaled = alpha * w1
t_scaled = w_scaled[0] / d[0]
print(f"  {alpha}*w1 = {w_scaled} = {t_scaled}*(1,2) -> in W? {np.allclose(w_scaled, t_scaled*d)}")
print("  Conclusion: W is a subspace of R^2. dim(W) = 1.")

Code cell 21

# === 3.2 Planes through the origin (2D subspaces of R^3) ===

# W = {(x,y,z) : x + 2y - z = 0} -- a plane through origin in R^3
# This is null(A) where A = [[1, 2, -1]]

A_plane = np.array([[1.0, 2.0, -1.0]])

# Find null space basis
N = null_space(A_plane)
print("Plane W = {(x,y,z) : x + 2y - z = 0}")
print(f"Null space basis (columns):\n{N}")
print(f"dim(W) = {N.shape[1]}")

# Verify subspace conditions with random elements of W
coeffs1 = np.array([2.0, -1.0])
coeffs2 = np.array([0.5, 3.0])
w1 = N @ coeffs1
w2 = N @ coeffs2

def scalar_value(x):
    return float(np.asarray(x).reshape(-1)[0])

print(f"\nw1 = {w1},  A*w1 = {scalar_value(A_plane @ w1):.10f}  (= 0, so w1 in W)")
print(f"w2 = {w2},  A*w2 = {scalar_value(A_plane @ w2):.10f}  (= 0, so w2 in W)")

# Test closure
print(f"\nw1+w2 = {w1+w2},  A*(w1+w2) = {scalar_value(A_plane @ (w1+w2)):.10f}  (closed under +)")
print(f"3.5*w1 = {3.5*w1},  A*(3.5*w1) = {scalar_value(A_plane @ (3.5*w1)):.10f}  (closed under scalar mult)")
print("All three conditions satisfied -> W is a subspace.")

Code cell 22

# === 3.3 Null space of a matrix is always a subspace ===

# null(A) = {x : Ax = 0}
# Proof that it's a subspace:
#   1. 0 in null(A) because A*0 = 0
#   2. If Ax=0 and Ay=0, then A(x+y) = Ax+Ay = 0+0 = 0
#   3. If Ax=0, then A(alpha*x) = alpha*Ax = alpha*0 = 0

A = np.array([
    [1.0, 2.0, 3.0, 4.0],
    [2.0, 4.0, 6.0, 8.0],
    [1.0, 1.0, 2.0, 3.0]
])

N = null_space(A)
rank_A = np.linalg.matrix_rank(A)
n = A.shape[1]

print(f"A is {A.shape[0]}x{A.shape[1]}, rank = {rank_A}")
print(f"nullity = n - rank = {n} - {rank_A} = {n - rank_A}")
print(f"Null space basis (columns):\n{N}")
print(f"\nVerification: A @ N =\n{A @ N}")
print(f"All zero? {np.allclose(A @ N, 0)}")

# Show closure explicitly
if N.shape[1] >= 2:
    x = N[:, 0]
    y = N[:, 1]
    print(f"\nx in null(A): A*x = {A @ x}")
    print(f"y in null(A): A*y = {A @ y}")
    print(f"x+y in null(A)? A*(x+y) = {A @ (x+y)} -> {np.allclose(A @ (x+y), 0)}")
    print(f"5x in null(A)?  A*(5x) = {A @ (5*x)} -> {np.allclose(A @ (5*x), 0)}")

Code cell 23

# === 3.4 Column space of a matrix is always a subspace ===

# col(A) = {Ax : x in R^n}
# Proof that it's a subspace:
#   1. 0 in col(A) because A*0 = 0
#   2. If Ax, Ay in col(A), then Ax+Ay = A(x+y) in col(A)
#   3. alpha*(Ax) = A(alpha*x) in col(A)

A = np.array([
    [1.0, 0.0, 2.0],
    [0.0, 1.0, 1.0],
    [1.0, 1.0, 3.0]
])

C = col_space(A)
rank_A = np.linalg.matrix_rank(A)

print(f"A =\n{A}")
print(f"rank(A) = {rank_A}")
print(f"dim(col(A)) = {rank_A}")
print(f"Column space basis (orthonormal columns):\n{C}")

# Generate two vectors in col(A) and check closure
x1 = np.array([1.0, 0.0, 0.0])
x2 = np.array([0.0, 2.0, -1.0])
v1 = A @ x1
v2 = A @ x2
print(f"\nv1 = A*x1 = {v1}  (in col(A))")
print(f"v2 = A*x2 = {v2}  (in col(A))")
v_sum = v1 + v2
# v1+v2 = A*(x1+x2), so it's in col(A)
print(f"v1+v2 = {v_sum} = A*(x1+x2) = A*{x1+x2} = {A @ (x1+x2)} -> in col(A)")

Code cell 24

# === 3.5 Symmetric matrices form a subspace of R^{nxn} ===

# W = {A in R^{nxn} : A = A^T}
# Check: 0 is symmetric, sum of symmetric = symmetric, scalar * symmetric = symmetric

n = 3
S1 = np.array([[1.0, 2.0, 3.0], [2.0, 5.0, 4.0], [3.0, 4.0, 6.0]])
S2 = np.array([[0.0, -1.0, 2.0], [-1.0, 3.0, 0.0], [2.0, 0.0, 1.0]])

print("Symmetric matrices as a subspace of R^{3x3}:")
print(f"  S1 symmetric? {np.allclose(S1, S1.T)}")
print(f"  S2 symmetric? {np.allclose(S2, S2.T)}")
print(f"  Zero symmetric? {np.allclose(np.zeros((n,n)), np.zeros((n,n)).T)}")
print(f"  S1+S2 symmetric? {np.allclose(S1+S2, (S1+S2).T)}")
print(f"  -3*S1 symmetric? {np.allclose(-3*S1, (-3*S1).T)}")
print(f"\n  dim(R^{{3x3}}) = {n*n}")
print(f"  dim(Sym_3)    = n(n+1)/2 = {n*(n+1)//2}")
print(f"  Sym_3 is a proper subspace of R^{{3x3}} (6 < 9 dimensions)")

Code cell 25

# === 3.6 Non-subspace examples: systematic failure analysis ===

print("=== {x : Ax = b} with b != 0  (affine subspace, NOT a subspace) ===")
A_aff = np.array([[1.0, 1.0], [0.0, 1.0]])
b_aff = np.array([3.0, 1.0])
x_p = np.linalg.solve(A_aff, b_aff)
print(f"  A*0 = {A_aff @ np.zeros(2)} != b = {b_aff}  -> 0 not in set (FAILS)")
# Also: if x1, x2 are solutions, x1+x2 is not: A(x1+x2) = 2b != b
print(f"  If x is a solution, A*(2x) = 2b = {2*b_aff} != b  -> not closed under scalar mult")

print("\n=== Unit ball {{x : ||x|| <= 1}} ===")
x_ball = np.array([0.8, 0.5])
print(f"  x = {x_ball}, ||x|| = {np.linalg.norm(x_ball):.3f} <= 1  (in ball)")
print(f"  2*x = {2*x_ball}, ||2x|| = {np.linalg.norm(2*x_ball):.3f} > 1  (NOT in ball -> FAILS)")
print(f"  Contains 0? Yes. Closed under +? Maybe. Scalar mult? No.")

print("\n=== First quadrant {{(x,y) : x>=0, y>=0}} ===")
v_quad = np.array([1.0, 2.0])
print(f"  v = {v_quad}  (in first quadrant)")
print(f"  (-1)*v = {-1*v_quad}  (NOT in first quadrant -> FAILS scalar mult)")
print(f"  Contains 0? Yes. Closed under +? Yes. But fails scalar mult.")

print("\n=== Invertible matrices (NOT a subspace of R^{nxn}) ===")
print("  The zero matrix is NOT invertible -> 0 not in set (FAILS immediately)")
print("  Also: A + (-A) = 0, which is not invertible -> not closed under addition")

Code cell 26

# === 3.7 The two trivial subspaces ===

# Every vector space V has exactly two subspaces that always exist:
#   1. {0}  -- the trivial subspace, dimension 0
#   2. V    -- the whole space, dimension dim(V)

print("The two trivial subspaces of R^4:")
print(f"  {{0}}: dim = 0")
print(f"  R^4:  dim = 4")
print(f"  Any subspace W satisfies {{0}} ⊆ W ⊆ R^4")
print(f"  Possible subspace dimensions in R^4: 0, 1, 2, 3, 4")

# Demonstrate with null spaces of different ranks
for r in range(5):
    if r == 0:
        M = np.zeros((1, 4))
    elif r == 4:
        M = np.eye(4)
    else:
        M = np.random.randn(r, 4)
    N = null_space(M)
    dim_null = N.shape[1] if N.size > 0 else 0
    print(f"  rank-{r} matrix -> null space dim = {dim_null}")

Visualising subspaces in $\mathbb{R}^3$

In $\mathbb{R}^3$ , subspaces can only be:

The origin (dim 0)
A line through the origin (dim 1)
A plane through the origin (dim 2)
All of $\mathbb{R}^3$ (dim 3)

No curves, no spheres, no shifted planes — only flat spaces through the origin.

Code cell 28

# === 3.8 Visualise subspaces of R^3 ===

if HAS_MPL:
    from mpl_toolkits.mplot3d import Axes3D

    fig = plt.figure(figsize=(14, 5))

    # 1D subspace: line through origin in direction (1,2,1)
    ax1 = fig.add_subplot(131, projection='3d')
    d_line = np.array([1, 2, 1], dtype=float)
    t = np.linspace(-2, 2, 50)
    ax1.plot(t*d_line[0], t*d_line[1], t*d_line[2], 'b-', linewidth=2)
    ax1.scatter([0], [0], [0], color='red', s=60, zorder=5)
    ax1.set_title('1D subspace (line)', fontsize=10)
    ax1.set_xlabel('x'); ax1.set_ylabel('y'); ax1.set_zlabel('z')

    # 2D subspace: plane x + y - z = 0
    ax2 = fig.add_subplot(132, projection='3d')
    s_range = np.linspace(-2, 2, 20)
    S, T = np.meshgrid(s_range, s_range)
    # Two basis vectors for the plane x+y-z=0: (1,0,1) and (0,1,1)
    X_plane = S * 1 + T * 0
    Y_plane = S * 0 + T * 1
    Z_plane = S * 1 + T * 1
    ax2.plot_surface(X_plane, Y_plane, Z_plane, alpha=0.3, color='cyan')
    ax2.scatter([0], [0], [0], color='red', s=60, zorder=5)
    ax2.set_title('2D subspace (plane)', fontsize=10)
    ax2.set_xlabel('x'); ax2.set_ylabel('y'); ax2.set_zlabel('z')

    # NOT a subspace: plane x + y + z = 1 (does not pass through origin)
    ax3 = fig.add_subplot(133, projection='3d')
    X_aff = S * 1 + T * 0
    Y_aff = S * 0 + T * 1
    Z_aff = 1 - X_aff - Y_aff
    ax3.plot_surface(X_aff, Y_aff, Z_aff, alpha=0.3, color='salmon')
    ax3.scatter([0], [0], [0], color='red', s=60, zorder=5, label='origin')
    ax3.set_title('NOT a subspace (affine)', fontsize=10)
    ax3.set_xlabel('x'); ax3.set_ylabel('y'); ax3.set_zlabel('z')

    plt.tight_layout()
    plt.show()
else:
    print("(matplotlib not available — skipping 3D subspace plots)")

4. Span and Linear Combinations

A linear combination of vectors $\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_k \in V$ is any vector of the form:

\alpha_1 \mathbf{v}_1 + \alpha_2 \mathbf{v}_2 + \cdots + \alpha_k \mathbf{v}_k

where $\alpha_1, \ldots, \alpha_k \in \mathbb{R}$ are arbitrary scalars (the coefficients).

The span of a set $S = \{\mathbf{v}_1, \ldots, \mathbf{v}_k\}$ is the set of all linear combinations:

\text{span}(S) = \left\{ \sum_{i=1}^k \alpha_i \mathbf{v}_i : \alpha_1, \ldots, \alpha_k \in \mathbb{R} \right\}

Key facts about span

$\text{span}(S)$ is always a subspace (the smallest subspace containing all vectors in $S$ ).
$\text{span}(\emptyset) = \{\mathbf{0}\}$ by convention.
If $S \subseteq T$ , then $\text{span}(S) \subseteq \text{span}(T)$ .
A set $S$ spans $V$ if $\text{span}(S) = V$ .

Geometric pictures

$\text{span}\{\mathbf{v}\}$ : a line through the origin (1D)
$\text{span}\{\mathbf{u}, \mathbf{v}\}$ (non-parallel): a plane through the origin (2D)
$\text{span}\{\mathbf{u}, \mathbf{v}, \mathbf{w}\}$ (not coplanar): a 3D subspace

Connection to matrices

Given $\mathbf{v}_1, \ldots, \mathbf{v}_k \in \mathbb{R}^m$ , form $V = [\mathbf{v}_1 | \cdots | \mathbf{v}_k] \in \mathbb{R}^{m \times k}$ .

\text{span}\{\mathbf{v}_1, \ldots, \mathbf{v}_k\} = \{V\boldsymbol{\alpha} : \boldsymbol{\alpha} \in \mathbb{R}^k\} = \text{col}(V)

To check if $\mathbf{b} \in \text{span}\{\mathbf{v}_1, \ldots, \mathbf{v}_k\}$ : solve $V\boldsymbol{\alpha} = \mathbf{b}$ . Consistent iff $\text{rank}(V) = \text{rank}([V|\mathbf{b}])$ .

Code cell 30

# === 4.1 Linear combinations and span in R^3 ===

v1 = np.array([1.0, 0.0, 0.0])
v2 = np.array([0.0, 1.0, 0.0])
v3 = np.array([1.0, 1.0, 0.0])  # v3 = v1 + v2 (redundant)

# Linear combination
combo = 2*v1 + 3*v2 - 1*v3
print("v1 =", v1)
print("v2 =", v2)
print("v3 = v1 + v2 =", v3)
print("2*v1 + 3*v2 - 1*v3 =", combo, " (lies in the xy-plane)")

# span{v1, v2} = xy-plane (2D subspace of R^3)
# span{v1, v2, v3} = same xy-plane (v3 is redundant)
V12 = np.column_stack([v1, v2])
V123 = np.column_stack([v1, v2, v3])
print(f"\nrank([v1|v2])    = {np.linalg.matrix_rank(V12)}  -> span is 2D")
print(f"rank([v1|v2|v3]) = {np.linalg.matrix_rank(V123)}  -> adding v3 didn't increase span")
print("v3 is redundant: it's already in span{v1, v2}.")

Code cell 31

# === 4.2 Checking membership in a span ===

# Does b lie in span{v1, v2}?
v1 = np.array([1.0, 2.0, 3.0])
v2 = np.array([0.0, 1.0, 1.0])
V = np.column_stack([v1, v2])

b_in = np.array([2.0, 5.0, 7.0])     # 2*v1 + 1*v2
b_out = np.array([1.0, 0.0, 0.0])    # not in span

for label, b in [("b_in", b_in), ("b_out", b_out)]:
    augmented = np.column_stack([V, b])
    rank_V = np.linalg.matrix_rank(V)
    rank_aug = np.linalg.matrix_rank(augmented)
    in_span = rank_V == rank_aug
    print(f"{label} = {b}")
    print(f"  rank(V) = {rank_V}, rank([V|b]) = {rank_aug}")
    print(f"  In span? {in_span}")
    if in_span:
        coeffs, *_ = np.linalg.lstsq(V, b, rcond=None)
        print(f"  Coefficients: {coeffs[0]:.4f}*v1 + {coeffs[1]:.4f}*v2")
        print(f"  Verify: {coeffs[0]:.4f}*v1 + {coeffs[1]:.4f}*v2 = {V @ coeffs}")
    print()

Code cell 32

# === 4.3 Span = column space: the matrix viewpoint ===

# The columns of a weight matrix W span col(W) -- the reachable outputs.
# If rank(W) < m, some output directions are unreachable.

W = np.array([
    [1.0, 2.0],
    [2.0, 4.0],
    [0.0, 1.0]
])
print(f"W (3x2) =\n{W}")
print(f"rank(W) = {np.linalg.matrix_rank(W)}")
print(f"col(W) is a {np.linalg.matrix_rank(W)}D subspace of R^3")

# Try many inputs: outputs always land in col(W)
C = col_space(W)
print(f"\nOrthonormal basis for col(W):\n{C}")

np.random.seed(0)
for i in range(5):
    x = np.random.randn(2)
    y = W @ x
    # Project y onto col(W) and check it equals y
    y_proj = C @ (C.T @ y)
    print(f"  x={x} -> Wx={y} -> proj_col(W)(Wx)={y_proj} -> same? {np.allclose(y, y_proj)}")

print("\nAll outputs lie in col(W). Directions outside col(W) are unreachable.")

5. Linear Independence

Vectors $\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_k$ are linearly independent if the only linear combination equal to the zero vector uses all-zero coefficients:

\alpha_1 \mathbf{v}_1 + \alpha_2 \mathbf{v}_2 + \cdots + \alpha_k \mathbf{v}_k = \mathbf{0} \implies \alpha_1 = \alpha_2 = \cdots = \alpha_k = 0

Equivalently: no vector in the set is a linear combination of the others. Each vector adds a genuinely new direction.

Linearly dependent: $\exists$ non-trivial (not all-zero) coefficients such that $\sum \alpha_i \mathbf{v}_i = \mathbf{0}$ . At least one vector is redundant.

Checking independence

Form $V = [\mathbf{v}_1 | \cdots | \mathbf{v}_k]$ and row reduce:

Independent iff every column has a pivot iff $\text{rank}(V) = k$
Dependent iff $\text{rank}(V) < k$
Determinant test (square case): independent iff $\det(V) \neq 0$

Key properties

Any set containing $\mathbf{0}$ is dependent ( $1 \cdot \mathbf{0} + 0 \cdot \mathbf{v}_2 + \cdots = \mathbf{0}$ )
A single non-zero vector is always independent
Two vectors are dependent iff one is a scalar multiple of the other
More vectors than dimensions $\implies$ always dependent ( $k > n$ in $\mathbb{R}^n$ )

Code cell 34

# === 5.1 Independent vs dependent sets ===

# Independent set in R^3
ind = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]], dtype=float).T  # standard basis
# Dependent set: third vector = first + second
dep = np.array([[1, 0, 0], [0, 1, 0], [1, 1, 0]], dtype=float).T

print("Independent set (standard basis of R^3):")
print(f"  rank = {np.linalg.matrix_rank(ind)}, k = {ind.shape[1]}")
print(f"  rank == k? {np.linalg.matrix_rank(ind) == ind.shape[1]} -> independent")

print("\nDependent set {(1,0,0), (0,1,0), (1,1,0)}:")
print(f"  rank = {np.linalg.matrix_rank(dep)}, k = {dep.shape[1]}")
print(f"  rank < k? {np.linalg.matrix_rank(dep) < dep.shape[1]} -> dependent")
print("  Dependence relation: 1*(1,0,0) + 1*(0,1,0) + (-1)*(1,1,0) = (0,0,0)")
print("  Verify:", 1*dep[:,0] + 1*dep[:,1] + (-1)*dep[:,2])

Code cell 35

# === 5.2 Detecting dependence via RREF and the determinant ===

V = np.array([
    [1.0, 2.0, 3.0],
    [4.0, 5.0, 6.0],
    [7.0, 8.0, 9.0]
])
print("V =\n", V)
print(f"det(V) = {np.linalg.det(V):.6f}")
print(f"rank(V) = {np.linalg.matrix_rank(V)}")

R, pivots = rref(V)
print(f"\nRREF(V) =\n{R}")
print(f"Pivot columns: {pivots}")
print(f"Only {len(pivots)} pivots for 3 columns -> column 3 is dependent on columns 1,2")
print(f"Dependence: col3 = {R[0,2]:.0f}*col1 + {R[1,2]:.0f}*col2")
print(f"Verify: {R[0,2]}*(1,4,7) + {R[1,2]}*(2,5,8) = {R[0,2]*V[:,0] + R[1,2]*V[:,1]}")
print(f"col3 = {V[:,2]}")

Code cell 36

# === 5.3 More vectors than dimensions -> always dependent ===

np.random.seed(42)
n = 3
k = 5  # 5 vectors in R^3 -> must be dependent
V = np.random.randn(n, k)
print(f"{k} vectors in R^{n}:")
print(f"  rank([v1|...|v{k}]) = {np.linalg.matrix_rank(V)}")
print(f"  rank < k = {k}? {np.linalg.matrix_rank(V) < k} -> dependent (pigeonhole)")
print(f"  At most {n} vectors in R^{n} can be independent.")

# The null space of V^T gives the dependence relations
N = null_space(V.T)
if N.shape[1] > 0:
    c = N[:, 0]
    print(f"\n  Dependence coefficients: {c}")
    print(f"  Sum c_i * v_i = {V @ c}")
    print(f"  Is zero? {np.allclose(V @ c, 0)}")

Code cell 37

# === 5.4 AI context: superposition — more features than dimensions ===

# The superposition hypothesis: LLMs store more features than dimensions.
# If |features| > d, feature directions MUST be linearly dependent.
# This forces features to "share" dimensions and interfere with each other.

d = 4           # embedding dimension
n_features = 8  # more features than dimensions

np.random.seed(7)
# Random unit feature directions
features = np.random.randn(n_features, d)
features = features / np.linalg.norm(features, axis=1, keepdims=True)

print(f"{n_features} feature directions in R^{d}:")
print(f"  rank = {np.linalg.matrix_rank(features)}")
print(f"  {n_features} > {d} -> features are linearly dependent (superposition)")

# Measure interference via pairwise cosine similarities
G = features @ features.T
np.fill_diagonal(G, 0)  # ignore self-similarity
print(f"\n  Max |cosine| between features: {np.max(np.abs(G)):.4f}")
print(f"  Mean |cosine| between features: {np.mean(np.abs(G[np.triu_indices(n_features, 1)])):.4f}")
print("  Non-zero cosines = interference between features.")

6. Basis and Dimension

A basis for a vector space $V$ is a set $\mathcal{B} = \{\mathbf{b}_1, \ldots, \mathbf{b}_n\}$ that is:

Linearly independent — no vector is redundant
Spans $V$ — every vector in $V$ is a linear combination of the $\mathbf{b}_i$

A basis is both the minimal spanning set and the maximal independent set.

The Unique Representation Theorem

If $\mathcal{B}$ is a basis, then every $\mathbf{v} \in V$ has a unique representation $\mathbf{v} = \sum \alpha_i \mathbf{b}_i$ . The scalars $(\alpha_1, \ldots, \alpha_n)$ are the coordinates of $\mathbf{v}$ with respect to $\mathcal{B}$ .

Dimension

All bases for $V$ have the same number of elements. This number is the dimension:

\dim(V) = |\text{any basis of } V|

Space	Dimension
$\mathbb{R}^n$	$n$
$\mathbb{R}^{m \times n}$	$mn$
$P_n$ (polynomials deg $\leq n$ )	$n+1$
Symmetric $n \times n$ matrices	$n(n+1)/2$
$\{\mathbf{0}\}$	$0$

Dimension counting theorems

Rank-Nullity: $\text{nullity}(A) + \text{rank}(A) = n$ for $A \in \mathbb{R}^{m \times n}$ .

Dimension of sum: $\dim(W_1 + W_2) = \dim(W_1) + \dim(W_2) - \dim(W_1 \cap W_2)$ .

Code cell 39

# === 6.1 Coordinates in the standard basis vs a non-standard basis ===

# Standard basis for R^2: {e1, e2}
# Non-standard basis: B = {(1,1), (1,-1)}
B = np.array([[1.0, 1.0], [1.0, -1.0]]).T  # columns are basis vectors
x = np.array([4.0, 2.0])

# Coordinates in standard basis = entries themselves
print(f"x = {x}")
print(f"[x]_standard = {x}  (coordinates = entries)")

# Coordinates in basis B: solve B * coords = x
coords_B = np.linalg.solve(B, x)
print(f"\nBasis B = columns of\n{B}")
print(f"[x]_B = {coords_B}")
print(f"Verify: {coords_B[0]}*(1,1) + {coords_B[1]}*(1,-1) = {B @ coords_B}")

# The vector is unchanged; only the coordinate representation changes
print(f"\nSame vector, different coordinates:")
print(f"  Standard: ({x[0]}, {x[1]})")
print(f"  Basis B:  ({coords_B[0]}, {coords_B[1]})")

Code cell 40

# === 6.2 Change of basis ===

# Two bases for R^2
B1 = np.array([[1.0, 1.0], [1.0, -1.0]]).T   # columns = basis vectors
B2 = np.array([[2.0, 1.0], [0.0, 1.0]]).T

# Change-of-basis matrix from B1 to B2: P = B2^{-1} B1
P_B1_to_B2 = np.linalg.solve(B2, B1)
print("Change-of-basis matrix P (B1 -> B2):")
print(P_B1_to_B2)

# A vector x with known coords in B1
x = np.array([4.0, 2.0])
coords_B1 = np.linalg.solve(B1, x)
coords_B2 = P_B1_to_B2 @ coords_B1
coords_B2_direct = np.linalg.solve(B2, x)

print(f"\nx = {x}")
print(f"[x]_B1 = {coords_B1}")
print(f"[x]_B2 via change-of-basis = {coords_B2}")
print(f"[x]_B2 via direct solve    = {coords_B2_direct}")
print(f"Match? {np.allclose(coords_B2, coords_B2_direct)}")

# Inverse change of basis
P_B2_to_B1 = np.linalg.inv(P_B1_to_B2)
print(f"\nP_B2_to_B1 @ P_B1_to_B2 = I? {np.allclose(P_B2_to_B1 @ P_B1_to_B2, np.eye(2))}")

Code cell 41

# === 6.3 Rank-Nullity theorem: conservation of dimensions ===

np.random.seed(42)
for m, n in [(3, 5), (4, 4), (5, 3), (3, 3)]:
    A = np.random.randn(m, n)
    r = np.linalg.matrix_rank(A)
    nullity = n - r
    print(f"A is {m}x{n}: rank = {r}, nullity = {nullity}, rank + nullity = {r + nullity} = n = {n}")

# Concrete example with structure
A = np.array([
    [1.0, 2.0, 1.0, 0.0],
    [0.0, 0.0, 1.0, 1.0],
    [1.0, 2.0, 2.0, 1.0]
])
r = np.linalg.matrix_rank(A)
N = null_space(A)
print(f"\nA (3x4) =\n{A}")
print(f"rank = {r}, nullity = {N.shape[1]}")
print(f"rank + nullity = {r + N.shape[1]} = n = {A.shape[1]}")
print(f"Null space basis:\n{N}")
print(f"Verify A @ N = 0? {np.allclose(A @ N, 0)}")

7. The Four Fundamental Subspaces

For $A \in \mathbb{R}^{m \times n}$ with rank $r$ , there are four fundamental subspaces:

Subspace	Lives in	Dimension	Definition
Column space $\text{col}(A)$	$\mathbb{R}^m$	$r$	$\{A\mathbf{x} : \mathbf{x} \in \mathbb{R}^n\}$ — reachable outputs
Null space $\text{null}(A)$	$\mathbb{R}^n$	$n - r$	$\{\mathbf{x} : A\mathbf{x} = \mathbf{0}\}$ — invisible inputs
Row space $\text{row}(A)$	$\mathbb{R}^n$	$r$	$\{A^T\mathbf{y} : \mathbf{y} \in \mathbb{R}^m\}$ — directions $A$ listens to
Left null space $\text{null}(A^T)$	$\mathbb{R}^m$	$m - r$	$\{\mathbf{y} : A^T\mathbf{y} = \mathbf{0}\}$ — unreachable output directions

The orthogonality relations

\text{row}(A) \perp \text{null}(A) \quad \text{in } \mathbb{R}^n, \qquad \text{col}(A) \perp \text{null}(A^T) \quad \text{in } \mathbb{R}^m

\mathbb{R}^n = \text{row}(A) \oplus \text{null}(A), \qquad \mathbb{R}^m = \text{col}(A) \oplus \text{null}(A^T)

Strang's Big Picture

              R^n                              R^m
    ┌──────────────────┐           ┌──────────────────┐
    │  row(A)          │           │  col(A)          │
    │  dim = r         │────A─────▶│  dim = r         │
    │                  │           │                  │
    │    ⊕             │           │    ⊕             │
    │                  │           │                  │
    │  null(A)         │────A─────▶│  {0}             │
    │  dim = n − r     │  (→ 0)    │                  │
    └──────────────────┘           └──────────────────┘

$A$ is a bijection from $\text{row}(A)$ to $\text{col}(A)$ . $A$ maps $\text{null}(A)$ to $\{\mathbf{0}\}$ .

Code cell 43

# === 7.1 Computing all four fundamental subspaces via SVD ===

A = np.array([
    [1.0, 2.0, 3.0],
    [2.0, 4.0, 6.0],
    [1.0, 1.0, 2.0]
])

U, S, Vt = np.linalg.svd(A, full_matrices=True)
r = np.linalg.matrix_rank(A)
m, n = A.shape

col_basis = U[:, :r]
null_A_basis = Vt[r:].T
row_basis = Vt[:r]
left_null_basis = U[:, r:]

print(f"A ({m}x{n}), rank = {r}")
print(f"Singular values: {S}")
print(f"\n--- Four Fundamental Subspaces ---")
print(f"\nColumn space (dim {r}, lives in R^{m}):")
print(col_basis)
print(f"\nNull space (dim {n-r}, lives in R^{n}):")
print(null_A_basis)
print(f"\nRow space (dim {r}, lives in R^{n}):")
print(row_basis)
print(f"\nLeft null space (dim {m-r}, lives in R^{m}):")
print(left_null_basis)

# Dimension check
print(f"\n--- Dimension checks ---")
print(f"col + left_null = {r} + {m-r} = {m} = m")
print(f"row + null      = {r} + {n-r} = {n} = n")

Code cell 44

# === 7.2 Orthogonality verification and direct-sum decomposition ===

# Verify: row(A) ⊥ null(A)  and  col(A) ⊥ null(A^T)
print("Orthogonality checks:")
if null_A_basis.size > 0:
    cross = row_basis @ null_A_basis
    print(f"  row(A)^T @ null(A) = \n{cross}")
    print(f"  All zero? {np.allclose(cross, 0)}")

if left_null_basis.size > 0:
    cross2 = col_basis.T @ left_null_basis
    print(f"  col(A)^T @ null(A^T) = \n{cross2}")
    print(f"  All zero? {np.allclose(cross2, 0)}")

# Direct-sum decomposition in the domain R^n
z = np.array([2.0, -1.0, 3.0])
row_comp = row_basis.T @ (row_basis @ z)          # projection onto row space
null_comp = null_A_basis @ (null_A_basis.T @ z)    # projection onto null space

print(f"\n--- Direct-sum decomposition of z = {z} in R^{n} ---")
print(f"  Row-space component:  {row_comp}")
print(f"  Null-space component: {null_comp}")
print(f"  Sum:                  {row_comp + null_comp}")
print(f"  Equals z? {np.allclose(row_comp + null_comp, z)}")
print(f"  Orthogonal? dot = {np.dot(row_comp, null_comp):.10f}")
print(f"\n  A maps row-comp to: {A @ row_comp}")
print(f"  A maps null-comp to: {A @ null_comp} (= 0, as expected)")

Code cell 45

# === 7.3 AI interpretation: weight matrix subspaces ===

# For a neural network layer W in R^{m x n}:
#   col(W) = reachable output directions
#   null(W) = input directions the layer ignores
#   row(W) = input directions the layer listens to

np.random.seed(42)
m, n, r = 6, 8, 3  # 6 outputs, 8 inputs, effective rank 3

# Create a rank-r weight matrix (simulating a bottleneck layer)
U_full = np.random.randn(m, r)
V_full = np.random.randn(r, n)
W = U_full @ V_full

print(f"Weight matrix W: {m}x{n}, rank = {np.linalg.matrix_rank(W)}")
print(f"\nWhat this means for the layer:")
print(f"  col(W): {r}D subspace of R^{m} -> only {r} of {m} output directions are reachable")
print(f"  null(W): {n-r}D subspace of R^{n} -> {n-r} input directions are ignored")
print(f"  row(W): {r}D subspace of R^{n} -> only {r} input directions matter")
print(f"\n  Dead output dimensions: {m-r}")
print(f"  Ignored input dimensions: {n-r}")

# Demonstrate: two inputs differing only in null(W) produce the same output
N_W = null_space(W)
x = np.random.randn(n)
z = N_W @ np.random.randn(N_W.shape[1]) * 5  # large null-space perturbation
print(f"\n  x and x+z differ by a null-space vector:")
print(f"  ||z||  = {np.linalg.norm(z):.4f}  (large perturbation)")
print(f"  W*x    = {W @ x}")
print(f"  W*(x+z)= {W @ (x + z)}")
print(f"  Same output? {np.allclose(W @ x, W @ (x + z))}  (null space is invisible)")

8. Subspace Operations

Sum of subspaces

W_1 + W_2 = \{\mathbf{w}_1 + \mathbf{w}_2 : \mathbf{w}_1 \in W_1,\; \mathbf{w}_2 \in W_2\}

The sum is the smallest subspace containing both $W_1$ and $W_2$ .

Dimension formula: $\dim(W_1 + W_2) = \dim(W_1) + \dim(W_2) - \dim(W_1 \cap W_2)$

Intersection of subspaces

W_1 \cap W_2 = \{\mathbf{v} : \mathbf{v} \in W_1 \text{ and } \mathbf{v} \in W_2\}

Always a subspace. The largest subspace contained in both. Note: $W_1 \cup W_2$ is generally not a subspace.

Direct sum

$V = W_1 \oplus W_2$ if $W_1 + W_2 = V$ and $W_1 \cap W_2 = \{\mathbf{0}\}$ .

Every $\mathbf{v} \in V$ has a unique decomposition $\mathbf{v} = \mathbf{w}_1 + \mathbf{w}_2$ .

Orthogonal projection

For subspace $W$ with orthonormal basis $Q = [\mathbf{q}_1 | \cdots | \mathbf{q}_k]$ :

\text{Proj}_W(\mathbf{v}) = QQ^T\mathbf{v} = \sum_{i=1}^k \langle \mathbf{v}, \mathbf{q}_i \rangle \mathbf{q}_i

The projection matrix $P = QQ^T$ satisfies $P^2 = P$ (idempotent) and $P^T = P$ (symmetric).

Code cell 47

# === 8.1 Sum and intersection of subspaces ===

# W1 = span{(1,0,0), (0,1,0)} = xy-plane
# W2 = span{(0,1,0), (0,0,1)} = yz-plane
# W1 + W2 = R^3 (every vector is reachable)
# W1 ∩ W2 = span{(0,1,0)} = y-axis (shared direction)

B1 = np.array([[1,0,0],[0,1,0]], dtype=float).T
B2 = np.array([[0,1,0],[0,0,1]], dtype=float).T

# Sum: combine all basis vectors and find rank
B_sum = np.column_stack([B1, B2])
dim_sum = np.linalg.matrix_rank(B_sum)
dim_W1 = np.linalg.matrix_rank(B1)
dim_W2 = np.linalg.matrix_rank(B2)
dim_intersect = dim_W1 + dim_W2 - dim_sum

print("W1 = xy-plane (dim 2), W2 = yz-plane (dim 2)")
print(f"dim(W1) = {dim_W1}")
print(f"dim(W2) = {dim_W2}")
print(f"dim(W1 + W2) = {dim_sum}")
print(f"dim(W1 ∩ W2) = dim(W1) + dim(W2) - dim(W1+W2) = {dim_W1}+{dim_W2}-{dim_sum} = {dim_intersect}")
print(f"\nDimension formula verified: {dim_sum} = {dim_W1} + {dim_W2} - {dim_intersect}")
print("W1 ∩ W2 = y-axis (the only direction shared by both planes)")

Code cell 48

# === 8.2 Union is NOT a subspace (important counterexample) ===

# W1 = x-axis = span{(1,0)}, W2 = y-axis = span{(0,1)}
v1 = np.array([1.0, 0.0])  # in W1
v2 = np.array([0.0, 1.0])  # in W2
v_sum = v1 + v2             # (1,1) -- not in W1 or W2

print("W1 = x-axis, W2 = y-axis")
print(f"  (1,0) in W1 ∪ W2? Yes")
print(f"  (0,1) in W1 ∪ W2? Yes")
print(f"  (1,0)+(0,1) = {v_sum}")
print(f"  (1,1) in W1? No (not on x-axis)")
print(f"  (1,1) in W2? No (not on y-axis)")
print(f"  (1,1) in W1 ∪ W2? No -> union is NOT closed under addition")
print(f"\n  W1 ∪ W2 is NOT a subspace.")
print(f"  W1 + W2 = R^2 IS a subspace (the correct notion of 'combining' subspaces).")

Code cell 49

# === 8.3 Orthogonal projection onto a subspace ===

# W = span{(1,1,0), (0,1,1)} in R^3
W_basis = np.array([[1,1,0],[0,1,1]], dtype=float).T

# Orthonormalize the basis
Q = gram_schmidt(W_basis.T).T  # columns are orthonormal basis for W
print("Orthonormal basis Q for W:")
print(Q)
print(f"Q^T Q = \n{Q.T @ Q}")

# Projection matrix P = Q Q^T
P = Q @ Q.T
print(f"\nProjection matrix P = QQ^T:")
print(P)
print(f"P^2 = P (idempotent)? {np.allclose(P @ P, P)}")
print(f"P^T = P (symmetric)?  {np.allclose(P.T, P)}")
print(f"rank(P) = {np.linalg.matrix_rank(P)} = dim(W)")

# Project a vector
v = np.array([3.0, 1.0, 2.0])
proj_v = P @ v
residual = v - proj_v

print(f"\nv = {v}")
print(f"Proj_W(v) = {proj_v}")
print(f"Residual v - Proj_W(v) = {residual}")
print(f"Residual ⊥ W? Q^T @ residual = {Q.T @ residual}")
print(f"||v||^2 = ||proj||^2 + ||resid||^2? {np.linalg.norm(v)**2:.4f} = {np.linalg.norm(proj_v)**2:.4f} + {np.linalg.norm(residual)**2:.4f} = {np.linalg.norm(proj_v)**2 + np.linalg.norm(residual)**2:.4f}")

Code cell 50

# === 8.4 Principal angles between subspaces ===

# The principal angles between two subspaces measure how "aligned" they are.
# cos(theta_i) = sigma_i(Q1^T Q2)

np.random.seed(42)

# Two 2D subspaces of R^4
A1 = np.random.randn(4, 2)
A2 = np.random.randn(4, 2)

Q1 = gram_schmidt(A1.T).T  # orthonormal basis, columns
Q2 = gram_schmidt(A2.T).T

cos_angles = np.linalg.svd(Q1.T @ Q2, compute_uv=False)
angles_deg = np.degrees(np.arccos(np.clip(cos_angles, -1, 1)))

print("Two random 2D subspaces of R^4:")
print(f"  cos(principal angles) = {cos_angles}")
print(f"  principal angles (degrees) = {angles_deg}")
print(f"\n  First angle = {angles_deg[0]:.1f}° (closest directions between subspaces)")
print(f"  Second angle = {angles_deg[1]:.1f}° (next pair)")

# Identical subspaces -> angles = 0
cos_same = np.linalg.svd(Q1.T @ Q1, compute_uv=False)
print(f"\n  Same subspace: cos(angles) = {cos_same} (all 1 -> all angles 0°)")

# Orthogonal subspaces
Q_orth = gram_schmidt(np.eye(4)[2:]).T
cos_orth = np.linalg.svd(Q1.T @ Q_orth, compute_uv=False)
angles_orth = np.degrees(np.arccos(np.clip(cos_orth, -1, 1)))
print(f"  Orthogonal subspace: angles = {angles_orth} degrees")

9. Inner Product Spaces and Orthogonality

An inner product on $V$ is a function $\langle \cdot, \cdot \rangle : V \times V \to \mathbb{R}$ satisfying:

Symmetry: $\langle \mathbf{u}, \mathbf{v} \rangle = \langle \mathbf{v}, \mathbf{u} \rangle$
Linearity: $\langle \alpha\mathbf{u} + \beta\mathbf{w}, \mathbf{v} \rangle = \alpha\langle \mathbf{u}, \mathbf{v} \rangle + \beta\langle \mathbf{w}, \mathbf{v} \rangle$
Positive definiteness: $\langle \mathbf{v}, \mathbf{v} \rangle \geq 0$ , with equality iff $\mathbf{v} = \mathbf{0}$

Induced norm: $\|\mathbf{v}\| = \sqrt{\langle \mathbf{v}, \mathbf{v} \rangle}$ . Cauchy-Schwarz: $|\langle \mathbf{u}, \mathbf{v} \rangle| \leq \|\mathbf{u}\| \|\mathbf{v}\|$ .

Standard inner products

Space	Inner product	Induced norm
$\mathbb{R}^n$	$\langle \mathbf{u}, \mathbf{v} \rangle = \mathbf{u}^T\mathbf{v}$	$\ell^2$ norm
$\mathbb{R}^{m \times n}$	$\langle A, B \rangle = \text{tr}(A^TB)$	Frobenius norm
$C([a,b])$	$\langle f, g \rangle = \int_a^b f(t)g(t)\,dt$	$L^2$ norm

Orthogonality and Gram-Schmidt

$\mathbf{u} \perp \mathbf{v}$ if $\langle \mathbf{u}, \mathbf{v} \rangle = 0$
Pythagorean theorem: $\mathbf{u} \perp \mathbf{v} \implies \|\mathbf{u}+\mathbf{v}\|^2 = \|\mathbf{u}\|^2 + \|\mathbf{v}\|^2$
Orthogonal/orthonormal sets are always linearly independent
Gram-Schmidt converts any independent set to an orthonormal basis for the same span
The orthogonal complement $W^\perp = \{\mathbf{v} : \langle \mathbf{v}, \mathbf{w} \rangle = 0\;\forall\, \mathbf{w} \in W\}$
$V = W \oplus W^\perp$ and $\dim(W) + \dim(W^\perp) = \dim(V)$

Code cell 52

# === 9.1 Inner product properties and Cauchy-Schwarz ===

u = np.array([1.0, 2.0, 3.0])
v = np.array([4.0, -1.0, 2.0])
w = np.array([0.0, 1.0, -1.0])
alpha, beta = 2.0, -0.5

# Verify inner product axioms for standard dot product
print("Inner product axioms (standard dot product on R^3):")
print(f"  Symmetry:  <u,v> = {np.dot(u,v)}, <v,u> = {np.dot(v,u)}, equal? {np.isclose(np.dot(u,v), np.dot(v,u))}")
print(f"  Linearity: <au+bw,v> = {np.dot(alpha*u+beta*w, v):.4f}")
print(f"             a<u,v>+b<w,v> = {alpha*np.dot(u,v)+beta*np.dot(w,v):.4f}")
print(f"             equal? {np.isclose(np.dot(alpha*u+beta*w, v), alpha*np.dot(u,v)+beta*np.dot(w,v))}")
print(f"  Pos. def.: <u,u> = {np.dot(u,u):.4f} >= 0")
print(f"             <0,0> = {np.dot(np.zeros(3), np.zeros(3)):.4f} = 0")

# Cauchy-Schwarz
lhs = abs(np.dot(u, v))
rhs = np.linalg.norm(u) * np.linalg.norm(v)
print(f"\nCauchy-Schwarz: |<u,v>| = {lhs:.4f} <= ||u|| ||v|| = {rhs:.4f}? {lhs <= rhs + 1e-12}")

# Frobenius inner product for matrices
A = np.array([[1.0, 2.0], [3.0, 4.0]])
B = np.array([[5.0, 6.0], [7.0, 8.0]])
fro_ip = np.trace(A.T @ B)
print(f"\nFrobenius inner product: <A,B> = tr(A^T B) = {fro_ip}")
print(f"Manual sum: {np.sum(A * B)}")
print(f"||A||_F = sqrt(<A,A>) = {np.sqrt(np.trace(A.T @ A)):.4f} = {np.linalg.norm(A, 'fro'):.4f}")

Code cell 53

# === 9.2 Gram-Schmidt orthogonalisation: step by step ===

vectors = np.array([
    [1.0, 1.0, 0.0],
    [1.0, 0.0, 1.0],
    [0.0, 1.0, 1.0]
], dtype=float)

print("Input vectors (rows):")
for i, v in enumerate(vectors):
    print(f"  v{i+1} = {v}")

Q = gram_schmidt(vectors)

print("\nGram-Schmidt output (orthonormal):")
for i, q in enumerate(Q):
    print(f"  q{i+1} = {q}")

print("\nVerification:")
print(f"  Q Q^T (should be I):\n{Q @ Q.T}")
print(f"  Orthonormal? {np.allclose(Q @ Q.T, np.eye(len(Q)))}")

# Same span check: rank([v1..vk]) == rank([q1..qk])
V_mat = vectors.T
Q_mat = Q.T
print(f"\n  rank(original vectors) = {np.linalg.matrix_rank(V_mat)}")
print(f"  rank(orthonormal basis) = {np.linalg.matrix_rank(Q_mat)}")
print(f"  Same span? {np.linalg.matrix_rank(V_mat) == np.linalg.matrix_rank(Q_mat)}")

# Connection to QR decomposition
A_qr = vectors.T  # columns = original vectors
Q_qr, R_qr = np.linalg.qr(A_qr)
print(f"\n  QR decomposition: A = QR")
print(f"  Q (orthonormal columns):\n{Q_qr}")
print(f"  R (upper triangular):\n{R_qr}")
print(f"  A = QR? {np.allclose(A_qr, Q_qr @ R_qr)}")

Code cell 54

# === 9.3 Orthogonal complement and V = W ⊕ W⊥ ===

# W = null(A) for some A. Then W⊥ = row(A).
# This is exactly the fundamental subspace decomposition.

A = np.array([
    [1.0, 2.0, 1.0],
    [0.0, 1.0, 1.0]
])

W_basis = null_space(A)         # null(A) = W
W_perp_basis = row_space(A).T   # row(A) = W⊥

dim_W = W_basis.shape[1] if W_basis.size > 0 else 0
dim_Wp = W_perp_basis.shape[1] if W_perp_basis.size > 0 else 0

print(f"A ({A.shape[0]}x{A.shape[1]}), rank = {np.linalg.matrix_rank(A)}")
print(f"W = null(A): dim = {dim_W}")
print(f"W⊥ = row(A): dim = {dim_Wp}")
print(f"dim(W) + dim(W⊥) = {dim_W + dim_Wp} = n = {A.shape[1]}")

# Verify orthogonality
if dim_W > 0 and dim_Wp > 0:
    cross = W_perp_basis.T @ W_basis
    print(f"\nW⊥^T @ W =\n{cross}")
    print(f"All zero (orthogonal)? {np.allclose(cross, 0)}")

# Decompose a vector
v = np.array([3.0, 1.0, 2.0])
# Project onto W and W⊥
P_W = W_basis @ np.linalg.pinv(W_basis) if dim_W > 0 else np.zeros((3,3))
P_Wp = W_perp_basis @ np.linalg.pinv(W_perp_basis) if dim_Wp > 0 else np.zeros((3,3))
v_W = P_W @ v
v_Wp = P_Wp @ v

print(f"\nv = {v}")
print(f"W-component:  {v_W}")
print(f"W⊥-component: {v_Wp}")
print(f"Sum = v? {np.allclose(v_W + v_Wp, v)}")
print(f"Orthogonal? <v_W, v_Wp> = {np.dot(v_W, v_Wp):.10f}")

10. Affine Subspaces and Quotient Spaces

Affine subspaces

An affine subspace is a translation of a linear subspace: $W + \mathbf{v}_0 = \{\mathbf{w} + \mathbf{v}_0 : \mathbf{w} \in W\}$ .

Does not pass through the origin (unless $\mathbf{v}_0 \in W$ ).
The solution set of $A\mathbf{x} = \mathbf{b}$ with $\mathbf{b} \neq \mathbf{0}$ is an affine subspace: $\mathbf{x}_p + \text{null}(A)$ .
Same dimension as the underlying $W$ , but NOT a subspace of $V$ .

Affine and convex combinations

Affine combination: $\sum \alpha_i \mathbf{v}_i$ where $\sum \alpha_i = 1$ (coefficients sum to 1)
Convex combination: affine combination with $\alpha_i \geq 0$ (stay in convex hull)
AI: interpolation between embeddings $\alpha\mathbf{u} + (1-\alpha)\mathbf{v}$ is a convex combination

Quotient spaces (intuition)

$V/W$ "forgets" the $W$ -directions: two vectors are equivalent if they differ by an element of $W$ .

[\mathbf{v}] = \mathbf{v} + W = \{\mathbf{v} + \mathbf{w} : \mathbf{w} \in W\}

$\dim(V/W) = \dim(V) - \dim(W)$ . In normalisation layers, mean-subtraction projects onto a quotient.

Code cell 56

# === 10.1 Affine subspace: solution set of Ax = b ===

A = np.array([
    [1.0, 1.0, 1.0],
    [0.0, 1.0, 2.0]
])
b = np.array([6.0, 4.0])

# Particular solution
x_p, *_ = np.linalg.lstsq(A, b, rcond=None)

# Null space
N = null_space(A)

print("System Ax = b:")
print(f"A =\n{A}")
print(f"b = {b}")
print(f"\nParticular solution x_p = {x_p}")
print(f"Verify: A*x_p = {A @ x_p}")

print(f"\nNull space of A (dim = {N.shape[1]}):")
print(f"  Basis:\n{N}")

# The full solution set is x_p + null(A)
print(f"\nSolution set = x_p + null(A):")
for t in [-2, -1, 0, 1, 2]:
    x = x_p + t * N[:, 0]
    print(f"  t={t:>2}: x = {x},  A*x = {A @ x}")

print(f"\nAll solutions differ by null-space vectors.")
print(f"Solution set is an affine subspace, NOT a linear subspace (0 not in it).")
print(f"A*0 = {A @ np.zeros(3)} != b = {b}")

Code cell 57

# === 10.2 Convex combinations and embedding interpolation ===

# Interpolation between two embedding vectors
e_cat = np.array([0.8, 0.3, -0.1, 0.5])
e_dog = np.array([0.7, 0.4, -0.2, 0.6])

print("Embedding interpolation (convex combination):")
print(f"  e_cat = {e_cat}")
print(f"  e_dog = {e_dog}")
for alpha in [0.0, 0.25, 0.5, 0.75, 1.0]:
    interp = alpha * e_cat + (1 - alpha) * e_dog
    cos_cat = cosine_similarity(interp, e_cat)
    cos_dog = cosine_similarity(interp, e_dog)
    print(f"  alpha={alpha:.2f}: {interp}  cos(cat)={cos_cat:.3f}  cos(dog)={cos_dog:.3f}")

# Quotient space intuition: mean subtraction
print(f"\n--- Quotient space: mean subtraction ---")
x = np.array([5.0, 3.0, 7.0, 1.0])
mean = np.mean(x)
x_centered = x - mean
print(f"x = {x}, mean = {mean}")
print(f"x - mean = {x_centered}, sum = {x_centered.sum():.10f}")
print("Mean subtraction projects onto the hyperplane sum(x_i)=0.")
print("This is the quotient V / span{(1,1,...,1)}: forget the 'all-ones' direction.")

11. Subspaces in Functional Analysis

Infinite-dimensional spaces

In finite dimensions, every subspace is automatically closed. In infinite dimensions, there exist subspaces that are not closed — dense proper subspaces whose limits escape the subspace (e.g. polynomials inside $L^2$ : every $L^2$ function is a limit of polynomials, but not every $L^2$ function is a polynomial).

Function spaces relevant to AI

Space	Description	AI application
$L^2(\mathbb{R}^d)$	Square-integrable functions	Kernel methods, GP priors
Sobolev $W^{k,p}$	Functions with $k$ weak derivatives in $L^p$	PINNs, variational problems
RKHS $\mathcal{H}_k$	Reproducing Kernel Hilbert Space	SVMs, kernel regression

Krylov subspaces

\mathcal{K}_k(A, \mathbf{b}) = \text{span}\{\mathbf{b}, A\mathbf{b}, A^2\mathbf{b}, \ldots, A^{k-1}\mathbf{b}\}

A nested sequence of expanding subspaces. Iterative solvers (CG, GMRES, Lanczos) find the best solution within $\mathcal{K}_k$ and expand $k$ until convergence. Cost: $k$ matrix-vector products — far cheaper than direct methods for sparse systems.

Neural networks and function spaces

A neural network architecture defines a subset $\mathcal{F}_\theta \subset L^2$ — generally not a subspace (the sum of two neural networks is not typically representable by one). But the tangent space at $\theta$ is a subspace: the Neural Tangent Kernel (NTK) regime corresponds to learning in this tangent subspace.

Code cell 59

# === 11.1 Krylov subspaces: expanding nested subspaces ===

np.random.seed(42)
n = 6
A = np.random.randn(n, n)
A = A + A.T  # make symmetric for cleaner behaviour
b = np.random.randn(n)

print(f"Krylov subspaces K_k(A, b) for {n}x{n} symmetric A:")
print(f"  K_k = span{{b, Ab, A^2 b, ..., A^(k-1) b}}\n")

for k in range(1, n + 1):
    # Build Krylov basis: [b, Ab, A^2 b, ..., A^{k-1} b]
    K = np.zeros((n, k))
    K[:, 0] = b
    for j in range(1, k):
        K[:, j] = A @ K[:, j-1]
    dim_k = np.linalg.matrix_rank(K)
    print(f"  K_{k}: {k} vectors, effective dim = {dim_k}", end="")
    if dim_k < k:
        print(f"  (Krylov sequence stalled — rank stabilised)")
    else:
        print()

print(f"\n  The Krylov subspace grows until it captures the full relevant space.")
print(f"  CG/GMRES solve Ax=b by searching in K_k; converge when K_k is rich enough.")

Code cell 60

# === 11.2 RKHS intuition: kernel defines a subspace of functions ===

# A kernel k(x, x') defines an RKHS H_k.
# Given data {x_1, ..., x_n}, the representer theorem says the optimal
# predictor lies in span{k(x_1, .), k(x_2, .), ..., k(x_n, .)} — a
# finite-dimensional subspace of the infinite-dimensional H_k.

# RBF kernel: k(x, x') = exp(-||x-x'||^2 / (2*sigma^2))
def rbf_kernel(X, sigma=1.0):
    sq_dists = np.sum(X**2, axis=1, keepdims=True) - 2*X@X.T + np.sum(X**2, axis=1)
    return np.exp(-sq_dists / (2 * sigma**2))

np.random.seed(0)
X_data = np.sort(np.random.randn(8, 1), axis=0)
K = rbf_kernel(X_data, sigma=1.0)

print(f"Data points: {X_data.ravel()}")
print(f"\nKernel Gram matrix K (8x8):\n{K}")
print(f"rank(K) = {np.linalg.matrix_rank(K)}")
print(f"K is PSD? min eigenvalue = {np.min(np.linalg.eigvalsh(K)):.6f} >= 0")
print(f"\nThe optimal predictor lives in an {np.linalg.matrix_rank(K)}-dimensional")
print(f"subspace of the RKHS, spanned by the kernel functions at data points.")

12. Subspace Methods in Machine Learning

PCA as subspace finding

PCA finds the rank- $r$ subspace of $\mathbb{R}^d$ that captures the most variance:

W^* = \arg\min_{\dim(W)=r} \frac{1}{n}\sum_{i=1}^n \|\mathbf{x}_i - \text{Proj}_W(\mathbf{x}_i)\|^2

Solution: the top- $r$ eigenvectors of the covariance matrix $C = \frac{1}{n}\sum (\mathbf{x}_i - \boldsymbol{\mu})(\mathbf{x}_i - \boldsymbol{\mu})^T$ .

LoRA as explicit subspace restriction

LoRA replaces $\Delta W \in \mathbb{R}^{m \times n}$ with $\Delta W = BA^T$ where $B \in \mathbb{R}^{m \times r}$ , $A \in \mathbb{R}^{n \times r}$ . This restricts updates to a rank- $r$ subspace. The intrinsic dimension of fine-tuning is often $\ll p$ (Aghajanyan et al. 2020).

Mechanistic interpretability

Attention heads write to $\text{col}(W_O^h)$ and read from $\text{row}(W_Q^h)$ , $\text{row}(W_K^h)$ — specific subspaces of $\mathbb{R}^d$ .
The superposition hypothesis: features outnumber dimensions, so they live in non-orthogonal subspaces with interference.
LEACE (concept erasure): project out a subspace to remove a concept from representations.

Code cell 62

# === 12.1 PCA: finding the optimal subspace ===

np.random.seed(42)
n_samples = 200
d = 5

# Data lives mostly in a 2D subspace with noise in other directions
true_directions = np.array([
    [1, 1, 0, 0, 0],
    [0, 0, 1, 1, 0]
], dtype=float)
true_directions = true_directions / np.linalg.norm(true_directions, axis=1, keepdims=True)

coeffs = np.random.randn(n_samples, 2) * np.array([3.0, 2.0])
noise = np.random.randn(n_samples, d) * 0.1
X = coeffs @ true_directions + noise

# Center the data
mu = X.mean(axis=0)
X_centered = X - mu

# Covariance matrix and its eigendecomposition
C = (X_centered.T @ X_centered) / n_samples
eigvals, eigvecs = np.linalg.eigh(C)
# Sort descending
idx = np.argsort(eigvals)[::-1]
eigvals = eigvals[idx]
eigvecs = eigvecs[:, idx]

print(f"Data: {n_samples} points in R^{d}")
print(f"Eigenvalues of covariance matrix: {eigvals}")

# Variance explained
total_var = eigvals.sum()
for r in range(1, d + 1):
    explained = eigvals[:r].sum() / total_var
    print(f"  Top-{r} PCs explain {explained:.4f} of total variance")

print(f"\nThe data lives approximately in a 2D subspace.")
print(f"PCA finds this subspace by maximising captured variance.")

Code cell 63

# === 12.2  Gradient subspace during training + LoRA rank analysis ===
#
# Empirical finding (Gur-Ari et al. 2018, Aghajanyan et al. 2020):
# Gradient updates across training steps lie in a subspace whose dimension
# is much smaller than the full parameter count p.
# LoRA exploits this by restricting delta_W to a rank-r subspace.

np.random.seed(0)
m, n = 64, 64
p = m * n
T = 50

grads = []
for t in range(T):
    true_dirs = np.random.randn(5, p)
    true_dirs /= np.linalg.norm(true_dirs, axis=1, keepdims=True)
    g = (np.random.randn(5) @ true_dirs) + 0.05 * np.random.randn(p)
    grads.append(g)

G = np.array(grads)          # T x p
Gram = G @ G.T / T           # T x T
eigvals_gram = np.sort(np.linalg.eigvalsh(Gram))[::-1]
total_energy = eigvals_gram.sum()
cumulative = np.cumsum(eigvals_gram) / total_energy

print(f'Full parameter count p = {p:,}')
print(f'Gradient snapshots T = {T}')
print(f'\nTop-10 eigenvalues of gradient Gram matrix:')
for i in range(10):
    print(f'  lambda_{i+1:2d} = {eigvals_gram[i]:8.4f}  cumulative variance = {cumulative[i]:.4f}')

dim_90 = int(np.searchsorted(cumulative, 0.90)) + 1
print(f'\n90% of gradient energy in top-{dim_90} directions (full dim p = {p:,})')
print(f'Compression ratio: {p / dim_90:.1f}x')

print(f'\n--- LoRA rank analysis ---')
print(f'{"r":>4}  {"params":>10}  {"fraction":>10}  {"covers 90% gradient subspace?":>32}')
for r in [1, 2, 4, 5, 8, 16, 32]:
    params_lora = r * (m + n)
    frac = params_lora / p
    covers = 'YES' if r >= dim_90 else 'no'
    print(f'  r={r:>2}        {params_lora:>10,}  {frac:>10.4f}  {covers}')

print(f'\nKey: the intrinsic gradient dimension (approx {dim_90}) sets the minimum')
print(f'LoRA rank needed to capture 90% of the gradient update information.')
print(f'Using r < {dim_90} loses information; using r > {dim_90} wastes capacity.')

Code cell 64

# === 12.3  Concept directions in embedding space ===
#
# The linear representation hypothesis: concepts are encoded as directions
# (1-D subspaces) in the residual stream.
# Classic: gender direction in word2vec;  king - man + woman = queen.

np.random.seed(7)
d = 16

# Fixed semantic directions (random unit vectors in R^d)
gender_dir  = np.random.randn(d)
gender_dir /= np.linalg.norm(gender_dir)
royalty_dir = np.random.randn(d)
royalty_dir -= royalty_dir @ gender_dir * gender_dir   # orthogonalise
royalty_dir /= np.linalg.norm(royalty_dir)

base = np.random.randn(d) * 0.5
noise = lambda: np.random.randn(d) * 0.05

king   = base + 2.0 * gender_dir  + 2.0 * royalty_dir + noise()
queen  = base - 2.0 * gender_dir  + 2.0 * royalty_dir + noise()
man    = base + 2.0 * gender_dir                       + noise()
woman  = base - 2.0 * gender_dir                       + noise()
prince = base + 2.0 * gender_dir  + 1.0 * royalty_dir + noise()

def cosine(u, v):
    return np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))

print('=== Embedding arithmetic: king - man + woman approx queen ===')
analogy = king - man + woman
print(f'  cos(king-man+woman, queen) = {cosine(analogy, queen):+.4f}  (should be near +1)')
print(f'  cos(king-man+woman, king)  = {cosine(analogy, king):+.4f}')
print(f'  cos(king-man+woman, man)   = {cosine(analogy, man):+.4f}')

print('\n=== Recovering the gender direction via difference of means ===')
male_words   = np.array([king, man, prince])
female_words = np.array([queen, woman])
gender_est = male_words.mean(0) - female_words.mean(0)
gender_est /= np.linalg.norm(gender_est)
print(f'  cos(estimated, true gender direction) = {cosine(gender_est, gender_dir):.4f}')

print('\n=== Concept erasure: project out the gender direction ===')
def project_out(v, direction):
    return v - np.dot(v, direction) * direction

king_d  = project_out(king,  gender_est)
queen_d = project_out(queen, gender_est)
man_d   = project_out(man,   gender_est)
woman_d = project_out(woman, gender_est)

print('  After projecting out gender direction:')
print(f'    cos(king_debiased,  queen_debiased) = {cosine(king_d, queen_d):+.4f}')
print(f'    cos(king_debiased,  man_debiased)   = {cosine(king_d, man_d):+.4f}')
print(f'    cos(man_debiased,   woman_debiased) = {cosine(man_d, woman_d):+.4f}')
print('  Gender contrast (man vs woman) is erased; royalty contrast remains.')

print('\n=== Superposition: more concepts than dimensions ===')
np.random.seed(3)
n_features, d_emb = 40, 16
feat = np.random.randn(n_features, d_emb)
feat /= np.linalg.norm(feat, axis=1, keepdims=True)
G_feat = feat @ feat.T
np.fill_diagonal(G_feat, 0)
avg_int = np.abs(G_feat).sum() / (n_features * (n_features - 1))
print(f'  {n_features} features in d={d_emb} space')
print(f'  Average |cos(fi, fj)| = {avg_int:.4f}  (0 = orthogonal, impossible here)')
print(f'  Since {n_features} > {d_emb}, features MUST be non-orthogonal (superposed).')
print(f'  Interference is unavoidable -- this is the superposition hypothesis.')

Code cell 65

# === 12.4  Mechanistic interpretability: OV and QK subspace circuits ===
#
# Attention head h:
#   W_Q, W_K in R^{d x d_k}  -- query/key projections
#   W_V in R^{d x d_v}, W_O in R^{d_v x d}  -- value/output projections
#
# QK circuit: score(i,j) = x_i^T W_Q W_K^T x_j
#   Rank of score matrix <= d_k regardless of sequence length.
# OV circuit: output = softmax(scores) V W_O
#   Head writes to col(W_O^T) -- a d_v-dim subspace of R^d.

np.random.seed(11)
d, d_k, d_v, seq = 32, 8, 8, 6

W_Q = np.random.randn(d, d_k) * 0.1
W_K = np.random.randn(d, d_k) * 0.1
W_V = np.random.randn(d, d_v) * 0.1
W_O = np.random.randn(d_v, d) * 0.1
X   = np.random.randn(seq, d)

QK_mat  = W_Q @ W_K.T                          # d x d
rank_QK = np.linalg.matrix_rank(QK_mat)
scores  = X @ QK_mat @ X.T                      # seq x seq

print(f'=== QK Circuit ===')
print(f'  W_Q W_K^T: shape {QK_mat.shape}, rank = {rank_QK} (bound: d_k = {d_k})')
print(f'  Score matrix S = X (W_Q W_K^T) X^T: shape {scores.shape}, rank = {np.linalg.matrix_rank(scores)}')
print(f'  Key: {d}-dim residual stream compressed to {d_k} dims for attention comparison.')

def softmax(x):
    e = np.exp(x - x.max(axis=-1, keepdims=True))
    return e / e.sum(axis=-1, keepdims=True)

attn_weights = softmax(scores / np.sqrt(d_k))
V_proj       = X @ W_V                          # seq x d_v
head_out     = attn_weights @ V_proj @ W_O       # seq x d

OV_mat  = W_O.T @ W_V.T                         # d x d
rank_OV = np.linalg.matrix_rank(OV_mat)

print(f'\n=== OV Circuit ===')
print(f'  W_O^T W_V^T: shape {OV_mat.shape}, rank = {rank_OV} (bound: d_v = {d_v})')
write_dim = np.linalg.matrix_rank(W_O.T)
read_dim  = np.linalg.matrix_rank(W_V)
print(f'  Write subspace col(W_O^T): dim = {write_dim} of {d}')
print(f'  Read  subspace col(W_V):   dim = {read_dim}  of {d}')
print(f'  Bottleneck: {d} -> {d_k} (QK) -> score -> {d_v} (OV) -> {d} (write)')

print(f'\n=== Principal angles between write subspaces of two independent heads ===')
np.random.seed(22)
W_O2 = np.random.randn(d_v, d) * 0.1
U1 = np.linalg.svd(W_O.T,  full_matrices=False)[0]
U2 = np.linalg.svd(W_O2.T, full_matrices=False)[0]
cos_angles = np.linalg.svd(U1.T @ U2, compute_uv=False)
angles_deg = np.degrees(np.arccos(np.clip(cos_angles, -1, 1)))
overlap = np.mean(cos_angles**2)
print(f'  cos(theta_i): {cos_angles}')
print(f'  angles (deg): {angles_deg}')
print(f'  Mean cos^2 = {overlap:.4f}  (0=orthogonal/independent, 1=identical heads)')
print(f'  Orthogonal write subspaces => heads do not interfere => can prune either safely.')

Code cell 66

# === 12.5  Subspace fine-tuning: LoRA rank vs approximation quality ===
#
# If the true weight update DW_true has intrinsic rank r_true,
# the best rank-r LoRA approximation captures all signal when r >= r_true.
# Eckart-Young: best rank-r approximation uses truncated SVD.

np.random.seed(5)
m, n = 256, 256
true_rank = 8
lora_ranks = [1, 2, 4, 8, 16, 32, 64]

# Construct a true rank-true_rank update + small noise
U_t = np.random.randn(m, true_rank)
V_t = np.random.randn(n, true_rank)
DW_true = U_t @ V_t.T + 0.01 * np.random.randn(m, n)

fro_true = np.linalg.norm(DW_true, 'fro')
U_s, s_s, Vt_s = np.linalg.svd(DW_true, full_matrices=False)

print(f'Weight matrix: R^{{{m} x {n}}},  full params = {m*n:,}')
print(f'True update rank = {true_rank}')
print(f'Singular values of true DW (top 12): {s_s[:12]}')

print(f'\n{"LoRA r":>8}  {"Params":>10}  {"Fraction":>10}  {"Relative error":>16}')
for r in lora_ranks:
    DW_r = (U_s[:, :r] * s_s[:r]) @ Vt_s[:r, :]
    err  = np.linalg.norm(DW_true - DW_r, 'fro') / fro_true
    pars = r * (m + n)
    print(f'  r={r:>3}      {pars:>10,}  {pars/(m*n):>10.4f}  {err:>16.6f}')

print(f'\nAt r = {true_rank} = true rank, error drops to noise level.')
print(f'r < {true_rank}: information lost.  r > {true_rank}: capacity wasted.')

print(f'\n=== Fine-tuning method comparison (all are subspace constraints) ===')
rows = [
    ('LoRA',    'DW = B A^T, B in R^{mxr}, A in R^{nxr}',   'r(m+n)'),
    ('AdaLoRA', 'DW = P Lambda Q^T, prune small sing vals',  'varies by layer'),
    ('IA3',     'multiply activations by learned scalar',     '~1 per layer'),
    ('BitFit',  'fine-tune bias vectors only',               'sum of biases'),
    ('DoRA',    'DW = magnitude * direction  (rank-r dir)',   'r(m+n)+m'),
]
print(f'{"Method":>10}  {"Constraint":>50}  {"Params":>20}')
for name, desc, pars in rows:
    print(f'  {name:>8}  {desc:>50}  {pars:>20}')
print(f'\nAll methods restrict DW to a structured low-dimensional subspace of R^{{mxn}}.')
print(f'The choice of subspace structure (rank, sparsity, magnitude/direction) is the design.')

13. Invariant Subspaces and Spectral Theory

A subspace $W \subseteq V$ is invariant under a linear map $T : V \to V$ if $T(W) \subseteq W$ . Invariant subspaces are the key to understanding what a linear map does: every invariant subspace is a channel along which $T$ acts independently.

Eigenspaces are the simplest invariant subspaces. If $T\mathbf{v} = \lambda \mathbf{v}$ , then $\operatorname{span}\{\mathbf{v}\}$ is a 1-D invariant subspace.

The Spectral Theorem says every symmetric matrix $A = A^\top$ admits an orthogonal decomposition:

\mathbb{R}^n = E(\lambda_1) \oplus E(\lambda_2) \oplus \cdots \oplus E(\lambda_k)

and $A = Q \Lambda Q^\top$ where $Q$ is orthogonal and $\Lambda$ is diagonal.

SVD extends this to non-square matrices: $A = U \Sigma V^\top$ decomposes both domain and codomain into invariant directions.

For AI: covariance matrices, Hessians, and Gram matrices are symmetric — the Spectral Theorem always applies. PCA is exactly the Spectral Theorem applied to the covariance matrix. LoRA targets the top invariant directions of the gradient covariance.

Code cell 68

# === 13.1  Invariant subspaces: eigenspaces and the Spectral Theorem ===

np.random.seed(17)

# --- Example 1: rotation has no real 1-D invariant subspace ---
theta = np.pi / 4
R_rot = np.array([[np.cos(theta), -np.sin(theta)],
                  [np.sin(theta),  np.cos(theta)]])
print('=== 2x2 rotation matrix (theta = pi/4) ===')
print(f'  R =\n{R_rot}')
eigvals_rot = np.linalg.eigvals(R_rot)
print(f'  Eigenvalues (complex): {eigvals_rot}')
print(f'  No real eigenvalues => no real 1-D invariant subspace.')
print(f'  The entire plane rotates; no direction is preserved.\n')

# --- Example 2: symmetric matrix => orthogonal invariant eigenspaces ---
A_sym = np.array([[4.0, 2.0, 0.0],
                  [2.0, 3.0, 1.0],
                  [0.0, 1.0, 2.0]])
eigvals_s, eigvecs_s = np.linalg.eigh(A_sym)   # guaranteed real for symmetric

print('=== Symmetric 3x3 matrix: spectral decomposition ===')
print(f'  Eigenvalues: {eigvals_s}')
print(f'  Eigenvectors Q:\n{eigvecs_s}\n')

for i in range(3):
    v  = eigvecs_s[:, i]
    Av = A_sym @ v
    ok = np.allclose(Av, eigvals_s[i] * v)
    print(f'  E(lambda={eigvals_s[i]:.4f}): A v = lambda v?  {ok}  (1-D invariant subspace)')

print(f'\n  Q^T Q (should be I):\n{eigvecs_s.T @ eigvecs_s}\n')

# Spectral decomposition A = Q Lambda Q^T
Lam = np.diag(eigvals_s)
A_recon = eigvecs_s @ Lam @ eigvecs_s.T
print(f'  A = Q Lambda Q^T  reconstruction error: {np.max(np.abs(A_sym - A_recon)):.2e}')

# As sum of rank-1 projections
A_sum = sum(eigvals_s[i] * np.outer(eigvecs_s[:, i], eigvecs_s[:, i]) for i in range(3))
print(f'  A = sum lambda_i q_i q_i^T  error: {np.max(np.abs(A_sym - A_sum)):.2e}')

# === 13.2  R^n = E(lambda1) + E(lambda2) + ... ===
print(f'\n=== Spectral Theorem: R^3 = E(l1) oplus E(l2) oplus E(l3) ===')
projs = [np.outer(eigvecs_s[:, i], eigvecs_s[:, i]) for i in range(3)]
print(f'  P1+P2+P3 = I?  error: {np.max(np.abs(sum(projs) - np.eye(3))):.2e}')

x = np.array([1.0, 2.0, -1.0])
Ax_spec = sum(eigvals_s[i] * projs[i] @ x for i in range(3))
print(f'  Ax via eigenspace decomposition: {Ax_spec}')
print(f'  A @ x directly:                  {A_sym @ x}')
print(f'  Match: {np.allclose(Ax_spec, A_sym @ x)}')
print(f'\n  A acts on each eigenspace independently -- scaling by lambda_i.')

Code cell 69

# === 13.3  SVD: complete subspace decomposition of any linear map ===
#
# A = U Sigma V^T:
#   Domain:   R^n = row(A) [v1..vr] (+) null(A) [v_{r+1}..v_n]
#   Codomain: R^m = col(A) [u1..ur] (+) null(A^T) [u_{r+1}..u_m]
#   A maps v_i -> sigma_i u_i (bijection between row(A) and col(A))
#   A maps null(A) -> 0

A_svd = np.array([
    [3.0, 1.0, 0.0, 2.0],
    [1.0, 2.0, 1.0, 0.0],
    [2.0, 0.0, 1.0, 1.0]
])  # 3 x 4

U, s, Vt = np.linalg.svd(A_svd, full_matrices=True)
V = Vt.T
r = np.linalg.matrix_rank(A_svd)
m_s, n_s = A_svd.shape

print(f'=== SVD of A in R^{{{m_s} x {n_s}}} ===')
print(f'  rank = {r},  singular values = {s}')

print(f'\n=== Domain R^{n_s}: row(A) oplus null(A) ===')
print(f'  row(A)  = span{{v1,...,v{r}}}    dim = {r}')
print(f'  null(A) = span{{v{r+1},...,v{n_s}}} dim = {n_s - r}')

print(f'\n=== Codomain R^{m_s}: col(A) oplus null(A^T) ===')
print(f'  col(A)    = span{{u1,...,u{r}}}    dim = {r}')
print(f'  null(A^T) = span{{u{r+1},...,u{m_s}}} dim = {m_s - r}')

print(f'\n=== A is a bijection: row(A) -> col(A) ===')
for i in range(r):
    vi   = V[:, i]
    Avi  = A_svd @ vi
    s_ui = s[i] * U[:, i]
    print(f'  A v{i+1} = sigma{i+1} u{i+1}?  {np.allclose(Avi, s_ui)}   sigma{i+1} = {s[i]:.4f}')

print(f'\n=== A maps null(A) to 0 ===')
for i in range(r, n_s):
    print(f'  ||A v{i+1}|| = {np.linalg.norm(A_svd @ V[:, i]):.2e}')

print(f'\n=== Low-rank approximation (Eckart-Young) ===')
print(f'{"rank k":>8}  {"Frobenius error":>16}  {"Variance captured":>20}')
for k in range(r + 1):
    if k == 0:
        err = np.linalg.norm(A_svd, 'fro')
        var = 0.0
    else:
        Ak  = sum(s[i] * np.outer(U[:, i], V[:, i]) for i in range(k))
        err = np.linalg.norm(A_svd - Ak, 'fro')
        var = 1.0 - err**2 / np.sum(s**2)
    print(f'  k={k}       {err:>16.4f}  {var:>20.4f}')

Code cell 70

# === 13.4  Schur decomposition and nested invariant subspaces ===
#
# Every square A admits  A = Q T Q^*  where Q orthogonal, T upper triangular.
# First k columns of Q span a k-dim invariant subspace of A.
# For symmetric A, T is diagonal (= spectral theorem).

from scipy.linalg import schur

print('=== Schur decomposition of a non-symmetric matrix ===')
A_sch = np.array([
    [4.0, 3.0, 2.0],
    [0.0, 2.0, 1.0],
    [0.0, 1.0, 3.0]
])

T_s, Q_s = schur(A_sch)
print(f'A =\n{A_sch}')
print(f'\nSchur T (upper triangular):\n{T_s}')
print(f'\nSchur Q (orthonormal columns):\n{Q_s}')
print(f'\nReconstruction A = Q T Q^T error: {np.max(np.abs(A_sch - Q_s @ T_s @ Q_s.T)):.2e}')

eigvals_d = np.sort(np.linalg.eigvals(A_sch).real)
eigvals_t = np.sort(np.diag(T_s).real)
print(f'\nEigenvalues from eigvals: {eigvals_d}')
print(f'Eigenvalues from diag(T): {eigvals_t}')
print(f'Match: {np.allclose(eigvals_d, eigvals_t)}')

print(f'\n=== Nested invariant subspaces from Schur vectors ===')
for k in range(1, 4):
    Qk   = Q_s[:, :k]
    AQk  = A_sch @ Qk
    proj = Qk @ Qk.T @ AQk
    is_inv = np.allclose(AQk, proj)
    print(f'  span{{q1,...,q{k}}} invariant under A?  {is_inv}')

print(f'\n=== Symmetric case: Schur = spectral (T diagonal) ===')
A_sym3 = np.array([[5.0, 2.0], [2.0, 3.0]])
T_sym, Q_sym = schur(A_sym3)
print(f'  T for symmetric A:\n{T_sym}')
print(f'  T is diagonal: {np.allclose(T_sym, np.diag(np.diag(T_sym)))}')

print(f'\n=== Connection to AI ===')
print(f'  Hessian H of training loss is symmetric => spectral theorem applies.')
print(f'  Top-k Schur/eigenspace of H = high-curvature subspace for 2nd-order optimisers.')
print(f'  K-FAC / Shampoo / SOAP approximate H^{{-1}} g using structured invariant subspaces.')
print(f'  Natural gradient = gradient in metric defined by the Fisher (symmetric PSD).')

14. Common Mistakes

Mistake	Why It's Wrong	Fix
"Union of two subspaces is a subspace"	$W_1 \cup W_2$ fails closure under addition: $(1,0)+(0,1)=(1,1)$ is not on either coordinate axis.	Use the sum $W_1 + W_2$ , which IS a subspace.
"An affine subspace is a subspace"	$\{x : Ax=b\}$ with $b \neq 0$ does not contain $\mathbf{0}$ .	An affine subspace is a coset of a linear subspace — parallel but offset. A linear subspace must contain $\mathbf{0}$ .
"Closure under addition alone makes a subspace"	$\mathbb{Z}^n$ is closed under addition but not scalar mult ( $\pi \cdot (1,0)=(\pi,0) \notin \mathbb{Z}^n$ ).	All three conditions: $\mathbf{0} \in W$ , closed under $+$ , closed under scalar mult.
"span equals unchanged => new vector is zero"	$\mathbf{v}_{k+1}$ could be any linear combination of the others — dependent, not zero.	Span equality means $\mathbf{v}_{k+1}$ is redundant (linearly dependent), not necessarily $\mathbf{0}$ .
" $\dim(W_1 \cap W_2) = \dim(W_1)+\dim(W_2)-\dim(V)$ "	The intersection dimension depends on orientation, not just sizes.	Correct formula is for the sum: $\dim(W_1+W_2) = \dim(W_1)+\dim(W_2)-\dim(W_1 \cap W_2)$ .
"The complement of a subspace is a subspace"	$V \setminus W$ does not contain $\mathbf{0}$ and is not closed under addition.	The orthogonal complement $W^\perp$ IS a subspace. Always say $W^\perp$ , not "complement".
"All bases have the same vectors"	Infinitely many valid bases exist for $\mathbb{R}^n$ with $n>1$ .	All bases have the same count (= $\dim(V)$ ); the actual vectors vary freely.
"null $(AB)$ = null $(B)$ "	null $(B) \subseteq$ null $(AB)$ always; equality can fail if $A$ is not injective on col $(B)$ .	null $(AB) \supseteq$ null $(B)$ ; equality iff null $(A) \cap$ col $(B) = \{\mathbf{0}\}$ .
"Same dimension => same subspace"	The $x$ -axis and $y$ -axis in $\mathbb{R}^2$ both have $\dim=1$ but are different subspaces.	Equal dimension means isomorphic; actual equality requires containment in both directions.
"Gram-Schmidt works on any set of vectors"	If a vector is already in the span of previous ones, $\mathbf{u}_j = \mathbf{0}$ and cannot be normalised.	First extract a linearly independent subset; discard any vector that produces $\mathbf{u}_j \approx \mathbf{0}$ .

16. Why This Matters for AI (2026 Perspective)

Aspect	Impact
Residual stream as shared vector space	The residual stream $\mathbf{x} \in \mathbb{R}^d$ is the central shared vector space of a transformer. Every attention head and MLP layer adds vectors to this space. Understanding subspace decompositions of $\mathbb{R}^d$ IS understanding how transformers route and store information.
LoRA and gradient subspaces	LoRA's premise: weight updates $\Delta W$ lie in a low-dimensional subspace. Aghajanyan et al. (2020) showed the fine-tuning gradient subspace has dimension $\ll p$ . Choosing LoRA rank $r$ = choosing the subspace dimension to search in.
Superposition and polysemanticity	LLMs represent more features than dimensions by storing them as non-orthogonal directions. Features share subspace dimensions, causing polysemanticity. This is the superposition hypothesis (Elhage et al. 2022).
MLA and KV subspace compression	DeepSeek-V3's Multi-head Latent Attention compresses KV projections to a rank- $r$ subspace ( $r \ll d$ ), enabling 5.75 $\times$ memory reduction. Subspace dimensionality IS the design variable.
Mechanistic interpretability	Every circuit in mechanistic interpretability is a composition of subspace operations: heads read from and write to specific subspaces of $\mathbb{R}^d$ via their $W_Q, W_K, W_V, W_O$ matrices.
Representation collapse prevention	VICReg, Barlow Twins, BYOL prevent collapse by ensuring embeddings span a high-dimensional subspace. Collapse = subspace dimension $\to 0$ ; these losses maximise it.
Generalisation and implicit bias	Gradient descent on overparameterised models converges to minimum-norm solutions supported in a low-dimensional subspace of parameter space. Generalisation is explained by subspace geometry.
Orthogonality and head diversity	Attention heads writing to orthogonal subspaces do not interfere. Head redundancy = subspace overlap. Orthogonality = independence = safe pruning.
Second-order optimisation	The Hessian concentrates spectral mass in a low-dimensional high-curvature subspace. K-FAC, Shampoo, and SOAP exploit this. Natural gradient aligns updates with the curved subspace.

Conceptual Bridge

Vector spaces and subspaces are the geometry underlying all of linear algebra. Every concept — span, independence, basis, dimension, rank, orthogonality, projections, eigenvalues — is ultimately a statement about the structure of subspaces.

The abstract framework (eight axioms) is what makes the theory universal: the same theorems that govern arrows in a plane also govern polynomials, matrices, functions, and probability distributions. When you recognise a new mathematical object as a vector space, you immediately inherit centuries of theorems for free.

For AI in 2026, subspaces are the architectural primitives of transformers, the design variables of efficient fine-tuning, the diagnostic language of mechanistic interpretability, and the theoretical foundation of generalisation.

Next: Eigenvalues, Eigenvectors, and Matrix Decompositions — where invariant subspaces of a linear map are revealed through its spectrum. An eigenvector spans a 1-D invariant subspace; the spectral theorem decomposes any symmetric matrix into orthogonal invariant eigenspaces; the SVD extends this to the complete subspace structure of any rectangular matrix.

Vector Spaces Subspaces