Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 3
30 min read18 headingsSplit lesson page

Lesson overview | Previous part | Next part

Linear Transformations: Appendix C: The Geometry of Linear Maps - A Deep Dive to Appendix M: Self-Assessment Checklist

Appendix C: The Geometry of Linear Maps - A Deep Dive

C.1 How Linear Maps Deform Space

To deeply understand a linear map T:RnRmT: \mathbb{R}^n \to \mathbb{R}^m, we track how it deforms geometric objects.

Ellipsoids to ellipsoids. The image of the unit sphere Sn1={x:x=1}S^{n-1} = \{\mathbf{x} : \lVert\mathbf{x}\rVert = 1\} under a full-rank map AA is an ellipsoid whose semi-axes have lengths equal to the singular values σ1σ2σn>0\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_n > 0 of AA, pointing in the directions of the left singular vectors u1,,um\mathbf{u}_1, \ldots, \mathbf{u}_m.

This is the geometric content of the SVD: A=UΣVA = U\Sigma V^\top means:

  1. VV^\top: rotate the input so the "natural input directions" vi\mathbf{v}_i align with the coordinate axes.
  2. Σ\Sigma: stretch each axis ii by σi\sigma_i.
  3. UU: rotate the output to place the stretched axes along the ui\mathbf{u}_i directions.

The sphere becomes an ellipsoid. The "shape" of the ellipsoid is completely described by the singular values.

Volume scaling. The volume of the image of a set SRnS \subseteq \mathbb{R}^n under AA is det(A)vol(S)|\det(A)| \cdot \operatorname{vol}(S) (when AA is square). More precisely, det(A)=σi|\det(A)| = \prod \sigma_i = product of all singular values = volume of the image of the unit cube.

For a rank-deficient map (det(A)=0\det(A) = 0), the image has lower dimension - and mm-dimensional volume = 0. The map "collapses" nn-dimensional space to a lower-dimensional flat object.

Angles. Unless AA is orthogonal, linear maps change angles. The angle θ\theta between u\mathbf{u} and v\mathbf{v} satisfies:

cosθ=uvuv\cos\theta = \frac{\mathbf{u}^\top\mathbf{v}}{\lVert\mathbf{u}\rVert\lVert\mathbf{v}\rVert}

but cos(Au,Av)=(Au)(Av)AuAv=uAAvAuAv\cos\angle(A\mathbf{u}, A\mathbf{v}) = \frac{(A\mathbf{u})^\top(A\mathbf{v})}{\lVert A\mathbf{u}\rVert\lVert A\mathbf{v}\rVert} = \frac{\mathbf{u}^\top A^\top A\mathbf{v}}{\lVert A\mathbf{u}\rVert\lVert A\mathbf{v}\rVert}.

The matrix G=AAG = A^\top A (the Gram matrix) determines how AA distorts inner products. Eigenvalues of GG are σi2\sigma_i^2 (squared singular values).

C.2 Interpreting the Four Fundamental Subspaces Geometrically

Given T:RnRmT: \mathbb{R}^n \to \mathbb{R}^m with matrix AA and rank rr:

The row space row(A)\operatorname{row}(A): This is the "input directions that survive" - the rr-dimensional subspace of Rn\mathbb{R}^n that TT maps faithfully (injectively) onto the column space. Any xrow(A)\mathbf{x} \in \operatorname{row}(A) is "noticed" by TT.

The null space null(A)\operatorname{null}(A): This is the "input directions that are killed" - the (nr)(n-r)-dimensional subspace of Rn\mathbb{R}^n that TT maps to zero. Any xnull(A)\mathbf{x} \in \operatorname{null}(A) is "invisible" to TT.

The decomposition Rn=row(A)null(A)\mathbb{R}^n = \operatorname{row}(A) \oplus \operatorname{null}(A): Every input x\mathbf{x} splits uniquely as x=xr+xn\mathbf{x} = \mathbf{x}_r + \mathbf{x}_n where xr\mathbf{x}_r is in the row space (the "signal" part) and xn\mathbf{x}_n is in the null space (the "noise" invisible to TT).

The column space col(A)\operatorname{col}(A): The rr-dimensional subspace of Rm\mathbb{R}^m that TT can actually reach. Solutions to Ax=bA\mathbf{x} = \mathbf{b} exist iff bcol(A)\mathbf{b} \in \operatorname{col}(A).

The left null space null(A)\operatorname{null}(A^\top): The (mr)(m-r)-dimensional complement of col(A)\operatorname{col}(A) in Rm\mathbb{R}^m. Directions in the left null space are unreachable by TT.

COMPLETE PICTURE OF THE FOUR FUNDAMENTAL SUBSPACES
========================================================================

  \mathbb{R}^n (domain)                           \mathbb{R}^m (codomain)
  ---------------------------------------------------------

  +---------------------+    T(x) = Ax    +---------------------+
  |    row space        | -------------->  |    column space     |
  |   (dim = r)         |  isomorphism    |    (dim = r)        |
  |                     |                 |                     |
  |   -------------     |                 |   -------------     |
  |                     |                 |                     |
  |    null space       | -------------->  |    left null space  |
  |   (dim = n-r)       |    maps to 0    |    (dim = m-r)      |
  +---------------------+                 +---------------------+
         up \perp complement                          up \perp complement

  Every x = (row space part) + (null space part)
  T sees only the row space part.

========================================================================

For AI (linear systems / least squares): When fitting a model Aw=bA\mathbf{w} = \mathbf{b} with more constraints than parameters (m>nm > n), the system is overdetermined. A solution exists only if bcol(A)\mathbf{b} \in \operatorname{col}(A). If not, the least-squares solution minimizes Awb2\lVert A\mathbf{w} - \mathbf{b}\rVert^2 - finding the projection of b\mathbf{b} onto col(A)\operatorname{col}(A) and then solving in the row space.

C.3 Linear Maps and Information Theory

The rank-nullity theorem has an information-theoretic interpretation.

Rank = information preserved. A linear map of rank rr preserves at most rr "dimensions" of information from the input. The remaining nrn - r dimensions are destroyed.

Mutual information. For a Gaussian input xN(0,In)\mathbf{x} \sim \mathcal{N}(\mathbf{0}, I_n) and output y=Ax\mathbf{y} = A\mathbf{x}, the mutual information:

I(x;y)=12i=1rlog(1+σi2)I(\mathbf{x}; \mathbf{y}) = \frac{1}{2} \sum_{i=1}^r \log(1 + \sigma_i^2)

depends only on the singular values - not on the specific directions. The null space of AA contributes zero mutual information.

Compression. If we want to compress xRn\mathbf{x} \in \mathbb{R}^n to zRr\mathbf{z} \in \mathbb{R}^r via a linear map CRr×nC \in \mathbb{R}^{r \times n}, the maximum mutual information I(x;Cx)I(\mathbf{x}; C\mathbf{x}) is achieved when CC projects onto the top-rr right singular vectors of... itself (the row space). For structured data with covariance Σ\Sigma, the optimal compression is PCA - projecting onto the top eigenvectors of Σ\Sigma.

This is why PCA is the optimal linear compressor under mean-squared error: it maximizes the retained variance (information) for any fixed rank rr.

C.4 Generalization of Linear Maps: Tensors

The concept of a linear map generalizes to multilinear maps and tensors in ways that are directly relevant to deep learning.

Bilinear maps and matrices of bilinear forms. A bilinear form B:Rn×RmRB: \mathbb{R}^n \times \mathbb{R}^m \to \mathbb{R} can be written as B(x,y)=xMyB(\mathbf{x}, \mathbf{y}) = \mathbf{x}^\top M \mathbf{y} for a matrix MRn×mM \in \mathbb{R}^{n \times m}. The bilinear form is:

  • Symmetric if M=MM = M^\top (and n=mn = m): B(x,y)=B(y,x)B(\mathbf{x}, \mathbf{y}) = B(\mathbf{y}, \mathbf{x}).
  • Positive definite if B(x,x)>0B(\mathbf{x}, \mathbf{x}) > 0 for x0\mathbf{x} \neq \mathbf{0}: inner products are positive definite symmetric bilinear forms.

Multilinear maps. A kk-linear map T:V1××VkWT: V_1 \times \cdots \times V_k \to W is linear in each argument separately. The space of kk-linear maps on Rn\mathbb{R}^n is the space of tensors of order kk.

For AI: The multi-head attention score qWQKk\mathbf{q}^\top W_{\text{QK}} \mathbf{k} is a bilinear form parameterized by WQK=WQWKW_{\text{QK}} = W_Q^\top W_K. Understanding bilinear forms via their eigendecomposition (WQK=UΛVW_{\text{QK}} = U\Lambda V^\top by SVD) reveals what "patterns" each attention head is sensitive to: the left singular vectors UU are "what queries to look for" and the right singular vectors VV are "what keys are being matched against."



Appendix D: Computational Methods and Numerical Considerations

D.1 Computing the Kernel via Row Reduction

Given ARm×nA \in \mathbb{R}^{m \times n}, finding a basis for ker(A)\ker(A) requires solving Ax=0A\mathbf{x} = \mathbf{0}.

Algorithm (Gaussian Elimination to RREF):

  1. Apply row operations to reduce AA to reduced row echelon form (RREF).
  2. Identify pivot columns (columns with leading 1s in RREF) and free columns (all other columns).
  3. For each free variable, set it to 1 and all other free variables to 0, then solve for pivot variables.
  4. Each such solution is one basis vector for ker(A)\ker(A).

Example. A=[121024231212]A = \begin{bmatrix} 1 & 2 & -1 & 0 \\ 2 & 4 & -2 & 3 \\ -1 & -2 & 1 & 2 \end{bmatrix}.

RREF: R2R22R1R_2 \leftarrow R_2 - 2R_1, R3R3+R1R_3 \leftarrow R_3 + R_1:

[121000030002]\begin{bmatrix} 1 & 2 & -1 & 0 \\ 0 & 0 & 0 & 3 \\ 0 & 0 & 0 & 2 \end{bmatrix}

R3R323R2R_3 \leftarrow R_3 - \frac{2}{3} R_2, R213R2R_2 \leftarrow \frac{1}{3}R_2:

[121000010000]\begin{bmatrix} 1 & 2 & -1 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 \end{bmatrix}

R1R10R2R_1 \leftarrow R_1 - 0 \cdot R_2: already done.

Pivot columns: 1 and 4. Free variables: x2x_2 and x3x_3.

Setting x2=1,x3=0x_2 = 1, x_3 = 0: x4=0x_4 = 0, x1+2(1)1(0)=0x1=2x_1 + 2(1) - 1(0) = 0 \Rightarrow x_1 = -2. Basis vector: (2,1,0,0)(-2, 1, 0, 0)^\top.

Setting x2=0,x3=1x_2 = 0, x_3 = 1: x4=0x_4 = 0, x1+01=0x1=1x_1 + 0 - 1 = 0 \Rightarrow x_1 = 1. Basis vector: (1,0,1,0)(1, 0, 1, 0)^\top.

Null space basis: ker(A)=span{(2,1,0,0),(1,0,1,0)}\ker(A) = \operatorname{span}\left\{(-2,1,0,0)^\top, (1,0,1,0)^\top\right\}. Nullity = 2.

Rank-nullity: n=4n = 4, rank = 2 (two pivots), nullity = 2. Check: 2+2=42 + 2 = 4. OK

D.2 Numerical Stability of Basis Computations

Computing the null space or column space numerically requires care because floating-point arithmetic can introduce small errors.

The SVD approach (recommended). Instead of row reduction, compute the SVD: A=UΣVA = U\Sigma V^\top. Then:

  • ker(A)\ker(A) = span of columns of VV corresponding to zero singular values (or singular values below a threshold ϵ\epsilon).
  • col(A)\operatorname{col}(A) = span of columns of UU corresponding to nonzero singular values.

The SVD-based approach is numerically stable because orthonormal bases (UU and VV) are well-conditioned.

Numerical rank. For floating-point matrices, "zero" singular values appear as small but nonzero values. The numerical rank with threshold ϵ\epsilon is:

rankϵ(A)={i:σi>ϵσ1}\operatorname{rank}_\epsilon(A) = |\{i : \sigma_i > \epsilon \cdot \sigma_1\}|

A common choice is ϵ=nmachine epsilon\epsilon = n \cdot \text{machine epsilon} (about 101310^{-13} for double precision). numpy.linalg.matrix_rank uses a default threshold based on machine epsilon.

Why this matters: In practice, a matrix with theoretical rank rr may appear to have rank r+5r + 5 due to measurement noise. The SVD reveals the "intrinsic" rank through the gap in singular values.

D.3 Efficient Change of Basis Computations

Naive approach: Compute P1APP^{-1}AP directly. For n×nn \times n matrices, this costs O(n3)O(n^3) (two matrix multiplications plus one matrix inversion).

Better approach when PP is orthogonal: If PP is orthogonal (P1=PP^{-1} = P^\top), then PAPP^\top AP costs only O(n3)O(n^3) but with better constants than general P1APP^{-1}AP (no matrix inversion needed).

Eigendecomposition case: When A=PΛP1A = P\Lambda P^{-1}, computing Ak=PΛkP1A^k = P\Lambda^k P^{-1} requires only O(n)O(n) operations to compute Λk\Lambda^k (raise each diagonal entry to the kk-th power), plus two O(n2)O(n^2) matrix-vector multiplications for PP and P1P^{-1}.

For AI: Computing exp(At)\exp(At) (matrix exponential, important for continuous-time state space models like Mamba/S4) is done by diagonalizing: exp(At)=Pexp(Λt)P1\exp(At) = P\exp(\Lambda t)P^{-1}, where exp(Λt)\exp(\Lambda t) is diagonal with entries eλite^{\lambda_i t}.

D.4 The Rank-Revealing QR Decomposition

Standard QR decomposition A=QRA = QR doesn't directly reveal rank. The rank-revealing QR (RRQR) uses column pivoting:

AP=QR=[Q1Q2][R11R120R22]AP = QR = \begin{bmatrix} Q_1 & Q_2 \end{bmatrix} \begin{bmatrix} R_{11} & R_{12} \\ 0 & R_{22} \end{bmatrix}

where PP is a permutation matrix, and R22R_{22} is "small" (its Frobenius norm bounds how far AA is from rank-rr). The columns of Q1Q_1 form a basis for col(A)\operatorname{col}(A).

RRQR is preferred over SVD when only a basis for the column space (not the singular values themselves) is needed, as it is about 3×3\times faster.


Appendix E: Connections to Other Fields

E.1 Linear Maps in Physics

In quantum mechanics, operators on Hilbert spaces are infinite-dimensional linear maps. The Hamiltonian H^\hat{H}, momentum p^\hat{p}, and position x^\hat{x} are linear operators. The Schrodinger equation itψ=H^ψi\hbar \partial_t |\psi\rangle = \hat{H}|\psi\rangle is a linear ODE on the Hilbert space of quantum states.

The spectral theorem for self-adjoint operators (the quantum generalization of symmetric matrix diagonalization) guarantees that observables have real eigenvalues (the possible measurement outcomes) and that the eigenfunctions form a complete orthonormal basis.

For AI: Transformers share surprising mathematical parallels with quantum mechanics: both involve attention-like mechanisms (inner products of states), superposition (linear combinations of basis states), and entanglement-like correlations. The linear algebra of quantum mechanics and of transformers both live in the framework of linear maps on Hilbert spaces.

E.2 Linear Maps in Topology: Homomorphisms

In algebraic topology, chain complexes are sequences of vector spaces connected by linear maps:

3C22C11C000\cdots \xrightarrow{\partial_3} C_2 \xrightarrow{\partial_2} C_1 \xrightarrow{\partial_1} C_0 \xrightarrow{\partial_0} 0

where k1k=0\partial_{k-1} \circ \partial_k = 0 (boundary of a boundary is zero - exactly the condition ker(k1)im(k)\ker(\partial_{k-1}) \supseteq \operatorname{im}(\partial_k)). The homology groups Hk=ker(k)/im(k+1)H_k = \ker(\partial_k) / \operatorname{im}(\partial_{k+1}) measure "holes" in topological spaces.

Persistent homology, used in topological data analysis (TDA), applies this to point cloud data to find features that persist across scales. It's used in ML for analyzing data manifolds, protein structure prediction, and understanding neural network loss landscapes.

E.3 Linear Maps in Signal Processing

The Discrete Fourier Transform (DFT) is a linear map F:CnCnF: \mathbb{C}^n \to \mathbb{C}^n with matrix entries Fkj=e2πijk/nF_{kj} = e^{-2\pi i jk/n}. The DFT matrix is unitary (FF=nIF^* F = nI).

Convolution is linear - convolving a signal with a kernel is a linear map - and in the Fourier domain it becomes pointwise multiplication. This is the key to making CNNs efficient: convolution is a structured linear map with shared weights (translation equivariance), and the Fourier transform diagonalizes the convolution operator.

For AI: The fast Fourier transform (FFT) is O(nlogn)O(n \log n) instead of O(n2)O(n^2) for the full DFT matrix multiply, by exploiting the structure (sparsity in a different basis) of the DFT linear map. Similarly, FlashAttention speeds up attention by exploiting the structure of the attention linear map to minimize memory bandwidth.



Appendix F: The Algebra of Linear Maps - Structural Results

F.1 The Space of Linear Maps is a Vector Space

We noted briefly that L(V,W)\mathcal{L}(V, W) is a vector space. Let's make this precise and compute its dimension.

Operations: For S,TL(V,W)S, T \in \mathcal{L}(V, W) and cFc \in \mathbb{F}:

  • (S+T)(v)=S(v)+T(v)(S + T)(\mathbf{v}) = S(\mathbf{v}) + T(\mathbf{v})
  • (cT)(v)=cT(v)(cT)(\mathbf{v}) = c \cdot T(\mathbf{v})
  • The zero element is the zero map O(v)=0O(\mathbf{v}) = \mathbf{0} for all v\mathbf{v}.

Dimension. If dim(V)=n\dim(V) = n and dim(W)=m\dim(W) = m, then:

dim(L(V,W))=mn\dim(\mathcal{L}(V, W)) = mn

Proof: Every linear map T:VWT: V \to W is determined by nn vectors in WW (images of basis vectors), each in a mm-dimensional space. The natural isomorphism is L(V,W)WnRmn\mathcal{L}(V, W) \cong W^n \cong \mathbb{R}^{mn} (as vector spaces), which corresponds to the identification with m×nm \times n matrices. \square

Basis for L(Rn,Rm)\mathcal{L}(\mathbb{R}^n, \mathbb{R}^m). The standard basis consists of the mnmn maps Eij:RnRmE_{ij}: \mathbb{R}^n \to \mathbb{R}^m defined by Eij(ek)=δjkeiE_{ij}(\mathbf{e}_k) = \delta_{jk}\mathbf{e}_i. In matrix form, EijE_{ij} is the matrix with 1 in position (i,j)(i,j) and 0 elsewhere.

F.2 Composition Gives L(V,V)\mathcal{L}(V, V) an Algebra Structure

When V=WV = W, linear maps T:VVT: V \to V can be composed. The space L(V,V)\mathcal{L}(V, V) (endomorphisms of VV) is a ring under composition (it is also a vector space - together, an algebra).

Properties of composition in L(V,V)\mathcal{L}(V, V):

  • Associative: (RS)T=R(ST)(RS)T = R(ST)
  • Identity: IT=TI=TI \circ T = T \circ I = T
  • Distributive: R(S+T)=RS+RTR(S + T) = RS + RT and (R+S)T=RT+ST(R+S)T = RT + ST
  • NOT commutative: RSSRRS \neq SR in general

Matrix polynomials. For TL(V,V)T \in \mathcal{L}(V,V), we can form p(T)=a0I+a1T+a2T2++akTkp(T) = a_0 I + a_1 T + a_2 T^2 + \cdots + a_k T^k for any polynomial pp. This is well-defined because we can add and compose linear maps.

Cayley-Hamilton theorem. Every linear operator TT satisfies its own characteristic polynomial: pT(T)=0p_T(T) = 0, where pT(λ)=det(λIT)p_T(\lambda) = \det(\lambda I - T).

For AI: The spectral approach to recurrent networks analyzes the long-run behavior of TkvT^k \mathbf{v} as kk \to \infty. If TT has eigenvalues λi<1|\lambda_i| < 1, then Tk0T^k \to 0 (stable memory decay). If any λi>1|\lambda_i| > 1, the recurrence explodes. This spectral stability analysis is the foundation of designing stable RNNs (LSTM, GRU use gating to control the effective eigenvalue spectrum of the recurrence).

F.3 Quotient Maps and Projections

The quotient space. Given T:VWT: V \to W with kernel K=ker(T)K = \ker(T), the quotient space V/KV/K consists of equivalence classes [v]=v+K={v+k:kK}[\mathbf{v}] = \mathbf{v} + K = \{\mathbf{v} + \mathbf{k} : \mathbf{k} \in K\}.

V/KV/K is a vector space of dimension dim(V)dim(K)=rank(T)\dim(V) - \dim(K) = \operatorname{rank}(T).

The first isomorphism theorem. Every linear map T:VWT: V \to W factors as:

VπV/ker(T)T~im(T)WV \xrightarrow{\pi} V/\ker(T) \xrightarrow{\tilde{T}} \operatorname{im}(T) \hookrightarrow W

where π\pi is the quotient map (π(v)=[v]\pi(\mathbf{v}) = [\mathbf{v}]) and T~\tilde{T} is an isomorphism from V/ker(T)V/\ker(T) to im(T)\operatorname{im}(T).

This is the coordinate-free statement of the rank-nullity theorem: V/ker(T)im(T)V/\ker(T) \cong \operatorname{im}(T).

Geometric meaning. The quotient map π\pi "collapses" the null space to a point, then T~\tilde{T} acts faithfully (injectively) on the resulting space. Any linear map splits into: collapse (project out the null space) + inject faithfully into the codomain.

For AI: In contrastive learning (SimCLR, MoCo), the projection head maps representations to a lower-dimensional space. This is a linear (or nonlinear) quotient map - it deliberately collapses some dimensions (those corresponding to nuisance factors like image augmentation) while preserving the semantically meaningful directions. The first isomorphism theorem says: what survives in the image is exactly what was not collapsed.

F.4 Dual Bases and the Canonical Isomorphism

We saw that dim(V)=dim(V)\dim(V^*) = \dim(V), so VVV \cong V^*. But this isomorphism is non-canonical - it depends on the choice of basis.

With an inner product. When VV is an inner product space (like Rn\mathbb{R}^n with the standard dot product), there is a canonical isomorphism ϕ:VV\phi: V \to V^* defined by:

ϕ(v)=v,\phi(\mathbf{v}) = \langle \mathbf{v}, \cdot \rangle

That is, ϕ(v)\phi(\mathbf{v}) is the linear functional that takes wv,w\mathbf{w} \mapsto \langle \mathbf{v}, \mathbf{w}\rangle.

This isomorphism is canonical because it doesn't depend on any choice of basis - it uses only the inner product structure. When we identify f(x)Rn\nabla f(\mathbf{x}) \in \mathbb{R}^n as a column vector (primal) rather than a row vector (dual), we are implicitly using this canonical isomorphism via the standard inner product.

For AI: On non-Euclidean spaces (manifolds of probability distributions, manifolds of neural network weights under the Fisher metric), the identification VVV \cong V^* is NO longer trivial - gradients and velocity vectors live in different spaces. The natural gradient method corrects for this by using the Fisher information matrix FF as the metric: ~θ=F1θ\tilde{\nabla} \theta = F^{-1} \nabla \theta. This maps the gradient (a covector) to a tangent vector using the Riemannian metric instead of the Euclidean metric.


Appendix G: Linear Maps in Practice - Worked Problems

G.1 Verifying Linearity: Systematic Approach

Problem: Is T:R2×2RT: \mathbb{R}^{2\times 2} \to \mathbb{R} defined by T(A)=tr(A)T(A) = \operatorname{tr}(A) linear?

Check additivity: T(A+B)=tr(A+B)=tr(A)+tr(B)=T(A)+T(B)T(A + B) = \operatorname{tr}(A + B) = \operatorname{tr}(A) + \operatorname{tr}(B) = T(A) + T(B). OK

Check homogeneity: T(cA)=tr(cA)=ctr(A)=cT(A)T(cA) = \operatorname{tr}(cA) = c\operatorname{tr}(A) = cT(A). OK

Conclusion: TT is linear. Its matrix (viewing R2×2\mathbb{R}^{2\times 2} with basis {E11,E12,E21,E22}\{E_{11}, E_{12}, E_{21}, E_{22}\}):

tr(E11)=1,tr(E12)=0,tr(E21)=0,tr(E22)=1\operatorname{tr}(E_{11}) = 1, \quad \operatorname{tr}(E_{12}) = 0, \quad \operatorname{tr}(E_{21}) = 0, \quad \operatorname{tr}(E_{22}) = 1

So [T]=(1,0,0,1)[T] = (1, 0, 0, 1) as a 1×41 \times 4 matrix. Kernel = {A:tr(A)=0}\{A : \operatorname{tr}(A) = 0\} (trace-zero matrices), dimension 3.

Problem: Is T:RnRT: \mathbb{R}^n \to \mathbb{R} defined by T(x)=x2T(\mathbf{x}) = \lVert\mathbf{x}\rVert_2 linear?

Check homogeneity: T(cx)=cx=cxT(c\mathbf{x}) = \lVert c\mathbf{x}\rVert = |c| \lVert\mathbf{x}\rVert. For c=1c = -1: T(x)=x=T(x)T(-\mathbf{x}) = \lVert\mathbf{x}\rVert = T(\mathbf{x}), but linearity requires T(x)=T(x)T(-\mathbf{x}) = -T(\mathbf{x}).

For x0\mathbf{x} \neq \mathbf{0}: T(x)>0T(\mathbf{x}) > 0 but T(x)<0-T(\mathbf{x}) < 0. So T(x)T(x)T(-\mathbf{x}) \neq -T(\mathbf{x}).

Conclusion: TT is NOT linear (the norm fails homogeneity due to the absolute value).

G.2 Finding the Kernel: Four Approaches

For A=[112224]A = \begin{bmatrix} 1 & -1 & 2 \\ 2 & -2 & 4 \end{bmatrix} (rows are multiples), find ker(A)\ker(A):

Approach 1: Row reduction. RREF: [112000]\begin{bmatrix} 1 & -1 & 2 \\ 0 & 0 & 0 \end{bmatrix}. Free variables x2=sx_2 = s, x3=tx_3 = t. Then x1=s2tx_1 = s - 2t.

ker(A)=span{(1,1,0),(2,0,1)}\ker(A) = \operatorname{span}\{(1,1,0)^\top, (-2,0,1)^\top\}.

Approach 2: Inspection. The columns satisfy a2=a1\mathbf{a}_2 = -\mathbf{a}_1 and a3=2a1\mathbf{a}_3 = 2\mathbf{a}_1. So A(1,1,0)=a1+a1=0A(1,-1,0)^\top = \mathbf{a}_1 + \mathbf{a}_1 = 0... no, wait: A(1,1,0)=1a1+(1)(a1)+0=a1+a1=2a10A(1,-1,0)^\top = 1\mathbf{a}_1 + (-1)(-\mathbf{a}_1) + 0 = \mathbf{a}_1 + \mathbf{a}_1 = 2\mathbf{a}_1 \neq \mathbf{0}.

Correcting: a2=a1\mathbf{a}_2 = -\mathbf{a}_1 means a1+a2=0\mathbf{a}_1 + \mathbf{a}_2 = \mathbf{0}, so (1,1,0)ker(A)(1,1,0)^\top \in \ker(A). And 2a1+0a2+(1)a3=2a12a1=02\mathbf{a}_1 + 0\mathbf{a}_2 + (-1)\mathbf{a}_3 = 2\mathbf{a}_1 - 2\mathbf{a}_1 = \mathbf{0} since a3=2a1\mathbf{a}_3 = 2\mathbf{a}_1, so (2,0,1)ker(A)(2,0,-1)^\top \in \ker(A) - or equivalently (2,0,1)(-2,0,1)^\top.

Approach 3: SVD. Compute SVD of AA; null space vectors are the right singular vectors with zero (or near-zero) singular values.

Approach 4: Random sampling + orthogonalization. Sample many vectors, project out the row space, keep those with zero image (useful when AA is very large).

G.3 Composition of Transforms in a Graphics Pipeline

A 3D object is processed through a graphics pipeline using compositions of affine maps:

  1. Model matrix MM: transform from object coordinates to world coordinates (rotation, scale, translation).
  2. View matrix VV: transform from world coordinates to camera coordinates (rotation + translation).
  3. Projection matrix PP: from camera coordinates to clip coordinates (perspective projection).

The combined transform: pclip=PVMpobject\mathbf{p}_{\text{clip}} = P \cdot V \cdot M \cdot \mathbf{p}_{\text{object}} (in homogeneous coordinates).

Composition order matters. Reading left to right: first apply MM, then VV, then PP. The matrix product PVMPVM can be precomputed once per frame (not per vertex), saving O(vertices)O(|\text{vertices}|) matrix multiplications.

This is the same principle as "avoid recomputing shared prefixes" in transformer KV-caching: the KVKV cache stores the linear maps K=XWKK = XW_K, V=XWVV = XW_V for all past tokens, so they don't need to be recomputed when generating each new token.



Appendix H: Linear Maps in Optimization and Training

H.1 The Gradient as a Linear Map

In optimization, we minimize a loss function L:RnR\mathcal{L}: \mathbb{R}^n \to \mathbb{R}. The gradient L(w)\nabla \mathcal{L}(\mathbf{w}) tells us the direction of steepest ascent. But more precisely:

The directional derivative of L\mathcal{L} at w\mathbf{w} in direction d\mathbf{d} is:

DdL(w)=limt0L(w+td)L(w)t=L(w)dD_{\mathbf{d}} \mathcal{L}(\mathbf{w}) = \lim_{t \to 0} \frac{\mathcal{L}(\mathbf{w} + t\mathbf{d}) - \mathcal{L}(\mathbf{w})}{t} = \nabla \mathcal{L}(\mathbf{w})^\top \mathbf{d}

This is a linear functional in d\mathbf{d}: it is the dual vector L(w)(Rn)\nabla \mathcal{L}(\mathbf{w})^\top \in (\mathbb{R}^n)^*.

Gradient descent in its pure form: wt+1=wtηL(wt)\mathbf{w}_{t+1} = \mathbf{w}_t - \eta \nabla \mathcal{L}(\mathbf{w}_t). This uses the Euclidean identification of the gradient (covector) with a primal vector.

Natural gradient descent: wt+1=wtηF(wt)1L(wt)\mathbf{w}_{t+1} = \mathbf{w}_t - \eta F(\mathbf{w}_t)^{-1} \nabla \mathcal{L}(\mathbf{w}_t), where FF is the Fisher information matrix. This uses the correct metric on the manifold of probability distributions (the Fisher-Rao metric) to convert the covector gradient to a tangent vector.

H.2 The Hessian as a Bilinear Map

The Hessian HL(w)Rn×nH\mathcal{L}(\mathbf{w}) \in \mathbb{R}^{n \times n} is the matrix of second derivatives:

Hij=2LwiwjH_{ij} = \frac{\partial^2 \mathcal{L}}{\partial w_i \partial w_j}

But more abstractly, the Hessian is a bilinear form B:Rn×RnRB: \mathbb{R}^n \times \mathbb{R}^n \to \mathbb{R}:

B(u,v)=uHv=rate of change of directional derivative of L in direction v, in direction uB(\mathbf{u}, \mathbf{v}) = \mathbf{u}^\top H \mathbf{v} = \text{rate of change of directional derivative of } \mathcal{L} \text{ in direction } \mathbf{v}, \text{ in direction } \mathbf{u}

The Hessian determines the curvature of the loss landscape:

  • Positive definite Hessian (uHu>0\mathbf{u}^\top H\mathbf{u} > 0 for all u0\mathbf{u} \neq 0): the point is a local minimum.
  • Indefinite Hessian (has both positive and negative eigenvalues): the point is a saddle point.
  • The ratio of largest to smallest eigenvalue is the condition number κ(H)\kappa(H) - it governs how slowly gradient descent converges.

For AI: Modern optimizers (Adam, AdaGrad) approximate Hessian-related quantities. Adam's second moment estimate v^tdiag(H)\hat{v}_t \approx \operatorname{diag}(H) approximates the diagonal of the Hessian. Dividing the gradient by v^t\sqrt{\hat{v}_t} is an approximation to Newton's method (which divides by HH). This is why Adam often converges much faster than SGD on ill-conditioned problems.

H.3 Weight Matrices as Linear Maps: Training Dynamics

The neural tangent kernel (NTK) theory (Jacot et al., 2018) analyzes infinitely wide neural networks and shows that their training dynamics under gradient flow are governed by a linear system:

y˙t=ηΘ(X,X)(yty)\dot{\mathbf{y}}_t = -\eta \, \Theta(\mathbf{X}, \mathbf{X}) (\mathbf{y}_t - \mathbf{y}^*)

where Θ(X,X)\Theta(\mathbf{X}, \mathbf{X}) is the NTK matrix (constant in the infinite-width limit). This is an ODE with a constant linear map Θ\Theta - so its solution is yt=y+eηΘt(y0y)\mathbf{y}_t = \mathbf{y}^* + e^{-\eta \Theta t}(\mathbf{y}_0 - \mathbf{y}^*).

The eigenvalues of Θ\Theta determine which output directions are learned quickly (large eigenvalues -> fast convergence) and which slowly (small eigenvalues -> slow convergence). This is linear algebra - specifically, the spectral decomposition of a positive semidefinite linear map.

H.4 Gradient Flow through Linear Layers

Consider a linear layer y=Wx\mathbf{y} = W\mathbf{x} with loss L\mathcal{L}. The gradient with respect to the weight matrix:

LW=Lyx=δx\frac{\partial \mathcal{L}}{\partial W} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}} \mathbf{x}^\top = \boldsymbol{\delta} \mathbf{x}^\top

where δ=LyRm\boldsymbol{\delta} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}} \in \mathbb{R}^m is the "error signal" (upstream gradient). The gradient LW\frac{\partial \mathcal{L}}{\partial W} is an outer product - a rank-1 matrix.

This means gradient updates are always rank-1. For a mini-batch of BB samples, the gradient is:

LW=1Bb=1Bδbxb\frac{\partial \mathcal{L}}{\partial W} = \frac{1}{B}\sum_{b=1}^B \boldsymbol{\delta}_b \mathbf{x}_b^\top

A sum of BB rank-1 matrices - the gradient has rank at most BB. For large models with batch size BnB \ll n, the gradient is a very low-rank update to the weight matrix. This low-rank structure of gradients is the empirical justification for gradient low-rank projection methods (GaLore, 2024).


Appendix I: Reference Tables

I.1 Linear Map Properties at a Glance

PropertyConditionMatrix EquivalentGeometric Meaning
LinearT(au+bv)=aTu+bTvT(a\mathbf{u}+b\mathbf{v}) = aT\mathbf{u}+bT\mathbf{v}Any m×nm \times n matrixPreserves addition and scaling
Injectiveker(T)={0}\ker(T) = \{\mathbf{0}\}Full column rankNo two inputs map to same output
Surjectiveim(T)=W\operatorname{im}(T) = WFull row rankEvery output is reachable
Bijective (isomorphism)Both injective and surjectiveSquare, full rank, invertiblePerfect correspondence
OrthogonalTT=IT^\top T = IAA=IA^\top A = I, orthogonal columnsPreserves lengths and angles
UnitaryTT=IT^* T = I (complex)AA=IA^* A = I, unitary columnsComplex analogue of orthogonal
ProjectionT2=TT^2 = TA2=AA^2 = AIdempotent: applying twice = once
SymmetricT=TT = T^\topA=AA = A^\topSelf-adjoint; diagonalizable by spectral theorem
Positive definiteTv,v>0\langle T\mathbf{v}, \mathbf{v}\rangle > 0xAx>0\mathbf{x}^\top A\mathbf{x} > 0 for x0\mathbf{x} \neq \mathbf{0}All eigenvalues positive; curvature at min
NormalTT=TTT T^\top = T^\top TAA=AAA A^\top = A^\top ADiagonalizable by unitary matrix
NilpotentTk=0T^k = 0 for some kkAk=0A^k = 0Powers eventually vanish; all eigenvalues 0
InvolutionT2=IT^2 = IA2=IA^2 = ISelf-inverse (like Householder reflections)

I.2 Rank and Dimension Formulas

FormulaStatement
rank(T)+nullity(T)=dim(V)\operatorname{rank}(T) + \operatorname{nullity}(T) = \dim(V)Rank-nullity theorem for T:VWT: V \to W
rank(A)=rank(A)\operatorname{rank}(A) = \operatorname{rank}(A^\top)Row rank equals column rank
rank(AB)min(rank(A),rank(B))\operatorname{rank}(AB) \leq \min(\operatorname{rank}(A), \operatorname{rank}(B))Rank cannot increase under composition
rank(A+B)rank(A)+rank(B)\operatorname{rank}(A + B) \leq \operatorname{rank}(A) + \operatorname{rank}(B)Subadditivity of rank
dim(S+T)=dim(S)+dim(T)dim(ST)\dim(S + T) = \dim(S) + \dim(T) - \dim(S \cap T)Inclusion-exclusion for subspaces
dim(V/W)=dim(V)dim(W)\dim(V/W) = \dim(V) - \dim(W) (for subspace WW)Dimension of quotient space
rank(AA)=rank(A)\operatorname{rank}(A^\top A) = \operatorname{rank}(A)Gram matrix has same rank
rank(P)=tr(P)\operatorname{rank}(P) = \operatorname{tr}(P) (for projection PP)Rank = trace for idempotent matrices

I.3 AI Applications Cross-Reference

Linear Map ConceptWhere It Appears in AIMathematical Role
Wx+bW\mathbf{x} + \mathbf{b} (affine map)Every neural layerPre-activation computation
Q=XWQQ = XW_Q (linear projection)Attention mechanismProjects to query subspace
ΔW=BA\Delta W = BA (low-rank)LoRA fine-tuningRank-rr weight update
Jf=fxJ_f = \frac{\partial f}{\partial \mathbf{x}} (Jacobian)BackpropagationChain rule at each layer
WδW^\top \boldsymbol{\delta} (transpose map)Backward passDual map of forward
P=QQP = QQ^\top (projection)Layer norm, attentionProjects onto subspace
WUhW_U \mathbf{h} (linear map)Unembedding (logit computation)Representation to vocabulary
eiθqe^{i\theta} \cdot \mathbf{q} (rotation in C\mathbb{C})RoPE positional encodingPositional rotation
F1LF^{-1}\nabla\mathcal{L} (metric-adjusted gradient)Natural gradient / AdamRiemannian gradient
Θ=JJ\Theta = J^\top J (Gram matrix of Jacobian)Neural tangent kernelTraining dynamics


Appendix J: Proofs of Key Results

J.1 Proof: rank(A)=rank(A)\operatorname{rank}(A) = \operatorname{rank}(A^\top)

This is a fundamental result that deserves a careful proof.

Theorem. For any matrix ARm×nA \in \mathbb{R}^{m \times n}, the column rank (dimension of the column space) equals the row rank (dimension of the row space).

Proof (via RREF). Let AA have RREF RR (obtained by row operations, which don't change the row space but can change the column space). In RR:

  • The nonzero rows are linearly independent (each has a leading 1 not shared by any other row).
  • The number of nonzero rows = number of pivot columns = rank.

So row rank = column rank = number of pivots in RREF. \square

Alternative proof (via SVD). The SVD A=UΣVA = U\Sigma V^\top has rank(A)\operatorname{rank}(A) nonzero singular values. The column space of AA is spanned by {u1,,ur}\{\mathbf{u}_1, \ldots, \mathbf{u}_r\} (first rr left singular vectors), dimension rr. The row space of AA (= column space of AA^\top) is spanned by {v1,,vr}\{\mathbf{v}_1, \ldots, \mathbf{v}_r\} (first rr right singular vectors), dimension rr. Both have dimension rr = number of nonzero singular values. \square

J.2 Proof: Kernel and Image are Subspaces

Theorem. For any linear map T:VWT: V \to W, both ker(T)\ker(T) and im(T)\operatorname{im}(T) are subspaces (of VV and WW respectively).

Proof for ker(T)\ker(T):

  1. Non-empty: T(0V)=0WT(\mathbf{0}_V) = \mathbf{0}_W, so 0Vker(T)\mathbf{0}_V \in \ker(T).
  2. Closed under addition: Let u,vker(T)\mathbf{u}, \mathbf{v} \in \ker(T). Then T(u+v)=T(u)+T(v)=0+0=0T(\mathbf{u} + \mathbf{v}) = T(\mathbf{u}) + T(\mathbf{v}) = \mathbf{0} + \mathbf{0} = \mathbf{0}, so u+vker(T)\mathbf{u} + \mathbf{v} \in \ker(T).
  3. Closed under scalar multiplication: Let vker(T)\mathbf{v} \in \ker(T), cFc \in \mathbb{F}. Then T(cv)=cT(v)=c0=0T(c\mathbf{v}) = cT(\mathbf{v}) = c\mathbf{0} = \mathbf{0}, so cvker(T)c\mathbf{v} \in \ker(T). \square

Proof for im(T)\operatorname{im}(T):

  1. Non-empty: T(0V)=0Wim(T)T(\mathbf{0}_V) = \mathbf{0}_W \in \operatorname{im}(T).
  2. Closed under addition: Let w1,w2im(T)\mathbf{w}_1, \mathbf{w}_2 \in \operatorname{im}(T), so wi=T(vi)\mathbf{w}_i = T(\mathbf{v}_i) for some viV\mathbf{v}_i \in V. Then w1+w2=T(v1)+T(v2)=T(v1+v2)im(T)\mathbf{w}_1 + \mathbf{w}_2 = T(\mathbf{v}_1) + T(\mathbf{v}_2) = T(\mathbf{v}_1 + \mathbf{v}_2) \in \operatorname{im}(T).
  3. Closed under scalar multiplication: Let w=T(v)im(T)\mathbf{w} = T(\mathbf{v}) \in \operatorname{im}(T), cFc \in \mathbb{F}. Then cw=cT(v)=T(cv)im(T)c\mathbf{w} = cT(\mathbf{v}) = T(c\mathbf{v}) \in \operatorname{im}(T). \square

J.3 Proof: Composition of Linear Maps is Linear

Theorem. If S:VWS: V \to W and T:UVT: U \to V are linear, then ST:UWS \circ T: U \to W is linear.

Proof:

(ST)(au1+bu2)=S(T(au1+bu2))(S \circ T)(a\mathbf{u}_1 + b\mathbf{u}_2) = S(T(a\mathbf{u}_1 + b\mathbf{u}_2)) =S(aT(u1)+bT(u2))(T is linear)= S(aT(\mathbf{u}_1) + bT(\mathbf{u}_2)) \quad \text{(T is linear)} =aS(T(u1))+bS(T(u2))(S is linear)= aS(T(\mathbf{u}_1)) + bS(T(\mathbf{u}_2)) \quad \text{(S is linear)} =a(ST)(u1)+b(ST)(u2)= a(S \circ T)(\mathbf{u}_1) + b(S \circ T)(\mathbf{u}_2) \quad \square

J.4 Proof: The Dual Map is Linear

Theorem. If T:VWT: V \to W is linear, then T:WVT^\top: W^* \to V^* defined by (Tf)(v)=f(T(v))(T^\top f)(\mathbf{v}) = f(T(\mathbf{v})) is linear.

Proof:

  • (T(af+bg))(v)=(af+bg)(T(v))=af(T(v))+bg(T(v))(T^\top(af + bg))(\mathbf{v}) = (af + bg)(T(\mathbf{v})) = af(T(\mathbf{v})) + bg(T(\mathbf{v}))
  • =a(Tf)(v)+b(Tg)(v)=(aTf+bTg)(v)= a(T^\top f)(\mathbf{v}) + b(T^\top g)(\mathbf{v}) = (aT^\top f + bT^\top g)(\mathbf{v})

So T(af+bg)=aTf+bTgT^\top(af + bg) = aT^\top f + bT^\top g. \square

J.5 Proof: Invertibility Criterion for Finite-Dimensional Spaces

Theorem. Let T:VVT: V \to V be a linear map on a finite-dimensional space VV. Then the following are equivalent:

  1. TT is injective (one-to-one)
  2. TT is surjective (onto)
  3. TT is bijective (invertible)
  4. ker(T)={0}\ker(T) = \{\mathbf{0}\}
  5. rank(T)=dim(V)\operatorname{rank}(T) = \dim(V)

Proof: (1)(4)(1) \Leftrightarrow (4): TT injective iff ker(T)={0}\ker(T) = \{\mathbf{0}\} (standard).

(4)(5)(4) \Leftrightarrow (5): By rank-nullity: rank(T)+nullity(T)=dim(V)\operatorname{rank}(T) + \operatorname{nullity}(T) = \dim(V). Nullity = 0 iff rank = dim(V)\dim(V).

(5)(2)(5) \Leftrightarrow (2): Rank = dim(im(T))\dim(\operatorname{im}(T)). Since im(T)V\operatorname{im}(T) \subseteq V and both are finite-dimensional, dim(im(T))=dim(V)\dim(\operatorname{im}(T)) = \dim(V) iff im(T)=V\operatorname{im}(T) = V (a subspace of equal dimension must be the whole space).

(3)(1)&(2)(3) \Leftrightarrow (1) \& (2): by definition of bijective. \square

Important: This equivalence only holds for maps T:VVT: V \to V with the same domain and codomain. For T:RmRnT: \mathbb{R}^m \to \mathbb{R}^n with mnm \neq n, injective and surjective are NOT equivalent (one is impossible given the dimension constraint).



Appendix K: Additional AI Case Studies

K.1 Mechanistic Interpretability via Linear Maps

Mechanistic interpretability (MI) aims to reverse-engineer neural networks by understanding what computation each component performs. Linear map theory is central to this enterprise.

The residual stream as a communication bus. In transformer models, each layer reads from and writes to a shared "residual stream" xRd\mathbf{x} \in \mathbb{R}^d. Each attention head and MLP layer contributes an additive update:

x+1=x+Attn(x)attn. update+MLP(x)MLP update\mathbf{x}_{\ell+1} = \mathbf{x}_\ell + \underbrace{\text{Attn}_\ell(\mathbf{x}_\ell)}_{\text{attn. update}} + \underbrace{\text{MLP}_\ell(\mathbf{x}_\ell)}_{\text{MLP update}}

Each update is (approximately) a low-rank linear map from the residual stream back to itself. The attention update's linear part is WOWVW_O W_V (the "OV circuit"); the MLP's linear part is WoutWinW_{\text{out}} W_{\text{in}} after linearizing the activation.

SVD of the OV circuit. The matrix WOWVRd×dW_O W_V \in \mathbb{R}^{d \times d} can be analyzed via SVD. Its singular values reveal how strongly the attention head modifies the residual stream, and its singular vectors reveal which directions it reads from and writes to. Heads with near-zero singular values are "inattentive" - they barely modify the residual stream regardless of attention pattern.

Subspace decomposition. The full set of L×HL \times H attention heads (for LL layers, HH heads per layer) collectively form a large linear map from the input to the residual stream updates. The total update is a sum of LHLH low-rank linear maps. Understanding the structure of this sum - which heads are redundant, which are essential - is a central goal of circuit-level MI.

K.2 Linear Algebra of Diffusion Models

Diffusion models (DDPM, Score matching) add Gaussian noise to data and learn to denoise. The forward process is an affine map:

xt=αˉtx0+1αˉtε,εN(0,I)\mathbf{x}_t = \sqrt{\bar\alpha_t}\, \mathbf{x}_0 + \sqrt{1 - \bar\alpha_t}\, \boldsymbol{\varepsilon}, \quad \boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, I)

This is an affine interpolation between the data x0\mathbf{x}_0 and pure noise. The coefficient αˉt\sqrt{\bar\alpha_t} scales the data, and 1αˉt\sqrt{1-\bar\alpha_t} scales the noise.

The denoising objective. The neural network ϵθ(xt,t)\epsilon_\theta(\mathbf{x}_t, t) estimates ε\boldsymbol{\varepsilon} (the noise) from the noisy input. Near a data point x0\mathbf{x}_0, this estimator is approximately a linear function of xtαˉtx0\mathbf{x}_t - \sqrt{\bar\alpha_t}\mathbf{x}_0 - the Tweedie formula gives the optimal estimator as:

x^0=xt1αˉtϵθ(xt,t)αˉt\hat{\mathbf{x}}_0 = \frac{\mathbf{x}_t - \sqrt{1-\bar\alpha_t}\, \epsilon_\theta(\mathbf{x}_t, t)}{\sqrt{\bar\alpha_t}}

which is a linear function of xt\mathbf{x}_t and ϵθ(xt,t)\epsilon_\theta(\mathbf{x}_t, t). The diffusion process itself is a composition of affine maps in the forward direction, and the learned reverse process approximately inverts these affine maps.

K.3 State Space Models as Linear Dynamical Systems

Structured State Space Models (S4, Mamba, RWKV) compute their state updates via linear recurrences:

ht+1=Aht+Bxt\mathbf{h}_{t+1} = A\mathbf{h}_t + B\mathbf{x}_t yt=Cht+Dxt\mathbf{y}_t = C\mathbf{h}_t + D\mathbf{x}_t

where ARN×NA \in \mathbb{R}^{N \times N}, BRN×dB \in \mathbb{R}^{N \times d}, CRd×NC \in \mathbb{R}^{d \times N}, DRd×dD \in \mathbb{R}^{d \times d} are (possibly input-dependent) matrices.

The state transition hAh+Bx\mathbf{h} \mapsto A\mathbf{h} + B\mathbf{x} is a linear dynamical system - the fundamental object of study in control theory and signal processing.

Key linear algebra results for SSMs:

  1. Eigenvalues of AA determine memory. If λi(A)<1|\lambda_i(A)| < 1 for all ii, the system has bounded memory decay. If any λi>1|\lambda_i| > 1, the state can grow unboundedly.

  2. Diagonalization for efficiency. If A=PΛP1A = P\Lambda P^{-1}, the recurrence decouples into NN independent scalar recurrences - each computable independently. S4 uses diagonal AA (DPLR structure) for O(NlogN)O(N\log N) parallel computation via convolution.

  3. The convolution view. Unrolling the recurrence: yt=τ=0tCAtτBxτ+Dxt\mathbf{y}_t = \sum_{\tau=0}^t C A^{t-\tau} B \mathbf{x}_\tau + D\mathbf{x}_t. The impulse response CAkBC A^k B is a sequence of matrix powers - analyzable by the spectral decomposition of AA.

  4. Mamba's selectivity. Mamba makes A,B,CA, B, C input-dependent: A(x),B(x),C(x)A(\mathbf{x}), B(\mathbf{x}), C(\mathbf{x}). The recurrence becomes bilinear in (h,x)(\mathbf{h}, \mathbf{x}), not purely linear. The linearization (around typical inputs) gives a locally linear system analyzable by the tools of this section.


Appendix L: Further Reading

Primary References

  1. Axler, S. (2015). Linear Algebra Done Right (3rd ed.). Springer. - The definitive abstract treatment of linear maps. Goes from axioms to spectral theory without matrices until chapter 10. Highly recommended for conceptual depth.

  2. Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley-Cambridge Press. - Computational and applied focus. Excellent for four fundamental subspaces and applications.

  3. Horn, R. & Johnson, C. (2013). Matrix Analysis (2nd ed.). Cambridge University Press. - Comprehensive advanced reference. Proofs of all major results, including Cayley-Hamilton, spectral theorems, singular values.

  4. Trefethen, L. & Bau, D. (1997). Numerical Linear Algebra. SIAM. - Gold standard for computational linear algebra and stability.

AI-Focused References

  1. Vaswani, A. et al. (2017). "Attention is All You Need." NeurIPS. - The original transformer paper; read the attention mechanism as linear projections.

  2. Hu, E. et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." ICLR 2022. - LoRA rank-nullity argument.

  3. Elhage, N. et al. (2021). "A Mathematical Framework for Transformer Circuits." Anthropic. - OV and QK circuits as linear maps; residual stream as communication bus.

  4. Park, K. et al. (2023). "The Linear Representation Hypothesis and the Geometry of Large Language Models." - Linear features in transformer representations.

  5. Jacot, A. et al. (2018). "Neural Tangent Kernel: Convergence and Generalization in Neural Networks." NeurIPS. - Training dynamics via linear maps (NTK theory).

  6. Gu, A. et al. (2022). "Efficiently Modeling Long Sequences with Structured State Spaces." ICLR. - SSMs as linear dynamical systems.


This section is part of the Math for LLMs curriculum - a systematic treatment of the mathematics underlying modern large language models.


Appendix M: Self-Assessment Checklist

After completing this section, you should be able to answer the following questions without notes.

Conceptual Understanding

  • Q1. State the two axioms of a linear map. What is the fastest way to show a map is NOT linear?

  • Q2. What is the kernel of a linear map? Prove it is a subspace.

  • Q3. State the rank-nullity theorem. Give an example where rank = 2, nullity = 3. What can you say about the domain and codomain dimensions?

  • Q4. Why is a linear map from VV to VV injective if and only if it is surjective? Why does this fail for maps between spaces of different dimensions?

  • Q5. What is the change-of-basis formula? If A=P1BPA = P^{-1}BP, what relationship does that establish between the maps represented by AA and BB?

  • Q6. What is an orthogonal projection? How do you verify that a matrix PP is a projection? What two extra properties make it an orthogonal projection?

  • Q7. What is the Jacobian matrix? For f:R3R2f: \mathbb{R}^3 \to \mathbb{R}^2, what is the shape of JfJ_f?

  • Q8. In the backward pass of backpropagation, why do we multiply by WW^\top rather than WW?

  • Q9. What makes an affine map different from a linear map? How do you represent an affine map as a linear map (in one higher dimension)?

Computational Skills

  • C1. Given a matrix AA, find a basis for ker(A)\ker(A) using row reduction.

  • C2. Given a linear map TT defined on a non-standard basis, write its matrix in that basis.

  • C3. Given two bases B\mathcal{B} and B\mathcal{B}', compute the change-of-basis matrix PP and use it to transform the matrix of TT.

  • C4. For a rank-rr update ΔW=BA\Delta W = BA, compute the null space dimension and verify it numerically.

  • C5. Compute the Jacobian of a given vector-valued function (e.g., softmax, elementwise ReLU, an affine map composed with sigmoid).

AI Connections

  • AI1. Explain why LoRA (low-rank adaptation) is more parameter-efficient than full fine-tuning, using the language of rank and nullity.

  • AI2. Describe the forward pass of a multi-head attention layer as a sequence of linear maps. Which operations are linear, which are bilinear, and which are nonlinear?

  • AI3. What is the linear representation hypothesis? Why does it matter for interpretability research?

  • AI4. Why does a purely linear deep network (no activations) collapse to a single linear map, regardless of depth?

  • AI5. How does the dual map relate to backpropagation? What mathematical object is the "gradient" in the strict sense?


Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue