NotesMath for LLMs

Linear Transformations

Advanced Linear Algebra / Linear Transformations

Notes

"The matrix is not the territory. It is a coordinate representation of a linear map - and the map exists independently of any basis you choose to describe it."

Overview

Linear transformations are the language in which all of machine learning is written. Every forward pass through a neural network is a sequence of linear maps punctuated by nonlinearities. Every attention head projects queries, keys, and values through learned linear transformations. Every gradient computation traces the chain rule backward through compositions of Jacobians - which are themselves linear maps. Understanding linear transformations at the abstract level - beyond matrices, beyond coordinates - is what separates a practitioner who can apply formulas from one who can reason about why those formulas work.

This section develops the full theory of linear transformations between vector spaces. We begin with the geometric intuition: what does a linear map do to space? We then build the formal machinery - kernel, image, rank-nullity theorem, matrix representation in arbitrary bases, change of basis, isomorphisms, and composition - that makes this intuition precise. We extend to affine maps (which include bias terms), to the Jacobian as the canonical linear approximation of a nonlinear function, and to the dual space, where gradients live.

Throughout, the AI connections are not decorative but structural: LoRA is a rank-rr linear map composition; attention is a sequence of linear projections; backpropagation is reverse-mode accumulation of Jacobians. By the end of this section, you will see matrix multiplication not as arithmetic, but as function composition - the most powerful organizing principle in applied mathematics.

Prerequisites

Companion Notebooks

NotebookDescription
theory.ipynbInteractive exploration of linear maps: kernel/image computation, change-of-basis visualization, Jacobian experiments, LoRA rank analysis
exercises.ipynb8 graded exercises from rank-nullity proofs through softmax Jacobian and LoRA fine-tuning analysis

Learning Objectives

After completing this section, you will be able to:

  • State the two axioms of a linear transformation and derive all immediate consequences
  • Compute the kernel and image of any linear map and verify the rank-nullity theorem
  • Construct the matrix representation of a linear map in any pair of bases
  • Derive the change-of-basis formula [T]B=P1[T]BP[T]_{\mathcal{B}'} = P^{-1}[T]_{\mathcal{B}} P and use it to diagonalize operators
  • Identify and characterize projection, rotation, reflection, shear, and low-rank maps geometrically
  • Extend linear maps to affine maps via homogeneous coordinates
  • Compute the Jacobian of a vector-valued function and interpret backpropagation as Jacobian composition
  • Explain why LoRA, attention, and linear probing are instances of linear map theory

Table of Contents


1. Intuition

1.1 Three Views of a Matrix

A matrix ARm×nA \in \mathbb{R}^{m \times n} admits three distinct but equivalent interpretations, and fluent practitioners shift between them effortlessly.

View 1: Data container. The matrix is a rectangular array of numbers - mm rows, nn columns. This is the computational view. We use it when we care about entries: AijA_{ij}, the element in row ii, column jj.

View 2: Column geometry. The matrix is a collection of nn column vectors in Rm\mathbb{R}^m. The product AxA\mathbf{x} is then a linear combination: Ax=x1a1+x2a2++xnanA\mathbf{x} = x_1 \mathbf{a}_1 + x_2 \mathbf{a}_2 + \cdots + x_n \mathbf{a}_n, where ai\mathbf{a}_i is the ii-th column. This view makes the column space visible and illuminates when Ax=bA\mathbf{x} = \mathbf{b} has solutions.

View 3: Linear transformation. The matrix defines a function T:RnRmT: \mathbb{R}^n \to \mathbb{R}^m by T(x)=AxT(\mathbf{x}) = A\mathbf{x}. This is the abstract, coordinate-free view. The map TT exists independently of any coordinate representation - the matrix AA is merely its description in a particular pair of bases (standard basis for both Rn\mathbb{R}^n and Rm\mathbb{R}^m).

THREE VIEWS OF THE MATRIX A
========================================================================

  [2  1]         Column geometry:          Linear map:
  [0  3]         A = [a_1 | a_2]            T: \mathbb{R}^2 -> \mathbb{R}^2
  [1 -1]                                  T(x) = Ax

  Data container      Span{a_1, a_2}         Sends basis vectors
  A_2_1 = 0            = column space        e_1 -> col 1 of A
  A_3_2 = -1                                e_2 -> col 2 of A

  All three describe the same mathematical object.

========================================================================

For AI: In a transformer, the weight matrix WQRdk×dW_Q \in \mathbb{R}^{d_k \times d} is simultaneously all three: raw parameters to be optimized (View 1), a basis for the query subspace (View 2), and a linear projection from the residual stream to query space (View 3). Understanding which view you're using prevents confusion about what "the attention head is doing."

1.2 Geometric Action of Linear Maps

What does T(x)=AxT(\mathbf{x}) = A\mathbf{x} do to space? The two axioms - additivity and homogeneity - impose strong geometric constraints:

  1. The origin is fixed. T(0)=A0=0T(\mathbf{0}) = A\mathbf{0} = \mathbf{0} always. A linear map cannot translate; it cannot move the origin. This is the most important geometric fact about linear transformations.

  2. Lines through the origin stay lines. If x(t)=tv\mathbf{x}(t) = t\mathbf{v} is a line, then T(x(t))=tT(v)T(\mathbf{x}(t)) = tT(\mathbf{v}) is a line through the origin in the output space.

  3. Parallel lines stay parallel. If y1y2=v\mathbf{y}_1 - \mathbf{y}_2 = \mathbf{v} (same direction), then T(y1)T(y2)=T(v)T(\mathbf{y}_1) - T(\mathbf{y}_2) = T(\mathbf{v}) (still same direction).

  4. Grid lines go to grid lines, equally spaced. This is the classic 3Blue1Brown visualization: apply AA to the integer grid of R2\mathbb{R}^2 and you get a (possibly skewed) grid in R2\mathbb{R}^2.

  5. The unit square maps to a parallelogram. The parallelogram spanned by the columns of AA is the image of the unit square under TT.

What can a linear map do? Depending on the matrix:

  • Rotate space (orthogonal matrix, det=1\det = 1)
  • Reflect space (orthogonal matrix, det=1\det = -1)
  • Scale dimensions (diagonal matrix)
  • Shear space (triangular matrix)
  • Project onto a subspace (idempotent: P2=PP^2 = P)
  • Collapse space to a lower dimension (rank-deficient matrix)

What a linear map cannot do: translate (no origin-shifting), apply nonlinear distortion (no bending, folding, or curving of lines).

1.3 Why Linearity is Special

The superposition principle is the reason linear algebra is tractable. For a linear map TT:

T ⁣(i=1ncivi)=i=1nciT(vi)T\!\left(\sum_{i=1}^n c_i \mathbf{v}_i\right) = \sum_{i=1}^n c_i T(\mathbf{v}_i)

This means: to know TT on all of VV, you only need to know TT on a basis. A basis has dim(V)\dim(V) elements. The map on an infinite space is encoded in finitely many column vectors. This is extraordinary compression: infinite -> finite.

Linearity vs nonlinearity in neural networks. A deep neural network with no nonlinearities is just a single linear transformation: WLWL1W1x=WeffxW_L W_{L-1} \cdots W_1 \mathbf{x} = W_{\text{eff}} \mathbf{x}. Multiple linear layers collapse to one. The nonlinearities (ReLU, GELU, SiLU) are what allow composition to create genuinely new representational power. The linear layers provide the parameterized directions; the nonlinear activations provide the expressive capacity.

Linearity in analysis. Many of the hardest problems in mathematics and ML become tractable when restricted to linear functions: linear regression has a closed-form solution, linear systems have complete theory, spectral analysis of linear operators is well-understood. The strategy of "linearize, solve, interpret" recurs throughout calculus, optimization, and signal processing.

For AI: The linear representation hypothesis (Elhage et al. 2022, Park et al. 2023) conjectures that many high-level features in LLMs are encoded as directions in representation space - i.e., they are linear features. If true, this means the crucial structure of LLM computation is linear, and all the machinery of this section applies directly to understanding what models are doing.

1.4 Historical Timeline

YearPersonContribution
1844Hermann GrassmannDie lineale Ausdehnungslehre - first abstract treatment of linear spaces
1855Arthur CayleyMatrix algebra as formal system; composition of matrices
1888Giuseppe PeanoFirst rigorous axiomatization of vector spaces
1902Henri LebesgueIntegration as a linear functional on function spaces
1904-1910Hilbert, Riesz, FischerInfinite-dimensional linear algebra; Hilbert spaces; spectral theory
1929John von NeumannOperator theory; linear transformations on Hilbert spaces
1940sAlan Turing, von NeumannLinear algebra for numerical computation; matrix algorithms
1986Rumelhart, Hinton, WilliamsBackpropagation - chains of Jacobians as the training algorithm
2017Vaswani et al.Attention = linear projections + scaled dot product; transformers
2021Hu et al. (LoRA)Low-rank linear map updates for efficient fine-tuning
2022-Elhage, Park et al.Linear representation hypothesis; mechanistic interpretability

2. Formal Definitions

2.1 The Two Axioms and All Their Consequences

Definition (Linear Transformation). Let VV and WW be vector spaces over the same field F\mathbb{F} (typically R\mathbb{R} or C\mathbb{C}). A function T:VWT: V \to W is a linear transformation (also called a linear map or homomorphism) if it satisfies:

  1. Additivity: T(u+v)=T(u)+T(v)T(\mathbf{u} + \mathbf{v}) = T(\mathbf{u}) + T(\mathbf{v}) for all u,vV\mathbf{u}, \mathbf{v} \in V
  2. Homogeneity: T(cv)=cT(v)T(c\mathbf{v}) = cT(\mathbf{v}) for all vV\mathbf{v} \in V, cFc \in \mathbb{F}

These two axioms are often combined into the single condition:

T(au+bv)=aT(u)+bT(v)for all u,vV,  a,bFT(a\mathbf{u} + b\mathbf{v}) = aT(\mathbf{u}) + bT(\mathbf{v}) \quad \text{for all } \mathbf{u}, \mathbf{v} \in V, \; a, b \in \mathbb{F}

Immediate consequences (each follows directly from the axioms):

Proposition 2.1.1 (Zero maps to zero). T(0V)=0WT(\mathbf{0}_V) = \mathbf{0}_W.

Proof: T(0)=T(0v)=0T(v)=0T(\mathbf{0}) = T(0 \cdot \mathbf{v}) = 0 \cdot T(\mathbf{v}) = \mathbf{0} for any vV\mathbf{v} \in V. \square

This is a universal test: if T(0)0T(\mathbf{0}) \neq \mathbf{0}, then TT is not linear. Translation (T(x)=x+bT(\mathbf{x}) = \mathbf{x} + \mathbf{b}, b0\mathbf{b} \neq \mathbf{0}) fails immediately.

Proposition 2.1.2 (Negatives are preserved). T(v)=T(v)T(-\mathbf{v}) = -T(\mathbf{v}).

Proof: T(v)=T((1)v)=(1)T(v)=T(v)T(-\mathbf{v}) = T((-1)\mathbf{v}) = (-1)T(\mathbf{v}) = -T(\mathbf{v}). \square

Proposition 2.1.3 (General linear combinations). For any v1,,vkV\mathbf{v}_1, \ldots, \mathbf{v}_k \in V and c1,,ckFc_1, \ldots, c_k \in \mathbb{F}:

T ⁣(i=1kcivi)=i=1kciT(vi)T\!\left(\sum_{i=1}^k c_i \mathbf{v}_i\right) = \sum_{i=1}^k c_i T(\mathbf{v}_i)

Proof: By induction using additivity and homogeneity. \square

Proposition 2.1.4 (Determined by basis images). If {b1,,bn}\{\mathbf{b}_1, \ldots, \mathbf{b}_n\} is a basis for VV, then TT is completely determined by T(b1),,T(bn)T(\mathbf{b}_1), \ldots, T(\mathbf{b}_n). Moreover, for any choice of w1,,wnW\mathbf{w}_1, \ldots, \mathbf{w}_n \in W, there exists a unique linear map TT with T(bi)=wiT(\mathbf{b}_i) = \mathbf{w}_i.

This is the fundamental construction theorem. It means the matrix of TT (in standard coordinates) is:

A=[T(e1)T(e2)T(en)]A = \begin{bmatrix} T(\mathbf{e}_1) & T(\mathbf{e}_2) & \cdots & T(\mathbf{e}_n) \end{bmatrix}

The set of all linear maps T:VWT: V \to W is denoted L(V,W)\mathcal{L}(V, W) or Hom(V,W)\operatorname{Hom}(V, W). It is itself a vector space under pointwise operations: (S+T)(v)=S(v)+T(v)(S + T)(\mathbf{v}) = S(\mathbf{v}) + T(\mathbf{v}) and (cT)(v)=cT(v)(cT)(\mathbf{v}) = cT(\mathbf{v}).

2.2 Kernel of a Linear Map

Definition (Kernel). The kernel (or null space) of a linear map T:VWT: V \to W is:

ker(T)={vV:T(v)=0W}\ker(T) = \{\mathbf{v} \in V : T(\mathbf{v}) = \mathbf{0}_W\}

It is the set of all inputs that TT maps to zero - the "lost information" of the map.

Theorem 2.2.1. ker(T)\ker(T) is a subspace of VV.

Proof:

  • Zero: T(0)=0T(\mathbf{0}) = \mathbf{0}, so 0ker(T)\mathbf{0} \in \ker(T).
  • Closure under addition: If T(u)=T(v)=0T(\mathbf{u}) = T(\mathbf{v}) = \mathbf{0}, then T(u+v)=T(u)+T(v)=0+0=0T(\mathbf{u} + \mathbf{v}) = T(\mathbf{u}) + T(\mathbf{v}) = \mathbf{0} + \mathbf{0} = \mathbf{0}.
  • Closure under scaling: If T(v)=0T(\mathbf{v}) = \mathbf{0}, then T(cv)=cT(v)=c0=0T(c\mathbf{v}) = cT(\mathbf{v}) = c\mathbf{0} = \mathbf{0}. \square

Geometric meaning. ker(T)\ker(T) is the subspace that TT "collapses to zero." If TT is the projection onto the xyxy-plane, then ker(T)\ker(T) is the zz-axis. If TT is differentiation of polynomials, ker(T)\ker(T) is the constant polynomials.

Theorem 2.2.2 (Injectivity criterion). TT is injective (one-to-one) if and only if ker(T)={0}\ker(T) = \{\mathbf{0}\}.

Proof. (\Rightarrow) If TT is injective and T(v)=0=T(0)T(\mathbf{v}) = \mathbf{0} = T(\mathbf{0}), then v=0\mathbf{v} = \mathbf{0}. (\Leftarrow) If ker(T)={0}\ker(T) = \{\mathbf{0}\} and T(u)=T(v)T(\mathbf{u}) = T(\mathbf{v}), then T(uv)=0T(\mathbf{u} - \mathbf{v}) = \mathbf{0}, so uv=0\mathbf{u} - \mathbf{v} = \mathbf{0}, i.e., u=v\mathbf{u} = \mathbf{v}. \square

The dimension of the kernel, dim(ker(T))\dim(\ker(T)), is called the nullity of TT.

2.3 Image of a Linear Map

Definition (Image). The image (or range) of a linear map T:VWT: V \to W is:

im(T)={T(v):vV}=T(V)\operatorname{im}(T) = \{T(\mathbf{v}) : \mathbf{v} \in V\} = T(V)

It is the set of all possible outputs - the "reachable" part of WW.

Theorem 2.3.1. im(T)\operatorname{im}(T) is a subspace of WW.

Proof:

  • Zero: T(0)=0im(T)T(\mathbf{0}) = \mathbf{0} \in \operatorname{im}(T).
  • Closure under addition: T(u)+T(v)=T(u+v)im(T)T(\mathbf{u}) + T(\mathbf{v}) = T(\mathbf{u} + \mathbf{v}) \in \operatorname{im}(T).
  • Closure under scaling: cT(v)=T(cv)im(T)cT(\mathbf{v}) = T(c\mathbf{v}) \in \operatorname{im}(T). \square

Theorem 2.3.2. im(T)\operatorname{im}(T) is the column space of the matrix AA representing TT.

Proof: T(x)=Ax=x1a1++xnanT(\mathbf{x}) = A\mathbf{x} = x_1\mathbf{a}_1 + \cdots + x_n\mathbf{a}_n, which is exactly the span of the columns of AA. \square

Theorem 2.3.3 (Surjectivity criterion). T:VWT: V \to W is surjective (onto) if and only if im(T)=W\operatorname{im}(T) = W.

The dimension of the image, dim(im(T))\dim(\operatorname{im}(T)), is called the rank of TT, denoted rank(T)\operatorname{rank}(T).

2.4 The Rank-Nullity Theorem

This is one of the most elegant and useful results in linear algebra, connecting the three fundamental dimensions of a linear map.

Theorem 2.4.1 (Rank-Nullity Theorem). Let T:VWT: V \to W be a linear map with VV finite-dimensional. Then:

dim(V)=dim(ker(T))+dim(im(T))=nullity(T)+rank(T)\dim(V) = \dim(\ker(T)) + \dim(\operatorname{im}(T)) = \operatorname{nullity}(T) + \operatorname{rank}(T)

Proof. Let {k1,,kp}\{\mathbf{k}_1, \ldots, \mathbf{k}_p\} be a basis for ker(T)\ker(T) (so p=nullity(T)p = \operatorname{nullity}(T)). Extend this to a basis for all of VV: {k1,,kp,b1,,bq}\{\mathbf{k}_1, \ldots, \mathbf{k}_p, \mathbf{b}_1, \ldots, \mathbf{b}_q\} where p+q=dim(V)p + q = \dim(V).

We claim {T(b1),,T(bq)}\{T(\mathbf{b}_1), \ldots, T(\mathbf{b}_q)\} is a basis for im(T)\operatorname{im}(T).

Spanning: For any wim(T)\mathbf{w} \in \operatorname{im}(T), write w=T(v)\mathbf{w} = T(\mathbf{v}) for some v=ciki+djbj\mathbf{v} = \sum c_i \mathbf{k}_i + \sum d_j \mathbf{b}_j. Then w=T(v)=ciT(ki)+djT(bj)=djT(bj)\mathbf{w} = T(\mathbf{v}) = \sum c_i T(\mathbf{k}_i) + \sum d_j T(\mathbf{b}_j) = \sum d_j T(\mathbf{b}_j).

Linear independence: If djT(bj)=0\sum d_j T(\mathbf{b}_j) = \mathbf{0}, then T(djbj)=0T(\sum d_j \mathbf{b}_j) = \mathbf{0}, so djbjker(T)=span{k1,,kp}\sum d_j \mathbf{b}_j \in \ker(T) = \operatorname{span}\{\mathbf{k}_1, \ldots, \mathbf{k}_p\}. But {k1,,kp,b1,,bq}\{\mathbf{k}_1, \ldots, \mathbf{k}_p, \mathbf{b}_1, \ldots, \mathbf{b}_q\} is linearly independent, so all dj=0d_j = 0. \square

Intuition: The rank-nullity theorem says TT "uses" its input dimensions in two ways: some dimensions are collapsed to zero (nullity), and the rest are faithfully transmitted to the output (rank). Total in = lost + kept.

RANK-NULLITY THEOREM
========================================================================

  T: V -----------------------------------> W
       |                                  |
       +-- ker(T) ---> {0}                 |
       |   (nullity)   collapsed           |
       |                                  |
       +-- "rest" ---> im(T) \subseteq W          |
           (rank)      transmitted         |

  dim(V)    =    nullity(T)    +    rank(T)
  [total]     [collapsed dims]  [surviving dims]

  Example: T: \mathbb{R}^5 -> \mathbb{R}^3, rank 2
  -> nullity = 5 - 2 = 3
  -> 3 dimensions collapse, 2 survive

========================================================================

Example. T:R4R3T: \mathbb{R}^4 \to \mathbb{R}^3 defined by A=[101001100001]A = \begin{bmatrix} 1 & 0 & 1 & 0 \\ 0 & 1 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}. Rank = 3 (full row rank). Nullity = 43=14 - 3 = 1. The null space is ker(T)=span{(1,1,1,0)}\ker(T) = \operatorname{span}\{(-1, -1, 1, 0)^\top\}.

For AI: In LoRA, a weight update ΔW=BA\Delta W = BA with BRd×rB \in \mathbb{R}^{d \times r}, ARr×kA \in \mathbb{R}^{r \times k} has rank at most rr. By rank-nullity, its kernel has dimension at least krk - r. When rkr \ll k, the update leaves most of the input space unchanged - it only "speaks to" an rr-dimensional subspace.

2.5 Examples and Non-Examples

Linear transformations:

MapDomain -> CodomainKernelImage
T(x)=AxT(\mathbf{x}) = A\mathbf{x} (matrix mult.)RnRm\mathbb{R}^n \to \mathbb{R}^mnull space of AAcolumn space of AA
T(f)=fT(f) = f' (differentiation)PnPn1\mathcal{P}_n \to \mathcal{P}_{n-1}constantsall polynomials of degree n1\leq n-1
T(f)=0xf(t)dtT(f) = \int_0^x f(t)\,dt (integration)C[0,1]C[0,1]\mathcal{C}[0,1] \to \mathcal{C}[0,1]{0}\{\mathbf{0}\}functions vanishing at 0
T(x)=0T(\mathbf{x}) = \mathbf{0} (zero map)VWV \to Wall of VV{0}\{\mathbf{0}\}
T(x)=xT(\mathbf{x}) = \mathbf{x} (identity)VVV \to V{0}\{\mathbf{0}\}all of VV
T(x,y)=(x,0)T(x, y) = (x, 0) (projection)R2R2\mathbb{R}^2 \to \mathbb{R}^2yy-axisxx-axis
T(A)=tr(A)T(A) = \operatorname{tr}(A) (trace)Rn×nR\mathbb{R}^{n\times n} \to \mathbb{R}trace-zero matricesR\mathbb{R}

Non-linear maps (and why they fail):

MapLinearity failureTest
T(x)=x+bT(\mathbf{x}) = \mathbf{x} + \mathbf{b} (b0\mathbf{b} \neq \mathbf{0})T(0)=b0T(\mathbf{0}) = \mathbf{b} \neq \mathbf{0}Zero test
T(x)=x2T(x) = x^2T(1+1)=4T(1)+T(1)=2T(1+1) = 4 \neq T(1) + T(1) = 2Additivity
T(x)=xT(\mathbf{x}) = \lVert\mathbf{x}\rVert (norm)T(2e1)=22T(e1)=2T(2\mathbf{e}_1) = 2 \neq 2T(\mathbf{e}_1) = 2... wait, this passes? No: T(e1)=1T(e1)=1T(-\mathbf{e}_1) = 1 \neq -T(\mathbf{e}_1) = -1Homogeneity
T(x)=softmax(x)T(\mathbf{x}) = \operatorname{softmax}(\mathbf{x})T(2x)2T(x)T(2\mathbf{x}) \neq 2T(\mathbf{x}); softmax is scale-invariantHomogeneity
T(x)=ReLU(x)T(\mathbf{x}) = \operatorname{ReLU}(\mathbf{x})T(u+v)T(u)+T(v)T(\mathbf{u} + \mathbf{v}) \neq T(\mathbf{u}) + T(\mathbf{v}) in generalAdditivity
T(x)=xxT(\mathbf{x}) = \mathbf{x} \odot \mathbf{x} (elementwise square)T(cx)=c2xxc(xx)T(c\mathbf{x}) = c^2 \mathbf{x} \odot \mathbf{x} \neq c(\mathbf{x} \odot \mathbf{x})Homogeneity

Note on ReLU: Though ReLU is not linear, it is piecewise linear - linear on each orthant. This means neural networks with ReLU are piecewise linear functions, which is a key fact for understanding their behavior.


3. Matrix Representation of a Linear Map

3.1 The Standard Basis Construction

Over Rn\mathbb{R}^n and Rm\mathbb{R}^m with standard bases, every linear map T:RnRmT: \mathbb{R}^n \to \mathbb{R}^m corresponds to a unique matrix ARm×nA \in \mathbb{R}^{m \times n}.

Construction. The jj-th column of AA is T(ej)T(\mathbf{e}_j), the image of the jj-th standard basis vector:

A=[T(e1)T(e2)T(en)]A = \begin{bmatrix} T(\mathbf{e}_1) & T(\mathbf{e}_2) & \cdots & T(\mathbf{e}_n) \end{bmatrix}

Why this works: For any x=j=1nxjej\mathbf{x} = \sum_{j=1}^n x_j \mathbf{e}_j:

T(x)=T ⁣(jxjej)=jxjT(ej)=AxT(\mathbf{x}) = T\!\left(\sum_j x_j \mathbf{e}_j\right) = \sum_j x_j T(\mathbf{e}_j) = A\mathbf{x}

The matrix is the complete, coordinate-encoded description of TT.

Example. Find the matrix for the 2D2D counterclockwise rotation by angle θ\theta.

T(e1)=(cosθ,sinθ)T(\mathbf{e}_1) = (\cos\theta, \sin\theta)^\top and T(e2)=(sinθ,cosθ)T(\mathbf{e}_2) = (-\sin\theta, \cos\theta)^\top, giving:

Rθ=[cosθsinθsinθcosθ]R_\theta = \begin{bmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{bmatrix}

Example. Find the matrix for differentiation D:P3P2D: \mathcal{P}_3 \to \mathcal{P}_2 using the bases {1,x,x2,x3}\{1, x, x^2, x^3\} and {1,x,x2}\{1, x, x^2\}.

D(1)=0D(1) = 0, D(x)=1D(x) = 1, D(x2)=2xD(x^2) = 2x, D(x3)=3x2D(x^3) = 3x^2. In the output basis:

[D]BC=[010000200003][D]_{\mathcal{B}}^{\mathcal{C}} = \begin{bmatrix} 0 & 1 & 0 & 0 \\ 0 & 0 & 2 & 0 \\ 0 & 0 & 0 & 3 \end{bmatrix}

This is a 3×43 \times 4 matrix, reflecting D:P3P2D: \mathcal{P}_3 \to \mathcal{P}_2.

3.2 Representation in Arbitrary Bases

When VV and WW have non-standard bases, the matrix of TT depends on those bases.

Setup. Let B={b1,,bn}\mathcal{B} = \{\mathbf{b}_1, \ldots, \mathbf{b}_n\} be a basis for VV and C={c1,,cm}\mathcal{C} = \{\mathbf{c}_1, \ldots, \mathbf{c}_m\} be a basis for WW.

Coordinate vectors. For vV\mathbf{v} \in V, write v=jαjbj\mathbf{v} = \sum_j \alpha_j \mathbf{b}_j. The coordinate vector is [v]B=(α1,,αn)Rn[\mathbf{v}]_{\mathcal{B}} = (\alpha_1, \ldots, \alpha_n)^\top \in \mathbb{R}^n.

The matrix of TT in bases (B,C)(\mathcal{B}, \mathcal{C}). Express each T(bj)T(\mathbf{b}_j) in the basis C\mathcal{C}:

T(bj)=i=1maijciT(\mathbf{b}_j) = \sum_{i=1}^m a_{ij} \mathbf{c}_i

The matrix [T]BC=(aij)Rm×n[T]_{\mathcal{B}}^{\mathcal{C}} = (a_{ij}) \in \mathbb{R}^{m \times n} satisfies:

[T(v)]C=[T]BC[v]B[T(\mathbf{v})]_{\mathcal{C}} = [T]_{\mathcal{B}}^{\mathcal{C}} \, [\mathbf{v}]_{\mathcal{B}}

The commutative diagram:

  v \in V  ----------------T----------------->  T(v) \in W
     |                                            |
   [*]_B (coord in B)                       [*]_C (coord in C)
     |                                            |
     down                                            down
  [v]_B ---------[T]_B^C (matrix mult.)----->  [T(v)]_C

The matrix [T]BC[T]_{\mathcal{B}}^{\mathcal{C}} is the bridge between coordinates: it takes B\mathcal{B}-coordinates of input to C\mathcal{C}-coordinates of output.

Example. Let T:R2R2T: \mathbb{R}^2 \to \mathbb{R}^2 be the map T(x,y)=(x+y,xy)T(x, y) = (x + y, x - y). In the non-standard basis B={(1,1),(1,1)}\mathcal{B} = \{(1, 1), (1, -1)\}:

T(1,1)=(2,0)=1(1,1)+1(1,1)T(1,1) = (2, 0) = 1 \cdot (1,1) + 1 \cdot (1,-1), so column 1 is (1,1)(1, 1)^\top. T(1,1)=(0,2)=1(1,1)+(1)(1,1)T(1,-1) = (0, 2) = 1 \cdot (1,1) + (-1) \cdot (1,-1), so column 2 is (1,1)(1, -1)^\top.

[T]BB=[1111][T]_{\mathcal{B}}^{\mathcal{B}} = \begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix}

This is diagonal-like: in the basis B\mathcal{B}, TT acts as a simple scaling on each basis vector - exactly the diagonalization idea.

3.3 The Change-of-Basis Matrix

Definition. Let B={b1,,bn}\mathcal{B} = \{\mathbf{b}_1, \ldots, \mathbf{b}_n\} and B={b1,,bn}\mathcal{B}' = \{\mathbf{b}'_1, \ldots, \mathbf{b}'_n\} be two bases for the same space VV. The change-of-basis matrix from B\mathcal{B}' to B\mathcal{B} is:

P=[[b1]B[b2]B[bn]B]P = \begin{bmatrix} [\mathbf{b}'_1]_{\mathcal{B}} & [\mathbf{b}'_2]_{\mathcal{B}} & \cdots & [\mathbf{b}'_n]_{\mathcal{B}} \end{bmatrix}

Each column is the B\mathcal{B}-coordinate vector of the corresponding new basis vector.

Key property: If [v]B[\mathbf{v}]_{\mathcal{B}'} are the B\mathcal{B}'-coordinates of v\mathbf{v}, then the B\mathcal{B}-coordinates are:

[v]B=P[v]B[\mathbf{v}]_{\mathcal{B}} = P \, [\mathbf{v}]_{\mathcal{B}'}

The change-of-basis formula for TT. If [T]B[T]_{\mathcal{B}} is the matrix of TT in the basis B\mathcal{B}, and PP is the change-of-basis matrix from B\mathcal{B}' to B\mathcal{B}, then:

[T]B=P1[T]BP[T]_{\mathcal{B}'} = P^{-1} [T]_{\mathcal{B}} P

Derivation:

[T(v)]B=P1[T(v)]B=P1[T]B[v]B=P1[T]BP[v]B[T(\mathbf{v})]_{\mathcal{B}'} = P^{-1} [T(\mathbf{v})]_{\mathcal{B}} = P^{-1} [T]_{\mathcal{B}} [\mathbf{v}]_{\mathcal{B}} = P^{-1} [T]_{\mathcal{B}} P [\mathbf{v}]_{\mathcal{B}'}

Worked example. Let T:R2R2T: \mathbb{R}^2 \to \mathbb{R}^2 have matrix A=[3103]A = \begin{bmatrix} 3 & 1 \\ 0 & 3 \end{bmatrix} in the standard basis. In the basis B={(1,0),(1,1)}\mathcal{B}' = \{(1, 0), (1, 1)\}:

P=[1101],P1=[1101]P = \begin{bmatrix} 1 & 1 \\ 0 & 1 \end{bmatrix}, \quad P^{-1} = \begin{bmatrix} 1 & -1 \\ 0 & 1 \end{bmatrix} P1AP=[1101][3103][1101]=[3103][1101]=[3203]P^{-1} A P = \begin{bmatrix} 1 & -1 \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 3 & 1 \\ 0 & 3 \end{bmatrix} \begin{bmatrix} 1 & 1 \\ 0 & 1 \end{bmatrix} = \begin{bmatrix} 3 & -1 \\ 0 & 3 \end{bmatrix} \begin{bmatrix} 1 & 1 \\ 0 & 1 \end{bmatrix} = \begin{bmatrix} 3 & 2 \\ 0 & 3 \end{bmatrix}

3.4 Similarity Transformations

Definition. Two matrices A,BRn×nA, B \in \mathbb{R}^{n \times n} are similar (ABA \sim B) if there exists an invertible PP such that B=P1APB = P^{-1}AP.

Geometrically: AA and BB represent the same linear map in different bases.

Invariants under similarity (properties that don't change when you change basis):

InvariantFormulaMeaning
Eigenvaluesλ\lambda unchangedSpectrum is basis-independent
Determinantdet(P1AP)=det(A)\det(P^{-1}AP) = \det(A)Volume scaling is basis-independent
Tracetr(P1AP)=tr(A)\operatorname{tr}(P^{-1}AP) = \operatorname{tr}(A)Sum of eigenvalues
Rankrank(P1AP)=rank(A)\operatorname{rank}(P^{-1}AP) = \operatorname{rank}(A)Dimension of image
Characteristic polynomialdet(λIP1AP)=det(λIA)\det(\lambda I - P^{-1}AP) = \det(\lambda I - A)Entire spectrum

Diagonalization is change of basis. When AA is diagonalizable with eigenvectors v1,,vn\mathbf{v}_1, \ldots, \mathbf{v}_n, the matrix P=[v1vn]P = [\mathbf{v}_1 | \cdots | \mathbf{v}_n] gives P1AP=Λ=diag(λ1,,λn)P^{-1}AP = \Lambda = \operatorname{diag}(\lambda_1, \ldots, \lambda_n). We are choosing the basis of eigenvectors, in which the map acts by simple scaling.

Forward reference: Eigenvalues and Eigenvectors

The full theory of diagonalization, the spectral theorem, and when a matrix is diagonalizable is developed in 01: Eigenvalues and Eigenvectors. Here we note only that diagonalization is a special case of change of basis.


4. Composition, Invertibility, and Isomorphisms

4.1 Composition is Matrix Multiplication

Let T:UVT: U \to V and S:VWS: V \to W be linear maps. Their composition ST:UWS \circ T: U \to W defined by (ST)(u)=S(T(u))(S \circ T)(\mathbf{u}) = S(T(\mathbf{u})) is also linear.

Theorem 4.1.1. If TT has matrix AA (in standard bases) and SS has matrix BB, then STS \circ T has matrix BABA.

Proof: (ST)(x)=S(T(x))=S(Ax)=B(Ax)=(BA)x(S \circ T)(\mathbf{x}) = S(T(\mathbf{x})) = S(A\mathbf{x}) = B(A\mathbf{x}) = (BA)\mathbf{x}. \square

This is the fundamental theorem connecting composition of functions to matrix multiplication. It explains:

  • Non-commutativity: STTSS \circ T \neq T \circ S in general (different domains/codomains, or BAABBA \neq AB for square matrices).
  • Associativity: (RS)T=R(ST)(R \circ S) \circ T = R \circ (S \circ T) - matches matrix associativity.
  • Deep learning: A network f=fLf1f = f_L \circ \cdots \circ f_1 is a composition. The forward pass computes this product. The backward pass (backprop) uses the chain rule, which is exactly the composition of Jacobians.

Collapse of linear-only networks. If fi(x)=Wixf_i(\mathbf{x}) = W_i \mathbf{x} (all layers linear, no activations):

f(x)=WLW2W1x=Weffxf(\mathbf{x}) = W_L \cdots W_2 W_1 \mathbf{x} = W_{\text{eff}} \mathbf{x}

where Weff=WLW1W_{\text{eff}} = W_L \cdots W_1 is a single matrix. No depth benefit without nonlinearity.

4.2 Injectivity, Surjectivity, and Bijectivity

Definitions. A linear map T:VWT: V \to W is:

  • Injective (one-to-one) if T(u)=T(v)u=vT(\mathbf{u}) = T(\mathbf{v}) \Rightarrow \mathbf{u} = \mathbf{v}
  • Surjective (onto) if for every wW\mathbf{w} \in W, vV\exists\, \mathbf{v} \in V with T(v)=wT(\mathbf{v}) = \mathbf{w}
  • Bijective if both injective and surjective

Criteria via rank-nullity:

PropertyConditionMatrix equivalent
Injectiveker(T)={0}\ker(T) = \{\mathbf{0}\}, i.e., nullity = 0Columns of AA are linearly independent
Surjectiveim(T)=W\operatorname{im}(T) = W, i.e., rank = dim(W)\dim(W)Rows of AA span Rm\mathbb{R}^m; full row rank
BijectiveBoth; requires dim(V)=dim(W)\dim(V) = \dim(W) and full rankAA is square and invertible

Dimension constraints:

  • If dim(V)<dim(W)\dim(V) < \dim(W): TT cannot be surjective (rank dim(V)<dim(W)\leq \dim(V) < \dim(W)).
  • If dim(V)>dim(W)\dim(V) > \dim(W): TT cannot be injective (nullity =dim(V)rankdim(V)dim(W)>0= \dim(V) - \text{rank} \geq \dim(V) - \dim(W) > 0).
  • If dim(V)=dim(W)\dim(V) = \dim(W): injective \Leftrightarrow surjective \Leftrightarrow bijective.

4.3 Isomorphisms

Definition. A bijective linear map T:VWT: V \to W is called an isomorphism. If such a map exists, VV and WW are isomorphic, written VWV \cong W.

Isomorphic spaces are "the same" as vector spaces - they have the same algebraic structure. An isomorphism is a relabeling of elements that preserves all linear operations.

Theorem 4.3.1 (Fundamental Classification). Two finite-dimensional vector spaces over the same field are isomorphic if and only if they have the same dimension.

Consequence: Every nn-dimensional vector space over R\mathbb{R} is isomorphic to Rn\mathbb{R}^n. The space Pn\mathcal{P}_n of polynomials of degree n\leq n (dimension n+1n+1) is isomorphic to Rn+1\mathbb{R}^{n+1}. The space R2×2\mathbb{R}^{2 \times 2} (dimension 4) is isomorphic to R4\mathbb{R}^4.

For AI: The residual stream of a transformer (a vector in Rd\mathbb{R}^d) is isomorphic to any other dd-dimensional space. Features are not inherently "in Rd\mathbb{R}^d" - they live in an abstract vector space and Rd\mathbb{R}^d is just one coordinate representation. The linear representation hypothesis says this space has meaningful geometric structure independent of the coordinate system.

Properties of isomorphisms:

  • The inverse T1:WVT^{-1}: W \to V is also an isomorphism.
  • Isomorphism preserves all vector space structure: subspaces, linear independence, bases, dimension.
  • Isomorphisms form a group under composition.

4.4 The Inverse Map

Theorem 4.4.1. If T:VWT: V \to W is an isomorphism (bijective linear map), then T1:WVT^{-1}: W \to V is also a linear map.

Proof: For w1,w2W\mathbf{w}_1, \mathbf{w}_2 \in W, let vi=T1(wi)\mathbf{v}_i = T^{-1}(\mathbf{w}_i), so T(vi)=wiT(\mathbf{v}_i) = \mathbf{w}_i. Then:

T(av1+bv2)=aT(v1)+bT(v2)=aw1+bw2T(a\mathbf{v}_1 + b\mathbf{v}_2) = aT(\mathbf{v}_1) + bT(\mathbf{v}_2) = a\mathbf{w}_1 + b\mathbf{w}_2

So T1(aw1+bw2)=av1+bv2=aT1(w1)+bT1(w2)T^{-1}(a\mathbf{w}_1 + b\mathbf{w}_2) = a\mathbf{v}_1 + b\mathbf{v}_2 = aT^{-1}(\mathbf{w}_1) + bT^{-1}(\mathbf{w}_2). \square

Matrix inverse. If TT has matrix AA (square, full rank), then T1T^{-1} has matrix A1A^{-1}.

Left and right inverses for non-square maps. When T:RnRmT: \mathbb{R}^n \to \mathbb{R}^m is injective but not surjective (n<mn < m), there is no full inverse. However:

  • Left inverse: LL such that LT=IVL \circ T = I_V. Exists iff TT is injective. Formula: L=(AA)1AL = (A^\top A)^{-1} A^\top (left pseudo-inverse).
  • Right inverse: RR such that TR=IWT \circ R = I_W. Exists iff TT is surjective. Formula: R=A(AA)1R = A^\top (A A^\top)^{-1} (right pseudo-inverse).

Forward reference: Moore-Penrose Pseudo-Inverse

The general pseudo-inverse A+A^+, defined via SVD as A+=VΣ+UA^+ = V\Sigma^+ U^\top, handles all cases uniformly and gives the least-squares solution when an exact solution doesn't exist. Full treatment in 02: Singular Value Decomposition.

4.5 The Four Fundamental Subspaces via Linear Maps

Every linear map T:VWT: V \to W (with V=RnV = \mathbb{R}^n, W=RmW = \mathbb{R}^m, matrix AA) defines four fundamental subspaces:

SubspaceDefinitionLives inDimension
Column space col(A)\operatorname{col}(A)im(T)\operatorname{im}(T)Rm\mathbb{R}^mr=rank(A)r = \operatorname{rank}(A)
Null space null(A)\operatorname{null}(A)ker(T)\ker(T)Rn\mathbb{R}^nnrn - r
Row space row(A)\operatorname{row}(A)im(T)\operatorname{im}(T^\top)Rn\mathbb{R}^nrr
Left null space null(A)\operatorname{null}(A^\top)ker(T)\ker(T^\top)Rm\mathbb{R}^mmrm - r

Orthogonality relations (proven using the dual map - see 6):

null(A)row(A),null(A)col(A)\operatorname{null}(A) \perp \operatorname{row}(A), \quad \operatorname{null}(A^\top) \perp \operatorname{col}(A)

This splits Rn=row(A)null(A)\mathbb{R}^n = \operatorname{row}(A) \oplus \operatorname{null}(A) and Rm=col(A)null(A)\mathbb{R}^m = \operatorname{col}(A) \oplus \operatorname{null}(A^\top).

THE FOUR FUNDAMENTAL SUBSPACES
========================================================================

  \mathbb{R}^n (domain)                    \mathbb{R}^m (codomain)
  +--------------------+         +--------------------+
  | row space          |--T--->   | column space       |
  | (dim r)            |         | (dim r)            |
  +--------------------+         +--------------------+
  | null space         |--T--->   | left null space    |
  | (dim n-r)          | {0}     | (dim m-r)          |
  +--------------------+         +--------------------+
        <-> \perp                               <-> \perp
     orthogonal complement            orthogonal complement

========================================================================

5. Special Classes of Linear Transformations

5.1 Projection Operators

Definition. A linear map P:VVP: V \to V is a projection if it is idempotent: P2=PP^2 = P.

Idempotency means "doing it twice is the same as doing it once." Once you project onto a subspace, you're already there - projecting again does nothing.

Theorem 5.1.1. If PP is a projection, then:

  • im(P)=ker(IP)\operatorname{im}(P) = \ker(I - P): the image of PP is the fixed-point set.
  • ker(P)=im(IP)\ker(P) = \operatorname{im}(I - P): the kernel of PP is the image of the complementary projection.
  • V=im(P)ker(P)V = \operatorname{im}(P) \oplus \ker(P) (direct sum decomposition).

Proof of last claim: Any vV\mathbf{v} \in V decomposes as v=Pv+(IP)v\mathbf{v} = P\mathbf{v} + (I-P)\mathbf{v}, where Pvim(P)P\mathbf{v} \in \operatorname{im}(P) and (IP)vker(P)(I-P)\mathbf{v} \in \ker(P) (since P(IP)v=(PP2)v=0P(I-P)\mathbf{v} = (P-P^2)\mathbf{v} = \mathbf{0}). Uniqueness follows from: if v=u+w\mathbf{v} = \mathbf{u} + \mathbf{w} with uim(P)\mathbf{u} \in \operatorname{im}(P), wker(P)\mathbf{w} \in \ker(P), then Pv=Pu+Pw=uP\mathbf{v} = P\mathbf{u} + P\mathbf{w} = \mathbf{u}. \square

Rank and trace. For a projection: rank(P)=tr(P)\operatorname{rank}(P) = \operatorname{tr}(P) (eigenvalues are only 0 and 1; trace = sum of eigenvalues = number of 1s = rank).

Orthogonal vs oblique projections. An orthogonal projection satisfies additionally P=PP = P^\top (i.e., PP is symmetric). In this case:

  • ker(P)=im(P)\ker(P) = \operatorname{im}(P)^\perp - the kernel is the orthogonal complement of the image.
  • Formula: P=A(AA)1AP = A(A^\top A)^{-1}A^\top for projection onto col(A)\operatorname{col}(A).

An oblique projection has P2=PP^2 = P but PPP \neq P^\top - it projects along a subspace that is not orthogonal to the target.

Forward reference: Orthogonal Projections

The full theory of orthogonal projections - including the projection formula, orthogonal decompositions, and the relationship to least squares - is in 05: Orthogonality and Orthonormality.

For AI: Attention heads in transformers apply projections onto query/key/value subspaces. The head-specific projection matrices WQ,WK,WVW_Q, W_K, W_V define low-dimensional subspaces of the residual stream. The attention mechanism then computes weighted combinations within these projected spaces. Projection is also central to layer normalization (projecting onto the unit sphere in the mean-subtracted subspace).

5.2 Rotations and Reflections

Orthogonal transformations are linear maps T:RnRnT: \mathbb{R}^n \to \mathbb{R}^n that preserve inner products:

T(u),T(v)=u,vfor all u,v\langle T(\mathbf{u}), T(\mathbf{v}) \rangle = \langle \mathbf{u}, \mathbf{v} \rangle \quad \text{for all } \mathbf{u}, \mathbf{v}

Equivalently, their matrices satisfy AA=IA^\top A = I, i.e., A1=AA^{-1} = A^\top.

The set of all n×nn \times n orthogonal matrices forms the orthogonal group O(n)O(n).

Determinant splits them into two classes:

  • det(A)=+1\det(A) = +1: proper rotations - preserve orientation. These form SO(n)SO(n), the special orthogonal group.
  • det(A)=1\det(A) = -1: improper rotations (reflections, or rotation + reflection) - reverse orientation.

2D rotation by angle θ\theta:

Rθ=[cosθsinθsinθcosθ]SO(2)R_\theta = \begin{bmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{bmatrix} \in SO(2)

Properties: RθRϕ=Rθ+ϕR_\theta R_\phi = R_{\theta+\phi}, Rθ1=Rθ=RθR_\theta^{-1} = R_{-\theta} = R_\theta^\top.

3D rotations are parametrized by an axis n^\hat{\mathbf{n}} and angle θ\theta (Rodrigues' formula):

R=I+sin(θ)K+(1cosθ)K2R = I + \sin(\theta)\, K + (1 - \cos\theta)\, K^2

where KK is the skew-symmetric matrix with Kv=n^×vK\mathbf{v} = \hat{\mathbf{n}} \times \mathbf{v}.

Householder reflections. Given a unit vector n^\hat{\mathbf{n}}, the reflection through the hyperplane orthogonal to n^\hat{\mathbf{n}} is:

H=I2n^n^H = I - 2\hat{\mathbf{n}}\hat{\mathbf{n}}^\top

Properties: H=HH = H^\top, H2=IH^2 = I, det(H)=1\det(H) = -1.

For AI: Rotary Position Embedding (RoPE), used in LLaMA, GPT-NeoX, and Gemma, encodes positional information via rotation matrices applied blockwise to query and key vectors. The rotation angle depends on position, and relative positions appear as relative rotations, which interact cleanly with the dot-product attention score.

5.3 Shear and Scaling Maps

Diagonal scaling: T(x)=DxT(\mathbf{x}) = D\mathbf{x} where D=diag(d1,,dn)D = \operatorname{diag}(d_1, \ldots, d_n). Scales each coordinate independently. Kernel = {0}\{\mathbf{0}\} if all di0d_i \neq 0. det(D)=di\det(D) = \prod d_i.

Shear maps in R2\mathbb{R}^2: the horizontal shear by factor kk is:

Sk=[1k01],Sk(x,y)=(x+ky,y)S_k = \begin{bmatrix} 1 & k \\ 0 & 1 \end{bmatrix}, \quad S_k(x, y) = (x + ky, y)

det(Sk)=1\det(S_k) = 1 - shear preserves area. Sk1=SkS_k^{-1} = S_{-k}.

Geometrically: the xx-axis is fixed; each horizontal line slides by a distance proportional to its height. Parallelograms are tilted but area is preserved.

Elementary matrices (from row operations) are all either shears, row swaps, or scalings:

  • Row scaling by cc: Eii(c)E_{ii}(c) - multiply row ii by cc
  • Row swap: EijE_{ij} - swap rows ii and jj
  • Row addition: Eij(c)E_{ij}(c) - add cc times row jj to row ii (this is a shear)

Every invertible matrix is a product of elementary matrices - Gaussian elimination as composition of linear maps.

5.4 The Geometry of Low-Rank Maps

A rank-kk linear map T:RnRmT: \mathbb{R}^n \to \mathbb{R}^m with k<nk < n collapses the domain: it maps all of Rn\mathbb{R}^n onto a kk-dimensional subspace of Rm\mathbb{R}^m, while sending an (nk)(n-k)-dimensional subspace (the null space) to zero.

Decomposition via SVD preview. Any rank-kk matrix admits:

A=i=1kσiuiviA = \sum_{i=1}^k \sigma_i \mathbf{u}_i \mathbf{v}_i^\top

This is a sum of kk rank-1 outer products. Each rank-1 term σiuivi\sigma_i \mathbf{u}_i \mathbf{v}_i^\top maps everything in the direction vi\mathbf{v}_i to the direction ui\mathbf{u}_i, scaled by σi\sigma_i.

Forward reference: SVD and Low-Rank Approximation

The full theory of SVD - including the Eckart-Young theorem (best rank-kk approximation) and the geometric interpretation of singular values - is in 02: Singular Value Decomposition.

LoRA preview. A rank-rr update ΔW=BA\Delta W = BA (with BRd×rB \in \mathbb{R}^{d \times r}, ARr×kA \in \mathbb{R}^{r \times k}, rmin(d,k)r \ll \min(d,k)) is a composition of two low-rank linear maps:

  1. A:RkRrA: \mathbb{R}^k \to \mathbb{R}^r compresses the input to rr dimensions.
  2. B:RrRdB: \mathbb{R}^r \to \mathbb{R}^d expands back to dd dimensions.

The effective map ΔW=BA\Delta W = BA has rank r\leq r, so it only modifies the network's behavior along rr directions in the input space. The hypothesis underlying LoRA is that the task-relevant weight changes lie in a low-dimensional subspace - a statement directly about the geometry of linear maps.


6. Dual Spaces and Transposes

6.1 The Dual Space

Definition. Given a vector space VV over F\mathbb{F}, the dual space VV^* is the set of all linear maps from VV to F\mathbb{F}:

V=L(V,F)={f:VFf is linear}V^* = \mathcal{L}(V, \mathbb{F}) = \{f: V \to \mathbb{F} \mid f \text{ is linear}\}

Elements of VV^* are called linear functionals or covectors or one-forms.

VV^* is itself a vector space under pointwise addition (f+g)(v)=f(v)+g(v)(f+g)(\mathbf{v}) = f(\mathbf{v}) + g(\mathbf{v}) and scalar multiplication (cf)(v)=cf(v)(cf)(\mathbf{v}) = c \cdot f(\mathbf{v}).

The dual basis. If B={e1,,en}\mathcal{B} = \{\mathbf{e}_1, \ldots, \mathbf{e}_n\} is a basis for VV, then the dual basis B={e1,,en}\mathcal{B}^* = \{e^1, \ldots, e^n\} is defined by:

ei(ej)=δij={1i=j0ije^i(\mathbf{e}_j) = \delta_{ij} = \begin{cases} 1 & i = j \\ 0 & i \neq j \end{cases}

The dual basis is a basis for VV^*, so dim(V)=dim(V)\dim(V^*) = \dim(V).

Key example: Row vectors. In Rn\mathbb{R}^n, a linear functional is any map f(x)=axf(\mathbf{x}) = \mathbf{a}^\top \mathbf{x} for some fixed aRn\mathbf{a} \in \mathbb{R}^n. So (Rn)Rn(\mathbb{R}^n)^* \cong \mathbb{R}^n, but the identification a\mathbf{a}^\top (row vector) \leftrightarrow a\mathbf{a} (column vector) is coordinate-dependent. A row vector a\mathbf{a}^\top is intrinsically an element of (Rn)(\mathbb{R}^n)^*, not of Rn\mathbb{R}^n.

Why the distinction matters. When you write ax\mathbf{a}^\top \mathbf{x}, you're applying the dual vector a(Rn)\mathbf{a}^\top \in (\mathbb{R}^n)^* to the vector xRn\mathbf{x} \in \mathbb{R}^n. This is a pairing between a space and its dual - not a dot product of two vectors in the same space. In Riemannian geometry and general relativity, this distinction is essential. In ML, it matters for understanding gradients.

6.2 The Dual Map and the Transpose

Definition. Given T:VWT: V \to W, the dual map (or transpose) T:WVT^\top: W^* \to V^* is defined by:

(Tf)(v)=f(T(v))for fW,vV(T^\top f)(\mathbf{v}) = f(T(\mathbf{v})) \quad \text{for } f \in W^*, \mathbf{v} \in V

In words: to apply TfT^\top f to v\mathbf{v}, first apply TT to v\mathbf{v}, then apply ff to the result.

Matrix of the dual map. If TT has matrix AA in standard coordinates, then TT^\top has matrix AA^\top (the matrix transpose). The notation "TT^\top" for the dual map is intentional and consistent.

Domain and codomain switch. T:VWT: V \to W means T:WVT^\top: W^* \to V^*. The transpose reverses the direction. This is why:

  • Composing on the left: (ST)=TS(ST)^\top = T^\top S^\top (order reversal).
  • In backpropagation: if the forward pass multiplies by WW, the backward pass multiplies by WW^\top.

Kernel of the transpose = left null space:

ker(T)={wW:Tw=0}=null(A)\ker(T^\top) = \{\mathbf{w} \in W : T^\top \mathbf{w} = \mathbf{0}\} = \operatorname{null}(A^\top)

This is the left null space of AA, which is orthogonal to col(A)\operatorname{col}(A).

Annihilators. The annihilator of a subspace SVS \subseteq V is S0={fV:f(s)=0sS}S^0 = \{f \in V^* : f(\mathbf{s}) = 0 \, \forall \mathbf{s} \in S\}. Key identities:

ker(T)=(im(T))0,im(T)=(ker(T))0\ker(T^\top) = (\operatorname{im}(T))^0, \qquad \operatorname{im}(T^\top) = (\ker(T))^0

6.3 Gradients Live in the Dual Space

This connection between linear maps and gradients is one of the deepest in all of applied mathematics, and is directly relevant to understanding backpropagation.

The gradient as a linear functional. For a differentiable function f:RnRf: \mathbb{R}^n \to \mathbb{R}, the derivative at x\mathbf{x} is a linear functional Dfx:RnRDf_{\mathbf{x}}: \mathbb{R}^n \to \mathbb{R}, defined by:

Dfx(h)=limt0f(x+th)f(x)t=f(x)hDf_{\mathbf{x}}(\mathbf{h}) = \lim_{t \to 0} \frac{f(\mathbf{x} + t\mathbf{h}) - f(\mathbf{x})}{t} = \nabla f(\mathbf{x})^\top \mathbf{h}

The derivative DfxDf_{\mathbf{x}} is an element of (Rn)(\mathbb{R}^n)^* - a covector, represented as a row vector f(x)\nabla f(\mathbf{x})^\top.

The gradient vector f(x)Rn\nabla f(\mathbf{x}) \in \mathbb{R}^n is obtained by identifying (Rn)(\mathbb{R}^n)^* with Rn\mathbb{R}^n via the standard inner product. This identification is coordinate-dependent: on a curved manifold (like a constraint surface or the space of probability distributions), gradients and vectors must be treated differently.

Backpropagation via dual maps. Consider a composed network y=Wx\mathbf{y} = W\mathbf{x} (one linear layer, loss =L(y)\ell = L(\mathbf{y})). The gradient with respect to x\mathbf{x} is:

x=Wy\frac{\partial \ell}{\partial \mathbf{x}} = W^\top \frac{\partial \ell}{\partial \mathbf{y}}

This is exactly the dual map TT^\top applied to the incoming gradient y\frac{\partial \ell}{\partial \mathbf{y}}. Backpropagation propagates gradients backward through the transpose of each weight matrix. The chain of transposes in the backward pass is the dual map composition:

(WLW1)=W1WL(W_L \cdots W_1)^\top = W_1^\top \cdots W_L^\top

For AI: Modern deep learning frameworks (PyTorch, JAX) compute gradients using automatic differentiation, which is precisely the evaluation of dual maps via the chain rule. Every .backward() call accumulates contributions via transpose weight matrices - it is applying the adjoint (dual) of the forward linear map.


7. Affine Transformations

7.1 Beyond Linearity: Affine Maps

A linear map fixes the origin: T(0)=0T(\mathbf{0}) = \mathbf{0}. But many practical transformations in geometry and ML need to shift the origin - they include a translation component.

Definition. A function f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m is an affine transformation if it has the form:

f(x)=Ax+bf(\mathbf{x}) = A\mathbf{x} + \mathbf{b}

where ARm×nA \in \mathbb{R}^{m \times n} is a matrix (the linear part) and bRm\mathbf{b} \in \mathbb{R}^m is a vector (the translation).

Affine maps are linear maps composed with a translation. They are NOT linear (unless b=0\mathbf{b} = \mathbf{0}) - they fail the zero test: f(0)=b0f(\mathbf{0}) = \mathbf{b} \neq \mathbf{0} when b0\mathbf{b} \neq \mathbf{0}.

Affine subspaces. The image of Rn\mathbb{R}^n under an affine map is an affine subspace - a translate of a linear subspace. Solutions to Ax=bA\mathbf{x} = \mathbf{b} (when they exist) form an affine subspace: x0+ker(A)\mathbf{x}_0 + \ker(A).

Composition of affine maps. If f(x)=Ax+bf(\mathbf{x}) = A\mathbf{x} + \mathbf{b} and g(y)=Cy+dg(\mathbf{y}) = C\mathbf{y} + \mathbf{d}, then:

(gf)(x)=g(Ax+b)=C(Ax+b)+d=CAx+(Cb+d)(g \circ f)(\mathbf{x}) = g(A\mathbf{x} + \mathbf{b}) = C(A\mathbf{x} + \mathbf{b}) + \mathbf{d} = CA\mathbf{x} + (C\mathbf{b} + \mathbf{d})

The composition is affine: linear part CACA, translation Cb+dC\mathbf{b} + \mathbf{d}.

7.2 Homogeneous Coordinates

The elegant trick to make affine maps linear is to lift to one higher dimension.

Definition. The homogeneous coordinates of xRn\mathbf{x} \in \mathbb{R}^n are x~=(x1)Rn+1\tilde{\mathbf{x}} = \begin{pmatrix} \mathbf{x} \\ 1 \end{pmatrix} \in \mathbb{R}^{n+1}.

The augmented matrix. The affine map f(x)=Ax+bf(\mathbf{x}) = A\mathbf{x} + \mathbf{b} becomes:

f~(x~)=[Ab01](x1)=(Ax+b1)\tilde{f}(\tilde{\mathbf{x}}) = \begin{bmatrix} A & \mathbf{b} \\ \mathbf{0}^\top & 1 \end{bmatrix} \begin{pmatrix} \mathbf{x} \\ 1 \end{pmatrix} = \begin{pmatrix} A\mathbf{x} + \mathbf{b} \\ 1 \end{pmatrix}

Now f~\tilde{f} is a linear map in Rn+1\mathbb{R}^{n+1}. The last coordinate is always 1 for "proper" points.

Key benefit: composition becomes matrix multiplication. Two affine maps M~1,M~2\tilde{M}_1, \tilde{M}_2 compose as:

M~2M~1=[A2b201][A1b101]=[A2A1A2b1+b201]\tilde{M}_2 \tilde{M}_1 = \begin{bmatrix} A_2 & \mathbf{b}_2 \\ \mathbf{0}^\top & 1 \end{bmatrix} \begin{bmatrix} A_1 & \mathbf{b}_1 \\ \mathbf{0}^\top & 1 \end{bmatrix} = \begin{bmatrix} A_2 A_1 & A_2\mathbf{b}_1 + \mathbf{b}_2 \\ \mathbf{0}^\top & 1 \end{bmatrix}

No special cases needed - composition is just matrix multiplication.

Inverse of an affine map:

[Ab01]1=[A1A1b01]\begin{bmatrix} A & \mathbf{b} \\ \mathbf{0}^\top & 1 \end{bmatrix}^{-1} = \begin{bmatrix} A^{-1} & -A^{-1}\mathbf{b} \\ \mathbf{0}^\top & 1 \end{bmatrix}

(Valid when AA is invertible.)

In computer graphics: Every rigid body transformation (rotation + translation) is an affine map, and sequences of transformations are composed by matrix multiplication in homogeneous coordinates. This is the foundation of 3D graphics pipelines (OpenGL, Vulkan).

7.3 Neural Network Layers as Affine Maps

A single fully-connected layer computes:

z=Wx+b,a=σ(z)\mathbf{z} = W\mathbf{x} + \mathbf{b}, \quad \mathbf{a} = \sigma(\mathbf{z})

The pre-activation z=Wx+b\mathbf{z} = W\mathbf{x} + \mathbf{b} is an affine map from Rdin\mathbb{R}^{d_{\text{in}}} to Rdout\mathbb{R}^{d_{\text{out}}}. In homogeneous coordinates:

z~=[Wb01]x~\tilde{\mathbf{z}} = \begin{bmatrix} W & \mathbf{b} \\ \mathbf{0}^\top & 1 \end{bmatrix} \tilde{\mathbf{x}}

Why bias matters. Without bias (b=0\mathbf{b} = \mathbf{0}), each layer is linear and the hyperplanes separating classes must pass through the origin. With bias, the decision boundary can be placed anywhere. This is the geometric reason bias terms dramatically increase expressive power.

Multi-layer without activations. A network zL=WL(W2(W1x+b1)+b2)+bL\mathbf{z}_L = W_L(\cdots W_2(W_1\mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2 \cdots) + \mathbf{b}_L is still affine:

zL=Weffx+beff\mathbf{z}_L = W_{\text{eff}}\mathbf{x} + \mathbf{b}_{\text{eff}}

Multiple affine layers compose to a single affine layer. Depth without nonlinearity gives no expressive benefit.

BatchNorm as affine rescaling. After normalizing to zero mean and unit variance, BatchNorm applies a learned affine transformation γz+β\gamma \mathbf{z} + \beta (elementwise). This is an affine map on the normalized activations, restoring the capacity to represent any desired scale and shift.

Embedding layers. An embedding layer ERV×dE \in \mathbb{R}^{|V| \times d} maps token indices to vectors. Selecting the ii-th embedding is equivalent to multiplying by the one-hot vector ei\mathbf{e}_i: Eei=ei,:E\mathbf{e}_i = \mathbf{e}_{i,:} (the ii-th row). This is linear in the one-hot representation. The unembedding WURd×VW_U \in \mathbb{R}^{d \times |V|} maps representations to logits: l=WUh\mathbf{l} = W_U \mathbf{h}, a pure linear map.


8. The Jacobian as Linear Approximation

8.1 Linearization of Nonlinear Maps

Every differentiable function is "locally linear" - at any point, it looks like a linear map to first order. This principle underlies calculus, optimization, and all of numerical analysis.

Definition (Total Derivative). A function f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m is differentiable at x\mathbf{x} if there exists a linear map Dfx:RnRmDf_{\mathbf{x}}: \mathbb{R}^n \to \mathbb{R}^m such that:

limh0f(x+h)f(x)Dfx(h)h=0\lim_{\lVert\mathbf{h}\rVert \to 0} \frac{\lVert f(\mathbf{x} + \mathbf{h}) - f(\mathbf{x}) - Df_{\mathbf{x}}(\mathbf{h})\rVert}{\lVert\mathbf{h}\rVert} = 0

The linear map DfxDf_{\mathbf{x}} is the total derivative or Frechet derivative of ff at x\mathbf{x}. The first-order approximation is:

f(x+h)f(x)+Dfx(h)f(\mathbf{x} + \mathbf{h}) \approx f(\mathbf{x}) + Df_{\mathbf{x}}(\mathbf{h})

For small h\mathbf{h}, the function looks like an affine map: constant f(x)f(\mathbf{x}) plus linear correction Dfx(h)Df_{\mathbf{x}}(\mathbf{h}).

Geometric meaning. At each point x\mathbf{x}, the total derivative DfxDf_{\mathbf{x}} is the best linear approximation to ff near x\mathbf{x}. It maps directions (tangent vectors at x\mathbf{x}) to directions (tangent vectors at f(x)f(\mathbf{x})). This is the pushforward of vectors.

8.2 The Jacobian Matrix

Definition. The matrix representation of DfxDf_{\mathbf{x}} in standard coordinates is the Jacobian matrix:

Jf(x)=[f1x1f1xnfmx1fmxn]Rm×nJ_f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix} \in \mathbb{R}^{m \times n}

Entry (i,j)(i, j): Jij=fixjJ_{ij} = \frac{\partial f_i}{\partial x_j} - how the ii-th output changes with the jj-th input.

Shape mnemonic: output dimension ×\times input dimension: (m×n)(m \times n) for f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m.

Special cases:

  • f:RnRf: \mathbb{R}^n \to \mathbb{R} (scalar function): Jacobian is the row gradient fR1×n\nabla f^\top \in \mathbb{R}^{1 \times n}.
  • f:RRmf: \mathbb{R} \to \mathbb{R}^m (curve): Jacobian is the column tangent vector f(t)Rmf'(t) \in \mathbb{R}^m.
  • f:RnRnf: \mathbb{R}^n \to \mathbb{R}^n (same dim): JfJ_f is square; det(Jf)\det(J_f) is the local volume scaling.

The Jacobian of softmax. For s=softmax(z)\mathbf{s} = \operatorname{softmax}(\mathbf{z}) where si=ezi/kezks_i = e^{z_i} / \sum_k e^{z_k}:

Jsoftmax(z)=diag(s)ssJ_{\operatorname{softmax}}(\mathbf{z}) = \operatorname{diag}(\mathbf{s}) - \mathbf{s}\mathbf{s}^\top

This is a rank-deficient matrix (rank n1n-1) because softmax outputs sum to 1: the Jacobian has a constant vector 1\mathbf{1} in its null space.

The Jacobian of ReLU. For a=ReLU(z)\mathbf{a} = \operatorname{ReLU}(\mathbf{z}) (elementwise):

JReLU(z)=diag(1[z>0])J_{\operatorname{ReLU}}(\mathbf{z}) = \operatorname{diag}(\mathbf{1}[\mathbf{z} > 0])

A diagonal matrix with 1s where zi>0z_i > 0 and 0s elsewhere. ReLU's Jacobian is a projection - it zeroes out the gradient for dead neurons.

The Jacobian of layer normalization. More complex: layer norm applies mean subtraction, variance normalization, and an affine transformation. Its Jacobian is a projection onto a specific hyperplane (orthogonal to the constant vector), scaled and shifted.

8.3 Chain Rule = Composition of Jacobians

The chain rule for vector-valued functions states: if f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m and g:RmRpg: \mathbb{R}^m \to \mathbb{R}^p, then:

Jgf(x)=Jg(f(x))Jf(x)J_{g \circ f}(\mathbf{x}) = J_g(f(\mathbf{x})) \cdot J_f(\mathbf{x})

This is matrix multiplication of Jacobians. The composition of two differentiable maps has a Jacobian equal to the matrix product of their individual Jacobians (evaluated at the appropriate points).

Backpropagation is reverse-mode Jacobian accumulation. For a network y=fLf1(x)\mathbf{y} = f_L \circ \cdots \circ f_1(\mathbf{x}) with loss =L(y)\ell = L(\mathbf{y}):

x=Jf1JfLyL\frac{\partial \ell}{\partial \mathbf{x}} = J_{f_1}^\top \cdots J_{f_L}^\top \nabla_{\mathbf{y}} L

Reading right to left: start with yL\nabla_{\mathbf{y}} L, multiply by JfLJ_{f_L}^\top, then JfL1J_{f_{L-1}}^\top, etc. At each step, we multiply by the transpose Jacobian of a layer - which is the dual map of that layer's linear approximation.

Computational graph view. Each node in the computation graph stores its local Jacobian. Forward pass evaluates the functions; backward pass multiplies the transpose Jacobians (right to left). This is the mathematical content of autograd.

BACKPROPAGATION AS JACOBIAN CHAIN
========================================================================

  Forward:  x --J_1---> z_1 --J_2---> z_2 -- ... --J_L---> y ---> ell

  Backward: \partialell/\partialx <---J_1^T-- \partialell/\partialz_1 <---J_2^T-- \partialell/\partialz_2 <--- ... <---J_L^T-- \nablay L

  Each backward step:  (grad at input) = J^T \times (grad at output)
                     = (dual map) applied to incoming gradient

========================================================================

Why Jacobians matter for training dynamics:

  • Large Jacobian singular values -> exploding gradients
  • Small Jacobian singular values -> vanishing gradients
  • The spectral norm of JfJ_f measures how much the linear approximation of ff can amplify inputs

For AI: Gradient clipping, careful weight initialization (Xavier, He), and residual connections all address the Jacobian conditioning problem. Residual connections add an identity Jacobian contribution: Jx+F(x)=I+JFJ_{x + F(x)} = I + J_F, which prevents the singular values from collapsing to zero (vanishing gradients).


9. Applications in Machine Learning

9.1 Attention as Linear Projections

The scaled dot-product attention mechanism (Vaswani et al., 2017) is built entirely from linear transformations applied to a sequence of token representations.

Setup. Given an input sequence XRL×dX \in \mathbb{R}^{L \times d} (L tokens, d-dimensional residual stream), three learned projection matrices WQ,WKRd×dkW_Q, W_K \in \mathbb{R}^{d \times d_k} and WVRd×dvW_V \in \mathbb{R}^{d \times d_v} define:

Q=XWQ,K=XWK,V=XWVQ = XW_Q, \quad K = XW_K, \quad V = XW_V

Each is a linear transformation: QQ projects each token into "query space", KK into "key space", VV into "value space".

Attention scores. The attention pattern α=softmax(QK/dk)\alpha = \operatorname{softmax}(QK^\top / \sqrt{d_k}) is a soft selection matrix. For a fixed query vector q\mathbf{q}, the attention scores qK\mathbf{q}^\top K^\top are dot products in key space - a linear functional applied to each key.

Output. Attn(Q,K,V)=αV\operatorname{Attn}(Q,K,V) = \alpha V. The output is a weighted linear combination of value vectors - a linear operation parameterized by α\alpha. The output projection WORdv×dW_O \in \mathbb{R}^{d_v \times d} maps back to the residual stream: another linear map.

Multi-head attention. Each head hh applies its own projection matrices (WQh,WKh,WVh)(W_Q^h, W_K^h, W_V^h), computes attention independently, and the results are concatenated and projected:

MHA(X)=Concat(head1,,headH)WO\operatorname{MHA}(X) = \operatorname{Concat}(\operatorname{head}_1, \ldots, \operatorname{head}_H) W_O

This is a composition and concatenation of linear maps. The expressivity comes from the multiplicative interaction QKQK^\top (which is bilinear, not linear) - but conditioned on fixed α\alpha, the rest is linear.

For AI: The "OV circuit" and "QK circuit" decomposition (Elhage et al., 2021) analyzes transformer attention by studying the linear maps WOWVW_OW_V (value writing) and WQWKW_QW_K^\top (key-query matching) separately. This is possible precisely because attention is compositionally linear.

9.2 LoRA: Fine-Tuning via Low-Rank Composition

Low-Rank Adaptation (Hu et al., 2021) is one of the most important parameter-efficient fine-tuning methods, and its design is a direct application of linear map theory.

Setup. Freeze the pretrained weight W0Rd×kW_0 \in \mathbb{R}^{d \times k}. Add a trainable low-rank update:

W=W0+ΔW=W0+BAW = W_0 + \Delta W = W_0 + BA

where BRd×rB \in \mathbb{R}^{d \times r}, ARr×kA \in \mathbb{R}^{r \times k}, and rmin(d,k)r \ll \min(d, k).

Rank-nullity interpretation. The map ΔW=BA\Delta W = BA has rank at most rr. By rank-nullity:

dim(ker(ΔW))kr\dim(\ker(\Delta W)) \geq k - r

The update only changes the network's behavior along at most rr directions in the kk-dimensional input space. When r=8r = 8 and k=4096k = 4096, only 8/40960.2%8/4096 \approx 0.2\% of input directions are affected - the update is extremely sparse in "direction space."

The two-layer interpretation. The update ΔW=BA\Delta W = BA is a composition of two maps:

  1. A:RkRrA: \mathbb{R}^k \to \mathbb{R}^r - compression to rank-rr space
  2. B:RrRdB: \mathbb{R}^r \to \mathbb{R}^d - expansion back to full dimension

Initializing AA randomly and B=0B = 0 ensures ΔW=0\Delta W = 0 at the start of fine-tuning, so the model starts from the pretrained weights.

Parameter count. ΔW\Delta W would have dkdk parameters. BABA uses dr+rk=r(d+k)dr + rk = r(d+k) parameters. Savings ratio: r(d+k)dk=rd+rk2rmin(d,k)\frac{r(d+k)}{dk} = \frac{r}{d} + \frac{r}{k} \approx \frac{2r}{\min(d,k)}. For r=8r=8, d=k=4096d=k=4096: savings factor of 256×\approx 256\times.

Extensions: DoRA (Weight-Decomposition Low-Rank Adaptation, 2024) decomposes WW into magnitude and direction, applying LoRA only to the direction component. GaLore (2024) applies LoRA-style updates to gradients rather than weights.

9.3 Linear Probes and the Linear Representation Hypothesis

Linear probing tests whether a feature is linearly decodable from a model's representations. Given representations {hi}\{\mathbf{h}_i\} and labels {yi}\{y_i\}, train a linear classifier:

y^=sign(wh+b)\hat{y} = \operatorname{sign}(\mathbf{w}^\top \mathbf{h} + b)

If the probe achieves high accuracy, the feature is linearly represented - it corresponds to a direction w\mathbf{w} in representation space.

The Linear Representation Hypothesis (Mikolov et al., 2013; Elhage et al., 2022; Park et al., 2023) states that high-level features (sentiment, syntax, factual attributes, world models) are encoded as directions in the residual stream - i.e., as linear features.

Evidence:

  • Word2vec arithmetic: v(king)v(man)+v(woman)v(queen)\mathbf{v}(\text{king}) - \mathbf{v}(\text{man}) + \mathbf{v}(\text{woman}) \approx \mathbf{v}(\text{queen}). Semantic relationships are linear offsets.
  • Steering vectors: adding cdc\mathbf{d} to all residual stream activations controls model behavior (e.g., "banana" direction, sentiment directions).
  • Probing studies: most tested syntactic and semantic features are linearly decodable.

Superposition. When there are more features than dimensions, the model stores features in near-orthogonal directions that partially overlap (superposition). This is still linear representation - just with interference.

For AI: If the linear representation hypothesis holds broadly, then:

  • Linear algebra provides the right toolkit for model interpretability.
  • Interventions on model behavior reduce to vector addition in representation space.
  • Feature extraction is a linear map - PCA/SVD on activations finds meaningful directions.

9.4 Embedding and Unembedding

Token embeddings. The embedding matrix WERV×dW_E \in \mathbb{R}^{|V| \times d} maps vocabulary indices to dd-dimensional vectors. Indexing row ii of WEW_E is equivalent to the linear map eiWE\mathbf{e}_i^\top W_E (one-hot selection). The embedding layer is linear in the one-hot representation.

Unembedding. The unembedding matrix WURd×VW_U \in \mathbb{R}^{d \times |V|} maps residual stream vectors to logits over the vocabulary:

l=WUhRV\mathbf{l} = W_U \mathbf{h} \in \mathbb{R}^{|V|}

This is a pure linear map. The logit for token vv is lv=wU,vhl_v = \mathbf{w}_{U,v}^\top \mathbf{h} - a dot product (linear functional) between the unembedding direction wU,v\mathbf{w}_{U,v} and the residual stream.

Logit lens. Applying WUW_U to intermediate residual stream states (before the final layer) gives "early predictions" - showing what the model is computing at each layer. This technique (Nostalgebraist, 2020) is possible because unembedding is linear.

Tied embeddings. Many models (GPT-2, LLaMA variants) use WU=WEW_U = W_E^\top - the same matrix for both embedding and unembedding. This enforces consistency: the most likely next token vv after seeing context h\mathbf{h} is the one whose embedding WE[v]W_E[v] has the highest dot product with h\mathbf{h} - i.e., argmaxvWE[v]h\operatorname{argmax}_v W_E[v] \cdot \mathbf{h}.


10. Common Mistakes

#MistakeWhy It's WrongFix
1Assuming T(0)0T(\mathbf{0}) \neq \mathbf{0} is possible for a linear mapHomogeneity immediately gives T(0)=T(0v)=0T(v)=0T(\mathbf{0}) = T(0\cdot\mathbf{v}) = 0\cdot T(\mathbf{v}) = \mathbf{0}. Any map with T(0)0T(\mathbf{0}) \neq \mathbf{0} is affine or nonlinear.The zero-test is the fastest way to rule out linearity.
2Treating translation as linearf(x)=x+bf(\mathbf{x}) = \mathbf{x} + \mathbf{b} fails additivity: f(u+v)=u+v+bf(u)+f(v)=u+v+2bf(\mathbf{u}+\mathbf{v}) = \mathbf{u}+\mathbf{v}+\mathbf{b} \neq f(\mathbf{u})+f(\mathbf{v}) = \mathbf{u}+\mathbf{v}+2\mathbf{b}Neural network layers are affine (Wx+bW\mathbf{x}+\mathbf{b}), not linear. This matters for composing and inverting.
3Confusing rank of a map with rank of a matrixRank is basis-independent (it's the dimension of the image) but the matrix depends on the basis.rank(T)=rank([T]BC)\operatorname{rank}(T) = \operatorname{rank}([T]_{\mathcal{B}}^{\mathcal{C}}) for any bases - rank is a map invariant, not a matrix property.
4Misapplying rank-nullityForgetting that rank-nullity applies to the domain dimension, not the codomain. For T:R5R3T: \mathbb{R}^5 \to \mathbb{R}^3, rank + nullity = 5, not 3.Identify dim(V)\dim(V) explicitly before applying the theorem.
5Assuming AB=BAAB = BAMatrix multiplication (= composition of linear maps) is generally not commutative. Even for square matrices, ABBAAB \neq BA in general.Non-commutativity is the default. Commutativity (e.g., diagonal matrices, polynomial functions of a matrix) is the special case.
6Confusing similar matrices with equal matricesABA \sim B means they represent the same map in different bases - same eigenvalues, trace, determinant. But ABA \neq B in general.Similar matrices are equal only if the change-of-basis matrix PP is the identity.
7Thinking the kernel is trivial when AA is tallA tall matrix ARm×nA \in \mathbb{R}^{m \times n} with m>nm > n can still have a non-trivial null space if its columns are linearly dependent.Compute rank(A)\operatorname{rank}(A). Nullity = nrank(A)n - \operatorname{rank}(A), regardless of whether m>nm > n.
8Applying the inverse when only a one-sided inverse existsA non-square ARm×nA \in \mathbb{R}^{m \times n} (mnm \neq n) cannot have a two-sided inverse. Attempting A1A^{-1} for such matrices is undefined.Use the pseudo-inverse A+A^+ for the least-squares solution.
9Forgetting that the Jacobian is a linear map, not just partial derivativesThe Jacobian matrix is the coordinate representation of the total derivative DfxDf_{\mathbf{x}}. Partial derivatives individually give the columns but don't by themselves prove differentiability.Differentiability requires the linear approximation error to go to zero - this requires the partials to exist AND be continuous.
10Treating the gradient as a vector in the same spaceStrictly, f(x)(Rn)\nabla f(\mathbf{x}) \in (\mathbb{R}^n)^* (the dual space). In Euclidean space with standard basis, the identification (Rn)Rn(\mathbb{R}^n)^* \cong \mathbb{R}^n makes this invisible. But for optimization on manifolds or with non-Euclidean metrics, treating gradients as primal vectors gives wrong answers.Use the gradient as a covector when working with Fisher information, natural gradient, or Riemannian optimization.
11Assuming all bijective functions are linear isomorphismsA function can be bijective but not linear. E.g., f:RRf: \mathbb{R} \to \mathbb{R}, f(x)=x3f(x) = x^3 is bijective but not linear (f(1+1)=8f(1)+f(1)=2f(1+1) = 8 \neq f(1) + f(1) = 2).Isomorphisms must be both bijective AND linear. Check both conditions.
12Forgetting that im(T)\operatorname{im}(T) is in WW, not in VVThe image is a subspace of the codomain WW, not the domain VV. The null space is in VV.Draw the diagram: T:VWT: V \to W. ker(T)V\ker(T) \subseteq V. im(T)W\operatorname{im}(T) \subseteq W.

11. Exercises

Exercise 1 * - Kernel and Image Computation

Goal: Given explicit matrices, compute kernel and image and verify rank-nullity.

(a) For A=[123456789]A = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix}: find a basis for ker(A)\ker(A), a basis for im(A)\operatorname{im}(A), and verify rank + nullity = 3.

(b) For B=[100010]B = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \end{bmatrix} (projection to first two coordinates): identify kernel and image without computation.

(c) For the differentiation map D:P3P2D: \mathcal{P}_3 \to \mathcal{P}_2 with matrix as in 3.1: find kernel and image dimensions.

Exercise 2 * - Matrix of a Linear Map in Non-Standard Basis

Goal: Construct the matrix of a given transformation in a specified basis.

Let T:R2R2T: \mathbb{R}^2 \to \mathbb{R}^2 be reflection across the line y=xy = x (i.e., T(x,y)=(y,x)T(x,y) = (y,x)). Basis B={(1,1)/2,(1,1)/2}\mathcal{B} = \{(1,1)/\sqrt{2}, (1,-1)/\sqrt{2}\}.

(a) Write the standard matrix AA of TT in the standard basis.

(b) Compute the change-of-basis matrix PP from B\mathcal{B} to the standard basis.

(c) Compute [T]B=P1AP[T]_{\mathcal{B}} = P^{-1}AP. What is notable about this result?

(d) Explain geometrically why TT is diagonal in the basis B\mathcal{B}.

Exercise 3 * - Rank-Nullity and Linear Systems

Goal: Use rank-nullity to understand solution sets of linear systems.

(a) AR4×6A \in \mathbb{R}^{4 \times 6} has rank 3. What is the nullity? How many free variables are there in Ax=0A\mathbf{x} = \mathbf{0}?

(b) BR6×4B \in \mathbb{R}^{6 \times 4} has rank 4. Is Bx=bB\mathbf{x} = \mathbf{b} always solvable? What is the nullity of BB?

(c) Prove: if T:VVT: V \to V is a linear map on a finite-dimensional space, then TT is injective if and only if TT is surjective.

Exercise 4 ** - Projection Operator Construction

Goal: Build an orthogonal projection onto a given subspace and verify idempotency.

Let S=span{(1,2,0),(0,1,1)}R3S = \operatorname{span}\{(1, 2, 0), (0, 1, 1)\} \subseteq \mathbb{R}^3.

(a) Orthonormalize the spanning set using Gram-Schmidt.

(b) Construct the projection matrix P=QQP = QQ^\top where QQ has the orthonormal basis as columns.

(c) Verify: P2=PP^2 = P, P=PP = P^\top, and rank(P)=tr(P)=2\operatorname{rank}(P) = \operatorname{tr}(P) = 2.

(d) Compute (IP)(I-P) and verify it is also a projection. What does it project onto?

Exercise 5 ** - Jacobian of Softmax

Goal: Derive and verify the Jacobian of the softmax function.

(a) Derive sizj\frac{\partial s_i}{\partial z_j} for s=softmax(z)\mathbf{s} = \operatorname{softmax}(\mathbf{z}), handling the cases i=ji = j and iji \neq j separately.

(b) Show that the Jacobian is J=diag(s)ssJ = \operatorname{diag}(\mathbf{s}) - \mathbf{s}\mathbf{s}^\top.

(c) Verify that J1=0J\mathbf{1} = \mathbf{0} (the Jacobian kills constant vectors). Why does this make sense?

(d) Show that the rank of JJ is at most n1n-1.

Exercise 6 ** - Affine Map Composition via Homogeneous Coordinates

Goal: Compose affine transformations using augmented matrices.

In R2\mathbb{R}^2, let:

  • f1f_1: rotation by 45degrees45 degrees then translate by (1,0)(1, 0)
  • f2f_2: scale by 22 in both dimensions then translate by (0,1)(0, 1)

(a) Write augmented 3×33 \times 3 matrices M~1\tilde{M}_1 and M~2\tilde{M}_2 for f1f_1 and f2f_2.

(b) Compute M~2M~1\tilde{M}_2 \tilde{M}_1 (apply f1f_1 then f2f_2). What are the effective rotation, scale, and translation?

(c) Apply the composition to the point (1,0)(1, 0). Verify by applying f1f_1 then f2f_2 directly.

Exercise 7 *** - LoRA Rank Analysis

Goal: Analyze the geometry of low-rank weight updates.

Let W0R512×512W_0 \in \mathbb{R}^{512 \times 512} be a pretrained weight. A LoRA update with rank r=4r = 4 is ΔW=BA\Delta W = BA where BR512×4B \in \mathbb{R}^{512 \times 4}, AR4×512A \in \mathbb{R}^{4 \times 512}.

(a) What is rank(ΔW)\operatorname{rank}(\Delta W)? What is the nullity of ΔW\Delta W?

(b) Generate random BB and AA and numerically verify: rank(BA)4\operatorname{rank}(BA) \leq 4.

(c) For a fixed input xR512\mathbf{x} \in \mathbb{R}^{512}, compute ΔWx=BAx\Delta W \mathbf{x} = BA\mathbf{x}. Show this is in col(B)\operatorname{col}(B).

(d) Compute the singular values of ΔW\Delta W. How many are nonzero? What does this tell you about the effective dimension of the update?

(e) Compare the number of trainable parameters: 512×512512 \times 512 vs r(512+512)r(512 + 512) for r=4,8,16,32r = 4, 8, 16, 32.

Exercise 8 *** - Linear Probing and the Linear Representation Hypothesis

Goal: Empirically test whether a concept is linearly represented in embedding space.

(a) Generate synthetic embeddings for binary sentiment: positive embeddings clustered near d\mathbf{d}, negative near d-\mathbf{d} for a fixed direction d\mathbf{d}, plus noise.

(b) Train a linear probe (logistic regression on top of embeddings) and measure accuracy.

(c) Apply PCA to the embeddings. Show that the first principal component aligns with d\mathbf{d}.

(d) Add a "superposition" condition: embed two independent binary features in d=3d = 3 dimensions. Show that both features can be linearly decoded but with interference.

(e) Compute the mutual coherence of the feature directions. How does it relate to probe accuracy?


12. Why This Matters for AI (2026 Perspective)

ConceptAI ApplicationWhy It Matters
Linear map axiomsNeural layer computationEvery forward pass is a composition of linear maps + activations; understanding linearity separates what the layer does from what nonlinearity adds
Kernel and imageInformation compressionThe null space of a weight matrix is "dead information" - inputs in ker(W)\ker(W) produce no signal. Attention heads have low-rank structure that defines their effective null space
Rank-nullity theoremLoRA, model compressionLoRA exploits rank-nullity: a rank-rr update has (kr)(k-r)-dimensional null space, leaving most of the input space unaffected - this is why it works with few parameters
Change of basisDiagonalization, eigenfeaturesThe eigenvalue decomposition of weight matrices (studied in mechanistic interpretability) is a change to the "natural basis" in which the layer acts by simple scaling
Composition = multiplicationDeep network analysisThe effective weight of a kk-layer linear network is one matrix WLW1W_L\cdots W_1 - depth without nonlinearity has zero benefit. All depth benefits require nonlinearity
Projection operatorsAttention heads, layer normAttention heads project onto query/key/value subspaces; layer norm projects onto the hyperplane of mean-zero vectors; understanding projections clarifies what information is preserved
Affine maps + biasUniversal approximationBias terms are essential for shifting decision boundaries. Without bias, the model cannot represent affine functions - only linear ones. Universal approximation requires affine layers
Jacobian and chain ruleBackpropagationEvery .backward() call is Jacobian-matrix multiplication via the chain rule. Gradient explosion/vanishing is about Jacobian singular value growth/decay through layers
Dual maps and transposesGradient computationThe backward pass uses transpose weight matrices - these are the dual maps. Natural gradient and Fisher information matrix methods exploit the geometry of the dual space
Linear representation hypothesisMechanistic interpretabilityIf features are linear, activation patching, steering vectors, and linear probing all work. This is why "linear algebra for interpretability" (e.g., representation engineering, logit lens) is a coherent research program

13. Conceptual Bridge

Looking Back

This section builds on two foundational pillars from earlier in the curriculum.

From Chapter 2 06: Vector Spaces and Subspaces: We studied the axioms of abstract vector spaces (closure, associativity, identity, inverse, distributivity) and their subspaces (spans, null spaces, column spaces). Linear transformations are the maps between these abstract structures - the morphisms of the category of vector spaces. The four fundamental subspaces (column space, null space, row space, left null space) that were defined for matrices are now understood as im(T)\operatorname{im}(T), ker(T)\ker(T), im(T)\operatorname{im}(T^\top), and ker(T)\ker(T^\top) - intrinsic properties of the map, not the matrix.

From 01: Eigenvalues and Eigenvectors and 02: SVD: Both eigendecomposition and SVD are studied as decompositions of linear maps into simple pieces. Eigendecomposition finds a basis in which TT acts by scaling. SVD finds two orthonormal bases (for domain and codomain) in which TT acts by scaling. These are special cases of the general change-of-basis machinery developed here in 3.

From 03: PCA: PCA uses the linear structure of the data covariance matrix - the covariance C=XX/(n1)C = X^\top X / (n-1) is built from linear maps - to find the principal directions via SVD. The whitening transform and PCA projection are linear maps, and their geometry (dimension reduction, preserved variance) follows directly from the rank-nullity theorem.

Looking Forward

The abstract machinery of linear transformations will appear throughout the rest of the curriculum in concrete technical forms.

05: Orthogonality and Orthonormality: Gram-Schmidt is an algorithm that constructs a specific linear map (change of basis to an orthonormal basis). QR decomposition A=QRA = QR factors a linear map as an orthogonal map QQ followed by a triangular map RR. Orthogonal projections (5.1 here) are studied in depth there.

06: Matrix Norms: The spectral norm A2=σ1(A)\lVert A \rVert_2 = \sigma_1(A) measures how much a linear map can amplify vectors. The nuclear norm A=σi\lVert A \rVert_* = \sum \sigma_i measures the "effective rank." These norms quantify properties of the linear map that are invisible from the matrix entries.

Chapter 4: Calculus Fundamentals: The Jacobian (8 here) is the bridge between linear algebra and calculus. Multivariable calculus is essentially the study of how linear maps (Jacobians) approximate smooth nonlinear maps. The Hessian is the second-order analogue - a bilinear map that measures curvature. The implicit function theorem, inverse function theorem, and change-of-variables formula all rely on the Jacobian being an invertible linear map at the point of interest.

Curriculum Position

POSITION IN THE MATH FOR LLMS CURRICULUM
========================================================================

  Chapter 1: Mathematical Foundations
  Chapter 2: Linear Algebra Basics
    +-- Vectors, Matrix Operations, Systems of Equations
    +-- Determinants, Matrix Rank
    +-- Vector Spaces ----------------------------+
                                                  | prerequisite
  Chapter 3: Advanced Linear Algebra              |
    +-- 01-Eigenvalues --------------------------+
    +-- 02-SVD ---------------------------------+
    +-- 03-PCA ---------------------------------+
    |                                             |
    +-- 04-Linear Transformations <--------------+
    |      (YOU ARE HERE)
    |      Kernel, Image, Rank-Nullity
    |      Matrix Representation, Change of Basis
    |      Composition, Isomorphisms, Jacobian
    |      Affine Maps, Dual Spaces, AI Applications
    |                                             |
    +-- 05-Orthogonality <----------------------+
    +-- 06-Matrix Norms <-----------------------+
    +-- 07-Positive Definite Matrices           |
    +-- 08-Matrix Decompositions               |
                                               down
  Chapter 4: Calculus Fundamentals
    (Jacobians and chain rule developed further)

========================================================================

The unifying theme: Every major algorithm in deep learning is a composition of linear maps and nonlinearities. Understanding this section means understanding the language in which every neural network, every attention mechanism, every gradient computation, and every interpretability method is written. The matrix is not the territory - but knowing how to move between coordinate representations (change of basis), how to measure what a map collapses (kernel and rank), how to compose maps (matrix multiplication), and how to invert them (isomorphisms and pseudo-inverses) gives you the full algebraic toolkit to reason about any linear system you encounter in machine learning.


<- PCA | Back to Advanced Linear Algebra | Orthogonality ->


Appendix A: Extended Examples and Computations

A.1 Computing the Matrix of Differentiation

We work out the differentiation operator in full detail to build intuition for linear maps on function spaces.

Setting. Let V=P3={a0+a1x+a2x2+a3x3}V = \mathcal{P}_3 = \{a_0 + a_1 x + a_2 x^2 + a_3 x^3\} with basis BV={1,x,x2,x3}\mathcal{B}_V = \{1, x, x^2, x^3\} and W=P2W = \mathcal{P}_2 with basis BW={1,x,x2}\mathcal{B}_W = \{1, x, x^2\}.

Define D:VWD: V \to W by D(p)=pD(p) = p' (differentiation).

Step 1: Apply DD to each basis vector of VV.

  • D(1)=0=01+0x+0x2D(1) = 0 = 0 \cdot 1 + 0 \cdot x + 0 \cdot x^2 -> coordinate vector (0,0,0)(0, 0, 0)^\top
  • D(x)=1=11+0x+0x2D(x) = 1 = 1 \cdot 1 + 0 \cdot x + 0 \cdot x^2 -> coordinate vector (1,0,0)(1, 0, 0)^\top
  • D(x2)=2x=01+2x+0x2D(x^2) = 2x = 0 \cdot 1 + 2 \cdot x + 0 \cdot x^2 -> coordinate vector (0,2,0)(0, 2, 0)^\top
  • D(x3)=3x2=01+0x+3x2D(x^3) = 3x^2 = 0 \cdot 1 + 0 \cdot x + 3 \cdot x^2 -> coordinate vector (0,0,3)(0, 0, 3)^\top

Step 2: Assemble into matrix.

[D]BVBW=[010000200003][D]_{\mathcal{B}_V}^{\mathcal{B}_W} = \begin{bmatrix} 0 & 1 & 0 & 0 \\ 0 & 0 & 2 & 0 \\ 0 & 0 & 0 & 3 \end{bmatrix}

Step 3: Use the matrix. Differentiate p(x)=2+3xx2+4x3p(x) = 2 + 3x - x^2 + 4x^3. In BV\mathcal{B}_V-coordinates: [p]BV=(2,3,1,4)[p]_{\mathcal{B}_V} = (2, 3, -1, 4)^\top.

[D(p)]BW=[010000200003](2314)=(3212)[D(p)]_{\mathcal{B}_W} = \begin{bmatrix} 0 & 1 & 0 & 0 \\ 0 & 0 & 2 & 0 \\ 0 & 0 & 0 & 3 \end{bmatrix} \begin{pmatrix} 2 \\ 3 \\ -1 \\ 4 \end{pmatrix} = \begin{pmatrix} 3 \\ -2 \\ 12 \end{pmatrix}

So D(p)=32x+12x2D(p) = 3 - 2x + 12x^2. Check: (2+3xx2+4x3)=32x+12x2(2 + 3x - x^2 + 4x^3)' = 3 - 2x + 12x^2. OK

Kernel of DD: The null space is {p:p=0}={constants}=span{1}\{p : p' = 0\} = \{\text{constants}\} = \operatorname{span}\{1\}. Dimension 1.

Image of DD: All polynomials of degree 2\leq 2 (since every qP2q \in \mathcal{P}_2 equals (p)(p)' for some pP3p \in \mathcal{P}_3). Dimension 3.

Rank-nullity check: dim(P3)=4=1+3=nullity+rank\dim(\mathcal{P}_3) = 4 = 1 + 3 = \text{nullity} + \text{rank}. OK

A.2 Change of Basis: Full Worked Example

Problem. Let T:R2R2T: \mathbb{R}^2 \to \mathbb{R}^2 rotate vectors by 30degrees30 degrees counterclockwise. Express TT in the rotated basis B={R45degreese1,R45degreese2}\mathcal{B}' = \{R_{45 degrees }\mathbf{e}_1, R_{45 degrees }\mathbf{e}_2\}.

Step 1: Standard matrix of TT.

A=R30degrees=[cos30degreessin30degreessin30degreescos30degrees]=[3/21/21/23/2]A = R_{30 degrees } = \begin{bmatrix} \cos 30 degrees & -\sin 30 degrees \\ \sin 30 degrees & \cos 30 degrees \end{bmatrix} = \begin{bmatrix} \sqrt{3}/2 & -1/2 \\ 1/2 & \sqrt{3}/2 \end{bmatrix}

Step 2: Change-of-basis matrix. The new basis vectors, in standard coordinates:

b1=R45degreese1=(cos45degrees,sin45degrees)=(1/2,1/2)\mathbf{b}'_1 = R_{45 degrees }\mathbf{e}_1 = (\cos 45 degrees , \sin 45 degrees )^\top = (1/\sqrt{2}, 1/\sqrt{2})^\top b2=R45degreese2=(sin45degrees,cos45degrees)=(1/2,1/2)\mathbf{b}'_2 = R_{45 degrees }\mathbf{e}_2 = (-\sin 45 degrees , \cos 45 degrees )^\top = (-1/\sqrt{2}, 1/\sqrt{2})^\top P=[1/21/21/21/2]=R45degreesP = \begin{bmatrix} 1/\sqrt{2} & -1/\sqrt{2} \\ 1/\sqrt{2} & 1/\sqrt{2} \end{bmatrix} = R_{45 degrees }

Step 3: New matrix.

[T]B=P1AP=R45degreesR30degreesR45degrees[T]_{\mathcal{B}'} = P^{-1} A P = R_{-45 degrees } R_{30 degrees } R_{45 degrees }

Since rotation matrices commute (they all rotate around the same axis in 2D): R45degreesR30degreesR45degrees=R30degreesR_{-45 degrees } R_{30 degrees } R_{45 degrees } = R_{30 degrees }.

Key insight: A rotation in 2D has the same matrix in every orthonormal basis (since all such matrices are just RθR_\theta). This is because RθR_\theta commutes with all rotations: RαRθRα1=RθR_\alpha R_\theta R_\alpha^{-1} = R_\theta.

A.3 The Null Space as a Subspace: Visualization

For A=[121242]A = \begin{bmatrix} 1 & 2 & -1 \\ 2 & 4 & -2 \end{bmatrix} (rows are multiples), let's find ker(A)\ker(A).

Row reduce: R2R22R1R_2 \leftarrow R_2 - 2R_1:

[121000]\begin{bmatrix} 1 & 2 & -1 \\ 0 & 0 & 0 \end{bmatrix}

Free variables: x2=sx_2 = s, x3=tx_3 = t (free). Back-substitute: x1=2s+tx_1 = -2s + t.

ker(A)={s(210)+t(101):s,tR}\ker(A) = \left\{ s \begin{pmatrix} -2 \\ 1 \\ 0 \end{pmatrix} + t \begin{pmatrix} 1 \\ 0 \\ 1 \end{pmatrix} : s, t \in \mathbb{R} \right\}

This is a plane through the origin in R3\mathbb{R}^3. The map TA:R3R2T_A: \mathbb{R}^3 \to \mathbb{R}^2 collapses this entire plane to zero. By rank-nullity: rank =1= 1, nullity =2= 2, and 1+2=3=dim(R3)1 + 2 = 3 = \dim(\mathbb{R}^3). OK

The image of TAT_A is span{(1,2)}\operatorname{span}\{(1,2)^\top\} - a line in R2\mathbb{R}^2 - since the two columns of AA are (1,2)(1,2)^\top and (2,4)=2(1,2)(2,4)^\top = 2 \cdot (1,2)^\top, so rank = 1.


Appendix B: Abstract Linear Algebra Perspective

B.1 The Category of Vector Spaces

In the language of category theory (which provides a unifying framework for all of mathematics):

  • Objects: Vector spaces over a field F\mathbb{F}
  • Morphisms: Linear maps between them
  • Composition: Function composition (= matrix multiplication)
  • Identity morphism: The identity map idV(v)=v\operatorname{id}_V(\mathbf{v}) = \mathbf{v} (= identity matrix II)

This forms a category denoted VectF\mathbf{Vect}_\mathbb{F}.

Isomorphisms in the category are exactly the invertible linear maps - this matches our definition of isomorphism in 4.3.

Functors are maps between categories that preserve the categorical structure. The "matrix representation" is a functor from VectF\mathbf{Vect}_\mathbb{F} (with chosen bases) to the category MatF\mathbf{Mat}_\mathbb{F} of matrices. Changing bases corresponds to a natural transformation.

This perspective matters for ML because: neural network architectures are themselves categorical structures (composition of morphisms), and categorical thinking helps reason about when two architectures are "equivalent" (related by isomorphisms).

B.2 Infinite-Dimensional Extensions

The theory extends to infinite-dimensional spaces, where the familiar finite-dimensional results require modification.

Bounded linear operators. On Hilbert spaces (infinite-dimensional inner product spaces), the right notion of "linear transformation" is a bounded linear operator T:HHT: H \to H satisfying TvMv\lVert T\mathbf{v}\rVert \leq M\lVert\mathbf{v}\rVert for some M<M < \infty. Unbounded operators (like differentiation on L2L^2) require careful domain specification.

The spectral theorem for compact operators. A compact self-adjoint operator on a Hilbert space has a countable set of eigenvalues λ1λ20\lambda_1 \geq \lambda_2 \geq \cdots \to 0 and an orthonormal basis of eigenvectors. This is the infinite-dimensional analogue of diagonalization.

Functional analysis. The study of linear maps on infinite-dimensional spaces is called functional analysis. Key results (Hahn-Banach, open mapping theorem, closed graph theorem) parallel finite-dimensional results but require additional technical hypotheses.

For AI: Neural network function classes are subsets of infinite-dimensional function spaces (L2L^2 or Sobolev spaces). Understanding the "size" and "complexity" of these classes uses infinite-dimensional linear algebra - e.g., kernel methods and Gaussian processes operate in reproducing kernel Hilbert spaces (RKHS), and neural tangent kernel theory analyzes infinitely wide networks using spectral theory of linear operators.

B.3 The Tensor Product and Multilinear Maps

Bilinear maps. A map B:V×WUB: V \times W \to U is bilinear if it is linear in each argument separately. The attention score B(q,k)=qkB(\mathbf{q}, \mathbf{k}) = \mathbf{q}^\top \mathbf{k} is bilinear (linear in q\mathbf{q}, linear in k\mathbf{k}, but not linear jointly: B(cq,ck)=c2B(q,k)cB(q,k)B(c\mathbf{q}, c\mathbf{k}) = c^2 B(\mathbf{q}, \mathbf{k}) \neq c B(\mathbf{q}, \mathbf{k}) in general).

The tensor product VWV \otimes W is the universal space for bilinear maps: every bilinear map B:V×WUB: V \times W \to U factors through a unique linear map B~:VWU\tilde{B}: V \otimes W \to U. This is why tensors (in the ML sense: multi-dimensional arrays) are called tensors - they represent multilinear maps.

For AI: The key operation in self-attention, qk=q,k\mathbf{q}^\top \mathbf{k} = \langle \mathbf{q}, \mathbf{k} \rangle, is a bilinear form. The matrix WW in the "weight matrix" formulation of attention (qWQKk\mathbf{q}^\top W_{\text{QK}} \mathbf{k}) makes this explicit: WQK=WQWKW_{\text{QK}} = W_Q^\top W_K is the matrix of the bilinear form. This is the reason attention is more expressive than standard linear transformations - it computes a bilinear (quadratic) function of the input.



Appendix C: The Geometry of Linear Maps - A Deep Dive

C.1 How Linear Maps Deform Space

To deeply understand a linear map T:RnRmT: \mathbb{R}^n \to \mathbb{R}^m, we track how it deforms geometric objects.

Ellipsoids to ellipsoids. The image of the unit sphere Sn1={x:x=1}S^{n-1} = \{\mathbf{x} : \lVert\mathbf{x}\rVert = 1\} under a full-rank map AA is an ellipsoid whose semi-axes have lengths equal to the singular values σ1σ2σn>0\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_n > 0 of AA, pointing in the directions of the left singular vectors u1,,um\mathbf{u}_1, \ldots, \mathbf{u}_m.

This is the geometric content of the SVD: A=UΣVA = U\Sigma V^\top means:

  1. VV^\top: rotate the input so the "natural input directions" vi\mathbf{v}_i align with the coordinate axes.
  2. Σ\Sigma: stretch each axis ii by σi\sigma_i.
  3. UU: rotate the output to place the stretched axes along the ui\mathbf{u}_i directions.

The sphere becomes an ellipsoid. The "shape" of the ellipsoid is completely described by the singular values.

Volume scaling. The volume of the image of a set SRnS \subseteq \mathbb{R}^n under AA is det(A)vol(S)|\det(A)| \cdot \operatorname{vol}(S) (when AA is square). More precisely, det(A)=σi|\det(A)| = \prod \sigma_i = product of all singular values = volume of the image of the unit cube.

For a rank-deficient map (det(A)=0\det(A) = 0), the image has lower dimension - and mm-dimensional volume = 0. The map "collapses" nn-dimensional space to a lower-dimensional flat object.

Angles. Unless AA is orthogonal, linear maps change angles. The angle θ\theta between u\mathbf{u} and v\mathbf{v} satisfies:

cosθ=uvuv\cos\theta = \frac{\mathbf{u}^\top\mathbf{v}}{\lVert\mathbf{u}\rVert\lVert\mathbf{v}\rVert}

but cos(Au,Av)=(Au)(Av)AuAv=uAAvAuAv\cos\angle(A\mathbf{u}, A\mathbf{v}) = \frac{(A\mathbf{u})^\top(A\mathbf{v})}{\lVert A\mathbf{u}\rVert\lVert A\mathbf{v}\rVert} = \frac{\mathbf{u}^\top A^\top A\mathbf{v}}{\lVert A\mathbf{u}\rVert\lVert A\mathbf{v}\rVert}.

The matrix G=AAG = A^\top A (the Gram matrix) determines how AA distorts inner products. Eigenvalues of GG are σi2\sigma_i^2 (squared singular values).

C.2 Interpreting the Four Fundamental Subspaces Geometrically

Given T:RnRmT: \mathbb{R}^n \to \mathbb{R}^m with matrix AA and rank rr:

The row space row(A)\operatorname{row}(A): This is the "input directions that survive" - the rr-dimensional subspace of Rn\mathbb{R}^n that TT maps faithfully (injectively) onto the column space. Any xrow(A)\mathbf{x} \in \operatorname{row}(A) is "noticed" by TT.

The null space null(A)\operatorname{null}(A): This is the "input directions that are killed" - the (nr)(n-r)-dimensional subspace of Rn\mathbb{R}^n that TT maps to zero. Any xnull(A)\mathbf{x} \in \operatorname{null}(A) is "invisible" to TT.

The decomposition Rn=row(A)null(A)\mathbb{R}^n = \operatorname{row}(A) \oplus \operatorname{null}(A): Every input x\mathbf{x} splits uniquely as x=xr+xn\mathbf{x} = \mathbf{x}_r + \mathbf{x}_n where xr\mathbf{x}_r is in the row space (the "signal" part) and xn\mathbf{x}_n is in the null space (the "noise" invisible to TT).

The column space col(A)\operatorname{col}(A): The rr-dimensional subspace of Rm\mathbb{R}^m that TT can actually reach. Solutions to Ax=bA\mathbf{x} = \mathbf{b} exist iff bcol(A)\mathbf{b} \in \operatorname{col}(A).

The left null space null(A)\operatorname{null}(A^\top): The (mr)(m-r)-dimensional complement of col(A)\operatorname{col}(A) in Rm\mathbb{R}^m. Directions in the left null space are unreachable by TT.

COMPLETE PICTURE OF THE FOUR FUNDAMENTAL SUBSPACES
========================================================================

  \mathbb{R}^n (domain)                           \mathbb{R}^m (codomain)
  ---------------------------------------------------------

  +---------------------+    T(x) = Ax    +---------------------+
  |    row space        | -------------->  |    column space     |
  |   (dim = r)         |  isomorphism    |    (dim = r)        |
  |                     |                 |                     |
  |   -------------     |                 |   -------------     |
  |                     |                 |                     |
  |    null space       | -------------->  |    left null space  |
  |   (dim = n-r)       |    maps to 0    |    (dim = m-r)      |
  +---------------------+                 +---------------------+
         up \perp complement                          up \perp complement

  Every x = (row space part) + (null space part)
  T sees only the row space part.

========================================================================

For AI (linear systems / least squares): When fitting a model Aw=bA\mathbf{w} = \mathbf{b} with more constraints than parameters (m>nm > n), the system is overdetermined. A solution exists only if bcol(A)\mathbf{b} \in \operatorname{col}(A). If not, the least-squares solution minimizes Awb2\lVert A\mathbf{w} - \mathbf{b}\rVert^2 - finding the projection of b\mathbf{b} onto col(A)\operatorname{col}(A) and then solving in the row space.

C.3 Linear Maps and Information Theory

The rank-nullity theorem has an information-theoretic interpretation.

Rank = information preserved. A linear map of rank rr preserves at most rr "dimensions" of information from the input. The remaining nrn - r dimensions are destroyed.

Mutual information. For a Gaussian input xN(0,In)\mathbf{x} \sim \mathcal{N}(\mathbf{0}, I_n) and output y=Ax\mathbf{y} = A\mathbf{x}, the mutual information:

I(x;y)=12i=1rlog(1+σi2)I(\mathbf{x}; \mathbf{y}) = \frac{1}{2} \sum_{i=1}^r \log(1 + \sigma_i^2)

depends only on the singular values - not on the specific directions. The null space of AA contributes zero mutual information.

Compression. If we want to compress xRn\mathbf{x} \in \mathbb{R}^n to zRr\mathbf{z} \in \mathbb{R}^r via a linear map CRr×nC \in \mathbb{R}^{r \times n}, the maximum mutual information I(x;Cx)I(\mathbf{x}; C\mathbf{x}) is achieved when CC projects onto the top-rr right singular vectors of... itself (the row space). For structured data with covariance Σ\Sigma, the optimal compression is PCA - projecting onto the top eigenvectors of Σ\Sigma.

This is why PCA is the optimal linear compressor under mean-squared error: it maximizes the retained variance (information) for any fixed rank rr.

C.4 Generalization of Linear Maps: Tensors

The concept of a linear map generalizes to multilinear maps and tensors in ways that are directly relevant to deep learning.

Bilinear maps and matrices of bilinear forms. A bilinear form B:Rn×RmRB: \mathbb{R}^n \times \mathbb{R}^m \to \mathbb{R} can be written as B(x,y)=xMyB(\mathbf{x}, \mathbf{y}) = \mathbf{x}^\top M \mathbf{y} for a matrix MRn×mM \in \mathbb{R}^{n \times m}. The bilinear form is:

  • Symmetric if M=MM = M^\top (and n=mn = m): B(x,y)=B(y,x)B(\mathbf{x}, \mathbf{y}) = B(\mathbf{y}, \mathbf{x}).
  • Positive definite if B(x,x)>0B(\mathbf{x}, \mathbf{x}) > 0 for x0\mathbf{x} \neq \mathbf{0}: inner products are positive definite symmetric bilinear forms.

Multilinear maps. A kk-linear map T:V1××VkWT: V_1 \times \cdots \times V_k \to W is linear in each argument separately. The space of kk-linear maps on Rn\mathbb{R}^n is the space of tensors of order kk.

For AI: The multi-head attention score qWQKk\mathbf{q}^\top W_{\text{QK}} \mathbf{k} is a bilinear form parameterized by WQK=WQWKW_{\text{QK}} = W_Q^\top W_K. Understanding bilinear forms via their eigendecomposition (WQK=UΛVW_{\text{QK}} = U\Lambda V^\top by SVD) reveals what "patterns" each attention head is sensitive to: the left singular vectors UU are "what queries to look for" and the right singular vectors VV are "what keys are being matched against."



Appendix D: Computational Methods and Numerical Considerations

D.1 Computing the Kernel via Row Reduction

Given ARm×nA \in \mathbb{R}^{m \times n}, finding a basis for ker(A)\ker(A) requires solving Ax=0A\mathbf{x} = \mathbf{0}.

Algorithm (Gaussian Elimination to RREF):

  1. Apply row operations to reduce AA to reduced row echelon form (RREF).
  2. Identify pivot columns (columns with leading 1s in RREF) and free columns (all other columns).
  3. For each free variable, set it to 1 and all other free variables to 0, then solve for pivot variables.
  4. Each such solution is one basis vector for ker(A)\ker(A).

Example. A=[121024231212]A = \begin{bmatrix} 1 & 2 & -1 & 0 \\ 2 & 4 & -2 & 3 \\ -1 & -2 & 1 & 2 \end{bmatrix}.

RREF: R2R22R1R_2 \leftarrow R_2 - 2R_1, R3R3+R1R_3 \leftarrow R_3 + R_1:

[121000030002]\begin{bmatrix} 1 & 2 & -1 & 0 \\ 0 & 0 & 0 & 3 \\ 0 & 0 & 0 & 2 \end{bmatrix}

R3R323R2R_3 \leftarrow R_3 - \frac{2}{3} R_2, R213R2R_2 \leftarrow \frac{1}{3}R_2:

[121000010000]\begin{bmatrix} 1 & 2 & -1 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 \end{bmatrix}

R1R10R2R_1 \leftarrow R_1 - 0 \cdot R_2: already done.

Pivot columns: 1 and 4. Free variables: x2x_2 and x3x_3.

Setting x2=1,x3=0x_2 = 1, x_3 = 0: x4=0x_4 = 0, x1+2(1)1(0)=0x1=2x_1 + 2(1) - 1(0) = 0 \Rightarrow x_1 = -2. Basis vector: (2,1,0,0)(-2, 1, 0, 0)^\top.

Setting x2=0,x3=1x_2 = 0, x_3 = 1: x4=0x_4 = 0, x1+01=0x1=1x_1 + 0 - 1 = 0 \Rightarrow x_1 = 1. Basis vector: (1,0,1,0)(1, 0, 1, 0)^\top.

Null space basis: ker(A)=span{(2,1,0,0),(1,0,1,0)}\ker(A) = \operatorname{span}\left\{(-2,1,0,0)^\top, (1,0,1,0)^\top\right\}. Nullity = 2.

Rank-nullity: n=4n = 4, rank = 2 (two pivots), nullity = 2. Check: 2+2=42 + 2 = 4. OK

D.2 Numerical Stability of Basis Computations

Computing the null space or column space numerically requires care because floating-point arithmetic can introduce small errors.

The SVD approach (recommended). Instead of row reduction, compute the SVD: A=UΣVA = U\Sigma V^\top. Then:

  • ker(A)\ker(A) = span of columns of VV corresponding to zero singular values (or singular values below a threshold ϵ\epsilon).
  • col(A)\operatorname{col}(A) = span of columns of UU corresponding to nonzero singular values.

The SVD-based approach is numerically stable because orthonormal bases (UU and VV) are well-conditioned.

Numerical rank. For floating-point matrices, "zero" singular values appear as small but nonzero values. The numerical rank with threshold ϵ\epsilon is:

rankϵ(A)={i:σi>ϵσ1}\operatorname{rank}_\epsilon(A) = |\{i : \sigma_i > \epsilon \cdot \sigma_1\}|

A common choice is ϵ=nmachine epsilon\epsilon = n \cdot \text{machine epsilon} (about 101310^{-13} for double precision). numpy.linalg.matrix_rank uses a default threshold based on machine epsilon.

Why this matters: In practice, a matrix with theoretical rank rr may appear to have rank r+5r + 5 due to measurement noise. The SVD reveals the "intrinsic" rank through the gap in singular values.

D.3 Efficient Change of Basis Computations

Naive approach: Compute P1APP^{-1}AP directly. For n×nn \times n matrices, this costs O(n3)O(n^3) (two matrix multiplications plus one matrix inversion).

Better approach when PP is orthogonal: If PP is orthogonal (P1=PP^{-1} = P^\top), then PAPP^\top AP costs only O(n3)O(n^3) but with better constants than general P1APP^{-1}AP (no matrix inversion needed).

Eigendecomposition case: When A=PΛP1A = P\Lambda P^{-1}, computing Ak=PΛkP1A^k = P\Lambda^k P^{-1} requires only O(n)O(n) operations to compute Λk\Lambda^k (raise each diagonal entry to the kk-th power), plus two O(n2)O(n^2) matrix-vector multiplications for PP and P1P^{-1}.

For AI: Computing exp(At)\exp(At) (matrix exponential, important for continuous-time state space models like Mamba/S4) is done by diagonalizing: exp(At)=Pexp(Λt)P1\exp(At) = P\exp(\Lambda t)P^{-1}, where exp(Λt)\exp(\Lambda t) is diagonal with entries eλite^{\lambda_i t}.

D.4 The Rank-Revealing QR Decomposition

Standard QR decomposition A=QRA = QR doesn't directly reveal rank. The rank-revealing QR (RRQR) uses column pivoting:

AP=QR=[Q1Q2][R11R120R22]AP = QR = \begin{bmatrix} Q_1 & Q_2 \end{bmatrix} \begin{bmatrix} R_{11} & R_{12} \\ 0 & R_{22} \end{bmatrix}

where PP is a permutation matrix, and R22R_{22} is "small" (its Frobenius norm bounds how far AA is from rank-rr). The columns of Q1Q_1 form a basis for col(A)\operatorname{col}(A).

RRQR is preferred over SVD when only a basis for the column space (not the singular values themselves) is needed, as it is about 3×3\times faster.


Appendix E: Connections to Other Fields

E.1 Linear Maps in Physics

In quantum mechanics, operators on Hilbert spaces are infinite-dimensional linear maps. The Hamiltonian H^\hat{H}, momentum p^\hat{p}, and position x^\hat{x} are linear operators. The Schrodinger equation itψ=H^ψi\hbar \partial_t |\psi\rangle = \hat{H}|\psi\rangle is a linear ODE on the Hilbert space of quantum states.

The spectral theorem for self-adjoint operators (the quantum generalization of symmetric matrix diagonalization) guarantees that observables have real eigenvalues (the possible measurement outcomes) and that the eigenfunctions form a complete orthonormal basis.

For AI: Transformers share surprising mathematical parallels with quantum mechanics: both involve attention-like mechanisms (inner products of states), superposition (linear combinations of basis states), and entanglement-like correlations. The linear algebra of quantum mechanics and of transformers both live in the framework of linear maps on Hilbert spaces.

E.2 Linear Maps in Topology: Homomorphisms

In algebraic topology, chain complexes are sequences of vector spaces connected by linear maps:

3C22C11C000\cdots \xrightarrow{\partial_3} C_2 \xrightarrow{\partial_2} C_1 \xrightarrow{\partial_1} C_0 \xrightarrow{\partial_0} 0

where k1k=0\partial_{k-1} \circ \partial_k = 0 (boundary of a boundary is zero - exactly the condition ker(k1)im(k)\ker(\partial_{k-1}) \supseteq \operatorname{im}(\partial_k)). The homology groups Hk=ker(k)/im(k+1)H_k = \ker(\partial_k) / \operatorname{im}(\partial_{k+1}) measure "holes" in topological spaces.

Persistent homology, used in topological data analysis (TDA), applies this to point cloud data to find features that persist across scales. It's used in ML for analyzing data manifolds, protein structure prediction, and understanding neural network loss landscapes.

E.3 Linear Maps in Signal Processing

The Discrete Fourier Transform (DFT) is a linear map F:CnCnF: \mathbb{C}^n \to \mathbb{C}^n with matrix entries Fkj=e2πijk/nF_{kj} = e^{-2\pi i jk/n}. The DFT matrix is unitary (FF=nIF^* F = nI).

Convolution is linear - convolving a signal with a kernel is a linear map - and in the Fourier domain it becomes pointwise multiplication. This is the key to making CNNs efficient: convolution is a structured linear map with shared weights (translation equivariance), and the Fourier transform diagonalizes the convolution operator.

For AI: The fast Fourier transform (FFT) is O(nlogn)O(n \log n) instead of O(n2)O(n^2) for the full DFT matrix multiply, by exploiting the structure (sparsity in a different basis) of the DFT linear map. Similarly, FlashAttention speeds up attention by exploiting the structure of the attention linear map to minimize memory bandwidth.



Appendix F: The Algebra of Linear Maps - Structural Results

F.1 The Space of Linear Maps is a Vector Space

We noted briefly that L(V,W)\mathcal{L}(V, W) is a vector space. Let's make this precise and compute its dimension.

Operations: For S,TL(V,W)S, T \in \mathcal{L}(V, W) and cFc \in \mathbb{F}:

  • (S+T)(v)=S(v)+T(v)(S + T)(\mathbf{v}) = S(\mathbf{v}) + T(\mathbf{v})
  • (cT)(v)=cT(v)(cT)(\mathbf{v}) = c \cdot T(\mathbf{v})
  • The zero element is the zero map O(v)=0O(\mathbf{v}) = \mathbf{0} for all v\mathbf{v}.

Dimension. If dim(V)=n\dim(V) = n and dim(W)=m\dim(W) = m, then:

dim(L(V,W))=mn\dim(\mathcal{L}(V, W)) = mn

Proof: Every linear map T:VWT: V \to W is determined by nn vectors in WW (images of basis vectors), each in a mm-dimensional space. The natural isomorphism is L(V,W)WnRmn\mathcal{L}(V, W) \cong W^n \cong \mathbb{R}^{mn} (as vector spaces), which corresponds to the identification with m×nm \times n matrices. \square

Basis for L(Rn,Rm)\mathcal{L}(\mathbb{R}^n, \mathbb{R}^m). The standard basis consists of the mnmn maps Eij:RnRmE_{ij}: \mathbb{R}^n \to \mathbb{R}^m defined by Eij(ek)=δjkeiE_{ij}(\mathbf{e}_k) = \delta_{jk}\mathbf{e}_i. In matrix form, EijE_{ij} is the matrix with 1 in position (i,j)(i,j) and 0 elsewhere.

F.2 Composition Gives L(V,V)\mathcal{L}(V, V) an Algebra Structure

When V=WV = W, linear maps T:VVT: V \to V can be composed. The space L(V,V)\mathcal{L}(V, V) (endomorphisms of VV) is a ring under composition (it is also a vector space - together, an algebra).

Properties of composition in L(V,V)\mathcal{L}(V, V):

  • Associative: (RS)T=R(ST)(RS)T = R(ST)
  • Identity: IT=TI=TI \circ T = T \circ I = T
  • Distributive: R(S+T)=RS+RTR(S + T) = RS + RT and (R+S)T=RT+ST(R+S)T = RT + ST
  • NOT commutative: RSSRRS \neq SR in general

Matrix polynomials. For TL(V,V)T \in \mathcal{L}(V,V), we can form p(T)=a0I+a1T+a2T2++akTkp(T) = a_0 I + a_1 T + a_2 T^2 + \cdots + a_k T^k for any polynomial pp. This is well-defined because we can add and compose linear maps.

Cayley-Hamilton theorem. Every linear operator TT satisfies its own characteristic polynomial: pT(T)=0p_T(T) = 0, where pT(λ)=det(λIT)p_T(\lambda) = \det(\lambda I - T).

For AI: The spectral approach to recurrent networks analyzes the long-run behavior of TkvT^k \mathbf{v} as kk \to \infty. If TT has eigenvalues λi<1|\lambda_i| < 1, then Tk0T^k \to 0 (stable memory decay). If any λi>1|\lambda_i| > 1, the recurrence explodes. This spectral stability analysis is the foundation of designing stable RNNs (LSTM, GRU use gating to control the effective eigenvalue spectrum of the recurrence).

F.3 Quotient Maps and Projections

The quotient space. Given T:VWT: V \to W with kernel K=ker(T)K = \ker(T), the quotient space V/KV/K consists of equivalence classes [v]=v+K={v+k:kK}[\mathbf{v}] = \mathbf{v} + K = \{\mathbf{v} + \mathbf{k} : \mathbf{k} \in K\}.

V/KV/K is a vector space of dimension dim(V)dim(K)=rank(T)\dim(V) - \dim(K) = \operatorname{rank}(T).

The first isomorphism theorem. Every linear map T:VWT: V \to W factors as:

VπV/ker(T)T~im(T)WV \xrightarrow{\pi} V/\ker(T) \xrightarrow{\tilde{T}} \operatorname{im}(T) \hookrightarrow W

where π\pi is the quotient map (π(v)=[v]\pi(\mathbf{v}) = [\mathbf{v}]) and T~\tilde{T} is an isomorphism from V/ker(T)V/\ker(T) to im(T)\operatorname{im}(T).

This is the coordinate-free statement of the rank-nullity theorem: V/ker(T)im(T)V/\ker(T) \cong \operatorname{im}(T).

Geometric meaning. The quotient map π\pi "collapses" the null space to a point, then T~\tilde{T} acts faithfully (injectively) on the resulting space. Any linear map splits into: collapse (project out the null space) + inject faithfully into the codomain.

For AI: In contrastive learning (SimCLR, MoCo), the projection head maps representations to a lower-dimensional space. This is a linear (or nonlinear) quotient map - it deliberately collapses some dimensions (those corresponding to nuisance factors like image augmentation) while preserving the semantically meaningful directions. The first isomorphism theorem says: what survives in the image is exactly what was not collapsed.

F.4 Dual Bases and the Canonical Isomorphism

We saw that dim(V)=dim(V)\dim(V^*) = \dim(V), so VVV \cong V^*. But this isomorphism is non-canonical - it depends on the choice of basis.

With an inner product. When VV is an inner product space (like Rn\mathbb{R}^n with the standard dot product), there is a canonical isomorphism ϕ:VV\phi: V \to V^* defined by:

ϕ(v)=v,\phi(\mathbf{v}) = \langle \mathbf{v}, \cdot \rangle

That is, ϕ(v)\phi(\mathbf{v}) is the linear functional that takes wv,w\mathbf{w} \mapsto \langle \mathbf{v}, \mathbf{w}\rangle.

This isomorphism is canonical because it doesn't depend on any choice of basis - it uses only the inner product structure. When we identify f(x)Rn\nabla f(\mathbf{x}) \in \mathbb{R}^n as a column vector (primal) rather than a row vector (dual), we are implicitly using this canonical isomorphism via the standard inner product.

For AI: On non-Euclidean spaces (manifolds of probability distributions, manifolds of neural network weights under the Fisher metric), the identification VVV \cong V^* is NO longer trivial - gradients and velocity vectors live in different spaces. The natural gradient method corrects for this by using the Fisher information matrix FF as the metric: ~θ=F1θ\tilde{\nabla} \theta = F^{-1} \nabla \theta. This maps the gradient (a covector) to a tangent vector using the Riemannian metric instead of the Euclidean metric.


Appendix G: Linear Maps in Practice - Worked Problems

G.1 Verifying Linearity: Systematic Approach

Problem: Is T:R2×2RT: \mathbb{R}^{2\times 2} \to \mathbb{R} defined by T(A)=tr(A)T(A) = \operatorname{tr}(A) linear?

Check additivity: T(A+B)=tr(A+B)=tr(A)+tr(B)=T(A)+T(B)T(A + B) = \operatorname{tr}(A + B) = \operatorname{tr}(A) + \operatorname{tr}(B) = T(A) + T(B). OK

Check homogeneity: T(cA)=tr(cA)=ctr(A)=cT(A)T(cA) = \operatorname{tr}(cA) = c\operatorname{tr}(A) = cT(A). OK

Conclusion: TT is linear. Its matrix (viewing R2×2\mathbb{R}^{2\times 2} with basis {E11,E12,E21,E22}\{E_{11}, E_{12}, E_{21}, E_{22}\}):

tr(E11)=1,tr(E12)=0,tr(E21)=0,tr(E22)=1\operatorname{tr}(E_{11}) = 1, \quad \operatorname{tr}(E_{12}) = 0, \quad \operatorname{tr}(E_{21}) = 0, \quad \operatorname{tr}(E_{22}) = 1

So [T]=(1,0,0,1)[T] = (1, 0, 0, 1) as a 1×41 \times 4 matrix. Kernel = {A:tr(A)=0}\{A : \operatorname{tr}(A) = 0\} (trace-zero matrices), dimension 3.

Problem: Is T:RnRT: \mathbb{R}^n \to \mathbb{R} defined by T(x)=x2T(\mathbf{x}) = \lVert\mathbf{x}\rVert_2 linear?

Check homogeneity: T(cx)=cx=cxT(c\mathbf{x}) = \lVert c\mathbf{x}\rVert = |c| \lVert\mathbf{x}\rVert. For c=1c = -1: T(x)=x=T(x)T(-\mathbf{x}) = \lVert\mathbf{x}\rVert = T(\mathbf{x}), but linearity requires T(x)=T(x)T(-\mathbf{x}) = -T(\mathbf{x}).

For x0\mathbf{x} \neq \mathbf{0}: T(x)>0T(\mathbf{x}) > 0 but T(x)<0-T(\mathbf{x}) < 0. So T(x)T(x)T(-\mathbf{x}) \neq -T(\mathbf{x}).

Conclusion: TT is NOT linear (the norm fails homogeneity due to the absolute value).

G.2 Finding the Kernel: Four Approaches

For A=[112224]A = \begin{bmatrix} 1 & -1 & 2 \\ 2 & -2 & 4 \end{bmatrix} (rows are multiples), find ker(A)\ker(A):

Approach 1: Row reduction. RREF: [112000]\begin{bmatrix} 1 & -1 & 2 \\ 0 & 0 & 0 \end{bmatrix}. Free variables x2=sx_2 = s, x3=tx_3 = t. Then x1=s2tx_1 = s - 2t.

ker(A)=span{(1,1,0),(2,0,1)}\ker(A) = \operatorname{span}\{(1,1,0)^\top, (-2,0,1)^\top\}.

Approach 2: Inspection. The columns satisfy a2=a1\mathbf{a}_2 = -\mathbf{a}_1 and a3=2a1\mathbf{a}_3 = 2\mathbf{a}_1. So A(1,1,0)=a1+a1=0A(1,-1,0)^\top = \mathbf{a}_1 + \mathbf{a}_1 = 0... no, wait: A(1,1,0)=1a1+(1)(a1)+0=a1+a1=2a10A(1,-1,0)^\top = 1\mathbf{a}_1 + (-1)(-\mathbf{a}_1) + 0 = \mathbf{a}_1 + \mathbf{a}_1 = 2\mathbf{a}_1 \neq \mathbf{0}.

Correcting: a2=a1\mathbf{a}_2 = -\mathbf{a}_1 means a1+a2=0\mathbf{a}_1 + \mathbf{a}_2 = \mathbf{0}, so (1,1,0)ker(A)(1,1,0)^\top \in \ker(A). And 2a1+0a2+(1)a3=2a12a1=02\mathbf{a}_1 + 0\mathbf{a}_2 + (-1)\mathbf{a}_3 = 2\mathbf{a}_1 - 2\mathbf{a}_1 = \mathbf{0} since a3=2a1\mathbf{a}_3 = 2\mathbf{a}_1, so (2,0,1)ker(A)(2,0,-1)^\top \in \ker(A) - or equivalently (2,0,1)(-2,0,1)^\top.

Approach 3: SVD. Compute SVD of AA; null space vectors are the right singular vectors with zero (or near-zero) singular values.

Approach 4: Random sampling + orthogonalization. Sample many vectors, project out the row space, keep those with zero image (useful when AA is very large).

G.3 Composition of Transforms in a Graphics Pipeline

A 3D object is processed through a graphics pipeline using compositions of affine maps:

  1. Model matrix MM: transform from object coordinates to world coordinates (rotation, scale, translation).
  2. View matrix VV: transform from world coordinates to camera coordinates (rotation + translation).
  3. Projection matrix PP: from camera coordinates to clip coordinates (perspective projection).

The combined transform: pclip=PVMpobject\mathbf{p}_{\text{clip}} = P \cdot V \cdot M \cdot \mathbf{p}_{\text{object}} (in homogeneous coordinates).

Composition order matters. Reading left to right: first apply MM, then VV, then PP. The matrix product PVMPVM can be precomputed once per frame (not per vertex), saving O(vertices)O(|\text{vertices}|) matrix multiplications.

This is the same principle as "avoid recomputing shared prefixes" in transformer KV-caching: the KVKV cache stores the linear maps K=XWKK = XW_K, V=XWVV = XW_V for all past tokens, so they don't need to be recomputed when generating each new token.



Appendix H: Linear Maps in Optimization and Training

H.1 The Gradient as a Linear Map

In optimization, we minimize a loss function L:RnR\mathcal{L}: \mathbb{R}^n \to \mathbb{R}. The gradient L(w)\nabla \mathcal{L}(\mathbf{w}) tells us the direction of steepest ascent. But more precisely:

The directional derivative of L\mathcal{L} at w\mathbf{w} in direction d\mathbf{d} is:

DdL(w)=limt0L(w+td)L(w)t=L(w)dD_{\mathbf{d}} \mathcal{L}(\mathbf{w}) = \lim_{t \to 0} \frac{\mathcal{L}(\mathbf{w} + t\mathbf{d}) - \mathcal{L}(\mathbf{w})}{t} = \nabla \mathcal{L}(\mathbf{w})^\top \mathbf{d}

This is a linear functional in d\mathbf{d}: it is the dual vector L(w)(Rn)\nabla \mathcal{L}(\mathbf{w})^\top \in (\mathbb{R}^n)^*.

Gradient descent in its pure form: wt+1=wtηL(wt)\mathbf{w}_{t+1} = \mathbf{w}_t - \eta \nabla \mathcal{L}(\mathbf{w}_t). This uses the Euclidean identification of the gradient (covector) with a primal vector.

Natural gradient descent: wt+1=wtηF(wt)1L(wt)\mathbf{w}_{t+1} = \mathbf{w}_t - \eta F(\mathbf{w}_t)^{-1} \nabla \mathcal{L}(\mathbf{w}_t), where FF is the Fisher information matrix. This uses the correct metric on the manifold of probability distributions (the Fisher-Rao metric) to convert the covector gradient to a tangent vector.

H.2 The Hessian as a Bilinear Map

The Hessian HL(w)Rn×nH\mathcal{L}(\mathbf{w}) \in \mathbb{R}^{n \times n} is the matrix of second derivatives:

Hij=2LwiwjH_{ij} = \frac{\partial^2 \mathcal{L}}{\partial w_i \partial w_j}

But more abstractly, the Hessian is a bilinear form B:Rn×RnRB: \mathbb{R}^n \times \mathbb{R}^n \to \mathbb{R}:

B(u,v)=uHv=rate of change of directional derivative of L in direction v, in direction uB(\mathbf{u}, \mathbf{v}) = \mathbf{u}^\top H \mathbf{v} = \text{rate of change of directional derivative of } \mathcal{L} \text{ in direction } \mathbf{v}, \text{ in direction } \mathbf{u}

The Hessian determines the curvature of the loss landscape:

  • Positive definite Hessian (uHu>0\mathbf{u}^\top H\mathbf{u} > 0 for all u0\mathbf{u} \neq 0): the point is a local minimum.
  • Indefinite Hessian (has both positive and negative eigenvalues): the point is a saddle point.
  • The ratio of largest to smallest eigenvalue is the condition number κ(H)\kappa(H) - it governs how slowly gradient descent converges.

For AI: Modern optimizers (Adam, AdaGrad) approximate Hessian-related quantities. Adam's second moment estimate v^tdiag(H)\hat{v}_t \approx \operatorname{diag}(H) approximates the diagonal of the Hessian. Dividing the gradient by v^t\sqrt{\hat{v}_t} is an approximation to Newton's method (which divides by HH). This is why Adam often converges much faster than SGD on ill-conditioned problems.

H.3 Weight Matrices as Linear Maps: Training Dynamics

The neural tangent kernel (NTK) theory (Jacot et al., 2018) analyzes infinitely wide neural networks and shows that their training dynamics under gradient flow are governed by a linear system:

y˙t=ηΘ(X,X)(yty)\dot{\mathbf{y}}_t = -\eta \, \Theta(\mathbf{X}, \mathbf{X}) (\mathbf{y}_t - \mathbf{y}^*)

where Θ(X,X)\Theta(\mathbf{X}, \mathbf{X}) is the NTK matrix (constant in the infinite-width limit). This is an ODE with a constant linear map Θ\Theta - so its solution is yt=y+eηΘt(y0y)\mathbf{y}_t = \mathbf{y}^* + e^{-\eta \Theta t}(\mathbf{y}_0 - \mathbf{y}^*).

The eigenvalues of Θ\Theta determine which output directions are learned quickly (large eigenvalues -> fast convergence) and which slowly (small eigenvalues -> slow convergence). This is linear algebra - specifically, the spectral decomposition of a positive semidefinite linear map.

H.4 Gradient Flow through Linear Layers

Consider a linear layer y=Wx\mathbf{y} = W\mathbf{x} with loss L\mathcal{L}. The gradient with respect to the weight matrix:

LW=Lyx=δx\frac{\partial \mathcal{L}}{\partial W} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}} \mathbf{x}^\top = \boldsymbol{\delta} \mathbf{x}^\top

where δ=LyRm\boldsymbol{\delta} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}} \in \mathbb{R}^m is the "error signal" (upstream gradient). The gradient LW\frac{\partial \mathcal{L}}{\partial W} is an outer product - a rank-1 matrix.

This means gradient updates are always rank-1. For a mini-batch of BB samples, the gradient is:

LW=1Bb=1Bδbxb\frac{\partial \mathcal{L}}{\partial W} = \frac{1}{B}\sum_{b=1}^B \boldsymbol{\delta}_b \mathbf{x}_b^\top

A sum of BB rank-1 matrices - the gradient has rank at most BB. For large models with batch size BnB \ll n, the gradient is a very low-rank update to the weight matrix. This low-rank structure of gradients is the empirical justification for gradient low-rank projection methods (GaLore, 2024).


Appendix I: Reference Tables

I.1 Linear Map Properties at a Glance

PropertyConditionMatrix EquivalentGeometric Meaning
LinearT(au+bv)=aTu+bTvT(a\mathbf{u}+b\mathbf{v}) = aT\mathbf{u}+bT\mathbf{v}Any m×nm \times n matrixPreserves addition and scaling
Injectiveker(T)={0}\ker(T) = \{\mathbf{0}\}Full column rankNo two inputs map to same output
Surjectiveim(T)=W\operatorname{im}(T) = WFull row rankEvery output is reachable
Bijective (isomorphism)Both injective and surjectiveSquare, full rank, invertiblePerfect correspondence
OrthogonalTT=IT^\top T = IAA=IA^\top A = I, orthogonal columnsPreserves lengths and angles
UnitaryTT=IT^* T = I (complex)AA=IA^* A = I, unitary columnsComplex analogue of orthogonal
ProjectionT2=TT^2 = TA2=AA^2 = AIdempotent: applying twice = once
SymmetricT=TT = T^\topA=AA = A^\topSelf-adjoint; diagonalizable by spectral theorem
Positive definiteTv,v>0\langle T\mathbf{v}, \mathbf{v}\rangle > 0xAx>0\mathbf{x}^\top A\mathbf{x} > 0 for x0\mathbf{x} \neq \mathbf{0}All eigenvalues positive; curvature at min
NormalTT=TTT T^\top = T^\top TAA=AAA A^\top = A^\top ADiagonalizable by unitary matrix
NilpotentTk=0T^k = 0 for some kkAk=0A^k = 0Powers eventually vanish; all eigenvalues 0
InvolutionT2=IT^2 = IA2=IA^2 = ISelf-inverse (like Householder reflections)

I.2 Rank and Dimension Formulas

FormulaStatement
rank(T)+nullity(T)=dim(V)\operatorname{rank}(T) + \operatorname{nullity}(T) = \dim(V)Rank-nullity theorem for T:VWT: V \to W
rank(A)=rank(A)\operatorname{rank}(A) = \operatorname{rank}(A^\top)Row rank equals column rank
rank(AB)min(rank(A),rank(B))\operatorname{rank}(AB) \leq \min(\operatorname{rank}(A), \operatorname{rank}(B))Rank cannot increase under composition
rank(A+B)rank(A)+rank(B)\operatorname{rank}(A + B) \leq \operatorname{rank}(A) + \operatorname{rank}(B)Subadditivity of rank
dim(S+T)=dim(S)+dim(T)dim(ST)\dim(S + T) = \dim(S) + \dim(T) - \dim(S \cap T)Inclusion-exclusion for subspaces
dim(V/W)=dim(V)dim(W)\dim(V/W) = \dim(V) - \dim(W) (for subspace WW)Dimension of quotient space
rank(AA)=rank(A)\operatorname{rank}(A^\top A) = \operatorname{rank}(A)Gram matrix has same rank
rank(P)=tr(P)\operatorname{rank}(P) = \operatorname{tr}(P) (for projection PP)Rank = trace for idempotent matrices

I.3 AI Applications Cross-Reference

Linear Map ConceptWhere It Appears in AIMathematical Role
Wx+bW\mathbf{x} + \mathbf{b} (affine map)Every neural layerPre-activation computation
Q=XWQQ = XW_Q (linear projection)Attention mechanismProjects to query subspace
ΔW=BA\Delta W = BA (low-rank)LoRA fine-tuningRank-rr weight update
Jf=fxJ_f = \frac{\partial f}{\partial \mathbf{x}} (Jacobian)BackpropagationChain rule at each layer
WδW^\top \boldsymbol{\delta} (transpose map)Backward passDual map of forward
P=QQP = QQ^\top (projection)Layer norm, attentionProjects onto subspace
WUhW_U \mathbf{h} (linear map)Unembedding (logit computation)Representation to vocabulary
eiθqe^{i\theta} \cdot \mathbf{q} (rotation in C\mathbb{C})RoPE positional encodingPositional rotation
F1LF^{-1}\nabla\mathcal{L} (metric-adjusted gradient)Natural gradient / AdamRiemannian gradient
Θ=JJ\Theta = J^\top J (Gram matrix of Jacobian)Neural tangent kernelTraining dynamics


Appendix J: Proofs of Key Results

J.1 Proof: rank(A)=rank(A)\operatorname{rank}(A) = \operatorname{rank}(A^\top)

This is a fundamental result that deserves a careful proof.

Theorem. For any matrix ARm×nA \in \mathbb{R}^{m \times n}, the column rank (dimension of the column space) equals the row rank (dimension of the row space).

Proof (via RREF). Let AA have RREF RR (obtained by row operations, which don't change the row space but can change the column space). In RR:

  • The nonzero rows are linearly independent (each has a leading 1 not shared by any other row).
  • The number of nonzero rows = number of pivot columns = rank.

So row rank = column rank = number of pivots in RREF. \square

Alternative proof (via SVD). The SVD A=UΣVA = U\Sigma V^\top has rank(A)\operatorname{rank}(A) nonzero singular values. The column space of AA is spanned by {u1,,ur}\{\mathbf{u}_1, \ldots, \mathbf{u}_r\} (first rr left singular vectors), dimension rr. The row space of AA (= column space of AA^\top) is spanned by {v1,,vr}\{\mathbf{v}_1, \ldots, \mathbf{v}_r\} (first rr right singular vectors), dimension rr. Both have dimension rr = number of nonzero singular values. \square

J.2 Proof: Kernel and Image are Subspaces

Theorem. For any linear map T:VWT: V \to W, both ker(T)\ker(T) and im(T)\operatorname{im}(T) are subspaces (of VV and WW respectively).

Proof for ker(T)\ker(T):

  1. Non-empty: T(0V)=0WT(\mathbf{0}_V) = \mathbf{0}_W, so 0Vker(T)\mathbf{0}_V \in \ker(T).
  2. Closed under addition: Let u,vker(T)\mathbf{u}, \mathbf{v} \in \ker(T). Then T(u+v)=T(u)+T(v)=0+0=0T(\mathbf{u} + \mathbf{v}) = T(\mathbf{u}) + T(\mathbf{v}) = \mathbf{0} + \mathbf{0} = \mathbf{0}, so u+vker(T)\mathbf{u} + \mathbf{v} \in \ker(T).
  3. Closed under scalar multiplication: Let vker(T)\mathbf{v} \in \ker(T), cFc \in \mathbb{F}. Then T(cv)=cT(v)=c0=0T(c\mathbf{v}) = cT(\mathbf{v}) = c\mathbf{0} = \mathbf{0}, so cvker(T)c\mathbf{v} \in \ker(T). \square

Proof for im(T)\operatorname{im}(T):

  1. Non-empty: T(0V)=0Wim(T)T(\mathbf{0}_V) = \mathbf{0}_W \in \operatorname{im}(T).
  2. Closed under addition: Let w1,w2im(T)\mathbf{w}_1, \mathbf{w}_2 \in \operatorname{im}(T), so wi=T(vi)\mathbf{w}_i = T(\mathbf{v}_i) for some viV\mathbf{v}_i \in V. Then w1+w2=T(v1)+T(v2)=T(v1+v2)im(T)\mathbf{w}_1 + \mathbf{w}_2 = T(\mathbf{v}_1) + T(\mathbf{v}_2) = T(\mathbf{v}_1 + \mathbf{v}_2) \in \operatorname{im}(T).
  3. Closed under scalar multiplication: Let w=T(v)im(T)\mathbf{w} = T(\mathbf{v}) \in \operatorname{im}(T), cFc \in \mathbb{F}. Then cw=cT(v)=T(cv)im(T)c\mathbf{w} = cT(\mathbf{v}) = T(c\mathbf{v}) \in \operatorname{im}(T). \square

J.3 Proof: Composition of Linear Maps is Linear

Theorem. If S:VWS: V \to W and T:UVT: U \to V are linear, then ST:UWS \circ T: U \to W is linear.

Proof:

(ST)(au1+bu2)=S(T(au1+bu2))(S \circ T)(a\mathbf{u}_1 + b\mathbf{u}_2) = S(T(a\mathbf{u}_1 + b\mathbf{u}_2)) =S(aT(u1)+bT(u2))(T is linear)= S(aT(\mathbf{u}_1) + bT(\mathbf{u}_2)) \quad \text{(T is linear)} =aS(T(u1))+bS(T(u2))(S is linear)= aS(T(\mathbf{u}_1)) + bS(T(\mathbf{u}_2)) \quad \text{(S is linear)} =a(ST)(u1)+b(ST)(u2)= a(S \circ T)(\mathbf{u}_1) + b(S \circ T)(\mathbf{u}_2) \quad \square

J.4 Proof: The Dual Map is Linear

Theorem. If T:VWT: V \to W is linear, then T:WVT^\top: W^* \to V^* defined by (Tf)(v)=f(T(v))(T^\top f)(\mathbf{v}) = f(T(\mathbf{v})) is linear.

Proof:

  • (T(af+bg))(v)=(af+bg)(T(v))=af(T(v))+bg(T(v))(T^\top(af + bg))(\mathbf{v}) = (af + bg)(T(\mathbf{v})) = af(T(\mathbf{v})) + bg(T(\mathbf{v}))
  • =a(Tf)(v)+b(Tg)(v)=(aTf+bTg)(v)= a(T^\top f)(\mathbf{v}) + b(T^\top g)(\mathbf{v}) = (aT^\top f + bT^\top g)(\mathbf{v})

So T(af+bg)=aTf+bTgT^\top(af + bg) = aT^\top f + bT^\top g. \square

J.5 Proof: Invertibility Criterion for Finite-Dimensional Spaces

Theorem. Let T:VVT: V \to V be a linear map on a finite-dimensional space VV. Then the following are equivalent:

  1. TT is injective (one-to-one)
  2. TT is surjective (onto)
  3. TT is bijective (invertible)
  4. ker(T)={0}\ker(T) = \{\mathbf{0}\}
  5. rank(T)=dim(V)\operatorname{rank}(T) = \dim(V)

Proof: (1)(4)(1) \Leftrightarrow (4): TT injective iff ker(T)={0}\ker(T) = \{\mathbf{0}\} (standard).

(4)(5)(4) \Leftrightarrow (5): By rank-nullity: rank(T)+nullity(T)=dim(V)\operatorname{rank}(T) + \operatorname{nullity}(T) = \dim(V). Nullity = 0 iff rank = dim(V)\dim(V).

(5)(2)(5) \Leftrightarrow (2): Rank = dim(im(T))\dim(\operatorname{im}(T)). Since im(T)V\operatorname{im}(T) \subseteq V and both are finite-dimensional, dim(im(T))=dim(V)\dim(\operatorname{im}(T)) = \dim(V) iff im(T)=V\operatorname{im}(T) = V (a subspace of equal dimension must be the whole space).

(3)(1)&(2)(3) \Leftrightarrow (1) \& (2): by definition of bijective. \square

Important: This equivalence only holds for maps T:VVT: V \to V with the same domain and codomain. For T:RmRnT: \mathbb{R}^m \to \mathbb{R}^n with mnm \neq n, injective and surjective are NOT equivalent (one is impossible given the dimension constraint).



Appendix K: Additional AI Case Studies

K.1 Mechanistic Interpretability via Linear Maps

Mechanistic interpretability (MI) aims to reverse-engineer neural networks by understanding what computation each component performs. Linear map theory is central to this enterprise.

The residual stream as a communication bus. In transformer models, each layer reads from and writes to a shared "residual stream" xRd\mathbf{x} \in \mathbb{R}^d. Each attention head and MLP layer contributes an additive update:

x+1=x+Attn(x)attn. update+MLP(x)MLP update\mathbf{x}_{\ell+1} = \mathbf{x}_\ell + \underbrace{\text{Attn}_\ell(\mathbf{x}_\ell)}_{\text{attn. update}} + \underbrace{\text{MLP}_\ell(\mathbf{x}_\ell)}_{\text{MLP update}}

Each update is (approximately) a low-rank linear map from the residual stream back to itself. The attention update's linear part is WOWVW_O W_V (the "OV circuit"); the MLP's linear part is WoutWinW_{\text{out}} W_{\text{in}} after linearizing the activation.

SVD of the OV circuit. The matrix WOWVRd×dW_O W_V \in \mathbb{R}^{d \times d} can be analyzed via SVD. Its singular values reveal how strongly the attention head modifies the residual stream, and its singular vectors reveal which directions it reads from and writes to. Heads with near-zero singular values are "inattentive" - they barely modify the residual stream regardless of attention pattern.

Subspace decomposition. The full set of L×HL \times H attention heads (for LL layers, HH heads per layer) collectively form a large linear map from the input to the residual stream updates. The total update is a sum of LHLH low-rank linear maps. Understanding the structure of this sum - which heads are redundant, which are essential - is a central goal of circuit-level MI.

K.2 Linear Algebra of Diffusion Models

Diffusion models (DDPM, Score matching) add Gaussian noise to data and learn to denoise. The forward process is an affine map:

xt=αˉtx0+1αˉtε,εN(0,I)\mathbf{x}_t = \sqrt{\bar\alpha_t}\, \mathbf{x}_0 + \sqrt{1 - \bar\alpha_t}\, \boldsymbol{\varepsilon}, \quad \boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, I)

This is an affine interpolation between the data x0\mathbf{x}_0 and pure noise. The coefficient αˉt\sqrt{\bar\alpha_t} scales the data, and 1αˉt\sqrt{1-\bar\alpha_t} scales the noise.

The denoising objective. The neural network ϵθ(xt,t)\epsilon_\theta(\mathbf{x}_t, t) estimates ε\boldsymbol{\varepsilon} (the noise) from the noisy input. Near a data point x0\mathbf{x}_0, this estimator is approximately a linear function of xtαˉtx0\mathbf{x}_t - \sqrt{\bar\alpha_t}\mathbf{x}_0 - the Tweedie formula gives the optimal estimator as:

x^0=xt1αˉtϵθ(xt,t)αˉt\hat{\mathbf{x}}_0 = \frac{\mathbf{x}_t - \sqrt{1-\bar\alpha_t}\, \epsilon_\theta(\mathbf{x}_t, t)}{\sqrt{\bar\alpha_t}}

which is a linear function of xt\mathbf{x}_t and ϵθ(xt,t)\epsilon_\theta(\mathbf{x}_t, t). The diffusion process itself is a composition of affine maps in the forward direction, and the learned reverse process approximately inverts these affine maps.

K.3 State Space Models as Linear Dynamical Systems

Structured State Space Models (S4, Mamba, RWKV) compute their state updates via linear recurrences:

ht+1=Aht+Bxt\mathbf{h}_{t+1} = A\mathbf{h}_t + B\mathbf{x}_t yt=Cht+Dxt\mathbf{y}_t = C\mathbf{h}_t + D\mathbf{x}_t

where ARN×NA \in \mathbb{R}^{N \times N}, BRN×dB \in \mathbb{R}^{N \times d}, CRd×NC \in \mathbb{R}^{d \times N}, DRd×dD \in \mathbb{R}^{d \times d} are (possibly input-dependent) matrices.

The state transition hAh+Bx\mathbf{h} \mapsto A\mathbf{h} + B\mathbf{x} is a linear dynamical system - the fundamental object of study in control theory and signal processing.

Key linear algebra results for SSMs:

  1. Eigenvalues of AA determine memory. If λi(A)<1|\lambda_i(A)| < 1 for all ii, the system has bounded memory decay. If any λi>1|\lambda_i| > 1, the state can grow unboundedly.

  2. Diagonalization for efficiency. If A=PΛP1A = P\Lambda P^{-1}, the recurrence decouples into NN independent scalar recurrences - each computable independently. S4 uses diagonal AA (DPLR structure) for O(NlogN)O(N\log N) parallel computation via convolution.

  3. The convolution view. Unrolling the recurrence: yt=τ=0tCAtτBxτ+Dxt\mathbf{y}_t = \sum_{\tau=0}^t C A^{t-\tau} B \mathbf{x}_\tau + D\mathbf{x}_t. The impulse response CAkBC A^k B is a sequence of matrix powers - analyzable by the spectral decomposition of AA.

  4. Mamba's selectivity. Mamba makes A,B,CA, B, C input-dependent: A(x),B(x),C(x)A(\mathbf{x}), B(\mathbf{x}), C(\mathbf{x}). The recurrence becomes bilinear in (h,x)(\mathbf{h}, \mathbf{x}), not purely linear. The linearization (around typical inputs) gives a locally linear system analyzable by the tools of this section.


Appendix L: Further Reading

Primary References

  1. Axler, S. (2015). Linear Algebra Done Right (3rd ed.). Springer. - The definitive abstract treatment of linear maps. Goes from axioms to spectral theory without matrices until chapter 10. Highly recommended for conceptual depth.

  2. Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley-Cambridge Press. - Computational and applied focus. Excellent for four fundamental subspaces and applications.

  3. Horn, R. & Johnson, C. (2013). Matrix Analysis (2nd ed.). Cambridge University Press. - Comprehensive advanced reference. Proofs of all major results, including Cayley-Hamilton, spectral theorems, singular values.

  4. Trefethen, L. & Bau, D. (1997). Numerical Linear Algebra. SIAM. - Gold standard for computational linear algebra and stability.

AI-Focused References

  1. Vaswani, A. et al. (2017). "Attention is All You Need." NeurIPS. - The original transformer paper; read the attention mechanism as linear projections.

  2. Hu, E. et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." ICLR 2022. - LoRA rank-nullity argument.

  3. Elhage, N. et al. (2021). "A Mathematical Framework for Transformer Circuits." Anthropic. - OV and QK circuits as linear maps; residual stream as communication bus.

  4. Park, K. et al. (2023). "The Linear Representation Hypothesis and the Geometry of Large Language Models." - Linear features in transformer representations.

  5. Jacot, A. et al. (2018). "Neural Tangent Kernel: Convergence and Generalization in Neural Networks." NeurIPS. - Training dynamics via linear maps (NTK theory).

  6. Gu, A. et al. (2022). "Efficiently Modeling Long Sequences with Structured State Spaces." ICLR. - SSMs as linear dynamical systems.


This section is part of the Math for LLMs curriculum - a systematic treatment of the mathematics underlying modern large language models.


Appendix M: Self-Assessment Checklist

After completing this section, you should be able to answer the following questions without notes.

Conceptual Understanding

  • Q1. State the two axioms of a linear map. What is the fastest way to show a map is NOT linear?

  • Q2. What is the kernel of a linear map? Prove it is a subspace.

  • Q3. State the rank-nullity theorem. Give an example where rank = 2, nullity = 3. What can you say about the domain and codomain dimensions?

  • Q4. Why is a linear map from VV to VV injective if and only if it is surjective? Why does this fail for maps between spaces of different dimensions?

  • Q5. What is the change-of-basis formula? If A=P1BPA = P^{-1}BP, what relationship does that establish between the maps represented by AA and BB?

  • Q6. What is an orthogonal projection? How do you verify that a matrix PP is a projection? What two extra properties make it an orthogonal projection?

  • Q7. What is the Jacobian matrix? For f:R3R2f: \mathbb{R}^3 \to \mathbb{R}^2, what is the shape of JfJ_f?

  • Q8. In the backward pass of backpropagation, why do we multiply by WW^\top rather than WW?

  • Q9. What makes an affine map different from a linear map? How do you represent an affine map as a linear map (in one higher dimension)?

Computational Skills

  • C1. Given a matrix AA, find a basis for ker(A)\ker(A) using row reduction.

  • C2. Given a linear map TT defined on a non-standard basis, write its matrix in that basis.

  • C3. Given two bases B\mathcal{B} and B\mathcal{B}', compute the change-of-basis matrix PP and use it to transform the matrix of TT.

  • C4. For a rank-rr update ΔW=BA\Delta W = BA, compute the null space dimension and verify it numerically.

  • C5. Compute the Jacobian of a given vector-valued function (e.g., softmax, elementwise ReLU, an affine map composed with sigmoid).

AI Connections

  • AI1. Explain why LoRA (low-rank adaptation) is more parameter-efficient than full fine-tuning, using the language of rank and nullity.

  • AI2. Describe the forward pass of a multi-head attention layer as a sequence of linear maps. Which operations are linear, which are bilinear, and which are nonlinear?

  • AI3. What is the linear representation hypothesis? Why does it matter for interpretability research?

  • AI4. Why does a purely linear deep network (no activations) collapse to a single linear map, regardless of depth?

  • AI5. How does the dual map relate to backpropagation? What mathematical object is the "gradient" in the strict sense?


Appendix N: Notation Summary for This Section

SymbolMeaning
T:VWT: V \to WLinear transformation from VV to WW
ker(T)\ker(T)Kernel (null space) of TT
im(T)\operatorname{im}(T)Image (range) of TT
rank(T)\operatorname{rank}(T)Dimension of im(T)\operatorname{im}(T)
nullity(T)\operatorname{nullity}(T)Dimension of ker(T)\ker(T)
[T]BC[T]_{\mathcal{B}}^{\mathcal{C}}Matrix of TT from basis B\mathcal{B} to basis C\mathcal{C}
PPChange-of-basis matrix
TT^\topDual (transpose) map
VV^*Dual space of VV
DfxDf_{\mathbf{x}}Total derivative (Frechet derivative) of ff at x\mathbf{x}
Jf(x)J_f(\mathbf{x})Jacobian matrix of ff at x\mathbf{x}
L(V,W)\mathcal{L}(V, W)Space of all linear maps from VV to WW
VWV \cong WVV and WW are isomorphic
V/KV/KQuotient space of VV modulo subspace KK
ABA \sim BAA and BB are similar matrices (B=P1APB = P^{-1}AP)
P2=PP^2 = PProjection (idempotent)
AA=IA^\top A = IOrthogonal matrix
f(x)=Ax+bf(\mathbf{x}) = A\mathbf{x} + \mathbf{b}Affine map
ΔW=BA\Delta W = BALoRA low-rank update, rankr\operatorname{rank} \leq r

Appendix O: Linear Maps and Symmetry

O.1 Equivariant Maps

A linear map T:VWT: V \to W is equivariant with respect to a group GG if, for every gGg \in G and every vV\mathbf{v} \in V:

T(ρV(g)v)=ρW(g)T(v)T(\rho_V(g)\mathbf{v}) = \rho_W(g) T(\mathbf{v})

where ρV:GGL(V)\rho_V: G \to GL(V) and ρW:GGL(W)\rho_W: G \to GL(W) are representations of GG on VV and WW.

Intuitively: "applying the group action then the map = applying the map then the group action." The map commutes with the symmetry.

Examples:

  • Translation equivariance: T(v+c)=T(v)+T(c)T(v + c) = T(v) + T(c)... but this is additivity - every linear map is equivariant to the translation group on vector spaces.
  • Rotation equivariance: T(Rv)=RT(v)T(R\mathbf{v}) = RT(\mathbf{v}) for all rotations RR. In 3D: implies T=λIT = \lambda I for some scalar λ\lambda (Schur's lemma for the rotation representation).
  • Permutation equivariance: T(Pv)=PT(v)T(P\mathbf{v}) = PT(\mathbf{v}) for all permutation matrices PP. Implies TT is a sum of a "same-position" term and a "mean-field" term - this is why mean pooling and attention with tied weights are permutation equivariant.

For AI: CNNs achieve translation equivariance by using convolutional (shared-weight) linear maps. Equivariant graph neural networks use permutation-equivariant maps. Geometric deep learning is the systematic study of building neural networks as compositions of equivariant linear maps. Transformer attention (without positional encoding) is permutation equivariant - adding positional encodings explicitly breaks this symmetry.

O.2 Schur's Lemma and Irreducible Representations

Schur's Lemma. Let T:VVT: V \to V be a linear map that commutes with all maps in an irreducible representation of a group GG (i.e., Tρ(g)=ρ(g)TT\rho(g) = \rho(g)T for all gGg \in G). Then T=λIT = \lambda I for some scalar λ\lambda.

This powerful result says: the only linear maps that commute with all symmetries of an irreducible representation are scalar multiples of the identity. This constrains the form of equivariant maps.

Application: If attention weights must be equivariant to the representation of a certain symmetry group acting on the heads, Schur's lemma constrains the possible attention patterns.

O.3 Representation Theory Preview

Representation theory studies how groups act on vector spaces via linear maps. Every group representation ρ:GGL(V)\rho: G \to GL(V) is a group homomorphism - a map that takes group elements to invertible linear maps, preserving the group structure:

ρ(gh)=ρ(g)ρ(h)(composition respects group multiplication)\rho(gh) = \rho(g)\rho(h) \quad \text{(composition respects group multiplication)}

This is the language in which equivariant neural networks (E(3)-equivariant networks for molecular property prediction, SE(3)-equivariant networks for robotics, permutation-equivariant networks for sets) are designed. The "weights" of an equivariant linear layer are constrained to be equivariant - and representation theory tells you exactly what form these weights can take.


Appendix P: Quick Reference - Common Linear Maps in R2\mathbb{R}^2 and R3\mathbb{R}^3

Common 2×22 \times 2 Linear Maps

TransformationMatrixProperties
Rotation by θ\theta(cosθsinθsinθcosθ)\begin{pmatrix}\cos\theta & -\sin\theta \\ \sin\theta & \cos\theta\end{pmatrix}Orthogonal, det=1\det=1, $
Reflection across xx-axis(1001)\begin{pmatrix}1 & 0 \\ 0 & -1\end{pmatrix}Symmetric, det=1\det=-1, λ=±1\lambda = \pm 1
Reflection across y=xy=x(0110)\begin{pmatrix}0 & 1 \\ 1 & 0\end{pmatrix}Symmetric, det=1\det=-1, λ=±1\lambda = \pm 1
Horizontal shear by kk(1k01)\begin{pmatrix}1 & k \\ 0 & 1\end{pmatrix}det=1\det=1, λ=1\lambda = 1 (double)
Scaling by (a,b)(a, b)(a00b)\begin{pmatrix}a & 0 \\ 0 & b\end{pmatrix}Symmetric, det=ab\det=ab, λ=a,b\lambda = a, b
Projection onto xx-axis(1000)\begin{pmatrix}1 & 0 \\ 0 & 0\end{pmatrix}Symmetric, idempotent, det=0\det=0, λ=0,1\lambda=0,1
Projection onto y=xy=x12(1111)\frac{1}{2}\begin{pmatrix}1 & 1 \\ 1 & 1\end{pmatrix}Symmetric, idempotent, λ=0,1\lambda=0,1
Zero map(0000)\begin{pmatrix}0 & 0 \\ 0 & 0\end{pmatrix}Rank 0, det=0\det=0, λ=0\lambda=0
Identity(1001)\begin{pmatrix}1 & 0 \\ 0 & 1\end{pmatrix}All eigenvalues 1, det=1\det=1

All these are linear maps. To make them affine (include translation), append a row and column in homogeneous form.

Common 3×33 \times 3 Linear Maps

TransformationDescriptionKey Properties
Rotation around zz-axisRz(θ)R_z(\theta): rotates xyxy-plane, fixes zzOrthogonal, det=1\det=1
Reflection across xyxy-planediag(1,1,1)\operatorname{diag}(1,1,-1)Symmetric, det=1\det=-1
Projection onto xyxy-planediag(1,1,0)\operatorname{diag}(1,1,0)Symmetric, idempotent, rank 2
Householder reflectionI2nnI - 2\mathbf{n}\mathbf{n}^\topSymmetric, det=1\det=-1, λ=1\lambda = 1 (mult. 2) and 1-1
Scalingdiag(a,b,c)\operatorname{diag}(a,b,c)Diagonal; eigenvalues are a,b,ca,b,c
ShearI+seiejI + s\,\mathbf{e}_i\mathbf{e}_j^\top (iji \neq j)det=1\det=1; all eigenvalues 1

End of Linear Transformations section. Continue to 05: Orthogonality and Orthonormality.