Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 1
29 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Linear Transformations: Part 1: Intuition to 6. Dual Spaces and Transposes

1. Intuition

1.1 Three Views of a Matrix

A matrix ARm×nA \in \mathbb{R}^{m \times n} admits three distinct but equivalent interpretations, and fluent practitioners shift between them effortlessly.

View 1: Data container. The matrix is a rectangular array of numbers - mm rows, nn columns. This is the computational view. We use it when we care about entries: AijA_{ij}, the element in row ii, column jj.

View 2: Column geometry. The matrix is a collection of nn column vectors in Rm\mathbb{R}^m. The product AxA\mathbf{x} is then a linear combination: Ax=x1a1+x2a2++xnanA\mathbf{x} = x_1 \mathbf{a}_1 + x_2 \mathbf{a}_2 + \cdots + x_n \mathbf{a}_n, where ai\mathbf{a}_i is the ii-th column. This view makes the column space visible and illuminates when Ax=bA\mathbf{x} = \mathbf{b} has solutions.

View 3: Linear transformation. The matrix defines a function T:RnRmT: \mathbb{R}^n \to \mathbb{R}^m by T(x)=AxT(\mathbf{x}) = A\mathbf{x}. This is the abstract, coordinate-free view. The map TT exists independently of any coordinate representation - the matrix AA is merely its description in a particular pair of bases (standard basis for both Rn\mathbb{R}^n and Rm\mathbb{R}^m).

THREE VIEWS OF THE MATRIX A
========================================================================

  [2  1]         Column geometry:          Linear map:
  [0  3]         A = [a_1 | a_2]            T: \mathbb{R}^2 -> \mathbb{R}^2
  [1 -1]                                  T(x) = Ax

  Data container      Span{a_1, a_2}         Sends basis vectors
  A_2_1 = 0            = column space        e_1 -> col 1 of A
  A_3_2 = -1                                e_2 -> col 2 of A

  All three describe the same mathematical object.

========================================================================

For AI: In a transformer, the weight matrix WQRdk×dW_Q \in \mathbb{R}^{d_k \times d} is simultaneously all three: raw parameters to be optimized (View 1), a basis for the query subspace (View 2), and a linear projection from the residual stream to query space (View 3). Understanding which view you're using prevents confusion about what "the attention head is doing."

1.2 Geometric Action of Linear Maps

What does T(x)=AxT(\mathbf{x}) = A\mathbf{x} do to space? The two axioms - additivity and homogeneity - impose strong geometric constraints:

  1. The origin is fixed. T(0)=A0=0T(\mathbf{0}) = A\mathbf{0} = \mathbf{0} always. A linear map cannot translate; it cannot move the origin. This is the most important geometric fact about linear transformations.

  2. Lines through the origin stay lines. If x(t)=tv\mathbf{x}(t) = t\mathbf{v} is a line, then T(x(t))=tT(v)T(\mathbf{x}(t)) = tT(\mathbf{v}) is a line through the origin in the output space.

  3. Parallel lines stay parallel. If y1y2=v\mathbf{y}_1 - \mathbf{y}_2 = \mathbf{v} (same direction), then T(y1)T(y2)=T(v)T(\mathbf{y}_1) - T(\mathbf{y}_2) = T(\mathbf{v}) (still same direction).

  4. Grid lines go to grid lines, equally spaced. This is the classic 3Blue1Brown visualization: apply AA to the integer grid of R2\mathbb{R}^2 and you get a (possibly skewed) grid in R2\mathbb{R}^2.

  5. The unit square maps to a parallelogram. The parallelogram spanned by the columns of AA is the image of the unit square under TT.

What can a linear map do? Depending on the matrix:

  • Rotate space (orthogonal matrix, det=1\det = 1)
  • Reflect space (orthogonal matrix, det=1\det = -1)
  • Scale dimensions (diagonal matrix)
  • Shear space (triangular matrix)
  • Project onto a subspace (idempotent: P2=PP^2 = P)
  • Collapse space to a lower dimension (rank-deficient matrix)

What a linear map cannot do: translate (no origin-shifting), apply nonlinear distortion (no bending, folding, or curving of lines).

1.3 Why Linearity is Special

The superposition principle is the reason linear algebra is tractable. For a linear map TT:

T ⁣(i=1ncivi)=i=1nciT(vi)T\!\left(\sum_{i=1}^n c_i \mathbf{v}_i\right) = \sum_{i=1}^n c_i T(\mathbf{v}_i)

This means: to know TT on all of VV, you only need to know TT on a basis. A basis has dim(V)\dim(V) elements. The map on an infinite space is encoded in finitely many column vectors. This is extraordinary compression: infinite -> finite.

Linearity vs nonlinearity in neural networks. A deep neural network with no nonlinearities is just a single linear transformation: WLWL1W1x=WeffxW_L W_{L-1} \cdots W_1 \mathbf{x} = W_{\text{eff}} \mathbf{x}. Multiple linear layers collapse to one. The nonlinearities (ReLU, GELU, SiLU) are what allow composition to create genuinely new representational power. The linear layers provide the parameterized directions; the nonlinear activations provide the expressive capacity.

Linearity in analysis. Many of the hardest problems in mathematics and ML become tractable when restricted to linear functions: linear regression has a closed-form solution, linear systems have complete theory, spectral analysis of linear operators is well-understood. The strategy of "linearize, solve, interpret" recurs throughout calculus, optimization, and signal processing.

For AI: The linear representation hypothesis (Elhage et al. 2022, Park et al. 2023) conjectures that many high-level features in LLMs are encoded as directions in representation space - i.e., they are linear features. If true, this means the crucial structure of LLM computation is linear, and all the machinery of this section applies directly to understanding what models are doing.

1.4 Historical Timeline

YearPersonContribution
1844Hermann GrassmannDie lineale Ausdehnungslehre - first abstract treatment of linear spaces
1855Arthur CayleyMatrix algebra as formal system; composition of matrices
1888Giuseppe PeanoFirst rigorous axiomatization of vector spaces
1902Henri LebesgueIntegration as a linear functional on function spaces
1904-1910Hilbert, Riesz, FischerInfinite-dimensional linear algebra; Hilbert spaces; spectral theory
1929John von NeumannOperator theory; linear transformations on Hilbert spaces
1940sAlan Turing, von NeumannLinear algebra for numerical computation; matrix algorithms
1986Rumelhart, Hinton, WilliamsBackpropagation - chains of Jacobians as the training algorithm
2017Vaswani et al.Attention = linear projections + scaled dot product; transformers
2021Hu et al. (LoRA)Low-rank linear map updates for efficient fine-tuning
2022-Elhage, Park et al.Linear representation hypothesis; mechanistic interpretability

2. Formal Definitions

2.1 The Two Axioms and All Their Consequences

Definition (Linear Transformation). Let VV and WW be vector spaces over the same field F\mathbb{F} (typically R\mathbb{R} or C\mathbb{C}). A function T:VWT: V \to W is a linear transformation (also called a linear map or homomorphism) if it satisfies:

  1. Additivity: T(u+v)=T(u)+T(v)T(\mathbf{u} + \mathbf{v}) = T(\mathbf{u}) + T(\mathbf{v}) for all u,vV\mathbf{u}, \mathbf{v} \in V
  2. Homogeneity: T(cv)=cT(v)T(c\mathbf{v}) = cT(\mathbf{v}) for all vV\mathbf{v} \in V, cFc \in \mathbb{F}

These two axioms are often combined into the single condition:

T(au+bv)=aT(u)+bT(v)for all u,vV,  a,bFT(a\mathbf{u} + b\mathbf{v}) = aT(\mathbf{u}) + bT(\mathbf{v}) \quad \text{for all } \mathbf{u}, \mathbf{v} \in V, \; a, b \in \mathbb{F}

Immediate consequences (each follows directly from the axioms):

Proposition 2.1.1 (Zero maps to zero). T(0V)=0WT(\mathbf{0}_V) = \mathbf{0}_W.

Proof: T(0)=T(0v)=0T(v)=0T(\mathbf{0}) = T(0 \cdot \mathbf{v}) = 0 \cdot T(\mathbf{v}) = \mathbf{0} for any vV\mathbf{v} \in V. \square

This is a universal test: if T(0)0T(\mathbf{0}) \neq \mathbf{0}, then TT is not linear. Translation (T(x)=x+bT(\mathbf{x}) = \mathbf{x} + \mathbf{b}, b0\mathbf{b} \neq \mathbf{0}) fails immediately.

Proposition 2.1.2 (Negatives are preserved). T(v)=T(v)T(-\mathbf{v}) = -T(\mathbf{v}).

Proof: T(v)=T((1)v)=(1)T(v)=T(v)T(-\mathbf{v}) = T((-1)\mathbf{v}) = (-1)T(\mathbf{v}) = -T(\mathbf{v}). \square

Proposition 2.1.3 (General linear combinations). For any v1,,vkV\mathbf{v}_1, \ldots, \mathbf{v}_k \in V and c1,,ckFc_1, \ldots, c_k \in \mathbb{F}:

T ⁣(i=1kcivi)=i=1kciT(vi)T\!\left(\sum_{i=1}^k c_i \mathbf{v}_i\right) = \sum_{i=1}^k c_i T(\mathbf{v}_i)

Proof: By induction using additivity and homogeneity. \square

Proposition 2.1.4 (Determined by basis images). If {b1,,bn}\{\mathbf{b}_1, \ldots, \mathbf{b}_n\} is a basis for VV, then TT is completely determined by T(b1),,T(bn)T(\mathbf{b}_1), \ldots, T(\mathbf{b}_n). Moreover, for any choice of w1,,wnW\mathbf{w}_1, \ldots, \mathbf{w}_n \in W, there exists a unique linear map TT with T(bi)=wiT(\mathbf{b}_i) = \mathbf{w}_i.

This is the fundamental construction theorem. It means the matrix of TT (in standard coordinates) is:

A=[T(e1)T(e2)T(en)]A = \begin{bmatrix} T(\mathbf{e}_1) & T(\mathbf{e}_2) & \cdots & T(\mathbf{e}_n) \end{bmatrix}

The set of all linear maps T:VWT: V \to W is denoted L(V,W)\mathcal{L}(V, W) or Hom(V,W)\operatorname{Hom}(V, W). It is itself a vector space under pointwise operations: (S+T)(v)=S(v)+T(v)(S + T)(\mathbf{v}) = S(\mathbf{v}) + T(\mathbf{v}) and (cT)(v)=cT(v)(cT)(\mathbf{v}) = cT(\mathbf{v}).

2.2 Kernel of a Linear Map

Definition (Kernel). The kernel (or null space) of a linear map T:VWT: V \to W is:

ker(T)={vV:T(v)=0W}\ker(T) = \{\mathbf{v} \in V : T(\mathbf{v}) = \mathbf{0}_W\}

It is the set of all inputs that TT maps to zero - the "lost information" of the map.

Theorem 2.2.1. ker(T)\ker(T) is a subspace of VV.

Proof:

  • Zero: T(0)=0T(\mathbf{0}) = \mathbf{0}, so 0ker(T)\mathbf{0} \in \ker(T).
  • Closure under addition: If T(u)=T(v)=0T(\mathbf{u}) = T(\mathbf{v}) = \mathbf{0}, then T(u+v)=T(u)+T(v)=0+0=0T(\mathbf{u} + \mathbf{v}) = T(\mathbf{u}) + T(\mathbf{v}) = \mathbf{0} + \mathbf{0} = \mathbf{0}.
  • Closure under scaling: If T(v)=0T(\mathbf{v}) = \mathbf{0}, then T(cv)=cT(v)=c0=0T(c\mathbf{v}) = cT(\mathbf{v}) = c\mathbf{0} = \mathbf{0}. \square

Geometric meaning. ker(T)\ker(T) is the subspace that TT "collapses to zero." If TT is the projection onto the xyxy-plane, then ker(T)\ker(T) is the zz-axis. If TT is differentiation of polynomials, ker(T)\ker(T) is the constant polynomials.

Theorem 2.2.2 (Injectivity criterion). TT is injective (one-to-one) if and only if ker(T)={0}\ker(T) = \{\mathbf{0}\}.

Proof. (\Rightarrow) If TT is injective and T(v)=0=T(0)T(\mathbf{v}) = \mathbf{0} = T(\mathbf{0}), then v=0\mathbf{v} = \mathbf{0}. (\Leftarrow) If ker(T)={0}\ker(T) = \{\mathbf{0}\} and T(u)=T(v)T(\mathbf{u}) = T(\mathbf{v}), then T(uv)=0T(\mathbf{u} - \mathbf{v}) = \mathbf{0}, so uv=0\mathbf{u} - \mathbf{v} = \mathbf{0}, i.e., u=v\mathbf{u} = \mathbf{v}. \square

The dimension of the kernel, dim(ker(T))\dim(\ker(T)), is called the nullity of TT.

2.3 Image of a Linear Map

Definition (Image). The image (or range) of a linear map T:VWT: V \to W is:

im(T)={T(v):vV}=T(V)\operatorname{im}(T) = \{T(\mathbf{v}) : \mathbf{v} \in V\} = T(V)

It is the set of all possible outputs - the "reachable" part of WW.

Theorem 2.3.1. im(T)\operatorname{im}(T) is a subspace of WW.

Proof:

  • Zero: T(0)=0im(T)T(\mathbf{0}) = \mathbf{0} \in \operatorname{im}(T).
  • Closure under addition: T(u)+T(v)=T(u+v)im(T)T(\mathbf{u}) + T(\mathbf{v}) = T(\mathbf{u} + \mathbf{v}) \in \operatorname{im}(T).
  • Closure under scaling: cT(v)=T(cv)im(T)cT(\mathbf{v}) = T(c\mathbf{v}) \in \operatorname{im}(T). \square

Theorem 2.3.2. im(T)\operatorname{im}(T) is the column space of the matrix AA representing TT.

Proof: T(x)=Ax=x1a1++xnanT(\mathbf{x}) = A\mathbf{x} = x_1\mathbf{a}_1 + \cdots + x_n\mathbf{a}_n, which is exactly the span of the columns of AA. \square

Theorem 2.3.3 (Surjectivity criterion). T:VWT: V \to W is surjective (onto) if and only if im(T)=W\operatorname{im}(T) = W.

The dimension of the image, dim(im(T))\dim(\operatorname{im}(T)), is called the rank of TT, denoted rank(T)\operatorname{rank}(T).

2.4 The Rank-Nullity Theorem

This is one of the most elegant and useful results in linear algebra, connecting the three fundamental dimensions of a linear map.

Theorem 2.4.1 (Rank-Nullity Theorem). Let T:VWT: V \to W be a linear map with VV finite-dimensional. Then:

dim(V)=dim(ker(T))+dim(im(T))=nullity(T)+rank(T)\dim(V) = \dim(\ker(T)) + \dim(\operatorname{im}(T)) = \operatorname{nullity}(T) + \operatorname{rank}(T)

Proof. Let {k1,,kp}\{\mathbf{k}_1, \ldots, \mathbf{k}_p\} be a basis for ker(T)\ker(T) (so p=nullity(T)p = \operatorname{nullity}(T)). Extend this to a basis for all of VV: {k1,,kp,b1,,bq}\{\mathbf{k}_1, \ldots, \mathbf{k}_p, \mathbf{b}_1, \ldots, \mathbf{b}_q\} where p+q=dim(V)p + q = \dim(V).

We claim {T(b1),,T(bq)}\{T(\mathbf{b}_1), \ldots, T(\mathbf{b}_q)\} is a basis for im(T)\operatorname{im}(T).

Spanning: For any wim(T)\mathbf{w} \in \operatorname{im}(T), write w=T(v)\mathbf{w} = T(\mathbf{v}) for some v=ciki+djbj\mathbf{v} = \sum c_i \mathbf{k}_i + \sum d_j \mathbf{b}_j. Then w=T(v)=ciT(ki)+djT(bj)=djT(bj)\mathbf{w} = T(\mathbf{v}) = \sum c_i T(\mathbf{k}_i) + \sum d_j T(\mathbf{b}_j) = \sum d_j T(\mathbf{b}_j).

Linear independence: If djT(bj)=0\sum d_j T(\mathbf{b}_j) = \mathbf{0}, then T(djbj)=0T(\sum d_j \mathbf{b}_j) = \mathbf{0}, so djbjker(T)=span{k1,,kp}\sum d_j \mathbf{b}_j \in \ker(T) = \operatorname{span}\{\mathbf{k}_1, \ldots, \mathbf{k}_p\}. But {k1,,kp,b1,,bq}\{\mathbf{k}_1, \ldots, \mathbf{k}_p, \mathbf{b}_1, \ldots, \mathbf{b}_q\} is linearly independent, so all dj=0d_j = 0. \square

Intuition: The rank-nullity theorem says TT "uses" its input dimensions in two ways: some dimensions are collapsed to zero (nullity), and the rest are faithfully transmitted to the output (rank). Total in = lost + kept.

RANK-NULLITY THEOREM
========================================================================

  T: V -----------------------------------> W
       |                                  |
       +-- ker(T) ---> {0}                 |
       |   (nullity)   collapsed           |
       |                                  |
       +-- "rest" ---> im(T) \subseteq W          |
           (rank)      transmitted         |

  dim(V)    =    nullity(T)    +    rank(T)
  [total]     [collapsed dims]  [surviving dims]

  Example: T: \mathbb{R}^5 -> \mathbb{R}^3, rank 2
  -> nullity = 5 - 2 = 3
  -> 3 dimensions collapse, 2 survive

========================================================================

Example. T:R4R3T: \mathbb{R}^4 \to \mathbb{R}^3 defined by A=[101001100001]A = \begin{bmatrix} 1 & 0 & 1 & 0 \\ 0 & 1 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}. Rank = 3 (full row rank). Nullity = 43=14 - 3 = 1. The null space is ker(T)=span{(1,1,1,0)}\ker(T) = \operatorname{span}\{(-1, -1, 1, 0)^\top\}.

For AI: In LoRA, a weight update ΔW=BA\Delta W = BA with BRd×rB \in \mathbb{R}^{d \times r}, ARr×kA \in \mathbb{R}^{r \times k} has rank at most rr. By rank-nullity, its kernel has dimension at least krk - r. When rkr \ll k, the update leaves most of the input space unchanged - it only "speaks to" an rr-dimensional subspace.

2.5 Examples and Non-Examples

Linear transformations:

MapDomain -> CodomainKernelImage
T(x)=AxT(\mathbf{x}) = A\mathbf{x} (matrix mult.)RnRm\mathbb{R}^n \to \mathbb{R}^mnull space of AAcolumn space of AA
T(f)=fT(f) = f' (differentiation)PnPn1\mathcal{P}_n \to \mathcal{P}_{n-1}constantsall polynomials of degree n1\leq n-1
T(f)=0xf(t)dtT(f) = \int_0^x f(t)\,dt (integration)C[0,1]C[0,1]\mathcal{C}[0,1] \to \mathcal{C}[0,1]{0}\{\mathbf{0}\}functions vanishing at 0
T(x)=0T(\mathbf{x}) = \mathbf{0} (zero map)VWV \to Wall of VV{0}\{\mathbf{0}\}
T(x)=xT(\mathbf{x}) = \mathbf{x} (identity)VVV \to V{0}\{\mathbf{0}\}all of VV
T(x,y)=(x,0)T(x, y) = (x, 0) (projection)R2R2\mathbb{R}^2 \to \mathbb{R}^2yy-axisxx-axis
T(A)=tr(A)T(A) = \operatorname{tr}(A) (trace)Rn×nR\mathbb{R}^{n\times n} \to \mathbb{R}trace-zero matricesR\mathbb{R}

Non-linear maps (and why they fail):

MapLinearity failureTest
T(x)=x+bT(\mathbf{x}) = \mathbf{x} + \mathbf{b} (b0\mathbf{b} \neq \mathbf{0})T(0)=b0T(\mathbf{0}) = \mathbf{b} \neq \mathbf{0}Zero test
T(x)=x2T(x) = x^2T(1+1)=4T(1)+T(1)=2T(1+1) = 4 \neq T(1) + T(1) = 2Additivity
T(x)=xT(\mathbf{x}) = \lVert\mathbf{x}\rVert (norm)T(2e1)=22T(e1)=2T(2\mathbf{e}_1) = 2 \neq 2T(\mathbf{e}_1) = 2... wait, this passes? No: T(e1)=1T(e1)=1T(-\mathbf{e}_1) = 1 \neq -T(\mathbf{e}_1) = -1Homogeneity
T(x)=softmax(x)T(\mathbf{x}) = \operatorname{softmax}(\mathbf{x})T(2x)2T(x)T(2\mathbf{x}) \neq 2T(\mathbf{x}); softmax is scale-invariantHomogeneity
T(x)=ReLU(x)T(\mathbf{x}) = \operatorname{ReLU}(\mathbf{x})T(u+v)T(u)+T(v)T(\mathbf{u} + \mathbf{v}) \neq T(\mathbf{u}) + T(\mathbf{v}) in generalAdditivity
T(x)=xxT(\mathbf{x}) = \mathbf{x} \odot \mathbf{x} (elementwise square)T(cx)=c2xxc(xx)T(c\mathbf{x}) = c^2 \mathbf{x} \odot \mathbf{x} \neq c(\mathbf{x} \odot \mathbf{x})Homogeneity

Note on ReLU: Though ReLU is not linear, it is piecewise linear - linear on each orthant. This means neural networks with ReLU are piecewise linear functions, which is a key fact for understanding their behavior.


3. Matrix Representation of a Linear Map

3.1 The Standard Basis Construction

Over Rn\mathbb{R}^n and Rm\mathbb{R}^m with standard bases, every linear map T:RnRmT: \mathbb{R}^n \to \mathbb{R}^m corresponds to a unique matrix ARm×nA \in \mathbb{R}^{m \times n}.

Construction. The jj-th column of AA is T(ej)T(\mathbf{e}_j), the image of the jj-th standard basis vector:

A=[T(e1)T(e2)T(en)]A = \begin{bmatrix} T(\mathbf{e}_1) & T(\mathbf{e}_2) & \cdots & T(\mathbf{e}_n) \end{bmatrix}

Why this works: For any x=j=1nxjej\mathbf{x} = \sum_{j=1}^n x_j \mathbf{e}_j:

T(x)=T ⁣(jxjej)=jxjT(ej)=AxT(\mathbf{x}) = T\!\left(\sum_j x_j \mathbf{e}_j\right) = \sum_j x_j T(\mathbf{e}_j) = A\mathbf{x}

The matrix is the complete, coordinate-encoded description of TT.

Example. Find the matrix for the 2D2D counterclockwise rotation by angle θ\theta.

T(e1)=(cosθ,sinθ)T(\mathbf{e}_1) = (\cos\theta, \sin\theta)^\top and T(e2)=(sinθ,cosθ)T(\mathbf{e}_2) = (-\sin\theta, \cos\theta)^\top, giving:

Rθ=[cosθsinθsinθcosθ]R_\theta = \begin{bmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{bmatrix}

Example. Find the matrix for differentiation D:P3P2D: \mathcal{P}_3 \to \mathcal{P}_2 using the bases {1,x,x2,x3}\{1, x, x^2, x^3\} and {1,x,x2}\{1, x, x^2\}.

D(1)=0D(1) = 0, D(x)=1D(x) = 1, D(x2)=2xD(x^2) = 2x, D(x3)=3x2D(x^3) = 3x^2. In the output basis:

[D]BC=[010000200003][D]_{\mathcal{B}}^{\mathcal{C}} = \begin{bmatrix} 0 & 1 & 0 & 0 \\ 0 & 0 & 2 & 0 \\ 0 & 0 & 0 & 3 \end{bmatrix}

This is a 3×43 \times 4 matrix, reflecting D:P3P2D: \mathcal{P}_3 \to \mathcal{P}_2.

3.2 Representation in Arbitrary Bases

When VV and WW have non-standard bases, the matrix of TT depends on those bases.

Setup. Let B={b1,,bn}\mathcal{B} = \{\mathbf{b}_1, \ldots, \mathbf{b}_n\} be a basis for VV and C={c1,,cm}\mathcal{C} = \{\mathbf{c}_1, \ldots, \mathbf{c}_m\} be a basis for WW.

Coordinate vectors. For vV\mathbf{v} \in V, write v=jαjbj\mathbf{v} = \sum_j \alpha_j \mathbf{b}_j. The coordinate vector is [v]B=(α1,,αn)Rn[\mathbf{v}]_{\mathcal{B}} = (\alpha_1, \ldots, \alpha_n)^\top \in \mathbb{R}^n.

The matrix of TT in bases (B,C)(\mathcal{B}, \mathcal{C}). Express each T(bj)T(\mathbf{b}_j) in the basis C\mathcal{C}:

T(bj)=i=1maijciT(\mathbf{b}_j) = \sum_{i=1}^m a_{ij} \mathbf{c}_i

The matrix [T]BC=(aij)Rm×n[T]_{\mathcal{B}}^{\mathcal{C}} = (a_{ij}) \in \mathbb{R}^{m \times n} satisfies:

[T(v)]C=[T]BC[v]B[T(\mathbf{v})]_{\mathcal{C}} = [T]_{\mathcal{B}}^{\mathcal{C}} \, [\mathbf{v}]_{\mathcal{B}}

The commutative diagram:

  v \in V  ----------------T----------------->  T(v) \in W
     |                                            |
   [*]_B (coord in B)                       [*]_C (coord in C)
     |                                            |
     down                                            down
  [v]_B ---------[T]_B^C (matrix mult.)----->  [T(v)]_C

The matrix [T]BC[T]_{\mathcal{B}}^{\mathcal{C}} is the bridge between coordinates: it takes B\mathcal{B}-coordinates of input to C\mathcal{C}-coordinates of output.

Example. Let T:R2R2T: \mathbb{R}^2 \to \mathbb{R}^2 be the map T(x,y)=(x+y,xy)T(x, y) = (x + y, x - y). In the non-standard basis B={(1,1),(1,1)}\mathcal{B} = \{(1, 1), (1, -1)\}:

T(1,1)=(2,0)=1(1,1)+1(1,1)T(1,1) = (2, 0) = 1 \cdot (1,1) + 1 \cdot (1,-1), so column 1 is (1,1)(1, 1)^\top. T(1,1)=(0,2)=1(1,1)+(1)(1,1)T(1,-1) = (0, 2) = 1 \cdot (1,1) + (-1) \cdot (1,-1), so column 2 is (1,1)(1, -1)^\top.

[T]BB=[1111][T]_{\mathcal{B}}^{\mathcal{B}} = \begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix}

This is diagonal-like: in the basis B\mathcal{B}, TT acts as a simple scaling on each basis vector - exactly the diagonalization idea.

3.3 The Change-of-Basis Matrix

Definition. Let B={b1,,bn}\mathcal{B} = \{\mathbf{b}_1, \ldots, \mathbf{b}_n\} and B={b1,,bn}\mathcal{B}' = \{\mathbf{b}'_1, \ldots, \mathbf{b}'_n\} be two bases for the same space VV. The change-of-basis matrix from B\mathcal{B}' to B\mathcal{B} is:

P=[[b1]B[b2]B[bn]B]P = \begin{bmatrix} [\mathbf{b}'_1]_{\mathcal{B}} & [\mathbf{b}'_2]_{\mathcal{B}} & \cdots & [\mathbf{b}'_n]_{\mathcal{B}} \end{bmatrix}

Each column is the B\mathcal{B}-coordinate vector of the corresponding new basis vector.

Key property: If [v]B[\mathbf{v}]_{\mathcal{B}'} are the B\mathcal{B}'-coordinates of v\mathbf{v}, then the B\mathcal{B}-coordinates are:

[v]B=P[v]B[\mathbf{v}]_{\mathcal{B}} = P \, [\mathbf{v}]_{\mathcal{B}'}

The change-of-basis formula for TT. If [T]B[T]_{\mathcal{B}} is the matrix of TT in the basis B\mathcal{B}, and PP is the change-of-basis matrix from B\mathcal{B}' to B\mathcal{B}, then:

[T]B=P1[T]BP[T]_{\mathcal{B}'} = P^{-1} [T]_{\mathcal{B}} P

Derivation:

[T(v)]B=P1[T(v)]B=P1[T]B[v]B=P1[T]BP[v]B[T(\mathbf{v})]_{\mathcal{B}'} = P^{-1} [T(\mathbf{v})]_{\mathcal{B}} = P^{-1} [T]_{\mathcal{B}} [\mathbf{v}]_{\mathcal{B}} = P^{-1} [T]_{\mathcal{B}} P [\mathbf{v}]_{\mathcal{B}'}

Worked example. Let T:R2R2T: \mathbb{R}^2 \to \mathbb{R}^2 have matrix A=[3103]A = \begin{bmatrix} 3 & 1 \\ 0 & 3 \end{bmatrix} in the standard basis. In the basis B={(1,0),(1,1)}\mathcal{B}' = \{(1, 0), (1, 1)\}:

P=[1101],P1=[1101]P = \begin{bmatrix} 1 & 1 \\ 0 & 1 \end{bmatrix}, \quad P^{-1} = \begin{bmatrix} 1 & -1 \\ 0 & 1 \end{bmatrix} P1AP=[1101][3103][1101]=[3103][1101]=[3203]P^{-1} A P = \begin{bmatrix} 1 & -1 \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 3 & 1 \\ 0 & 3 \end{bmatrix} \begin{bmatrix} 1 & 1 \\ 0 & 1 \end{bmatrix} = \begin{bmatrix} 3 & -1 \\ 0 & 3 \end{bmatrix} \begin{bmatrix} 1 & 1 \\ 0 & 1 \end{bmatrix} = \begin{bmatrix} 3 & 2 \\ 0 & 3 \end{bmatrix}

3.4 Similarity Transformations

Definition. Two matrices A,BRn×nA, B \in \mathbb{R}^{n \times n} are similar (ABA \sim B) if there exists an invertible PP such that B=P1APB = P^{-1}AP.

Geometrically: AA and BB represent the same linear map in different bases.

Invariants under similarity (properties that don't change when you change basis):

InvariantFormulaMeaning
Eigenvaluesλ\lambda unchangedSpectrum is basis-independent
Determinantdet(P1AP)=det(A)\det(P^{-1}AP) = \det(A)Volume scaling is basis-independent
Tracetr(P1AP)=tr(A)\operatorname{tr}(P^{-1}AP) = \operatorname{tr}(A)Sum of eigenvalues
Rankrank(P1AP)=rank(A)\operatorname{rank}(P^{-1}AP) = \operatorname{rank}(A)Dimension of image
Characteristic polynomialdet(λIP1AP)=det(λIA)\det(\lambda I - P^{-1}AP) = \det(\lambda I - A)Entire spectrum

Diagonalization is change of basis. When AA is diagonalizable with eigenvectors v1,,vn\mathbf{v}_1, \ldots, \mathbf{v}_n, the matrix P=[v1vn]P = [\mathbf{v}_1 | \cdots | \mathbf{v}_n] gives P1AP=Λ=diag(λ1,,λn)P^{-1}AP = \Lambda = \operatorname{diag}(\lambda_1, \ldots, \lambda_n). We are choosing the basis of eigenvectors, in which the map acts by simple scaling.

Forward reference: Eigenvalues and Eigenvectors

The full theory of diagonalization, the spectral theorem, and when a matrix is diagonalizable is developed in 01: Eigenvalues and Eigenvectors. Here we note only that diagonalization is a special case of change of basis.


4. Composition, Invertibility, and Isomorphisms

4.1 Composition is Matrix Multiplication

Let T:UVT: U \to V and S:VWS: V \to W be linear maps. Their composition ST:UWS \circ T: U \to W defined by (ST)(u)=S(T(u))(S \circ T)(\mathbf{u}) = S(T(\mathbf{u})) is also linear.

Theorem 4.1.1. If TT has matrix AA (in standard bases) and SS has matrix BB, then STS \circ T has matrix BABA.

Proof: (ST)(x)=S(T(x))=S(Ax)=B(Ax)=(BA)x(S \circ T)(\mathbf{x}) = S(T(\mathbf{x})) = S(A\mathbf{x}) = B(A\mathbf{x}) = (BA)\mathbf{x}. \square

This is the fundamental theorem connecting composition of functions to matrix multiplication. It explains:

  • Non-commutativity: STTSS \circ T \neq T \circ S in general (different domains/codomains, or BAABBA \neq AB for square matrices).
  • Associativity: (RS)T=R(ST)(R \circ S) \circ T = R \circ (S \circ T) - matches matrix associativity.
  • Deep learning: A network f=fLf1f = f_L \circ \cdots \circ f_1 is a composition. The forward pass computes this product. The backward pass (backprop) uses the chain rule, which is exactly the composition of Jacobians.

Collapse of linear-only networks. If fi(x)=Wixf_i(\mathbf{x}) = W_i \mathbf{x} (all layers linear, no activations):

f(x)=WLW2W1x=Weffxf(\mathbf{x}) = W_L \cdots W_2 W_1 \mathbf{x} = W_{\text{eff}} \mathbf{x}

where Weff=WLW1W_{\text{eff}} = W_L \cdots W_1 is a single matrix. No depth benefit without nonlinearity.

4.2 Injectivity, Surjectivity, and Bijectivity

Definitions. A linear map T:VWT: V \to W is:

  • Injective (one-to-one) if T(u)=T(v)u=vT(\mathbf{u}) = T(\mathbf{v}) \Rightarrow \mathbf{u} = \mathbf{v}
  • Surjective (onto) if for every wW\mathbf{w} \in W, vV\exists\, \mathbf{v} \in V with T(v)=wT(\mathbf{v}) = \mathbf{w}
  • Bijective if both injective and surjective

Criteria via rank-nullity:

PropertyConditionMatrix equivalent
Injectiveker(T)={0}\ker(T) = \{\mathbf{0}\}, i.e., nullity = 0Columns of AA are linearly independent
Surjectiveim(T)=W\operatorname{im}(T) = W, i.e., rank = dim(W)\dim(W)Rows of AA span Rm\mathbb{R}^m; full row rank
BijectiveBoth; requires dim(V)=dim(W)\dim(V) = \dim(W) and full rankAA is square and invertible

Dimension constraints:

  • If dim(V)<dim(W)\dim(V) < \dim(W): TT cannot be surjective (rank dim(V)<dim(W)\leq \dim(V) < \dim(W)).
  • If dim(V)>dim(W)\dim(V) > \dim(W): TT cannot be injective (nullity =dim(V)rankdim(V)dim(W)>0= \dim(V) - \text{rank} \geq \dim(V) - \dim(W) > 0).
  • If dim(V)=dim(W)\dim(V) = \dim(W): injective \Leftrightarrow surjective \Leftrightarrow bijective.

4.3 Isomorphisms

Definition. A bijective linear map T:VWT: V \to W is called an isomorphism. If such a map exists, VV and WW are isomorphic, written VWV \cong W.

Isomorphic spaces are "the same" as vector spaces - they have the same algebraic structure. An isomorphism is a relabeling of elements that preserves all linear operations.

Theorem 4.3.1 (Fundamental Classification). Two finite-dimensional vector spaces over the same field are isomorphic if and only if they have the same dimension.

Consequence: Every nn-dimensional vector space over R\mathbb{R} is isomorphic to Rn\mathbb{R}^n. The space Pn\mathcal{P}_n of polynomials of degree n\leq n (dimension n+1n+1) is isomorphic to Rn+1\mathbb{R}^{n+1}. The space R2×2\mathbb{R}^{2 \times 2} (dimension 4) is isomorphic to R4\mathbb{R}^4.

For AI: The residual stream of a transformer (a vector in Rd\mathbb{R}^d) is isomorphic to any other dd-dimensional space. Features are not inherently "in Rd\mathbb{R}^d" - they live in an abstract vector space and Rd\mathbb{R}^d is just one coordinate representation. The linear representation hypothesis says this space has meaningful geometric structure independent of the coordinate system.

Properties of isomorphisms:

  • The inverse T1:WVT^{-1}: W \to V is also an isomorphism.
  • Isomorphism preserves all vector space structure: subspaces, linear independence, bases, dimension.
  • Isomorphisms form a group under composition.

4.4 The Inverse Map

Theorem 4.4.1. If T:VWT: V \to W is an isomorphism (bijective linear map), then T1:WVT^{-1}: W \to V is also a linear map.

Proof: For w1,w2W\mathbf{w}_1, \mathbf{w}_2 \in W, let vi=T1(wi)\mathbf{v}_i = T^{-1}(\mathbf{w}_i), so T(vi)=wiT(\mathbf{v}_i) = \mathbf{w}_i. Then:

T(av1+bv2)=aT(v1)+bT(v2)=aw1+bw2T(a\mathbf{v}_1 + b\mathbf{v}_2) = aT(\mathbf{v}_1) + bT(\mathbf{v}_2) = a\mathbf{w}_1 + b\mathbf{w}_2

So T1(aw1+bw2)=av1+bv2=aT1(w1)+bT1(w2)T^{-1}(a\mathbf{w}_1 + b\mathbf{w}_2) = a\mathbf{v}_1 + b\mathbf{v}_2 = aT^{-1}(\mathbf{w}_1) + bT^{-1}(\mathbf{w}_2). \square

Matrix inverse. If TT has matrix AA (square, full rank), then T1T^{-1} has matrix A1A^{-1}.

Left and right inverses for non-square maps. When T:RnRmT: \mathbb{R}^n \to \mathbb{R}^m is injective but not surjective (n<mn < m), there is no full inverse. However:

  • Left inverse: LL such that LT=IVL \circ T = I_V. Exists iff TT is injective. Formula: L=(AA)1AL = (A^\top A)^{-1} A^\top (left pseudo-inverse).
  • Right inverse: RR such that TR=IWT \circ R = I_W. Exists iff TT is surjective. Formula: R=A(AA)1R = A^\top (A A^\top)^{-1} (right pseudo-inverse).

Forward reference: Moore-Penrose Pseudo-Inverse

The general pseudo-inverse A+A^+, defined via SVD as A+=VΣ+UA^+ = V\Sigma^+ U^\top, handles all cases uniformly and gives the least-squares solution when an exact solution doesn't exist. Full treatment in 02: Singular Value Decomposition.

4.5 The Four Fundamental Subspaces via Linear Maps

Every linear map T:VWT: V \to W (with V=RnV = \mathbb{R}^n, W=RmW = \mathbb{R}^m, matrix AA) defines four fundamental subspaces:

SubspaceDefinitionLives inDimension
Column space col(A)\operatorname{col}(A)im(T)\operatorname{im}(T)Rm\mathbb{R}^mr=rank(A)r = \operatorname{rank}(A)
Null space null(A)\operatorname{null}(A)ker(T)\ker(T)Rn\mathbb{R}^nnrn - r
Row space row(A)\operatorname{row}(A)im(T)\operatorname{im}(T^\top)Rn\mathbb{R}^nrr
Left null space null(A)\operatorname{null}(A^\top)ker(T)\ker(T^\top)Rm\mathbb{R}^mmrm - r

Orthogonality relations (proven using the dual map - see 6):

null(A)row(A),null(A)col(A)\operatorname{null}(A) \perp \operatorname{row}(A), \quad \operatorname{null}(A^\top) \perp \operatorname{col}(A)

This splits Rn=row(A)null(A)\mathbb{R}^n = \operatorname{row}(A) \oplus \operatorname{null}(A) and Rm=col(A)null(A)\mathbb{R}^m = \operatorname{col}(A) \oplus \operatorname{null}(A^\top).

THE FOUR FUNDAMENTAL SUBSPACES
========================================================================

  \mathbb{R}^n (domain)                    \mathbb{R}^m (codomain)
  +--------------------+         +--------------------+
  | row space          |--T--->   | column space       |
  | (dim r)            |         | (dim r)            |
  +--------------------+         +--------------------+
  | null space         |--T--->   | left null space    |
  | (dim n-r)          | {0}     | (dim m-r)          |
  +--------------------+         +--------------------+
        <-> \perp                               <-> \perp
     orthogonal complement            orthogonal complement

========================================================================

5. Special Classes of Linear Transformations

5.1 Projection Operators

Definition. A linear map P:VVP: V \to V is a projection if it is idempotent: P2=PP^2 = P.

Idempotency means "doing it twice is the same as doing it once." Once you project onto a subspace, you're already there - projecting again does nothing.

Theorem 5.1.1. If PP is a projection, then:

  • im(P)=ker(IP)\operatorname{im}(P) = \ker(I - P): the image of PP is the fixed-point set.
  • ker(P)=im(IP)\ker(P) = \operatorname{im}(I - P): the kernel of PP is the image of the complementary projection.
  • V=im(P)ker(P)V = \operatorname{im}(P) \oplus \ker(P) (direct sum decomposition).

Proof of last claim: Any vV\mathbf{v} \in V decomposes as v=Pv+(IP)v\mathbf{v} = P\mathbf{v} + (I-P)\mathbf{v}, where Pvim(P)P\mathbf{v} \in \operatorname{im}(P) and (IP)vker(P)(I-P)\mathbf{v} \in \ker(P) (since P(IP)v=(PP2)v=0P(I-P)\mathbf{v} = (P-P^2)\mathbf{v} = \mathbf{0}). Uniqueness follows from: if v=u+w\mathbf{v} = \mathbf{u} + \mathbf{w} with uim(P)\mathbf{u} \in \operatorname{im}(P), wker(P)\mathbf{w} \in \ker(P), then Pv=Pu+Pw=uP\mathbf{v} = P\mathbf{u} + P\mathbf{w} = \mathbf{u}. \square

Rank and trace. For a projection: rank(P)=tr(P)\operatorname{rank}(P) = \operatorname{tr}(P) (eigenvalues are only 0 and 1; trace = sum of eigenvalues = number of 1s = rank).

Orthogonal vs oblique projections. An orthogonal projection satisfies additionally P=PP = P^\top (i.e., PP is symmetric). In this case:

  • ker(P)=im(P)\ker(P) = \operatorname{im}(P)^\perp - the kernel is the orthogonal complement of the image.
  • Formula: P=A(AA)1AP = A(A^\top A)^{-1}A^\top for projection onto col(A)\operatorname{col}(A).

An oblique projection has P2=PP^2 = P but PPP \neq P^\top - it projects along a subspace that is not orthogonal to the target.

Forward reference: Orthogonal Projections

The full theory of orthogonal projections - including the projection formula, orthogonal decompositions, and the relationship to least squares - is in 05: Orthogonality and Orthonormality.

For AI: Attention heads in transformers apply projections onto query/key/value subspaces. The head-specific projection matrices WQ,WK,WVW_Q, W_K, W_V define low-dimensional subspaces of the residual stream. The attention mechanism then computes weighted combinations within these projected spaces. Projection is also central to layer normalization (projecting onto the unit sphere in the mean-subtracted subspace).

5.2 Rotations and Reflections

Orthogonal transformations are linear maps T:RnRnT: \mathbb{R}^n \to \mathbb{R}^n that preserve inner products:

T(u),T(v)=u,vfor all u,v\langle T(\mathbf{u}), T(\mathbf{v}) \rangle = \langle \mathbf{u}, \mathbf{v} \rangle \quad \text{for all } \mathbf{u}, \mathbf{v}

Equivalently, their matrices satisfy AA=IA^\top A = I, i.e., A1=AA^{-1} = A^\top.

The set of all n×nn \times n orthogonal matrices forms the orthogonal group O(n)O(n).

Determinant splits them into two classes:

  • det(A)=+1\det(A) = +1: proper rotations - preserve orientation. These form SO(n)SO(n), the special orthogonal group.
  • det(A)=1\det(A) = -1: improper rotations (reflections, or rotation + reflection) - reverse orientation.

2D rotation by angle θ\theta:

Rθ=[cosθsinθsinθcosθ]SO(2)R_\theta = \begin{bmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{bmatrix} \in SO(2)

Properties: RθRϕ=Rθ+ϕR_\theta R_\phi = R_{\theta+\phi}, Rθ1=Rθ=RθR_\theta^{-1} = R_{-\theta} = R_\theta^\top.

3D rotations are parametrized by an axis n^\hat{\mathbf{n}} and angle θ\theta (Rodrigues' formula):

R=I+sin(θ)K+(1cosθ)K2R = I + \sin(\theta)\, K + (1 - \cos\theta)\, K^2

where KK is the skew-symmetric matrix with Kv=n^×vK\mathbf{v} = \hat{\mathbf{n}} \times \mathbf{v}.

Householder reflections. Given a unit vector n^\hat{\mathbf{n}}, the reflection through the hyperplane orthogonal to n^\hat{\mathbf{n}} is:

H=I2n^n^H = I - 2\hat{\mathbf{n}}\hat{\mathbf{n}}^\top

Properties: H=HH = H^\top, H2=IH^2 = I, det(H)=1\det(H) = -1.

For AI: Rotary Position Embedding (RoPE), used in LLaMA, GPT-NeoX, and Gemma, encodes positional information via rotation matrices applied blockwise to query and key vectors. The rotation angle depends on position, and relative positions appear as relative rotations, which interact cleanly with the dot-product attention score.

5.3 Shear and Scaling Maps

Diagonal scaling: T(x)=DxT(\mathbf{x}) = D\mathbf{x} where D=diag(d1,,dn)D = \operatorname{diag}(d_1, \ldots, d_n). Scales each coordinate independently. Kernel = {0}\{\mathbf{0}\} if all di0d_i \neq 0. det(D)=di\det(D) = \prod d_i.

Shear maps in R2\mathbb{R}^2: the horizontal shear by factor kk is:

Sk=[1k01],Sk(x,y)=(x+ky,y)S_k = \begin{bmatrix} 1 & k \\ 0 & 1 \end{bmatrix}, \quad S_k(x, y) = (x + ky, y)

det(Sk)=1\det(S_k) = 1 - shear preserves area. Sk1=SkS_k^{-1} = S_{-k}.

Geometrically: the xx-axis is fixed; each horizontal line slides by a distance proportional to its height. Parallelograms are tilted but area is preserved.

Elementary matrices (from row operations) are all either shears, row swaps, or scalings:

  • Row scaling by cc: Eii(c)E_{ii}(c) - multiply row ii by cc
  • Row swap: EijE_{ij} - swap rows ii and jj
  • Row addition: Eij(c)E_{ij}(c) - add cc times row jj to row ii (this is a shear)

Every invertible matrix is a product of elementary matrices - Gaussian elimination as composition of linear maps.

5.4 The Geometry of Low-Rank Maps

A rank-kk linear map T:RnRmT: \mathbb{R}^n \to \mathbb{R}^m with k<nk < n collapses the domain: it maps all of Rn\mathbb{R}^n onto a kk-dimensional subspace of Rm\mathbb{R}^m, while sending an (nk)(n-k)-dimensional subspace (the null space) to zero.

Decomposition via SVD preview. Any rank-kk matrix admits:

A=i=1kσiuiviA = \sum_{i=1}^k \sigma_i \mathbf{u}_i \mathbf{v}_i^\top

This is a sum of kk rank-1 outer products. Each rank-1 term σiuivi\sigma_i \mathbf{u}_i \mathbf{v}_i^\top maps everything in the direction vi\mathbf{v}_i to the direction ui\mathbf{u}_i, scaled by σi\sigma_i.

Forward reference: SVD and Low-Rank Approximation

The full theory of SVD - including the Eckart-Young theorem (best rank-kk approximation) and the geometric interpretation of singular values - is in 02: Singular Value Decomposition.

LoRA preview. A rank-rr update ΔW=BA\Delta W = BA (with BRd×rB \in \mathbb{R}^{d \times r}, ARr×kA \in \mathbb{R}^{r \times k}, rmin(d,k)r \ll \min(d,k)) is a composition of two low-rank linear maps:

  1. A:RkRrA: \mathbb{R}^k \to \mathbb{R}^r compresses the input to rr dimensions.
  2. B:RrRdB: \mathbb{R}^r \to \mathbb{R}^d expands back to dd dimensions.

The effective map ΔW=BA\Delta W = BA has rank r\leq r, so it only modifies the network's behavior along rr directions in the input space. The hypothesis underlying LoRA is that the task-relevant weight changes lie in a low-dimensional subspace - a statement directly about the geometry of linear maps.


6. Dual Spaces and Transposes

6.1 The Dual Space

Definition. Given a vector space VV over F\mathbb{F}, the dual space VV^* is the set of all linear maps from VV to F\mathbb{F}:

V=L(V,F)={f:VFf is linear}V^* = \mathcal{L}(V, \mathbb{F}) = \{f: V \to \mathbb{F} \mid f \text{ is linear}\}

Elements of VV^* are called linear functionals or covectors or one-forms.

VV^* is itself a vector space under pointwise addition (f+g)(v)=f(v)+g(v)(f+g)(\mathbf{v}) = f(\mathbf{v}) + g(\mathbf{v}) and scalar multiplication (cf)(v)=cf(v)(cf)(\mathbf{v}) = c \cdot f(\mathbf{v}).

The dual basis. If B={e1,,en}\mathcal{B} = \{\mathbf{e}_1, \ldots, \mathbf{e}_n\} is a basis for VV, then the dual basis B={e1,,en}\mathcal{B}^* = \{e^1, \ldots, e^n\} is defined by:

ei(ej)=δij={1i=j0ije^i(\mathbf{e}_j) = \delta_{ij} = \begin{cases} 1 & i = j \\ 0 & i \neq j \end{cases}

The dual basis is a basis for VV^*, so dim(V)=dim(V)\dim(V^*) = \dim(V).

Key example: Row vectors. In Rn\mathbb{R}^n, a linear functional is any map f(x)=axf(\mathbf{x}) = \mathbf{a}^\top \mathbf{x} for some fixed aRn\mathbf{a} \in \mathbb{R}^n. So (Rn)Rn(\mathbb{R}^n)^* \cong \mathbb{R}^n, but the identification a\mathbf{a}^\top (row vector) \leftrightarrow a\mathbf{a} (column vector) is coordinate-dependent. A row vector a\mathbf{a}^\top is intrinsically an element of (Rn)(\mathbb{R}^n)^*, not of Rn\mathbb{R}^n.

Why the distinction matters. When you write ax\mathbf{a}^\top \mathbf{x}, you're applying the dual vector a(Rn)\mathbf{a}^\top \in (\mathbb{R}^n)^* to the vector xRn\mathbf{x} \in \mathbb{R}^n. This is a pairing between a space and its dual - not a dot product of two vectors in the same space. In Riemannian geometry and general relativity, this distinction is essential. In ML, it matters for understanding gradients.

6.2 The Dual Map and the Transpose

Definition. Given T:VWT: V \to W, the dual map (or transpose) T:WVT^\top: W^* \to V^* is defined by:

(Tf)(v)=f(T(v))for fW,vV(T^\top f)(\mathbf{v}) = f(T(\mathbf{v})) \quad \text{for } f \in W^*, \mathbf{v} \in V

In words: to apply TfT^\top f to v\mathbf{v}, first apply TT to v\mathbf{v}, then apply ff to the result.

Matrix of the dual map. If TT has matrix AA in standard coordinates, then TT^\top has matrix AA^\top (the matrix transpose). The notation "TT^\top" for the dual map is intentional and consistent.

Domain and codomain switch. T:VWT: V \to W means T:WVT^\top: W^* \to V^*. The transpose reverses the direction. This is why:

  • Composing on the left: (ST)=TS(ST)^\top = T^\top S^\top (order reversal).
  • In backpropagation: if the forward pass multiplies by WW, the backward pass multiplies by WW^\top.

Kernel of the transpose = left null space:

ker(T)={wW:Tw=0}=null(A)\ker(T^\top) = \{\mathbf{w} \in W : T^\top \mathbf{w} = \mathbf{0}\} = \operatorname{null}(A^\top)

This is the left null space of AA, which is orthogonal to col(A)\operatorname{col}(A).

Annihilators. The annihilator of a subspace SVS \subseteq V is S0={fV:f(s)=0sS}S^0 = \{f \in V^* : f(\mathbf{s}) = 0 \, \forall \mathbf{s} \in S\}. Key identities:

ker(T)=(im(T))0,im(T)=(ker(T))0\ker(T^\top) = (\operatorname{im}(T))^0, \qquad \operatorname{im}(T^\top) = (\ker(T))^0

6.3 Gradients Live in the Dual Space

This connection between linear maps and gradients is one of the deepest in all of applied mathematics, and is directly relevant to understanding backpropagation.

The gradient as a linear functional. For a differentiable function f:RnRf: \mathbb{R}^n \to \mathbb{R}, the derivative at x\mathbf{x} is a linear functional Dfx:RnRDf_{\mathbf{x}}: \mathbb{R}^n \to \mathbb{R}, defined by:

Dfx(h)=limt0f(x+th)f(x)t=f(x)hDf_{\mathbf{x}}(\mathbf{h}) = \lim_{t \to 0} \frac{f(\mathbf{x} + t\mathbf{h}) - f(\mathbf{x})}{t} = \nabla f(\mathbf{x})^\top \mathbf{h}

The derivative DfxDf_{\mathbf{x}} is an element of (Rn)(\mathbb{R}^n)^* - a covector, represented as a row vector f(x)\nabla f(\mathbf{x})^\top.

The gradient vector f(x)Rn\nabla f(\mathbf{x}) \in \mathbb{R}^n is obtained by identifying (Rn)(\mathbb{R}^n)^* with Rn\mathbb{R}^n via the standard inner product. This identification is coordinate-dependent: on a curved manifold (like a constraint surface or the space of probability distributions), gradients and vectors must be treated differently.

Backpropagation via dual maps. Consider a composed network y=Wx\mathbf{y} = W\mathbf{x} (one linear layer, loss =L(y)\ell = L(\mathbf{y})). The gradient with respect to x\mathbf{x} is:

x=Wy\frac{\partial \ell}{\partial \mathbf{x}} = W^\top \frac{\partial \ell}{\partial \mathbf{y}}

This is exactly the dual map TT^\top applied to the incoming gradient y\frac{\partial \ell}{\partial \mathbf{y}}. Backpropagation propagates gradients backward through the transpose of each weight matrix. The chain of transposes in the backward pass is the dual map composition:

(WLW1)=W1WL(W_L \cdots W_1)^\top = W_1^\top \cdots W_L^\top

For AI: Modern deep learning frameworks (PyTorch, JAX) compute gradients using automatic differentiation, which is precisely the evaluation of dual maps via the chain rule. Every .backward() call accumulates contributions via transpose weight matrices - it is applying the adjoint (dual) of the forward linear map.


Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue