Part 1

29 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Linear Transformations: Part 1: Intuition to 6. Dual Spaces and Transposes

1. Intuition

1.1 Three Views of a Matrix

A matrix $A \in \mathbb{R}^{m \times n}$ admits three distinct but equivalent interpretations, and fluent practitioners shift between them effortlessly.

View 1: Data container. The matrix is a rectangular array of numbers - $m$ rows, $n$ columns. This is the computational view. We use it when we care about entries: $A_{ij}$ , the element in row $i$ , column $j$ .

View 2: Column geometry. The matrix is a collection of $n$ column vectors in $\mathbb{R}^m$ . The product $A\mathbf{x}$ is then a linear combination: $A\mathbf{x} = x_1 \mathbf{a}_1 + x_2 \mathbf{a}_2 + \cdots + x_n \mathbf{a}_n$ , where $\mathbf{a}_i$ is the $i$ -th column. This view makes the column space visible and illuminates when $A\mathbf{x} = \mathbf{b}$ has solutions.

View 3: Linear transformation. The matrix defines a function $T: \mathbb{R}^n \to \mathbb{R}^m$ by $T(\mathbf{x}) = A\mathbf{x}$ . This is the abstract, coordinate-free view. The map $T$ exists independently of any coordinate representation - the matrix $A$ is merely its description in a particular pair of bases (standard basis for both $\mathbb{R}^n$ and $\mathbb{R}^m$ ).

THREE VIEWS OF THE MATRIX A
========================================================================

  [2  1]         Column geometry:          Linear map:
  [0  3]         A = [a_1 | a_2]            T: \mathbb{R}^2 -> \mathbb{R}^2
  [1 -1]                                  T(x) = Ax

  Data container      Span{a_1, a_2}         Sends basis vectors
  A_2_1 = 0            = column space        e_1 -> col 1 of A
  A_3_2 = -1                                e_2 -> col 2 of A

  All three describe the same mathematical object.

========================================================================

For AI: In a transformer, the weight matrix $W_Q \in \mathbb{R}^{d_k \times d}$ is simultaneously all three: raw parameters to be optimized (View 1), a basis for the query subspace (View 2), and a linear projection from the residual stream to query space (View 3). Understanding which view you're using prevents confusion about what "the attention head is doing."

1.2 Geometric Action of Linear Maps

What does $T(\mathbf{x}) = A\mathbf{x}$ do to space? The two axioms - additivity and homogeneity - impose strong geometric constraints:

The origin is fixed. $T(\mathbf{0}) = A\mathbf{0} = \mathbf{0}$ always. A linear map cannot translate; it cannot move the origin. This is the most important geometric fact about linear transformations.
Lines through the origin stay lines. If $\mathbf{x}(t) = t\mathbf{v}$ is a line, then $T(\mathbf{x}(t)) = tT(\mathbf{v})$ is a line through the origin in the output space.
Parallel lines stay parallel. If $\mathbf{y}_1 - \mathbf{y}_2 = \mathbf{v}$ (same direction), then $T(\mathbf{y}_1) - T(\mathbf{y}_2) = T(\mathbf{v})$ (still same direction).
Grid lines go to grid lines, equally spaced. This is the classic 3Blue1Brown visualization: apply $A$ to the integer grid of $\mathbb{R}^2$ and you get a (possibly skewed) grid in $\mathbb{R}^2$ .
The unit square maps to a parallelogram. The parallelogram spanned by the columns of $A$ is the image of the unit square under $T$ .

What can a linear map do? Depending on the matrix:

Rotate space (orthogonal matrix, $\det = 1$ )
Reflect space (orthogonal matrix, $\det = -1$ )
Scale dimensions (diagonal matrix)
Shear space (triangular matrix)
Project onto a subspace (idempotent: $P^2 = P$ )
Collapse space to a lower dimension (rank-deficient matrix)

What a linear map cannot do: translate (no origin-shifting), apply nonlinear distortion (no bending, folding, or curving of lines).

1.3 Why Linearity is Special

The superposition principle is the reason linear algebra is tractable. For a linear map $T$ :

T\!\left(\sum_{i=1}^n c_i \mathbf{v}_i\right) = \sum_{i=1}^n c_i T(\mathbf{v}_i)

This means: to know $T$ on all of $V$ , you only need to know $T$ on a basis. A basis has $\dim(V)$ elements. The map on an infinite space is encoded in finitely many column vectors. This is extraordinary compression: infinite -> finite.

Linearity vs nonlinearity in neural networks. A deep neural network with no nonlinearities is just a single linear transformation: $W_L W_{L-1} \cdots W_1 \mathbf{x} = W_{\text{eff}} \mathbf{x}$ . Multiple linear layers collapse to one. The nonlinearities (ReLU, GELU, SiLU) are what allow composition to create genuinely new representational power. The linear layers provide the parameterized directions; the nonlinear activations provide the expressive capacity.

Linearity in analysis. Many of the hardest problems in mathematics and ML become tractable when restricted to linear functions: linear regression has a closed-form solution, linear systems have complete theory, spectral analysis of linear operators is well-understood. The strategy of "linearize, solve, interpret" recurs throughout calculus, optimization, and signal processing.

For AI: The linear representation hypothesis (Elhage et al. 2022, Park et al. 2023) conjectures that many high-level features in LLMs are encoded as directions in representation space - i.e., they are linear features. If true, this means the crucial structure of LLM computation is linear, and all the machinery of this section applies directly to understanding what models are doing.

1.4 Historical Timeline

Year	Person	Contribution
1844	Hermann Grassmann	Die lineale Ausdehnungslehre - first abstract treatment of linear spaces
1855	Arthur Cayley	Matrix algebra as formal system; composition of matrices
1888	Giuseppe Peano	First rigorous axiomatization of vector spaces
1902	Henri Lebesgue	Integration as a linear functional on function spaces
1904-1910	Hilbert, Riesz, Fischer	Infinite-dimensional linear algebra; Hilbert spaces; spectral theory
1929	John von Neumann	Operator theory; linear transformations on Hilbert spaces
1940s	Alan Turing, von Neumann	Linear algebra for numerical computation; matrix algorithms
1986	Rumelhart, Hinton, Williams	Backpropagation - chains of Jacobians as the training algorithm
2017	Vaswani et al.	Attention = linear projections + scaled dot product; transformers
2021	Hu et al. (LoRA)	Low-rank linear map updates for efficient fine-tuning
2022-	Elhage, Park et al.	Linear representation hypothesis; mechanistic interpretability

2. Formal Definitions

2.1 The Two Axioms and All Their Consequences

Definition (Linear Transformation). Let $V$ and $W$ be vector spaces over the same field $\mathbb{F}$ (typically $\mathbb{R}$ or $\mathbb{C}$ ). A function $T: V \to W$ is a linear transformation (also called a linear map or homomorphism) if it satisfies:

Additivity: $T(\mathbf{u} + \mathbf{v}) = T(\mathbf{u}) + T(\mathbf{v})$ for all $\mathbf{u}, \mathbf{v} \in V$
Homogeneity: $T(c\mathbf{v}) = cT(\mathbf{v})$ for all $\mathbf{v} \in V$ , $c \in \mathbb{F}$

These two axioms are often combined into the single condition:

T(a\mathbf{u} + b\mathbf{v}) = aT(\mathbf{u}) + bT(\mathbf{v}) \quad \text{for all } \mathbf{u}, \mathbf{v} \in V, \; a, b \in \mathbb{F}

Immediate consequences (each follows directly from the axioms):

Proposition 2.1.1 (Zero maps to zero). $T(\mathbf{0}_V) = \mathbf{0}_W$ .

Proof: $T(\mathbf{0}) = T(0 \cdot \mathbf{v}) = 0 \cdot T(\mathbf{v}) = \mathbf{0}$ for any $\mathbf{v} \in V$ . $\square$

This is a universal test: if $T(\mathbf{0}) \neq \mathbf{0}$ , then $T$ is not linear. Translation ( $T(\mathbf{x}) = \mathbf{x} + \mathbf{b}$ , $\mathbf{b} \neq \mathbf{0}$ ) fails immediately.

Proposition 2.1.2 (Negatives are preserved). $T(-\mathbf{v}) = -T(\mathbf{v})$ .

Proof: $T(-\mathbf{v}) = T((-1)\mathbf{v}) = (-1)T(\mathbf{v}) = -T(\mathbf{v})$ . $\square$

Proposition 2.1.3 (General linear combinations). For any $\mathbf{v}_1, \ldots, \mathbf{v}_k \in V$ and $c_1, \ldots, c_k \in \mathbb{F}$ :

T\!\left(\sum_{i=1}^k c_i \mathbf{v}_i\right) = \sum_{i=1}^k c_i T(\mathbf{v}_i)

Proof: By induction using additivity and homogeneity. $\square$

Proposition 2.1.4 (Determined by basis images). If $\{\mathbf{b}_1, \ldots, \mathbf{b}_n\}$ is a basis for $V$ , then $T$ is completely determined by $T(\mathbf{b}_1), \ldots, T(\mathbf{b}_n)$ . Moreover, for any choice of $\mathbf{w}_1, \ldots, \mathbf{w}_n \in W$ , there exists a unique linear map $T$ with $T(\mathbf{b}_i) = \mathbf{w}_i$ .

This is the fundamental construction theorem. It means the matrix of $T$ (in standard coordinates) is:

A = \begin{bmatrix} T(\mathbf{e}_1) & T(\mathbf{e}_2) & \cdots & T(\mathbf{e}_n) \end{bmatrix}

The set of all linear maps $T: V \to W$ is denoted $\mathcal{L}(V, W)$ or $\operatorname{Hom}(V, W)$ . It is itself a vector space under pointwise operations: $(S + T)(\mathbf{v}) = S(\mathbf{v}) + T(\mathbf{v})$ and $(cT)(\mathbf{v}) = cT(\mathbf{v})$ .

2.2 Kernel of a Linear Map

Definition (Kernel). The kernel (or null space) of a linear map $T: V \to W$ is:

\ker(T) = \{\mathbf{v} \in V : T(\mathbf{v}) = \mathbf{0}_W\}

It is the set of all inputs that $T$ maps to zero - the "lost information" of the map.

Theorem 2.2.1. $\ker(T)$ is a subspace of $V$ .

Proof:

Zero: $T(\mathbf{0}) = \mathbf{0}$ , so $\mathbf{0} \in \ker(T)$ .
Closure under addition: If $T(\mathbf{u}) = T(\mathbf{v}) = \mathbf{0}$ , then $T(\mathbf{u} + \mathbf{v}) = T(\mathbf{u}) + T(\mathbf{v}) = \mathbf{0} + \mathbf{0} = \mathbf{0}$ .
Closure under scaling: If $T(\mathbf{v}) = \mathbf{0}$ , then $T(c\mathbf{v}) = cT(\mathbf{v}) = c\mathbf{0} = \mathbf{0}$ . $\square$

Geometric meaning. $\ker(T)$ is the subspace that $T$ "collapses to zero." If $T$ is the projection onto the $xy$ -plane, then $\ker(T)$ is the $z$ -axis. If $T$ is differentiation of polynomials, $\ker(T)$ is the constant polynomials.

Theorem 2.2.2 (Injectivity criterion). $T$ is injective (one-to-one) if and only if $\ker(T) = \{\mathbf{0}\}$ .

Proof. ( $\Rightarrow$ ) If $T$ is injective and $T(\mathbf{v}) = \mathbf{0} = T(\mathbf{0})$ , then $\mathbf{v} = \mathbf{0}$ . ( $\Leftarrow$ ) If $\ker(T) = \{\mathbf{0}\}$ and $T(\mathbf{u}) = T(\mathbf{v})$ , then $T(\mathbf{u} - \mathbf{v}) = \mathbf{0}$ , so $\mathbf{u} - \mathbf{v} = \mathbf{0}$ , i.e., $\mathbf{u} = \mathbf{v}$ . $\square$

The dimension of the kernel, $\dim(\ker(T))$ , is called the nullity of $T$ .

2.3 Image of a Linear Map

Definition (Image). The image (or range) of a linear map $T: V \to W$ is:

\operatorname{im}(T) = \{T(\mathbf{v}) : \mathbf{v} \in V\} = T(V)

It is the set of all possible outputs - the "reachable" part of $W$ .

Theorem 2.3.1. $\operatorname{im}(T)$ is a subspace of $W$ .

Proof:

Zero: $T(\mathbf{0}) = \mathbf{0} \in \operatorname{im}(T)$ .
Closure under addition: $T(\mathbf{u}) + T(\mathbf{v}) = T(\mathbf{u} + \mathbf{v}) \in \operatorname{im}(T)$ .
Closure under scaling: $cT(\mathbf{v}) = T(c\mathbf{v}) \in \operatorname{im}(T)$ . $\square$

Theorem 2.3.2. $\operatorname{im}(T)$ is the column space of the matrix $A$ representing $T$ .

Proof: $T(\mathbf{x}) = A\mathbf{x} = x_1\mathbf{a}_1 + \cdots + x_n\mathbf{a}_n$ , which is exactly the span of the columns of $A$ . $\square$

Theorem 2.3.3 (Surjectivity criterion). $T: V \to W$ is surjective (onto) if and only if $\operatorname{im}(T) = W$ .

The dimension of the image, $\dim(\operatorname{im}(T))$ , is called the rank of $T$ , denoted $\operatorname{rank}(T)$ .

2.4 The Rank-Nullity Theorem

This is one of the most elegant and useful results in linear algebra, connecting the three fundamental dimensions of a linear map.

Theorem 2.4.1 (Rank-Nullity Theorem). Let $T: V \to W$ be a linear map with $V$ finite-dimensional. Then:

\dim(V) = \dim(\ker(T)) + \dim(\operatorname{im}(T)) = \operatorname{nullity}(T) + \operatorname{rank}(T)

Proof. Let $\{\mathbf{k}_1, \ldots, \mathbf{k}_p\}$ be a basis for $\ker(T)$ (so $p = \operatorname{nullity}(T)$ ). Extend this to a basis for all of $V$ : $\{\mathbf{k}_1, \ldots, \mathbf{k}_p, \mathbf{b}_1, \ldots, \mathbf{b}_q\}$ where $p + q = \dim(V)$ .

We claim $\{T(\mathbf{b}_1), \ldots, T(\mathbf{b}_q)\}$ is a basis for $\operatorname{im}(T)$ .

Spanning: For any $\mathbf{w} \in \operatorname{im}(T)$ , write $\mathbf{w} = T(\mathbf{v})$ for some $\mathbf{v} = \sum c_i \mathbf{k}_i + \sum d_j \mathbf{b}_j$ . Then $\mathbf{w} = T(\mathbf{v}) = \sum c_i T(\mathbf{k}_i) + \sum d_j T(\mathbf{b}_j) = \sum d_j T(\mathbf{b}_j)$ .

Linear independence: If $\sum d_j T(\mathbf{b}_j) = \mathbf{0}$ , then $T(\sum d_j \mathbf{b}_j) = \mathbf{0}$ , so $\sum d_j \mathbf{b}_j \in \ker(T) = \operatorname{span}\{\mathbf{k}_1, \ldots, \mathbf{k}_p\}$ . But $\{\mathbf{k}_1, \ldots, \mathbf{k}_p, \mathbf{b}_1, \ldots, \mathbf{b}_q\}$ is linearly independent, so all $d_j = 0$ . $\square$

Intuition: The rank-nullity theorem says $T$ "uses" its input dimensions in two ways: some dimensions are collapsed to zero (nullity), and the rest are faithfully transmitted to the output (rank). Total in = lost + kept.

RANK-NULLITY THEOREM
========================================================================

  T: V -----------------------------------> W
       |                                  |
       +-- ker(T) ---> {0}                 |
       |   (nullity)   collapsed           |
       |                                  |
       +-- "rest" ---> im(T) \subseteq W          |
           (rank)      transmitted         |

  dim(V)    =    nullity(T)    +    rank(T)
  [total]     [collapsed dims]  [surviving dims]

  Example: T: \mathbb{R}^5 -> \mathbb{R}^3, rank 2
  -> nullity = 5 - 2 = 3
  -> 3 dimensions collapse, 2 survive

========================================================================

Example. $T: \mathbb{R}^4 \to \mathbb{R}^3$ defined by $A = \begin{bmatrix} 1 & 0 & 1 & 0 \\ 0 & 1 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}$ . Rank = 3 (full row rank). Nullity = $4 - 3 = 1$ . The null space is $\ker(T) = \operatorname{span}\{(-1, -1, 1, 0)^\top\}$ .

For AI: In LoRA, a weight update $\Delta W = BA$ with $B \in \mathbb{R}^{d \times r}$ , $A \in \mathbb{R}^{r \times k}$ has rank at most $r$ . By rank-nullity, its kernel has dimension at least $k - r$ . When $r \ll k$ , the update leaves most of the input space unchanged - it only "speaks to" an $r$ -dimensional subspace.

2.5 Examples and Non-Examples

Linear transformations:

Map	Domain -> Codomain	Kernel	Image
$T(\mathbf{x}) = A\mathbf{x}$ (matrix mult.)	$\mathbb{R}^n \to \mathbb{R}^m$	null space of $A$	column space of $A$
$T(f) = f'$ (differentiation)	$\mathcal{P}_n \to \mathcal{P}_{n-1}$	constants	all polynomials of degree $\leq n-1$
$T(f) = \int_0^x f(t)\,dt$ (integration)	$\mathcal{C}[0,1] \to \mathcal{C}[0,1]$	$\{\mathbf{0}\}$	functions vanishing at 0
$T(\mathbf{x}) = \mathbf{0}$ (zero map)	$V \to W$	all of $V$	$\{\mathbf{0}\}$
$T(\mathbf{x}) = \mathbf{x}$ (identity)	$V \to V$	$\{\mathbf{0}\}$	all of $V$
$T(x, y) = (x, 0)$ (projection)	$\mathbb{R}^2 \to \mathbb{R}^2$	$y$ -axis	$x$ -axis
$T(A) = \operatorname{tr}(A)$ (trace)	$\mathbb{R}^{n\times n} \to \mathbb{R}$	trace-zero matrices	$\mathbb{R}$

Non-linear maps (and why they fail):

Map	Linearity failure	Test
$T(\mathbf{x}) = \mathbf{x} + \mathbf{b}$ ( $\mathbf{b} \neq \mathbf{0}$ )	$T(\mathbf{0}) = \mathbf{b} \neq \mathbf{0}$	Zero test
$T(x) = x^2$	$T(1+1) = 4 \neq T(1) + T(1) = 2$	Additivity
$T(\mathbf{x}) = \lVert\mathbf{x}\rVert$ (norm)	$T(2\mathbf{e}_1) = 2 \neq 2T(\mathbf{e}_1) = 2$ ... wait, this passes? No: $T(-\mathbf{e}_1) = 1 \neq -T(\mathbf{e}_1) = -1$	Homogeneity
$T(\mathbf{x}) = \operatorname{softmax}(\mathbf{x})$	$T(2\mathbf{x}) \neq 2T(\mathbf{x})$ ; softmax is scale-invariant	Homogeneity
$T(\mathbf{x}) = \operatorname{ReLU}(\mathbf{x})$	$T(\mathbf{u} + \mathbf{v}) \neq T(\mathbf{u}) + T(\mathbf{v})$ in general	Additivity
$T(\mathbf{x}) = \mathbf{x} \odot \mathbf{x}$ (elementwise square)	$T(c\mathbf{x}) = c^2 \mathbf{x} \odot \mathbf{x} \neq c(\mathbf{x} \odot \mathbf{x})$	Homogeneity

Note on ReLU: Though ReLU is not linear, it is piecewise linear - linear on each orthant. This means neural networks with ReLU are piecewise linear functions, which is a key fact for understanding their behavior.

3. Matrix Representation of a Linear Map

3.1 The Standard Basis Construction

Over $\mathbb{R}^n$ and $\mathbb{R}^m$ with standard bases, every linear map $T: \mathbb{R}^n \to \mathbb{R}^m$ corresponds to a unique matrix $A \in \mathbb{R}^{m \times n}$ .

Construction. The $j$ -th column of $A$ is $T(\mathbf{e}_j)$ , the image of the $j$ -th standard basis vector:

A = \begin{bmatrix} T(\mathbf{e}_1) & T(\mathbf{e}_2) & \cdots & T(\mathbf{e}_n) \end{bmatrix}

Why this works: For any $\mathbf{x} = \sum_{j=1}^n x_j \mathbf{e}_j$ :

T(\mathbf{x}) = T\!\left(\sum_j x_j \mathbf{e}_j\right) = \sum_j x_j T(\mathbf{e}_j) = A\mathbf{x}

The matrix is the complete, coordinate-encoded description of $T$ .

Example. Find the matrix for the $2D$ counterclockwise rotation by angle $\theta$ .

$T(\mathbf{e}_1) = (\cos\theta, \sin\theta)^\top$ and $T(\mathbf{e}_2) = (-\sin\theta, \cos\theta)^\top$ , giving:

R_\theta = \begin{bmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{bmatrix}

Example. Find the matrix for differentiation $D: \mathcal{P}_3 \to \mathcal{P}_2$ using the bases $\{1, x, x^2, x^3\}$ and $\{1, x, x^2\}$ .

$D(1) = 0$ , $D(x) = 1$ , $D(x^2) = 2x$ , $D(x^3) = 3x^2$ . In the output basis:

[D]_{\mathcal{B}}^{\mathcal{C}} = \begin{bmatrix} 0 & 1 & 0 & 0 \\ 0 & 0 & 2 & 0 \\ 0 & 0 & 0 & 3 \end{bmatrix}

This is a $3 \times 4$ matrix, reflecting $D: \mathcal{P}_3 \to \mathcal{P}_2$ .

3.2 Representation in Arbitrary Bases

When $V$ and $W$ have non-standard bases, the matrix of $T$ depends on those bases.

Setup. Let $\mathcal{B} = \{\mathbf{b}_1, \ldots, \mathbf{b}_n\}$ be a basis for $V$ and $\mathcal{C} = \{\mathbf{c}_1, \ldots, \mathbf{c}_m\}$ be a basis for $W$ .

Coordinate vectors. For $\mathbf{v} \in V$ , write $\mathbf{v} = \sum_j \alpha_j \mathbf{b}_j$ . The coordinate vector is $[\mathbf{v}]_{\mathcal{B}} = (\alpha_1, \ldots, \alpha_n)^\top \in \mathbb{R}^n$ .

The matrix of $T$ in bases $(\mathcal{B}, \mathcal{C})$ . Express each $T(\mathbf{b}_j)$ in the basis $\mathcal{C}$ :

T(\mathbf{b}_j) = \sum_{i=1}^m a_{ij} \mathbf{c}_i

The matrix $[T]_{\mathcal{B}}^{\mathcal{C}} = (a_{ij}) \in \mathbb{R}^{m \times n}$ satisfies:

[T(\mathbf{v})]_{\mathcal{C}} = [T]_{\mathcal{B}}^{\mathcal{C}} \, [\mathbf{v}]_{\mathcal{B}}

The commutative diagram:

  v \in V  ----------------T----------------->  T(v) \in W
     |                                            |
   [*]_B (coord in B)                       [*]_C (coord in C)
     |                                            |
     down                                            down
  [v]_B ---------[T]_B^C (matrix mult.)----->  [T(v)]_C

The matrix $[T]_{\mathcal{B}}^{\mathcal{C}}$ is the bridge between coordinates: it takes $\mathcal{B}$ -coordinates of input to $\mathcal{C}$ -coordinates of output.

Example. Let $T: \mathbb{R}^2 \to \mathbb{R}^2$ be the map $T(x, y) = (x + y, x - y)$ . In the non-standard basis $\mathcal{B} = \{(1, 1), (1, -1)\}$ :

$T(1,1) = (2, 0) = 1 \cdot (1,1) + 1 \cdot (1,-1)$ , so column 1 is $(1, 1)^\top$ . $T(1,-1) = (0, 2) = 1 \cdot (1,1) + (-1) \cdot (1,-1)$ , so column 2 is $(1, -1)^\top$ .

[T]_{\mathcal{B}}^{\mathcal{B}} = \begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix}

This is diagonal-like: in the basis $\mathcal{B}$ , $T$ acts as a simple scaling on each basis vector - exactly the diagonalization idea.

3.3 The Change-of-Basis Matrix

Definition. Let $\mathcal{B} = \{\mathbf{b}_1, \ldots, \mathbf{b}_n\}$ and $\mathcal{B}' = \{\mathbf{b}'_1, \ldots, \mathbf{b}'_n\}$ be two bases for the same space $V$ . The change-of-basis matrix from $\mathcal{B}'$ to $\mathcal{B}$ is:

P = \begin{bmatrix} [\mathbf{b}'_1]_{\mathcal{B}} & [\mathbf{b}'_2]_{\mathcal{B}} & \cdots & [\mathbf{b}'_n]_{\mathcal{B}} \end{bmatrix}

Each column is the $\mathcal{B}$ -coordinate vector of the corresponding new basis vector.

Key property: If $[\mathbf{v}]_{\mathcal{B}'}$ are the $\mathcal{B}'$ -coordinates of $\mathbf{v}$ , then the $\mathcal{B}$ -coordinates are:

[\mathbf{v}]_{\mathcal{B}} = P \, [\mathbf{v}]_{\mathcal{B}'}

The change-of-basis formula for $T$ . If $[T]_{\mathcal{B}}$ is the matrix of $T$ in the basis $\mathcal{B}$ , and $P$ is the change-of-basis matrix from $\mathcal{B}'$ to $\mathcal{B}$ , then:

[T]_{\mathcal{B}'} = P^{-1} [T]_{\mathcal{B}} P

Derivation:

[T(\mathbf{v})]_{\mathcal{B}'} = P^{-1} [T(\mathbf{v})]_{\mathcal{B}} = P^{-1} [T]_{\mathcal{B}} [\mathbf{v}]_{\mathcal{B}} = P^{-1} [T]_{\mathcal{B}} P [\mathbf{v}]_{\mathcal{B}'}

Worked example. Let $T: \mathbb{R}^2 \to \mathbb{R}^2$ have matrix $A = \begin{bmatrix} 3 & 1 \\ 0 & 3 \end{bmatrix}$ in the standard basis. In the basis $\mathcal{B}' = \{(1, 0), (1, 1)\}$ :

P = \begin{bmatrix} 1 & 1 \\ 0 & 1 \end{bmatrix}, \quad P^{-1} = \begin{bmatrix} 1 & -1 \\ 0 & 1 \end{bmatrix}

P^{-1} A P = \begin{bmatrix} 1 & -1 \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 3 & 1 \\ 0 & 3 \end{bmatrix} \begin{bmatrix} 1 & 1 \\ 0 & 1 \end{bmatrix} = \begin{bmatrix} 3 & -1 \\ 0 & 3 \end{bmatrix} \begin{bmatrix} 1 & 1 \\ 0 & 1 \end{bmatrix} = \begin{bmatrix} 3 & 2 \\ 0 & 3 \end{bmatrix}

3.4 Similarity Transformations

Definition. Two matrices $A, B \in \mathbb{R}^{n \times n}$ are similar ( $A \sim B$ ) if there exists an invertible $P$ such that $B = P^{-1}AP$ .

Geometrically: $A$ and $B$ represent the same linear map in different bases.

Invariants under similarity (properties that don't change when you change basis):

Invariant	Formula	Meaning
Eigenvalues	$\lambda$ unchanged	Spectrum is basis-independent
Determinant	$\det(P^{-1}AP) = \det(A)$	Volume scaling is basis-independent
Trace	$\operatorname{tr}(P^{-1}AP) = \operatorname{tr}(A)$	Sum of eigenvalues
Rank	$\operatorname{rank}(P^{-1}AP) = \operatorname{rank}(A)$	Dimension of image
Characteristic polynomial	$\det(\lambda I - P^{-1}AP) = \det(\lambda I - A)$	Entire spectrum

Diagonalization is change of basis. When $A$ is diagonalizable with eigenvectors $\mathbf{v}_1, \ldots, \mathbf{v}_n$ , the matrix $P = [\mathbf{v}_1 | \cdots | \mathbf{v}_n]$ gives $P^{-1}AP = \Lambda = \operatorname{diag}(\lambda_1, \ldots, \lambda_n)$ . We are choosing the basis of eigenvectors, in which the map acts by simple scaling.

Forward reference: Eigenvalues and Eigenvectors

The full theory of diagonalization, the spectral theorem, and when a matrix is diagonalizable is developed in 01: Eigenvalues and Eigenvectors. Here we note only that diagonalization is a special case of change of basis.

4. Composition, Invertibility, and Isomorphisms

4.1 Composition is Matrix Multiplication

Let $T: U \to V$ and $S: V \to W$ be linear maps. Their composition $S \circ T: U \to W$ defined by $(S \circ T)(\mathbf{u}) = S(T(\mathbf{u}))$ is also linear.

Theorem 4.1.1. If $T$ has matrix $A$ (in standard bases) and $S$ has matrix $B$ , then $S \circ T$ has matrix $BA$ .

Proof: $(S \circ T)(\mathbf{x}) = S(T(\mathbf{x})) = S(A\mathbf{x}) = B(A\mathbf{x}) = (BA)\mathbf{x}$ . $\square$

This is the fundamental theorem connecting composition of functions to matrix multiplication. It explains:

Non-commutativity: $S \circ T \neq T \circ S$ in general (different domains/codomains, or $BA \neq AB$ for square matrices).
Associativity: $(R \circ S) \circ T = R \circ (S \circ T)$ - matches matrix associativity.
Deep learning: A network $f = f_L \circ \cdots \circ f_1$ is a composition. The forward pass computes this product. The backward pass (backprop) uses the chain rule, which is exactly the composition of Jacobians.

Collapse of linear-only networks. If $f_i(\mathbf{x}) = W_i \mathbf{x}$ (all layers linear, no activations):

f(\mathbf{x}) = W_L \cdots W_2 W_1 \mathbf{x} = W_{\text{eff}} \mathbf{x}

where $W_{\text{eff}} = W_L \cdots W_1$ is a single matrix. No depth benefit without nonlinearity.

4.2 Injectivity, Surjectivity, and Bijectivity

Definitions. A linear map $T: V \to W$ is:

Injective (one-to-one) if $T(\mathbf{u}) = T(\mathbf{v}) \Rightarrow \mathbf{u} = \mathbf{v}$
Surjective (onto) if for every $\mathbf{w} \in W$ , $\exists\, \mathbf{v} \in V$ with $T(\mathbf{v}) = \mathbf{w}$
Bijective if both injective and surjective

Criteria via rank-nullity:

Property	Condition	Matrix equivalent
Injective	$\ker(T) = \{\mathbf{0}\}$ , i.e., nullity = 0	Columns of $A$ are linearly independent
Surjective	$\operatorname{im}(T) = W$ , i.e., rank = $\dim(W)$	Rows of $A$ span $\mathbb{R}^m$ ; full row rank
Bijective	Both; requires $\dim(V) = \dim(W)$ and full rank	$A$ is square and invertible

Dimension constraints:

If $\dim(V) < \dim(W)$ : $T$ cannot be surjective (rank $\leq \dim(V) < \dim(W)$ ).
If $\dim(V) > \dim(W)$ : $T$ cannot be injective (nullity $= \dim(V) - \text{rank} \geq \dim(V) - \dim(W) > 0$ ).
If $\dim(V) = \dim(W)$ : injective $\Leftrightarrow$ surjective $\Leftrightarrow$ bijective.

4.3 Isomorphisms

Definition. A bijective linear map $T: V \to W$ is called an isomorphism. If such a map exists, $V$ and $W$ are isomorphic, written $V \cong W$ .

Isomorphic spaces are "the same" as vector spaces - they have the same algebraic structure. An isomorphism is a relabeling of elements that preserves all linear operations.

Theorem 4.3.1 (Fundamental Classification). Two finite-dimensional vector spaces over the same field are isomorphic if and only if they have the same dimension.

Consequence: Every $n$ -dimensional vector space over $\mathbb{R}$ is isomorphic to $\mathbb{R}^n$ . The space $\mathcal{P}_n$ of polynomials of degree $\leq n$ (dimension $n+1$ ) is isomorphic to $\mathbb{R}^{n+1}$ . The space $\mathbb{R}^{2 \times 2}$ (dimension 4) is isomorphic to $\mathbb{R}^4$ .

For AI: The residual stream of a transformer (a vector in $\mathbb{R}^d$ ) is isomorphic to any other $d$ -dimensional space. Features are not inherently "in $\mathbb{R}^d$ " - they live in an abstract vector space and $\mathbb{R}^d$ is just one coordinate representation. The linear representation hypothesis says this space has meaningful geometric structure independent of the coordinate system.

Properties of isomorphisms:

The inverse $T^{-1}: W \to V$ is also an isomorphism.
Isomorphism preserves all vector space structure: subspaces, linear independence, bases, dimension.
Isomorphisms form a group under composition.

4.4 The Inverse Map

Theorem 4.4.1. If $T: V \to W$ is an isomorphism (bijective linear map), then $T^{-1}: W \to V$ is also a linear map.

Proof: For $\mathbf{w}_1, \mathbf{w}_2 \in W$ , let $\mathbf{v}_i = T^{-1}(\mathbf{w}_i)$ , so $T(\mathbf{v}_i) = \mathbf{w}_i$ . Then:

T(a\mathbf{v}_1 + b\mathbf{v}_2) = aT(\mathbf{v}_1) + bT(\mathbf{v}_2) = a\mathbf{w}_1 + b\mathbf{w}_2

So $T^{-1}(a\mathbf{w}_1 + b\mathbf{w}_2) = a\mathbf{v}_1 + b\mathbf{v}_2 = aT^{-1}(\mathbf{w}_1) + bT^{-1}(\mathbf{w}_2)$ . $\square$

Matrix inverse. If $T$ has matrix $A$ (square, full rank), then $T^{-1}$ has matrix $A^{-1}$ .

Left and right inverses for non-square maps. When $T: \mathbb{R}^n \to \mathbb{R}^m$ is injective but not surjective ( $n < m$ ), there is no full inverse. However:

Left inverse: $L$ such that $L \circ T = I_V$ . Exists iff $T$ is injective. Formula: $L = (A^\top A)^{-1} A^\top$ (left pseudo-inverse).
Right inverse: $R$ such that $T \circ R = I_W$ . Exists iff $T$ is surjective. Formula: $R = A^\top (A A^\top)^{-1}$ (right pseudo-inverse).

Forward reference: Moore-Penrose Pseudo-Inverse

The general pseudo-inverse $A^+$ , defined via SVD as $A^+ = V\Sigma^+ U^\top$ , handles all cases uniformly and gives the least-squares solution when an exact solution doesn't exist. Full treatment in 02: Singular Value Decomposition.

4.5 The Four Fundamental Subspaces via Linear Maps

Every linear map $T: V \to W$ (with $V = \mathbb{R}^n$ , $W = \mathbb{R}^m$ , matrix $A$ ) defines four fundamental subspaces:

Subspace	Definition	Lives in	Dimension
Column space $\operatorname{col}(A)$	$\operatorname{im}(T)$	$\mathbb{R}^m$	$r = \operatorname{rank}(A)$
Null space $\operatorname{null}(A)$	$\ker(T)$	$\mathbb{R}^n$	$n - r$
Row space $\operatorname{row}(A)$	$\operatorname{im}(T^\top)$	$\mathbb{R}^n$	$r$
Left null space $\operatorname{null}(A^\top)$	$\ker(T^\top)$	$\mathbb{R}^m$	$m - r$

Orthogonality relations (proven using the dual map - see 6):

\operatorname{null}(A) \perp \operatorname{row}(A), \quad \operatorname{null}(A^\top) \perp \operatorname{col}(A)

This splits $\mathbb{R}^n = \operatorname{row}(A) \oplus \operatorname{null}(A)$ and $\mathbb{R}^m = \operatorname{col}(A) \oplus \operatorname{null}(A^\top)$ .

THE FOUR FUNDAMENTAL SUBSPACES
========================================================================

  \mathbb{R}^n (domain)                    \mathbb{R}^m (codomain)
  +--------------------+         +--------------------+
  | row space          |--T--->   | column space       |
  | (dim r)            |         | (dim r)            |
  +--------------------+         +--------------------+
  | null space         |--T--->   | left null space    |
  | (dim n-r)          | {0}     | (dim m-r)          |
  +--------------------+         +--------------------+
        <-> \perp                               <-> \perp
     orthogonal complement            orthogonal complement

========================================================================

5. Special Classes of Linear Transformations

5.1 Projection Operators

Definition. A linear map $P: V \to V$ is a projection if it is idempotent: $P^2 = P$ .

Idempotency means "doing it twice is the same as doing it once." Once you project onto a subspace, you're already there - projecting again does nothing.

Theorem 5.1.1. If $P$ is a projection, then:

$\operatorname{im}(P) = \ker(I - P)$ : the image of $P$ is the fixed-point set.
$\ker(P) = \operatorname{im}(I - P)$ : the kernel of $P$ is the image of the complementary projection.
$V = \operatorname{im}(P) \oplus \ker(P)$ (direct sum decomposition).

Proof of last claim: Any $\mathbf{v} \in V$ decomposes as $\mathbf{v} = P\mathbf{v} + (I-P)\mathbf{v}$ , where $P\mathbf{v} \in \operatorname{im}(P)$ and $(I-P)\mathbf{v} \in \ker(P)$ (since $P(I-P)\mathbf{v} = (P-P^2)\mathbf{v} = \mathbf{0}$ ). Uniqueness follows from: if $\mathbf{v} = \mathbf{u} + \mathbf{w}$ with $\mathbf{u} \in \operatorname{im}(P)$ , $\mathbf{w} \in \ker(P)$ , then $P\mathbf{v} = P\mathbf{u} + P\mathbf{w} = \mathbf{u}$ . $\square$

Rank and trace. For a projection: $\operatorname{rank}(P) = \operatorname{tr}(P)$ (eigenvalues are only 0 and 1; trace = sum of eigenvalues = number of 1s = rank).

Orthogonal vs oblique projections. An orthogonal projection satisfies additionally $P = P^\top$ (i.e., $P$ is symmetric). In this case:

$\ker(P) = \operatorname{im}(P)^\perp$ - the kernel is the orthogonal complement of the image.
Formula: $P = A(A^\top A)^{-1}A^\top$ for projection onto $\operatorname{col}(A)$ .

An oblique projection has $P^2 = P$ but $P \neq P^\top$ - it projects along a subspace that is not orthogonal to the target.

Forward reference: Orthogonal Projections

The full theory of orthogonal projections - including the projection formula, orthogonal decompositions, and the relationship to least squares - is in 05: Orthogonality and Orthonormality.

For AI: Attention heads in transformers apply projections onto query/key/value subspaces. The head-specific projection matrices $W_Q, W_K, W_V$ define low-dimensional subspaces of the residual stream. The attention mechanism then computes weighted combinations within these projected spaces. Projection is also central to layer normalization (projecting onto the unit sphere in the mean-subtracted subspace).

5.2 Rotations and Reflections

Orthogonal transformations are linear maps $T: \mathbb{R}^n \to \mathbb{R}^n$ that preserve inner products:

\langle T(\mathbf{u}), T(\mathbf{v}) \rangle = \langle \mathbf{u}, \mathbf{v} \rangle \quad \text{for all } \mathbf{u}, \mathbf{v}

Equivalently, their matrices satisfy $A^\top A = I$ , i.e., $A^{-1} = A^\top$ .

The set of all $n \times n$ orthogonal matrices forms the orthogonal group $O(n)$ .

Determinant splits them into two classes:

$\det(A) = +1$ : proper rotations - preserve orientation. These form $SO(n)$ , the special orthogonal group.
$\det(A) = -1$ : improper rotations (reflections, or rotation + reflection) - reverse orientation.

2D rotation by angle $\theta$ :

R_\theta = \begin{bmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{bmatrix} \in SO(2)

Properties: $R_\theta R_\phi = R_{\theta+\phi}$ , $R_\theta^{-1} = R_{-\theta} = R_\theta^\top$ .

3D rotations are parametrized by an axis $\hat{\mathbf{n}}$ and angle $\theta$ (Rodrigues' formula):

R = I + \sin(\theta)\, K + (1 - \cos\theta)\, K^2

where $K$ is the skew-symmetric matrix with $K\mathbf{v} = \hat{\mathbf{n}} \times \mathbf{v}$ .

Householder reflections. Given a unit vector $\hat{\mathbf{n}}$ , the reflection through the hyperplane orthogonal to $\hat{\mathbf{n}}$ is:

H = I - 2\hat{\mathbf{n}}\hat{\mathbf{n}}^\top

Properties: $H = H^\top$ , $H^2 = I$ , $\det(H) = -1$ .

For AI: Rotary Position Embedding (RoPE), used in LLaMA, GPT-NeoX, and Gemma, encodes positional information via rotation matrices applied blockwise to query and key vectors. The rotation angle depends on position, and relative positions appear as relative rotations, which interact cleanly with the dot-product attention score.

5.3 Shear and Scaling Maps

Diagonal scaling: $T(\mathbf{x}) = D\mathbf{x}$ where $D = \operatorname{diag}(d_1, \ldots, d_n)$ . Scales each coordinate independently. Kernel = $\{\mathbf{0}\}$ if all $d_i \neq 0$ . $\det(D) = \prod d_i$ .

Shear maps in $\mathbb{R}^2$ : the horizontal shear by factor $k$ is:

S_k = \begin{bmatrix} 1 & k \\ 0 & 1 \end{bmatrix}, \quad S_k(x, y) = (x + ky, y)

$\det(S_k) = 1$ - shear preserves area. $S_k^{-1} = S_{-k}$ .

Geometrically: the $x$ -axis is fixed; each horizontal line slides by a distance proportional to its height. Parallelograms are tilted but area is preserved.

Elementary matrices (from row operations) are all either shears, row swaps, or scalings:

Row scaling by $c$ : $E_{ii}(c)$ - multiply row $i$ by $c$
Row swap: $E_{ij}$ - swap rows $i$ and $j$
Row addition: $E_{ij}(c)$ - add $c$ times row $j$ to row $i$ (this is a shear)

Every invertible matrix is a product of elementary matrices - Gaussian elimination as composition of linear maps.

5.4 The Geometry of Low-Rank Maps

A rank- $k$ linear map $T: \mathbb{R}^n \to \mathbb{R}^m$ with $k < n$ collapses the domain: it maps all of $\mathbb{R}^n$ onto a $k$ -dimensional subspace of $\mathbb{R}^m$ , while sending an $(n-k)$ -dimensional subspace (the null space) to zero.

Decomposition via SVD preview. Any rank- $k$ matrix admits:

A = \sum_{i=1}^k \sigma_i \mathbf{u}_i \mathbf{v}_i^\top

This is a sum of $k$ rank-1 outer products. Each rank-1 term $\sigma_i \mathbf{u}_i \mathbf{v}_i^\top$ maps everything in the direction $\mathbf{v}_i$ to the direction $\mathbf{u}_i$ , scaled by $\sigma_i$ .

Forward reference: SVD and Low-Rank Approximation

The full theory of SVD - including the Eckart-Young theorem (best rank- $k$ approximation) and the geometric interpretation of singular values - is in 02: Singular Value Decomposition.

LoRA preview. A rank- $r$ update $\Delta W = BA$ (with $B \in \mathbb{R}^{d \times r}$ , $A \in \mathbb{R}^{r \times k}$ , $r \ll \min(d,k)$ ) is a composition of two low-rank linear maps:

$A: \mathbb{R}^k \to \mathbb{R}^r$ compresses the input to $r$ dimensions.
$B: \mathbb{R}^r \to \mathbb{R}^d$ expands back to $d$ dimensions.

The effective map $\Delta W = BA$ has rank $\leq r$ , so it only modifies the network's behavior along $r$ directions in the input space. The hypothesis underlying LoRA is that the task-relevant weight changes lie in a low-dimensional subspace - a statement directly about the geometry of linear maps.

6. Dual Spaces and Transposes

6.1 The Dual Space

Definition. Given a vector space $V$ over $\mathbb{F}$ , the dual space $V^*$ is the set of all linear maps from $V$ to $\mathbb{F}$ :

V^* = \mathcal{L}(V, \mathbb{F}) = \{f: V \to \mathbb{F} \mid f \text{ is linear}\}

Elements of $V^*$ are called linear functionals or covectors or one-forms.

$V^*$ is itself a vector space under pointwise addition $(f+g)(\mathbf{v}) = f(\mathbf{v}) + g(\mathbf{v})$ and scalar multiplication $(cf)(\mathbf{v}) = c \cdot f(\mathbf{v})$ .

The dual basis. If $\mathcal{B} = \{\mathbf{e}_1, \ldots, \mathbf{e}_n\}$ is a basis for $V$ , then the dual basis $\mathcal{B}^* = \{e^1, \ldots, e^n\}$ is defined by:

e^i(\mathbf{e}_j) = \delta_{ij} = \begin{cases} 1 & i = j \\ 0 & i \neq j \end{cases}

The dual basis is a basis for $V^*$ , so $\dim(V^*) = \dim(V)$ .

Key example: Row vectors. In $\mathbb{R}^n$ , a linear functional is any map $f(\mathbf{x}) = \mathbf{a}^\top \mathbf{x}$ for some fixed $\mathbf{a} \in \mathbb{R}^n$ . So $(\mathbb{R}^n)^* \cong \mathbb{R}^n$ , but the identification $\mathbf{a}^\top$ (row vector) $\leftrightarrow$ $\mathbf{a}$ (column vector) is coordinate-dependent. A row vector $\mathbf{a}^\top$ is intrinsically an element of $(\mathbb{R}^n)^*$ , not of $\mathbb{R}^n$ .

Why the distinction matters. When you write $\mathbf{a}^\top \mathbf{x}$ , you're applying the dual vector $\mathbf{a}^\top \in (\mathbb{R}^n)^*$ to the vector $\mathbf{x} \in \mathbb{R}^n$ . This is a pairing between a space and its dual - not a dot product of two vectors in the same space. In Riemannian geometry and general relativity, this distinction is essential. In ML, it matters for understanding gradients.

6.2 The Dual Map and the Transpose

Definition. Given $T: V \to W$ , the dual map (or transpose) $T^\top: W^* \to V^*$ is defined by:

(T^\top f)(\mathbf{v}) = f(T(\mathbf{v})) \quad \text{for } f \in W^*, \mathbf{v} \in V

In words: to apply $T^\top f$ to $\mathbf{v}$ , first apply $T$ to $\mathbf{v}$ , then apply $f$ to the result.

Matrix of the dual map. If $T$ has matrix $A$ in standard coordinates, then $T^\top$ has matrix $A^\top$ (the matrix transpose). The notation " $T^\top$ " for the dual map is intentional and consistent.

Domain and codomain switch. $T: V \to W$ means $T^\top: W^* \to V^*$ . The transpose reverses the direction. This is why:

Composing on the left: $(ST)^\top = T^\top S^\top$ (order reversal).
In backpropagation: if the forward pass multiplies by $W$ , the backward pass multiplies by $W^\top$ .

Kernel of the transpose = left null space:

\ker(T^\top) = \{\mathbf{w} \in W : T^\top \mathbf{w} = \mathbf{0}\} = \operatorname{null}(A^\top)

This is the left null space of $A$ , which is orthogonal to $\operatorname{col}(A)$ .

Annihilators. The annihilator of a subspace $S \subseteq V$ is $S^0 = \{f \in V^* : f(\mathbf{s}) = 0 \, \forall \mathbf{s} \in S\}$ . Key identities:

\ker(T^\top) = (\operatorname{im}(T))^0, \qquad \operatorname{im}(T^\top) = (\ker(T))^0

6.3 Gradients Live in the Dual Space

This connection between linear maps and gradients is one of the deepest in all of applied mathematics, and is directly relevant to understanding backpropagation.

The gradient as a linear functional. For a differentiable function $f: \mathbb{R}^n \to \mathbb{R}$ , the derivative at $\mathbf{x}$ is a linear functional $Df_{\mathbf{x}}: \mathbb{R}^n \to \mathbb{R}$ , defined by:

Df_{\mathbf{x}}(\mathbf{h}) = \lim_{t \to 0} \frac{f(\mathbf{x} + t\mathbf{h}) - f(\mathbf{x})}{t} = \nabla f(\mathbf{x})^\top \mathbf{h}

The derivative $Df_{\mathbf{x}}$ is an element of $(\mathbb{R}^n)^*$ - a covector, represented as a row vector $\nabla f(\mathbf{x})^\top$ .

The gradient vector $\nabla f(\mathbf{x}) \in \mathbb{R}^n$ is obtained by identifying $(\mathbb{R}^n)^*$ with $\mathbb{R}^n$ via the standard inner product. This identification is coordinate-dependent: on a curved manifold (like a constraint surface or the space of probability distributions), gradients and vectors must be treated differently.

Backpropagation via dual maps. Consider a composed network $\mathbf{y} = W\mathbf{x}$ (one linear layer, loss $\ell = L(\mathbf{y})$ ). The gradient with respect to $\mathbf{x}$ is:

\frac{\partial \ell}{\partial \mathbf{x}} = W^\top \frac{\partial \ell}{\partial \mathbf{y}}

This is exactly the dual map $T^\top$ applied to the incoming gradient $\frac{\partial \ell}{\partial \mathbf{y}}$ . Backpropagation propagates gradients backward through the transpose of each weight matrix. The chain of transposes in the backward pass is the dual map composition:

(W_L \cdots W_1)^\top = W_1^\top \cdots W_L^\top

For AI: Modern deep learning frameworks (PyTorch, JAX) compute gradients using automatic differentiation, which is precisely the evaluation of dual maps via the chain rule. Every .backward() call accumulates contributions via transpose weight matrices - it is applying the adjoint (dual) of the forward linear map.

Linear Transformations: Part 1 - Intuition To 6 Dual Spaces And Transposes