Part 3

30 min read18 headingsSplit lesson page

Lesson overview | Previous part | Next part

Linear Transformations: Appendix C: The Geometry of Linear Maps - A Deep Dive to Appendix M: Self-Assessment Checklist

Appendix C: The Geometry of Linear Maps - A Deep Dive

C.1 How Linear Maps Deform Space

To deeply understand a linear map $T: \mathbb{R}^n \to \mathbb{R}^m$ , we track how it deforms geometric objects.

Ellipsoids to ellipsoids. The image of the unit sphere $S^{n-1} = \{\mathbf{x} : \lVert\mathbf{x}\rVert = 1\}$ under a full-rank map $A$ is an ellipsoid whose semi-axes have lengths equal to the singular values $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_n > 0$ of $A$ , pointing in the directions of the left singular vectors $\mathbf{u}_1, \ldots, \mathbf{u}_m$ .

This is the geometric content of the SVD: $A = U\Sigma V^\top$ means:

$V^\top$ : rotate the input so the "natural input directions" $\mathbf{v}_i$ align with the coordinate axes.
$\Sigma$ : stretch each axis $i$ by $\sigma_i$ .
$U$ : rotate the output to place the stretched axes along the $\mathbf{u}_i$ directions.

The sphere becomes an ellipsoid. The "shape" of the ellipsoid is completely described by the singular values.

Volume scaling. The volume of the image of a set $S \subseteq \mathbb{R}^n$ under $A$ is $|\det(A)| \cdot \operatorname{vol}(S)$ (when $A$ is square). More precisely, $|\det(A)| = \prod \sigma_i$ = product of all singular values = volume of the image of the unit cube.

For a rank-deficient map ( $\det(A) = 0$ ), the image has lower dimension - and $m$ -dimensional volume = 0. The map "collapses" $n$ -dimensional space to a lower-dimensional flat object.

Angles. Unless $A$ is orthogonal, linear maps change angles. The angle $\theta$ between $\mathbf{u}$ and $\mathbf{v}$ satisfies:

\cos\theta = \frac{\mathbf{u}^\top\mathbf{v}}{\lVert\mathbf{u}\rVert\lVert\mathbf{v}\rVert}

but $\cos\angle(A\mathbf{u}, A\mathbf{v}) = \frac{(A\mathbf{u})^\top(A\mathbf{v})}{\lVert A\mathbf{u}\rVert\lVert A\mathbf{v}\rVert} = \frac{\mathbf{u}^\top A^\top A\mathbf{v}}{\lVert A\mathbf{u}\rVert\lVert A\mathbf{v}\rVert}$ .

The matrix $G = A^\top A$ (the Gram matrix) determines how $A$ distorts inner products. Eigenvalues of $G$ are $\sigma_i^2$ (squared singular values).

C.2 Interpreting the Four Fundamental Subspaces Geometrically

Given $T: \mathbb{R}^n \to \mathbb{R}^m$ with matrix $A$ and rank $r$ :

The row space $\operatorname{row}(A)$ : This is the "input directions that survive" - the $r$ -dimensional subspace of $\mathbb{R}^n$ that $T$ maps faithfully (injectively) onto the column space. Any $\mathbf{x} \in \operatorname{row}(A)$ is "noticed" by $T$ .

The null space $\operatorname{null}(A)$ : This is the "input directions that are killed" - the $(n-r)$ -dimensional subspace of $\mathbb{R}^n$ that $T$ maps to zero. Any $\mathbf{x} \in \operatorname{null}(A)$ is "invisible" to $T$ .

The decomposition $\mathbb{R}^n = \operatorname{row}(A) \oplus \operatorname{null}(A)$ : Every input $\mathbf{x}$ splits uniquely as $\mathbf{x} = \mathbf{x}_r + \mathbf{x}_n$ where $\mathbf{x}_r$ is in the row space (the "signal" part) and $\mathbf{x}_n$ is in the null space (the "noise" invisible to $T$ ).

The column space $\operatorname{col}(A)$ : The $r$ -dimensional subspace of $\mathbb{R}^m$ that $T$ can actually reach. Solutions to $A\mathbf{x} = \mathbf{b}$ exist iff $\mathbf{b} \in \operatorname{col}(A)$ .

The left null space $\operatorname{null}(A^\top)$ : The $(m-r)$ -dimensional complement of $\operatorname{col}(A)$ in $\mathbb{R}^m$ . Directions in the left null space are unreachable by $T$ .

COMPLETE PICTURE OF THE FOUR FUNDAMENTAL SUBSPACES
========================================================================

  \mathbb{R}^n (domain)                           \mathbb{R}^m (codomain)
  ---------------------------------------------------------

  +---------------------+    T(x) = Ax    +---------------------+
  |    row space        | -------------->  |    column space     |
  |   (dim = r)         |  isomorphism    |    (dim = r)        |
  |                     |                 |                     |
  |   -------------     |                 |   -------------     |
  |                     |                 |                     |
  |    null space       | -------------->  |    left null space  |
  |   (dim = n-r)       |    maps to 0    |    (dim = m-r)      |
  +---------------------+                 +---------------------+
         up \perp complement                          up \perp complement

  Every x = (row space part) + (null space part)
  T sees only the row space part.

========================================================================

For AI (linear systems / least squares): When fitting a model $A\mathbf{w} = \mathbf{b}$ with more constraints than parameters ( $m > n$ ), the system is overdetermined. A solution exists only if $\mathbf{b} \in \operatorname{col}(A)$ . If not, the least-squares solution minimizes $\lVert A\mathbf{w} - \mathbf{b}\rVert^2$ - finding the projection of $\mathbf{b}$ onto $\operatorname{col}(A)$ and then solving in the row space.

C.3 Linear Maps and Information Theory

The rank-nullity theorem has an information-theoretic interpretation.

Rank = information preserved. A linear map of rank $r$ preserves at most $r$ "dimensions" of information from the input. The remaining $n - r$ dimensions are destroyed.

Mutual information. For a Gaussian input $\mathbf{x} \sim \mathcal{N}(\mathbf{0}, I_n)$ and output $\mathbf{y} = A\mathbf{x}$ , the mutual information:

I(\mathbf{x}; \mathbf{y}) = \frac{1}{2} \sum_{i=1}^r \log(1 + \sigma_i^2)

depends only on the singular values - not on the specific directions. The null space of $A$ contributes zero mutual information.

Compression. If we want to compress $\mathbf{x} \in \mathbb{R}^n$ to $\mathbf{z} \in \mathbb{R}^r$ via a linear map $C \in \mathbb{R}^{r \times n}$ , the maximum mutual information $I(\mathbf{x}; C\mathbf{x})$ is achieved when $C$ projects onto the top- $r$ right singular vectors of... itself (the row space). For structured data with covariance $\Sigma$ , the optimal compression is PCA - projecting onto the top eigenvectors of $\Sigma$ .

This is why PCA is the optimal linear compressor under mean-squared error: it maximizes the retained variance (information) for any fixed rank $r$ .

C.4 Generalization of Linear Maps: Tensors

The concept of a linear map generalizes to multilinear maps and tensors in ways that are directly relevant to deep learning.

Bilinear maps and matrices of bilinear forms. A bilinear form $B: \mathbb{R}^n \times \mathbb{R}^m \to \mathbb{R}$ can be written as $B(\mathbf{x}, \mathbf{y}) = \mathbf{x}^\top M \mathbf{y}$ for a matrix $M \in \mathbb{R}^{n \times m}$ . The bilinear form is:

Symmetric if $M = M^\top$ (and $n = m$ ): $B(\mathbf{x}, \mathbf{y}) = B(\mathbf{y}, \mathbf{x})$ .
Positive definite if $B(\mathbf{x}, \mathbf{x}) > 0$ for $\mathbf{x} \neq \mathbf{0}$ : inner products are positive definite symmetric bilinear forms.

Multilinear maps. A $k$ -linear map $T: V_1 \times \cdots \times V_k \to W$ is linear in each argument separately. The space of $k$ -linear maps on $\mathbb{R}^n$ is the space of tensors of order $k$ .

For AI: The multi-head attention score $\mathbf{q}^\top W_{\text{QK}} \mathbf{k}$ is a bilinear form parameterized by $W_{\text{QK}} = W_Q^\top W_K$ . Understanding bilinear forms via their eigendecomposition ( $W_{\text{QK}} = U\Lambda V^\top$ by SVD) reveals what "patterns" each attention head is sensitive to: the left singular vectors $U$ are "what queries to look for" and the right singular vectors $V$ are "what keys are being matched against."

Appendix D: Computational Methods and Numerical Considerations

D.1 Computing the Kernel via Row Reduction

Given $A \in \mathbb{R}^{m \times n}$ , finding a basis for $\ker(A)$ requires solving $A\mathbf{x} = \mathbf{0}$ .

Algorithm (Gaussian Elimination to RREF):

Apply row operations to reduce $A$ to reduced row echelon form (RREF).
Identify pivot columns (columns with leading 1s in RREF) and free columns (all other columns).
For each free variable, set it to 1 and all other free variables to 0, then solve for pivot variables.
Each such solution is one basis vector for $\ker(A)$ .

Example. $A = \begin{bmatrix} 1 & 2 & -1 & 0 \\ 2 & 4 & -2 & 3 \\ -1 & -2 & 1 & 2 \end{bmatrix}$ .

RREF: $R_2 \leftarrow R_2 - 2R_1$ , $R_3 \leftarrow R_3 + R_1$ :

\begin{bmatrix} 1 & 2 & -1 & 0 \\ 0 & 0 & 0 & 3 \\ 0 & 0 & 0 & 2 \end{bmatrix}

$R_3 \leftarrow R_3 - \frac{2}{3} R_2$ , $R_2 \leftarrow \frac{1}{3}R_2$ :

\begin{bmatrix} 1 & 2 & -1 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 \end{bmatrix}

$R_1 \leftarrow R_1 - 0 \cdot R_2$ : already done.

Pivot columns: 1 and 4. Free variables: $x_2$ and $x_3$ .

Setting $x_2 = 1, x_3 = 0$ : $x_4 = 0$ , $x_1 + 2(1) - 1(0) = 0 \Rightarrow x_1 = -2$ . Basis vector: $(-2, 1, 0, 0)^\top$ .

Setting $x_2 = 0, x_3 = 1$ : $x_4 = 0$ , $x_1 + 0 - 1 = 0 \Rightarrow x_1 = 1$ . Basis vector: $(1, 0, 1, 0)^\top$ .

Null space basis: $\ker(A) = \operatorname{span}\left\{(-2,1,0,0)^\top, (1,0,1,0)^\top\right\}$ . Nullity = 2.

Rank-nullity: $n = 4$ , rank = 2 (two pivots), nullity = 2. Check: $2 + 2 = 4$ . OK

D.2 Numerical Stability of Basis Computations

Computing the null space or column space numerically requires care because floating-point arithmetic can introduce small errors.

The SVD approach (recommended). Instead of row reduction, compute the SVD: $A = U\Sigma V^\top$ . Then:

$\ker(A)$ = span of columns of $V$ corresponding to zero singular values (or singular values below a threshold $\epsilon$ ).
$\operatorname{col}(A)$ = span of columns of $U$ corresponding to nonzero singular values.

The SVD-based approach is numerically stable because orthonormal bases ( $U$ and $V$ ) are well-conditioned.

Numerical rank. For floating-point matrices, "zero" singular values appear as small but nonzero values. The numerical rank with threshold $\epsilon$ is:

\operatorname{rank}_\epsilon(A) = |\{i : \sigma_i > \epsilon \cdot \sigma_1\}|

A common choice is $\epsilon = n \cdot \text{machine epsilon}$ (about $10^{-13}$ for double precision). numpy.linalg.matrix_rank uses a default threshold based on machine epsilon.

Why this matters: In practice, a matrix with theoretical rank $r$ may appear to have rank $r + 5$ due to measurement noise. The SVD reveals the "intrinsic" rank through the gap in singular values.

D.3 Efficient Change of Basis Computations

Naive approach: Compute $P^{-1}AP$ directly. For $n \times n$ matrices, this costs $O(n^3)$ (two matrix multiplications plus one matrix inversion).

Better approach when $P$ is orthogonal: If $P$ is orthogonal ( $P^{-1} = P^\top$ ), then $P^\top AP$ costs only $O(n^3)$ but with better constants than general $P^{-1}AP$ (no matrix inversion needed).

Eigendecomposition case: When $A = P\Lambda P^{-1}$ , computing $A^k = P\Lambda^k P^{-1}$ requires only $O(n)$ operations to compute $\Lambda^k$ (raise each diagonal entry to the $k$ -th power), plus two $O(n^2)$ matrix-vector multiplications for $P$ and $P^{-1}$ .

For AI: Computing $\exp(At)$ (matrix exponential, important for continuous-time state space models like Mamba/S4) is done by diagonalizing: $\exp(At) = P\exp(\Lambda t)P^{-1}$ , where $\exp(\Lambda t)$ is diagonal with entries $e^{\lambda_i t}$ .

D.4 The Rank-Revealing QR Decomposition

Standard QR decomposition $A = QR$ doesn't directly reveal rank. The rank-revealing QR (RRQR) uses column pivoting:

AP = QR = \begin{bmatrix} Q_1 & Q_2 \end{bmatrix} \begin{bmatrix} R_{11} & R_{12} \\ 0 & R_{22} \end{bmatrix}

where $P$ is a permutation matrix, and $R_{22}$ is "small" (its Frobenius norm bounds how far $A$ is from rank- $r$ ). The columns of $Q_1$ form a basis for $\operatorname{col}(A)$ .

RRQR is preferred over SVD when only a basis for the column space (not the singular values themselves) is needed, as it is about $3\times$ faster.

Appendix E: Connections to Other Fields

E.1 Linear Maps in Physics

In quantum mechanics, operators on Hilbert spaces are infinite-dimensional linear maps. The Hamiltonian $\hat{H}$ , momentum $\hat{p}$ , and position $\hat{x}$ are linear operators. The Schrodinger equation $i\hbar \partial_t |\psi\rangle = \hat{H}|\psi\rangle$ is a linear ODE on the Hilbert space of quantum states.

The spectral theorem for self-adjoint operators (the quantum generalization of symmetric matrix diagonalization) guarantees that observables have real eigenvalues (the possible measurement outcomes) and that the eigenfunctions form a complete orthonormal basis.

For AI: Transformers share surprising mathematical parallels with quantum mechanics: both involve attention-like mechanisms (inner products of states), superposition (linear combinations of basis states), and entanglement-like correlations. The linear algebra of quantum mechanics and of transformers both live in the framework of linear maps on Hilbert spaces.

E.2 Linear Maps in Topology: Homomorphisms

In algebraic topology, chain complexes are sequences of vector spaces connected by linear maps:

\cdots \xrightarrow{\partial_3} C_2 \xrightarrow{\partial_2} C_1 \xrightarrow{\partial_1} C_0 \xrightarrow{\partial_0} 0

where $\partial_{k-1} \circ \partial_k = 0$ (boundary of a boundary is zero - exactly the condition $\ker(\partial_{k-1}) \supseteq \operatorname{im}(\partial_k)$ ). The homology groups $H_k = \ker(\partial_k) / \operatorname{im}(\partial_{k+1})$ measure "holes" in topological spaces.

Persistent homology, used in topological data analysis (TDA), applies this to point cloud data to find features that persist across scales. It's used in ML for analyzing data manifolds, protein structure prediction, and understanding neural network loss landscapes.

E.3 Linear Maps in Signal Processing

The Discrete Fourier Transform (DFT) is a linear map $F: \mathbb{C}^n \to \mathbb{C}^n$ with matrix entries $F_{kj} = e^{-2\pi i jk/n}$ . The DFT matrix is unitary ( $F^* F = nI$ ).

Convolution is linear - convolving a signal with a kernel is a linear map - and in the Fourier domain it becomes pointwise multiplication. This is the key to making CNNs efficient: convolution is a structured linear map with shared weights (translation equivariance), and the Fourier transform diagonalizes the convolution operator.

For AI: The fast Fourier transform (FFT) is $O(n \log n)$ instead of $O(n^2)$ for the full DFT matrix multiply, by exploiting the structure (sparsity in a different basis) of the DFT linear map. Similarly, FlashAttention speeds up attention by exploiting the structure of the attention linear map to minimize memory bandwidth.

Appendix F: The Algebra of Linear Maps - Structural Results

F.1 The Space of Linear Maps is a Vector Space

We noted briefly that $\mathcal{L}(V, W)$ is a vector space. Let's make this precise and compute its dimension.

Operations: For $S, T \in \mathcal{L}(V, W)$ and $c \in \mathbb{F}$ :

$(S + T)(\mathbf{v}) = S(\mathbf{v}) + T(\mathbf{v})$
$(cT)(\mathbf{v}) = c \cdot T(\mathbf{v})$
The zero element is the zero map $O(\mathbf{v}) = \mathbf{0}$ for all $\mathbf{v}$ .

Dimension. If $\dim(V) = n$ and $\dim(W) = m$ , then:

\dim(\mathcal{L}(V, W)) = mn

Proof: Every linear map $T: V \to W$ is determined by $n$ vectors in $W$ (images of basis vectors), each in a $m$ -dimensional space. The natural isomorphism is $\mathcal{L}(V, W) \cong W^n \cong \mathbb{R}^{mn}$ (as vector spaces), which corresponds to the identification with $m \times n$ matrices. $\square$

Basis for $\mathcal{L}(\mathbb{R}^n, \mathbb{R}^m)$ . The standard basis consists of the $mn$ maps $E_{ij}: \mathbb{R}^n \to \mathbb{R}^m$ defined by $E_{ij}(\mathbf{e}_k) = \delta_{jk}\mathbf{e}_i$ . In matrix form, $E_{ij}$ is the matrix with 1 in position $(i,j)$ and 0 elsewhere.

F.2 Composition Gives $\mathcal{L}(V, V)$ an Algebra Structure

When $V = W$ , linear maps $T: V \to V$ can be composed. The space $\mathcal{L}(V, V)$ (endomorphisms of $V$ ) is a ring under composition (it is also a vector space - together, an algebra).

Properties of composition in $\mathcal{L}(V, V)$ :

Associative: $(RS)T = R(ST)$
Identity: $I \circ T = T \circ I = T$
Distributive: $R(S + T) = RS + RT$ and $(R+S)T = RT + ST$
NOT commutative: $RS \neq SR$ in general

Matrix polynomials. For $T \in \mathcal{L}(V,V)$ , we can form $p(T) = a_0 I + a_1 T + a_2 T^2 + \cdots + a_k T^k$ for any polynomial $p$ . This is well-defined because we can add and compose linear maps.

Cayley-Hamilton theorem. Every linear operator $T$ satisfies its own characteristic polynomial: $p_T(T) = 0$ , where $p_T(\lambda) = \det(\lambda I - T)$ .

For AI: The spectral approach to recurrent networks analyzes the long-run behavior of $T^k \mathbf{v}$ as $k \to \infty$ . If $T$ has eigenvalues $|\lambda_i| < 1$ , then $T^k \to 0$ (stable memory decay). If any $|\lambda_i| > 1$ , the recurrence explodes. This spectral stability analysis is the foundation of designing stable RNNs (LSTM, GRU use gating to control the effective eigenvalue spectrum of the recurrence).

F.3 Quotient Maps and Projections

The quotient space. Given $T: V \to W$ with kernel $K = \ker(T)$ , the quotient space $V/K$ consists of equivalence classes $[\mathbf{v}] = \mathbf{v} + K = \{\mathbf{v} + \mathbf{k} : \mathbf{k} \in K\}$ .

$V/K$ is a vector space of dimension $\dim(V) - \dim(K) = \operatorname{rank}(T)$ .

The first isomorphism theorem. Every linear map $T: V \to W$ factors as:

V \xrightarrow{\pi} V/\ker(T) \xrightarrow{\tilde{T}} \operatorname{im}(T) \hookrightarrow W

where $\pi$ is the quotient map ( $\pi(\mathbf{v}) = [\mathbf{v}]$ ) and $\tilde{T}$ is an isomorphism from $V/\ker(T)$ to $\operatorname{im}(T)$ .

This is the coordinate-free statement of the rank-nullity theorem: $V/\ker(T) \cong \operatorname{im}(T)$ .

Geometric meaning. The quotient map $\pi$ "collapses" the null space to a point, then $\tilde{T}$ acts faithfully (injectively) on the resulting space. Any linear map splits into: collapse (project out the null space) + inject faithfully into the codomain.

For AI: In contrastive learning (SimCLR, MoCo), the projection head maps representations to a lower-dimensional space. This is a linear (or nonlinear) quotient map - it deliberately collapses some dimensions (those corresponding to nuisance factors like image augmentation) while preserving the semantically meaningful directions. The first isomorphism theorem says: what survives in the image is exactly what was not collapsed.

F.4 Dual Bases and the Canonical Isomorphism

We saw that $\dim(V^*) = \dim(V)$ , so $V \cong V^*$ . But this isomorphism is non-canonical - it depends on the choice of basis.

With an inner product. When $V$ is an inner product space (like $\mathbb{R}^n$ with the standard dot product), there is a canonical isomorphism $\phi: V \to V^*$ defined by:

\phi(\mathbf{v}) = \langle \mathbf{v}, \cdot \rangle

That is, $\phi(\mathbf{v})$ is the linear functional that takes $\mathbf{w} \mapsto \langle \mathbf{v}, \mathbf{w}\rangle$ .

This isomorphism is canonical because it doesn't depend on any choice of basis - it uses only the inner product structure. When we identify $\nabla f(\mathbf{x}) \in \mathbb{R}^n$ as a column vector (primal) rather than a row vector (dual), we are implicitly using this canonical isomorphism via the standard inner product.

For AI: On non-Euclidean spaces (manifolds of probability distributions, manifolds of neural network weights under the Fisher metric), the identification $V \cong V^*$ is NO longer trivial - gradients and velocity vectors live in different spaces. The natural gradient method corrects for this by using the Fisher information matrix $F$ as the metric: $\tilde{\nabla} \theta = F^{-1} \nabla \theta$ . This maps the gradient (a covector) to a tangent vector using the Riemannian metric instead of the Euclidean metric.

Appendix G: Linear Maps in Practice - Worked Problems

G.1 Verifying Linearity: Systematic Approach

Problem: Is $T: \mathbb{R}^{2\times 2} \to \mathbb{R}$ defined by $T(A) = \operatorname{tr}(A)$ linear?

Check additivity: $T(A + B) = \operatorname{tr}(A + B) = \operatorname{tr}(A) + \operatorname{tr}(B) = T(A) + T(B)$ . OK

Check homogeneity: $T(cA) = \operatorname{tr}(cA) = c\operatorname{tr}(A) = cT(A)$ . OK

Conclusion: $T$ is linear. Its matrix (viewing $\mathbb{R}^{2\times 2}$ with basis $\{E_{11}, E_{12}, E_{21}, E_{22}\}$ ):

\operatorname{tr}(E_{11}) = 1, \quad \operatorname{tr}(E_{12}) = 0, \quad \operatorname{tr}(E_{21}) = 0, \quad \operatorname{tr}(E_{22}) = 1

So $[T] = (1, 0, 0, 1)$ as a $1 \times 4$ matrix. Kernel = $\{A : \operatorname{tr}(A) = 0\}$ (trace-zero matrices), dimension 3.

Problem: Is $T: \mathbb{R}^n \to \mathbb{R}$ defined by $T(\mathbf{x}) = \lVert\mathbf{x}\rVert_2$ linear?

Check homogeneity: $T(c\mathbf{x}) = \lVert c\mathbf{x}\rVert = |c| \lVert\mathbf{x}\rVert$ . For $c = -1$ : $T(-\mathbf{x}) = \lVert\mathbf{x}\rVert = T(\mathbf{x})$ , but linearity requires $T(-\mathbf{x}) = -T(\mathbf{x})$ .

For $\mathbf{x} \neq \mathbf{0}$ : $T(\mathbf{x}) > 0$ but $-T(\mathbf{x}) < 0$ . So $T(-\mathbf{x}) \neq -T(\mathbf{x})$ .

Conclusion: $T$ is NOT linear (the norm fails homogeneity due to the absolute value).

G.2 Finding the Kernel: Four Approaches

For $A = \begin{bmatrix} 1 & -1 & 2 \\ 2 & -2 & 4 \end{bmatrix}$ (rows are multiples), find $\ker(A)$ :

Approach 1: Row reduction. RREF: $\begin{bmatrix} 1 & -1 & 2 \\ 0 & 0 & 0 \end{bmatrix}$ . Free variables $x_2 = s$ , $x_3 = t$ . Then $x_1 = s - 2t$ .

$\ker(A) = \operatorname{span}\{(1,1,0)^\top, (-2,0,1)^\top\}$ .

Approach 2: Inspection. The columns satisfy $\mathbf{a}_2 = -\mathbf{a}_1$ and $\mathbf{a}_3 = 2\mathbf{a}_1$ . So $A(1,-1,0)^\top = \mathbf{a}_1 + \mathbf{a}_1 = 0$ ... no, wait: $A(1,-1,0)^\top = 1\mathbf{a}_1 + (-1)(-\mathbf{a}_1) + 0 = \mathbf{a}_1 + \mathbf{a}_1 = 2\mathbf{a}_1 \neq \mathbf{0}$ .

Correcting: $\mathbf{a}_2 = -\mathbf{a}_1$ means $\mathbf{a}_1 + \mathbf{a}_2 = \mathbf{0}$ , so $(1,1,0)^\top \in \ker(A)$ . And $2\mathbf{a}_1 + 0\mathbf{a}_2 + (-1)\mathbf{a}_3 = 2\mathbf{a}_1 - 2\mathbf{a}_1 = \mathbf{0}$ since $\mathbf{a}_3 = 2\mathbf{a}_1$ , so $(2,0,-1)^\top \in \ker(A)$ - or equivalently $(-2,0,1)^\top$ .

Approach 3: SVD. Compute SVD of $A$ ; null space vectors are the right singular vectors with zero (or near-zero) singular values.

Approach 4: Random sampling + orthogonalization. Sample many vectors, project out the row space, keep those with zero image (useful when $A$ is very large).

G.3 Composition of Transforms in a Graphics Pipeline

A 3D object is processed through a graphics pipeline using compositions of affine maps:

Model matrix $M$ : transform from object coordinates to world coordinates (rotation, scale, translation).
View matrix $V$ : transform from world coordinates to camera coordinates (rotation + translation).
Projection matrix $P$ : from camera coordinates to clip coordinates (perspective projection).

The combined transform: $\mathbf{p}_{\text{clip}} = P \cdot V \cdot M \cdot \mathbf{p}_{\text{object}}$ (in homogeneous coordinates).

Composition order matters. Reading left to right: first apply $M$ , then $V$ , then $P$ . The matrix product $PVM$ can be precomputed once per frame (not per vertex), saving $O(|\text{vertices}|)$ matrix multiplications.

This is the same principle as "avoid recomputing shared prefixes" in transformer KV-caching: the $KV$ cache stores the linear maps $K = XW_K$ , $V = XW_V$ for all past tokens, so they don't need to be recomputed when generating each new token.

Appendix H: Linear Maps in Optimization and Training

H.1 The Gradient as a Linear Map

In optimization, we minimize a loss function $\mathcal{L}: \mathbb{R}^n \to \mathbb{R}$ . The gradient $\nabla \mathcal{L}(\mathbf{w})$ tells us the direction of steepest ascent. But more precisely:

The directional derivative of $\mathcal{L}$ at $\mathbf{w}$ in direction $\mathbf{d}$ is:

D_{\mathbf{d}} \mathcal{L}(\mathbf{w}) = \lim_{t \to 0} \frac{\mathcal{L}(\mathbf{w} + t\mathbf{d}) - \mathcal{L}(\mathbf{w})}{t} = \nabla \mathcal{L}(\mathbf{w})^\top \mathbf{d}

This is a linear functional in $\mathbf{d}$ : it is the dual vector $\nabla \mathcal{L}(\mathbf{w})^\top \in (\mathbb{R}^n)^*$ .

Gradient descent in its pure form: $\mathbf{w}_{t+1} = \mathbf{w}_t - \eta \nabla \mathcal{L}(\mathbf{w}_t)$ . This uses the Euclidean identification of the gradient (covector) with a primal vector.

Natural gradient descent: $\mathbf{w}_{t+1} = \mathbf{w}_t - \eta F(\mathbf{w}_t)^{-1} \nabla \mathcal{L}(\mathbf{w}_t)$ , where $F$ is the Fisher information matrix. This uses the correct metric on the manifold of probability distributions (the Fisher-Rao metric) to convert the covector gradient to a tangent vector.

H.2 The Hessian as a Bilinear Map

The Hessian $H\mathcal{L}(\mathbf{w}) \in \mathbb{R}^{n \times n}$ is the matrix of second derivatives:

H_{ij} = \frac{\partial^2 \mathcal{L}}{\partial w_i \partial w_j}

But more abstractly, the Hessian is a bilinear form $B: \mathbb{R}^n \times \mathbb{R}^n \to \mathbb{R}$ :

B(\mathbf{u}, \mathbf{v}) = \mathbf{u}^\top H \mathbf{v} = \text{rate of change of directional derivative of } \mathcal{L} \text{ in direction } \mathbf{v}, \text{ in direction } \mathbf{u}

The Hessian determines the curvature of the loss landscape:

Positive definite Hessian ( $\mathbf{u}^\top H\mathbf{u} > 0$ for all $\mathbf{u} \neq 0$ ): the point is a local minimum.
Indefinite Hessian (has both positive and negative eigenvalues): the point is a saddle point.
The ratio of largest to smallest eigenvalue is the condition number $\kappa(H)$ - it governs how slowly gradient descent converges.

For AI: Modern optimizers (Adam, AdaGrad) approximate Hessian-related quantities. Adam's second moment estimate $\hat{v}_t \approx \operatorname{diag}(H)$ approximates the diagonal of the Hessian. Dividing the gradient by $\sqrt{\hat{v}_t}$ is an approximation to Newton's method (which divides by $H$ ). This is why Adam often converges much faster than SGD on ill-conditioned problems.

H.3 Weight Matrices as Linear Maps: Training Dynamics

The neural tangent kernel (NTK) theory (Jacot et al., 2018) analyzes infinitely wide neural networks and shows that their training dynamics under gradient flow are governed by a linear system:

\dot{\mathbf{y}}_t = -\eta \, \Theta(\mathbf{X}, \mathbf{X}) (\mathbf{y}_t - \mathbf{y}^*)

where $\Theta(\mathbf{X}, \mathbf{X})$ is the NTK matrix (constant in the infinite-width limit). This is an ODE with a constant linear map $\Theta$ - so its solution is $\mathbf{y}_t = \mathbf{y}^* + e^{-\eta \Theta t}(\mathbf{y}_0 - \mathbf{y}^*)$ .

The eigenvalues of $\Theta$ determine which output directions are learned quickly (large eigenvalues -> fast convergence) and which slowly (small eigenvalues -> slow convergence). This is linear algebra - specifically, the spectral decomposition of a positive semidefinite linear map.

H.4 Gradient Flow through Linear Layers

Consider a linear layer $\mathbf{y} = W\mathbf{x}$ with loss $\mathcal{L}$ . The gradient with respect to the weight matrix:

\frac{\partial \mathcal{L}}{\partial W} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}} \mathbf{x}^\top = \boldsymbol{\delta} \mathbf{x}^\top

where $\boldsymbol{\delta} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}} \in \mathbb{R}^m$ is the "error signal" (upstream gradient). The gradient $\frac{\partial \mathcal{L}}{\partial W}$ is an outer product - a rank-1 matrix.

This means gradient updates are always rank-1. For a mini-batch of $B$ samples, the gradient is:

\frac{\partial \mathcal{L}}{\partial W} = \frac{1}{B}\sum_{b=1}^B \boldsymbol{\delta}_b \mathbf{x}_b^\top

A sum of $B$ rank-1 matrices - the gradient has rank at most $B$ . For large models with batch size $B \ll n$ , the gradient is a very low-rank update to the weight matrix. This low-rank structure of gradients is the empirical justification for gradient low-rank projection methods (GaLore, 2024).

Appendix I: Reference Tables

I.1 Linear Map Properties at a Glance

Property	Condition	Matrix Equivalent	Geometric Meaning
Linear	$T(a\mathbf{u}+b\mathbf{v}) = aT\mathbf{u}+bT\mathbf{v}$	Any $m \times n$ matrix	Preserves addition and scaling
Injective	$\ker(T) = \{\mathbf{0}\}$	Full column rank	No two inputs map to same output
Surjective	$\operatorname{im}(T) = W$	Full row rank	Every output is reachable
Bijective (isomorphism)	Both injective and surjective	Square, full rank, invertible	Perfect correspondence
Orthogonal	$T^\top T = I$	$A^\top A = I$ , orthogonal columns	Preserves lengths and angles
Unitary	$T^* T = I$ (complex)	$A^* A = I$ , unitary columns	Complex analogue of orthogonal
Projection	$T^2 = T$	$A^2 = A$	Idempotent: applying twice = once
Symmetric	$T = T^\top$	$A = A^\top$	Self-adjoint; diagonalizable by spectral theorem
Positive definite	$\langle T\mathbf{v}, \mathbf{v}\rangle > 0$	$\mathbf{x}^\top A\mathbf{x} > 0$ for $\mathbf{x} \neq \mathbf{0}$	All eigenvalues positive; curvature at min
Normal	$T T^\top = T^\top T$	$A A^\top = A^\top A$	Diagonalizable by unitary matrix
Nilpotent	$T^k = 0$ for some $k$	$A^k = 0$	Powers eventually vanish; all eigenvalues 0
Involution	$T^2 = I$	$A^2 = I$	Self-inverse (like Householder reflections)

I.2 Rank and Dimension Formulas

Formula	Statement
$\operatorname{rank}(T) + \operatorname{nullity}(T) = \dim(V)$	Rank-nullity theorem for $T: V \to W$
$\operatorname{rank}(A) = \operatorname{rank}(A^\top)$	Row rank equals column rank
$\operatorname{rank}(AB) \leq \min(\operatorname{rank}(A), \operatorname{rank}(B))$	Rank cannot increase under composition
$\operatorname{rank}(A + B) \leq \operatorname{rank}(A) + \operatorname{rank}(B)$	Subadditivity of rank
$\dim(S + T) = \dim(S) + \dim(T) - \dim(S \cap T)$	Inclusion-exclusion for subspaces
$\dim(V/W) = \dim(V) - \dim(W)$ (for subspace $W$ )	Dimension of quotient space
$\operatorname{rank}(A^\top A) = \operatorname{rank}(A)$	Gram matrix has same rank
$\operatorname{rank}(P) = \operatorname{tr}(P)$ (for projection $P$ )	Rank = trace for idempotent matrices

I.3 AI Applications Cross-Reference

Linear Map Concept	Where It Appears in AI	Mathematical Role
$W\mathbf{x} + \mathbf{b}$ (affine map)	Every neural layer	Pre-activation computation
$Q = XW_Q$ (linear projection)	Attention mechanism	Projects to query subspace
$\Delta W = BA$ (low-rank)	LoRA fine-tuning	Rank- $r$ weight update
$J_f = \frac{\partial f}{\partial \mathbf{x}}$ (Jacobian)	Backpropagation	Chain rule at each layer
$W^\top \boldsymbol{\delta}$ (transpose map)	Backward pass	Dual map of forward
$P = QQ^\top$ (projection)	Layer norm, attention	Projects onto subspace
$W_U \mathbf{h}$ (linear map)	Unembedding (logit computation)	Representation to vocabulary
$e^{i\theta} \cdot \mathbf{q}$ (rotation in $\mathbb{C}$ )	RoPE positional encoding	Positional rotation
$F^{-1}\nabla\mathcal{L}$ (metric-adjusted gradient)	Natural gradient / Adam	Riemannian gradient
$\Theta = J^\top J$ (Gram matrix of Jacobian)	Neural tangent kernel	Training dynamics

Appendix J: Proofs of Key Results

J.1 Proof: $\operatorname{rank}(A) = \operatorname{rank}(A^\top)$

This is a fundamental result that deserves a careful proof.

Theorem. For any matrix $A \in \mathbb{R}^{m \times n}$ , the column rank (dimension of the column space) equals the row rank (dimension of the row space).

Proof (via RREF). Let $A$ have RREF $R$ (obtained by row operations, which don't change the row space but can change the column space). In $R$ :

The nonzero rows are linearly independent (each has a leading 1 not shared by any other row).
The number of nonzero rows = number of pivot columns = rank.

So row rank = column rank = number of pivots in RREF. $\square$

Alternative proof (via SVD). The SVD $A = U\Sigma V^\top$ has $\operatorname{rank}(A)$ nonzero singular values. The column space of $A$ is spanned by $\{\mathbf{u}_1, \ldots, \mathbf{u}_r\}$ (first $r$ left singular vectors), dimension $r$ . The row space of $A$ (= column space of $A^\top$ ) is spanned by $\{\mathbf{v}_1, \ldots, \mathbf{v}_r\}$ (first $r$ right singular vectors), dimension $r$ . Both have dimension $r$ = number of nonzero singular values. $\square$

J.2 Proof: Kernel and Image are Subspaces

Theorem. For any linear map $T: V \to W$ , both $\ker(T)$ and $\operatorname{im}(T)$ are subspaces (of $V$ and $W$ respectively).

Proof for $\ker(T)$ :

Non-empty: $T(\mathbf{0}_V) = \mathbf{0}_W$ , so $\mathbf{0}_V \in \ker(T)$ .
Closed under addition: Let $\mathbf{u}, \mathbf{v} \in \ker(T)$ . Then $T(\mathbf{u} + \mathbf{v}) = T(\mathbf{u}) + T(\mathbf{v}) = \mathbf{0} + \mathbf{0} = \mathbf{0}$ , so $\mathbf{u} + \mathbf{v} \in \ker(T)$ .
Closed under scalar multiplication: Let $\mathbf{v} \in \ker(T)$ , $c \in \mathbb{F}$ . Then $T(c\mathbf{v}) = cT(\mathbf{v}) = c\mathbf{0} = \mathbf{0}$ , so $c\mathbf{v} \in \ker(T)$ . $\square$

Proof for $\operatorname{im}(T)$ :

Non-empty: $T(\mathbf{0}_V) = \mathbf{0}_W \in \operatorname{im}(T)$ .
Closed under addition: Let $\mathbf{w}_1, \mathbf{w}_2 \in \operatorname{im}(T)$ , so $\mathbf{w}_i = T(\mathbf{v}_i)$ for some $\mathbf{v}_i \in V$ . Then $\mathbf{w}_1 + \mathbf{w}_2 = T(\mathbf{v}_1) + T(\mathbf{v}_2) = T(\mathbf{v}_1 + \mathbf{v}_2) \in \operatorname{im}(T)$ .
Closed under scalar multiplication: Let $\mathbf{w} = T(\mathbf{v}) \in \operatorname{im}(T)$ , $c \in \mathbb{F}$ . Then $c\mathbf{w} = cT(\mathbf{v}) = T(c\mathbf{v}) \in \operatorname{im}(T)$ . $\square$

J.3 Proof: Composition of Linear Maps is Linear

Theorem. If $S: V \to W$ and $T: U \to V$ are linear, then $S \circ T: U \to W$ is linear.

Proof:

(S \circ T)(a\mathbf{u}_1 + b\mathbf{u}_2) = S(T(a\mathbf{u}_1 + b\mathbf{u}_2))

= S(aT(\mathbf{u}_1) + bT(\mathbf{u}_2)) \quad \text{(T is linear)}

= aS(T(\mathbf{u}_1)) + bS(T(\mathbf{u}_2)) \quad \text{(S is linear)}

= a(S \circ T)(\mathbf{u}_1) + b(S \circ T)(\mathbf{u}_2) \quad \square

J.4 Proof: The Dual Map is Linear

Theorem. If $T: V \to W$ is linear, then $T^\top: W^* \to V^*$ defined by $(T^\top f)(\mathbf{v}) = f(T(\mathbf{v}))$ is linear.

Proof:

$(T^\top(af + bg))(\mathbf{v}) = (af + bg)(T(\mathbf{v})) = af(T(\mathbf{v})) + bg(T(\mathbf{v}))$
$= a(T^\top f)(\mathbf{v}) + b(T^\top g)(\mathbf{v}) = (aT^\top f + bT^\top g)(\mathbf{v})$

So $T^\top(af + bg) = aT^\top f + bT^\top g$ . $\square$

J.5 Proof: Invertibility Criterion for Finite-Dimensional Spaces

Theorem. Let $T: V \to V$ be a linear map on a finite-dimensional space $V$ . Then the following are equivalent:

$T$ is injective (one-to-one)
$T$ is surjective (onto)
$T$ is bijective (invertible)
$\ker(T) = \{\mathbf{0}\}$
$\operatorname{rank}(T) = \dim(V)$

Proof: $(1) \Leftrightarrow (4)$ : $T$ injective iff $\ker(T) = \{\mathbf{0}\}$ (standard).

$(4) \Leftrightarrow (5)$ : By rank-nullity: $\operatorname{rank}(T) + \operatorname{nullity}(T) = \dim(V)$ . Nullity = 0 iff rank = $\dim(V)$ .

$(5) \Leftrightarrow (2)$ : Rank = $\dim(\operatorname{im}(T))$ . Since $\operatorname{im}(T) \subseteq V$ and both are finite-dimensional, $\dim(\operatorname{im}(T)) = \dim(V)$ iff $\operatorname{im}(T) = V$ (a subspace of equal dimension must be the whole space).

$(3) \Leftrightarrow (1) \& (2)$ : by definition of bijective. $\square$

Important: This equivalence only holds for maps $T: V \to V$ with the same domain and codomain. For $T: \mathbb{R}^m \to \mathbb{R}^n$ with $m \neq n$ , injective and surjective are NOT equivalent (one is impossible given the dimension constraint).

Appendix K: Additional AI Case Studies

K.1 Mechanistic Interpretability via Linear Maps

Mechanistic interpretability (MI) aims to reverse-engineer neural networks by understanding what computation each component performs. Linear map theory is central to this enterprise.

The residual stream as a communication bus. In transformer models, each layer reads from and writes to a shared "residual stream" $\mathbf{x} \in \mathbb{R}^d$ . Each attention head and MLP layer contributes an additive update:

\mathbf{x}_{\ell+1} = \mathbf{x}_\ell + \underbrace{\text{Attn}_\ell(\mathbf{x}_\ell)}_{\text{attn. update}} + \underbrace{\text{MLP}_\ell(\mathbf{x}_\ell)}_{\text{MLP update}}

Each update is (approximately) a low-rank linear map from the residual stream back to itself. The attention update's linear part is $W_O W_V$ (the "OV circuit"); the MLP's linear part is $W_{\text{out}} W_{\text{in}}$ after linearizing the activation.

SVD of the OV circuit. The matrix $W_O W_V \in \mathbb{R}^{d \times d}$ can be analyzed via SVD. Its singular values reveal how strongly the attention head modifies the residual stream, and its singular vectors reveal which directions it reads from and writes to. Heads with near-zero singular values are "inattentive" - they barely modify the residual stream regardless of attention pattern.

Subspace decomposition. The full set of $L \times H$ attention heads (for $L$ layers, $H$ heads per layer) collectively form a large linear map from the input to the residual stream updates. The total update is a sum of $LH$ low-rank linear maps. Understanding the structure of this sum - which heads are redundant, which are essential - is a central goal of circuit-level MI.

K.2 Linear Algebra of Diffusion Models

Diffusion models (DDPM, Score matching) add Gaussian noise to data and learn to denoise. The forward process is an affine map:

\mathbf{x}_t = \sqrt{\bar\alpha_t}\, \mathbf{x}_0 + \sqrt{1 - \bar\alpha_t}\, \boldsymbol{\varepsilon}, \quad \boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, I)

This is an affine interpolation between the data $\mathbf{x}_0$ and pure noise. The coefficient $\sqrt{\bar\alpha_t}$ scales the data, and $\sqrt{1-\bar\alpha_t}$ scales the noise.

The denoising objective. The neural network $\epsilon_\theta(\mathbf{x}_t, t)$ estimates $\boldsymbol{\varepsilon}$ (the noise) from the noisy input. Near a data point $\mathbf{x}_0$ , this estimator is approximately a linear function of $\mathbf{x}_t - \sqrt{\bar\alpha_t}\mathbf{x}_0$ - the Tweedie formula gives the optimal estimator as:

\hat{\mathbf{x}}_0 = \frac{\mathbf{x}_t - \sqrt{1-\bar\alpha_t}\, \epsilon_\theta(\mathbf{x}_t, t)}{\sqrt{\bar\alpha_t}}

which is a linear function of $\mathbf{x}_t$ and $\epsilon_\theta(\mathbf{x}_t, t)$ . The diffusion process itself is a composition of affine maps in the forward direction, and the learned reverse process approximately inverts these affine maps.

K.3 State Space Models as Linear Dynamical Systems

Structured State Space Models (S4, Mamba, RWKV) compute their state updates via linear recurrences:

\mathbf{h}_{t+1} = A\mathbf{h}_t + B\mathbf{x}_t

\mathbf{y}_t = C\mathbf{h}_t + D\mathbf{x}_t

where $A \in \mathbb{R}^{N \times N}$ , $B \in \mathbb{R}^{N \times d}$ , $C \in \mathbb{R}^{d \times N}$ , $D \in \mathbb{R}^{d \times d}$ are (possibly input-dependent) matrices.

The state transition $\mathbf{h} \mapsto A\mathbf{h} + B\mathbf{x}$ is a linear dynamical system - the fundamental object of study in control theory and signal processing.

Key linear algebra results for SSMs:

Eigenvalues of $A$ determine memory. If $|\lambda_i(A)| < 1$ for all $i$ , the system has bounded memory decay. If any $|\lambda_i| > 1$ , the state can grow unboundedly.
Diagonalization for efficiency. If $A = P\Lambda P^{-1}$ , the recurrence decouples into $N$ independent scalar recurrences - each computable independently. S4 uses diagonal $A$ (DPLR structure) for $O(N\log N)$ parallel computation via convolution.
The convolution view. Unrolling the recurrence: $\mathbf{y}_t = \sum_{\tau=0}^t C A^{t-\tau} B \mathbf{x}_\tau + D\mathbf{x}_t$ . The impulse response $C A^k B$ is a sequence of matrix powers - analyzable by the spectral decomposition of $A$ .
Mamba's selectivity. Mamba makes $A, B, C$ input-dependent: $A(\mathbf{x}), B(\mathbf{x}), C(\mathbf{x})$ . The recurrence becomes bilinear in $(\mathbf{h}, \mathbf{x})$ , not purely linear. The linearization (around typical inputs) gives a locally linear system analyzable by the tools of this section.

Appendix L: Further Reading

Primary References

Axler, S. (2015). Linear Algebra Done Right (3rd ed.). Springer. - The definitive abstract treatment of linear maps. Goes from axioms to spectral theory without matrices until chapter 10. Highly recommended for conceptual depth.
Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley-Cambridge Press. - Computational and applied focus. Excellent for four fundamental subspaces and applications.
Horn, R. & Johnson, C. (2013). Matrix Analysis (2nd ed.). Cambridge University Press. - Comprehensive advanced reference. Proofs of all major results, including Cayley-Hamilton, spectral theorems, singular values.
Trefethen, L. & Bau, D. (1997). Numerical Linear Algebra. SIAM. - Gold standard for computational linear algebra and stability.

AI-Focused References

Vaswani, A. et al. (2017). "Attention is All You Need." NeurIPS. - The original transformer paper; read the attention mechanism as linear projections.
Hu, E. et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." ICLR 2022. - LoRA rank-nullity argument.
Elhage, N. et al. (2021). "A Mathematical Framework for Transformer Circuits." Anthropic. - OV and QK circuits as linear maps; residual stream as communication bus.
Park, K. et al. (2023). "The Linear Representation Hypothesis and the Geometry of Large Language Models." - Linear features in transformer representations.
Jacot, A. et al. (2018). "Neural Tangent Kernel: Convergence and Generalization in Neural Networks." NeurIPS. - Training dynamics via linear maps (NTK theory).
Gu, A. et al. (2022). "Efficiently Modeling Long Sequences with Structured State Spaces." ICLR. - SSMs as linear dynamical systems.

This section is part of the Math for LLMs curriculum - a systematic treatment of the mathematics underlying modern large language models.

Appendix M: Self-Assessment Checklist

After completing this section, you should be able to answer the following questions without notes.

Conceptual Understanding

Q1. State the two axioms of a linear map. What is the fastest way to show a map is NOT linear?
Q2. What is the kernel of a linear map? Prove it is a subspace.
Q3. State the rank-nullity theorem. Give an example where rank = 2, nullity = 3. What can you say about the domain and codomain dimensions?
Q4. Why is a linear map from $V$ to $V$ injective if and only if it is surjective? Why does this fail for maps between spaces of different dimensions?
Q5. What is the change-of-basis formula? If $A = P^{-1}BP$ , what relationship does that establish between the maps represented by $A$ and $B$ ?
Q6. What is an orthogonal projection? How do you verify that a matrix $P$ is a projection? What two extra properties make it an orthogonal projection?
Q7. What is the Jacobian matrix? For $f: \mathbb{R}^3 \to \mathbb{R}^2$ , what is the shape of $J_f$ ?
Q8. In the backward pass of backpropagation, why do we multiply by $W^\top$ rather than $W$ ?
Q9. What makes an affine map different from a linear map? How do you represent an affine map as a linear map (in one higher dimension)?

Computational Skills

C1. Given a matrix $A$ , find a basis for $\ker(A)$ using row reduction.
C2. Given a linear map $T$ defined on a non-standard basis, write its matrix in that basis.
C3. Given two bases $\mathcal{B}$ and $\mathcal{B}'$ , compute the change-of-basis matrix $P$ and use it to transform the matrix of $T$ .
C4. For a rank- $r$ update $\Delta W = BA$ , compute the null space dimension and verify it numerically.
C5. Compute the Jacobian of a given vector-valued function (e.g., softmax, elementwise ReLU, an affine map composed with sigmoid).

AI Connections

AI1. Explain why LoRA (low-rank adaptation) is more parameter-efficient than full fine-tuning, using the language of rank and nullity.
AI2. Describe the forward pass of a multi-head attention layer as a sequence of linear maps. Which operations are linear, which are bilinear, and which are nonlinear?
AI3. What is the linear representation hypothesis? Why does it matter for interpretability research?
AI4. Why does a purely linear deep network (no activations) collapse to a single linear map, regardless of depth?
AI5. How does the dual map relate to backpropagation? What mathematical object is the "gradient" in the strict sense?

Linear Transformations: Part 3 - Appendix C The Geometry Of Linear Maps A Deep Dive To Appendix M Self