Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 2
29 min read18 headingsSplit lesson page

Lesson overview | Previous part | Next part

Linear Transformations: Part 7: Affine Transformations to Appendix B: Abstract Linear Algebra Perspective

7. Affine Transformations

7.1 Beyond Linearity: Affine Maps

A linear map fixes the origin: T(0)=0T(\mathbf{0}) = \mathbf{0}. But many practical transformations in geometry and ML need to shift the origin - they include a translation component.

Definition. A function f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m is an affine transformation if it has the form:

f(x)=Ax+bf(\mathbf{x}) = A\mathbf{x} + \mathbf{b}

where ARm×nA \in \mathbb{R}^{m \times n} is a matrix (the linear part) and bRm\mathbf{b} \in \mathbb{R}^m is a vector (the translation).

Affine maps are linear maps composed with a translation. They are NOT linear (unless b=0\mathbf{b} = \mathbf{0}) - they fail the zero test: f(0)=b0f(\mathbf{0}) = \mathbf{b} \neq \mathbf{0} when b0\mathbf{b} \neq \mathbf{0}.

Affine subspaces. The image of Rn\mathbb{R}^n under an affine map is an affine subspace - a translate of a linear subspace. Solutions to Ax=bA\mathbf{x} = \mathbf{b} (when they exist) form an affine subspace: x0+ker(A)\mathbf{x}_0 + \ker(A).

Composition of affine maps. If f(x)=Ax+bf(\mathbf{x}) = A\mathbf{x} + \mathbf{b} and g(y)=Cy+dg(\mathbf{y}) = C\mathbf{y} + \mathbf{d}, then:

(gf)(x)=g(Ax+b)=C(Ax+b)+d=CAx+(Cb+d)(g \circ f)(\mathbf{x}) = g(A\mathbf{x} + \mathbf{b}) = C(A\mathbf{x} + \mathbf{b}) + \mathbf{d} = CA\mathbf{x} + (C\mathbf{b} + \mathbf{d})

The composition is affine: linear part CACA, translation Cb+dC\mathbf{b} + \mathbf{d}.

7.2 Homogeneous Coordinates

The elegant trick to make affine maps linear is to lift to one higher dimension.

Definition. The homogeneous coordinates of xRn\mathbf{x} \in \mathbb{R}^n are x~=(x1)Rn+1\tilde{\mathbf{x}} = \begin{pmatrix} \mathbf{x} \\ 1 \end{pmatrix} \in \mathbb{R}^{n+1}.

The augmented matrix. The affine map f(x)=Ax+bf(\mathbf{x}) = A\mathbf{x} + \mathbf{b} becomes:

f~(x~)=[Ab01](x1)=(Ax+b1)\tilde{f}(\tilde{\mathbf{x}}) = \begin{bmatrix} A & \mathbf{b} \\ \mathbf{0}^\top & 1 \end{bmatrix} \begin{pmatrix} \mathbf{x} \\ 1 \end{pmatrix} = \begin{pmatrix} A\mathbf{x} + \mathbf{b} \\ 1 \end{pmatrix}

Now f~\tilde{f} is a linear map in Rn+1\mathbb{R}^{n+1}. The last coordinate is always 1 for "proper" points.

Key benefit: composition becomes matrix multiplication. Two affine maps M~1,M~2\tilde{M}_1, \tilde{M}_2 compose as:

M~2M~1=[A2b201][A1b101]=[A2A1A2b1+b201]\tilde{M}_2 \tilde{M}_1 = \begin{bmatrix} A_2 & \mathbf{b}_2 \\ \mathbf{0}^\top & 1 \end{bmatrix} \begin{bmatrix} A_1 & \mathbf{b}_1 \\ \mathbf{0}^\top & 1 \end{bmatrix} = \begin{bmatrix} A_2 A_1 & A_2\mathbf{b}_1 + \mathbf{b}_2 \\ \mathbf{0}^\top & 1 \end{bmatrix}

No special cases needed - composition is just matrix multiplication.

Inverse of an affine map:

[Ab01]1=[A1A1b01]\begin{bmatrix} A & \mathbf{b} \\ \mathbf{0}^\top & 1 \end{bmatrix}^{-1} = \begin{bmatrix} A^{-1} & -A^{-1}\mathbf{b} \\ \mathbf{0}^\top & 1 \end{bmatrix}

(Valid when AA is invertible.)

In computer graphics: Every rigid body transformation (rotation + translation) is an affine map, and sequences of transformations are composed by matrix multiplication in homogeneous coordinates. This is the foundation of 3D graphics pipelines (OpenGL, Vulkan).

7.3 Neural Network Layers as Affine Maps

A single fully-connected layer computes:

z=Wx+b,a=σ(z)\mathbf{z} = W\mathbf{x} + \mathbf{b}, \quad \mathbf{a} = \sigma(\mathbf{z})

The pre-activation z=Wx+b\mathbf{z} = W\mathbf{x} + \mathbf{b} is an affine map from Rdin\mathbb{R}^{d_{\text{in}}} to Rdout\mathbb{R}^{d_{\text{out}}}. In homogeneous coordinates:

z~=[Wb01]x~\tilde{\mathbf{z}} = \begin{bmatrix} W & \mathbf{b} \\ \mathbf{0}^\top & 1 \end{bmatrix} \tilde{\mathbf{x}}

Why bias matters. Without bias (b=0\mathbf{b} = \mathbf{0}), each layer is linear and the hyperplanes separating classes must pass through the origin. With bias, the decision boundary can be placed anywhere. This is the geometric reason bias terms dramatically increase expressive power.

Multi-layer without activations. A network zL=WL(W2(W1x+b1)+b2)+bL\mathbf{z}_L = W_L(\cdots W_2(W_1\mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2 \cdots) + \mathbf{b}_L is still affine:

zL=Weffx+beff\mathbf{z}_L = W_{\text{eff}}\mathbf{x} + \mathbf{b}_{\text{eff}}

Multiple affine layers compose to a single affine layer. Depth without nonlinearity gives no expressive benefit.

BatchNorm as affine rescaling. After normalizing to zero mean and unit variance, BatchNorm applies a learned affine transformation γz+β\gamma \mathbf{z} + \beta (elementwise). This is an affine map on the normalized activations, restoring the capacity to represent any desired scale and shift.

Embedding layers. An embedding layer ERV×dE \in \mathbb{R}^{|V| \times d} maps token indices to vectors. Selecting the ii-th embedding is equivalent to multiplying by the one-hot vector ei\mathbf{e}_i: Eei=ei,:E\mathbf{e}_i = \mathbf{e}_{i,:} (the ii-th row). This is linear in the one-hot representation. The unembedding WURd×VW_U \in \mathbb{R}^{d \times |V|} maps representations to logits: l=WUh\mathbf{l} = W_U \mathbf{h}, a pure linear map.


8. The Jacobian as Linear Approximation

8.1 Linearization of Nonlinear Maps

Every differentiable function is "locally linear" - at any point, it looks like a linear map to first order. This principle underlies calculus, optimization, and all of numerical analysis.

Definition (Total Derivative). A function f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m is differentiable at x\mathbf{x} if there exists a linear map Dfx:RnRmDf_{\mathbf{x}}: \mathbb{R}^n \to \mathbb{R}^m such that:

limh0f(x+h)f(x)Dfx(h)h=0\lim_{\lVert\mathbf{h}\rVert \to 0} \frac{\lVert f(\mathbf{x} + \mathbf{h}) - f(\mathbf{x}) - Df_{\mathbf{x}}(\mathbf{h})\rVert}{\lVert\mathbf{h}\rVert} = 0

The linear map DfxDf_{\mathbf{x}} is the total derivative or Frechet derivative of ff at x\mathbf{x}. The first-order approximation is:

f(x+h)f(x)+Dfx(h)f(\mathbf{x} + \mathbf{h}) \approx f(\mathbf{x}) + Df_{\mathbf{x}}(\mathbf{h})

For small h\mathbf{h}, the function looks like an affine map: constant f(x)f(\mathbf{x}) plus linear correction Dfx(h)Df_{\mathbf{x}}(\mathbf{h}).

Geometric meaning. At each point x\mathbf{x}, the total derivative DfxDf_{\mathbf{x}} is the best linear approximation to ff near x\mathbf{x}. It maps directions (tangent vectors at x\mathbf{x}) to directions (tangent vectors at f(x)f(\mathbf{x})). This is the pushforward of vectors.

8.2 The Jacobian Matrix

Definition. The matrix representation of DfxDf_{\mathbf{x}} in standard coordinates is the Jacobian matrix:

Jf(x)=[f1x1f1xnfmx1fmxn]Rm×nJ_f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix} \in \mathbb{R}^{m \times n}

Entry (i,j)(i, j): Jij=fixjJ_{ij} = \frac{\partial f_i}{\partial x_j} - how the ii-th output changes with the jj-th input.

Shape mnemonic: output dimension ×\times input dimension: (m×n)(m \times n) for f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m.

Special cases:

  • f:RnRf: \mathbb{R}^n \to \mathbb{R} (scalar function): Jacobian is the row gradient fR1×n\nabla f^\top \in \mathbb{R}^{1 \times n}.
  • f:RRmf: \mathbb{R} \to \mathbb{R}^m (curve): Jacobian is the column tangent vector f(t)Rmf'(t) \in \mathbb{R}^m.
  • f:RnRnf: \mathbb{R}^n \to \mathbb{R}^n (same dim): JfJ_f is square; det(Jf)\det(J_f) is the local volume scaling.

The Jacobian of softmax. For s=softmax(z)\mathbf{s} = \operatorname{softmax}(\mathbf{z}) where si=ezi/kezks_i = e^{z_i} / \sum_k e^{z_k}:

Jsoftmax(z)=diag(s)ssJ_{\operatorname{softmax}}(\mathbf{z}) = \operatorname{diag}(\mathbf{s}) - \mathbf{s}\mathbf{s}^\top

This is a rank-deficient matrix (rank n1n-1) because softmax outputs sum to 1: the Jacobian has a constant vector 1\mathbf{1} in its null space.

The Jacobian of ReLU. For a=ReLU(z)\mathbf{a} = \operatorname{ReLU}(\mathbf{z}) (elementwise):

JReLU(z)=diag(1[z>0])J_{\operatorname{ReLU}}(\mathbf{z}) = \operatorname{diag}(\mathbf{1}[\mathbf{z} > 0])

A diagonal matrix with 1s where zi>0z_i > 0 and 0s elsewhere. ReLU's Jacobian is a projection - it zeroes out the gradient for dead neurons.

The Jacobian of layer normalization. More complex: layer norm applies mean subtraction, variance normalization, and an affine transformation. Its Jacobian is a projection onto a specific hyperplane (orthogonal to the constant vector), scaled and shifted.

8.3 Chain Rule = Composition of Jacobians

The chain rule for vector-valued functions states: if f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m and g:RmRpg: \mathbb{R}^m \to \mathbb{R}^p, then:

Jgf(x)=Jg(f(x))Jf(x)J_{g \circ f}(\mathbf{x}) = J_g(f(\mathbf{x})) \cdot J_f(\mathbf{x})

This is matrix multiplication of Jacobians. The composition of two differentiable maps has a Jacobian equal to the matrix product of their individual Jacobians (evaluated at the appropriate points).

Backpropagation is reverse-mode Jacobian accumulation. For a network y=fLf1(x)\mathbf{y} = f_L \circ \cdots \circ f_1(\mathbf{x}) with loss =L(y)\ell = L(\mathbf{y}):

x=Jf1JfLyL\frac{\partial \ell}{\partial \mathbf{x}} = J_{f_1}^\top \cdots J_{f_L}^\top \nabla_{\mathbf{y}} L

Reading right to left: start with yL\nabla_{\mathbf{y}} L, multiply by JfLJ_{f_L}^\top, then JfL1J_{f_{L-1}}^\top, etc. At each step, we multiply by the transpose Jacobian of a layer - which is the dual map of that layer's linear approximation.

Computational graph view. Each node in the computation graph stores its local Jacobian. Forward pass evaluates the functions; backward pass multiplies the transpose Jacobians (right to left). This is the mathematical content of autograd.

BACKPROPAGATION AS JACOBIAN CHAIN
========================================================================

  Forward:  x --J_1---> z_1 --J_2---> z_2 -- ... --J_L---> y ---> ell

  Backward: \partialell/\partialx <---J_1^T-- \partialell/\partialz_1 <---J_2^T-- \partialell/\partialz_2 <--- ... <---J_L^T-- \nablay L

  Each backward step:  (grad at input) = J^T \times (grad at output)
                     = (dual map) applied to incoming gradient

========================================================================

Why Jacobians matter for training dynamics:

  • Large Jacobian singular values -> exploding gradients
  • Small Jacobian singular values -> vanishing gradients
  • The spectral norm of JfJ_f measures how much the linear approximation of ff can amplify inputs

For AI: Gradient clipping, careful weight initialization (Xavier, He), and residual connections all address the Jacobian conditioning problem. Residual connections add an identity Jacobian contribution: Jx+F(x)=I+JFJ_{x + F(x)} = I + J_F, which prevents the singular values from collapsing to zero (vanishing gradients).


9. Applications in Machine Learning

9.1 Attention as Linear Projections

The scaled dot-product attention mechanism (Vaswani et al., 2017) is built entirely from linear transformations applied to a sequence of token representations.

Setup. Given an input sequence XRL×dX \in \mathbb{R}^{L \times d} (L tokens, d-dimensional residual stream), three learned projection matrices WQ,WKRd×dkW_Q, W_K \in \mathbb{R}^{d \times d_k} and WVRd×dvW_V \in \mathbb{R}^{d \times d_v} define:

Q=XWQ,K=XWK,V=XWVQ = XW_Q, \quad K = XW_K, \quad V = XW_V

Each is a linear transformation: QQ projects each token into "query space", KK into "key space", VV into "value space".

Attention scores. The attention pattern α=softmax(QK/dk)\alpha = \operatorname{softmax}(QK^\top / \sqrt{d_k}) is a soft selection matrix. For a fixed query vector q\mathbf{q}, the attention scores qK\mathbf{q}^\top K^\top are dot products in key space - a linear functional applied to each key.

Output. Attn(Q,K,V)=αV\operatorname{Attn}(Q,K,V) = \alpha V. The output is a weighted linear combination of value vectors - a linear operation parameterized by α\alpha. The output projection WORdv×dW_O \in \mathbb{R}^{d_v \times d} maps back to the residual stream: another linear map.

Multi-head attention. Each head hh applies its own projection matrices (WQh,WKh,WVh)(W_Q^h, W_K^h, W_V^h), computes attention independently, and the results are concatenated and projected:

MHA(X)=Concat(head1,,headH)WO\operatorname{MHA}(X) = \operatorname{Concat}(\operatorname{head}_1, \ldots, \operatorname{head}_H) W_O

This is a composition and concatenation of linear maps. The expressivity comes from the multiplicative interaction QKQK^\top (which is bilinear, not linear) - but conditioned on fixed α\alpha, the rest is linear.

For AI: The "OV circuit" and "QK circuit" decomposition (Elhage et al., 2021) analyzes transformer attention by studying the linear maps WOWVW_OW_V (value writing) and WQWKW_QW_K^\top (key-query matching) separately. This is possible precisely because attention is compositionally linear.

9.2 LoRA: Fine-Tuning via Low-Rank Composition

Low-Rank Adaptation (Hu et al., 2021) is one of the most important parameter-efficient fine-tuning methods, and its design is a direct application of linear map theory.

Setup. Freeze the pretrained weight W0Rd×kW_0 \in \mathbb{R}^{d \times k}. Add a trainable low-rank update:

W=W0+ΔW=W0+BAW = W_0 + \Delta W = W_0 + BA

where BRd×rB \in \mathbb{R}^{d \times r}, ARr×kA \in \mathbb{R}^{r \times k}, and rmin(d,k)r \ll \min(d, k).

Rank-nullity interpretation. The map ΔW=BA\Delta W = BA has rank at most rr. By rank-nullity:

dim(ker(ΔW))kr\dim(\ker(\Delta W)) \geq k - r

The update only changes the network's behavior along at most rr directions in the kk-dimensional input space. When r=8r = 8 and k=4096k = 4096, only 8/40960.2%8/4096 \approx 0.2\% of input directions are affected - the update is extremely sparse in "direction space."

The two-layer interpretation. The update ΔW=BA\Delta W = BA is a composition of two maps:

  1. A:RkRrA: \mathbb{R}^k \to \mathbb{R}^r - compression to rank-rr space
  2. B:RrRdB: \mathbb{R}^r \to \mathbb{R}^d - expansion back to full dimension

Initializing AA randomly and B=0B = 0 ensures ΔW=0\Delta W = 0 at the start of fine-tuning, so the model starts from the pretrained weights.

Parameter count. ΔW\Delta W would have dkdk parameters. BABA uses dr+rk=r(d+k)dr + rk = r(d+k) parameters. Savings ratio: r(d+k)dk=rd+rk2rmin(d,k)\frac{r(d+k)}{dk} = \frac{r}{d} + \frac{r}{k} \approx \frac{2r}{\min(d,k)}. For r=8r=8, d=k=4096d=k=4096: savings factor of 256×\approx 256\times.

Extensions: DoRA (Weight-Decomposition Low-Rank Adaptation, 2024) decomposes WW into magnitude and direction, applying LoRA only to the direction component. GaLore (2024) applies LoRA-style updates to gradients rather than weights.

9.3 Linear Probes and the Linear Representation Hypothesis

Linear probing tests whether a feature is linearly decodable from a model's representations. Given representations {hi}\{\mathbf{h}_i\} and labels {yi}\{y_i\}, train a linear classifier:

y^=sign(wh+b)\hat{y} = \operatorname{sign}(\mathbf{w}^\top \mathbf{h} + b)

If the probe achieves high accuracy, the feature is linearly represented - it corresponds to a direction w\mathbf{w} in representation space.

The Linear Representation Hypothesis (Mikolov et al., 2013; Elhage et al., 2022; Park et al., 2023) states that high-level features (sentiment, syntax, factual attributes, world models) are encoded as directions in the residual stream - i.e., as linear features.

Evidence:

  • Word2vec arithmetic: v(king)v(man)+v(woman)v(queen)\mathbf{v}(\text{king}) - \mathbf{v}(\text{man}) + \mathbf{v}(\text{woman}) \approx \mathbf{v}(\text{queen}). Semantic relationships are linear offsets.
  • Steering vectors: adding cdc\mathbf{d} to all residual stream activations controls model behavior (e.g., "banana" direction, sentiment directions).
  • Probing studies: most tested syntactic and semantic features are linearly decodable.

Superposition. When there are more features than dimensions, the model stores features in near-orthogonal directions that partially overlap (superposition). This is still linear representation - just with interference.

For AI: If the linear representation hypothesis holds broadly, then:

  • Linear algebra provides the right toolkit for model interpretability.
  • Interventions on model behavior reduce to vector addition in representation space.
  • Feature extraction is a linear map - PCA/SVD on activations finds meaningful directions.

9.4 Embedding and Unembedding

Token embeddings. The embedding matrix WERV×dW_E \in \mathbb{R}^{|V| \times d} maps vocabulary indices to dd-dimensional vectors. Indexing row ii of WEW_E is equivalent to the linear map eiWE\mathbf{e}_i^\top W_E (one-hot selection). The embedding layer is linear in the one-hot representation.

Unembedding. The unembedding matrix WURd×VW_U \in \mathbb{R}^{d \times |V|} maps residual stream vectors to logits over the vocabulary:

l=WUhRV\mathbf{l} = W_U \mathbf{h} \in \mathbb{R}^{|V|}

This is a pure linear map. The logit for token vv is lv=wU,vhl_v = \mathbf{w}_{U,v}^\top \mathbf{h} - a dot product (linear functional) between the unembedding direction wU,v\mathbf{w}_{U,v} and the residual stream.

Logit lens. Applying WUW_U to intermediate residual stream states (before the final layer) gives "early predictions" - showing what the model is computing at each layer. This technique (Nostalgebraist, 2020) is possible because unembedding is linear.

Tied embeddings. Many models (GPT-2, LLaMA variants) use WU=WEW_U = W_E^\top - the same matrix for both embedding and unembedding. This enforces consistency: the most likely next token vv after seeing context h\mathbf{h} is the one whose embedding WE[v]W_E[v] has the highest dot product with h\mathbf{h} - i.e., argmaxvWE[v]h\operatorname{argmax}_v W_E[v] \cdot \mathbf{h}.


10. Common Mistakes

#MistakeWhy It's WrongFix
1Assuming T(0)0T(\mathbf{0}) \neq \mathbf{0} is possible for a linear mapHomogeneity immediately gives T(0)=T(0v)=0T(v)=0T(\mathbf{0}) = T(0\cdot\mathbf{v}) = 0\cdot T(\mathbf{v}) = \mathbf{0}. Any map with T(0)0T(\mathbf{0}) \neq \mathbf{0} is affine or nonlinear.The zero-test is the fastest way to rule out linearity.
2Treating translation as linearf(x)=x+bf(\mathbf{x}) = \mathbf{x} + \mathbf{b} fails additivity: f(u+v)=u+v+bf(u)+f(v)=u+v+2bf(\mathbf{u}+\mathbf{v}) = \mathbf{u}+\mathbf{v}+\mathbf{b} \neq f(\mathbf{u})+f(\mathbf{v}) = \mathbf{u}+\mathbf{v}+2\mathbf{b}Neural network layers are affine (Wx+bW\mathbf{x}+\mathbf{b}), not linear. This matters for composing and inverting.
3Confusing rank of a map with rank of a matrixRank is basis-independent (it's the dimension of the image) but the matrix depends on the basis.rank(T)=rank([T]BC)\operatorname{rank}(T) = \operatorname{rank}([T]_{\mathcal{B}}^{\mathcal{C}}) for any bases - rank is a map invariant, not a matrix property.
4Misapplying rank-nullityForgetting that rank-nullity applies to the domain dimension, not the codomain. For T:R5R3T: \mathbb{R}^5 \to \mathbb{R}^3, rank + nullity = 5, not 3.Identify dim(V)\dim(V) explicitly before applying the theorem.
5Assuming AB=BAAB = BAMatrix multiplication (= composition of linear maps) is generally not commutative. Even for square matrices, ABBAAB \neq BA in general.Non-commutativity is the default. Commutativity (e.g., diagonal matrices, polynomial functions of a matrix) is the special case.
6Confusing similar matrices with equal matricesABA \sim B means they represent the same map in different bases - same eigenvalues, trace, determinant. But ABA \neq B in general.Similar matrices are equal only if the change-of-basis matrix PP is the identity.
7Thinking the kernel is trivial when AA is tallA tall matrix ARm×nA \in \mathbb{R}^{m \times n} with m>nm > n can still have a non-trivial null space if its columns are linearly dependent.Compute rank(A)\operatorname{rank}(A). Nullity = nrank(A)n - \operatorname{rank}(A), regardless of whether m>nm > n.
8Applying the inverse when only a one-sided inverse existsA non-square ARm×nA \in \mathbb{R}^{m \times n} (mnm \neq n) cannot have a two-sided inverse. Attempting A1A^{-1} for such matrices is undefined.Use the pseudo-inverse A+A^+ for the least-squares solution.
9Forgetting that the Jacobian is a linear map, not just partial derivativesThe Jacobian matrix is the coordinate representation of the total derivative DfxDf_{\mathbf{x}}. Partial derivatives individually give the columns but don't by themselves prove differentiability.Differentiability requires the linear approximation error to go to zero - this requires the partials to exist AND be continuous.
10Treating the gradient as a vector in the same spaceStrictly, f(x)(Rn)\nabla f(\mathbf{x}) \in (\mathbb{R}^n)^* (the dual space). In Euclidean space with standard basis, the identification (Rn)Rn(\mathbb{R}^n)^* \cong \mathbb{R}^n makes this invisible. But for optimization on manifolds or with non-Euclidean metrics, treating gradients as primal vectors gives wrong answers.Use the gradient as a covector when working with Fisher information, natural gradient, or Riemannian optimization.
11Assuming all bijective functions are linear isomorphismsA function can be bijective but not linear. E.g., f:RRf: \mathbb{R} \to \mathbb{R}, f(x)=x3f(x) = x^3 is bijective but not linear (f(1+1)=8f(1)+f(1)=2f(1+1) = 8 \neq f(1) + f(1) = 2).Isomorphisms must be both bijective AND linear. Check both conditions.
12Forgetting that im(T)\operatorname{im}(T) is in WW, not in VVThe image is a subspace of the codomain WW, not the domain VV. The null space is in VV.Draw the diagram: T:VWT: V \to W. ker(T)V\ker(T) \subseteq V. im(T)W\operatorname{im}(T) \subseteq W.

11. Exercises

Exercise 1 * - Kernel and Image Computation

Goal: Given explicit matrices, compute kernel and image and verify rank-nullity.

(a) For A=[123456789]A = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix}: find a basis for ker(A)\ker(A), a basis for im(A)\operatorname{im}(A), and verify rank + nullity = 3.

(b) For B=[100010]B = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \end{bmatrix} (projection to first two coordinates): identify kernel and image without computation.

(c) For the differentiation map D:P3P2D: \mathcal{P}_3 \to \mathcal{P}_2 with matrix as in 3.1: find kernel and image dimensions.

Exercise 2 * - Matrix of a Linear Map in Non-Standard Basis

Goal: Construct the matrix of a given transformation in a specified basis.

Let T:R2R2T: \mathbb{R}^2 \to \mathbb{R}^2 be reflection across the line y=xy = x (i.e., T(x,y)=(y,x)T(x,y) = (y,x)). Basis B={(1,1)/2,(1,1)/2}\mathcal{B} = \{(1,1)/\sqrt{2}, (1,-1)/\sqrt{2}\}.

(a) Write the standard matrix AA of TT in the standard basis.

(b) Compute the change-of-basis matrix PP from B\mathcal{B} to the standard basis.

(c) Compute [T]B=P1AP[T]_{\mathcal{B}} = P^{-1}AP. What is notable about this result?

(d) Explain geometrically why TT is diagonal in the basis B\mathcal{B}.

Exercise 3 * - Rank-Nullity and Linear Systems

Goal: Use rank-nullity to understand solution sets of linear systems.

(a) AR4×6A \in \mathbb{R}^{4 \times 6} has rank 3. What is the nullity? How many free variables are there in Ax=0A\mathbf{x} = \mathbf{0}?

(b) BR6×4B \in \mathbb{R}^{6 \times 4} has rank 4. Is Bx=bB\mathbf{x} = \mathbf{b} always solvable? What is the nullity of BB?

(c) Prove: if T:VVT: V \to V is a linear map on a finite-dimensional space, then TT is injective if and only if TT is surjective.

Exercise 4 ** - Projection Operator Construction

Goal: Build an orthogonal projection onto a given subspace and verify idempotency.

Let S=span{(1,2,0),(0,1,1)}R3S = \operatorname{span}\{(1, 2, 0), (0, 1, 1)\} \subseteq \mathbb{R}^3.

(a) Orthonormalize the spanning set using Gram-Schmidt.

(b) Construct the projection matrix P=QQP = QQ^\top where QQ has the orthonormal basis as columns.

(c) Verify: P2=PP^2 = P, P=PP = P^\top, and rank(P)=tr(P)=2\operatorname{rank}(P) = \operatorname{tr}(P) = 2.

(d) Compute (IP)(I-P) and verify it is also a projection. What does it project onto?

Exercise 5 ** - Jacobian of Softmax

Goal: Derive and verify the Jacobian of the softmax function.

(a) Derive sizj\frac{\partial s_i}{\partial z_j} for s=softmax(z)\mathbf{s} = \operatorname{softmax}(\mathbf{z}), handling the cases i=ji = j and iji \neq j separately.

(b) Show that the Jacobian is J=diag(s)ssJ = \operatorname{diag}(\mathbf{s}) - \mathbf{s}\mathbf{s}^\top.

(c) Verify that J1=0J\mathbf{1} = \mathbf{0} (the Jacobian kills constant vectors). Why does this make sense?

(d) Show that the rank of JJ is at most n1n-1.

Exercise 6 ** - Affine Map Composition via Homogeneous Coordinates

Goal: Compose affine transformations using augmented matrices.

In R2\mathbb{R}^2, let:

  • f1f_1: rotation by 45degrees45 degrees then translate by (1,0)(1, 0)
  • f2f_2: scale by 22 in both dimensions then translate by (0,1)(0, 1)

(a) Write augmented 3×33 \times 3 matrices M~1\tilde{M}_1 and M~2\tilde{M}_2 for f1f_1 and f2f_2.

(b) Compute M~2M~1\tilde{M}_2 \tilde{M}_1 (apply f1f_1 then f2f_2). What are the effective rotation, scale, and translation?

(c) Apply the composition to the point (1,0)(1, 0). Verify by applying f1f_1 then f2f_2 directly.

Exercise 7 *** - LoRA Rank Analysis

Goal: Analyze the geometry of low-rank weight updates.

Let W0R512×512W_0 \in \mathbb{R}^{512 \times 512} be a pretrained weight. A LoRA update with rank r=4r = 4 is ΔW=BA\Delta W = BA where BR512×4B \in \mathbb{R}^{512 \times 4}, AR4×512A \in \mathbb{R}^{4 \times 512}.

(a) What is rank(ΔW)\operatorname{rank}(\Delta W)? What is the nullity of ΔW\Delta W?

(b) Generate random BB and AA and numerically verify: rank(BA)4\operatorname{rank}(BA) \leq 4.

(c) For a fixed input xR512\mathbf{x} \in \mathbb{R}^{512}, compute ΔWx=BAx\Delta W \mathbf{x} = BA\mathbf{x}. Show this is in col(B)\operatorname{col}(B).

(d) Compute the singular values of ΔW\Delta W. How many are nonzero? What does this tell you about the effective dimension of the update?

(e) Compare the number of trainable parameters: 512×512512 \times 512 vs r(512+512)r(512 + 512) for r=4,8,16,32r = 4, 8, 16, 32.

Exercise 8 *** - Linear Probing and the Linear Representation Hypothesis

Goal: Empirically test whether a concept is linearly represented in embedding space.

(a) Generate synthetic embeddings for binary sentiment: positive embeddings clustered near d\mathbf{d}, negative near d-\mathbf{d} for a fixed direction d\mathbf{d}, plus noise.

(b) Train a linear probe (logistic regression on top of embeddings) and measure accuracy.

(c) Apply PCA to the embeddings. Show that the first principal component aligns with d\mathbf{d}.

(d) Add a "superposition" condition: embed two independent binary features in d=3d = 3 dimensions. Show that both features can be linearly decoded but with interference.

(e) Compute the mutual coherence of the feature directions. How does it relate to probe accuracy?


12. Why This Matters for AI (2026 Perspective)

ConceptAI ApplicationWhy It Matters
Linear map axiomsNeural layer computationEvery forward pass is a composition of linear maps + activations; understanding linearity separates what the layer does from what nonlinearity adds
Kernel and imageInformation compressionThe null space of a weight matrix is "dead information" - inputs in ker(W)\ker(W) produce no signal. Attention heads have low-rank structure that defines their effective null space
Rank-nullity theoremLoRA, model compressionLoRA exploits rank-nullity: a rank-rr update has (kr)(k-r)-dimensional null space, leaving most of the input space unaffected - this is why it works with few parameters
Change of basisDiagonalization, eigenfeaturesThe eigenvalue decomposition of weight matrices (studied in mechanistic interpretability) is a change to the "natural basis" in which the layer acts by simple scaling
Composition = multiplicationDeep network analysisThe effective weight of a kk-layer linear network is one matrix WLW1W_L\cdots W_1 - depth without nonlinearity has zero benefit. All depth benefits require nonlinearity
Projection operatorsAttention heads, layer normAttention heads project onto query/key/value subspaces; layer norm projects onto the hyperplane of mean-zero vectors; understanding projections clarifies what information is preserved
Affine maps + biasUniversal approximationBias terms are essential for shifting decision boundaries. Without bias, the model cannot represent affine functions - only linear ones. Universal approximation requires affine layers
Jacobian and chain ruleBackpropagationEvery .backward() call is Jacobian-matrix multiplication via the chain rule. Gradient explosion/vanishing is about Jacobian singular value growth/decay through layers
Dual maps and transposesGradient computationThe backward pass uses transpose weight matrices - these are the dual maps. Natural gradient and Fisher information matrix methods exploit the geometry of the dual space
Linear representation hypothesisMechanistic interpretabilityIf features are linear, activation patching, steering vectors, and linear probing all work. This is why "linear algebra for interpretability" (e.g., representation engineering, logit lens) is a coherent research program

13. Conceptual Bridge

Looking Back

This section builds on two foundational pillars from earlier in the curriculum.

From Chapter 2 06: Vector Spaces and Subspaces: We studied the axioms of abstract vector spaces (closure, associativity, identity, inverse, distributivity) and their subspaces (spans, null spaces, column spaces). Linear transformations are the maps between these abstract structures - the morphisms of the category of vector spaces. The four fundamental subspaces (column space, null space, row space, left null space) that were defined for matrices are now understood as im(T)\operatorname{im}(T), ker(T)\ker(T), im(T)\operatorname{im}(T^\top), and ker(T)\ker(T^\top) - intrinsic properties of the map, not the matrix.

From 01: Eigenvalues and Eigenvectors and 02: SVD: Both eigendecomposition and SVD are studied as decompositions of linear maps into simple pieces. Eigendecomposition finds a basis in which TT acts by scaling. SVD finds two orthonormal bases (for domain and codomain) in which TT acts by scaling. These are special cases of the general change-of-basis machinery developed here in 3.

From 03: PCA: PCA uses the linear structure of the data covariance matrix - the covariance C=XX/(n1)C = X^\top X / (n-1) is built from linear maps - to find the principal directions via SVD. The whitening transform and PCA projection are linear maps, and their geometry (dimension reduction, preserved variance) follows directly from the rank-nullity theorem.

Looking Forward

The abstract machinery of linear transformations will appear throughout the rest of the curriculum in concrete technical forms.

05: Orthogonality and Orthonormality: Gram-Schmidt is an algorithm that constructs a specific linear map (change of basis to an orthonormal basis). QR decomposition A=QRA = QR factors a linear map as an orthogonal map QQ followed by a triangular map RR. Orthogonal projections (5.1 here) are studied in depth there.

06: Matrix Norms: The spectral norm A2=σ1(A)\lVert A \rVert_2 = \sigma_1(A) measures how much a linear map can amplify vectors. The nuclear norm A=σi\lVert A \rVert_* = \sum \sigma_i measures the "effective rank." These norms quantify properties of the linear map that are invisible from the matrix entries.

Chapter 4: Calculus Fundamentals: The Jacobian (8 here) is the bridge between linear algebra and calculus. Multivariable calculus is essentially the study of how linear maps (Jacobians) approximate smooth nonlinear maps. The Hessian is the second-order analogue - a bilinear map that measures curvature. The implicit function theorem, inverse function theorem, and change-of-variables formula all rely on the Jacobian being an invertible linear map at the point of interest.

Curriculum Position

POSITION IN THE MATH FOR LLMS CURRICULUM
========================================================================

  Chapter 1: Mathematical Foundations
  Chapter 2: Linear Algebra Basics
    +-- Vectors, Matrix Operations, Systems of Equations
    +-- Determinants, Matrix Rank
    +-- Vector Spaces ----------------------------+
                                                  | prerequisite
  Chapter 3: Advanced Linear Algebra              |
    +-- 01-Eigenvalues --------------------------+
    +-- 02-SVD ---------------------------------+
    +-- 03-PCA ---------------------------------+
    |                                             |
    +-- 04-Linear Transformations <--------------+
    |      (YOU ARE HERE)
    |      Kernel, Image, Rank-Nullity
    |      Matrix Representation, Change of Basis
    |      Composition, Isomorphisms, Jacobian
    |      Affine Maps, Dual Spaces, AI Applications
    |                                             |
    +-- 05-Orthogonality <----------------------+
    +-- 06-Matrix Norms <-----------------------+
    +-- 07-Positive Definite Matrices           |
    +-- 08-Matrix Decompositions               |
                                               down
  Chapter 4: Calculus Fundamentals
    (Jacobians and chain rule developed further)

========================================================================

The unifying theme: Every major algorithm in deep learning is a composition of linear maps and nonlinearities. Understanding this section means understanding the language in which every neural network, every attention mechanism, every gradient computation, and every interpretability method is written. The matrix is not the territory - but knowing how to move between coordinate representations (change of basis), how to measure what a map collapses (kernel and rank), how to compose maps (matrix multiplication), and how to invert them (isomorphisms and pseudo-inverses) gives you the full algebraic toolkit to reason about any linear system you encounter in machine learning.


<- PCA | Back to Advanced Linear Algebra | Orthogonality ->


Appendix A: Extended Examples and Computations

A.1 Computing the Matrix of Differentiation

We work out the differentiation operator in full detail to build intuition for linear maps on function spaces.

Setting. Let V=P3={a0+a1x+a2x2+a3x3}V = \mathcal{P}_3 = \{a_0 + a_1 x + a_2 x^2 + a_3 x^3\} with basis BV={1,x,x2,x3}\mathcal{B}_V = \{1, x, x^2, x^3\} and W=P2W = \mathcal{P}_2 with basis BW={1,x,x2}\mathcal{B}_W = \{1, x, x^2\}.

Define D:VWD: V \to W by D(p)=pD(p) = p' (differentiation).

Step 1: Apply DD to each basis vector of VV.

  • D(1)=0=01+0x+0x2D(1) = 0 = 0 \cdot 1 + 0 \cdot x + 0 \cdot x^2 -> coordinate vector (0,0,0)(0, 0, 0)^\top
  • D(x)=1=11+0x+0x2D(x) = 1 = 1 \cdot 1 + 0 \cdot x + 0 \cdot x^2 -> coordinate vector (1,0,0)(1, 0, 0)^\top
  • D(x2)=2x=01+2x+0x2D(x^2) = 2x = 0 \cdot 1 + 2 \cdot x + 0 \cdot x^2 -> coordinate vector (0,2,0)(0, 2, 0)^\top
  • D(x3)=3x2=01+0x+3x2D(x^3) = 3x^2 = 0 \cdot 1 + 0 \cdot x + 3 \cdot x^2 -> coordinate vector (0,0,3)(0, 0, 3)^\top

Step 2: Assemble into matrix.

[D]BVBW=[010000200003][D]_{\mathcal{B}_V}^{\mathcal{B}_W} = \begin{bmatrix} 0 & 1 & 0 & 0 \\ 0 & 0 & 2 & 0 \\ 0 & 0 & 0 & 3 \end{bmatrix}

Step 3: Use the matrix. Differentiate p(x)=2+3xx2+4x3p(x) = 2 + 3x - x^2 + 4x^3. In BV\mathcal{B}_V-coordinates: [p]BV=(2,3,1,4)[p]_{\mathcal{B}_V} = (2, 3, -1, 4)^\top.

[D(p)]BW=[010000200003](2314)=(3212)[D(p)]_{\mathcal{B}_W} = \begin{bmatrix} 0 & 1 & 0 & 0 \\ 0 & 0 & 2 & 0 \\ 0 & 0 & 0 & 3 \end{bmatrix} \begin{pmatrix} 2 \\ 3 \\ -1 \\ 4 \end{pmatrix} = \begin{pmatrix} 3 \\ -2 \\ 12 \end{pmatrix}

So D(p)=32x+12x2D(p) = 3 - 2x + 12x^2. Check: (2+3xx2+4x3)=32x+12x2(2 + 3x - x^2 + 4x^3)' = 3 - 2x + 12x^2. OK

Kernel of DD: The null space is {p:p=0}={constants}=span{1}\{p : p' = 0\} = \{\text{constants}\} = \operatorname{span}\{1\}. Dimension 1.

Image of DD: All polynomials of degree 2\leq 2 (since every qP2q \in \mathcal{P}_2 equals (p)(p)' for some pP3p \in \mathcal{P}_3). Dimension 3.

Rank-nullity check: dim(P3)=4=1+3=nullity+rank\dim(\mathcal{P}_3) = 4 = 1 + 3 = \text{nullity} + \text{rank}. OK

A.2 Change of Basis: Full Worked Example

Problem. Let T:R2R2T: \mathbb{R}^2 \to \mathbb{R}^2 rotate vectors by 30degrees30 degrees counterclockwise. Express TT in the rotated basis B={R45degreese1,R45degreese2}\mathcal{B}' = \{R_{45 degrees }\mathbf{e}_1, R_{45 degrees }\mathbf{e}_2\}.

Step 1: Standard matrix of TT.

A=R30degrees=[cos30degreessin30degreessin30degreescos30degrees]=[3/21/21/23/2]A = R_{30 degrees } = \begin{bmatrix} \cos 30 degrees & -\sin 30 degrees \\ \sin 30 degrees & \cos 30 degrees \end{bmatrix} = \begin{bmatrix} \sqrt{3}/2 & -1/2 \\ 1/2 & \sqrt{3}/2 \end{bmatrix}

Step 2: Change-of-basis matrix. The new basis vectors, in standard coordinates:

b1=R45degreese1=(cos45degrees,sin45degrees)=(1/2,1/2)\mathbf{b}'_1 = R_{45 degrees }\mathbf{e}_1 = (\cos 45 degrees , \sin 45 degrees )^\top = (1/\sqrt{2}, 1/\sqrt{2})^\top b2=R45degreese2=(sin45degrees,cos45degrees)=(1/2,1/2)\mathbf{b}'_2 = R_{45 degrees }\mathbf{e}_2 = (-\sin 45 degrees , \cos 45 degrees )^\top = (-1/\sqrt{2}, 1/\sqrt{2})^\top P=[1/21/21/21/2]=R45degreesP = \begin{bmatrix} 1/\sqrt{2} & -1/\sqrt{2} \\ 1/\sqrt{2} & 1/\sqrt{2} \end{bmatrix} = R_{45 degrees }

Step 3: New matrix.

[T]B=P1AP=R45degreesR30degreesR45degrees[T]_{\mathcal{B}'} = P^{-1} A P = R_{-45 degrees } R_{30 degrees } R_{45 degrees }

Since rotation matrices commute (they all rotate around the same axis in 2D): R45degreesR30degreesR45degrees=R30degreesR_{-45 degrees } R_{30 degrees } R_{45 degrees } = R_{30 degrees }.

Key insight: A rotation in 2D has the same matrix in every orthonormal basis (since all such matrices are just RθR_\theta). This is because RθR_\theta commutes with all rotations: RαRθRα1=RθR_\alpha R_\theta R_\alpha^{-1} = R_\theta.

A.3 The Null Space as a Subspace: Visualization

For A=[121242]A = \begin{bmatrix} 1 & 2 & -1 \\ 2 & 4 & -2 \end{bmatrix} (rows are multiples), let's find ker(A)\ker(A).

Row reduce: R2R22R1R_2 \leftarrow R_2 - 2R_1:

[121000]\begin{bmatrix} 1 & 2 & -1 \\ 0 & 0 & 0 \end{bmatrix}

Free variables: x2=sx_2 = s, x3=tx_3 = t (free). Back-substitute: x1=2s+tx_1 = -2s + t.

ker(A)={s(210)+t(101):s,tR}\ker(A) = \left\{ s \begin{pmatrix} -2 \\ 1 \\ 0 \end{pmatrix} + t \begin{pmatrix} 1 \\ 0 \\ 1 \end{pmatrix} : s, t \in \mathbb{R} \right\}

This is a plane through the origin in R3\mathbb{R}^3. The map TA:R3R2T_A: \mathbb{R}^3 \to \mathbb{R}^2 collapses this entire plane to zero. By rank-nullity: rank =1= 1, nullity =2= 2, and 1+2=3=dim(R3)1 + 2 = 3 = \dim(\mathbb{R}^3). OK

The image of TAT_A is span{(1,2)}\operatorname{span}\{(1,2)^\top\} - a line in R2\mathbb{R}^2 - since the two columns of AA are (1,2)(1,2)^\top and (2,4)=2(1,2)(2,4)^\top = 2 \cdot (1,2)^\top, so rank = 1.


Appendix B: Abstract Linear Algebra Perspective

B.1 The Category of Vector Spaces

In the language of category theory (which provides a unifying framework for all of mathematics):

  • Objects: Vector spaces over a field F\mathbb{F}
  • Morphisms: Linear maps between them
  • Composition: Function composition (= matrix multiplication)
  • Identity morphism: The identity map idV(v)=v\operatorname{id}_V(\mathbf{v}) = \mathbf{v} (= identity matrix II)

This forms a category denoted VectF\mathbf{Vect}_\mathbb{F}.

Isomorphisms in the category are exactly the invertible linear maps - this matches our definition of isomorphism in 4.3.

Functors are maps between categories that preserve the categorical structure. The "matrix representation" is a functor from VectF\mathbf{Vect}_\mathbb{F} (with chosen bases) to the category MatF\mathbf{Mat}_\mathbb{F} of matrices. Changing bases corresponds to a natural transformation.

This perspective matters for ML because: neural network architectures are themselves categorical structures (composition of morphisms), and categorical thinking helps reason about when two architectures are "equivalent" (related by isomorphisms).

B.2 Infinite-Dimensional Extensions

The theory extends to infinite-dimensional spaces, where the familiar finite-dimensional results require modification.

Bounded linear operators. On Hilbert spaces (infinite-dimensional inner product spaces), the right notion of "linear transformation" is a bounded linear operator T:HHT: H \to H satisfying TvMv\lVert T\mathbf{v}\rVert \leq M\lVert\mathbf{v}\rVert for some M<M < \infty. Unbounded operators (like differentiation on L2L^2) require careful domain specification.

The spectral theorem for compact operators. A compact self-adjoint operator on a Hilbert space has a countable set of eigenvalues λ1λ20\lambda_1 \geq \lambda_2 \geq \cdots \to 0 and an orthonormal basis of eigenvectors. This is the infinite-dimensional analogue of diagonalization.

Functional analysis. The study of linear maps on infinite-dimensional spaces is called functional analysis. Key results (Hahn-Banach, open mapping theorem, closed graph theorem) parallel finite-dimensional results but require additional technical hypotheses.

For AI: Neural network function classes are subsets of infinite-dimensional function spaces (L2L^2 or Sobolev spaces). Understanding the "size" and "complexity" of these classes uses infinite-dimensional linear algebra - e.g., kernel methods and Gaussian processes operate in reproducing kernel Hilbert spaces (RKHS), and neural tangent kernel theory analyzes infinitely wide networks using spectral theory of linear operators.

B.3 The Tensor Product and Multilinear Maps

Bilinear maps. A map B:V×WUB: V \times W \to U is bilinear if it is linear in each argument separately. The attention score B(q,k)=qkB(\mathbf{q}, \mathbf{k}) = \mathbf{q}^\top \mathbf{k} is bilinear (linear in q\mathbf{q}, linear in k\mathbf{k}, but not linear jointly: B(cq,ck)=c2B(q,k)cB(q,k)B(c\mathbf{q}, c\mathbf{k}) = c^2 B(\mathbf{q}, \mathbf{k}) \neq c B(\mathbf{q}, \mathbf{k}) in general).

The tensor product VWV \otimes W is the universal space for bilinear maps: every bilinear map B:V×WUB: V \times W \to U factors through a unique linear map B~:VWU\tilde{B}: V \otimes W \to U. This is why tensors (in the ML sense: multi-dimensional arrays) are called tensors - they represent multilinear maps.

For AI: The key operation in self-attention, qk=q,k\mathbf{q}^\top \mathbf{k} = \langle \mathbf{q}, \mathbf{k} \rangle, is a bilinear form. The matrix WW in the "weight matrix" formulation of attention (qWQKk\mathbf{q}^\top W_{\text{QK}} \mathbf{k}) makes this explicit: WQK=WQWKW_{\text{QK}} = W_Q^\top W_K is the matrix of the bilinear form. This is the reason attention is more expressive than standard linear transformations - it computes a bilinear (quadratic) function of the input.



Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue