Lesson overview | Lesson overview | Next part
Linear Transformations: Part 1: Intuition to 6. Dual Spaces and Transposes
1. Intuition
1.1 Three Views of a Matrix
A matrix admits three distinct but equivalent interpretations, and fluent practitioners shift between them effortlessly.
View 1: Data container. The matrix is a rectangular array of numbers - rows, columns. This is the computational view. We use it when we care about entries: , the element in row , column .
View 2: Column geometry. The matrix is a collection of column vectors in . The product is then a linear combination: , where is the -th column. This view makes the column space visible and illuminates when has solutions.
View 3: Linear transformation. The matrix defines a function by . This is the abstract, coordinate-free view. The map exists independently of any coordinate representation - the matrix is merely its description in a particular pair of bases (standard basis for both and ).
THREE VIEWS OF THE MATRIX A
========================================================================
[2 1] Column geometry: Linear map:
[0 3] A = [a_1 | a_2] T: \mathbb{R}^2 -> \mathbb{R}^2
[1 -1] T(x) = Ax
Data container Span{a_1, a_2} Sends basis vectors
A_2_1 = 0 = column space e_1 -> col 1 of A
A_3_2 = -1 e_2 -> col 2 of A
All three describe the same mathematical object.
========================================================================
For AI: In a transformer, the weight matrix is simultaneously all three: raw parameters to be optimized (View 1), a basis for the query subspace (View 2), and a linear projection from the residual stream to query space (View 3). Understanding which view you're using prevents confusion about what "the attention head is doing."
1.2 Geometric Action of Linear Maps
What does do to space? The two axioms - additivity and homogeneity - impose strong geometric constraints:
-
The origin is fixed. always. A linear map cannot translate; it cannot move the origin. This is the most important geometric fact about linear transformations.
-
Lines through the origin stay lines. If is a line, then is a line through the origin in the output space.
-
Parallel lines stay parallel. If (same direction), then (still same direction).
-
Grid lines go to grid lines, equally spaced. This is the classic 3Blue1Brown visualization: apply to the integer grid of and you get a (possibly skewed) grid in .
-
The unit square maps to a parallelogram. The parallelogram spanned by the columns of is the image of the unit square under .
What can a linear map do? Depending on the matrix:
- Rotate space (orthogonal matrix, )
- Reflect space (orthogonal matrix, )
- Scale dimensions (diagonal matrix)
- Shear space (triangular matrix)
- Project onto a subspace (idempotent: )
- Collapse space to a lower dimension (rank-deficient matrix)
What a linear map cannot do: translate (no origin-shifting), apply nonlinear distortion (no bending, folding, or curving of lines).
1.3 Why Linearity is Special
The superposition principle is the reason linear algebra is tractable. For a linear map :
This means: to know on all of , you only need to know on a basis. A basis has elements. The map on an infinite space is encoded in finitely many column vectors. This is extraordinary compression: infinite -> finite.
Linearity vs nonlinearity in neural networks. A deep neural network with no nonlinearities is just a single linear transformation: . Multiple linear layers collapse to one. The nonlinearities (ReLU, GELU, SiLU) are what allow composition to create genuinely new representational power. The linear layers provide the parameterized directions; the nonlinear activations provide the expressive capacity.
Linearity in analysis. Many of the hardest problems in mathematics and ML become tractable when restricted to linear functions: linear regression has a closed-form solution, linear systems have complete theory, spectral analysis of linear operators is well-understood. The strategy of "linearize, solve, interpret" recurs throughout calculus, optimization, and signal processing.
For AI: The linear representation hypothesis (Elhage et al. 2022, Park et al. 2023) conjectures that many high-level features in LLMs are encoded as directions in representation space - i.e., they are linear features. If true, this means the crucial structure of LLM computation is linear, and all the machinery of this section applies directly to understanding what models are doing.
1.4 Historical Timeline
| Year | Person | Contribution |
|---|---|---|
| 1844 | Hermann Grassmann | Die lineale Ausdehnungslehre - first abstract treatment of linear spaces |
| 1855 | Arthur Cayley | Matrix algebra as formal system; composition of matrices |
| 1888 | Giuseppe Peano | First rigorous axiomatization of vector spaces |
| 1902 | Henri Lebesgue | Integration as a linear functional on function spaces |
| 1904-1910 | Hilbert, Riesz, Fischer | Infinite-dimensional linear algebra; Hilbert spaces; spectral theory |
| 1929 | John von Neumann | Operator theory; linear transformations on Hilbert spaces |
| 1940s | Alan Turing, von Neumann | Linear algebra for numerical computation; matrix algorithms |
| 1986 | Rumelhart, Hinton, Williams | Backpropagation - chains of Jacobians as the training algorithm |
| 2017 | Vaswani et al. | Attention = linear projections + scaled dot product; transformers |
| 2021 | Hu et al. (LoRA) | Low-rank linear map updates for efficient fine-tuning |
| 2022- | Elhage, Park et al. | Linear representation hypothesis; mechanistic interpretability |
2. Formal Definitions
2.1 The Two Axioms and All Their Consequences
Definition (Linear Transformation). Let and be vector spaces over the same field (typically or ). A function is a linear transformation (also called a linear map or homomorphism) if it satisfies:
- Additivity: for all
- Homogeneity: for all ,
These two axioms are often combined into the single condition:
Immediate consequences (each follows directly from the axioms):
Proposition 2.1.1 (Zero maps to zero). .
Proof: for any .
This is a universal test: if , then is not linear. Translation (, ) fails immediately.
Proposition 2.1.2 (Negatives are preserved). .
Proof: .
Proposition 2.1.3 (General linear combinations). For any and :
Proof: By induction using additivity and homogeneity.
Proposition 2.1.4 (Determined by basis images). If is a basis for , then is completely determined by . Moreover, for any choice of , there exists a unique linear map with .
This is the fundamental construction theorem. It means the matrix of (in standard coordinates) is:
The set of all linear maps is denoted or . It is itself a vector space under pointwise operations: and .
2.2 Kernel of a Linear Map
Definition (Kernel). The kernel (or null space) of a linear map is:
It is the set of all inputs that maps to zero - the "lost information" of the map.
Theorem 2.2.1. is a subspace of .
Proof:
- Zero: , so .
- Closure under addition: If , then .
- Closure under scaling: If , then .
Geometric meaning. is the subspace that "collapses to zero." If is the projection onto the -plane, then is the -axis. If is differentiation of polynomials, is the constant polynomials.
Theorem 2.2.2 (Injectivity criterion). is injective (one-to-one) if and only if .
Proof. () If is injective and , then . () If and , then , so , i.e., .
The dimension of the kernel, , is called the nullity of .
2.3 Image of a Linear Map
Definition (Image). The image (or range) of a linear map is:
It is the set of all possible outputs - the "reachable" part of .
Theorem 2.3.1. is a subspace of .
Proof:
- Zero: .
- Closure under addition: .
- Closure under scaling: .
Theorem 2.3.2. is the column space of the matrix representing .
Proof: , which is exactly the span of the columns of .
Theorem 2.3.3 (Surjectivity criterion). is surjective (onto) if and only if .
The dimension of the image, , is called the rank of , denoted .
2.4 The Rank-Nullity Theorem
This is one of the most elegant and useful results in linear algebra, connecting the three fundamental dimensions of a linear map.
Theorem 2.4.1 (Rank-Nullity Theorem). Let be a linear map with finite-dimensional. Then:
Proof. Let be a basis for (so ). Extend this to a basis for all of : where .
We claim is a basis for .
Spanning: For any , write for some . Then .
Linear independence: If , then , so . But is linearly independent, so all .
Intuition: The rank-nullity theorem says "uses" its input dimensions in two ways: some dimensions are collapsed to zero (nullity), and the rest are faithfully transmitted to the output (rank). Total in = lost + kept.
RANK-NULLITY THEOREM
========================================================================
T: V -----------------------------------> W
| |
+-- ker(T) ---> {0} |
| (nullity) collapsed |
| |
+-- "rest" ---> im(T) \subseteq W |
(rank) transmitted |
dim(V) = nullity(T) + rank(T)
[total] [collapsed dims] [surviving dims]
Example: T: \mathbb{R}^5 -> \mathbb{R}^3, rank 2
-> nullity = 5 - 2 = 3
-> 3 dimensions collapse, 2 survive
========================================================================
Example. defined by . Rank = 3 (full row rank). Nullity = . The null space is .
For AI: In LoRA, a weight update with , has rank at most . By rank-nullity, its kernel has dimension at least . When , the update leaves most of the input space unchanged - it only "speaks to" an -dimensional subspace.
2.5 Examples and Non-Examples
Linear transformations:
| Map | Domain -> Codomain | Kernel | Image |
|---|---|---|---|
| (matrix mult.) | null space of | column space of | |
| (differentiation) | constants | all polynomials of degree | |
| (integration) | functions vanishing at 0 | ||
| (zero map) | all of | ||
| (identity) | all of | ||
| (projection) | -axis | -axis | |
| (trace) | trace-zero matrices |
Non-linear maps (and why they fail):
| Map | Linearity failure | Test |
|---|---|---|
| () | Zero test | |
| Additivity | ||
| (norm) | ... wait, this passes? No: | Homogeneity |
| ; softmax is scale-invariant | Homogeneity | |
| in general | Additivity | |
| (elementwise square) | Homogeneity |
Note on ReLU: Though ReLU is not linear, it is piecewise linear - linear on each orthant. This means neural networks with ReLU are piecewise linear functions, which is a key fact for understanding their behavior.
3. Matrix Representation of a Linear Map
3.1 The Standard Basis Construction
Over and with standard bases, every linear map corresponds to a unique matrix .
Construction. The -th column of is , the image of the -th standard basis vector:
Why this works: For any :
The matrix is the complete, coordinate-encoded description of .
Example. Find the matrix for the counterclockwise rotation by angle .
and , giving:
Example. Find the matrix for differentiation using the bases and .
, , , . In the output basis:
This is a matrix, reflecting .
3.2 Representation in Arbitrary Bases
When and have non-standard bases, the matrix of depends on those bases.
Setup. Let be a basis for and be a basis for .
Coordinate vectors. For , write . The coordinate vector is .
The matrix of in bases . Express each in the basis :
The matrix satisfies:
The commutative diagram:
v \in V ----------------T-----------------> T(v) \in W
| |
[*]_B (coord in B) [*]_C (coord in C)
| |
down down
[v]_B ---------[T]_B^C (matrix mult.)-----> [T(v)]_C
The matrix is the bridge between coordinates: it takes -coordinates of input to -coordinates of output.
Example. Let be the map . In the non-standard basis :
, so column 1 is . , so column 2 is .
This is diagonal-like: in the basis , acts as a simple scaling on each basis vector - exactly the diagonalization idea.
3.3 The Change-of-Basis Matrix
Definition. Let and be two bases for the same space . The change-of-basis matrix from to is:
Each column is the -coordinate vector of the corresponding new basis vector.
Key property: If are the -coordinates of , then the -coordinates are:
The change-of-basis formula for . If is the matrix of in the basis , and is the change-of-basis matrix from to , then:
Derivation:
Worked example. Let have matrix in the standard basis. In the basis :
3.4 Similarity Transformations
Definition. Two matrices are similar () if there exists an invertible such that .
Geometrically: and represent the same linear map in different bases.
Invariants under similarity (properties that don't change when you change basis):
| Invariant | Formula | Meaning |
|---|---|---|
| Eigenvalues | unchanged | Spectrum is basis-independent |
| Determinant | Volume scaling is basis-independent | |
| Trace | Sum of eigenvalues | |
| Rank | Dimension of image | |
| Characteristic polynomial | Entire spectrum |
Diagonalization is change of basis. When is diagonalizable with eigenvectors , the matrix gives . We are choosing the basis of eigenvectors, in which the map acts by simple scaling.
Forward reference: Eigenvalues and Eigenvectors
The full theory of diagonalization, the spectral theorem, and when a matrix is diagonalizable is developed in 01: Eigenvalues and Eigenvectors. Here we note only that diagonalization is a special case of change of basis.
4. Composition, Invertibility, and Isomorphisms
4.1 Composition is Matrix Multiplication
Let and be linear maps. Their composition defined by is also linear.
Theorem 4.1.1. If has matrix (in standard bases) and has matrix , then has matrix .
Proof: .
This is the fundamental theorem connecting composition of functions to matrix multiplication. It explains:
- Non-commutativity: in general (different domains/codomains, or for square matrices).
- Associativity: - matches matrix associativity.
- Deep learning: A network is a composition. The forward pass computes this product. The backward pass (backprop) uses the chain rule, which is exactly the composition of Jacobians.
Collapse of linear-only networks. If (all layers linear, no activations):
where is a single matrix. No depth benefit without nonlinearity.
4.2 Injectivity, Surjectivity, and Bijectivity
Definitions. A linear map is:
- Injective (one-to-one) if
- Surjective (onto) if for every , with
- Bijective if both injective and surjective
Criteria via rank-nullity:
| Property | Condition | Matrix equivalent |
|---|---|---|
| Injective | , i.e., nullity = 0 | Columns of are linearly independent |
| Surjective | , i.e., rank = | Rows of span ; full row rank |
| Bijective | Both; requires and full rank | is square and invertible |
Dimension constraints:
- If : cannot be surjective (rank ).
- If : cannot be injective (nullity ).
- If : injective surjective bijective.
4.3 Isomorphisms
Definition. A bijective linear map is called an isomorphism. If such a map exists, and are isomorphic, written .
Isomorphic spaces are "the same" as vector spaces - they have the same algebraic structure. An isomorphism is a relabeling of elements that preserves all linear operations.
Theorem 4.3.1 (Fundamental Classification). Two finite-dimensional vector spaces over the same field are isomorphic if and only if they have the same dimension.
Consequence: Every -dimensional vector space over is isomorphic to . The space of polynomials of degree (dimension ) is isomorphic to . The space (dimension 4) is isomorphic to .
For AI: The residual stream of a transformer (a vector in ) is isomorphic to any other -dimensional space. Features are not inherently "in " - they live in an abstract vector space and is just one coordinate representation. The linear representation hypothesis says this space has meaningful geometric structure independent of the coordinate system.
Properties of isomorphisms:
- The inverse is also an isomorphism.
- Isomorphism preserves all vector space structure: subspaces, linear independence, bases, dimension.
- Isomorphisms form a group under composition.
4.4 The Inverse Map
Theorem 4.4.1. If is an isomorphism (bijective linear map), then is also a linear map.
Proof: For , let , so . Then:
So .
Matrix inverse. If has matrix (square, full rank), then has matrix .
Left and right inverses for non-square maps. When is injective but not surjective (), there is no full inverse. However:
- Left inverse: such that . Exists iff is injective. Formula: (left pseudo-inverse).
- Right inverse: such that . Exists iff is surjective. Formula: (right pseudo-inverse).
Forward reference: Moore-Penrose Pseudo-Inverse
The general pseudo-inverse , defined via SVD as , handles all cases uniformly and gives the least-squares solution when an exact solution doesn't exist. Full treatment in 02: Singular Value Decomposition.
4.5 The Four Fundamental Subspaces via Linear Maps
Every linear map (with , , matrix ) defines four fundamental subspaces:
| Subspace | Definition | Lives in | Dimension |
|---|---|---|---|
| Column space | |||
| Null space | |||
| Row space | |||
| Left null space |
Orthogonality relations (proven using the dual map - see 6):
This splits and .
THE FOUR FUNDAMENTAL SUBSPACES
========================================================================
\mathbb{R}^n (domain) \mathbb{R}^m (codomain)
+--------------------+ +--------------------+
| row space |--T---> | column space |
| (dim r) | | (dim r) |
+--------------------+ +--------------------+
| null space |--T---> | left null space |
| (dim n-r) | {0} | (dim m-r) |
+--------------------+ +--------------------+
<-> \perp <-> \perp
orthogonal complement orthogonal complement
========================================================================
5. Special Classes of Linear Transformations
5.1 Projection Operators
Definition. A linear map is a projection if it is idempotent: .
Idempotency means "doing it twice is the same as doing it once." Once you project onto a subspace, you're already there - projecting again does nothing.
Theorem 5.1.1. If is a projection, then:
- : the image of is the fixed-point set.
- : the kernel of is the image of the complementary projection.
- (direct sum decomposition).
Proof of last claim: Any decomposes as , where and (since ). Uniqueness follows from: if with , , then .
Rank and trace. For a projection: (eigenvalues are only 0 and 1; trace = sum of eigenvalues = number of 1s = rank).
Orthogonal vs oblique projections. An orthogonal projection satisfies additionally (i.e., is symmetric). In this case:
- - the kernel is the orthogonal complement of the image.
- Formula: for projection onto .
An oblique projection has but - it projects along a subspace that is not orthogonal to the target.
Forward reference: Orthogonal Projections
The full theory of orthogonal projections - including the projection formula, orthogonal decompositions, and the relationship to least squares - is in 05: Orthogonality and Orthonormality.
For AI: Attention heads in transformers apply projections onto query/key/value subspaces. The head-specific projection matrices define low-dimensional subspaces of the residual stream. The attention mechanism then computes weighted combinations within these projected spaces. Projection is also central to layer normalization (projecting onto the unit sphere in the mean-subtracted subspace).
5.2 Rotations and Reflections
Orthogonal transformations are linear maps that preserve inner products:
Equivalently, their matrices satisfy , i.e., .
The set of all orthogonal matrices forms the orthogonal group .
Determinant splits them into two classes:
- : proper rotations - preserve orientation. These form , the special orthogonal group.
- : improper rotations (reflections, or rotation + reflection) - reverse orientation.
2D rotation by angle :
Properties: , .
3D rotations are parametrized by an axis and angle (Rodrigues' formula):
where is the skew-symmetric matrix with .
Householder reflections. Given a unit vector , the reflection through the hyperplane orthogonal to is:
Properties: , , .
For AI: Rotary Position Embedding (RoPE), used in LLaMA, GPT-NeoX, and Gemma, encodes positional information via rotation matrices applied blockwise to query and key vectors. The rotation angle depends on position, and relative positions appear as relative rotations, which interact cleanly with the dot-product attention score.
5.3 Shear and Scaling Maps
Diagonal scaling: where . Scales each coordinate independently. Kernel = if all . .
Shear maps in : the horizontal shear by factor is:
- shear preserves area. .
Geometrically: the -axis is fixed; each horizontal line slides by a distance proportional to its height. Parallelograms are tilted but area is preserved.
Elementary matrices (from row operations) are all either shears, row swaps, or scalings:
- Row scaling by : - multiply row by
- Row swap: - swap rows and
- Row addition: - add times row to row (this is a shear)
Every invertible matrix is a product of elementary matrices - Gaussian elimination as composition of linear maps.
5.4 The Geometry of Low-Rank Maps
A rank- linear map with collapses the domain: it maps all of onto a -dimensional subspace of , while sending an -dimensional subspace (the null space) to zero.
Decomposition via SVD preview. Any rank- matrix admits:
This is a sum of rank-1 outer products. Each rank-1 term maps everything in the direction to the direction , scaled by .
Forward reference: SVD and Low-Rank Approximation
The full theory of SVD - including the Eckart-Young theorem (best rank- approximation) and the geometric interpretation of singular values - is in 02: Singular Value Decomposition.
LoRA preview. A rank- update (with , , ) is a composition of two low-rank linear maps:
- compresses the input to dimensions.
- expands back to dimensions.
The effective map has rank , so it only modifies the network's behavior along directions in the input space. The hypothesis underlying LoRA is that the task-relevant weight changes lie in a low-dimensional subspace - a statement directly about the geometry of linear maps.
6. Dual Spaces and Transposes
6.1 The Dual Space
Definition. Given a vector space over , the dual space is the set of all linear maps from to :
Elements of are called linear functionals or covectors or one-forms.
is itself a vector space under pointwise addition and scalar multiplication .
The dual basis. If is a basis for , then the dual basis is defined by:
The dual basis is a basis for , so .
Key example: Row vectors. In , a linear functional is any map for some fixed . So , but the identification (row vector) (column vector) is coordinate-dependent. A row vector is intrinsically an element of , not of .
Why the distinction matters. When you write , you're applying the dual vector to the vector . This is a pairing between a space and its dual - not a dot product of two vectors in the same space. In Riemannian geometry and general relativity, this distinction is essential. In ML, it matters for understanding gradients.
6.2 The Dual Map and the Transpose
Definition. Given , the dual map (or transpose) is defined by:
In words: to apply to , first apply to , then apply to the result.
Matrix of the dual map. If has matrix in standard coordinates, then has matrix (the matrix transpose). The notation "" for the dual map is intentional and consistent.
Domain and codomain switch. means . The transpose reverses the direction. This is why:
- Composing on the left: (order reversal).
- In backpropagation: if the forward pass multiplies by , the backward pass multiplies by .
Kernel of the transpose = left null space:
This is the left null space of , which is orthogonal to .
Annihilators. The annihilator of a subspace is . Key identities:
6.3 Gradients Live in the Dual Space
This connection between linear maps and gradients is one of the deepest in all of applied mathematics, and is directly relevant to understanding backpropagation.
The gradient as a linear functional. For a differentiable function , the derivative at is a linear functional , defined by:
The derivative is an element of - a covector, represented as a row vector .
The gradient vector is obtained by identifying with via the standard inner product. This identification is coordinate-dependent: on a curved manifold (like a constraint surface or the space of probability distributions), gradients and vectors must be treated differently.
Backpropagation via dual maps. Consider a composed network (one linear layer, loss ). The gradient with respect to is:
This is exactly the dual map applied to the incoming gradient . Backpropagation propagates gradients backward through the transpose of each weight matrix. The chain of transposes in the backward pass is the dual map composition:
For AI: Modern deep learning frameworks (PyTorch, JAX) compute gradients using automatic differentiation, which is precisely the evaluation of dual maps via the chain rule. Every .backward() call accumulates contributions via transpose weight matrices - it is applying the adjoint (dual) of the forward linear map.