Lesson overview | Previous part | Next part
Linear Transformations: Appendix C: The Geometry of Linear Maps - A Deep Dive to Appendix M: Self-Assessment Checklist
Appendix C: The Geometry of Linear Maps - A Deep Dive
C.1 How Linear Maps Deform Space
To deeply understand a linear map , we track how it deforms geometric objects.
Ellipsoids to ellipsoids. The image of the unit sphere under a full-rank map is an ellipsoid whose semi-axes have lengths equal to the singular values of , pointing in the directions of the left singular vectors .
This is the geometric content of the SVD: means:
- : rotate the input so the "natural input directions" align with the coordinate axes.
- : stretch each axis by .
- : rotate the output to place the stretched axes along the directions.
The sphere becomes an ellipsoid. The "shape" of the ellipsoid is completely described by the singular values.
Volume scaling. The volume of the image of a set under is (when is square). More precisely, = product of all singular values = volume of the image of the unit cube.
For a rank-deficient map (), the image has lower dimension - and -dimensional volume = 0. The map "collapses" -dimensional space to a lower-dimensional flat object.
Angles. Unless is orthogonal, linear maps change angles. The angle between and satisfies:
but .
The matrix (the Gram matrix) determines how distorts inner products. Eigenvalues of are (squared singular values).
C.2 Interpreting the Four Fundamental Subspaces Geometrically
Given with matrix and rank :
The row space : This is the "input directions that survive" - the -dimensional subspace of that maps faithfully (injectively) onto the column space. Any is "noticed" by .
The null space : This is the "input directions that are killed" - the -dimensional subspace of that maps to zero. Any is "invisible" to .
The decomposition : Every input splits uniquely as where is in the row space (the "signal" part) and is in the null space (the "noise" invisible to ).
The column space : The -dimensional subspace of that can actually reach. Solutions to exist iff .
The left null space : The -dimensional complement of in . Directions in the left null space are unreachable by .
COMPLETE PICTURE OF THE FOUR FUNDAMENTAL SUBSPACES
========================================================================
\mathbb{R}^n (domain) \mathbb{R}^m (codomain)
---------------------------------------------------------
+---------------------+ T(x) = Ax +---------------------+
| row space | --------------> | column space |
| (dim = r) | isomorphism | (dim = r) |
| | | |
| ------------- | | ------------- |
| | | |
| null space | --------------> | left null space |
| (dim = n-r) | maps to 0 | (dim = m-r) |
+---------------------+ +---------------------+
up \perp complement up \perp complement
Every x = (row space part) + (null space part)
T sees only the row space part.
========================================================================
For AI (linear systems / least squares): When fitting a model with more constraints than parameters (), the system is overdetermined. A solution exists only if . If not, the least-squares solution minimizes - finding the projection of onto and then solving in the row space.
C.3 Linear Maps and Information Theory
The rank-nullity theorem has an information-theoretic interpretation.
Rank = information preserved. A linear map of rank preserves at most "dimensions" of information from the input. The remaining dimensions are destroyed.
Mutual information. For a Gaussian input and output , the mutual information:
depends only on the singular values - not on the specific directions. The null space of contributes zero mutual information.
Compression. If we want to compress to via a linear map , the maximum mutual information is achieved when projects onto the top- right singular vectors of... itself (the row space). For structured data with covariance , the optimal compression is PCA - projecting onto the top eigenvectors of .
This is why PCA is the optimal linear compressor under mean-squared error: it maximizes the retained variance (information) for any fixed rank .
C.4 Generalization of Linear Maps: Tensors
The concept of a linear map generalizes to multilinear maps and tensors in ways that are directly relevant to deep learning.
Bilinear maps and matrices of bilinear forms. A bilinear form can be written as for a matrix . The bilinear form is:
- Symmetric if (and ): .
- Positive definite if for : inner products are positive definite symmetric bilinear forms.
Multilinear maps. A -linear map is linear in each argument separately. The space of -linear maps on is the space of tensors of order .
For AI: The multi-head attention score is a bilinear form parameterized by . Understanding bilinear forms via their eigendecomposition ( by SVD) reveals what "patterns" each attention head is sensitive to: the left singular vectors are "what queries to look for" and the right singular vectors are "what keys are being matched against."
Appendix D: Computational Methods and Numerical Considerations
D.1 Computing the Kernel via Row Reduction
Given , finding a basis for requires solving .
Algorithm (Gaussian Elimination to RREF):
- Apply row operations to reduce to reduced row echelon form (RREF).
- Identify pivot columns (columns with leading 1s in RREF) and free columns (all other columns).
- For each free variable, set it to 1 and all other free variables to 0, then solve for pivot variables.
- Each such solution is one basis vector for .
Example. .
RREF: , :
, :
: already done.
Pivot columns: 1 and 4. Free variables: and .
Setting : , . Basis vector: .
Setting : , . Basis vector: .
Null space basis: . Nullity = 2.
Rank-nullity: , rank = 2 (two pivots), nullity = 2. Check: . OK
D.2 Numerical Stability of Basis Computations
Computing the null space or column space numerically requires care because floating-point arithmetic can introduce small errors.
The SVD approach (recommended). Instead of row reduction, compute the SVD: . Then:
- = span of columns of corresponding to zero singular values (or singular values below a threshold ).
- = span of columns of corresponding to nonzero singular values.
The SVD-based approach is numerically stable because orthonormal bases ( and ) are well-conditioned.
Numerical rank. For floating-point matrices, "zero" singular values appear as small but nonzero values. The numerical rank with threshold is:
A common choice is (about for double precision). numpy.linalg.matrix_rank uses a default threshold based on machine epsilon.
Why this matters: In practice, a matrix with theoretical rank may appear to have rank due to measurement noise. The SVD reveals the "intrinsic" rank through the gap in singular values.
D.3 Efficient Change of Basis Computations
Naive approach: Compute directly. For matrices, this costs (two matrix multiplications plus one matrix inversion).
Better approach when is orthogonal: If is orthogonal (), then costs only but with better constants than general (no matrix inversion needed).
Eigendecomposition case: When , computing requires only operations to compute (raise each diagonal entry to the -th power), plus two matrix-vector multiplications for and .
For AI: Computing (matrix exponential, important for continuous-time state space models like Mamba/S4) is done by diagonalizing: , where is diagonal with entries .
D.4 The Rank-Revealing QR Decomposition
Standard QR decomposition doesn't directly reveal rank. The rank-revealing QR (RRQR) uses column pivoting:
where is a permutation matrix, and is "small" (its Frobenius norm bounds how far is from rank-). The columns of form a basis for .
RRQR is preferred over SVD when only a basis for the column space (not the singular values themselves) is needed, as it is about faster.
Appendix E: Connections to Other Fields
E.1 Linear Maps in Physics
In quantum mechanics, operators on Hilbert spaces are infinite-dimensional linear maps. The Hamiltonian , momentum , and position are linear operators. The Schrodinger equation is a linear ODE on the Hilbert space of quantum states.
The spectral theorem for self-adjoint operators (the quantum generalization of symmetric matrix diagonalization) guarantees that observables have real eigenvalues (the possible measurement outcomes) and that the eigenfunctions form a complete orthonormal basis.
For AI: Transformers share surprising mathematical parallels with quantum mechanics: both involve attention-like mechanisms (inner products of states), superposition (linear combinations of basis states), and entanglement-like correlations. The linear algebra of quantum mechanics and of transformers both live in the framework of linear maps on Hilbert spaces.
E.2 Linear Maps in Topology: Homomorphisms
In algebraic topology, chain complexes are sequences of vector spaces connected by linear maps:
where (boundary of a boundary is zero - exactly the condition ). The homology groups measure "holes" in topological spaces.
Persistent homology, used in topological data analysis (TDA), applies this to point cloud data to find features that persist across scales. It's used in ML for analyzing data manifolds, protein structure prediction, and understanding neural network loss landscapes.
E.3 Linear Maps in Signal Processing
The Discrete Fourier Transform (DFT) is a linear map with matrix entries . The DFT matrix is unitary ().
Convolution is linear - convolving a signal with a kernel is a linear map - and in the Fourier domain it becomes pointwise multiplication. This is the key to making CNNs efficient: convolution is a structured linear map with shared weights (translation equivariance), and the Fourier transform diagonalizes the convolution operator.
For AI: The fast Fourier transform (FFT) is instead of for the full DFT matrix multiply, by exploiting the structure (sparsity in a different basis) of the DFT linear map. Similarly, FlashAttention speeds up attention by exploiting the structure of the attention linear map to minimize memory bandwidth.
Appendix F: The Algebra of Linear Maps - Structural Results
F.1 The Space of Linear Maps is a Vector Space
We noted briefly that is a vector space. Let's make this precise and compute its dimension.
Operations: For and :
- The zero element is the zero map for all .
Dimension. If and , then:
Proof: Every linear map is determined by vectors in (images of basis vectors), each in a -dimensional space. The natural isomorphism is (as vector spaces), which corresponds to the identification with matrices.
Basis for . The standard basis consists of the maps defined by . In matrix form, is the matrix with 1 in position and 0 elsewhere.
F.2 Composition Gives an Algebra Structure
When , linear maps can be composed. The space (endomorphisms of ) is a ring under composition (it is also a vector space - together, an algebra).
Properties of composition in :
- Associative:
- Identity:
- Distributive: and
- NOT commutative: in general
Matrix polynomials. For , we can form for any polynomial . This is well-defined because we can add and compose linear maps.
Cayley-Hamilton theorem. Every linear operator satisfies its own characteristic polynomial: , where .
For AI: The spectral approach to recurrent networks analyzes the long-run behavior of as . If has eigenvalues , then (stable memory decay). If any , the recurrence explodes. This spectral stability analysis is the foundation of designing stable RNNs (LSTM, GRU use gating to control the effective eigenvalue spectrum of the recurrence).
F.3 Quotient Maps and Projections
The quotient space. Given with kernel , the quotient space consists of equivalence classes .
is a vector space of dimension .
The first isomorphism theorem. Every linear map factors as:
where is the quotient map () and is an isomorphism from to .
This is the coordinate-free statement of the rank-nullity theorem: .
Geometric meaning. The quotient map "collapses" the null space to a point, then acts faithfully (injectively) on the resulting space. Any linear map splits into: collapse (project out the null space) + inject faithfully into the codomain.
For AI: In contrastive learning (SimCLR, MoCo), the projection head maps representations to a lower-dimensional space. This is a linear (or nonlinear) quotient map - it deliberately collapses some dimensions (those corresponding to nuisance factors like image augmentation) while preserving the semantically meaningful directions. The first isomorphism theorem says: what survives in the image is exactly what was not collapsed.
F.4 Dual Bases and the Canonical Isomorphism
We saw that , so . But this isomorphism is non-canonical - it depends on the choice of basis.
With an inner product. When is an inner product space (like with the standard dot product), there is a canonical isomorphism defined by:
That is, is the linear functional that takes .
This isomorphism is canonical because it doesn't depend on any choice of basis - it uses only the inner product structure. When we identify as a column vector (primal) rather than a row vector (dual), we are implicitly using this canonical isomorphism via the standard inner product.
For AI: On non-Euclidean spaces (manifolds of probability distributions, manifolds of neural network weights under the Fisher metric), the identification is NO longer trivial - gradients and velocity vectors live in different spaces. The natural gradient method corrects for this by using the Fisher information matrix as the metric: . This maps the gradient (a covector) to a tangent vector using the Riemannian metric instead of the Euclidean metric.
Appendix G: Linear Maps in Practice - Worked Problems
G.1 Verifying Linearity: Systematic Approach
Problem: Is defined by linear?
Check additivity: . OK
Check homogeneity: . OK
Conclusion: is linear. Its matrix (viewing with basis ):
So as a matrix. Kernel = (trace-zero matrices), dimension 3.
Problem: Is defined by linear?
Check homogeneity: . For : , but linearity requires .
For : but . So .
Conclusion: is NOT linear (the norm fails homogeneity due to the absolute value).
G.2 Finding the Kernel: Four Approaches
For (rows are multiples), find :
Approach 1: Row reduction. RREF: . Free variables , . Then .
.
Approach 2: Inspection. The columns satisfy and . So ... no, wait: .
Correcting: means , so . And since , so - or equivalently .
Approach 3: SVD. Compute SVD of ; null space vectors are the right singular vectors with zero (or near-zero) singular values.
Approach 4: Random sampling + orthogonalization. Sample many vectors, project out the row space, keep those with zero image (useful when is very large).
G.3 Composition of Transforms in a Graphics Pipeline
A 3D object is processed through a graphics pipeline using compositions of affine maps:
- Model matrix : transform from object coordinates to world coordinates (rotation, scale, translation).
- View matrix : transform from world coordinates to camera coordinates (rotation + translation).
- Projection matrix : from camera coordinates to clip coordinates (perspective projection).
The combined transform: (in homogeneous coordinates).
Composition order matters. Reading left to right: first apply , then , then . The matrix product can be precomputed once per frame (not per vertex), saving matrix multiplications.
This is the same principle as "avoid recomputing shared prefixes" in transformer KV-caching: the cache stores the linear maps , for all past tokens, so they don't need to be recomputed when generating each new token.
Appendix H: Linear Maps in Optimization and Training
H.1 The Gradient as a Linear Map
In optimization, we minimize a loss function . The gradient tells us the direction of steepest ascent. But more precisely:
The directional derivative of at in direction is:
This is a linear functional in : it is the dual vector .
Gradient descent in its pure form: . This uses the Euclidean identification of the gradient (covector) with a primal vector.
Natural gradient descent: , where is the Fisher information matrix. This uses the correct metric on the manifold of probability distributions (the Fisher-Rao metric) to convert the covector gradient to a tangent vector.
H.2 The Hessian as a Bilinear Map
The Hessian is the matrix of second derivatives:
But more abstractly, the Hessian is a bilinear form :
The Hessian determines the curvature of the loss landscape:
- Positive definite Hessian ( for all ): the point is a local minimum.
- Indefinite Hessian (has both positive and negative eigenvalues): the point is a saddle point.
- The ratio of largest to smallest eigenvalue is the condition number - it governs how slowly gradient descent converges.
For AI: Modern optimizers (Adam, AdaGrad) approximate Hessian-related quantities. Adam's second moment estimate approximates the diagonal of the Hessian. Dividing the gradient by is an approximation to Newton's method (which divides by ). This is why Adam often converges much faster than SGD on ill-conditioned problems.
H.3 Weight Matrices as Linear Maps: Training Dynamics
The neural tangent kernel (NTK) theory (Jacot et al., 2018) analyzes infinitely wide neural networks and shows that their training dynamics under gradient flow are governed by a linear system:
where is the NTK matrix (constant in the infinite-width limit). This is an ODE with a constant linear map - so its solution is .
The eigenvalues of determine which output directions are learned quickly (large eigenvalues -> fast convergence) and which slowly (small eigenvalues -> slow convergence). This is linear algebra - specifically, the spectral decomposition of a positive semidefinite linear map.
H.4 Gradient Flow through Linear Layers
Consider a linear layer with loss . The gradient with respect to the weight matrix:
where is the "error signal" (upstream gradient). The gradient is an outer product - a rank-1 matrix.
This means gradient updates are always rank-1. For a mini-batch of samples, the gradient is:
A sum of rank-1 matrices - the gradient has rank at most . For large models with batch size , the gradient is a very low-rank update to the weight matrix. This low-rank structure of gradients is the empirical justification for gradient low-rank projection methods (GaLore, 2024).
Appendix I: Reference Tables
I.1 Linear Map Properties at a Glance
| Property | Condition | Matrix Equivalent | Geometric Meaning |
|---|---|---|---|
| Linear | Any matrix | Preserves addition and scaling | |
| Injective | Full column rank | No two inputs map to same output | |
| Surjective | Full row rank | Every output is reachable | |
| Bijective (isomorphism) | Both injective and surjective | Square, full rank, invertible | Perfect correspondence |
| Orthogonal | , orthogonal columns | Preserves lengths and angles | |
| Unitary | (complex) | , unitary columns | Complex analogue of orthogonal |
| Projection | Idempotent: applying twice = once | ||
| Symmetric | Self-adjoint; diagonalizable by spectral theorem | ||
| Positive definite | for | All eigenvalues positive; curvature at min | |
| Normal | Diagonalizable by unitary matrix | ||
| Nilpotent | for some | Powers eventually vanish; all eigenvalues 0 | |
| Involution | Self-inverse (like Householder reflections) |
I.2 Rank and Dimension Formulas
| Formula | Statement |
|---|---|
| Rank-nullity theorem for | |
| Row rank equals column rank | |
| Rank cannot increase under composition | |
| Subadditivity of rank | |
| Inclusion-exclusion for subspaces | |
| (for subspace ) | Dimension of quotient space |
| Gram matrix has same rank | |
| (for projection ) | Rank = trace for idempotent matrices |
I.3 AI Applications Cross-Reference
| Linear Map Concept | Where It Appears in AI | Mathematical Role |
|---|---|---|
| (affine map) | Every neural layer | Pre-activation computation |
| (linear projection) | Attention mechanism | Projects to query subspace |
| (low-rank) | LoRA fine-tuning | Rank- weight update |
| (Jacobian) | Backpropagation | Chain rule at each layer |
| (transpose map) | Backward pass | Dual map of forward |
| (projection) | Layer norm, attention | Projects onto subspace |
| (linear map) | Unembedding (logit computation) | Representation to vocabulary |
| (rotation in ) | RoPE positional encoding | Positional rotation |
| (metric-adjusted gradient) | Natural gradient / Adam | Riemannian gradient |
| (Gram matrix of Jacobian) | Neural tangent kernel | Training dynamics |
Appendix J: Proofs of Key Results
J.1 Proof:
This is a fundamental result that deserves a careful proof.
Theorem. For any matrix , the column rank (dimension of the column space) equals the row rank (dimension of the row space).
Proof (via RREF). Let have RREF (obtained by row operations, which don't change the row space but can change the column space). In :
- The nonzero rows are linearly independent (each has a leading 1 not shared by any other row).
- The number of nonzero rows = number of pivot columns = rank.
So row rank = column rank = number of pivots in RREF.
Alternative proof (via SVD). The SVD has nonzero singular values. The column space of is spanned by (first left singular vectors), dimension . The row space of (= column space of ) is spanned by (first right singular vectors), dimension . Both have dimension = number of nonzero singular values.
J.2 Proof: Kernel and Image are Subspaces
Theorem. For any linear map , both and are subspaces (of and respectively).
Proof for :
- Non-empty: , so .
- Closed under addition: Let . Then , so .
- Closed under scalar multiplication: Let , . Then , so .
Proof for :
- Non-empty: .
- Closed under addition: Let , so for some . Then .
- Closed under scalar multiplication: Let , . Then .
J.3 Proof: Composition of Linear Maps is Linear
Theorem. If and are linear, then is linear.
Proof:
J.4 Proof: The Dual Map is Linear
Theorem. If is linear, then defined by is linear.
Proof:
So .
J.5 Proof: Invertibility Criterion for Finite-Dimensional Spaces
Theorem. Let be a linear map on a finite-dimensional space . Then the following are equivalent:
- is injective (one-to-one)
- is surjective (onto)
- is bijective (invertible)
Proof: : injective iff (standard).
: By rank-nullity: . Nullity = 0 iff rank = .
: Rank = . Since and both are finite-dimensional, iff (a subspace of equal dimension must be the whole space).
: by definition of bijective.
Important: This equivalence only holds for maps with the same domain and codomain. For with , injective and surjective are NOT equivalent (one is impossible given the dimension constraint).
Appendix K: Additional AI Case Studies
K.1 Mechanistic Interpretability via Linear Maps
Mechanistic interpretability (MI) aims to reverse-engineer neural networks by understanding what computation each component performs. Linear map theory is central to this enterprise.
The residual stream as a communication bus. In transformer models, each layer reads from and writes to a shared "residual stream" . Each attention head and MLP layer contributes an additive update:
Each update is (approximately) a low-rank linear map from the residual stream back to itself. The attention update's linear part is (the "OV circuit"); the MLP's linear part is after linearizing the activation.
SVD of the OV circuit. The matrix can be analyzed via SVD. Its singular values reveal how strongly the attention head modifies the residual stream, and its singular vectors reveal which directions it reads from and writes to. Heads with near-zero singular values are "inattentive" - they barely modify the residual stream regardless of attention pattern.
Subspace decomposition. The full set of attention heads (for layers, heads per layer) collectively form a large linear map from the input to the residual stream updates. The total update is a sum of low-rank linear maps. Understanding the structure of this sum - which heads are redundant, which are essential - is a central goal of circuit-level MI.
K.2 Linear Algebra of Diffusion Models
Diffusion models (DDPM, Score matching) add Gaussian noise to data and learn to denoise. The forward process is an affine map:
This is an affine interpolation between the data and pure noise. The coefficient scales the data, and scales the noise.
The denoising objective. The neural network estimates (the noise) from the noisy input. Near a data point , this estimator is approximately a linear function of - the Tweedie formula gives the optimal estimator as:
which is a linear function of and . The diffusion process itself is a composition of affine maps in the forward direction, and the learned reverse process approximately inverts these affine maps.
K.3 State Space Models as Linear Dynamical Systems
Structured State Space Models (S4, Mamba, RWKV) compute their state updates via linear recurrences:
where , , , are (possibly input-dependent) matrices.
The state transition is a linear dynamical system - the fundamental object of study in control theory and signal processing.
Key linear algebra results for SSMs:
-
Eigenvalues of determine memory. If for all , the system has bounded memory decay. If any , the state can grow unboundedly.
-
Diagonalization for efficiency. If , the recurrence decouples into independent scalar recurrences - each computable independently. S4 uses diagonal (DPLR structure) for parallel computation via convolution.
-
The convolution view. Unrolling the recurrence: . The impulse response is a sequence of matrix powers - analyzable by the spectral decomposition of .
-
Mamba's selectivity. Mamba makes input-dependent: . The recurrence becomes bilinear in , not purely linear. The linearization (around typical inputs) gives a locally linear system analyzable by the tools of this section.
Appendix L: Further Reading
Primary References
-
Axler, S. (2015). Linear Algebra Done Right (3rd ed.). Springer. - The definitive abstract treatment of linear maps. Goes from axioms to spectral theory without matrices until chapter 10. Highly recommended for conceptual depth.
-
Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley-Cambridge Press. - Computational and applied focus. Excellent for four fundamental subspaces and applications.
-
Horn, R. & Johnson, C. (2013). Matrix Analysis (2nd ed.). Cambridge University Press. - Comprehensive advanced reference. Proofs of all major results, including Cayley-Hamilton, spectral theorems, singular values.
-
Trefethen, L. & Bau, D. (1997). Numerical Linear Algebra. SIAM. - Gold standard for computational linear algebra and stability.
AI-Focused References
-
Vaswani, A. et al. (2017). "Attention is All You Need." NeurIPS. - The original transformer paper; read the attention mechanism as linear projections.
-
Hu, E. et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." ICLR 2022. - LoRA rank-nullity argument.
-
Elhage, N. et al. (2021). "A Mathematical Framework for Transformer Circuits." Anthropic. - OV and QK circuits as linear maps; residual stream as communication bus.
-
Park, K. et al. (2023). "The Linear Representation Hypothesis and the Geometry of Large Language Models." - Linear features in transformer representations.
-
Jacot, A. et al. (2018). "Neural Tangent Kernel: Convergence and Generalization in Neural Networks." NeurIPS. - Training dynamics via linear maps (NTK theory).
-
Gu, A. et al. (2022). "Efficiently Modeling Long Sequences with Structured State Spaces." ICLR. - SSMs as linear dynamical systems.
This section is part of the Math for LLMs curriculum - a systematic treatment of the mathematics underlying modern large language models.
Appendix M: Self-Assessment Checklist
After completing this section, you should be able to answer the following questions without notes.
Conceptual Understanding
-
Q1. State the two axioms of a linear map. What is the fastest way to show a map is NOT linear?
-
Q2. What is the kernel of a linear map? Prove it is a subspace.
-
Q3. State the rank-nullity theorem. Give an example where rank = 2, nullity = 3. What can you say about the domain and codomain dimensions?
-
Q4. Why is a linear map from to injective if and only if it is surjective? Why does this fail for maps between spaces of different dimensions?
-
Q5. What is the change-of-basis formula? If , what relationship does that establish between the maps represented by and ?
-
Q6. What is an orthogonal projection? How do you verify that a matrix is a projection? What two extra properties make it an orthogonal projection?
-
Q7. What is the Jacobian matrix? For , what is the shape of ?
-
Q8. In the backward pass of backpropagation, why do we multiply by rather than ?
-
Q9. What makes an affine map different from a linear map? How do you represent an affine map as a linear map (in one higher dimension)?
Computational Skills
-
C1. Given a matrix , find a basis for using row reduction.
-
C2. Given a linear map defined on a non-standard basis, write its matrix in that basis.
-
C3. Given two bases and , compute the change-of-basis matrix and use it to transform the matrix of .
-
C4. For a rank- update , compute the null space dimension and verify it numerically.
-
C5. Compute the Jacobian of a given vector-valued function (e.g., softmax, elementwise ReLU, an affine map composed with sigmoid).
AI Connections
-
AI1. Explain why LoRA (low-rank adaptation) is more parameter-efficient than full fine-tuning, using the language of rank and nullity.
-
AI2. Describe the forward pass of a multi-head attention layer as a sequence of linear maps. Which operations are linear, which are bilinear, and which are nonlinear?
-
AI3. What is the linear representation hypothesis? Why does it matter for interpretability research?
-
AI4. Why does a purely linear deep network (no activations) collapse to a single linear map, regardless of depth?
-
AI5. How does the dual map relate to backpropagation? What mathematical object is the "gradient" in the strict sense?