"Every matrix tells a story of rotation, scaling, and rotation again - the SVD reads that story in full."
Overview
The Singular Value Decomposition (SVD) is the most universally applicable matrix factorisation in all of applied mathematics. Given any matrix - rectangular, rank-deficient, ill-conditioned, whatever - the SVD expresses it as : a rotation in the input space (), a coordinate-wise scaling (), and a rotation in the output space (). The scaling factors are the singular values; they completely determine the linear map's geometry. Unlike the eigendecomposition, which requires a square matrix and may involve complex numbers, the SVD always exists, always produces real non-negative singular values, and works for matrices of any shape.
For AI practitioners in 2026, the SVD is inescapable. Low-Rank Adaptation (LoRA) fine-tunes billion-parameter language models by constraining weight updates to a low-rank subspace - the dominant singular subspace of the gradient matrix. DeepSeek's Multi-Head Latent Attention (MLA) compresses key-value caches by projecting them through a low-rank bottleneck. The Eckart-Young theorem guarantees that the rank- SVD truncation is the best possible rank- approximation, which is why SVD underlies recommender systems, image compression, latent semantic analysis, and the pseudoinverse. WeightWatcher diagnoses model quality by examining the singular value spectrum of weight matrices: healthy trained networks have heavy-tailed spectra; undertrained or overtrained networks do not. Every time you compute a spectral norm, solve a least-squares problem, or talk about the "effective rank" of a weight matrix, you are using the SVD.
This section develops the SVD from first principles, proves the Eckart-Young theorem, connects SVD to eigenvalues and the four fundamental subspaces, and builds every major AI application from scratch.
Prerequisites
- Eigenvalues and eigenvectors, spectral theorem for symmetric matrices (Section 03-01)
- Matrix rank, null space, column space (Section 02-05)
- Vector spaces and orthogonality (Section 02-06)
- Matrix transpose, invertibility, determinants (Section 02-02 through 02-04)
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | Interactive demos: geometric action, Eckart-Young compression, pseudoinverse, randomised SVD, LoRA, image compression, weight matrix analysis |
| exercises.ipynb | 8 graded problems: SVD by hand, four subspaces, low-rank approximation, pseudoinverse, condition number, randomised SVD, LoRA adapter, image compression |
Learning Objectives
After completing this section, you will:
- State the SVD theorem and explain its geometric meaning as three successive linear maps
- Compute the SVD of small matrices by finding the eigendecomposition of and
- Distinguish full SVD, thin SVD, and truncated SVD; know when to use each
- Identify the four fundamental subspaces of from its SVD
- Prove and apply the Eckart-Young theorem for optimal low-rank approximation
- Compute the Moore-Penrose pseudoinverse and use it to solve least-squares problems
- Express the spectral norm, Frobenius norm, and nuclear norm in terms of singular values
- Implement randomised SVD for large matrices
- Connect SVD to LoRA, MLA, spectral normalisation, recommender systems, and LSA
- Interpret the singular value spectrum of neural network weight matrices
Table of Contents
- 1. Intuition
- 2. Formal Definitions
- 3. Computing the SVD
- 4. Properties of the SVD
- 5. Outer Product Form and Rank-1 Decomposition
- 6. SVD in Machine Learning and AI
- 7. Generalised SVD and Extensions
- 8. Common Mistakes
- 9. Exercises
- 10. Why This Matters for AI (2026 Perspective)
- 11. Conceptual Bridge
1. Intuition
1.1 What SVD Is
A matrix represents a linear map from to . That map can be arbitrarily complex - it can rotate, stretch, shrink, and project. The SVD breaks this complexity into three primitives that are each individually simple:
- : rotation/reflection in the input space . is orthogonal (), so is a rigid transformation that does not change lengths or angles - it just changes the orientation of the coordinate axes.
- : scaling along coordinate axes. is an diagonal-like matrix with non-negative entries on the diagonal. It stretches or shrinks each axis independently. If , it also changes the dimension.
- : rotation/reflection in the output space . is orthogonal (), another rigid transformation.
The sequence is the "rotate -> scale -> rotate" story of the matrix. The first rotation aligns the input coordinates to the special axes (the right singular vectors , columns of ). Then each axis is scaled by . Then the result is rotated to the output coordinate system via , whose columns are the left singular vectors.
This decomposition exists for every real (or complex) matrix, regardless of shape, rank, or conditioning. The singular values are unique (though the vectors have sign ambiguity and are non-unique when singular values repeat). This universality is what distinguishes the SVD from the eigendecomposition, which only applies to square matrices and can fail (defective matrices) or become complex.
For AI: When you apply a linear layer to an input , you are implicitly performing . The right singular vectors are the "input features" the layer responds to; the singular values are how strongly each feature is amplified; the left singular vectors are the "output features" it writes to. LoRA exploits this by constraining where , - a rank- update that modifies only the dominant singular directions.
1.2 The Geometric Picture
The clearest geometric picture of the SVD comes from watching what does to the unit sphere . Apply to every point on the unit sphere. The result is an ellipsoid in . This is not obvious - it is a theorem - but it follows directly from the SVD:
Step by step:
- is orthogonal, so (rigid transformation preserves the sphere).
- scales each axis by , turning the sphere into an axis-aligned ellipsoid with semi-axes .
- is orthogonal, so it rotates/reflects the ellipsoid without changing its shape.
The singular values are the semi-axis lengths of the output ellipsoid. The right singular vectors are the directions in the input space that map to the semi-axes of the ellipsoid. The left singular vectors are the directions of the ellipsoid's semi-axes in the output space.
GEOMETRIC ACTION OF A \in \mathbb{R}^(m\timesn)
========================================================================
Input space \mathbb{R}^n Output space \mathbb{R}^m
v_2 u_2
up up
| Unit sphere |
| * | \sigma_2*u_2
| * * V^T * | *
| * * ----------> * | * A = U \Sigma V^T
| * * * | *
| * ------------------- u_1 ->
-----+--------> v_1 * \sigma_1 *
| * *
| * *
| * * * * *
| (ellipse)
| semi-axes = singular values
Right singular vectors v_i Left singular vectors u_i
= natural input axes = natural output axes
========================================================================
For a matrix with : the unit circle maps to an ellipse with semi-axes and . The direction is the input direction that gets stretched the most (by ); is the direction stretched least (by ). If , the map collapses one direction to zero - the matrix is rank-1 and maps every vector onto a line.
1.3 Why SVD Is the Most Important Decomposition in AI
The SVD is not just an abstract factorisation - it is the foundational tool for understanding, compressing, and controlling linear maps in neural networks. Here are the core connections:
Low-rank approximation (LoRA, DoRA, MLA). Every weight matrix in a neural network is a linear map. After training, weight update matrices tend to have low effective rank - most of the "learning signal" lives in a few singular directions. LoRA exploits this: instead of fine-tuning all of , it learns with . The SVD is used to initialise and from the dominant singular subspace of , or to analyse which singular directions are being adapted. DoRA (Weight-Decomposed Low-Rank Adaptation) further decomposes and adapts direction (via low-rank) and magnitude (via scalar) separately. DeepSeek's MLA uses low-rank projections of key-value matrices to compress the KV cache, reducing memory from to with .
Spectral normalisation. The spectral norm of a weight matrix is the largest singular value. Spectral normalisation (Miyato et al. 2018) divides each weight matrix by its spectral norm, enforcing a Lipschitz-1 constraint on each layer. This stabilises GAN training. The spectral norm is computed via power iteration (equivalent to computing the top singular vector pair) and is per forward pass rather than for the full SVD.
Pseudoinverse and least squares. Every linear regression problem has solution where is the Moore-Penrose pseudoinverse. The SVD provides the most numerically stable way to compute this. Ridge regression adds a regulariser, which in SVD form becomes shrinkage: each singular component is shrunk by factor .
Matrix compression and efficiency. Images, attention matrices, embedding tables, and gradient matrices all have approximate low-rank structure. The Eckart-Young theorem guarantees the SVD gives the best rank- approximation. Randomised SVD (Halko et al. 2011) computes this in time rather than , making it practical for million-dimensional problems.
Understanding model geometry. The singular value spectrum of a weight matrix reveals its "information geometry": a flat spectrum (all equal) means the layer treats all directions equally; a steep spectrum means the layer has discovered dominant directions. WeightWatcher (Martin and Mahoney, 2019-2026) quantifies model quality by fitting power-law distributions to singular value spectra: well-trained models have with ; undertrained models have flat spectra; overtrained models have spike/bulk structure.
1.4 SVD vs Eigendecomposition
Students often confuse SVD and eigendecomposition. Here is a precise comparison:
| Property | Eigendecomposition | SVD |
|---|---|---|
| Applicable to | Square matrices only | Any matrix |
| Eigenvalues/singular values | Can be complex, can be negative | Always real, always |
| is orthogonal? | Only for symmetric | Always |
| Same input/output basis? | Yes ( used twice) | No ( for output, for input) |
| Existence | Not always (defective over ) | Always exists |
| For symmetric PSD | , | Coincides: , |
| Computation cost | ||
| Geometric meaning | Fixed points of the map | Extreme stretching directions |
The key insight: singular values and eigenvalues are generally different quantities. For a matrix , the eigenvalues of are not the same as the singular values of (unless is symmetric PSD). The singular values of are the square roots of the eigenvalues of (which is always symmetric PSD). For example, the rotation matrix has eigenvalues (complex, magnitude 1) but singular values (all equal - a rotation is an isometry).
1.5 Historical Timeline
| Year | Who | What |
|---|---|---|
| 1873 | Beltrami, Jordan | Independent discovery of SVD for bilinear forms; Beltrami published first, Jordan one month later |
| 1889 | Sylvester | Generalised to arbitrary rectangular real matrices |
| 1907 | Schmidt | Infinite-dimensional version (integral operators); modern functional-analytic foundation |
| 1936 | Eckart & Young | Proved the best low-rank approximation theorem; connected SVD to statistics |
| 1965 | Golub & Reinsch | Practical stable algorithm: bidiagonalisation + implicit QR shift; still in use today |
| 1976 | Lanczos bidiag. | Krylov subspace method for large sparse matrices; precursor to ARPACK |
| 1992 | Berry et al. | Latent Semantic Analysis (LSA) - SVD of term-document matrices for information retrieval |
| 2000 | Cai, Candes | Matrix completion theory; nuclear norm minimisation as convex relaxation of low-rank |
| 2009 | Koren et al. | Winning Netflix Prize solution used SVD-based matrix factorisation |
| 2011 | Halko, Martinsson, Tropp | Randomised SVD - near-linear-time algorithms for large matrices |
| 2017-26 | LoRA era | Hu et al. (LoRA 2021), Liu et al. (DoRA 2024), DeepSeek (MLA 2024) - low-rank SVD in LLMs |
| 2019-26 | Martin & Mahoney | WeightWatcher - power-law singular value spectra as model quality metrics |
2. Formal Definitions
2.1 The SVD Theorem
Theorem (Singular Value Decomposition). Let with . Then there exist orthogonal matrices , , and a matrix of the form
with , such that
The are called the singular values of . The columns of are the left singular vectors. The columns of are the right singular vectors.
Proof sketch (constructive). Since is symmetric positive semidefinite, by the Spectral Theorem it has an eigendecomposition with , . Define . For each , define . One can verify:
- (orthonormal).
- Extend to an orthonormal basis of to complete .
- Then for and for , which compactly gives .
Uniqueness. The singular values are unique (they are square roots of eigenvalues of , which are unique). The singular vectors are unique up to:
- Sign flips: leaves unchanged.
- Rotation within repeated singular value blocks: if , any orthogonal rotation within the corresponding subspaces is valid.
2.2 Singular Values
The singular values carry all the magnitude information about . Key properties:
Relation to eigenvalues of and :
The nonzero eigenvalues of and are identical (even though the matrices have different sizes), and they equal the squared singular values of .
Geometric meaning: (the maximum stretching factor). More generally, is the maximum stretching factor among all directions orthogonal to .
Rank: = number of positive singular values.
Examples:
- Identity : all singular values equal 1.
- Scaling matrix : singular values are 3, 1, 0; rank = 2.
- Rotation matrix : all singular values equal 1 (isometry).
- Projection matrix onto a -dimensional subspace: singular values equal 1, rest equal 0.
- All-ones matrix : , ; rank = 1.
2.3 Left and Right Singular Vectors
The left and right singular vectors satisfy a pair of eigenvalue-like equations:
Combining: and . So are eigenvectors of and are eigenvectors of , both with eigenvalue .
The relationships can be rewritten as a single symmetric eigenproblem by introducing the augmented matrix:
This symmetric matrix has eigenvalues . Some SVD algorithms exploit this formulation.
2.4 Full, Thin, and Truncated SVD
For an matrix with and rank , there are three common variants:
FULL SVD vs THIN SVD vs TRUNCATED SVD (m > n, rank r)
================================================================
FULL:
A = U \times \Sigma \times V^T
(m\timesn) (m\timesm) (m\timesn) (n\timesn)
[all m [n nonzero [all n
columns] rows] columns]
THIN (economy):
A = U \times \Sigma \times V^T
(m\timesn) (m\timesn) (n\timesn) (n\timesn)
[first n [square [all n
columns] diagonal] columns]
TRUNCATED (rank-k):
A_k = U_k \times \Sigma_k \times V_k^T
(m\timesn) (m\timesk) (k\timesk) (k\timesn)
[first k [top-k [first k
columns] diagonal] columns]
================================================================
Full SVD is the complete factorisation; and are square orthogonal matrices. Storage: .
Thin SVD (also called economy or compact SVD): keep only the first columns of (for ). Since has zeros in rows through , these columns of do not contribute. Storage: .
Truncated SVD (also called low-rank SVD): keep only the top terms (). This gives the best rank- approximation (Eckart-Young theorem, Section 4.2). Storage: . This is what LoRA exploits.
For AI: In practice, np.linalg.svd(A, full_matrices=False) computes the thin SVD. scipy.sparse.linalg.svds(A, k=k) and sklearn.utils.extmath.randomized_svd(A, n_components=k) compute truncated SVDs efficiently.
2.5 The Four Fundamental Subspaces
The SVD provides the cleanest way to identify the four fundamental subspaces of (Strang's framework):
| Subspace | Definition | From SVD | Dimension |
|---|---|---|---|
| Column space (range) | span | ||
| Left null space | span | ||
| Row space | span | ||
| Null space (kernel) | span |
The column space is spanned by the first left singular vectors; the null space by the last right singular vectors. These pairs are orthogonal complements: column space left null space in ; row space null space in .
For AI: In a linear layer :
- The row space is the subspace of inputs that "activate" the layer (non-null inputs).
- The null space is the subspace of inputs that are completely ignored.
- The column space is the set of reachable outputs.
- LoRA constrains to have column space within a rank- subspace.
3. Computing the SVD
3.1 Connection to Eigendecomposition
The most conceptually direct way to compute the SVD exploits the relationship and :
Algorithm (eigendecomposition-based SVD):
- Form (symmetric PSD).
- Compute eigendecomposition with , .
- Set and discard near-zero singular values (numerical rank).
- For each : .
- Extend to an orthonormal basis of (e.g., by Gram-Schmidt or QR).
This works, but it has a critical flaw: forming squares the condition number. If , then . For an ill-conditioned matrix (say ), forming destroys the last 7 decimal digits of information in double precision. The Golub-Reinsch algorithm avoids this.
For small, well-conditioned matrices this approach is perfectly fine and is what np.linalg.eigh(A.T @ A) computes. For production SVD, use np.linalg.svd which implements Golub-Reinsch.
3.2 Golub-Reinsch Algorithm
The Golub-Reinsch algorithm (1965) is the standard method implemented in LAPACK and numpy. It proceeds in two phases:
Phase 1: Bidiagonalisation. Using Householder reflectors, reduce to upper bidiagonal form:
where is upper bidiagonal (nonzeros only on main diagonal and superdiagonal) and are products of Householder matrices. This costs operations.
Phase 2: Diagonalisation of . Apply implicit QR iterations with Wilkinson shifts to , converging to diagonal form . Combining: .
The key advantage: bidiagonalisation works directly on without forming , so the condition number is not squared. The algorithm is backward stable with error .
Cost: for . This is too expensive for large sparse matrices.
3.3 Randomised SVD
For large matrices where only the top- singular values/vectors are needed, the randomised SVD (Halko, Martinsson, Tropp 2011) offers a dramatic speedup:
Algorithm:
- Sketch: Draw with i.i.d. Gaussian entries ( = oversampling, typically 5-10).
- Range sketch: Compute .
- Orthogonalise: QR decompose ; captures the range of .
- Project: Compute (small matrix).
- SVD of small matrix: Compute exactly.
- Recover: . Return .
Cost: - linear in the matrix size! Compare to for full SVD.
Error bound: With oversampling , the approximation error is:
with high probability. In practice the error is much closer to .
Power iteration improvement: Replace with for small (1-3 power iteration steps). This dramatically sharpens the approximation for matrices with slowly decaying singular values, at cost .
For AI: sklearn.utils.extmath.randomized_svd(A, n_components=k, n_iter=2) implements this. It's what TruncatedSVD uses for large sparse document matrices. PyTorch's torch.svd_lowrank uses randomised SVD for LoRA-related computations.
3.4 Lanczos Bidiagonalisation
For very large sparse matrices (e.g., with nonzeros), Lanczos bidiagonalisation builds a Krylov subspace for the SVD problem:
Algorithm: Starting from a unit vector , alternate between multiplying by and :
After steps, this builds a bidiagonal matrix whose singular values approximate the largest (and sometimes smallest) singular values of . Only matrix-vector products and are needed - no explicit matrix storage.
This is implemented in scipy.sparse.linalg.svds and is what makes SVD feasible for web-scale matrices (e.g., the Netflix prize data had entries).
3.5 Numerical Considerations
Do not form for ill-conditioned problems. As noted in Section 3.1, this squares the condition number. Use np.linalg.svd(A) directly, not np.linalg.eigh(A.T @ A).
Numerical rank. In finite precision, all singular values are slightly nonzero even if the true rank is less than . The numerical rank is the number of singular values greater than a threshold (where for float64). np.linalg.matrix_rank(A) uses exactly this threshold.
One-sided Jacobi SVD achieves machine accuracy even for matrices with a condition number of by working directly on the columns of without any intermediate matrix products. Used in high-accuracy requirements (e.g., geodesy, orbit determination).
Mixed precision. In training, gradients are often computed in float16/bfloat16 but SVD is performed in float32 for stability. This matters for gradient-based LoRA rank selection.
NUMERICAL SVD PITFALLS
========================================================================
A has condition kappa: A^T A has condition kappa^2
kappa(A) = 1e7 => kappa(A^T A) = 1e14 (only 2 correct digits
in last eigenvalue!)
SAFE: U, s, Vt = np.linalg.svd(A) <- O(mn min(m,n))
RISKY: eigs = np.linalg.eigh(A.T @ A) <- avoidable precision loss
========================================================================
4. Properties of the SVD
4.1 Norms via Singular Values
The three most important matrix norms all have elegant expressions in terms of singular values:
| Norm | Definition | SVD Expression | Use in AI |
|---|---|---|---|
| Spectral norm | (largest SV) | Lipschitz constant; GAN stability | |
| Frobenius norm | Weight decay regularisation | ||
| Nuclear norm | Low-rank regularisation; matrix completion |
Spectral norm : This is the operator norm - the maximum factor by which can amplify a vector. For a neural network layer , is the Lipschitz constant. Spectral normalisation enforces .
Frobenius norm : This is the matrix version of the Euclidean norm. Note . L2 weight decay minimises , which in SVD terms means penalising all singular values equally.
Nuclear norm : This is the convex envelope (tightest convex relaxation) of the rank function. Minimising promotes low-rank solutions. Nuclear norm regularisation is used in matrix completion, recommender systems, and multi-task learning.
Norm inequalities: For any :
4.2 The Eckart-Young Theorem
This is one of the most important theorems in applied mathematics - it justifies every low-rank approximation method in machine learning.
Theorem (Eckart-Young, 1936). Let be the SVD with singular values . Define the rank- truncation:
Then for both the spectral norm and the Frobenius norm:
with optimal errors:
Proof sketch (spectral norm). For any rank- matrix , there exists a unit vector in the -dimensional span of that lies in the null space of (by dimension counting: ). For this :
since is in the span of the top- right singular vectors. The bound is achieved by since for and for .
Practical significance: The Eckart-Young theorem is the mathematical foundation of:
- PCA: The rank- SVD of the centred data matrix gives the best rank- reconstruction (Section 6.1).
- LoRA: The low-rank decomposition approximates the full gradient update with minimum Frobenius error.
- Image compression: A rank- image approximation discards all but the top singular components; Eckart-Young guarantees this is optimal.
- Recommender systems: Matrix factorisation finds the best low-rank model of the user-item rating matrix.
4.3 The Moore-Penrose Pseudoinverse
The Moore-Penrose pseudoinverse is the unique matrix satisfying the four Moore-Penrose conditions:
The SVD gives the explicit formula:
where replaces each nonzero with and leaves zeros as zeros.
Geometric meaning: is the orthogonal projection onto the column space of ; is the orthogonal projection onto the row space of .
Solving linear systems:
- Overdetermined (, full column rank): The least-squares solution is (normal equations). The minimum residual is .
- Underdetermined (, full row rank): The minimum-norm solution s.t. is .
- Rank-deficient: gives the minimum-norm least-squares solution.
Ridge regression / Tikhonov regularisation: Adding to the objective gives:
Each singular component is attenuated by factor - components with are kept, components with are suppressed. This is spectral shrinkage or Tikhonov regularisation.
4.4 SVD and Rank
The SVD provides the most numerically stable definition of rank:
Exact rank: = number of positive singular values.
Numerical rank: In finite precision, use where typically .
Stable rank: The stable rank is a smooth, noise-robust proxy for rank. It lies in and equals 1 iff is rank-1. For LoRA analysis, if , the update is intrinsically low-rank and can be well-approximated with small .
Effective rank (Roy & Vetterli): where and is the entropy of the normalised squared singular values. This is a smooth measure that is maximised (= ) for uniform singular value spectra and minimised (= 1) for rank-1 matrices.
4.5 Condition Number
The condition number of a matrix with respect to the 2-norm is:
(where is the smallest positive singular value). For a square invertible matrix, .
Meaning: If and we perturb by , the relative perturbation in satisfies:
A well-conditioned matrix has ; a nearly singular matrix has .
For AI: The condition number of the Gram matrix (or the Hessian ) governs gradient descent convergence: for symmetric PSD matrices. Feature normalisation (standardisation, batch norm, layer norm) reduces , accelerating convergence.
Preconditioning: For linear systems , a preconditioner transforms the system to with smaller . K-FAC (Kronecker-Factored Approximate Curvature) is a natural gradient method that preconditioning by the Fisher information matrix - equivalent to using a block-diagonal approximation to the Hessian.
4.6 The Procrustes Problem
The orthogonal Procrustes problem asks: given matrices , find the orthogonal matrix () minimising . The solution is:
Solution: Compute (SVD). Then .
Proof: . Since is orthogonal, is constant. Minimising over means maximising . Setting , this is with equality when , i.e., .
For AI:
- Shape alignment: Procrustes is used in protein structure alignment, morphometric analysis.
- Domain adaptation: Aligning word vector spaces across languages (cross-lingual transfer) uses Procrustes.
- RetNet / linear attention: Some position encoding methods use orthogonal transformations computed via Procrustes.
- Gradient alignment: The Procrustes problem appears in task arithmetic - finding the optimal rotation to align task vectors.
5. Outer Product Form and Rank-1 Decomposition
5.1 Rank-1 Outer Product Decomposition
The SVD has a beautiful outer product representation: every matrix is a weighted sum of rank-1 matrices:
Each term is a rank-1 matrix (outer product of two vectors). The singular values act as importance weights: the first term contributes the most to , the last the least. The truncated sum is the best rank- approximation (Eckart-Young).
RANK-1 DECOMPOSITION
========================================================================
A = \sigma_1 * u_1v_1^T + \sigma_2 * u_2v_2^T + *** + \sigma^r * u^rv^r^T
------------- ------------- -------------
rank-1 term rank-1 term rank-1 term
(dominant) (less important) (least important)
Truncating at rank k: discard \sigma_{k+1},...,\sigma_r terms
Error: ||A - A_k||_F = \sqrt(\sigma_{k+1}^2 + *** + \sigma_r^2)
========================================================================
For AI - matrix as a sum of "features": Each rank-1 term is an "atom" - a pattern in the output () correlated with a pattern in the input (), with strength . In a trained weight matrix :
- The top singular direction () is the most amplified input-output feature pair.
- Lower singular directions correspond to less important, often noisier associations.
- LoRA's with , is equivalent to adding rank-1 terms to .
5.2 Incremental and Streaming SVD
When data arrives in a stream and the SVD must be updated without recomputing from scratch, incremental SVD methods are used. Brand's method (2002) updates the SVD to in time (rather than recomputing from scratch in ).
Algorithm sketch (rank-1 update):
- Project onto current bases: , .
- Form augmented bidiagonal matrix.
- Compute small SVD and update.
This is used in online recommender systems (user ratings arrive one at a time), PCA on streaming data, and online learning scenarios.
5.3 Symmetric Matrices: SVD Equals Eigendecomposition
For symmetric matrices, the SVD and eigendecomposition coincide (with a sign adjustment for indefinite matrices):
Case 1: Symmetric PSD (, all ). The spectral theorem gives . This is already an SVD with and . Singular values equal eigenvalues.
Case 2: Symmetric but indefinite (, mixed signs). Say with some . Then are the singular values. The sign of is absorbed into : if , then (flip the sign of the left singular vector). So for indefinite symmetric matrices.
Consequence: For a symmetric matrix, . The singular values are the absolute values of the eigenvalues. If you compute np.linalg.eigvalsh(A) and np.linalg.svd(A, compute_uv=False), the latter will equal the absolute values of the former, sorted in descending order.
6. SVD in Machine Learning and AI
6.1 Low-Rank Approximation and LoRA
LoRA (Low-Rank Adaptation, Hu et al. 2021) is the most widely used parameter-efficient fine-tuning method for large language models. Its mathematical foundation is the Eckart-Young theorem.
The key observation: when fine-tuning a pre-trained model on a new task, the weight updates have low intrinsic rank. Rather than storing and updating the full (which has parameters), LoRA parameterises:
Only and are trained (with initialised from a Gaussian and initialised to zero so ). The forward pass becomes (the scaling stabilises training).
SVD connection: The best rank- initialisation for comes from the SVD of : , . In practice, LoRA uses random initialisation (since is unknown before training), but SVD is used post-hoc to analyse which singular directions were actually modified.
DoRA (Liu et al. 2024) further decomposes: where is the column-wise norm. It then applies LoRA to the directional component, achieving better performance per parameter.
MLA (DeepSeek 2024) compresses the KV cache by factoring the projection matrices: where compresses to a latent dimension . This is exactly the rank- SVD structure applied to the key/value projections.
Parameters saved by LoRA: Instead of parameters for , LoRA uses parameters. For and : full is parameters, LoRA is - a reduction.
6.2 Recommender Systems
Recommender systems model the user-item interaction matrix where is user 's rating of item (or 0 if unrated). The goal is to predict missing entries.
SVD-based collaborative filtering: Approximate (truncated to rank ), where:
- : user latent factors (each user's preferences in -dimensional space)
- : item latent factors (each item's properties in -dimensional space)
- : importance of the -th latent dimension
Predicted rating: .
Challenge: is sparse (mostly unobserved). Direct SVD fills missing values with 0, which is wrong. Simon Funk's SGD-SVD (2006, won one of the Netflix Prize progress checks) instead minimises via stochastic gradient descent.
Non-negative Matrix Factorisation (NMF): Constrains elementwise. Gives "parts-based" representations (pixels, topics, users). NMF is a constrained SVD with non-negativity.
6.3 Latent Semantic Analysis and Word Vectors
Latent Semantic Analysis (LSA, Deerwester et al. 1990) applies truncated SVD to the term-document matrix where = tf-idf weight of term in document .
The rank- SVD gives:
- : word embeddings in the -dimensional latent semantic space
- : document embeddings
- : importance of each latent dimension
Words that co-occur in similar documents get nearby embeddings. This is one of the first distributed word representations - a precursor to word2vec and GloVe.
Connection to GloVe: GloVe (Pennington et al. 2014) can be interpreted as a shifted version of LSA. The GloVe model minimises the Frobenius norm between the log pointwise mutual information (PPMI) matrix and its low-rank factorisation - structurally an SVD with a different weighting function.
Modern transformer word embeddings are not directly SVD, but the embedding table can be analysed via SVD. Well-trained embeddings typically have a steep singular value spectrum (dominant directions = common syntactic/semantic features; tail = noise).
6.4 Image Compression
The rank- SVD of a grayscale image gives:
Storage: numbers vs for the full image. Compression ratio: .
For a image with : original stores numbers, compressed stores - a compression. JPEG achieves better compression ratios than SVD at the same quality because it exploits block structure and human visual system properties, but SVD compression is mathematically optimal in the sense.
Colour images: Apply SVD independently to each channel (R, G, B), or decompose the matrix of RGB triples. In practice, the colour channels are correlated and the singular value spectra of different channels are similar.
6.5 Pseudoinverse and Least Squares
Overdetermined systems (linear regression): Given (data, ) and , the least-squares problem has solution:
Each component is the projection of onto the -th left singular vector, divided by the singular value. If is very small (ill-conditioned), this component is amplified enormously - catastrophic if contains noise.
Ridge regression shrinkage: The regularised solution is:
For : factor (same as unregularised). For : factor (component is suppressed). The threshold acts as an effective singular value cutoff.
For AI: The normal equations are equivalent but numerically inferior (condition number squared). Always use np.linalg.lstsq(X, y) which uses the SVD internally.
6.6 Spectral Methods for Graphs
For a bipartite graph with vertices and edge weight matrix (biadjacency matrix), the SVD provides a spectral embedding:
- Left singular vectors : embed the -side vertices
- Right singular vectors : embed the -side vertices
- Singular values: weight each dimension
This is used in spectral co-clustering (Dhillon 2001): simultaneously cluster rows and columns of a data matrix by finding the top- singular vectors and applying -means to the embedded points.
Connection to symmetric spectral clustering: For a bipartite graph, the SVD of the normalised biadjacency matrix is equivalent to the eigendecomposition of the symmetric Laplacian of the full bipartite graph. The -th singular value of corresponds to the -th eigenvalue of the graph Laplacian.
6.7 Weight Matrix Analysis (WeightWatcher)
WeightWatcher (Martin and Mahoney, 2019) is a tool for diagnosing model quality without test data by analysing the singular value spectra of weight matrices.
Key finding: Well-trained neural networks have weight matrices with heavy-tailed singular value spectra that follow power laws:
with for high-quality models. This is predicted by random matrix theory (Marchenko-Pastur distribution) for random matrices, with the "learned" part manifesting as the heavy tail.
Diagnostic metrics:
- (power-law exponent): fit to the tail of the singular value distribution. indicates well-trained; indicates undertrained; indicates overtrained.
- Weighted alpha: averaged across layers, weighted by layer size.
- Stable rank: measures effective rank per layer.
For AI: WeightWatcher can predict model quality (test accuracy) from just the weight matrices, without any data. It has been used to select the best checkpoint during training, identify layers that need more training, and detect fine-tuning quality.
6.8 Attention and SVD
The attention weight matrix (for sequence length ) has interesting SVD structure in trained transformers:
Low-rank structure of attention: Trained attention matrices are approximately low-rank - most of the attention pattern is captured by a few singular vectors. This has been observed empirically across multiple model families. The effective rank is roughly for induction heads and for copying heads.
SVD-based attention compression: Several works approximate the attention computation using truncated SVD of and : write , , then . If the middle factor has low rank, the computation is cheaper.
FlashAttention does not directly use SVD but exploits the block structure of attention matrices for I/O-efficient computation. The theoretical analysis of FlashAttention's approximation quality involves the singular value decay of the attention matrix.
MLA's KV compression: DeepSeek's MLA projects through a low-rank bottleneck (). At inference, only the -dimensional latent is cached per token, reducing KV cache by factor . This is exactly a rank- SVD-style projection of the key-value information.
7. Generalised SVD and Extensions
7.1 Generalised SVD (GSVD)
The Generalised SVD (GSVD) factorises a pair of matrices with the same number of columns:
where , are orthogonal, is non-singular, and , are diagonal with .
Applications:
- Regularised least squares: Solve via GSVD.
- Generalised eigenproblems: is solved via GSVD of .
- Fisher LDA: The generalised eigenproblem (between/within scatter) is a GSVD problem.
- CCA: Canonical Correlation Analysis decomposes the cross-covariance matrix and uses a GSVD-like structure.
Available in SciPy via scipy.linalg.svd with the lapack_driver='gesdd' option, or directly through LAPACK's dggsvd routine.
7.2 Tensor SVD and Tucker Decomposition
For tensors (multi-dimensional arrays) , several SVD-like decompositions exist:
Tucker Decomposition: where is the core tensor and are orthogonal factor matrices. The Higher-Order SVD (HOSVD) computes each as the SVD of the mode- unfolding of .
CP Decomposition (CANDECOMP/PARAFAC): - a sum of rank-1 tensors. Unlike matrix SVD, tensor rank is NP-hard to compute in general, and the best rank- approximation may not exist (unlike Eckart-Young for matrices).
For AI: Tucker decomposition is used to compress convolutional layers (a 4D weight tensor). MobileNet-style depthwise separable convolutions are an approximate Tucker decomposition. Tensor train (TT) decomposition is used for compressing embedding tables and attention weight matrices in extreme low-resource settings.
7.3 Truncated and Randomised Variants in Practice
| Method | Function | When to Use | Cost |
|---|---|---|---|
| Full SVD | np.linalg.svd(A) | Small matrices, need all singular values | |
| Thin SVD | np.linalg.svd(A, full_matrices=False) | Standard case | |
| Truncated SVD | scipy.sparse.linalg.svds(A, k=k) | Sparse/large, need top- | |
| Randomised SVD | sklearn.utils.extmath.randomized_svd | Dense large, approximate top- | |
| Incremental SVD | sklearn.decomposition.IncrementalPCA | Streaming data | per batch |
| PyTorch low-rank | torch.svd_lowrank(A, q=k) | GPU tensors |
For LoRA initialisation from a checkpoint, the typical workflow is:
- Load the fine-tuned weight matrix .
- Compute .
- Run
U, s, Vh = np.linalg.svd(delta_W, full_matrices=False). - Set , .
- Confirm is small.
8. Common Mistakes
| # | Mistake | Why It's Wrong | Fix |
|---|---|---|---|
| 1 | Treating singular values as eigenvalues | in general; rotation matrix has eigenvalues but all | Remember: ; only equal for symmetric PSD matrices |
| 2 | Forming to compute the SVD | Squares the condition number (), destroying small singular values | Use np.linalg.svd(A) directly; never np.linalg.eigh(A.T @ A) for the SVD |
| 3 | Assuming for non-symmetric matrices | and are different orthogonal matrices (different sizes for rectangular ); only if is symmetric PSD | Check dimensions: , for an matrix |
| 4 | Confusing sign conventions in | leaves unchanged; numpy may return different signs each run | Always check numerically; compare absolute cosine similarity between vectors |
| 5 | Using full SVD when thin SVD suffices | Full SVD computes including unnecessary columns that don't contribute | Use np.linalg.svd(A, full_matrices=False) for the thin SVD |
| 6 | Claiming rank from nonzero singular values in finite precision | Every matrix in float64 has all due to rounding | Use np.linalg.matrix_rank(A) with appropriate tolerance |
| 7 | Applying LoRA to all layers equally | Not all layers have equally low-rank updates; key/query projections often need higher rank than MLP layers | Use loftq-style analysis to allocate rank per layer based on of each layer's update |
| 8 | Confusing nuclear norm with Frobenius norm | vs ; minimising nuclear norm promotes sparsity in singular values (low rank), not in the matrix entries | Choose norm based on goal: low rank -> nuclear norm; small magnitude -> Frobenius norm |
| 9 | Thinking pseudoinverse solves all linear systems stably | amplifies small singular value components; for noisy , the component is huge when | Use ridge regression () or truncate singular values below a threshold |
| 10 | Using scipy.linalg.svd instead of numpy.linalg.svd interchangeably | scipy.linalg.svd returns (U, s, Vh) while numpy.linalg.svd returns (U, s, Vh) - same convention, but scipy offers more options; be consistent | Pick one and note that Vh is already , not |
9. Exercises
Exercise 1 * - SVD by Hand (2\times2)
Compute the SVD of by hand: (a) Form and find its eigenvalues and eigenvectors. (b) Determine and . (c) Compute for each . (d) Write out , , and verify . (e) What are the semi-axes of the ellipse maps the unit circle to?
Exercise 2 * - Four Fundamental Subspaces via SVD
Given (rank 2): (a) Compute the thin SVD. (b) Identify a basis for the column space, row space, null space, and left null space from the SVD. (c) Verify your null space basis by showing . (d) Verify the column space basis by expressing 's columns as linear combinations.
Exercise 3 ** - Eckart-Young Reconstruction Error
For the matrix with singular values :
(a) Implement truncated_svd(A, k) that returns the rank- approximation .
(b) Compute the exact spectral norm error for .
(c) Compute the Frobenius norm error for each .
(d) Verify the Eckart-Young bound: .
(e) What is the minimum rank needed to retain 95% of the Frobenius norm of ?
Exercise 4 ** - Moore-Penrose Pseudoinverse
Implement pseudoinverse(A, tol=1e-10) using the SVD:
(a) Compute the thin SVD .
(b) Invert the nonzero singular values (use tol to determine which are zero).
(c) Return .
(d) Verify the four Moore-Penrose conditions for a random matrix.
(e) Use your pseudoinverse to solve the overdetermined system and verify it gives the minimum-norm least-squares solution.
Exercise 5 ** - Condition Number and Least Squares Stability
(a) Generate an ill-conditioned matrix with singular values using random .
(b) Set (add small noise ).
(c) Solve using np.linalg.lstsq (SVD-based) and via normal equations .
(d) Compare errors for both methods.
(e) Repeat with ridge regression for and plot the trade-off between fit and regularisation.
Exercise 6 *** - Randomised SVD
Implement randomized_svd(A, k, n_oversampling=5, n_iter=2):
(a) Generate Gaussian sketch matrix .
(b) Compute the range sketch (with n_iter=q power iterations).
(c) Orthogonalise via QR.
(d) Project and compute small SVD.
(e) Compare your implementation against np.linalg.svd and sklearn.utils.extmath.randomized_svd on a matrix. Report relative Frobenius error vs rank.
Exercise 7 *** - LoRA-Style Low-Rank Adapter
Simulate LoRA fine-tuning:
(a) Create a "pre-trained" weight matrix (random Gaussian).
(b) Create a "true fine-tuned" update with rank 4 (use SVD to construct).
(c) Implement lora_decompose(delta_W, r) that finds the rank- factorisation minimising .
(d) Compute the relative approximation error for .
(e) Show that recovers almost exactly (since it's rank-4) while introduces error. Plot the Eckart-Young error bound.
Exercise 8 *** - Image Compression via SVD
(a) Load or generate a grayscale "image" matrix (use a structured synthetic image with edges and gradients).
(b) Implement svd_compress(I, k) that returns the rank- approximation.
(c) Compute PSNR (peak signal-to-noise ratio) vs compression ratio for .
(d) Plot the singular value spectrum and identify the "knee" where adding more singular values gives diminishing returns.
(e) Compare storage: full image vs rank- representation. What rank achieves compression?
10. Why This Matters for AI (2026 Perspective)
| Concept | Impact on AI/LLMs in 2026 | Specific Methods |
|---|---|---|
| Eckart-Young theorem | Every low-rank method is justified by this theorem | LoRA, DoRA, MLA, spectral pruning |
| Truncated SVD | Foundation of parameter-efficient fine-tuning; compress to parameters | LoRA (Hu 2021), AdaLoRA (Zhang 2023), QLoRA |
| Randomised SVD | Makes SVD practical at web scale; vs | TruncatedSVD in sklearn, torch.svd_lowrank |
| Spectral norm | Lipschitz control of each layer; GAN stability; gradient norm bounding | Spectral normalisation (Miyato 2018), SN-GAN |
| Nuclear norm | Convex relaxation of rank; used in matrix completion and multi-task learning | Matrix completion, inductive matrix completion |
| Pseudoinverse | Numerically stable least squares; basis of linear probing | Linear regression, probing classifiers, OLS |
| Condition number | Explains ill-conditioning in training; motivates normalisation | BatchNorm, LayerNorm, feature scaling, K-FAC |
| SVD of weight matrices | Quality metric for trained models without test data | WeightWatcher, model selection, layer diagnosis |
| SVD of attention matrices | Explains attention head specialisation; guides compression | Attention pruning, low-rank attention, MLA |
| Procrustes problem | Cross-lingual transfer; domain adaptation; task arithmetic | MUSE (cross-lingual), task arithmetic, LoRA merge |
| Singular value spectrum | Power-law tail indicates implicit self-regularisation in SGD | WeightWatcher, alpha power law metric |
| MLA / KV compression | Low-rank projection of K/V matrices reduces inference cost | DeepSeek-V2/V3 MLA, GQA, MQA |
11. Conceptual Bridge
Where we came from: Section 03-01 (Eigenvalues and Eigenvectors) gave us the tools to decompose square matrices into invariant directions and scaling factors. But eigendecomposition is limited: it requires square matrices, may involve complex numbers, and fails for defective matrices. The SVD generalises this to any matrix by abandoning the requirement that input and output live in the same space. Instead of eigenvectors - directions that map to themselves - we have singular vectors: optimal input and output directions paired by the matrix's stretching action.
Where we are going: Section 03-03 (Principal Component Analysis) is, in its essence, the SVD of the centred data matrix. PCA finds the principal axes of data variation by computing the SVD of ; the right singular vectors are the principal components, and the singular values (scaled) are the standard deviations in each principal direction. Everything you have learned about the SVD - Eckart-Young, pseudoinverse, truncation - applies directly to PCA.
The SVD as a unifying lens: The SVD connects to every major topic in the rest of this curriculum:
- Matrix norms (Section 03-06): spectral, Frobenius, and nuclear norms are all functions of singular values.
- Positive definite matrices (Section 03-07): is PSD iff all singular values equal eigenvalues and are non-negative.
- Optimisation (Chapter 08): the gradient of with respect to involves the singular vectors; nuclear norm minimisation is a semidefinite program solvable via SVD.
- Probability (Chapter 06): the covariance matrix has SVD directly related to the SVD of ; Mahalanobis distance uses .
CONCEPTUAL MAP - SVD IN THE CURRICULUM
========================================================================
Eigenvalues & Singular Value Principal Component
Eigenvectors --> Decomposition --> Analysis
(03-01) (03-02) <HERE (03-03)
|
+--------------+------------------------------+
| | |
v v v
Matrix Least Squares AI Applications
Norms & Pseudoinverse -----------------
(03-06) (everywhere) LoRA / DoRA / MLA
WeightWatcher
Recommender Sys.
Image Compression
Spectral Norm GAN
========================================================================
The single most important insight about SVD: Every matrix is the composition of three simple operations - rotate, scale, rotate. The singular values tell you how much the matrix stretches space in its principal directions; the singular vectors tell you which directions those are. Understanding this decomposition is understanding every linear operation in every neural network, every data compression, and every least-squares problem in machine learning.
<- Back to Advanced Linear Algebra | <- Eigenvalues | PCA ->