NotesMath for LLMs

Singular Value Decomposition

Advanced Linear Algebra / Singular Value Decomposition

Notes

"Every matrix tells a story of rotation, scaling, and rotation again - the SVD reads that story in full."

Overview

The Singular Value Decomposition (SVD) is the most universally applicable matrix factorisation in all of applied mathematics. Given any matrix ARm×nA \in \mathbb{R}^{m \times n} - rectangular, rank-deficient, ill-conditioned, whatever - the SVD expresses it as A=UΣVA = U\Sigma V^\top: a rotation in the input space (VV^\top), a coordinate-wise scaling (Σ\Sigma), and a rotation in the output space (UU). The scaling factors σ1σ20\sigma_1 \geq \sigma_2 \geq \cdots \geq 0 are the singular values; they completely determine the linear map's geometry. Unlike the eigendecomposition, which requires a square matrix and may involve complex numbers, the SVD always exists, always produces real non-negative singular values, and works for matrices of any shape.

For AI practitioners in 2026, the SVD is inescapable. Low-Rank Adaptation (LoRA) fine-tunes billion-parameter language models by constraining weight updates to a low-rank subspace - the dominant singular subspace of the gradient matrix. DeepSeek's Multi-Head Latent Attention (MLA) compresses key-value caches by projecting them through a low-rank bottleneck. The Eckart-Young theorem guarantees that the rank-kk SVD truncation is the best possible rank-kk approximation, which is why SVD underlies recommender systems, image compression, latent semantic analysis, and the pseudoinverse. WeightWatcher diagnoses model quality by examining the singular value spectrum of weight matrices: healthy trained networks have heavy-tailed spectra; undertrained or overtrained networks do not. Every time you compute a spectral norm, solve a least-squares problem, or talk about the "effective rank" of a weight matrix, you are using the SVD.

This section develops the SVD from first principles, proves the Eckart-Young theorem, connects SVD to eigenvalues and the four fundamental subspaces, and builds every major AI application from scratch.

Prerequisites

  • Eigenvalues and eigenvectors, spectral theorem for symmetric matrices (Section 03-01)
  • Matrix rank, null space, column space (Section 02-05)
  • Vector spaces and orthogonality (Section 02-06)
  • Matrix transpose, invertibility, determinants (Section 02-02 through 02-04)

Companion Notebooks

NotebookDescription
theory.ipynbInteractive demos: geometric action, Eckart-Young compression, pseudoinverse, randomised SVD, LoRA, image compression, weight matrix analysis
exercises.ipynb8 graded problems: SVD by hand, four subspaces, low-rank approximation, pseudoinverse, condition number, randomised SVD, LoRA adapter, image compression

Learning Objectives

After completing this section, you will:

  • State the SVD theorem and explain its geometric meaning as three successive linear maps
  • Compute the SVD of small matrices by finding the eigendecomposition of AAA^\top A and AAAA^\top
  • Distinguish full SVD, thin SVD, and truncated SVD; know when to use each
  • Identify the four fundamental subspaces of AA from its SVD
  • Prove and apply the Eckart-Young theorem for optimal low-rank approximation
  • Compute the Moore-Penrose pseudoinverse A+A^+ and use it to solve least-squares problems
  • Express the spectral norm, Frobenius norm, and nuclear norm in terms of singular values
  • Implement randomised SVD for large matrices
  • Connect SVD to LoRA, MLA, spectral normalisation, recommender systems, and LSA
  • Interpret the singular value spectrum of neural network weight matrices

Table of Contents


1. Intuition

1.1 What SVD Is

A matrix ARm×nA \in \mathbb{R}^{m \times n} represents a linear map from Rn\mathbb{R}^n to Rm\mathbb{R}^m. That map can be arbitrarily complex - it can rotate, stretch, shrink, and project. The SVD breaks this complexity into three primitives that are each individually simple:

A=UΣVA = U \Sigma V^\top
  • VV^\top: rotation/reflection in the input space Rn\mathbb{R}^n. VV is orthogonal (VV=InV^\top V = I_n), so VV^\top is a rigid transformation that does not change lengths or angles - it just changes the orientation of the coordinate axes.
  • Σ\Sigma: scaling along coordinate axes. Σ\Sigma is an m×nm \times n diagonal-like matrix with non-negative entries σ1σ20\sigma_1 \geq \sigma_2 \geq \cdots \geq 0 on the diagonal. It stretches or shrinks each axis independently. If mnm \neq n, it also changes the dimension.
  • UU: rotation/reflection in the output space Rm\mathbb{R}^m. UU is orthogonal (UU=ImUU^\top = I_m), another rigid transformation.

The sequence VΣUV^\top \to \Sigma \to U is the "rotate -> scale -> rotate" story of the matrix. The first rotation aligns the input coordinates to the special axes (the right singular vectors vi\mathbf{v}_i, columns of VV). Then each axis is scaled by σi\sigma_i. Then the result is rotated to the output coordinate system via UU, whose columns ui\mathbf{u}_i are the left singular vectors.

This decomposition exists for every real (or complex) matrix, regardless of shape, rank, or conditioning. The singular values σi\sigma_i are unique (though the vectors ui,vi\mathbf{u}_i, \mathbf{v}_i have sign ambiguity and are non-unique when singular values repeat). This universality is what distinguishes the SVD from the eigendecomposition, which only applies to square matrices and can fail (defective matrices) or become complex.

For AI: When you apply a linear layer WRdout×dinW \in \mathbb{R}^{d_{out} \times d_{in}} to an input x\mathbf{x}, you are implicitly performing Wx=UΣVxW\mathbf{x} = U\Sigma V^\top \mathbf{x}. The right singular vectors vi\mathbf{v}_i are the "input features" the layer responds to; the singular values σi\sigma_i are how strongly each feature is amplified; the left singular vectors ui\mathbf{u}_i are the "output features" it writes to. LoRA exploits this by constraining ΔW=BA\Delta W = BA where BRdout×rB \in \mathbb{R}^{d_{out} \times r}, ARr×dinA \in \mathbb{R}^{r \times d_{in}} - a rank-rr update that modifies only the dominant rr singular directions.

1.2 The Geometric Picture

The clearest geometric picture of the SVD comes from watching what AA does to the unit sphere Sn1={xRn:x=1}S^{n-1} = \{\mathbf{x} \in \mathbb{R}^n : \|\mathbf{x}\| = 1\}. Apply AA to every point on the unit sphere. The result is an ellipsoid in Rm\mathbb{R}^m. This is not obvious - it is a theorem - but it follows directly from the SVD:

A(Sn1)=UΣV(Sn1)=UΣ(Sn1)=U(axis-aligned ellipsoid)=ellipsoidA(S^{n-1}) = U\Sigma V^\top(S^{n-1}) = U\Sigma(S^{n-1}) = U(\text{axis-aligned ellipsoid}) = \text{ellipsoid}

Step by step:

  1. VV^\top is orthogonal, so V(Sn1)=Sn1V^\top(S^{n-1}) = S^{n-1} (rigid transformation preserves the sphere).
  2. Σ\Sigma scales each axis by σi\sigma_i, turning the sphere into an axis-aligned ellipsoid with semi-axes σ1,σ2,\sigma_1, \sigma_2, \ldots.
  3. UU is orthogonal, so it rotates/reflects the ellipsoid without changing its shape.

The singular values σi\sigma_i are the semi-axis lengths of the output ellipsoid. The right singular vectors vi\mathbf{v}_i are the directions in the input space that map to the semi-axes of the ellipsoid. The left singular vectors ui\mathbf{u}_i are the directions of the ellipsoid's semi-axes in the output space.

GEOMETRIC ACTION OF A \in \mathbb{R}^(m\timesn)
========================================================================

  Input space \mathbb{R}^n               Output space \mathbb{R}^m

      v_2                            u_2
       up                             up
       |  Unit sphere                |
       |    *                        |   \sigma_2*u_2
       |  *   *       V^T          * | *
       | *     *  ---------->    *   |   *       A = U \Sigma V^T
       |  *   *                *     |     *
       |    *               ------------------- u_1 ->
  -----+--------> v_1       *         \sigma_1      *
       |                     *               *
       |                       *           *
       |                         * * * * *
       |                        (ellipse)
       |                  semi-axes = singular values

  Right singular vectors v_i    Left singular vectors u_i
  = natural input axes          = natural output axes

========================================================================

For a 2×22 \times 2 matrix with σ1>σ2>0\sigma_1 > \sigma_2 > 0: the unit circle maps to an ellipse with semi-axes σ1\sigma_1 and σ2\sigma_2. The direction v1\mathbf{v}_1 is the input direction that gets stretched the most (by σ1\sigma_1); v2\mathbf{v}_2 is the direction stretched least (by σ2\sigma_2). If σ2=0\sigma_2 = 0, the map collapses one direction to zero - the matrix is rank-1 and maps every vector onto a line.

1.3 Why SVD Is the Most Important Decomposition in AI

The SVD is not just an abstract factorisation - it is the foundational tool for understanding, compressing, and controlling linear maps in neural networks. Here are the core connections:

Low-rank approximation (LoRA, DoRA, MLA). Every weight matrix in a neural network is a linear map. After training, weight update matrices ΔW\Delta W tend to have low effective rank - most of the "learning signal" lives in a few singular directions. LoRA exploits this: instead of fine-tuning all of WW, it learns ΔW=BA\Delta W = BA with rdr \ll d. The SVD is used to initialise BB and AA from the dominant singular subspace of ΔW\Delta W, or to analyse which singular directions are being adapted. DoRA (Weight-Decomposed Low-Rank Adaptation) further decomposes W=WcW/WcW = \|W\|_c \cdot W/\|W\|_c and adapts direction (via low-rank) and magnitude (via scalar) separately. DeepSeek's MLA uses low-rank projections of key-value matrices to compress the KV cache, reducing memory from O(ndkv)O(n \cdot d_{kv}) to O(nr)O(n \cdot r) with rdkvr \ll d_{kv}.

Spectral normalisation. The spectral norm of a weight matrix W2=σmax(W)\|W\|_2 = \sigma_{\max}(W) is the largest singular value. Spectral normalisation (Miyato et al. 2018) divides each weight matrix by its spectral norm, enforcing a Lipschitz-1 constraint on each layer. This stabilises GAN training. The spectral norm is computed via power iteration (equivalent to computing the top singular vector pair) and is O(mn)O(mn) per forward pass rather than O(mnmin(m,n))O(mn\min(m,n)) for the full SVD.

Pseudoinverse and least squares. Every linear regression problem minxAxb2\min_\mathbf{x} \|A\mathbf{x} - \mathbf{b}\|_2 has solution x^=A+b\hat{\mathbf{x}} = A^+ \mathbf{b} where A+=VΣ+UA^+ = V\Sigma^+U^\top is the Moore-Penrose pseudoinverse. The SVD provides the most numerically stable way to compute this. Ridge regression adds a λI\lambda I regulariser, which in SVD form becomes shrinkage: each singular component is shrunk by factor σi2/(σi2+λ)\sigma_i^2/(\sigma_i^2 + \lambda).

Matrix compression and efficiency. Images, attention matrices, embedding tables, and gradient matrices all have approximate low-rank structure. The Eckart-Young theorem guarantees the SVD gives the best rank-kk approximation. Randomised SVD (Halko et al. 2011) computes this in O(mnk)O(mnk) time rather than O(mnmin(m,n))O(mn\min(m,n)), making it practical for million-dimensional problems.

Understanding model geometry. The singular value spectrum of a weight matrix reveals its "information geometry": a flat spectrum (all σi\sigma_i equal) means the layer treats all directions equally; a steep spectrum means the layer has discovered dominant directions. WeightWatcher (Martin and Mahoney, 2019-2026) quantifies model quality by fitting power-law distributions to singular value spectra: well-trained models have σiiα\sigma_i \sim i^{-\alpha} with α[2,4]\alpha \in [2, 4]; undertrained models have flat spectra; overtrained models have spike/bulk structure.

1.4 SVD vs Eigendecomposition

Students often confuse SVD and eigendecomposition. Here is a precise comparison:

PropertyEigendecomposition A=VΛV1A = V\Lambda V^{-1}SVD A=UΣVA = U\Sigma V^\top
Applicable toSquare matrices onlyAny m×nm \times n matrix
Eigenvalues/singular valuesCan be complex, can be negativeAlways real, always 0\geq 0
VV is orthogonal?Only for symmetric AAAlways
Same input/output basis?Yes (VV used twice)No (UU for output, VV for input)
ExistenceNot always (defective over C\mathbb{C})Always exists
For symmetric PSD AAA=QΛQA = Q\Lambda Q^\top, λi0\lambda_i \geq 0Coincides: U=V=QU=V=Q, σi=λi\sigma_i = \lambda_i
Computation costO(n3)O(n^3)O(mnmin(m,n))O(mn\min(m,n))
Geometric meaningFixed points of the mapExtreme stretching directions

The key insight: singular values and eigenvalues are generally different quantities. For a matrix AA, the eigenvalues of AA are not the same as the singular values of AA (unless AA is symmetric PSD). The singular values of AA are the square roots of the eigenvalues of AAA^\top A (which is always symmetric PSD). For example, the rotation matrix RθR_\theta has eigenvalues e±iθe^{\pm i\theta} (complex, magnitude 1) but singular values σ1=σ2=1\sigma_1 = \sigma_2 = 1 (all equal - a rotation is an isometry).

1.5 Historical Timeline

YearWhoWhat
1873Beltrami, JordanIndependent discovery of SVD for bilinear forms; Beltrami published first, Jordan one month later
1889SylvesterGeneralised to arbitrary rectangular real matrices
1907SchmidtInfinite-dimensional version (integral operators); modern functional-analytic foundation
1936Eckart & YoungProved the best low-rank approximation theorem; connected SVD to statistics
1965Golub & ReinschPractical stable algorithm: bidiagonalisation + implicit QR shift; still in use today
1976Lanczos bidiag.Krylov subspace method for large sparse matrices; precursor to ARPACK
1992Berry et al.Latent Semantic Analysis (LSA) - SVD of term-document matrices for information retrieval
2000Cai, CandesMatrix completion theory; nuclear norm minimisation as convex relaxation of low-rank
2009Koren et al.Winning Netflix Prize solution used SVD-based matrix factorisation
2011Halko, Martinsson, TroppRandomised SVD - near-linear-time algorithms for large matrices
2017-26LoRA eraHu et al. (LoRA 2021), Liu et al. (DoRA 2024), DeepSeek (MLA 2024) - low-rank SVD in LLMs
2019-26Martin & MahoneyWeightWatcher - power-law singular value spectra as model quality metrics

2. Formal Definitions

2.1 The SVD Theorem

Theorem (Singular Value Decomposition). Let ARm×nA \in \mathbb{R}^{m \times n} with rank(A)=r\text{rank}(A) = r. Then there exist orthogonal matrices URm×mU \in \mathbb{R}^{m \times m}, VRn×nV \in \mathbb{R}^{n \times n}, and a matrix ΣRm×n\Sigma \in \mathbb{R}^{m \times n} of the form

Σ=(σ1σ2σmin(m,n))m×n\Sigma = \begin{pmatrix} \sigma_1 & & & \\ & \sigma_2 & & \\ & & \ddots & \\ & & & \sigma_{\min(m,n)} \end{pmatrix}_{m \times n}

with σ1σ2σr>0=σr+1==σmin(m,n)\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0 = \sigma_{r+1} = \cdots = \sigma_{\min(m,n)}, such that

A=UΣV\boxed{A = U\Sigma V^\top}

The σi\sigma_i are called the singular values of AA. The columns u1,,um\mathbf{u}_1, \ldots, \mathbf{u}_m of UU are the left singular vectors. The columns v1,,vn\mathbf{v}_1, \ldots, \mathbf{v}_n of VV are the right singular vectors.

Proof sketch (constructive). Since AARn×nA^\top A \in \mathbb{R}^{n \times n} is symmetric positive semidefinite, by the Spectral Theorem it has an eigendecomposition AA=VΛVA^\top A = V\Lambda V^\top with Λ=diag(λ1,,λn)\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_n), λi0\lambda_i \geq 0. Define σi=λi0\sigma_i = \sqrt{\lambda_i} \geq 0. For each σi>0\sigma_i > 0, define ui=Avi/σi\mathbf{u}_i = A\mathbf{v}_i / \sigma_i. One can verify:

  • uiuj=(Avi/σi)(Avj/σj)=viAAvj/(σiσj)=λjvivj/(σiσj)=δij\mathbf{u}_i^\top \mathbf{u}_j = (A\mathbf{v}_i/\sigma_i)^\top(A\mathbf{v}_j/\sigma_j) = \mathbf{v}_i^\top A^\top A \mathbf{v}_j / (\sigma_i\sigma_j) = \lambda_j \mathbf{v}_i^\top\mathbf{v}_j/(\sigma_i\sigma_j) = \delta_{ij} (orthonormal).
  • Extend {u1,,ur}\{\mathbf{u}_1,\ldots,\mathbf{u}_r\} to an orthonormal basis of Rm\mathbb{R}^m to complete UU.
  • Then Avi=σiuiA\mathbf{v}_i = \sigma_i \mathbf{u}_i for iri \leq r and Avi=0A\mathbf{v}_i = \mathbf{0} for i>ri > r, which compactly gives A=UΣVA = U\Sigma V^\top. \square

Uniqueness. The singular values are unique (they are square roots of eigenvalues of AAA^\top A, which are unique). The singular vectors are unique up to:

  • Sign flips: (ui,vi)(ui,vi)(\mathbf{u}_i, \mathbf{v}_i) \to (-\mathbf{u}_i, -\mathbf{v}_i) leaves A=UΣVA = U\Sigma V^\top unchanged.
  • Rotation within repeated singular value blocks: if σi=σi+1==σi+k\sigma_i = \sigma_{i+1} = \cdots = \sigma_{i+k}, any orthogonal rotation within the corresponding subspaces is valid.

2.2 Singular Values

The singular values σ1σ2σmin(m,n)0\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_{\min(m,n)} \geq 0 carry all the magnitude information about AA. Key properties:

Relation to eigenvalues of AAA^\top A and AAAA^\top:

σi2=λi(AA)=λi(AA)(i=1,,r)\sigma_i^2 = \lambda_i(A^\top A) = \lambda_i(AA^\top) \quad (i = 1, \ldots, r)

The nonzero eigenvalues of AAA^\top A and AAAA^\top are identical (even though the matrices have different sizes), and they equal the squared singular values of AA.

Geometric meaning: σ1=maxx=1Ax\sigma_1 = \max_{\|\mathbf{x}\|=1} \|A\mathbf{x}\| (the maximum stretching factor). More generally, σi\sigma_i is the maximum stretching factor among all directions orthogonal to v1,,vi1\mathbf{v}_1, \ldots, \mathbf{v}_{i-1}.

Rank: rank(A)=r\text{rank}(A) = r = number of positive singular values.

Examples:

  • Identity InI_n: all nn singular values equal 1.
  • Scaling matrix diag(3,1,0)\text{diag}(3, 1, 0): singular values are 3, 1, 0; rank = 2.
  • Rotation matrix RθR_\theta: all singular values equal 1 (isometry).
  • Projection matrix onto a kk-dimensional subspace: kk singular values equal 1, rest equal 0.
  • All-ones matrix 11Rn×n\mathbf{1}\mathbf{1}^\top \in \mathbb{R}^{n\times n}: σ1=n\sigma_1 = n, σ2==σn=0\sigma_2 = \cdots = \sigma_n = 0; rank = 1.

2.3 Left and Right Singular Vectors

The left and right singular vectors satisfy a pair of eigenvalue-like equations:

Avi=σiui(i=1,,r)A\mathbf{v}_i = \sigma_i \mathbf{u}_i \qquad (i = 1, \ldots, r) Aui=σivi(i=1,,r)A^\top \mathbf{u}_i = \sigma_i \mathbf{v}_i \qquad (i = 1, \ldots, r) Avi=0(i>r),Aui=0(i>r)A\mathbf{v}_i = \mathbf{0} \quad (i > r), \qquad A^\top \mathbf{u}_i = \mathbf{0} \quad (i > r)

Combining: AAvi=σi2viA^\top A \mathbf{v}_i = \sigma_i^2 \mathbf{v}_i and AAui=σi2uiAA^\top \mathbf{u}_i = \sigma_i^2 \mathbf{u}_i. So vi\mathbf{v}_i are eigenvectors of AAA^\top A and ui\mathbf{u}_i are eigenvectors of AAAA^\top, both with eigenvalue σi2\sigma_i^2.

The relationships can be rewritten as a single symmetric eigenproblem by introducing the augmented matrix:

(0AA0)(uivi)=σi(uivi)\begin{pmatrix} 0 & A \\ A^\top & 0 \end{pmatrix} \begin{pmatrix} \mathbf{u}_i \\ \mathbf{v}_i \end{pmatrix} = \sigma_i \begin{pmatrix} \mathbf{u}_i \\ \mathbf{v}_i \end{pmatrix}

This (m+n)×(m+n)(m+n) \times (m+n) symmetric matrix has eigenvalues ±σ1,±σ2,,±σr,0,,0\pm\sigma_1, \pm\sigma_2, \ldots, \pm\sigma_r, 0, \ldots, 0. Some SVD algorithms exploit this formulation.

2.4 Full, Thin, and Truncated SVD

For an m×nm \times n matrix with m>nm > n and rank rnr \leq n, there are three common variants:

FULL SVD vs THIN SVD vs TRUNCATED SVD  (m > n, rank r)
================================================================

FULL:
  A    =    U      \times     \Sigma      \times    V^T
(m\timesn)    (m\timesm)       (m\timesn)       (n\timesn)
         [all m      [n nonzero   [all n
          columns]    rows]        columns]

THIN (economy):
  A    =    U      \times    \Sigma      \times    V^T
(m\timesn)    (m\timesn)       (n\timesn)       (n\timesn)
         [first n   [square      [all n
          columns]   diagonal]    columns]

TRUNCATED (rank-k):
  A_k  =   U_k    \times    \Sigma_k    \times   V_k^T
(m\timesn)    (m\timesk)       (k\timesk)       (k\timesn)
         [first k   [top-k       [first k
          columns]   diagonal]    columns]

================================================================

Full SVD is the complete factorisation; UU and VV are square orthogonal matrices. Storage: O(m2+mn+n2)O(m^2 + mn + n^2).

Thin SVD (also called economy or compact SVD): keep only the first nn columns of UU (for m>nm > n). Since Σ\Sigma has zeros in rows n+1n+1 through mm, these columns of UU do not contribute. Storage: O(mn+n2)O(mn + n^2).

Truncated SVD (also called low-rank SVD): keep only the top kk terms (krk \leq r). This gives the best rank-kk approximation AkA_k (Eckart-Young theorem, Section 4.2). Storage: O((m+n)k)O((m+n)k). This is what LoRA exploits.

For AI: In practice, np.linalg.svd(A, full_matrices=False) computes the thin SVD. scipy.sparse.linalg.svds(A, k=k) and sklearn.utils.extmath.randomized_svd(A, n_components=k) compute truncated SVDs efficiently.

2.5 The Four Fundamental Subspaces

The SVD provides the cleanest way to identify the four fundamental subspaces of AA (Strang's framework):

SubspaceDefinitionFrom SVDDimension
Column space (range){col(A)}={Ax}\{\text{col}(A)\} = \{A\mathbf{x}\}span{u1,,ur}\{\mathbf{u}_1,\ldots,\mathbf{u}_r\}rr
Left null space{y:Ay=0}\{\mathbf{y}: A^\top\mathbf{y}=\mathbf{0}\}span{ur+1,,um}\{\mathbf{u}_{r+1},\ldots,\mathbf{u}_m\}mrm-r
Row space{col(A)}\{\text{col}(A^\top)\}span{v1,,vr}\{\mathbf{v}_1,\ldots,\mathbf{v}_r\}rr
Null space (kernel){x:Ax=0}\{\mathbf{x}: A\mathbf{x}=\mathbf{0}\}span{vr+1,,vn}\{\mathbf{v}_{r+1},\ldots,\mathbf{v}_n\}nrn-r

The column space is spanned by the first rr left singular vectors; the null space by the last nrn-r right singular vectors. These pairs are orthogonal complements: column space \perp left null space in Rm\mathbb{R}^m; row space \perp null space in Rn\mathbb{R}^n.

For AI: In a linear layer WRdout×dinW \in \mathbb{R}^{d_{out} \times d_{in}}:

  • The row space is the subspace of inputs that "activate" the layer (non-null inputs).
  • The null space is the subspace of inputs that are completely ignored.
  • The column space is the set of reachable outputs.
  • LoRA constrains ΔW\Delta W to have column space within a rank-rr subspace.

3. Computing the SVD

3.1 Connection to Eigendecomposition

The most conceptually direct way to compute the SVD exploits the relationship AA=VΣΣVA^\top A = V\Sigma^\top\Sigma V^\top and AA=UΣΣUAA^\top = U\Sigma\Sigma^\top U^\top:

Algorithm (eigendecomposition-based SVD):

  1. Form M=AARn×nM = A^\top A \in \mathbb{R}^{n \times n} (symmetric PSD).
  2. Compute eigendecomposition M=VΛVM = V\Lambda V^\top with Λ=diag(λ1,,λn)\Lambda = \text{diag}(\lambda_1,\ldots,\lambda_n), λi0\lambda_i \geq 0.
  3. Set σi=λi\sigma_i = \sqrt{\lambda_i} and discard near-zero singular values (numerical rank).
  4. For each σi>0\sigma_i > 0: ui=Avi/σi\mathbf{u}_i = A\mathbf{v}_i / \sigma_i.
  5. Extend {ui}\{\mathbf{u}_i\} to an orthonormal basis of Rm\mathbb{R}^m (e.g., by Gram-Schmidt or QR).

This works, but it has a critical flaw: forming AAA^\top A squares the condition number. If κ(A)=σ1/σr\kappa(A) = \sigma_1/\sigma_r, then κ(AA)=κ(A)2\kappa(A^\top A) = \kappa(A)^2. For an ill-conditioned matrix (say κ(A)=107\kappa(A) = 10^7), forming AAA^\top A destroys the last 7 decimal digits of information in double precision. The Golub-Reinsch algorithm avoids this.

For small, well-conditioned matrices this approach is perfectly fine and is what np.linalg.eigh(A.T @ A) computes. For production SVD, use np.linalg.svd which implements Golub-Reinsch.

3.2 Golub-Reinsch Algorithm

The Golub-Reinsch algorithm (1965) is the standard method implemented in LAPACK and numpy. It proceeds in two phases:

Phase 1: Bidiagonalisation. Using Householder reflectors, reduce AA to upper bidiagonal form:

A=U1BV1A = U_1 B V_1^\top

where BRn×nB \in \mathbb{R}^{n \times n} is upper bidiagonal (nonzeros only on main diagonal and superdiagonal) and U1,V1U_1, V_1 are products of Householder matrices. This costs O(mn2n3/3)O(mn^2 - n^3/3) operations.

Phase 2: Diagonalisation of BB. Apply implicit QR iterations with Wilkinson shifts to BB, converging to diagonal form B=U2ΣV2B = U_2 \Sigma V_2^\top. Combining: A=(U1U2)Σ(V1V2)=UΣVA = (U_1 U_2) \Sigma (V_1 V_2)^\top = U\Sigma V^\top.

The key advantage: bidiagonalisation works directly on AA without forming AAA^\top A, so the condition number is not squared. The algorithm is backward stable with error O(ϵmachA)O(\epsilon_{\text{mach}} \|A\|).

Cost: O(mn2+n3)O(mn^2 + n^3) for mnm \geq n. This is too expensive for large sparse matrices.

3.3 Randomised SVD

For large matrices where only the top-kk singular values/vectors are needed, the randomised SVD (Halko, Martinsson, Tropp 2011) offers a dramatic speedup:

Algorithm:

  1. Sketch: Draw ΩRn×(k+p)\Omega \in \mathbb{R}^{n \times (k+p)} with i.i.d. Gaussian entries (pp = oversampling, typically 5-10).
  2. Range sketch: Compute Y=AΩRm×(k+p)Y = A\Omega \in \mathbb{R}^{m \times (k+p)}.
  3. Orthogonalise: QR decompose Y=QRY = QR; QRm×(k+p)Q \in \mathbb{R}^{m \times (k+p)} captures the range of AA.
  4. Project: Compute B=QAR(k+p)×nB = Q^\top A \in \mathbb{R}^{(k+p) \times n} (small matrix).
  5. SVD of small matrix: Compute B=U^ΣVB = \hat{U}\Sigma V^\top exactly.
  6. Recover: U=QU^U = Q\hat{U}. Return (U[:,:k],Σ[:k,:k],V[:,:k])(U[:, :k], \Sigma[:k,:k], V[:, :k]).

Cost: O(mnk)O(mnk) - linear in the matrix size! Compare to O(mnmin(m,n))O(mn\min(m,n)) for full SVD.

Error bound: With oversampling p5p \geq 5, the approximation error is:

AAkrand2(1+11(k+p)n)σk+1(A)\|A - A_k^{\text{rand}}\|_2 \leq \left(1 + 11\sqrt{(k+p)n}\right) \sigma_{k+1}(A)

with high probability. In practice the error is much closer to σk+1\sigma_{k+1}.

Power iteration improvement: Replace Y=AΩY = A\Omega with Y=(AA)qAΩY = (AA^\top)^q A\Omega for small qq (1-3 power iteration steps). This dramatically sharpens the approximation for matrices with slowly decaying singular values, at cost O(qmnk)O(qmnk).

For AI: sklearn.utils.extmath.randomized_svd(A, n_components=k, n_iter=2) implements this. It's what TruncatedSVD uses for large sparse document matrices. PyTorch's torch.svd_lowrank uses randomised SVD for LoRA-related computations.

3.4 Lanczos Bidiagonalisation

For very large sparse matrices (e.g., AR106×106A \in \mathbb{R}^{10^6 \times 10^6} with O(n)O(n) nonzeros), Lanczos bidiagonalisation builds a Krylov subspace for the SVD problem:

Algorithm: Starting from a unit vector v1\mathbf{v}_1, alternate between multiplying by AA and AA^\top:

βj+1uj+1=Avjαjuj\beta_{j+1}\mathbf{u}_{j+1} = A\mathbf{v}_j - \alpha_j\mathbf{u}_j γj+1vj+1=Auj+1βj+1vj\gamma_{j+1}\mathbf{v}_{j+1} = A^\top\mathbf{u}_{j+1} - \beta_{j+1}\mathbf{v}_j

After kk steps, this builds a bidiagonal matrix BkB_k whose singular values approximate the largest (and sometimes smallest) singular values of AA. Only matrix-vector products AxA\mathbf{x} and AyA^\top\mathbf{y} are needed - no explicit matrix storage.

This is implemented in scipy.sparse.linalg.svds and is what makes SVD feasible for web-scale matrices (e.g., the Netflix prize data had 108\sim 10^8 entries).

3.5 Numerical Considerations

Do not form AAA^\top A for ill-conditioned problems. As noted in Section 3.1, this squares the condition number. Use np.linalg.svd(A) directly, not np.linalg.eigh(A.T @ A).

Numerical rank. In finite precision, all singular values are slightly nonzero even if the true rank is less than min(m,n)\min(m,n). The numerical rank is the number of singular values greater than a threshold τ=ϵmachσ1max(m,n)\tau = \epsilon_{\text{mach}} \cdot \sigma_1 \cdot \max(m, n) (where ϵmach2.2×1016\epsilon_{\text{mach}} \approx 2.2 \times 10^{-16} for float64). np.linalg.matrix_rank(A) uses exactly this threshold.

One-sided Jacobi SVD achieves machine accuracy even for matrices with a condition number of 101510^{15} by working directly on the columns of AA without any intermediate matrix products. Used in high-accuracy requirements (e.g., geodesy, orbit determination).

Mixed precision. In training, gradients are often computed in float16/bfloat16 but SVD is performed in float32 for stability. This matters for gradient-based LoRA rank selection.

NUMERICAL SVD PITFALLS
========================================================================

  A has condition kappa:    A^T A has condition kappa^2

  kappa(A) = 1e7  =>  kappa(A^T A) = 1e14   (only 2 correct digits
                                               in last eigenvalue!)

  SAFE:   U, s, Vt = np.linalg.svd(A)         <- O(mn min(m,n))
  RISKY:  eigs = np.linalg.eigh(A.T @ A)      <- avoidable precision loss

========================================================================

4. Properties of the SVD

4.1 Norms via Singular Values

The three most important matrix norms all have elegant expressions in terms of singular values:

NormDefinitionSVD ExpressionUse in AI
Spectral norm A2\|A\|_2maxx=1Ax\max_{\|\mathbf{x}\|=1}\|A\mathbf{x}\|σ1\sigma_1 (largest SV)Lipschitz constant; GAN stability
Frobenius norm AF\|A\|_Fi,jAij2\sqrt{\sum_{i,j}A_{ij}^2}iσi2\sqrt{\sum_i \sigma_i^2}Weight decay regularisation
Nuclear norm A\|A\|_*iσi\sum_i \sigma_iiσi\sum_i \sigma_iLow-rank regularisation; matrix completion

Spectral norm A2=σ1\|A\|_2 = \sigma_1: This is the operator norm - the maximum factor by which AA can amplify a vector. For a neural network layer WW, W2=σ1(W)\|W\|_2 = \sigma_1(W) is the Lipschitz constant. Spectral normalisation WW/σ1(W)W \to W/\sigma_1(W) enforces W2=1\|W\|_2 = 1.

Frobenius norm AF=iσi2\|A\|_F = \sqrt{\sum_i \sigma_i^2}: This is the matrix version of the Euclidean norm. Note AF2=tr(AA)=iσi2\|A\|_F^2 = \text{tr}(A^\top A) = \sum_i \sigma_i^2. L2 weight decay minimises WF2\|W\|_F^2, which in SVD terms means penalising all singular values equally.

Nuclear norm A=iσi\|A\|_* = \sum_i \sigma_i: This is the convex envelope (tightest convex relaxation) of the rank function. Minimising A\|A\|_* promotes low-rank solutions. Nuclear norm regularisation is used in matrix completion, recommender systems, and multi-task learning.

Norm inequalities: For any ARm×nA \in \mathbb{R}^{m \times n}:

A2AFrA2min(m,n)A2\|A\|_2 \leq \|A\|_F \leq \sqrt{r} \|A\|_2 \leq \sqrt{\min(m,n)}\|A\|_2 A2ArAF\|A\|_2 \leq \|A\|_* \leq \sqrt{r}\|A\|_F

4.2 The Eckart-Young Theorem

This is one of the most important theorems in applied mathematics - it justifies every low-rank approximation method in machine learning.

Theorem (Eckart-Young, 1936). Let A=UΣVA = U\Sigma V^\top be the SVD with singular values σ1σr>0\sigma_1 \geq \cdots \geq \sigma_r > 0. Define the rank-kk truncation:

Ak=i=1kσiuivi=UkΣkVkA_k = \sum_{i=1}^k \sigma_i \mathbf{u}_i \mathbf{v}_i^\top = U_k \Sigma_k V_k^\top

Then for both the spectral norm and the Frobenius norm:

Ak=argminrank(B)kAB2=argminrank(B)kABFA_k = \arg\min_{\text{rank}(B) \leq k} \|A - B\|_2 = \arg\min_{\text{rank}(B) \leq k} \|A - B\|_F

with optimal errors:

AAk2=σk+1AAkF=i=k+1rσi2\|A - A_k\|_2 = \sigma_{k+1} \qquad \|A - A_k\|_F = \sqrt{\sum_{i=k+1}^r \sigma_i^2}

Proof sketch (spectral norm). For any rank-kk matrix BB, there exists a unit vector x\mathbf{x} in the (k+1)(k+1)-dimensional span of {v1,,vk+1}\{\mathbf{v}_1,\ldots,\mathbf{v}_{k+1}\} that lies in the null space of BB (by dimension counting: k+1>k=ndim(null(B))k+1 > k = n - \dim(\text{null}(B))). For this x\mathbf{x}:

AB2(AB)x2=Ax2σk+1\|A - B\|_2 \geq \|(A-B)\mathbf{x}\|_2 = \|A\mathbf{x}\|_2 \geq \sigma_{k+1}

since x\mathbf{x} is in the span of the top-k+1k+1 right singular vectors. The bound is achieved by AkA_k since (AAk)vi=0(A - A_k)\mathbf{v}_i = \mathbf{0} for iki \leq k and =σiui= \sigma_i\mathbf{u}_i for i>ki > k. \square

Practical significance: The Eckart-Young theorem is the mathematical foundation of:

  • PCA: The rank-kk SVD of the centred data matrix gives the best rank-kk reconstruction (Section 6.1).
  • LoRA: The low-rank decomposition ΔW=BA\Delta W = BA approximates the full gradient update ΔW\Delta W with minimum Frobenius error.
  • Image compression: A rank-kk image approximation discards all but the top kk singular components; Eckart-Young guarantees this is optimal.
  • Recommender systems: Matrix factorisation finds the best low-rank model of the user-item rating matrix.

4.3 The Moore-Penrose Pseudoinverse

The Moore-Penrose pseudoinverse A+Rn×mA^+ \in \mathbb{R}^{n \times m} is the unique matrix satisfying the four Moore-Penrose conditions:

  1. AA+A=AAA^+A = A
  2. A+AA+=A+A^+AA^+ = A^+
  3. (AA+)=AA+(AA^+)^\top = AA^+
  4. (A+A)=A+A(A^+A)^\top = A^+A

The SVD gives the explicit formula:

A+=VΣ+U\boxed{A^+ = V\Sigma^+U^\top}

where Σ+Rn×m\Sigma^+ \in \mathbb{R}^{n \times m} replaces each nonzero σi\sigma_i with 1/σi1/\sigma_i and leaves zeros as zeros.

Geometric meaning: AA+AA^+ is the orthogonal projection onto the column space of AA; A+AA^+A is the orthogonal projection onto the row space of AA.

Solving linear systems:

  • Overdetermined (m>nm > n, full column rank): The least-squares solution x^=argminAxb2\hat{\mathbf{x}} = \arg\min\|A\mathbf{x}-\mathbf{b}\|_2 is x^=A+b=(AA)1Ab\hat{\mathbf{x}} = A^+\mathbf{b} = (A^\top A)^{-1}A^\top\mathbf{b} (normal equations). The minimum residual is (IAA+)b2\|(I - AA^+)\mathbf{b}\|_2.
  • Underdetermined (m<nm < n, full row rank): The minimum-norm solution x^=argminx2\hat{\mathbf{x}} = \arg\min\|\mathbf{x}\|_2 s.t. Ax=bA\mathbf{x}=\mathbf{b} is x^=A+b=A(AA)1b\hat{\mathbf{x}} = A^+\mathbf{b} = A^\top(AA^\top)^{-1}\mathbf{b}.
  • Rank-deficient: A+bA^+\mathbf{b} gives the minimum-norm least-squares solution.

Ridge regression / Tikhonov regularisation: Adding λx22\lambda\|\mathbf{x}\|_2^2 to the objective gives:

x^λ=(AA+λI)1Ab=V(ΣΣ+λI)1ΣUb=i=1rσiσi2+λ(uib)vi\hat{\mathbf{x}}_\lambda = (A^\top A + \lambda I)^{-1}A^\top\mathbf{b} = V(\Sigma^\top\Sigma + \lambda I)^{-1}\Sigma^\top U^\top \mathbf{b} = \sum_{i=1}^r \frac{\sigma_i}{\sigma_i^2 + \lambda} (\mathbf{u}_i^\top \mathbf{b}) \mathbf{v}_i

Each singular component is attenuated by factor σi2/(σi2+λ)\sigma_i^2/(\sigma_i^2 + \lambda) - components with σiλ\sigma_i \gg \sqrt{\lambda} are kept, components with σiλ\sigma_i \ll \sqrt{\lambda} are suppressed. This is spectral shrinkage or Tikhonov regularisation.

4.4 SVD and Rank

The SVD provides the most numerically stable definition of rank:

Exact rank: rank(A)\text{rank}(A) = number of positive singular values.

Numerical rank: In finite precision, use rankτ(A)={i:σi>τ}\text{rank}_\tau(A) = |\{i : \sigma_i > \tau\}| where typically τ=ϵmachσ1max(m,n)\tau = \epsilon_{\text{mach}} \cdot \sigma_1 \cdot \max(m,n).

Stable rank: The stable rank sr(A)=AF2/A22=iσi2/σ12\text{sr}(A) = \|A\|_F^2 / \|A\|_2^2 = \sum_i \sigma_i^2 / \sigma_1^2 is a smooth, noise-robust proxy for rank. It lies in [1,r][1, r] and equals 1 iff AA is rank-1. For LoRA analysis, if sr(ΔW)min(m,n)\text{sr}(\Delta W) \ll \min(m,n), the update is intrinsically low-rank and can be well-approximated with small rr.

Effective rank (Roy & Vetterli): er(A)=exp(H(p))\text{er}(A) = \exp(H(p)) where pi=σi2/AF2p_i = \sigma_i^2/\|A\|_F^2 and H(p)=pilogpiH(p) = -\sum p_i \log p_i is the entropy of the normalised squared singular values. This is a smooth measure that is maximised (= rr) for uniform singular value spectra and minimised (= 1) for rank-1 matrices.

4.5 Condition Number

The condition number of a matrix AA with respect to the 2-norm is:

κ(A)=A2A+2=σ1σr\kappa(A) = \|A\|_2 \cdot \|A^+\|_2 = \frac{\sigma_1}{\sigma_r}

(where σr\sigma_r is the smallest positive singular value). For a square invertible matrix, κ(A)=σ1/σn\kappa(A) = \sigma_1/\sigma_n.

Meaning: If Ax=bA\mathbf{x} = \mathbf{b} and we perturb b\mathbf{b} by δb\delta\mathbf{b}, the relative perturbation in x\mathbf{x} satisfies:

δxxκ(A)δbb\frac{\|\delta\mathbf{x}\|}{\|\mathbf{x}\|} \leq \kappa(A) \cdot \frac{\|\delta\mathbf{b}\|}{\|\mathbf{b}\|}

A well-conditioned matrix has κ(A)1\kappa(A) \approx 1; a nearly singular matrix has κ(A)1\kappa(A) \gg 1.

For AI: The condition number of the Gram matrix XXX^\top X (or the Hessian H=2LH = \nabla^2\mathcal{L}) governs gradient descent convergence: κ(H)=λmax/λmin=σmax2/σmin2\kappa(H) = \lambda_{\max}/\lambda_{\min} = \sigma_{\max}^2/\sigma_{\min}^2 for symmetric PSD matrices. Feature normalisation (standardisation, batch norm, layer norm) reduces κ(XX)\kappa(X^\top X), accelerating convergence.

Preconditioning: For linear systems Ax=bA\mathbf{x} = \mathbf{b}, a preconditioner PAP \approx A transforms the system to P1Ax=P1bP^{-1}A\mathbf{x} = P^{-1}\mathbf{b} with smaller κ(P1A)\kappa(P^{-1}A). K-FAC (Kronecker-Factored Approximate Curvature) is a natural gradient method that preconditioning by the Fisher information matrix - equivalent to using a block-diagonal approximation to the Hessian.

4.6 The Procrustes Problem

The orthogonal Procrustes problem asks: given matrices A,BRm×nA, B \in \mathbb{R}^{m \times n}, find the orthogonal matrix RRm×mR \in \mathbb{R}^{m \times m} (RR=IR^\top R = I) minimising RABF\|RA - B\|_F. The solution is:

Solution: Compute BA=UΣVBA^\top = U\Sigma V^\top (SVD). Then R^=UV\hat{R} = UV^\top.

Proof: RABF2=RAF22RA,BF+BF2\|RA - B\|_F^2 = \|RA\|_F^2 - 2\langle RA, B\rangle_F + \|B\|_F^2. Since RR is orthogonal, RAF=AF\|RA\|_F = \|A\|_F is constant. Minimising over RR means maximising tr(RBA)=tr(RUΣV)\text{tr}(R^\top B A^\top) = \text{tr}(R^\top U\Sigma V^\top). Setting W=VRUW = V^\top R^\top U, this is tr(WΣ)=iWiiσiiσi\text{tr}(W\Sigma) = \sum_i W_{ii}\sigma_i \leq \sum_i\sigma_i with equality when W=IW = I, i.e., R=UVR = UV^\top. \square

For AI:

  • Shape alignment: Procrustes is used in protein structure alignment, morphometric analysis.
  • Domain adaptation: Aligning word vector spaces across languages (cross-lingual transfer) uses Procrustes.
  • RetNet / linear attention: Some position encoding methods use orthogonal transformations computed via Procrustes.
  • Gradient alignment: The Procrustes problem appears in task arithmetic - finding the optimal rotation to align task vectors.

5. Outer Product Form and Rank-1 Decomposition

5.1 Rank-1 Outer Product Decomposition

The SVD has a beautiful outer product representation: every matrix is a weighted sum of rank-1 matrices:

A=i=1rσiuiviA = \sum_{i=1}^r \sigma_i \mathbf{u}_i \mathbf{v}_i^\top

Each term σiuivi\sigma_i \mathbf{u}_i \mathbf{v}_i^\top is a rank-1 matrix (outer product of two vectors). The singular values σi\sigma_i act as importance weights: the first term contributes the most to AA, the last the least. The truncated sum Ak=i=1kσiuiviA_k = \sum_{i=1}^k \sigma_i \mathbf{u}_i\mathbf{v}_i^\top is the best rank-kk approximation (Eckart-Young).

RANK-1 DECOMPOSITION
========================================================================

  A  =  \sigma_1 * u_1v_1^T  +  \sigma_2 * u_2v_2^T  +  ***  +  \sigma^r * u^rv^r^T
        -------------   -------------          -------------
         rank-1 term      rank-1 term            rank-1 term
         (dominant)       (less important)       (least important)

  Truncating at rank k: discard \sigma_{k+1},...,\sigma_r terms
  Error: ||A - A_k||_F = \sqrt(\sigma_{k+1}^2 + *** + \sigma_r^2)

========================================================================

For AI - matrix as a sum of "features": Each rank-1 term σiuivi\sigma_i \mathbf{u}_i\mathbf{v}_i^\top is an "atom" - a pattern in the output (ui\mathbf{u}_i) correlated with a pattern in the input (vi\mathbf{v}_i), with strength σi\sigma_i. In a trained weight matrix WW:

  • The top singular direction (σ1,u1,v1\sigma_1, \mathbf{u}_1, \mathbf{v}_1) is the most amplified input-output feature pair.
  • Lower singular directions correspond to less important, often noisier associations.
  • LoRA's ΔW=BA\Delta W = BA with BRd×rB \in \mathbb{R}^{d\times r}, ARr×kA \in \mathbb{R}^{r\times k} is equivalent to adding rr rank-1 terms to WW.

5.2 Incremental and Streaming SVD

When data arrives in a stream and the SVD must be updated without recomputing from scratch, incremental SVD methods are used. Brand's method (2002) updates the SVD A=UΣVA = U\Sigma V^\top to (A,c)=[Ac](A, \mathbf{c}) = [A | \mathbf{c}] in O(rn)O(rn) time (rather than recomputing from scratch in O(mn2)O(mn^2)).

Algorithm sketch (rank-1 update):

  1. Project c\mathbf{c} onto current bases: p=Uc\mathbf{p} = U^\top\mathbf{c}, q=cUp\mathbf{q} = \mathbf{c} - U\mathbf{p}.
  2. Form augmented bidiagonal (2r+1)×(r+1)(2r+1) \times (r+1) matrix.
  3. Compute small SVD and update.

This is used in online recommender systems (user ratings arrive one at a time), PCA on streaming data, and online learning scenarios.

5.3 Symmetric Matrices: SVD Equals Eigendecomposition

For symmetric matrices, the SVD and eigendecomposition coincide (with a sign adjustment for indefinite matrices):

Case 1: Symmetric PSD (A=AA = A^\top, all λi0\lambda_i \geq 0). The spectral theorem gives A=QΛQA = Q\Lambda Q^\top. This is already an SVD with U=V=QU = V = Q and Σ=Λ\Sigma = \Lambda. Singular values equal eigenvalues.

Case 2: Symmetric but indefinite (A=AA = A^\top, mixed signs). Say A=QΛQA = Q\Lambda Q^\top with some λi<0\lambda_i < 0. Then λi|\lambda_i| are the singular values. The sign of λi\lambda_i is absorbed into UU: if λi<0\lambda_i < 0, then ui=qi\mathbf{u}_i = -\mathbf{q}_i (flip the sign of the left singular vector). So UVU \neq V for indefinite symmetric matrices.

Consequence: For a symmetric matrix, σi=λi\sigma_i = |\lambda_i|. The singular values are the absolute values of the eigenvalues. If you compute np.linalg.eigvalsh(A) and np.linalg.svd(A, compute_uv=False), the latter will equal the absolute values of the former, sorted in descending order.


6. SVD in Machine Learning and AI

6.1 Low-Rank Approximation and LoRA

LoRA (Low-Rank Adaptation, Hu et al. 2021) is the most widely used parameter-efficient fine-tuning method for large language models. Its mathematical foundation is the Eckart-Young theorem.

The key observation: when fine-tuning a pre-trained model on a new task, the weight updates ΔW=WfineWpre\Delta W = W_{\text{fine}} - W_{\text{pre}} have low intrinsic rank. Rather than storing and updating the full ΔWRd×k\Delta W \in \mathbb{R}^{d \times k} (which has dkdk parameters), LoRA parameterises:

ΔW=BA,BRd×r,ARr×k,rmin(d,k)\Delta W = BA, \quad B \in \mathbb{R}^{d \times r}, \quad A \in \mathbb{R}^{r \times k}, \quad r \ll \min(d, k)

Only AA and BB are trained (with AA initialised from a Gaussian and BB initialised to zero so ΔW0=0\Delta W_0 = 0). The forward pass becomes W0x+BAx/rW_0\mathbf{x} + BA\mathbf{x}/\sqrt{r} (the 1/r1/\sqrt{r} scaling stabilises training).

SVD connection: The best rank-rr initialisation for (B,A)(B, A) comes from the SVD of ΔW\Delta W: B=UrΣr1/2B = U_r\Sigma_r^{1/2}, A=Σr1/2VrA = \Sigma_r^{1/2}V_r^\top. In practice, LoRA uses random initialisation (since ΔW\Delta W is unknown before training), but SVD is used post-hoc to analyse which singular directions were actually modified.

DoRA (Liu et al. 2024) further decomposes: W=mW/WcW = m \cdot W/\|W\|_c where m=Wcm = \|W\|_c is the column-wise norm. It then applies LoRA to the directional component, achieving better performance per parameter.

MLA (DeepSeek 2024) compresses the KV cache by factoring the projection matrices: WK=WCDKWCUKW_K = W_C^{DK} W_C^{UK} where WCDKRdh×cW_C^{DK} \in \mathbb{R}^{d_h \times c} compresses to a latent dimension cdhc \ll d_h. This is exactly the rank-cc SVD structure applied to the key/value projections.

Parameters saved by LoRA: Instead of dkdk parameters for ΔW\Delta W, LoRA uses (d+k)r(d + k)r parameters. For d=k=4096d = k = 4096 and r=8r = 8: full is 16.7M16.7M parameters, LoRA is 65K65K - a 256×256\times reduction.

6.2 Recommender Systems

Recommender systems model the user-item interaction matrix RRnu×niR \in \mathbb{R}^{n_u \times n_i} where RijR_{ij} is user ii's rating of item jj (or 0 if unrated). The goal is to predict missing entries.

SVD-based collaborative filtering: Approximate RUΣVR \approx U\Sigma V^\top (truncated to rank kk), where:

  • URnu×kU \in \mathbb{R}^{n_u \times k}: user latent factors (each user's preferences in kk-dimensional space)
  • VRni×kV \in \mathbb{R}^{n_i \times k}: item latent factors (each item's properties in kk-dimensional space)
  • σi\sigma_i: importance of the ii-th latent dimension

Predicted rating: R^ij=f=1kσfUifVjf\hat{R}_{ij} = \sum_{f=1}^k \sigma_f U_{if} V_{jf}.

Challenge: RR is sparse (mostly unobserved). Direct SVD fills missing values with 0, which is wrong. Simon Funk's SGD-SVD (2006, won one of the Netflix Prize progress checks) instead minimises (i,j) observed(RijUiVj)2+λ(UF2+VF2)\sum_{(i,j)\text{ observed}} (R_{ij} - U_i^\top V_j)^2 + \lambda(\|U\|_F^2 + \|V\|_F^2) via stochastic gradient descent.

Non-negative Matrix Factorisation (NMF): Constrains U,V0U, V \geq 0 elementwise. Gives "parts-based" representations (pixels, topics, users). NMF is a constrained SVD with non-negativity.

6.3 Latent Semantic Analysis and Word Vectors

Latent Semantic Analysis (LSA, Deerwester et al. 1990) applies truncated SVD to the term-document matrix XRV×DX \in \mathbb{R}^{|V| \times |D|} where XijX_{ij} = tf-idf weight of term ii in document jj.

The rank-kk SVD XUkΣkVkX \approx U_k\Sigma_k V_k^\top gives:

  • UkRV×kU_k \in \mathbb{R}^{|V| \times k}: word embeddings in the kk-dimensional latent semantic space
  • VkRD×kV_k \in \mathbb{R}^{|D| \times k}: document embeddings
  • Σk\Sigma_k: importance of each latent dimension

Words that co-occur in similar documents get nearby embeddings. This is one of the first distributed word representations - a precursor to word2vec and GloVe.

Connection to GloVe: GloVe (Pennington et al. 2014) can be interpreted as a shifted version of LSA. The GloVe model minimises the Frobenius norm between the log pointwise mutual information (PPMI) matrix and its low-rank factorisation - structurally an SVD with a different weighting function.

Modern transformer word embeddings are not directly SVD, but the embedding table ERV×dE \in \mathbb{R}^{|V| \times d} can be analysed via SVD. Well-trained embeddings typically have a steep singular value spectrum (dominant directions = common syntactic/semantic features; tail = noise).

6.4 Image Compression

The rank-kk SVD of a grayscale image IRm×nI \in \mathbb{R}^{m \times n} gives:

IIk=i=1kσiuiviI \approx I_k = \sum_{i=1}^k \sigma_i \mathbf{u}_i\mathbf{v}_i^\top

Storage: k(m+n+1)k(m + n + 1) numbers vs mnmn for the full image. Compression ratio: mn/[k(m+n+1)]mn / [k(m+n+1)].

For a 512×512512 \times 512 image with k=50k = 50: original stores 262K262K numbers, compressed stores 51K51K - a 5×5\times compression. JPEG achieves better compression ratios than SVD at the same quality because it exploits block structure and human visual system properties, but SVD compression is mathematically optimal in the 2\ell^2 sense.

Colour images: Apply SVD independently to each channel (R, G, B), or decompose the matrix of RGB triples. In practice, the colour channels are correlated and the singular value spectra of different channels are similar.

6.5 Pseudoinverse and Least Squares

Overdetermined systems (linear regression): Given XRn×dX \in \mathbb{R}^{n \times d} (data, ndn \gg d) and yRn\mathbf{y} \in \mathbb{R}^n, the least-squares problem minβXβy2\min_{\boldsymbol{\beta}} \|X\boldsymbol{\beta} - \mathbf{y}\|_2 has solution:

β^=X+y=VΣ+Uy=i=1ruiyσivi\hat{\boldsymbol{\beta}} = X^+\mathbf{y} = V\Sigma^+U^\top\mathbf{y} = \sum_{i=1}^r \frac{\mathbf{u}_i^\top\mathbf{y}}{\sigma_i}\mathbf{v}_i

Each component (uiy)/σi(\mathbf{u}_i^\top\mathbf{y})/\sigma_i is the projection of y\mathbf{y} onto the ii-th left singular vector, divided by the singular value. If σi\sigma_i is very small (ill-conditioned), this component is amplified enormously - catastrophic if uiy\mathbf{u}_i^\top\mathbf{y} contains noise.

Ridge regression shrinkage: The regularised solution is:

β^λ=i=1rσiσi2+λ(uiy)vi\hat{\boldsymbol{\beta}}_\lambda = \sum_{i=1}^r \frac{\sigma_i}{\sigma_i^2 + \lambda} (\mathbf{u}_i^\top\mathbf{y})\mathbf{v}_i

For σiλ\sigma_i \gg \sqrt{\lambda}: factor σi/σi2=1/σi\to \sigma_i/\sigma_i^2 = 1/\sigma_i (same as unregularised). For σiλ\sigma_i \ll \sqrt{\lambda}: factor σi/λ0\to \sigma_i/\lambda \approx 0 (component is suppressed). The threshold λ\sqrt{\lambda} acts as an effective singular value cutoff.

For AI: The normal equations (XX)β^=Xy(X^\top X)\hat{\boldsymbol{\beta}} = X^\top\mathbf{y} are equivalent but numerically inferior (condition number squared). Always use np.linalg.lstsq(X, y) which uses the SVD internally.

6.6 Spectral Methods for Graphs

For a bipartite graph with vertices U,VU, V and edge weight matrix ARU×VA \in \mathbb{R}^{|U| \times |V|} (biadjacency matrix), the SVD A=UΣVA = U\Sigma V^\top provides a spectral embedding:

  • Left singular vectors UU: embed the UU-side vertices
  • Right singular vectors VV: embed the VV-side vertices
  • Singular values: weight each dimension

This is used in spectral co-clustering (Dhillon 2001): simultaneously cluster rows and columns of a data matrix by finding the top-kk singular vectors and applying kk-means to the embedded points.

Connection to symmetric spectral clustering: For a bipartite graph, the SVD of the normalised biadjacency matrix is equivalent to the eigendecomposition of the symmetric Laplacian of the full bipartite graph. The kk-th singular value of AA corresponds to the (k+1)(k+1)-th eigenvalue of the graph Laplacian.

6.7 Weight Matrix Analysis (WeightWatcher)

WeightWatcher (Martin and Mahoney, 2019) is a tool for diagnosing model quality without test data by analysing the singular value spectra of weight matrices.

Key finding: Well-trained neural networks have weight matrices with heavy-tailed singular value spectra that follow power laws:

ρ(σ)σ(α+1),σii1/α\rho(\sigma) \sim \sigma^{-(\alpha+1)}, \quad \sigma_i \sim i^{-1/\alpha}

with α[2,4]\alpha \in [2, 4] for high-quality models. This is predicted by random matrix theory (Marchenko-Pastur distribution) for random matrices, with the "learned" part manifesting as the heavy tail.

Diagnostic metrics:

  • α^\hat{\alpha} (power-law exponent): fit to the tail of the singular value distribution. α^[2,4]\hat{\alpha} \in [2, 4] indicates well-trained; α^>6\hat{\alpha} > 6 indicates undertrained; α^<2\hat{\alpha} < 2 indicates overtrained.
  • Weighted alpha: lα^llog(σ1,l)\sum_l \hat{\alpha}_l \log(\sigma_{1,l}) averaged across layers, weighted by layer size.
  • Stable rank: measures effective rank per layer.

For AI: WeightWatcher can predict model quality (test accuracy) from just the weight matrices, without any data. It has been used to select the best checkpoint during training, identify layers that need more training, and detect fine-tuning quality.

6.8 Attention and SVD

The attention weight matrix A=softmax(QK/dk)Rn×nA = \text{softmax}(QK^\top/\sqrt{d_k}) \in \mathbb{R}^{n \times n} (for sequence length nn) has interesting SVD structure in trained transformers:

Low-rank structure of attention: Trained attention matrices are approximately low-rank - most of the attention pattern is captured by a few singular vectors. This has been observed empirically across multiple model families. The effective rank is roughly O(logn)O(\log n) for induction heads and O(1)O(1) for copying heads.

SVD-based attention compression: Several works approximate the attention computation using truncated SVD of QQ and KK: write Q=UQΣQVQQ = U_Q\Sigma_Q V_Q^\top, K=UKΣKVKK = U_K\Sigma_K V_K^\top, then QK=UQΣQ(VQVK)ΣKUKQK^\top = U_Q\Sigma_Q(V_Q^\top V_K)\Sigma_K U_K^\top. If the middle factor VQVKV_Q^\top V_K has low rank, the computation is cheaper.

FlashAttention does not directly use SVD but exploits the block structure of attention matrices for I/O-efficient computation. The theoretical analysis of FlashAttention's approximation quality involves the singular value decay of the attention matrix.

MLA's KV compression: DeepSeek's MLA projects K,VK, V through a low-rank bottleneck WDKVRd×cW^{DKV} \in \mathbb{R}^{d \times c} (cdc \ll d). At inference, only the cc-dimensional latent cKV=WDKVh\mathbf{c}^{KV} = W^{DKV}\mathbf{h} is cached per token, reducing KV cache by factor d/cd/c. This is exactly a rank-cc SVD-style projection of the key-value information.


7. Generalised SVD and Extensions

7.1 Generalised SVD (GSVD)

The Generalised SVD (GSVD) factorises a pair of matrices (A,B)(A, B) with the same number of columns:

A=UCX,B=VSXA = U C X^\top, \quad B = V S X^\top

where UU, VV are orthogonal, XX is non-singular, and CC, SS are diagonal with Cii2+Sii2=1C_{ii}^2 + S_{ii}^2 = 1.

Applications:

  • Regularised least squares: Solve minAxb22+λ2Bx22\min\|A\mathbf{x}-\mathbf{b}\|_2^2 + \lambda^2\|B\mathbf{x}\|_2^2 via GSVD.
  • Generalised eigenproblems: Av=λBvAv = \lambda Bv is solved via GSVD of (A,B)(A, B).
  • Fisher LDA: The generalised eigenproblem SBv=λSWvS_B\mathbf{v} = \lambda S_W\mathbf{v} (between/within scatter) is a GSVD problem.
  • CCA: Canonical Correlation Analysis decomposes the cross-covariance matrix and uses a GSVD-like structure.

Available in SciPy via scipy.linalg.svd with the lapack_driver='gesdd' option, or directly through LAPACK's dggsvd routine.

7.2 Tensor SVD and Tucker Decomposition

For tensors (multi-dimensional arrays) ARn1×n2××nd\mathcal{A} \in \mathbb{R}^{n_1 \times n_2 \times \cdots \times n_d}, several SVD-like decompositions exist:

Tucker Decomposition: AG×1U1×2U2×dUd\mathcal{A} \approx \mathcal{G} \times_1 U_1 \times_2 U_2 \cdots \times_d U_d where GRr1×r2××rd\mathcal{G} \in \mathbb{R}^{r_1 \times r_2 \times \cdots \times r_d} is the core tensor and UkRnk×rkU_k \in \mathbb{R}^{n_k \times r_k} are orthogonal factor matrices. The Higher-Order SVD (HOSVD) computes each UkU_k as the SVD of the mode-kk unfolding of A\mathcal{A}.

CP Decomposition (CANDECOMP/PARAFAC): Ar=1Rar(1)ar(2)ar(d)\mathcal{A} \approx \sum_{r=1}^R \mathbf{a}_r^{(1)} \otimes \mathbf{a}_r^{(2)} \otimes \cdots \otimes \mathbf{a}_r^{(d)} - a sum of rank-1 tensors. Unlike matrix SVD, tensor rank is NP-hard to compute in general, and the best rank-RR approximation may not exist (unlike Eckart-Young for matrices).

For AI: Tucker decomposition is used to compress convolutional layers (a 4D weight tensor). MobileNet-style depthwise separable convolutions are an approximate Tucker decomposition. Tensor train (TT) decomposition is used for compressing embedding tables and attention weight matrices in extreme low-resource settings.

7.3 Truncated and Randomised Variants in Practice

MethodFunctionWhen to UseCost
Full SVDnp.linalg.svd(A)Small matrices, need all singular valuesO(mnmin(m,n))O(mn\min(m,n))
Thin SVDnp.linalg.svd(A, full_matrices=False)Standard caseO(mnmin(m,n))O(mn\min(m,n))
Truncated SVDscipy.sparse.linalg.svds(A, k=k)Sparse/large, need top-kkO(mnk)O(mnk)
Randomised SVDsklearn.utils.extmath.randomized_svdDense large, approximate top-kkO(mnk)O(mnk)
Incremental SVDsklearn.decomposition.IncrementalPCAStreaming dataO(nk2)O(nk^2) per batch
PyTorch low-ranktorch.svd_lowrank(A, q=k)GPU tensorsO(mnk)O(mnk)

For LoRA initialisation from a checkpoint, the typical workflow is:

  1. Load the fine-tuned weight matrix WfineW_{\text{fine}}.
  2. Compute ΔW=WfineW0\Delta W = W_{\text{fine}} - W_0.
  3. Run U, s, Vh = np.linalg.svd(delta_W, full_matrices=False).
  4. Set B=U[:,:r]diag(s[:r])B = U[:, :r] \cdot \text{diag}(\sqrt{s[:r]}), A=diag(s[:r])Vh[:r,:]A = \text{diag}(\sqrt{s[:r]}) \cdot Vh[:r, :].
  5. Confirm BAΔWF/ΔWF\|BA - \Delta W\|_F / \|\Delta W\|_F is small.

8. Common Mistakes

#MistakeWhy It's WrongFix
1Treating singular values as eigenvaluesσiλi\sigma_i \neq \lambda_i in general; rotation matrix has eigenvalues e±iθe^{\pm i\theta} but all σi=1\sigma_i = 1Remember: σi=λi(AA)\sigma_i = \sqrt{\lambda_i(A^\top A)}; only equal for symmetric PSD matrices
2Forming AAA^\top A to compute the SVDSquares the condition number (κ(AA)=κ(A)2\kappa(A^\top A) = \kappa(A)^2), destroying small singular valuesUse np.linalg.svd(A) directly; never np.linalg.eigh(A.T @ A) for the SVD
3Assuming U=VU = V for non-symmetric matricesUU and VV are different orthogonal matrices (different sizes for rectangular AA); only U=VU = V if AA is symmetric PSDCheck dimensions: URm×mU \in \mathbb{R}^{m\times m}, VRn×nV \in \mathbb{R}^{n\times n} for an m×nm\times n matrix
4Confusing sign conventions in U,VU, V(ui,vi)(ui,vi)(\mathbf{u}_i, \mathbf{v}_i) \to (-\mathbf{u}_i, -\mathbf{v}_i) leaves A=UΣVA = U\Sigma V^\top unchanged; numpy may return different signs each runAlways check AUΣVA \approx U\Sigma V^\top numerically; compare absolute cosine similarity between vectors
5Using full SVD when thin SVD sufficesFull SVD computes URm×mU \in \mathbb{R}^{m\times m} including mrm - r unnecessary columns that don't contributeUse np.linalg.svd(A, full_matrices=False) for the thin SVD
6Claiming rank from nonzero singular values in finite precisionEvery matrix in float64 has all σi>0\sigma_i > 0 due to roundingUse np.linalg.matrix_rank(A) with appropriate tolerance τ=ϵmachσ1max(m,n)\tau = \epsilon_{\text{mach}} \cdot \sigma_1 \cdot \max(m,n)
7Applying LoRA to all layers equallyNot all layers have equally low-rank updates; key/query projections often need higher rank than MLP layersUse loftq-style analysis to allocate rank rlr_l per layer based on α^\hat{\alpha} of each layer's update
8Confusing nuclear norm with Frobenius normA=σi\|A\|_* = \sum\sigma_i vs AF=σi2\|A\|_F = \sqrt{\sum\sigma_i^2}; minimising nuclear norm promotes sparsity in singular values (low rank), not in the matrix entriesChoose norm based on goal: low rank -> nuclear norm; small magnitude -> Frobenius norm
9Thinking pseudoinverse solves all linear systems stablyA+A^+ amplifies small singular value components; for noisy b\mathbf{b}, the component (uib)/σi(\mathbf{u}_i^\top\mathbf{b})/\sigma_i is huge when σi0\sigma_i \approx 0Use ridge regression (λ>0\lambda > 0) or truncate singular values below a threshold
10Using scipy.linalg.svd instead of numpy.linalg.svd interchangeablyscipy.linalg.svd returns (U, s, Vh) while numpy.linalg.svd returns (U, s, Vh) - same convention, but scipy offers more options; be consistentPick one and note that Vh is already VV^\top, not VV

9. Exercises

Exercise 1 * - SVD by Hand (2\times2)

Compute the SVD of A=(3113)A = \begin{pmatrix}3 & 1\\1 & 3\end{pmatrix} by hand: (a) Form AAA^\top A and find its eigenvalues and eigenvectors. (b) Determine σ1,σ2\sigma_1, \sigma_2 and v1,v2\mathbf{v}_1, \mathbf{v}_2. (c) Compute ui=Avi/σi\mathbf{u}_i = A\mathbf{v}_i/\sigma_i for each ii. (d) Write out UU, Σ\Sigma, VV and verify UΣV=AU\Sigma V^\top = A. (e) What are the semi-axes of the ellipse AA maps the unit circle to?

Exercise 2 * - Four Fundamental Subspaces via SVD

Given A=(123456789)A = \begin{pmatrix}1&2&3\\4&5&6\\7&8&9\end{pmatrix} (rank 2): (a) Compute the thin SVD. (b) Identify a basis for the column space, row space, null space, and left null space from the SVD. (c) Verify your null space basis by showing Av=0A\mathbf{v} = \mathbf{0}. (d) Verify the column space basis by expressing AA's columns as linear combinations.

Exercise 3 ** - Eckart-Young Reconstruction Error

For the 5×55\times5 matrix AA with singular values [10,5,3,1,0.1][10, 5, 3, 1, 0.1]: (a) Implement truncated_svd(A, k) that returns the rank-kk approximation AkA_k. (b) Compute the exact spectral norm error AAk2\|A - A_k\|_2 for k=1,2,3,4k = 1, 2, 3, 4. (c) Compute the Frobenius norm error AAkF\|A - A_k\|_F for each kk. (d) Verify the Eckart-Young bound: AAk2=σk+1\|A - A_k\|_2 = \sigma_{k+1}. (e) What is the minimum rank needed to retain 95% of the Frobenius norm of AA?

Exercise 4 ** - Moore-Penrose Pseudoinverse

Implement pseudoinverse(A, tol=1e-10) using the SVD: (a) Compute the thin SVD A=UΣVA = U\Sigma V^\top. (b) Invert the nonzero singular values (use tol to determine which are zero). (c) Return A+=VΣ+UA^+ = V\Sigma^+U^\top. (d) Verify the four Moore-Penrose conditions for a random 4×34\times3 matrix. (e) Use your pseudoinverse to solve the overdetermined system Ax=bA\mathbf{x} = \mathbf{b} and verify it gives the minimum-norm least-squares solution.

Exercise 5 ** - Condition Number and Least Squares Stability

(a) Generate an ill-conditioned matrix AR10×5A \in \mathbb{R}^{10\times5} with singular values [100,50,10,1,0.001][100, 50, 10, 1, 0.001] using random U,VU, V. (b) Set b=Ax+0.01ϵ\mathbf{b} = A\mathbf{x}^* + 0.01\boldsymbol{\epsilon} (add small noise ϵ\boldsymbol{\epsilon}). (c) Solve using np.linalg.lstsq (SVD-based) and via normal equations (AA)1Ab(A^\top A)^{-1}A^\top\mathbf{b}. (d) Compare errors x^x\|\hat{\mathbf{x}} - \mathbf{x}^*\| for both methods. (e) Repeat with ridge regression for λ[106,102]\lambda \in [10^{-6}, 10^2] and plot the trade-off between fit and regularisation.

Exercise 6 *** - Randomised SVD

Implement randomized_svd(A, k, n_oversampling=5, n_iter=2): (a) Generate Gaussian sketch matrix ΩRn×(k+p)\Omega \in \mathbb{R}^{n\times(k+p)}. (b) Compute the range sketch Y=(AA)qAΩY = (AA^\top)^q A\Omega (with n_iter=q power iterations). (c) Orthogonalise YY via QR. (d) Project and compute small SVD. (e) Compare your implementation against np.linalg.svd and sklearn.utils.extmath.randomized_svd on a 500×300500\times300 matrix. Report relative Frobenius error vs rank.

Exercise 7 *** - LoRA-Style Low-Rank Adapter

Simulate LoRA fine-tuning: (a) Create a "pre-trained" weight matrix W0R64×32W_0 \in \mathbb{R}^{64\times32} (random Gaussian). (b) Create a "true fine-tuned" update ΔW\Delta W^* with rank 4 (use SVD to construct). (c) Implement lora_decompose(delta_W, r) that finds the rank-rr factorisation B,AB, A minimising BAΔWF\|BA - \Delta W\|_F. (d) Compute the relative approximation error for r=1,2,4,8,16r = 1, 2, 4, 8, 16. (e) Show that r=4r = 4 recovers ΔW\Delta W^* almost exactly (since it's rank-4) while r<4r < 4 introduces error. Plot the Eckart-Young error bound.

Exercise 8 *** - Image Compression via SVD

(a) Load or generate a grayscale "image" matrix IR128×128I \in \mathbb{R}^{128\times128} (use a structured synthetic image with edges and gradients). (b) Implement svd_compress(I, k) that returns the rank-kk approximation. (c) Compute PSNR (peak signal-to-noise ratio) vs compression ratio for k=1,5,10,20,50,100k = 1, 5, 10, 20, 50, 100. (d) Plot the singular value spectrum and identify the "knee" where adding more singular values gives diminishing returns. (e) Compare storage: full image vs rank-kk representation. What rank achieves 10×10\times compression?


10. Why This Matters for AI (2026 Perspective)

ConceptImpact on AI/LLMs in 2026Specific Methods
Eckart-Young theoremEvery low-rank method is justified by this theoremLoRA, DoRA, MLA, spectral pruning
Truncated SVDFoundation of parameter-efficient fine-tuning; compress B256B \cdot 256 to B1B \cdot 1 parametersLoRA (Hu 2021), AdaLoRA (Zhang 2023), QLoRA
Randomised SVDMakes SVD practical at web scale; O(mnk)O(mnk) vs O(mn2)O(mn^2)TruncatedSVD in sklearn, torch.svd_lowrank
Spectral normLipschitz control of each layer; GAN stability; gradient norm boundingSpectral normalisation (Miyato 2018), SN-GAN
Nuclear normConvex relaxation of rank; used in matrix completion and multi-task learningMatrix completion, inductive matrix completion
PseudoinverseNumerically stable least squares; basis of linear probingLinear regression, probing classifiers, OLS
Condition numberExplains ill-conditioning in training; motivates normalisationBatchNorm, LayerNorm, feature scaling, K-FAC
SVD of weight matricesQuality metric for trained models without test dataWeightWatcher, model selection, layer diagnosis
SVD of attention matricesExplains attention head specialisation; guides compressionAttention pruning, low-rank attention, MLA
Procrustes problemCross-lingual transfer; domain adaptation; task arithmeticMUSE (cross-lingual), task arithmetic, LoRA merge
Singular value spectrumPower-law tail indicates implicit self-regularisation in SGDWeightWatcher, alpha power law metric
MLA / KV compressionLow-rank projection of K/V matrices reduces inference costDeepSeek-V2/V3 MLA, GQA, MQA

11. Conceptual Bridge

Where we came from: Section 03-01 (Eigenvalues and Eigenvectors) gave us the tools to decompose square matrices into invariant directions and scaling factors. But eigendecomposition is limited: it requires square matrices, may involve complex numbers, and fails for defective matrices. The SVD generalises this to any matrix by abandoning the requirement that input and output live in the same space. Instead of eigenvectors - directions that map to themselves - we have singular vectors: optimal input and output directions paired by the matrix's stretching action.

Where we are going: Section 03-03 (Principal Component Analysis) is, in its essence, the SVD of the centred data matrix. PCA finds the principal axes of data variation by computing the SVD of X~Rn×d\tilde{X} \in \mathbb{R}^{n \times d}; the right singular vectors are the principal components, and the singular values (scaled) are the standard deviations in each principal direction. Everything you have learned about the SVD - Eckart-Young, pseudoinverse, truncation - applies directly to PCA.

The SVD as a unifying lens: The SVD connects to every major topic in the rest of this curriculum:

  • Matrix norms (Section 03-06): spectral, Frobenius, and nuclear norms are all functions of singular values.
  • Positive definite matrices (Section 03-07): AA is PSD iff all singular values equal eigenvalues and are non-negative.
  • Optimisation (Chapter 08): the gradient of A\|A\|_* with respect to AA involves the singular vectors; nuclear norm minimisation is a semidefinite program solvable via SVD.
  • Probability (Chapter 06): the covariance matrix Σ=XX/n\Sigma = X^\top X / n has SVD directly related to the SVD of XX; Mahalanobis distance uses Σ1/2=VΛ1/2V\Sigma^{-1/2} = V\Lambda^{-1/2}V^\top.
CONCEPTUAL MAP - SVD IN THE CURRICULUM
========================================================================

  Eigenvalues &      Singular Value       Principal Component
  Eigenvectors  -->  Decomposition   -->  Analysis
  (03-01)            (03-02) <HERE        (03-03)
                          |
           +--------------+------------------------------+
           |              |                              |
           v              v                              v
       Matrix          Least Squares              AI Applications
       Norms           & Pseudoinverse            -----------------
       (03-06)         (everywhere)               LoRA / DoRA / MLA
                                                  WeightWatcher
                                                  Recommender Sys.
                                                  Image Compression
                                                  Spectral Norm GAN

========================================================================

The single most important insight about SVD: Every matrix is the composition of three simple operations - rotate, scale, rotate. The singular values tell you how much the matrix stretches space in its principal directions; the singular vectors tell you which directions those are. Understanding this decomposition is understanding every linear operation in every neural network, every data compression, and every least-squares problem in machine learning.


<- Back to Advanced Linear Algebra | <- Eigenvalues | PCA ->