Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 1
29 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Matrix Norms and Condition Numbers: Part 1: Intuition to 9. Applications in Machine Learning

1. Intuition

1.1 Why We Need to Measure Matrices

Before introducing any formula, consider five concrete scenarios where "how big is this matrix?" is not a vague question but a precise computational need:

Scenario 1 - Bounding errors in linear systems. You solve Ax=bA\mathbf{x} = \mathbf{b} on a computer. The computed x^\hat{\mathbf{x}} satisfies (A+E)x^=b(A+E)\hat{\mathbf{x}} = \mathbf{b} for some rounding error matrix EE. How wrong is x^\hat{\mathbf{x}}? The answer involves E\|E\| and A1\|A^{-1}\| - you need norms to bound the damage.

Scenario 2 - Regularization. Your neural network overfits. You add a penalty λW2\lambda\|W\|^2 to the loss. Which norm \|\cdot\|? The Frobenius norm penalizes every weight equally. The spectral norm penalizes the largest singular value. The nuclear norm encourages low rank. Each choice embeds different inductive bias.

Scenario 3 - Lipschitz constants. Your network ff maps inputs to outputs. How much can a small input perturbation change the output? The Lipschitz constant of a linear layer is exactly W2\|W\|_2. The global Lipschitz constant of a deep network is bounded by lW[l]2\prod_l \|W^{[l]}\|_2.

Scenario 4 - Convergence of gradient descent. Gradient descent on minx12xAxbx\min_\mathbf{x} \frac{1}{2}\mathbf{x}^\top A\mathbf{x} - \mathbf{b}^\top\mathbf{x} converges geometrically with rate 12/(κ+1)1 - 2/(\kappa+1) where κ=A2A12\kappa = \|A\|_2\|A^{-1}\|_2 is the condition number. A large condition number means slow convergence.

Scenario 5 - Low-rank approximation quality. The Eckart-Young theorem says the best rank-kk approximation to AA has error AAk2=σk+1\|A - A_k\|_2 = \sigma_{k+1} in spectral norm and AAkF=i>kσi2\|A - A_k\|_F = \sqrt{\sum_{i>k}\sigma_i^2} in Frobenius norm. Without norms, we cannot even state what "best approximation" means.

In every scenario, the choice of norm carries meaning. This section develops the machinery to make those choices deliberate and principled.

1.2 Geometric Picture: Amplification and Shrinkage

A matrix ARm×nA \in \mathbb{R}^{m \times n} is a linear map. Applied to the unit sphere {x:x2=1}\{\mathbf{x} : \|\mathbf{x}\|_2 = 1\} in Rn\mathbb{R}^n, it produces an ellipsoid in Rm\mathbb{R}^m. This follows directly from the SVD:

A=UΣVAx=U(Σ(Vx))A = U\Sigma V^\top \quad \Rightarrow \quad A\mathbf{x} = U(\Sigma(V^\top\mathbf{x}))

VV^\top rotates the sphere (orthogonal maps preserve spheres). Σ\Sigma stretches along coordinate axes by factors σ1,,σr\sigma_1, \ldots, \sigma_r (with σi>0\sigma_i > 0). UU rotates the result. The outcome: the unit sphere maps to an ellipsoid whose semi-axes have lengths σ1σ2σr>0\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0.

MATRIX NORMS AS ELLIPSOID GEOMETRY
========================================================================

  Unit sphere in \mathbb{R}^n         ->  A maps to ->  Ellipsoid in \mathbb{R}^m

  *****                                      +-----------+
 * * * *          A = U\SigmaV^T                +-+           +-+
* * * * *    --------------->            +-+               +-+
 * * * *                                +--+               +--+
  *****                                    +-+           +-+  <- \sigma_2
                                              +-----------+
                               <-------------------------------------->
                                              2\sigma_1  (longest axis)

  Spectral norm  ||A||_2  = \sigma_1  = length of longest semi-axis
  Frobenius norm ||A||_F = \sqrt(\sigma_1^2+\sigma_2^2+...) = "total energy" of axes
  Nuclear norm   ||A||_*  = \sigma_1+\sigma_2+... = sum of all semi-axis lengths

========================================================================

Each norm captures a different aspect of this ellipsoid:

  • A2=σ1\|A\|_2 = \sigma_1: the worst-case amplification - how long is the longest axis?
  • AF=iσi2\|A\|_F = \sqrt{\sum_i \sigma_i^2}: root-mean-square amplification across all directions
  • A=iσi\|A\|_* = \sum_i \sigma_i: total "size" of all directions (related to rank)

For AI: Weight matrices in neural networks define such ellipsoids. A spectrally normalized layer has σ1=1\sigma_1 = 1 - the unit sphere maps to an ellipsoid whose longest axis has length exactly 1. A Frobenius-regularized layer has bounded total energy across all singular values.

1.3 Why Matrix Norms Matter for AI

The table below shows each major AI technique and the norm that is its mathematical core:

AI TechniqueNorm UsedMathematical Role
Weight decay / L2 regFrobenius WF2\|W\|_F^2Penalizes total squared weight energy
Spectral normalization (GANs)Spectral W2=σ1\|W\|_2 = \sigma_1Enforces 1-Lipschitz discriminator
Nuclear norm regNuclear W=σi\|W\|_* = \sum\sigma_iPromotes low-rank weight matrices
Gradient clippingFrobenius or L2 of gradientPrevents exploding gradients
LoRA / low-rank adaptationImplicit nuclear normLow-rank ΔW=BA\Delta W = BA has bounded .\|.\|_*
Certified robustnessPer-layer spectral W[l]2\prod\|W^{[l]}\|_2Bounds Lipschitz constant of network
Condition number in Adamκ=σ1/σn\kappa = \sigma_1/\sigma_n of curvatureDiagonal preconditioning approximates H1H^{-1}
Attention stabilityQK2/dk\|QK^\top\|_2 / \sqrt{d_k}Prevents softmax saturation
Generalization boundsWF\|W\|_F productsPAC-Bayes and Rademacher complexity

1.4 Historical Timeline

YearContributorContribution
1878FrobeniusIntroduced the entry-sum norm (now Frobenius) in matrix theory
1911HilbertTrace-class operators in infinite-dimensional spaces
1937von NeumannSchatten/trace norms; characterization of unitarily invariant norms
1951Ky FanKy Fan k-norms; Fan dominance theorem
1960sGolub, KahanNumerical algorithms for condition numbers; QR iteration
1976RudinFunctional analysis formalization; duality of norms
1987Candes, RechtNuclear norm relaxation of rank minimization
1992Trefethen & BauNumerical linear algebra framing of condition number for ML
2009Candes & RechtMatrix completion via nuclear norm minimization (Netflix Prize theory)
2018Miyato et al.Spectral normalization for GANs - spectral norm goes mainstream in DL
2021-2024LoRA, DoRA, MuPMixed-norm methods for efficient fine-tuning of LLMs

2. Norm Axioms

2.1 Abstract Norm Definition

Definition. A norm on a vector space VV over R\mathbb{R} is a function :VR0\|\cdot\|: V \to \mathbb{R}_{\geq 0} satisfying three axioms for all u,vV\mathbf{u}, \mathbf{v} \in V and αR\alpha \in \mathbb{R}:

AxiomFormulaName
N1v0\|\mathbf{v}\| \geq 0 and v=0v=0\|\mathbf{v}\| = 0 \Leftrightarrow \mathbf{v} = \mathbf{0}Positive definiteness
N2$|\alpha\mathbf{v}| =\alpha
N3u+vu+v\|\mathbf{u} + \mathbf{v}\| \leq \|\mathbf{u}\| + \|\mathbf{v}\|Triangle inequality

A function satisfying N1 (without the zero condition), N2, and N3 is a semi-norm. A semi-norm can be zero on nonzero vectors.

Matrix space as a vector space. The space Rm×n\mathbb{R}^{m \times n} of m×nm \times n real matrices is a vector space of dimension mnmn under entry-wise addition and scalar multiplication. Any norm on Rmn\mathbb{R}^{mn} (after reshaping) gives a valid matrix norm. But not every matrix norm is "compatible" with matrix multiplication - additional properties (submultiplicativity) are often required.

Definition (submultiplicative norm). A matrix norm \|\cdot\| on Rn×n\mathbb{R}^{n \times n} is submultiplicative (or consistent) if:

ABABfor all A,BRn×n\|AB\| \leq \|A\|\|B\| \quad \text{for all } A, B \in \mathbb{R}^{n \times n}

Submultiplicativity is essential for analyzing products of matrices (as arise in deep network Jacobians) and for bounding errors that propagate through multiplications.

Non-example of submultiplicativity. The max-entry norm Amax=maxijAij\|A\|_{\max} = \max_{ij}|A_{ij}| satisfies the three norm axioms but is NOT submultiplicative in general. For A=B=11/nA = B = \mathbf{1}\mathbf{1}^\top/n (all-ones matrix divided by nn): Amax=1/n\|A\|_{\max} = 1/n, ABmax=1\|AB\|_{\max} = 1, so ABmax=1>1/n2=AmaxBmax\|AB\|_{\max} = 1 > 1/n^2 = \|A\|_{\max}\|B\|_{\max}.

2.2 Standard Vector Norms

For xRn\mathbf{x} \in \mathbb{R}^n, the standard norms are:

x1=i=1nxix2=i=1nxi2x=maxixi\|\mathbf{x}\|_1 = \sum_{i=1}^n |x_i| \qquad \|\mathbf{x}\|_2 = \sqrt{\sum_{i=1}^n x_i^2} \qquad \|\mathbf{x}\|_\infty = \max_i |x_i| xp=(i=1nxip)1/p(p1)\|\mathbf{x}\|_p = \left(\sum_{i=1}^n |x_i|^p\right)^{1/p} \quad (p \geq 1)

The unit balls {x:xp1}\{\mathbf{x} : \|\mathbf{x}\|_p \leq 1\} have characteristic shapes:

UNIT BALLS IN 2D FOR DIFFERENT p-NORMS
========================================================================

  p=1 (diamond)     p=2 (circle)     p=\infty (square)     p=1/2 (star)
                                                        [not a norm!]
      *                 ___               +---+
     /|\               /   \             |   |               *
    / | \             /     \            |   |              /|\
  *--+--*            |   *   |           |   |            */   \*
    \ | /             \     /            |   |              \   /
     \|/               \___/             +---+               \|/
      *                                                        *

  p < 1: concave, violates triangle inequality -> NOT a norm
  p \geq 1: convex, satisfies all axioms -> valid norm

========================================================================

Holder's inequality. For conjugate exponents 1/p+1/q=11/p + 1/q = 1 (with p=1q=p = 1 \Leftrightarrow q = \infty):

x,y=ixiyixpyq|\langle\mathbf{x},\mathbf{y}\rangle| = \left|\sum_i x_i y_i\right| \leq \|\mathbf{x}\|_p\|\mathbf{y}\|_q

This generalizes the Cauchy-Schwarz inequality (p=q=2p = q = 2) and is the foundation for dual norm theory.

2.3 Equivalence of Norms in Finite Dimensions

Theorem. On Rn\mathbb{R}^n (or any finite-dimensional normed space), all norms are equivalent: for any two norms α\|\cdot\|_\alpha and β\|\cdot\|_\beta, there exist constants c,C>0c, C > 0 such that:

cxαxβCxαfor all xRnc\|\mathbf{x}\|_\alpha \leq \|\mathbf{x}\|_\beta \leq C\|\mathbf{x}\|_\alpha \quad \text{for all } \mathbf{x} \in \mathbb{R}^n

Proof sketch. The unit sphere {x:xα=1}\{\mathbf{x} : \|\mathbf{x}\|_\alpha = 1\} is compact. A norm is continuous. A continuous function on a compact set attains its extrema. These extrema give the constants cc and CC. \square

Specific bounds relating standard norms on Rn\mathbb{R}^n:

InequalityBoundTight when
x2x1\|\mathbf{x}\|_2 \leq \|\mathbf{x}\|_1factor 1x\mathbf{x} has one nonzero entry
x1nx2\|\mathbf{x}\|_1 \leq \sqrt{n}\|\mathbf{x}\|_2factor n\sqrt{n}all entries equal
xx2nx\|\mathbf{x}\|_\infty \leq \|\mathbf{x}\|_2 \leq \sqrt{n}\|\mathbf{x}\|_\inftyfactor n\sqrt{n}one / all entries equal
xx1nx\|\mathbf{x}\|_\infty \leq \|\mathbf{x}\|_1 \leq n\|\mathbf{x}\|_\inftyfactor nnone / all entries equal

Why this matters: Norm equivalence guarantees that convergence in one norm implies convergence in all others - so the choice of norm for convergence analysis is flexible. However, the constants matter for quantitative bounds, and choosing the right norm can tighten bounds by factors of n\sqrt{n} or nn.

Warning: Norm equivalence fails in infinite dimensions. In LpL^p function spaces, L2≄LL^2 \not\simeq L^\infty - this distinction is crucial in functional analysis and infinite-width neural network theory.

2.4 Dual Norms

Definition. The dual norm of \|\cdot\| on Rn\mathbb{R}^n is:

y=maxx1x,y=maxx0x,yx\|\mathbf{y}\|_* = \max_{\|\mathbf{x}\| \leq 1} \langle\mathbf{x}, \mathbf{y}\rangle = \max_{\mathbf{x} \neq \mathbf{0}} \frac{\langle\mathbf{x},\mathbf{y}\rangle}{\|\mathbf{x}\|}

Key dual pairs:

  • 1\|\cdot\|_1 and \|\cdot\|_\infty are dual: y=maxx11x,y\|\mathbf{y}\|_\infty = \max_{\|\mathbf{x}\|_1 \leq 1}\langle\mathbf{x},\mathbf{y}\rangle
  • 2\|\cdot\|_2 is self-dual: y2=maxx21x,y\|\mathbf{y}\|_2 = \max_{\|\mathbf{x}\|_2 \leq 1}\langle\mathbf{x},\mathbf{y}\rangle (Cauchy-Schwarz with equality)
  • p\|\cdot\|_p and q\|\cdot\|_q are dual when 1/p+1/q=11/p + 1/q = 1 (Holder conjugates)

Dual norms for matrices (treating them as vectors): the dual of the spectral norm is the nuclear norm:

A=maxB21A,BF=maxB21tr(AB)\|A\|_* = \max_{\|B\|_2 \leq 1}\langle A, B\rangle_F = \max_{\|B\|_2 \leq 1}\operatorname{tr}(A^\top B)

This duality appears in optimization: the subdifferential of A\|A\|_* involves matrices with spectral norm 1\leq 1.

For AI: The dual norm appears naturally in proximal gradient methods. The proximal operator of λA\lambda\|A\|_* (nuclear norm) is soft-thresholding of singular values - a key step in matrix completion algorithms and low-rank optimization.


3. The Frobenius Norm

3.1 Definition and Geometric Meaning

The Frobenius norm treats a matrix as a long vector and applies the Euclidean norm:

AF=i=1mj=1naij2=tr(AA)\|A\|_F = \sqrt{\sum_{i=1}^m \sum_{j=1}^n a_{ij}^2} = \sqrt{\operatorname{tr}(A^\top A)}

For ARm×nA \in \mathbb{R}^{m \times n}, it is the 2\ell^2 norm on Rmn\mathbb{R}^{mn}. This gives the Frobenius norm full access to Euclidean geometry: inner products, projections, and angles between matrices.

The matrix inner product. The Frobenius norm is induced by the inner product:

A,BF=tr(AB)=i,jaijbij\langle A, B \rangle_F = \operatorname{tr}(A^\top B) = \sum_{i,j} a_{ij} b_{ij}

This turns the space Rm×n\mathbb{R}^{m \times n} into a Hilbert space. Cauchy-Schwarz holds: A,BFAFBF|\langle A, B\rangle_F| \leq \|A\|_F \|B\|_F.

Geometric interpretation. AF\|A\|_F is the "size" of the linear map AA when averaged uniformly over the unit sphere. More precisely, if x\mathbf{x} is drawn uniformly from the unit sphere Sn1S^{n-1}, then:

E[Ax2]=AF2n\mathbb{E}[\|A\mathbf{x}\|^2] = \frac{\|A\|_F^2}{n}

This is the mean squared output energy - the Frobenius norm measures average amplification over all directions.

3.2 SVD Relationship

The deepest formula for the Frobenius norm comes from the singular value decomposition:

AF=σ12+σ22++σr2=σ2\|A\|_F = \sqrt{\sigma_1^2 + \sigma_2^2 + \cdots + \sigma_r^2} = \|\boldsymbol{\sigma}\|_2

where σ1σ2σr>0\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0 are the nonzero singular values and r=rank(A)r = \operatorname{rank}(A).

Proof. AF2=tr(AA)\|A\|_F^2 = \operatorname{tr}(A^\top A). Since A=UΣVA = U\Sigma V^\top, we have AA=VΣ2VA^\top A = V\Sigma^2 V^\top. Trace is invariant under similarity, so tr(AA)=tr(Σ2)=iσi2\operatorname{tr}(A^\top A) = \operatorname{tr}(\Sigma^2) = \sum_i \sigma_i^2.

Consequences:

  • AF2\|A\|_F^2 equals the sum of squared singular values - a measure of total "energy"
  • AFA2=σ1\|A\|_F \geq \|A\|_2 = \sigma_1 (Frobenius norm is at least as large as spectral norm)
  • AFrA2\|A\|_F \leq \sqrt{r}\|A\|_2 (Frobenius norm bounded by rank\sqrt{\text{rank}} times spectral norm)
  • For rank-1 matrices: AF=A2=σ1\|A\|_F = \|A\|_2 = \sigma_1 (single singular value)

For AI: The Frobenius norm of a weight matrix WRdout×dinW \in \mathbb{R}^{d_{out} \times d_{in}} equals σ(W)2\|\boldsymbol{\sigma}(W)\|_2 - the 2\ell^2 norm of singular values. Weight decay (L2 regularization) penalizes WF2=iσi2\|W\|_F^2 = \sum_i \sigma_i^2, encouraging all singular values to shrink uniformly. This is weaker than nuclear norm regularization (σ1\|\boldsymbol{\sigma}\|_1), which promotes low-rank structure.

3.3 Key Properties and Inequalities

Submultiplicativity:

ABFAFBF\|AB\|_F \leq \|A\|_F \|B\|_F

Proof. (AB)ij=aibj(AB)_{ij} = \mathbf{a}_i^\top \mathbf{b}_j where ai\mathbf{a}_i is the ii-th row of AA and bj\mathbf{b}_j is the jj-th column of BB. By Cauchy-Schwarz: (AB)ij2ai2bj2|(AB)_{ij}|^2 \leq \|\mathbf{a}_i\|^2 \|\mathbf{b}_j\|^2. Summing over i,ji,j: ABF2(iai2)(jbj2)=AF2BF2\|AB\|_F^2 \leq (\sum_i \|\mathbf{a}_i\|^2)(\sum_j \|\mathbf{b}_j\|^2) = \|A\|_F^2 \|B\|_F^2.

Unitary invariance:

UAVF=AFfor unitary U,V\|UAV\|_F = \|A\|_F \quad \text{for unitary } U, V

This follows from the SVD formula: unitary transformations permute singular values but don't change them.

Comparison with other norms:

NormSymbolRelationship to AF\|A\|_F
SpectralA2\|A\|_2A2AFrA2\|A\|_2 \leq \|A\|_F \leq \sqrt{r}\|A\|_2
NuclearA\|A\|_*ArAF\|A\|_* \leq \sqrt{r}\|A\|_F
Max entryAmax\|A\|_{\max}AmaxAFmnAmax\|A\|_{\max} \leq \|A\|_F \leq \sqrt{mn}\|A\|_{\max}
Row-sumA1,\|A\|_{1,\infty}A1,AFmA1,\|A\|_{1,\infty} \leq \|A\|_F \leq \sqrt{m}\|A\|_{1,\infty}

3.4 Best Low-Rank Approximation

The Eckart-Young theorem - the fundamental result about low-rank approximation - is stated in terms of both the Frobenius and spectral norms:

Theorem (Eckart-Young, 1936). Let A=UΣVA = U\Sigma V^\top and Ak=i=1kσiuiviA_k = \sum_{i=1}^k \sigma_i \mathbf{u}_i \mathbf{v}_i^\top. Then:

minrank(B)kABF=AAkF=σk+12++σr2\min_{\operatorname{rank}(B) \leq k} \|A - B\|_F = \|A - A_k\|_F = \sqrt{\sigma_{k+1}^2 + \cdots + \sigma_r^2} minrank(B)kAB2=AAk2=σk+1\min_{\operatorname{rank}(B) \leq k} \|A - B\|_2 = \|A - A_k\|_2 = \sigma_{k+1}

In both norms, the optimal rank-kk approximation is obtained by truncating the SVD.

For AI: This theorem is the mathematical foundation of PCA (-> 03-PCA), LoRA, and attention matrix compression. When training a LoRA adapter W=W0+BAW = W_0 + BA with BRd×rB \in \mathbb{R}^{d \times r}, ARr×dA \in \mathbb{R}^{r \times d}, the Eckart-Young theorem guarantees this is the optimal rank-rr perturbation in the Frobenius norm.

3.5 Computing and Differentiating the Frobenius Norm

Computation. In NumPy: np.linalg.norm(A, 'fro') or equivalently np.sqrt(np.sum(A**2)).

Gradient. The Frobenius norm is differentiable everywhere. Its gradient with respect to AA is:

AF2A=2A,AFA=AAF\frac{\partial \|A\|_F^2}{\partial A} = 2A, \qquad \frac{\partial \|A\|_F}{\partial A} = \frac{A}{\|A\|_F}

This makes the Frobenius norm easy to differentiate in backpropagation. Weight decay adds λWF2\lambda \|W\|_F^2 to the loss, contributing gradient 2λW2\lambda W - the standard L2 regularization term.

For AI: Spectral normalization (Miyato 2018) normalizes weight matrices by their spectral norm, not Frobenius. The gradient of the spectral norm is:

A2A=u1v1\frac{\partial \|A\|_2}{\partial A} = \mathbf{u}_1 \mathbf{v}_1^\top

the rank-1 outer product of the top left and right singular vectors. This is non-smooth when σ1\sigma_1 has multiplicity >1> 1.


4. Induced (Operator) Norms

4.1 The General Definition

A matrix norm is induced (or operator) if it measures the worst-case amplification of a vector norm:

Apq=maxx0Axqxp=maxxp=1Axq\|A\|_{p \to q} = \max_{\mathbf{x} \neq \mathbf{0}} \frac{\|A\mathbf{x}\|_q}{\|\mathbf{x}\|_p} = \max_{\|\mathbf{x}\|_p = 1} \|A\mathbf{x}\|_q

The pqp \to q notation specifies the norms on input (pp) and output (qq) spaces.

Why induced norms matter. The definition captures the most important property of a linear map: its maximum stretch factor. For stability analysis, control theory, neural network Lipschitz bounds, and generalization theory, the induced norm is the right tool.

Submultiplicativity is automatic. Any induced norm satisfies ABpqArqBpr\|AB\|_{p \to q} \leq \|A\|_{r \to q}\|B\|_{p \to r} because:

ABxqArqBxrArqBprxp\|AB\mathbf{x}\|_q \leq \|A\|_{r\to q}\|B\mathbf{x}\|_r \leq \|A\|_{r\to q}\|B\|_{p\to r}\|\mathbf{x}\|_p

4.2 The Spectral Norm (p=q=2p = q = 2)

The most important induced norm is the spectral norm (also called operator 2-norm):

A2=A22=maxx2=1Ax2=σ1(A)\|A\|_2 = \|A\|_{2 \to 2} = \max_{\|\mathbf{x}\|_2 = 1} \|A\mathbf{x}\|_2 = \sigma_1(A)

Proof that A2=σ1\|A\|_2 = \sigma_1. Write A=UΣVA = U\Sigma V^\top. Then:

Ax22=UΣVx22=ΣVx22=Σy22=iσi2yi2\|A\mathbf{x}\|_2^2 = \|U\Sigma V^\top \mathbf{x}\|_2^2 = \|\Sigma V^\top \mathbf{x}\|_2^2 = \|\Sigma \mathbf{y}\|_2^2 = \sum_i \sigma_i^2 y_i^2

where y=Vx\mathbf{y} = V^\top \mathbf{x} has y2=x2=1\|\mathbf{y}\|_2 = \|\mathbf{x}\|_2 = 1. This is maximized by y=e1\mathbf{y} = \mathbf{e}_1 (put all weight on σ1\sigma_1), giving Ax22=σ12\|A\mathbf{x}\|_2^2 = \sigma_1^2. The maximizer is x=v1\mathbf{x} = \mathbf{v}_1 (the top right singular vector).

Computing the spectral norm:

  • Exact: via SVD, A2=σ1\|A\|_2 = \sigma_1
  • Approximate: power iteration xk+1=AAxk/AAxk\mathbf{x}_{k+1} = A^\top A \mathbf{x}_k / \|A^\top A \mathbf{x}_k\|, converging as (σ2/σ1)k(\sigma_2/\sigma_1)^k
  • For symmetric PSD matrices: A2=λmax(A)\|A\|_2 = \lambda_{\max}(A)

For AI: The spectral norm of a weight matrix WlW_l in a neural network is the Lipschitz constant of that layer (for activations with Lipschitz constant 1). The overall network Lipschitz constant is lWl2\prod_l \|W_l\|_2. Spectral normalization divides each weight matrix by its spectral norm at every step, bounding this product and stabilizing GAN training.

4.3 The Matrix 1-Norm and \infty-Norm

Matrix 1-norm (p=q=1p = q = 1, input and output use 1\ell^1):

A1=maxjiaij=maximum absolute column sum\|A\|_1 = \max_j \sum_i |a_{ij}| = \text{maximum absolute column sum}

Matrix \infty-norm (p=q=p = q = \infty, input and output use \ell^\infty):

A=maxijaij=maximum absolute row sum\|A\|_\infty = \max_i \sum_j |a_{ij}| = \text{maximum absolute row sum}

Why these formulas. For the 1-norm: Ax1=jxjaj1jxjaj1\|A\mathbf{x}\|_1 = \|\sum_j x_j \mathbf{a}_j\|_1 \leq \sum_j |x_j| \|\mathbf{a}_j\|_1. Since jxj=x1=1\sum_j |x_j| = \|\mathbf{x}\|_1 = 1, the worst case is to put all weight (xj=1x_j = 1) on the column with maximum 1\ell^1 norm.

Examples. For A=(1234)A = \begin{pmatrix} 1 & 2 \\ -3 & 4 \end{pmatrix}:

  • Column sums: 1+3=4|1| + |-3| = 4, 2+4=6|2| + |4| = 6. So A1=6\|A\|_1 = 6.
  • Row sums: 1+2=3|1| + |2| = 3, 3+4=7|-3| + |4| = 7. So A=7\|A\|_\infty = 7.

Duality. A=A1\|A\|_\infty = \|A^\top\|_1. These two norms are transposes of each other.

4.4 The 2->1 and 1->2 Norms (NP-Hard)

The nuclear norm is the induced 212 \to 1 norm's dual... actually, let's be precise: computing A12\|A\|_{1 \to 2} and A21\|A\|_{2 \to 1} is NP-hard in general. This is a fundamental computational barrier.

A21=maxx2=1Ax1\|A\|_{2 \to 1} = \max_{\|\mathbf{x}\|_2 = 1} \|A\mathbf{x}\|_1

No polynomial-time algorithm is known for this. However, there are useful bounds:

1nAFA21mAF\frac{1}{\sqrt{n}}\|A\|_F \leq \|A\|_{2\to 1} \leq \sqrt{m}\|A\|_F

For AI: The NP-hardness of A12\|A\|_{1\to 2} connects to problems in statistics (robustness of PCA) and compressed sensing. Approximate algorithms using semidefinite programming can compute A21\|A\|_{2\to 1} up to a O(logn)O(\log n) factor.

4.5 Submultiplicativity and Consistency

Definition. A matrix norm \|\cdot\| is:

  • Submultiplicative: ABAB\|AB\| \leq \|A\|\|B\| (all induced norms satisfy this)
  • Consistent with vector norm v\|\cdot\|_v: AxvAxv\|A\mathbf{x}\|_v \leq \|A\|\|\mathbf{x}\|_v (induced norms satisfy this by construction)
  • Compatible or subordinate: same as consistent

The Frobenius norm is submultiplicative but NOT induced. Every induced norm must satisfy I=1\|I\| = 1, but IF=n\|I\|_F = \sqrt{n}. The Frobenius norm is sometimes called a "compatible" norm because Ax2AFx2\|A\mathbf{x}\|_2 \leq \|A\|_F\|\mathbf{x}\|_2.

Non-example. The max-entry norm Amax=maxi,jaij\|A\|_{\max} = \max_{i,j}|a_{ij}| satisfies the norm axioms but is NOT submultiplicative: ABmax\|AB\|_{\max} can exceed AmaxBmax\|A\|_{\max}\|B\|_{\max}.


5. Schatten Norms and the Nuclear Norm

5.1 The Schatten p-Norm Family

The Schatten pp-norm unifies the spectral, Frobenius, and nuclear norms by applying the vector p\ell^p norm to the vector of singular values:

ASp=σ(A)p=(i=1rσip)1/p\|A\|_{S_p} = \|\boldsymbol{\sigma}(A)\|_p = \left(\sum_{i=1}^r \sigma_i^p\right)^{1/p}

where r=rank(A)r = \operatorname{rank}(A) and σ1σr>0\sigma_1 \geq \cdots \geq \sigma_r > 0.

The three canonical Schatten norms:

ppNameFormulaInterpretation
11Nuclear (trace)iσi=tr(AA)\sum_i \sigma_i = \operatorname{tr}(\sqrt{A^\top A})Sum of singular values
22Frobeniusiσi2=tr(AA)\sqrt{\sum_i \sigma_i^2} = \sqrt{\operatorname{tr}(A^\top A)}RMS singular value
\inftySpectral (operator)maxiσi=σ1\max_i \sigma_i = \sigma_1Maximum singular value

Ordering. For pqp \leq q: ASpASq\|A\|_{S_p} \geq \|A\|_{S_q}. In particular:

A2=ASAF=AS2A=AS1\|A\|_2 = \|A\|_{S_\infty} \leq \|A\|_F = \|A\|_{S_2} \leq \|A\|_* = \|A\|_{S_1}

with inequalities sharp for full-rank matrices with equal singular values.

Duality. Schatten norms are dual in pairs: (Sp)=Sq(S_p)^* = S_q where 1/p+1/q=11/p + 1/q = 1. In particular:

A,BFASpBSq(1p+1q=1)\langle A, B\rangle_F \leq \|A\|_{S_p}\|B\|_{S_q} \qquad \left(\frac{1}{p}+\frac{1}{q}=1\right)

This is Holder's inequality for singular values.

5.2 The Nuclear Norm

The nuclear norm (also called trace norm or Ky Fan rr-norm when applied to all rr singular values) is:

A=i=1rσi=tr(AA)=tr(A)\|A\|_* = \sum_{i=1}^r \sigma_i = \operatorname{tr}(\sqrt{A^\top A}) = \operatorname{tr}(|A|)

where A=AA|A| = \sqrt{A^\top A} is the matrix square root (symmetric PSD). For square PSD matrices, A=tr(A)\|A\|_* = \operatorname{tr}(A) - hence "trace norm."

The nuclear norm as the convex envelope of rank. This is the key motivation:

A=conv{rank(A):A21}\|A\|_* = \operatorname{conv}\{\operatorname{rank}(A) : \|A\|_2 \leq 1\}

More precisely, on the set {A:A21}\{A : \|A\|_2 \leq 1\}, the nuclear norm is the tightest convex lower bound on rank. This makes it the convex relaxation of rank minimization - replacing the NP-hard rank minimization with a convex nuclear norm minimization.

Dual norm. The nuclear norm is dual to the spectral norm:

A=maxB21A,BF=maxB21tr(AB)\|A\|_* = \max_{\|B\|_2 \leq 1} \langle A, B\rangle_F = \max_{\|B\|_2 \leq 1} \operatorname{tr}(A^\top B)

This duality is exploited in semidefinite programming formulations of nuclear norm minimization.

5.3 Computing the Nuclear Norm and Proximal Operator

Computation. A=iσi\|A\|_* = \sum_i \sigma_i, computed via SVD. In NumPy: np.linalg.norm(A, 'nuc').

Subdifferential. The nuclear norm is convex but non-smooth (when AA has repeated singular values). Its subdifferential at A=UΣVA = U\Sigma V^\top is:

A={UV+W:W21,UW=0,WV=0}\partial\|A\|_* = \{UV^\top + W : \|W\|_2 \leq 1,\, U^\top W = 0,\, WV = 0\}

For strictly positive singular values, UVUV^\top is the unique subgradient.

Proximal operator. The proximal operator of λ\lambda\|\cdot\|_* (needed for nuclear norm regularization) is singular value soft-thresholding:

proxλ(A)=Udiag(max(σiλ,0))V\operatorname{prox}_{\lambda\|\cdot\|_*}(A) = U \operatorname{diag}(\max(\sigma_i - \lambda, 0)) V^\top

This is the key computational step in matrix completion, robust PCA, and spectral recovery algorithms.

For AI: Matrix completion (Netflix Prize formulation) seeks the minimum-rank matrix consistent with observed entries. The convex relaxation minimizes the nuclear norm, solvable with alternating direction method of multipliers (ADMM) using the proximal operator above. LoRA implicitly regularizes toward low nuclear norm through the factored parameterization W=W0+BAW = W_0 + BA.

5.4 Ky Fan Norms

The Ky Fan kk-norm is the sum of the top kk singular values:

A(k)=i=1kσi=σ1+σ2++σk\|A\|_{(k)} = \sum_{i=1}^k \sigma_i = \sigma_1 + \sigma_2 + \cdots + \sigma_k

Special cases:

  • k=1k = 1: A(1)=σ1=A2\|A\|_{(1)} = \sigma_1 = \|A\|_2 (spectral norm)
  • k=rk = r: A(r)=iσi=A\|A\|_{(r)} = \sum_i \sigma_i = \|A\|_* (nuclear norm)

Ky Fan norms form a chain: A(1)A(2)A(r)\|A\|_{(1)} \leq \|A\|_{(2)} \leq \cdots \leq \|A\|_{(r)}, but this is NOT a "norm" ordering - a matrix can have larger A(1)\|A\|_{(1)} than another while having smaller A(k)\|A\|_{(k)} for larger kk.

Variational formula. The Ky Fan kk-norm has a beautiful characterization:

A(k)=maxUk,Vk orthonormaltr(UkAVk)\|A\|_{(k)} = \max_{U_k, V_k \text{ orthonormal}} \operatorname{tr}(U_k^\top A V_k)

where the max is over n×kn \times k orthonormal matrices Uk,VkU_k, V_k. The optimal choice is Uk=[u1uk]U_k = [\mathbf{u}_1|\cdots|\mathbf{u}_k] and Vk=[v1vk]V_k = [\mathbf{v}_1|\cdots|\mathbf{v}_k].


6. Unitarily Invariant Norms

6.1 Definition and Characterization

A matrix norm \|\cdot\| is unitarily invariant if:

UAV=Afor all unitary U,V\|UAV\| = \|A\| \quad \text{for all unitary } U, V

This means the norm depends only on the singular values of AA, not on the specific singular vectors.

Von Neumann's characterization. A norm is unitarily invariant if and only if it is a symmetric gauge function applied to the vector of singular values:

A=g(σ1(A),σ2(A),,σr(A))\|A\| = g(\sigma_1(A), \sigma_2(A), \ldots, \sigma_r(A))

where g:RrRg: \mathbb{R}^r \to \mathbb{R} is a symmetric gauge function - a norm on Rr\mathbb{R}^r that is:

  1. A norm (positive definite, homogeneous, triangle inequality)
  2. Symmetric: g(σπ(1),,σπ(r))=g(σ1,,σr)g(\sigma_{\pi(1)}, \ldots, \sigma_{\pi(r)}) = g(\sigma_1, \ldots, \sigma_r) for any permutation π\pi
  3. Monotone: g(σ1,,σr)g(σ1,,σr)g(\sigma_1, \ldots, \sigma_r) \geq g(\sigma_1', \ldots, \sigma_r') when σiσi|\sigma_i| \geq |\sigma_i'| for all ii

Examples of unitarily invariant norms:

  • Frobenius: g=2g = \ell^2 on singular values
  • Spectral: g=g = \ell^\infty on singular values
  • Nuclear: g=1g = \ell^1 on singular values
  • Ky Fan kk: g(σ)=σ1++σkg(\boldsymbol{\sigma}) = \sigma_1 + \cdots + \sigma_k
  • Schatten pp: g=pg = \ell^p on singular values

Non-unitarily-invariant norms:

  • Matrix 1-norm A1\|A\|_1 (depends on column structure)
  • Matrix \infty-norm A\|A\|_\infty (depends on row structure)
  • Entry-max norm Amax\|A\|_{\max}

6.2 Fan Dominance and Majorization

Majorization. A vector xRn\mathbf{x} \in \mathbb{R}^n majorizes yRn\mathbf{y} \in \mathbb{R}^n (written xy\mathbf{x} \succ \mathbf{y}) if:

i=1kx[i]i=1ky[i] for all k=1,,n1,i=1nx[i]=i=1ny[i]\sum_{i=1}^k x_{[i]} \geq \sum_{i=1}^k y_{[i]} \text{ for all } k = 1, \ldots, n-1, \qquad \sum_{i=1}^n x_{[i]} = \sum_{i=1}^n y_{[i]}

where x[1]x[n]x_{[1]} \geq \cdots \geq x_{[n]} are the decreasing order statistics.

Fan's dominance theorem. For unitarily invariant norms:

AB    σ(A)σ(B) (in some sense)\|A\| \leq \|B\| \iff \sigma(A) \prec \sigma(B) \text{ (in some sense)}

More precisely: AB\|A\| \leq \|B\| for ALL unitarily invariant norms if and only if σ(B)\sigma(B) majorizes σ(A)\sigma(A):

i=1kσi(A)i=1kσi(B)for all k\sum_{i=1}^k \sigma_i(A) \leq \sum_{i=1}^k \sigma_i(B) \quad \text{for all } k

This connects matrix norm comparison to singular value majorization.

6.3 Weyl's Inequality (Preview of Perturbation Theory)

For unitarily invariant norms, Weyl's inequality gives bounds on singular value perturbation:

Theorem (Weyl). If A,ERm×nA, E \in \mathbb{R}^{m \times n}, then for all ii:

σi(A+E)σi(A)σ1(E)=E2|\sigma_i(A + E) - \sigma_i(A)| \leq \sigma_1(E) = \|E\|_2

Each singular value can change by at most E2\|E\|_2 under perturbation EE. This will be developed fully in 8.

6.4 The Role in Optimization

Unitarily invariant norms appear naturally in matrix optimization because they respect the singular value geometry of the problem. The Mirsky theorem states that for unitarily invariant norms, the best rank-kk approximation (in ANY unitarily invariant norm) is the truncated SVD - Eckart-Young generalizes beyond Frobenius.

For AI: The invariance to unitary transformations aligns with key AI structures. Transformers apply learned projections Q=XWQQ = XW_Q, K=XWKK = XW_K; if WQ,WKW_Q, W_K undergo orthogonal reparametrization, unitarily invariant norms capture the effective capacity independently of parameterization choice. This is exploited in analyses of transformer expressivity.


7. Condition Number

7.1 Definition and Intuition

The condition number of a nonsingular matrix ARn×nA \in \mathbb{R}^{n \times n} (with respect to the pp-norm) is:

κp(A)=ApA1p\kappa_p(A) = \|A\|_p \|A^{-1}\|_p

For p=2p = 2 (using spectral norms):

κ2(A)=A2A12=σ1(A)1σn(A)=σ1(A)σn(A)\kappa_2(A) = \|A\|_2 \|A^{-1}\|_2 = \sigma_1(A) \cdot \frac{1}{\sigma_n(A)} = \frac{\sigma_1(A)}{\sigma_n(A)}

the ratio of the largest to smallest singular value.

Geometric intuition. The condition number measures the eccentricity of the ellipsoid {Ax:x21}\{A\mathbf{x} : \|\mathbf{x}\|_2 \leq 1\}. A sphere (all σi\sigma_i equal) gives κ=1\kappa = 1. A very flat pancake (one σn0\sigma_n \approx 0) gives κ\kappa \to \infty.

CONDITION NUMBER GEOMETRY
========================================================================

  \kappa(A) = 1 (identity)        \kappa(A) = 10 (moderate)       \kappa(A) -> \infty (near-singular)

      ***                        *******                    ---------------------
     *   *                      *       *                   ---------------------
      ***                        *******                    ---------------------

  \sigma_1/\sigma_2 = 1                  \sigma_1/\sigma_2 = 10                 \sigma_1/\sigma_2 -> \infty
  (sphere)                    (ellipse)                   (flat disk -> singular)

========================================================================

7.2 Condition Number and Numerical Stability

The condition number quantifies how sensitive the solution of Ax=bA\mathbf{x} = \mathbf{b} is to perturbations.

Forward error bound. If x^\hat{\mathbf{x}} solves (A+δA)x^=b+δb(A + \delta A)\hat{\mathbf{x}} = \mathbf{b} + \delta\mathbf{b}, then:

x^xxκ(A)(δAA+δbb)\frac{\|\hat{\mathbf{x}} - \mathbf{x}\|}{\|\mathbf{x}\|} \leq \kappa(A)\left(\frac{\|\delta A\|}{\|A\|} + \frac{\|\delta\mathbf{b}\|}{\|\mathbf{b}\|}\right)

The relative error in the solution is at most κ(A)\kappa(A) times the relative error in the data. If the input has tt digits of precision, the output has roughly tlog10κ(A)t - \log_{10}\kappa(A) reliable digits.

Example. For κ(A)=106\kappa(A) = 10^6 and double-precision arithmetic (16\approx 16 correct digits), the solution has 10\approx 10 reliable digits. For κ(A)=1012\kappa(A) = 10^{12}, only 4 reliable digits remain.

For AI: The Hessian of the loss H=2LH = \nabla^2 \mathcal{L} has condition number κ(H)=λmax/λmin\kappa(H) = \lambda_{\max}/\lambda_{\min}. High κ(H)\kappa(H) means:

  1. Gradient descent requires tiny step sizes (limited by 1/λmax1/\lambda_{\max}) but needs many steps (to make progress along λmin\lambda_{\min} directions)
  2. The loss landscape has steep valleys - the classic "ill-conditioned optimization" problem
  3. Adam and other adaptive optimizers implicitly precondition to reduce effective κ\kappa

7.3 Properties and Bounds

Key properties:

  1. κ(A)1\kappa(A) \geq 1 for all nonsingular AA (since 1=I=AA1AA11 = \|I\| = \|AA^{-1}\| \leq \|A\|\|A^{-1}\|)
  2. κ(A)=1\kappa(A) = 1 iff AA is a scalar multiple of a unitary matrix
  3. κ(AB)κ(A)κ(B)\kappa(AB) \leq \kappa(A)\kappa(B) - condition numbers multiply
  4. κ(A)=κ(A1)\kappa(A) = \kappa(A^{-1})
  5. κ(αA)=κ(A)\kappa(\alpha A) = \kappa(A) for any scalar α0\alpha \neq 0

Norm dependence. Different norms give different condition numbers, but they're equivalent up to polynomial factors in nn.

For singular matrices: We define κ(A)=\kappa(A) = \infty (singular matrices are maximally ill-conditioned).

Distance to singularity. The condition number relates to how far AA is from the nearest singular matrix:

min{E:A+E is singular}=Aκ(A)=σn(A)\min\{\|E\| : A + E \text{ is singular}\} = \frac{\|A\|}{\kappa(A)} = \sigma_n(A)

(using the spectral norm). A matrix with κ(A)=106\kappa(A) = 10^6 is "close to singular" in a relative sense.

7.4 Tikhonov Regularization

When κ(A)\kappa(A) is too large for reliable computation, Tikhonov regularization artificially improves conditioning:

(AA+λI)x^=Ab(A^\top A + \lambda I)\hat{\mathbf{x}} = A^\top \mathbf{b}

Condition number of the regularized system:

κ2(AA+λI)=σ12+λσn2+λ\kappa_2(A^\top A + \lambda I) = \frac{\sigma_1^2 + \lambda}{\sigma_n^2 + \lambda}

Since σn2+λλ>0\sigma_n^2 + \lambda \geq \lambda > 0, the regularized system has finite condition number even when AA is singular (σn=0\sigma_n = 0). As λ\lambda increases, κ\kappa decreases toward 1.

Solution bias. The Tikhonov solution has smaller norm than the minimum-norm least-squares solution: x^λ<x^LS\|\hat{\mathbf{x}}_\lambda\| < \|\hat{\mathbf{x}}_{LS}\|. This is the bias-variance tradeoff: regularization introduces bias to reduce variance (sensitivity to noise).

For AI: L2 regularization (weight decay) in neural networks is essentially Tikhonov regularization applied to the loss landscape. Adding λWF2\lambda\|W\|_F^2 to the loss shifts the Hessian eigenvalues by 2λ2\lambda, bounding κ(H+2λI)(λmax+2λ)/(2λ)\kappa(H + 2\lambda I) \leq (\lambda_{\max} + 2\lambda)/(2\lambda) - improving the condition number of the optimization problem.

7.5 Estimating Condition Numbers

Computing κ(A)\kappa(A) exactly requires full SVD (O(mn2)O(mn^2) for mnm \geq n). For large matrices, approximations are needed.

Power method / Lanczos. Estimate σ1\sigma_1 by power iteration on AAA^\top A. Estimate σn\sigma_n by inverse power iteration (more expensive, requires solving linear systems).

LAPACK routines. scipy.linalg.svd + ratio; or np.linalg.cond(A, p) for various pp-norms.

Randomized condition number estimation. For ARn×nA \in \mathbb{R}^{n \times n}: generate gN(0,I)\mathbf{g} \sim \mathcal{N}(0, I), compute Ag/A1g\|A\mathbf{g}\|/\|A^{-1}\mathbf{g}\|. This estimates κ(A)\kappa(A) with high probability using just two matrix-vector products.


8. Perturbation Theory

8.1 Motivation and Setup

The fundamental question of numerical analysis: If we perturb the data slightly, how much do the outputs change?

For matrix computations, we study:

  • How much do eigenvalues change when AA+EA \to A + E?
  • How much do singular values change when AA+EA \to A + E?
  • How much does the solution x=A1b\mathbf{x} = A^{-1}\mathbf{b} change when AA or b\mathbf{b} are perturbed?

All answers involve matrix norms.

Setup. Let ARm×nA \in \mathbb{R}^{m \times n} with SVD A=UΣVA = U\Sigma V^\top, singular values σ1σp\sigma_1 \geq \cdots \geq \sigma_p (p=min(m,n)p = \min(m,n)). Let A~=A+E\tilde{A} = A + E be the perturbed matrix with singular values σ~1σ~p\tilde{\sigma}_1 \geq \cdots \geq \tilde{\sigma}_p.

8.2 Weyl's Inequality for Singular Values

Theorem (Weyl's Inequality). For any ii:

σ~iσiE2=σ1(E)|\tilde{\sigma}_i - \sigma_i| \leq \|E\|_2 = \sigma_1(E)

Proof sketch. By the variational formula for singular values (Courant-Fischer), σi(A)=mindimU=imaxxUAx/x\sigma_i(A) = \min_{\dim U = i} \max_{\mathbf{x} \perp U} \|A\mathbf{x}\|/\|\mathbf{x}\|. Adding EE shifts this by at most Ex/xE2\|E\mathbf{x}\|/\|\mathbf{x}\| \leq \|E\|_2.

Interpretation: Singular values are Lipschitz functions of the matrix with Lipschitz constant 1 (with respect to the spectral norm):

σ(A+E)σ(A)2EFpE2\|\boldsymbol{\sigma}(A+E) - \boldsymbol{\sigma}(A)\|_2 \leq \|E\|_F \leq \sqrt{p}\|E\|_2

This makes singular values numerically stable - small matrix perturbations cause small changes.

Stronger form (Mirsky's theorem):

σ(A+E)σ(A)FEF\|\boldsymbol{\sigma}(A+E) - \boldsymbol{\sigma}(A)\|_F \leq \|E\|_F

In Frobenius norm, the vector of singular values is also Lipschitz-1.

8.3 Eigenvalue Perturbation: Bauer-Fike

For symmetric matrices AA with eigenvalues λi\lambda_i, the analog of Weyl is:

λ~iλiE2|\tilde{\lambda}_i - \lambda_i| \leq \|E\|_2

For non-symmetric matrices, the situation is more complex. The Bauer-Fike theorem bounds eigenvalue perturbation for diagonalizable A=PDP1A = PDP^{-1}:

minjλ~λjκ2(P)E2\min_j |\tilde{\lambda} - \lambda_j| \leq \kappa_2(P) \|E\|_2

where κ2(P)=P2P12\kappa_2(P) = \|P\|_2\|P^{-1}\|_2 is the condition number of the eigenvector matrix.

Key insight: Eigenvalues of non-normal matrices can be extremely sensitive to perturbation. The condition number κ2(P)\kappa_2(P) - which can be astronomically large for non-normal matrices - amplifies the perturbation. This is why direct eigenvalue computation for non-normal matrices is numerically dangerous.

For AI: The sensitivity of eigenvalues of non-normal operators appears in RNN/LSTM gradient flow analysis. The eigenvalues of the recurrent weight matrix WhhW_{hh} determine long-term memory, but if WhhW_{hh} is non-normal, small weight perturbations (from noise, finite-precision, or gradient updates) can dramatically change eigenvalues and hence gradient behavior.

8.4 Forward and Backward Error Analysis

Forward error: x^x\|\hat{x} - x\| - how wrong is the computed answer? Backward error: min{δA:(A+δA)x^=b}\min\{\|\delta A\| : (A + \delta A)\hat{x} = b\} - how much do we need to perturb the input for x^\hat{x} to be exact?

Fundamental relationship:

Forward errorκ(A)×Backward error\text{Forward error} \leq \kappa(A) \times \text{Backward error}

An algorithm is backward stable if it produces output x^\hat{x} with small backward error. For backward-stable algorithms on problems with small condition number, the forward error is also small.

Example. Gaussian elimination with partial pivoting is backward stable: the computed solution x^\hat{x} satisfies (A+E)x^=b(A + E)\hat{x} = b where EF/AFcnϵmach\|E\|_F / \|A\|_F \leq c n \epsilon_{mach} (small backward error). The forward error is bounded by κ(A)cnϵmach\kappa(A) \cdot cn\epsilon_{mach}.


9. Applications in Machine Learning

9.1 Weight Regularization: Frobenius vs Nuclear

Neural network training minimizes a loss with a regularizer:

L(W)=task loss+λR(W)\mathcal{L}(W) = \text{task loss} + \lambda R(W)

Frobenius (L2/weight decay): R(W)=WF2=iσi2R(W) = \|W\|_F^2 = \sum_i \sigma_i^2

  • Gradient: R=2W\nabla R = 2W
  • Effect: shrinks all singular values uniformly
  • Induces: no sparsity in singular values; doesn't promote low rank
  • Used in: essentially all deep learning (Adam, SGD with weight decay)

Nuclear norm: R(W)=W=iσiR(W) = \|W\|_* = \sum_i \sigma_i

  • Subgradient: UVUV^\top (top singular vectors)
  • Effect: promotes sparsity in singular values -> low-rank solutions
  • Proximal step: singular value soft-thresholding
  • Used in: matrix completion, robust PCA, some transformer pruning methods

Spectral norm: R(W)=W2=σ1R(W) = \|W\|_2 = \sigma_1

  • Subgradient: u1v1\mathbf{u}_1\mathbf{v}_1^\top (rank-1 outer product)
  • Effect: limits largest singular value; controls Lipschitz constant
  • Used in: spectral normalization for GAN training (Miyato 2018)

LoRA and nuclear norm. The LoRA reparameterization W=W0+BAW = W_0 + BA (with BRd×rB \in \mathbb{R}^{d \times r}, ARr×dA \in \mathbb{R}^{r \times d}) has an interesting norm structure:

BABFAF\|BA\|_* \leq \|B\|_F\|A\|_F

The product parameterization implicitly regularizes the nuclear norm of the weight increment, even without explicit regularization.

9.2 Spectral Normalization for GANs

Generative Adversarial Networks (GANs) require the discriminator D:RdRD: \mathbb{R}^d \to \mathbb{R} to be Lipschitz:

D(x)D(y)Lxy2x,y|D(\mathbf{x}) - D(\mathbf{y})| \leq L\|\mathbf{x} - \mathbf{y}\|_2 \quad \forall \mathbf{x}, \mathbf{y}

For a neural network D=DLD1D = D_L \circ \cdots \circ D_1 where Dl(x)=ϕ(Wlx+bl)D_l(\mathbf{x}) = \phi(W_l\mathbf{x} + \mathbf{b}_l), and the activation ϕ\phi has Lipschitz constant 1 (ReLU, tanh, etc.):

Lip(D)l=1LWl2\text{Lip}(D) \leq \prod_{l=1}^L \|W_l\|_2

Spectral normalization (Miyato et al., 2018) replaces each WlW_l with W^l=Wl/Wl2\hat{W}_l = W_l / \|W_l\|_2, ensuring each layer has spectral norm exactly 1, hence:

Lip(DSN)1\text{Lip}(D_{\text{SN}}) \leq 1

Implementation. The spectral norm is estimated via power iteration - a single step per training iteration suffices in practice:

v <- W^T u / ||W^T u||
u <- W v / ||W v||
\sigma <- u^T W v
W <- W / \sigma

This adds minimal overhead and dramatically stabilizes GAN training.

9.3 Gradient Clipping and Norm Bounds

Gradient clipping prevents gradient explosion by rescaling gradients when their norm exceeds a threshold:

g^={gif gττg/gif g>τ\hat{g} = \begin{cases} g & \text{if } \|g\| \leq \tau \\ \tau \cdot g / \|g\| & \text{if } \|g\| > \tau \end{cases}

Choice of norm matters:

  • Global L2 clipping: g=θL2\|g\| = \|\nabla_\theta \mathcal{L}\|_2 (all parameters concatenated). Standard in transformers (Vaswani 2017): clip at τ=1.0\tau = 1.0.
  • Per-tensor clipping: clip each weight matrix's gradient separately using its spectral norm. More expensive but more principled.
  • Frobenius clipping: clip by WLF\|\nabla_W \mathcal{L}\|_F per layer. Common in modern optimizers.

For AI: The choice of gradient clipping norm affects which singular value components of the gradient are preserved. Clipping by spectral norm G2\|G\|_2 ensures the maximum singular value doesn't exceed τ\tau, while Frobenius clipping preserves the relative structure of all singular values but scales them uniformly.

9.4 Low-Rank Approximation and Model Compression

The Eckart-Young theorem justifies truncated SVD for model compression:

Weight matrix compression. For a weight matrix WRd1×d2W \in \mathbb{R}^{d_1 \times d_2}, store the rank-kk approximation Wk=UkΣkVkW_k = U_k \Sigma_k V_k^\top using k(d1+d2)k(d_1 + d_2) parameters instead of d1d2d_1 d_2. The approximation error:

WWkF=i>kσi2,WWk2=σk+1\|W - W_k\|_F = \sqrt{\sum_{i>k} \sigma_i^2}, \qquad \|W - W_k\|_2 = \sigma_{k+1}

Compression ratio. For compression ratio r=d1d2/k(d1+d2)r = d_1 d_2 / k(d_1 + d_2), we need k<d1d2/(d1+d2)k < d_1 d_2 / (d_1 + d_2).

LoRA revisited. Instead of post-hoc compression, LoRA trains the low-rank increments directly:

W=Wpretrained+BRd×rARr×dW = W_{\text{pretrained}} + \underbrace{B}_{\mathbb{R}^{d \times r}} \underbrace{A}_{\mathbb{R}^{r \times d}}

The parameter savings are dramatic: fine-tuning rdr \ll d parameters per layer instead of d2d^2.

9.5 Lipschitz Bounds and Generalization

PAC-Bayes generalization bounds connect norms of weight matrices to the generalization gap. For a feedforward network with LL layers:

Generalization gap1nκnetworkl=1LWlF\text{Generalization gap} \lesssim \frac{1}{\sqrt{n}} \cdot \kappa_{\text{network}} \cdot \prod_{l=1}^L \|W_l\|_F

where the precise bound depends on the norm choice. Bartlett et al. (2017) proved a spectrally-normalized margin bound:

Generalization gapB2γ2(lWlF2/Wl22)lWl22n\text{Generalization gap} \lesssim \frac{B^2}{\gamma^2} \sqrt{\frac{\left(\sum_l \|W_l\|_F^2 / \|W_l\|_2^2\right) \cdot \prod_l \|W_l\|_2^2}{n}}

where BB is the input bound, γ\gamma is the margin, and nn is the training set size.

Key insight: The ratio WF2/W22=i(σi/σ1)2r\|W\|_F^2 / \|W\|_2^2 = \sum_i (\sigma_i/\sigma_1)^2 \leq r (the stable rank) measures effective dimensionality. Matrices with low stable rank generalize better.

9.6 Attention Mechanisms and Low-Rank Structure

The attention mechanism in transformers computes:

Attention(Q,K,V)=softmax(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

The matrix S=QK/dkRT×TS = QK^\top / \sqrt{d_k} \in \mathbb{R}^{T \times T} (for sequence length TT) is the attention logit matrix.

Nuclear norm of attention. Empirically, trained attention matrices have low nuclear norm relative to their Frobenius norm - they exhibit low-rank structure. Dong et al. (2021) ("Attention is not All You Need") showed that purely attention-based models without MLP layers collapse to rank-1, connecting to the spectral norm of the attention matrix.

Multi-head Low-Rank Attention (MLA). DeepSeek's MLA architecture (2024) explicitly parameterizes key-value caches in low-rank form to reduce memory:

K=CKVWK,V=CKVWVK = C_{KV} W^K, \quad V = C_{KV} W^V

where CKVRT×dcC_{KV} \in \mathbb{R}^{T \times d_c} with dcdmodeld_c \ll d_{model}. This is exactly SVD-based compression of the KV cache, motivated by the low-rank structure of attention matrices.

9.7 Implicit Regularization in Deep Learning

Deep learning optimizers (SGD, Adam) exhibit implicit regularization - they converge to solutions with small norms even without explicit regularization.

Observation (Gunasekar et al., 2017). Gradient flow on matrix factorization minU,VUVAF2\min_{U,V} \|UV^\top - A\|_F^2 converges to the minimum nuclear norm solution. This is because the dynamics U˙=(UVA)V\dot{U} = -(UV^\top - A)V, V˙=(UVA)U\dot{V} = -(UV^\top - A)^\top U implicitly minimize UV\|UV^\top\|_*.

Spectral norm implicit regularization. For overparameterized linear networks (depth 2\geq 2), gradient descent with small step size finds solutions with small spectral norm of the end-to-end product WLW1W_L \cdots W_1.

For AI (2026). Modern foundation models benefit from understanding these implicit biases:

  • LoRA fine-tuning implicitly penalizes nuclear norm of the weight increment
  • Adam with weight decay (decoupled) implicitly regularizes toward solutions with small WF\|W\|_F
  • Gradient clipping implicitly regularizes the spectral norm of gradient steps

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue