Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 3
7 min read10 headingsSplit lesson page

Lesson overview | Previous part | Lesson overview

Matrix Norms and Condition Numbers: Appendix J: Quick Reference - Key Theorems to Summary

Appendix J: Quick Reference - Key Theorems

J.1 The Four Fundamental Theorems of Matrix Norms

Theorem 1 (Equivalence of Matrix Norms). On Rm×n\mathbb{R}^{m\times n} (finite-dimensional), all matrix norms are equivalent: for any two norms α\|\cdot\|_\alpha and β\|\cdot\|_\beta, there exist constants c1,c2>0c_1, c_2 > 0 such that:

c1AαAβc2Aαfor all ARm×nc_1\|A\|_\alpha \leq \|A\|_\beta \leq c_2\|A\|_\alpha \quad \text{for all } A \in \mathbb{R}^{m\times n}

Implication: Convergence (or divergence) in one norm implies convergence (or divergence) in all norms. Norms differ quantitatively, not qualitatively, in finite dimensions.

Theorem 2 (Eckart-Young). The best rank-kk approximation to AA in BOTH the Frobenius and spectral norms is the truncated SVD Ak=i=1kσiuiviA_k = \sum_{i=1}^k\sigma_i\mathbf{u}_i\mathbf{v}_i^\top. The approximation errors are σk+1\sigma_{k+1} (spectral) and i>kσi2\sqrt{\sum_{i>k}\sigma_i^2} (Frobenius).

Implication: Low-rank approximation is solved optimally by SVD. This underpins PCA, LoRA, MLA, and all SVD-based compression methods.

Theorem 3 (Weyl's Inequality). Singular values are Lipschitz-1 functions of the matrix: σi(A+E)σi(A)σ1(E)=E2|\sigma_i(A+E) - \sigma_i(A)| \leq \sigma_1(E) = \|E\|_2 for all ii.

Implication: Singular values are numerically stable - guaranteed small changes for small perturbations. Eigenvalues of non-symmetric matrices do NOT have this guarantee.

Theorem 4 (Von Neumann Trace Inequality). tr(AB)iσi(A)σi(B)|\operatorname{tr}(A^\top B)| \leq \sum_i\sigma_i(A)\sigma_i(B), with equality when A,BA, B share singular vector bases.

Implication: The trace inner product A,BF=tr(AB)\langle A,B\rangle_F = \operatorname{tr}(A^\top B) is bounded by singular value inner products. This is the fundamental inequality behind nuclear-spectral duality and all Holder-type bounds for matrix norms.

J.2 Key Formulas at a Glance

Norm relations for ARm×nA \in \mathbb{R}^{m\times n} of rank rr:

A2AFrA2rArA2\|A\|_2 \leq \|A\|_F \leq \sqrt{r}\|A\|_2 \leq \sqrt{r}\|A\|_* \leq r\|A\|_2

Condition number: κ2(A)=σ1/σn\kappa_2(A) = \sigma_1/\sigma_n (for square nonsingular); each decimal digit of log10κ\log_{10}\kappa costs one digit of precision in double arithmetic.

Proximal operators:

  • proxτ(A)=Udiag(max(σiτ,0))V\operatorname{prox}_{\tau\|\cdot\|_*}(A) = U\operatorname{diag}(\max(\sigma_i-\tau,0))V^\top (nuclear norm -> SVT)
  • proxτF2(A)=A/(1+2τ)\operatorname{prox}_{\tau\|\cdot\|_F^2}(A) = A/(1+2\tau) (Frobenius squared -> scaling)

Dual norm pairs: spectral \leftrightarrow nuclear; pq\ell^p \leftrightarrow \ell^q (matrix 1-norm \leftrightarrow \infty-norm); Frobenius is self-dual.

Stable rank: ρ(A)=AF2/A22[1,r]\rho(A) = \|A\|_F^2/\|A\|_2^2 \in [1, r] - a smooth proxy for rank that controls generalization bounds.

J.3 Notation Reference

Following the project notation guide (docs/NOTATION_GUIDE.md):

SymbolMeaning
AF\|A\|_FFrobenius norm of matrix AA
A2\|A\|_2Spectral (operator 2-) norm of AA; equals σ1(A)\sigma_1(A)
A\|A\|_*Nuclear (trace) norm of AA; equals iσi(A)\sum_i\sigma_i(A)
A1\|A\|_1Matrix 1-norm (max absolute column sum)
A\|A\|_\inftyMatrix \infty-norm (max absolute row sum)
ASp\|A\|_{S_p}Schatten pp-norm; p\ell^p applied to singular values
A(k)\|A\|_{(k)}Ky Fan kk-norm; sum of top-kk singular values
κp(A)\kappa_p(A)Condition number in pp-norm: ApA1p\|A\|_p\|A^{-1}\|_p
ρ(A)\rho(A)Stable rank: AF2/A22\|A\|_F^2/\|A\|_2^2
σi(A)\sigma_i(A)ii-th singular value (decreasing order)
σ(A)\boldsymbol{\sigma}(A)Vector of all singular values
A,BF\langle A,B\rangle_FFrobenius inner product: tr(AB)\operatorname{tr}(A^\top B)

<- Back to Advanced Linear Algebra | Next: Linear Transformations ->

Appendix K: Further Reading

K.1 Textbooks

  1. Golub & Van Loan - Matrix Computations (4th ed., 2013). The definitive numerical linear algebra reference. Chapters 2-3 cover matrix norms and condition numbers in full depth.

  2. Horn & Johnson - Matrix Analysis (2nd ed., 2013). Comprehensive theoretical treatment of matrix norms, singular values, and inequalities.

  3. Bhatia - Matrix Analysis (1997). Advanced treatment of matrix inequalities, Schatten norms, and majorization. Chapter IV covers unitarily invariant norms.

  4. Trefethen & Bau - Numerical Linear Algebra (1997). Excellent pedagogical treatment. Lectures 4-5 cover norms; Lectures 11-12 cover condition number theory.

K.2 Foundational Papers

  1. Eckart & Young (1936) - "The approximation of one matrix by another of lower rank." Psychometrika. The Eckart-Young theorem.

  2. Mirsky (1960) - "Symmetric gauge functions and unitarily invariant norms." Quarterly Journal of Mathematics. Unitarily invariant norm characterization via symmetric gauge functions.

  3. Candes & Recht (2009) - "Exact matrix completion via convex optimization." Foundations of Computational Mathematics. Nuclear norm for matrix recovery under incoherence.

  4. Cai, Candes & Shen (2010) - "A singular value thresholding algorithm for matrix completion." SIAM Journal on Optimization. The SVT proximal algorithm.

K.3 Machine Learning Papers

  1. Miyato et al. (2018) - "Spectral Normalization for Generative Adversarial Networks." ICLR 2018. Spectral norm in GAN training; power iteration for σ1\sigma_1.

  2. Bartlett, Foster & Telgarsky (2017) - "Spectrally-normalized margin bounds for neural networks." NeurIPS 2017. PAC-Bayes generalization bounds via spectral and Frobenius norms.

  3. Gunasekar et al. (2017) - "Implicit Regularization in Matrix Factorization." NeurIPS 2017. Gradient flow on factored parameterization implicitly minimizes nuclear norm.

  4. Hu et al. (2021) - "LoRA: Low-Rank Adaptation of Large Language Models." ICLR 2022. Low-rank weight adaptation; implicit nuclear norm regularization.

  5. Dong et al. (2021) - "Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth." ICML 2021. Attention matrix rank analysis via spectral/nuclear norms.

  6. DeepSeek-AI (2024) - "DeepSeek-V2." MLA architecture and nuclear norm-motivated KV cache compression.

K.4 Online Resources

  • Gilbert Strang's Linear Algebra lectures (MIT OpenCourseWare 18.06): Lectures 29-33 cover SVD, norms, and condition numbers with worked examples.
  • Numerical Linear Algebra (Trefethen, Oxford): Freely available course notes that complement the textbook above.
  • Matrix Cookbook (Petersen & Pedersen, 2012): Dense reference for matrix identities and norm formulas. Freely available as a PDF.
  • Convex Optimization (Boyd & Vandenberghe, Stanford): Chapter 6 covers matrix norms as convex functions; Chapter 11 covers proximal methods including SVT.

This section is part of the Math for LLMs curriculum. All notation follows docs/NOTATION_GUIDE.md. For visualization standards, see docs/VISUALIZATION_GUIDE.md.

<- Back to Advanced Linear Algebra | Next: Linear Transformations ->


Summary

Matrix norms are the central quantitative tools of matrix analysis. This section covered:

Core norm families - The five principal matrix norms (Frobenius, spectral, nuclear, matrix-1, matrix-\infty) and the broader Schatten family that unifies them. Each norm captures a different geometric property of a linear map: the Frobenius norm measures RMS stretching; the spectral norm measures maximum stretching; the nuclear norm measures total stretching (sum of singular values).

Induced vs. non-induced - The spectral, matrix-1, and matrix-\infty norms are induced (they arise from vector norms on input/output spaces). The Frobenius norm is not induced but is compatible. Every induced norm is submultiplicative; submultiplicativity is the key property for stability analysis.

Unitarily invariant norms - Norms that depend only on singular values (Frobenius, spectral, nuclear, Schatten, Ky Fan) have a unified theory via symmetric gauge functions. The Von Neumann trace inequality and Mirsky's theorem are the central results.

Condition number - The ratio σ1/σn\sigma_1/\sigma_n quantifies sensitivity to perturbations. Every decimal digit of log10κ\log_{10}\kappa costs one digit of precision. Tikhonov regularization reduces κ\kappa at the cost of introducing systematic bias.

Perturbation theory - Weyl's inequality (singular values are Lipschitz-1 w.r.t. spectral norm) guarantees stability of SVD computations. The Bauer-Fike theorem shows that non-normal eigenvalues can be much less stable.

Machine learning connections - Every major modern ML technique has a matrix norm interpretation: weight decay (Frobenius), spectral normalization (spectral), LoRA (nuclear via factored form), gradient clipping (spectral or Frobenius of gradients), PAC-Bayes bounds (stable rank and Frobenius/spectral), MLA compression (Eckart-Young), and attention analysis (nuclear norm rank collapse).

The key unifying insight: matrix norms reduce infinite-dimensional questions about linear operators to finite-dimensional scalar measurements. This reduction is the mathematical move that makes analysis tractable, computation feasible, and regularization principled.

This understanding of norms as measurement instruments prepares us for the next section on Linear Transformations, where we use norms to study how maps between vector spaces can be classified, composed, and analyzed. The spectral norm of a transformation matrix is precisely its operator norm - the fundamental quantity controlling how much the transformation can stretch vectors. Every topic in the remaining curriculum - optimization convergence (Chapter 8), probabilistic models (Chapter 10), and the mathematics of specific architectures (Chapter 14) - uses matrix norms as a core tool.

The journey from abstract norm axioms to practical tools - power iteration, singular value thresholding, Tikhonov regularization, spectral normalization - illustrates how pure mathematics becomes engineering. Matrix norms are not just theoretical objects but the computational primitives of modern AI.

Matrix norms connect every part of linear algebra - the SVD gives their values, orthogonality explains their invariance, eigenvalues govern condition numbers - and every part of machine learning - norms define regularizers, bound generalization, and stabilize training. They are the language in which the theory and practice of modern AI is written.

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue