NotesMath for LLMs

Eigenvalues and Eigenvectors

Advanced Linear Algebra / Eigenvalues and Eigenvectors

Notes

"The eigenvalues of a matrix are not just numbers - they are the heartbeat of the linear map, the frequencies at which it resonates, the rates at which it remembers and forgets."

Overview

Eigenvalues and eigenvectors are the most powerful diagnostic tool in linear algebra. Given a square matrix AA, an eigenvector is a non-zero vector v\mathbf{v} that the matrix maps to a scalar multiple of itself: Av=λvA\mathbf{v} = \lambda\mathbf{v}. The scalar λ\lambda is the corresponding eigenvalue. Every other vector gets rotated when AA is applied; eigenvectors are the unique directions that survive without rotation - they are the natural axes of the linear map, the directions in which the map acts most simply.

The power of this concept is hard to overstate. The long-run behaviour of any linear dynamical system - whether a Markov chain, a recurrent neural network, or gradient descent on a quadratic loss - is entirely determined by the eigenvalue spectrum. The eigenvector for the largest eigenvalue is where the system ends up; the ratio of the top two eigenvalues controls the convergence rate; a single eigenvalue outside the unit disk triggers instability. Understanding eigenvalues is understanding the future of any linear system.

For AI practitioners in 2026, eigenvalues appear everywhere: PCA decomposes the data covariance matrix into eigenvectors (principal components) and eigenvalues (variances); the Hessian eigenspectrum determines gradient descent convergence and reveals flat vs sharp minima; attention matrices are row-stochastic with Perron-Frobenius eigenvalue 1; LoRA adapts the dominant eigenspace of weight matrices; spectral normalisation bounds the largest singular value to stabilise GAN training; graph neural networks convolve in the eigenspace of the graph Laplacian. This section develops all of these connections from first principles.

Prerequisites

  • Vector spaces, subspaces, span, and linear independence (Section 02-06)
  • Matrix multiplication, transpose, and invertibility (Section 02-02)
  • Determinants and their connection to invertibility (Section 02-04)
  • Systems of linear equations and row reduction / RREF (Section 02-03)
  • Matrix rank and null spaces (Section 02-05)
  • Basic familiarity with complex numbers (for complex eigenvalues)

Companion Notebooks

NotebookDescription
theory.ipynbInteractive demonstrations: geometric visualisation, power iteration, QR algorithm, spectral theorem, PCA, Hessian spectrum, graph Laplacian, spectral normalisation
exercises.ipynb8 graded problems: eigenpair verification, characteristic polynomial, diagonalisation, spectral theorem, matrix powers, Rayleigh quotient, PCA from scratch, power iteration

Learning Objectives

After completing this section, you will:

  • State the eigenvalue equation Av=λvA\mathbf{v} = \lambda\mathbf{v} and explain its geometric meaning
  • Compute eigenvalues by solving the characteristic equation det(λIA)=0\det(\lambda I - A) = 0
  • Find eigenvectors by solving the null space of (AλI)(A - \lambda I)
  • Distinguish algebraic and geometric multiplicity; identify defective matrices
  • Diagonalise a matrix as A=VΛV1A = V\Lambda V^{-1} and use it to compute matrix powers
  • State and apply the spectral theorem for symmetric matrices (A=QΛQA = Q\Lambda Q^\top)
  • Characterise positive (semi)definite matrices via their eigenvalue spectrum
  • Apply the Rayleigh quotient and Courant-Fischer theorem
  • Compute the dominant eigenvector via power iteration
  • Connect eigenvalues to PCA, gradient descent convergence, RNN stability, and attention structure
  • Explain the Hessian eigenspectrum and its role in neural network training dynamics
  • Use Gershgorin circles and Weyl inequalities to bound eigenvalues without computing them

Table of Contents


1. Intuition

1.1 What Are Eigenvalues and Eigenvectors?

A linear map A:RnRnA: \mathbb{R}^n \to \mathbb{R}^n generally rotates and stretches every vector it acts on. Apply it to any random vector x\mathbf{x} and the result AxA\mathbf{x} points in a completely different direction. But for certain special non-zero vectors v\mathbf{v}, something remarkable happens: AvA\mathbf{v} points in exactly the same direction as v\mathbf{v}. The map stretches or shrinks v\mathbf{v}, possibly flips it, but does not rotate it. These special vectors are called eigenvectors of AA, and the stretching factor is the corresponding eigenvalue λ\lambda:

Av=λvA\mathbf{v} = \lambda\mathbf{v}

The eigenvalue λ\lambda encodes everything about how AA acts on v\mathbf{v}: if λ>1\lambda > 1 the vector is stretched; if 0<λ<10 < \lambda < 1 it is shrunk; if λ<0\lambda < 0 it is reversed in direction and possibly scaled; if λ=0\lambda = 0 it is collapsed to the zero vector; if λ=1\lambda = 1 it is left completely unchanged. The eigenvector itself gives the direction; the eigenvalue gives the magnitude and sign of the action.

The word eigen is German for "own" or "characteristic." Eigenvectors are the characteristic directions that belong intrinsically to the linear map - they are independent of the coordinate system you choose to describe the map. No matter how you rotate your axes, the eigenvectors of AA remain the same geometric objects. They are the natural language in which the map speaks most simply.

For AI: The eigenvectors of the sample covariance matrix C=X~X~/(n1)C = \tilde{X}^\top \tilde{X}/(n-1) are the principal components - the directions of maximum variance in the data. The eigenvalues of the Hessian 2L\nabla^2 \mathcal{L} are the curvatures of the loss landscape in each eigendirection; large eigenvalues correspond to sharp, fast-converging directions, small eigenvalues to flat, slow-converging ones. The eigenvalues of a recurrent weight matrix WW determine whether the RNN's hidden state grows, decays, or oscillates as time steps accumulate.

1.2 The Geometric Picture

The clearest way to see eigenvalues geometrically is to watch what AA does to every unit vector simultaneously. Imagine the unit circle {x:x=1}\{\mathbf{x} : \|\mathbf{x}\| = 1\} in R2\mathbb{R}^2. Apply AA to every point on this circle. The result is an ellipse. The eigenvectors of AA are the directions that map to points on the same ray through the origin - vectors that get stretched along their own line.

For a symmetric matrix A=AA = A^\top, the eigenvectors are orthogonal to each other. The ellipse's axes align exactly with the eigenvectors, and the lengths of the semi-axes equal the absolute eigenvalues. The map acts as a pure coordinate-wise scaling in the eigenbasis: no coupling, no rotation, just independent stretching of orthogonal directions. This is the content of the Spectral Theorem (Section 6), the most important structural result in applied linear algebra.

For a non-symmetric matrix, the picture is more complex. Eigenvectors may not be orthogonal. They may be complex even for a real matrix (a 2D rotation matrix, for instance, has no real eigenvectors - it rotates every real vector). But the eigenvalues still control the long-run behaviour: repeated application of AA amplifies the direction of the largest λi|\lambda_i| and suppresses all others.

GEOMETRIC ACTION OF A 2\times2 MATRIX
========================================================================

  Unit circle                    After applying A
       *                              *  *
     *   *          A             *        *
    *     *    --------->      *              *
     *   *                    (ellipse, axes = eigenvectors,
       *                       semi-axes = |eigenvalues|)

  Most vectors get ROTATED.     Eigenvectors v_1, v_2 stay on
                                the same ray - only stretched.

========================================================================

1.3 Why Eigenvalues Are Central to AI

Eigenvalues are not a piece of abstract machinery that happens to appear in ML - they are the fundamental language of how linear systems evolve, converge, compress, and fail. Here are the core connections:

PCA and covariance structure. Given a centred data matrix X~Rn×d\tilde{X} \in \mathbb{R}^{n \times d}, the sample covariance is C=X~X~/(n1)Rd×dC = \tilde{X}^\top \tilde{X}/(n-1) \in \mathbb{R}^{d \times d}, a symmetric positive semidefinite matrix. Its eigendecomposition C=QΛQC = Q\Lambda Q^\top gives eigenvectors q1,,qd\mathbf{q}_1, \ldots, \mathbf{q}_d (the principal components) and eigenvalues λ1λd0\lambda_1 \geq \cdots \geq \lambda_d \geq 0 (the variances in each direction). Keeping only the top-kk eigenvectors gives the optimal rank-kk approximation to the data - the subspace capturing maximum variance.

Gradient descent convergence. For a quadratic loss L(θ)=12θHθbθ\mathcal{L}(\boldsymbol{\theta}) = \frac{1}{2}\boldsymbol{\theta}^\top H \boldsymbol{\theta} - \mathbf{b}^\top \boldsymbol{\theta}, gradient descent with step size η\eta converges iff 1ηλi<1|1 - \eta\lambda_i| < 1 for all eigenvalues λi\lambda_i of the Hessian HH. The convergence rate in each eigendirection is (1ηλi)t(1 - \eta\lambda_i)^t. The condition number κ(H)=λmax/λmin\kappa(H) = \lambda_{\max}/\lambda_{\min} determines how much slower the slowest direction converges relative to the fastest - large κ\kappa means gradient descent zigzags and stalls.

Vanishing and exploding gradients. In a deep linear network with the same weight matrix WW at each of LL layers, the gradient magnitude scales as WLρ(W)L\|W^L\| \approx \rho(W)^L where ρ(W)=maxiλi\rho(W) = \max_i |\lambda_i| is the spectral radius. If ρ(W)>1\rho(W) > 1: gradients explode. If ρ(W)<1\rho(W) < 1: gradients vanish. The eigenspectrum of WW is the fundamental quantity controlling gradient flow through depth.

Attention and Perron-Frobenius. The attention weight matrix S=softmax(QK/dk)S = \text{softmax}(QK^\top/\sqrt{d_k}) is row-stochastic. By the Perron-Frobenius theorem, its dominant eigenvalue is exactly 1, with all others having λi1|\lambda_i| \leq 1. The structure of the remaining eigenvalues characterises the type of attention pattern - induction heads, name-mover heads, and copying heads each have distinct eigenvalue signatures.

Spectral normalisation and stability. Dividing a weight matrix WW by its largest singular value σmax(W)\sigma_{\max}(W) (spectral norm) enforces a Lipschitz constraint of 1 on that layer. This stabilises GAN training and normalising flow invertibility. The largest singular value is itself the square root of the largest eigenvalue of WWW^\top W.

1.4 Eigenvalues as Long-Run Behaviour

The deepest reason eigenvalues matter is that they completely determine the long-run behaviour of any linear dynamical system. If Avi=λiviA\mathbf{v}_i = \lambda_i \mathbf{v}_i, then repeated application gives:

Atvi=λitviA^t \mathbf{v}_i = \lambda_i^t \mathbf{v}_i

Each eigenvector mode grows or decays at rate λit\lambda_i^t. For any initial vector decomposed in the eigenbasis, x0=icivi\mathbf{x}_0 = \sum_i c_i \mathbf{v}_i:

Atx0=i=1nciλitviA^t \mathbf{x}_0 = \sum_{i=1}^n c_i \lambda_i^t \mathbf{v}_i

The mode with the largest λi|\lambda_i| - the dominant eigenvalue λ1\lambda_1 - eventually dominates everything: as tt \to \infty, Atx0c1λ1tv1A^t \mathbf{x}_0 \approx c_1 \lambda_1^t \mathbf{v}_1. This is why power iteration works: repeatedly apply AA and renormalise, and the iterates converge to the eigenvector for λ1\lambda_1 at rate λ2/λ1|\lambda_2/\lambda_1| per step. It is also why PageRank works - the steady-state page importance vector is the dominant eigenvector of the web link matrix.

For AI: Training dynamics near a loss minimum are approximately a linear system with A=IηHA = I - \eta H. The eigenvalues of AA are 1ηλi(H)1 - \eta\lambda_i(H). The long-run behaviour of training error is therefore ici(1ηλi)tvi\sum_i c_i (1-\eta\lambda_i)^t \mathbf{v}_i - each Hessian eigendirection decays geometrically at its own rate. The slowest-converging direction is the one with the smallest Hessian eigenvalue (smallest curvature = flattest direction).

1.5 Historical Timeline

YearWhoWhat
1750sEulerSpecial exponential solutions to ODEs; implicit eigenvalue reasoning for linear differential equations
1762LagrangeReduction of quadratic forms to "principal axes"; normal modes of mechanical systems
1829CauchyRigorous proof that real symmetric matrices have real eigenvalues; foundation of spectral theory
1856HermiteGeneralisation to complex Hermitian matrices; eigenvalues real for A=AA = A^*
1866KroneckerFormulated the characteristic polynomial det(λIA)\det(\lambda I - A); eigenvalue problem as polynomial root-finding
1870JordanJordan normal form; canonical classification of all square matrices up to similarity
1904-10HilbertSpectral theorem for infinite-dimensional operators; eigenvalues of integral operators; birth of functional analysis
1925-27Heisenberg, von NeumannQuantum mechanics as eigenvalue problems; observable quantities = eigenvalues of Hermitian operators
1950LanczosKrylov subspace method for extreme eigenvalues of large sparse symmetric matrices
1955WignerRandom matrix theory; semicircle law for eigenvalue distributions of large random symmetric matrices
1961-62Francis; KublanovskayaQR algorithm - the practical standard for computing all eigenvalues; one of the top 10 algorithms of the 20th century
1965Golub-ReinschPractical SVD algorithm; singular values as generalised eigenvalues for rectangular matrices
1989LeCun et al.Hessian eigenspectrum of neural networks analysed; curvature-aware learning rate scheduling
2010MartensK-FAC - Kronecker-factored approximate curvature; eigendecomposition of layer-wise Hessian approximations
2019-26Martin & MahoneyWeightWatcher - heavy-tailed eigenvalue distributions as quality metrics for trained neural networks
2017-26Transformer eraSpectral analysis of attention matrices; NTK eigenvalues; spectral normalisation in GANs; LoRA via dominant eigenspace

2. Formal Definitions

2.1 The Eigenvalue Equation

Definition. Let ARn×nA \in \mathbb{R}^{n \times n} be a square matrix. A scalar λC\lambda \in \mathbb{C} is an eigenvalue of AA and a non-zero vector vCn{0}\mathbf{v} \in \mathbb{C}^n \setminus \{\mathbf{0}\} is the corresponding eigenvector if:

Av=λvA\mathbf{v} = \lambda\mathbf{v}

Three equivalent ways to state this:

  1. Geometric: AA maps v\mathbf{v} to a scalar multiple of itself; direction is preserved, only magnitude (and possibly sign) changes.
  2. Algebraic: (AλI)v=0(A - \lambda I)\mathbf{v} = \mathbf{0} has a non-trivial solution v0\mathbf{v} \neq \mathbf{0}.
  3. Subspace: vnull(AλI)\mathbf{v} \in \text{null}(A - \lambda I) and null(AλI){0}\text{null}(A - \lambda I) \neq \{\mathbf{0}\}.

The requirement v0\mathbf{v} \neq \mathbf{0} is essential: the zero vector satisfies A0=λ0A\mathbf{0} = \lambda\mathbf{0} for every λ\lambda, so it carries no information. Any non-zero scalar multiple of an eigenvector is also an eigenvector: A(αv)=αAv=αλv=λ(αv)A(\alpha\mathbf{v}) = \alpha A\mathbf{v} = \alpha\lambda\mathbf{v} = \lambda(\alpha\mathbf{v}). Eigenvectors are therefore not unique - they define a direction, not a specific vector. The set of all eigenvectors for a given λ\lambda, together with 0\mathbf{0}, forms a subspace called the eigenspace (Section 2.4).

Non-examples of eigenvalues:

  • The zero matrix A=0A = 0 has every non-zero vector as an eigenvector with λ=0\lambda = 0, but no non-zero eigenvalue.
  • A rotation matrix in R2\mathbb{R}^2 by angle θ0,π\theta \neq 0, \pi has no real eigenvalues - it rotates every real vector off its own direction.
  • Non-square matrices have no eigenvalues (the equation Av=λvA\mathbf{v} = \lambda\mathbf{v} requires AA to map RnRn\mathbb{R}^n \to \mathbb{R}^n).

2.2 The Characteristic Polynomial

For λ\lambda to be an eigenvalue, (AλI)v=0(A - \lambda I)\mathbf{v} = \mathbf{0} must have a non-trivial solution. This happens precisely when AλIA - \lambda I is singular, i.e., when its determinant is zero. This gives the characteristic equation:

det(λIA)=0\det(\lambda I - A) = 0

The function p(λ)=det(λIA)p(\lambda) = \det(\lambda I - A) is a degree-nn polynomial in λ\lambda called the characteristic polynomial of AA. Its roots are exactly the eigenvalues of AA. By convention we use det(λIA)\det(\lambda I - A) (rather than det(AλI)\det(A - \lambda I)) so that the leading coefficient is +1+1 (monic polynomial):

p(λ)=λntr(A)λn1++(1)ndet(A)p(\lambda) = \lambda^n - \text{tr}(A)\,\lambda^{n-1} + \cdots + (-1)^n \det(A)

Two key coefficients appear regardless of the matrix:

  • Coefficient of λn1\lambda^{n-1}: tr(A)=iλi-\text{tr}(A) = -\sum_i \lambda_i - the sum of eigenvalues equals the trace.
  • Constant term p(0)=(1)ndet(A)=(1)niλip(0) = (-1)^n \det(A) = (-1)^n \prod_i \lambda_i - the product of eigenvalues equals the determinant (up to sign).

By the Fundamental Theorem of Algebra, p(λ)p(\lambda) has exactly nn roots over C\mathbb{C} (counting multiplicity). Every n×nn \times n real matrix therefore has exactly nn eigenvalues in C\mathbb{C}, though some may be complex and some may be repeated.

For a 2×22 \times 2 matrix A=(abcd)A = \begin{pmatrix} a & b \\ c & d \end{pmatrix}:

p(λ)=λ2(a+d)λ+(adbc)=λ2tr(A)λ+det(A)p(\lambda) = \lambda^2 - (a+d)\lambda + (ad - bc) = \lambda^2 - \text{tr}(A)\,\lambda + \det(A) λ1,2=tr(A)±tr(A)24det(A)2\lambda_{1,2} = \frac{\text{tr}(A) \pm \sqrt{\text{tr}(A)^2 - 4\det(A)}}{2}

The discriminant Δ=tr(A)24det(A)\Delta = \text{tr}(A)^2 - 4\det(A) determines the nature of the eigenvalues: Δ>0\Delta > 0 gives two distinct real eigenvalues; Δ=0\Delta = 0 gives one repeated real eigenvalue; Δ<0\Delta < 0 gives two complex conjugate eigenvalues.

2.3 Algebraic and Geometric Multiplicity

When an eigenvalue λi\lambda_i appears as a repeated root of p(λ)p(\lambda), we distinguish two notions of multiplicity:

Algebraic multiplicity ma(λi)m_a(\lambda_i): the number of times (λλi)(\lambda - \lambda_i) divides p(λ)p(\lambda); the multiplicity of λi\lambda_i as a root of the characteristic polynomial.

Geometric multiplicity mg(λi)m_g(\lambda_i): the dimension of the eigenspace null(AλiI)\text{null}(A - \lambda_i I); the number of linearly independent eigenvectors for λi\lambda_i.

These two numbers satisfy the fundamental inequality:

1mg(λi)ma(λi)1 \leq m_g(\lambda_i) \leq m_a(\lambda_i)

The lower bound holds because every eigenvalue has at least one eigenvector. The upper bound requires proof but follows from the fact that the eigenspace cannot be larger than the multiplicity of the root. When mg(λi)=ma(λi)m_g(\lambda_i) = m_a(\lambda_i) for every eigenvalue, the matrix is diagonalisable - it has nn linearly independent eigenvectors. When mg(λi)<ma(λi)m_g(\lambda_i) < m_a(\lambda_i) for some eigenvalue, the matrix is defective and cannot be diagonalised; it requires the Jordan Normal Form (Section 7).

Example. Consider A=(2102)A = \begin{pmatrix} 2 & 1 \\ 0 & 2 \end{pmatrix}. The characteristic polynomial is (λ2)2(\lambda - 2)^2, so λ=2\lambda = 2 has ma=2m_a = 2. But A2I=(0100)A - 2I = \begin{pmatrix} 0 & 1 \\ 0 & 0 \end{pmatrix} has null space spanned only by e1=(1,0)\mathbf{e}_1 = (1,0)^\top, so mg=1<ma=2m_g = 1 < m_a = 2. This matrix is defective.

2.4 Eigenspaces

For eigenvalue λ\lambda, the eigenspace is:

E(λ)=null(AλI)={vRn:Av=λv}E(\lambda) = \text{null}(A - \lambda I) = \{\mathbf{v} \in \mathbb{R}^n : A\mathbf{v} = \lambda\mathbf{v}\}

E(λ)E(\lambda) is always a subspace (it is the null space of AλIA - \lambda I). It contains the zero vector and is closed under addition and scalar multiplication. Its dimension equals the geometric multiplicity: dimE(λ)=mg(λ)=nrank(AλI)\dim E(\lambda) = m_g(\lambda) = n - \text{rank}(A - \lambda I).

Two key properties make eigenspaces structurally special:

Invariance: E(λ)E(\lambda) is invariant under AA. If vE(λ)\mathbf{v} \in E(\lambda) then Av=λvE(λ)A\mathbf{v} = \lambda\mathbf{v} \in E(\lambda). Applying AA keeps you inside the eigenspace - eigenvectors stay eigenvectors.

Independence across eigenvalues: Eigenvectors for different eigenvalues are linearly independent. If Av=λ1vA\mathbf{v} = \lambda_1\mathbf{v} and Aw=λ2wA\mathbf{w} = \lambda_2\mathbf{w} with λ1λ2\lambda_1 \neq \lambda_2, then v\mathbf{v} and w\mathbf{w} cannot be parallel. Proof: suppose v=αw\mathbf{v} = \alpha\mathbf{w} for some scalar α0\alpha \neq 0. Then λ1v=Av=αAw=αλ2w=λ2v\lambda_1\mathbf{v} = A\mathbf{v} = \alpha A\mathbf{w} = \alpha\lambda_2\mathbf{w} = \lambda_2\mathbf{v}, giving (λ1λ2)v=0(\lambda_1 - \lambda_2)\mathbf{v} = \mathbf{0}. Since λ1λ2\lambda_1 \neq \lambda_2 and v0\mathbf{v} \neq \mathbf{0}, contradiction. More generally: eigenvectors v1,,vk\mathbf{v}_1, \ldots, \mathbf{v}_k for distinct eigenvalues λ1,,λk\lambda_1, \ldots, \lambda_k are always linearly independent.

For AI: In PCA, the eigenspaces of the covariance matrix are the principal component subspaces. The first eigenspace (for λ1=\lambda_1 = largest variance) gives the direction of greatest spread. In spectral clustering, the eigenspaces of the graph Laplacian partition the graph. In the analysis of attention heads, the eigenspace for eigenvalue 1 is the fixed-point subspace of the attention map.

2.5 The Spectrum

The spectrum of AA is the set of all distinct eigenvalues: σ(A)={λC:det(λIA)=0}\sigma(A) = \{\lambda \in \mathbb{C} : \det(\lambda I - A) = 0\}.

The spectral radius is ρ(A)=max{λ:λσ(A)}\rho(A) = \max\{|\lambda| : \lambda \in \sigma(A)\} - the largest absolute eigenvalue. It controls the long-run growth rate of AtA^t:

  • ρ(A)<1\rho(A) < 1: At0A^t \to 0 as tt \to \infty (all trajectories of xt+1=Axt\mathbf{x}_{t+1} = A\mathbf{x}_t converge to 0\mathbf{0})
  • ρ(A)=1\rho(A) = 1: trajectories are bounded (marginal stability, provided Jordan blocks for λ=1|\lambda|=1 eigenvalues have size 1)
  • ρ(A)>1\rho(A) > 1: some trajectories grow without bound (unstable)

For real matrices, complex eigenvalues always come in conjugate pairs: if λ=a+biσ(A)\lambda = a + bi \in \sigma(A) then λˉ=abiσ(A)\bar{\lambda} = a - bi \in \sigma(A), with corresponding conjugate eigenvectors. This means real matrices of odd dimension always have at least one real eigenvalue.

For AI: The spectral radius of the recurrent weight matrix WW in an RNN determines stability: ρ(W)<1\rho(W) < 1 causes vanishing gradients; ρ(W)>1\rho(W) > 1 causes explosion; ρ(W)=1\rho(W) = 1 is the critical point targeted by orthogonal RNN designs. The spectral radius of the attention matrix SS is always 1 (Perron-Frobenius). The spectral radius of IηHI - \eta H (where HH is the Hessian) determines gradient descent convergence.


3. Computing Eigenvalues and Eigenvectors

3.1 The Two-Step Process

Computing eigenvalues and eigenvectors follows a universal two-step recipe:

Step 1 - Find eigenvalues: Solve the characteristic equation det(λIA)=0\det(\lambda I - A) = 0. This gives a degree-nn polynomial whose roots are the eigenvalues λ1,,λn\lambda_1, \ldots, \lambda_n (with algebraic multiplicity).

Step 2 - Find eigenvectors: For each eigenvalue λi\lambda_i, solve the homogeneous linear system (AλiI)v=0(A - \lambda_i I)\mathbf{v} = \mathbf{0}. Row-reduce [AλiI0][A - \lambda_i I \mid \mathbf{0}] to RREF. The free variables give the null space basis vectors - these are the eigenvectors for λi\lambda_i.

This procedure is exact for small matrices. For n5n \geq 5, the Abel-Ruffini theorem guarantees no algebraic formula for roots of degree-5 polynomials, so numerical methods (Section 3.7) are mandatory in practice.

3.2 The 2\times2 Case

For A=(abcd)A = \begin{pmatrix} a & b \\ c & d \end{pmatrix}, the characteristic polynomial is p(λ)=λ2(a+d)λ+(adbc)p(\lambda) = \lambda^2 - (a+d)\lambda + (ad-bc), giving:

λ1,2=(a+d)±(ad)2+4bc2\lambda_{1,2} = \frac{(a+d) \pm \sqrt{(a-d)^2 + 4bc}}{2}

Worked example. A=(3113)A = \begin{pmatrix} 3 & 1 \\ 1 & 3 \end{pmatrix}.

  • p(λ)=λ26λ+8=(λ4)(λ2)p(\lambda) = \lambda^2 - 6\lambda + 8 = (\lambda-4)(\lambda-2); eigenvalues λ1=4\lambda_1 = 4, λ2=2\lambda_2 = 2.
  • For λ1=4\lambda_1 = 4: (A4I)=(1111)RREF(1100)(A - 4I) = \begin{pmatrix} -1 & 1 \\ 1 & -1 \end{pmatrix} \xrightarrow{\text{RREF}} \begin{pmatrix} 1 & -1 \\ 0 & 0 \end{pmatrix}; eigenvector v1=(1,1)\mathbf{v}_1 = (1, 1)^\top.
  • For λ2=2\lambda_2 = 2: (A2I)=(1111)RREF(1100)(A - 2I) = \begin{pmatrix} 1 & 1 \\ 1 & 1 \end{pmatrix} \xrightarrow{\text{RREF}} \begin{pmatrix} 1 & 1 \\ 0 & 0 \end{pmatrix}; eigenvector v2=(1,1)\mathbf{v}_2 = (1, -1)^\top.
  • Check: λ1+λ2=6=tr(A)\lambda_1 + \lambda_2 = 6 = \text{tr}(A) OK; λ1λ2=8=det(A)\lambda_1 \cdot \lambda_2 = 8 = \det(A) OK; v1v2\mathbf{v}_1 \perp \mathbf{v}_2 OK (symmetric matrix).

3.3 The 3\times3 and Higher Cases

For a 3×33 \times 3 matrix AA, expand det(λIA)\det(\lambda I - A) by cofactor expansion along any row or column:

p(λ)=λ3tr(A)λ2+12[(tr(A))2tr(A2)]λdet(A)p(\lambda) = \lambda^3 - \text{tr}(A)\,\lambda^2 + \frac{1}{2}[(\text{tr}(A))^2 - \text{tr}(A^2)]\,\lambda - \det(A)

The coefficient of λ\lambda equals the sum of the three 2×22 \times 2 principal minors M11+M22+M33M_{11} + M_{22} + M_{33} (where MiiM_{ii} is the determinant of AA with row ii and column ii deleted). Factor the cubic if possible; otherwise use the cubic formula or numerical root-finding.

For n4n \geq 4: expand the determinant directly; factor by guessing integer roots; use the rational root theorem. For n5n \geq 5: numerical methods are mandatory (Section 3.7).

Key shortcut - triangular and diagonal matrices. If AA is upper triangular, lower triangular, or diagonal, then det(λIA)=i=1n(λAii)\det(\lambda I - A) = \prod_{i=1}^n (\lambda - A_{ii}). The eigenvalues are exactly the diagonal entries. No expansion needed.

Key shortcut - block diagonal matrices. If A=diag(B1,B2,,Bk)A = \text{diag}(B_1, B_2, \ldots, B_k), then p(λ)=jpBj(λ)p(\lambda) = \prod_j p_{B_j}(\lambda). The eigenvalues of AA are the union of eigenvalues of each block.

3.4 Worked Example - 3\times3 Defective Matrix

Let A=(210021003)A = \begin{pmatrix} 2 & 1 & 0 \\ 0 & 2 & 1 \\ 0 & 0 & 3 \end{pmatrix} (upper triangular).

Step 1: p(λ)=(λ2)2(λ3)p(\lambda) = (\lambda-2)^2(\lambda-3). Eigenvalues: λ1=2\lambda_1 = 2 (algebraic multiplicity 2), λ2=3\lambda_2 = 3 (multiplicity 1).

Step 2a - eigenvectors for λ1=2\lambda_1 = 2:

A2I=(010001001)RREF(010001000)A - 2I = \begin{pmatrix} 0 & 1 & 0 \\ 0 & 0 & 1 \\ 0 & 0 & 1 \end{pmatrix} \xrightarrow{\text{RREF}} \begin{pmatrix} 0 & 1 & 0 \\ 0 & 0 & 1 \\ 0 & 0 & 0 \end{pmatrix}

Free variable: x1x_1; forced: x2=0x_2 = 0, x3=0x_3 = 0. Eigenvector: v1=(1,0,0)\mathbf{v}_1 = (1,0,0)^\top. Geometric multiplicity mg(λ1)=1<ma(λ1)=2m_g(\lambda_1) = 1 < m_a(\lambda_1) = 2. Defective eigenvalue. The matrix cannot be diagonalised.

Step 2b - eigenvectors for λ2=3\lambda_2 = 3:

A3I=(110011000)RREF(101011000)A - 3I = \begin{pmatrix} -1 & 1 & 0 \\ 0 & -1 & 1 \\ 0 & 0 & 0 \end{pmatrix} \xrightarrow{\text{RREF}} \begin{pmatrix} 1 & 0 & -1 \\ 0 & 1 & -1 \\ 0 & 0 & 0 \end{pmatrix}

Free variable: x3x_3; x1=x3x_1 = x_3, x2=x3x_2 = x_3. Eigenvector: v2=(1,1,1)\mathbf{v}_2 = (1,1,1)^\top. Geometric multiplicity mg(λ2)=1=ma(λ2)m_g(\lambda_2) = 1 = m_a(\lambda_2). Non-defective.

Conclusion: This matrix has only 2 linearly independent eigenvectors for 3 eigenvalues (counting multiplicity). Diagonalisation fails; Section 7 covers the Jordan form needed here.

3.5 Eigenvalues of Special Matrices

A table of shortcuts that apply throughout the course:

Matrix typeConditionEigenvalue property
Triangular (upper/lower)Aij=0A_{ij} = 0 for i>ji > j (or i<ji < j)Eigenvalues = diagonal entries AiiA_{ii}
DiagonalA=diag(d1,,dn)A = \text{diag}(d_1,\ldots,d_n)Eigenvalues =di= d_i; eigenvectors =ei= \mathbf{e}_i
SymmetricA=AA = A^\topAll λiR\lambda_i \in \mathbb{R}; eigenvectors orthogonal
Skew-symmetricA=AA = -A^\topAll λiiR\lambda_i \in i\mathbb{R} (purely imaginary or 0)
OrthogonalQQ=IQ^\top Q = IAll $
ProjectionP2=PP^2 = P, P=PP^\top = Pλi{0,1}\lambda_i \in \{0, 1\}
NilpotentAk=0A^k = 0 for some kkAll λi=0\lambda_i = 0
Row-stochasticAll entries 0\geq 0, row sums =1= 1λ1=1\lambda_1 = 1; $
PSDA0A \succeq 0All λi0\lambda_i \geq 0
PDA0A \succ 0All λi>0\lambda_i > 0

3.6 Power Iteration

Power iteration computes the dominant eigenvector - the eigenvector corresponding to the eigenvalue with largest absolute value - at cost O(n2)O(n^2) per step (or O(nnz)O(\text{nnz}) for sparse matrices).

Algorithm: Given initial vector x0Rn\mathbf{x}_0 \in \mathbb{R}^n (random, not orthogonal to v1\mathbf{v}_1):

For k = 1, 2, 3, ...:
    y_k = A x_{k-1}              (apply A - one matrix-vector multiply)
    x_k = y_k / ||y_k||            (normalise to unit length)
    \lambda_k = x_k^T A x_k             (Rayleigh quotient estimate)
    stop when |\lambda_k - \lambda_{k-1}| < tolerance

Convergence: The iterates xkv1\mathbf{x}_k \to \mathbf{v}_1 at geometric rate λ2/λ1|\lambda_2/\lambda_1| per step. If the spectral gap λ1λ2|\lambda_1| - |\lambda_2| is large, convergence is rapid; if the gap is small, convergence is slow.

Requirements: λ1\lambda_1 must be strictly larger in absolute value than all other eigenvalues; x0\mathbf{x}_0 must have a non-zero component in the v1\mathbf{v}_1 direction (almost surely true for a random initialisation).

For AI: Power iteration underlies PageRank (dominant eigenvector of the link transition matrix), spectral initialisation of word embeddings, and efficient estimation of the spectral norm σmax(W)=λmax(WW)\sigma_{\max}(W) = \sqrt{\lambda_{\max}(W^\top W)} for spectral normalisation in GANs and normalising flows. One step of power iteration per gradient update costs only O(mn)O(mn) and gives a running estimate of σmax\sigma_{\max}.

3.7 The QR Algorithm

The QR algorithm is the practical standard for computing all eigenvalues of a dense matrix. It is one of the top 10 algorithms of the 20th century (Dongarra and Sullivan, 2000) and underlies numpy.linalg.eig, scipy.linalg.eig, and LAPACK's dgeev/dsyev.

Basic QR iteration:

A_0 = A
For k = 1, 2, 3, ...:
    A_k_-_1 = Q_k R_k     (QR factorisation of current matrix)
    A_k   = R_k Q_k      (reverse-multiply; A_k is similar to A_k_-_1)

Each Ak=QkAk1QkA_k = Q_k^\top A_{k-1} Q_k is orthogonally similar to AA and thus has the same eigenvalues. The sequence AkA_k converges to an upper triangular (Schur) form whose diagonal entries are the eigenvalues.

Practical QR algorithm: Reduce AA to Hessenberg form first (zero below the first sub-diagonal) using Householder reflections - O(n3)O(n^3) once. QR steps on a Hessenberg matrix cost O(n2)O(n^2) each. With Wilkinson shifts (shifting by an estimate of the bottom eigenvalue), convergence is typically cubic. Total cost: O(n3)O(n^3).

For AI: The QR algorithm is called every time you run np.linalg.eig(A) or np.linalg.eigh(A) (the symmetric version). Computing the full eigendecomposition of the Hessian approximation in K-FAC uses block-wise QR. Eigenvalue decomposition of the attention weight matrix WOWVW_O W_V for mechanistic interpretability uses np.linalg.eig.

3.8 Lanczos Algorithm for Large Sparse Matrices

For large sparse symmetric matrices (e.g., graph Laplacians with n=106n = 10^6 nodes), computing all eigenvalues via QR is O(n3)O(n^3) - completely infeasible. The Lanczos algorithm computes the kk extreme eigenvalues (largest or smallest) at cost O(knnz)O(k \cdot \text{nnz}) where nnz\text{nnz} is the number of non-zeros.

Core idea: Build a Krylov subspace Kk(A,b)=span{b,Ab,A2b,,Ak1b}\mathcal{K}_k(A, \mathbf{b}) = \text{span}\{\mathbf{b}, A\mathbf{b}, A^2\mathbf{b}, \ldots, A^{k-1}\mathbf{b}\}. The eigenvalues of the k×kk \times k tridiagonal projection TkT_k (called Ritz values) converge to the extreme eigenvalues of AA as kk grows. Typically knk \ll n.

Algorithm (symmetric Lanczos):

\beta_0 = 0, q_0 = 0, q_1 = b/||b||
For j = 1, 2, ..., k:
    z     = A q_j
    \alpha_j    = q_j^T z
    z     <- z - \alpha_j q_j - \beta_j_-_1 q_j_-_1
    \beta_j    = ||z||;  q_j_+_1 = z / \beta_j
T_k = tridiag(\alpha_1,...,\alpha_k; \beta_1,...,\beta_k_-_1)

The Ritz values (eigenvalues of TkT_k) converge to extreme eigenvalues of AA at exponential rate. In practice, re-orthogonalisation is needed to combat floating-point accumulation of rounding errors.

For AI: Lanczos is used to compute top-rr eigenvectors of large graph Laplacians for spectral clustering and graph neural network positional encodings; to estimate the Hessian eigenspectrum of large neural networks (PyHessian, HessianFlow); to compute dominant eigenvectors of large kernel matrices (Nystrom approximation); and to analyse the spectral structure of attention weight matrices in large transformers.


4. Properties of Eigenvalues

4.1 Trace and Determinant Relations

Two fundamental identities connect eigenvalues to matrix invariants computable without solving the characteristic equation:

i=1nλi=tr(A)i=1nλi=det(A)\sum_{i=1}^n \lambda_i = \text{tr}(A) \qquad \prod_{i=1}^n \lambda_i = \det(A)

Proof (trace). The characteristic polynomial is p(λ)=i=1n(λλi)p(\lambda) = \prod_{i=1}^n (\lambda - \lambda_i). Expanding: the coefficient of λn1\lambda^{n-1} is iλi-\sum_i \lambda_i. Expanding det(λIA)\det(\lambda I - A) directly: the coefficient of λn1\lambda^{n-1} comes only from the product of diagonal terms (λA11)(λA22)(λAnn)(\lambda - A_{11})(\lambda - A_{22})\cdots(\lambda - A_{nn}), giving tr(A)-\text{tr}(A) for the coefficient of λn1\lambda^{n-1}. Hence iλi=tr(A)\sum_i \lambda_i = \text{tr}(A).

Proof (determinant). p(0)=det(A)=(1)ndet(A)p(0) = \det(-A) = (-1)^n \det(A). Also p(0)=i(0λi)=(1)niλip(0) = \prod_i (0 - \lambda_i) = (-1)^n \prod_i \lambda_i. Hence det(A)=iλi\det(A) = \prod_i \lambda_i.

These provide quick sanity checks for eigenvalue computations and theoretical tools:

For AI: tr(2L)=iλi\text{tr}(\nabla^2 \mathcal{L}) = \sum_i \lambda_i is the total curvature of the loss surface; it can be estimated cheaply via Hutchinson's trace estimator (random vector z\mathbf{z}: tr(H)zHz\text{tr}(H) \approx \mathbf{z}^\top H \mathbf{z} for zN(0,I)\mathbf{z} \sim \mathcal{N}(0,I)). logdet(C)=ilogλi\log\det(C) = \sum_i \log\lambda_i appears in Gaussian log-likelihoods and normalising flow log-densities. tr(AA)=iσi2=AF2\text{tr}(A^\top A) = \sum_i \sigma_i^2 = \|A\|_F^2 for symmetric PSD matrices where σi=λi\sigma_i = \lambda_i.

4.2 Eigenvalues Under Matrix Operations

A powerful feature of eigenvalues: many matrix operations transform eigenvalues in simple, predictable ways - without changing the eigenvectors.

OperationEigenvaluesEigenvectorsProof sketch
αA\alpha Aαλi\alpha\lambda_isame vi\mathbf{v}_i(αA)v=α(Av)=αλv(\alpha A)\mathbf{v} = \alpha(A\mathbf{v}) = \alpha\lambda\mathbf{v}
AkA^kλik\lambda_i^ksame vi\mathbf{v}_iInduction: Akv=Ak1(λv)=λkvA^k\mathbf{v} = A^{k-1}(\lambda\mathbf{v}) = \lambda^k\mathbf{v}
A1A^{-1} (if exists)1/λi1/\lambda_isame vi\mathbf{v}_iA1v=A1(Av/λ)=v/λA^{-1}\mathbf{v} = A^{-1}(A\mathbf{v}/\lambda) = \mathbf{v}/\lambda
p(A)=ckAkp(A) = \sum c_k A^kp(λi)p(\lambda_i)same vi\mathbf{v}_iLinearity + AkA^k rule
AA^\topsame λi\lambda_idifferent (left eigenvectors of AA)det(λIA)=det(λIA)\det(\lambda I - A^\top) = \det(\lambda I - A)
P1APP^{-1}APsame λi\lambda_iP1viP^{-1}\mathbf{v}_idet(λIP1AP)=det(λIA)\det(\lambda I - P^{-1}AP) = \det(\lambda I - A)
A+μIA + \mu Iλi+μ\lambda_i + \musame vi\mathbf{v}_i(A+μI)v=λv+μv=(λ+μ)v(A+\mu I)\mathbf{v} = \lambda\mathbf{v} + \mu\mathbf{v} = (\lambda+\mu)\mathbf{v}

For AI: The eigenvalues of IηHI - \eta H are 1ηλi(H)1 - \eta\lambda_i(H) - this is the key formula for gradient descent convergence. The eigenvalues of eAte^{At} are eλite^{\lambda_i t} - this governs neural ODE dynamics. Preconditioning replaces HH with P1HP^{-1}H to give better-conditioned eigenvalues.

4.3 Eigenvalues of Special Matrices

Symmetric matrices (A=AA = A^\top): All eigenvalues are real; eigenvectors for distinct eigenvalues are orthogonal; always diagonalisable with an orthonormal eigenbasis. (Full proof in Section 6.)

Positive semidefinite (A0A \succeq 0): All λi0\lambda_i \geq 0. Equivalent characterisations: xAx0\mathbf{x}^\top A\mathbf{x} \geq 0 for all x\mathbf{x}; A=BBA = B^\top B for some BB; all principal minors non-negative.

Orthogonal (QQ=IQ^\top Q = I): All λi=1|\lambda_i| = 1. Over R\mathbb{R}: eigenvalues are ±1\pm 1. Over C\mathbb{C}: eigenvalues lie on the unit circle. Proof: Qv=v\|Q\mathbf{v}\| = \|\mathbf{v}\| (orthogonal maps preserve length) and Qv=λv\|Q\mathbf{v}\| = |\lambda|\|\mathbf{v}\| so λ=1|\lambda| = 1.

Projection (P2=PP^2 = P, P=PP^\top = P): Eigenvalues {0,1}\in \{0, 1\}. The 1-eigenspace is col(P)\text{col}(P); the 0-eigenspace is null(P)\text{null}(P). Proof: Pv=λvP\mathbf{v} = \lambda\mathbf{v} implies P2v=λPv=λ2vP^2\mathbf{v} = \lambda P\mathbf{v} = \lambda^2\mathbf{v}; but P2=PP^2 = P so λ2=λ\lambda^2 = \lambda; hence λ{0,1}\lambda \in \{0,1\}.

Row-stochastic (A1=1A\mathbf{1} = \mathbf{1}, Aij0A_{ij} \geq 0): By Perron-Frobenius, the dominant eigenvalue is λ1=1\lambda_1 = 1 with eigenvector 1\mathbf{1}; all other eigenvalues satisfy λi1|\lambda_i| \leq 1. The attention weight matrix S=softmax(QK/dk)S = \text{softmax}(QK^\top/\sqrt{d_k}) is row-stochastic, so its largest eigenvalue is always 1.

4.4 The Rayleigh Quotient

For a symmetric matrix ARn×nA \in \mathbb{R}^{n \times n} and any non-zero xRn\mathbf{x} \in \mathbb{R}^n, the Rayleigh quotient is:

R(A,x)=xAxxxR(A, \mathbf{x}) = \frac{\mathbf{x}^\top A \mathbf{x}}{\mathbf{x}^\top \mathbf{x}}

It is bounded between the smallest and largest eigenvalues:

λmin(A)R(A,x)λmax(A)\lambda_{\min}(A) \leq R(A, \mathbf{x}) \leq \lambda_{\max}(A)

Proof. Write x=iciqi\mathbf{x} = \sum_i c_i \mathbf{q}_i in the orthonormal eigenbasis {qi}\{\mathbf{q}_i\}. Then xAx=iλici2\mathbf{x}^\top A \mathbf{x} = \sum_i \lambda_i c_i^2 and xx=ici2\mathbf{x}^\top \mathbf{x} = \sum_i c_i^2, giving R=iλici2ici2R = \frac{\sum_i \lambda_i c_i^2}{\sum_i c_i^2} - a weighted average of eigenvalues with non-negative weights ci2/jcj2c_i^2/\sum_j c_j^2 summing to 1. A weighted average is bounded by the min and max of its terms. Equality at x=q1\mathbf{x} = \mathbf{q}_1 gives R=λmaxR = \lambda_{\max}; at x=qn\mathbf{x} = \mathbf{q}_n gives R=λminR = \lambda_{\min}.

This gives the variational characterisation of extreme eigenvalues:

λmax(A)=maxx0R(A,x)=maxx=1xAx\lambda_{\max}(A) = \max_{\mathbf{x} \neq \mathbf{0}} R(A,\mathbf{x}) = \max_{\|\mathbf{x}\|=1} \mathbf{x}^\top A \mathbf{x} λmin(A)=minx0R(A,x)=minx=1xAx\lambda_{\min}(A) = \min_{\mathbf{x} \neq \mathbf{0}} R(A,\mathbf{x}) = \min_{\|\mathbf{x}\|=1} \mathbf{x}^\top A \mathbf{x}

For AI: The PCA objective - find the direction of maximum variance - is exactly maxw=1wCw=λmax(C)\max_{\|\mathbf{w}\|=1} \mathbf{w}^\top C \mathbf{w} = \lambda_{\max}(C), achieved at w=q1\mathbf{w} = \mathbf{q}_1. The curvature of the loss in direction d\mathbf{d} is R(2L,d)R(\nabla^2 \mathcal{L}, \mathbf{d}); the maximum curvature is λmax(H)\lambda_{\max}(H); the minimum is λmin(H)\lambda_{\min}(H). Rayleigh quotient iteration refines an eigenvector estimate xk\mathbf{x}_k by shifting: solve (ARkI)y=xk(A - R_k I)\mathbf{y} = \mathbf{x}_k, renormalise - this gives cubic convergence to the nearest eigenvector.

4.5 The Courant-Fischer Min-Max Theorem

The Rayleigh quotient gives λmax\lambda_{\max} and λmin\lambda_{\min}. The Courant-Fischer theorem generalises this to all eigenvalues. For symmetric AA with eigenvalues λ1λ2λn\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_n:

λk=maxdim(W)=kminxW,x0R(A,x)\lambda_k = \max_{\dim(W)=k} \min_{\mathbf{x} \in W,\, \mathbf{x} \neq \mathbf{0}} R(A,\mathbf{x})

Equivalently, λk\lambda_k is the best minimum Rayleigh quotient you can guarantee if you choose a kk-dimensional subspace WW optimally.

Weyl's inequalities give eigenvalue stability under perturbation: for symmetric AA and symmetric perturbation EE:

λk(A)+λn(E)λk(A+E)λk(A)+λ1(E)\lambda_k(A) + \lambda_n(E) \leq \lambda_k(A+E) \leq \lambda_k(A) + \lambda_1(E)

In particular, λk(A+E)λk(A)E2=λmax(E)|\lambda_k(A+E) - \lambda_k(A)| \leq \|E\|_2 = \lambda_{\max}(|E|). Eigenvalues of symmetric matrices are Lipschitz in the matrix entries with constant 1 (in spectral norm).

For AI: Weyl's inequalities bound how much eigenvalues shift when model weights are perturbed (e.g., by quantisation noise or gradient updates). If the perturbation EE has E2ϵ\|E\|_2 \leq \epsilon, then no eigenvalue moves by more than ϵ\epsilon. This is why spectral properties of trained models are relatively robust to small weight perturbations.

4.6 Gershgorin Circle Theorem

The Gershgorin Circle Theorem localises eigenvalues using only the matrix entries - no eigenvalue computation required.

Theorem. For ACn×nA \in \mathbb{C}^{n \times n}, every eigenvalue lies in at least one Gershgorin disc:

Di={λC:λAiiRi},Ri=jiAijD_i = \left\{\lambda \in \mathbb{C} : |\lambda - A_{ii}| \leq R_i\right\}, \quad R_i = \sum_{j \neq i} |A_{ij}|

Each disc DiD_i is centred at the diagonal entry AiiA_{ii} with radius equal to the sum of absolute values of the off-diagonal entries in row ii. All eigenvalues: σ(A)i=1nDi\sigma(A) \subseteq \bigcup_{i=1}^n D_i.

Proof. Let λ\lambda be an eigenvalue with eigenvector v\mathbf{v}. Choose ii such that vi=v|v_i| = \|\mathbf{v}\|_\infty (the largest component). From (AλI)v=0(A - \lambda I)\mathbf{v} = \mathbf{0} at row ii: (Aiiλ)vi=jiAijvj(A_{ii} - \lambda)v_i = -\sum_{j \neq i} A_{ij}v_j. Taking absolute values and dividing by vi>0|v_i| > 0: AiiλjiAijvj/vijiAij=Ri|A_{ii} - \lambda| \leq \sum_{j \neq i} |A_{ij}| \cdot |v_j/v_i| \leq \sum_{j \neq i}|A_{ij}| = R_i. So λDi\lambda \in D_i.

Corollary (diagonal dominance): If Aii>Ri|A_{ii}| > R_i for all ii, then all Gershgorin discs exclude the origin, so 0σ(A)0 \notin \sigma(A) - the matrix is invertible.

For AI: Gershgorin circles give fast bounds on the eigenvalue range of attention matrices, Hessians, or weight matrices without computing eigenvalues. If all discs lie in the right half-plane (Re(λ)>0\text{Re}(\lambda) > 0), the matrix is positive definite. Useful for verifying that a discretised ODE system is stable before running the forward pass.


5. Diagonalisation

5.1 The Diagonalisation Theorem

Definition. A matrix ARn×nA \in \mathbb{R}^{n \times n} is diagonalisable if there exists an invertible matrix VV and a diagonal matrix Λ\Lambda such that:

A=VΛV1A = V\Lambda V^{-1}

where V=[v1v2vn]V = [\mathbf{v}_1 \mid \mathbf{v}_2 \mid \cdots \mid \mathbf{v}_n] has the eigenvectors as columns and Λ=diag(λ1,λ2,,λn)\Lambda = \text{diag}(\lambda_1, \lambda_2, \ldots, \lambda_n) has the corresponding eigenvalues on the diagonal.

Proof (\Rightarrow). If Avi=λiviA\mathbf{v}_i = \lambda_i\mathbf{v}_i for i=1,,ni=1,\ldots,n and the vi\mathbf{v}_i are linearly independent, then AV=VΛAV = V\Lambda (column ii of AVAV is Avi=λiviA\mathbf{v}_i = \lambda_i\mathbf{v}_i, which equals column ii of VΛV\Lambda). Since VV is invertible (independent columns): A=VΛV1A = V\Lambda V^{-1}.

Proof (\Leftarrow). If A=VΛV1A = V\Lambda V^{-1} then AV=VΛAV = V\Lambda; column jj gives Avj=λjvjA\mathbf{v}_j = \lambda_j\mathbf{v}_j; the columns of VV are eigenvectors. Since VV is invertible, they are linearly independent.

Key insight: Diagonalisation is a change of basis. In the eigenbasis (coordinates given by V1xV^{-1}\mathbf{x}), the matrix AA acts as pure coordinate-wise scaling by λi\lambda_i - no mixing between different directions. This is why diagonalisable matrices are "simple": in their natural basis, they do nothing but stretch.

5.2 When Is A Diagonalisable?

Sufficient condition: If AA has nn distinct eigenvalues, it is diagonalisable. Proof: eigenvectors for distinct eigenvalues are linearly independent (Section 2.4); nn independent vectors in Rn\mathbb{R}^n form a basis; VV is invertible.

Necessary and sufficient: AA is diagonalisable if and only if mg(λi)=ma(λi)m_g(\lambda_i) = m_a(\lambda_i) for every eigenvalue λi\lambda_i. Equivalently: the eigenspaces together span all of Rn\mathbb{R}^n.

Always diagonalisable:

  • Symmetric matrices A=AA = A^\top (Spectral Theorem, Section 6)
  • Any matrix with nn distinct eigenvalues
  • Normal matrices satisfying AA=AAAA^\top = A^\top A

Not necessarily diagonalisable:

  • Matrices with repeated eigenvalues where mg<mam_g < m_a (defective)
  • The 2×22 \times 2 matrix (2102)\begin{pmatrix} 2 & 1 \\ 0 & 2\end{pmatrix} has λ=2\lambda = 2 with ma=2m_a = 2, mg=1m_g = 1 - defective

Non-examples of diagonalisability over R\mathbb{R}:

  • Any real matrix with complex (non-real) eigenvalues (e.g., rotation matrices) cannot be diagonalised over R\mathbb{R}, though it can over C\mathbb{C}

5.3 Matrix Powers via Diagonalisation

If A=VΛV1A = V\Lambda V^{-1}, then matrix powers become trivial:

Ak=VΛkV1,where Λk=diag(λ1k,,λnk)A^k = V\Lambda^k V^{-1}, \qquad \text{where } \Lambda^k = \text{diag}(\lambda_1^k, \ldots, \lambda_n^k)

Proof. A2=(VΛV1)(VΛV1)=VΛ(V1V)ΛV1=VΛ2V1A^2 = (V\Lambda V^{-1})(V\Lambda V^{-1}) = V\Lambda(V^{-1}V)\Lambda V^{-1} = V\Lambda^2 V^{-1}. By induction: Ak=VΛkV1A^k = V\Lambda^k V^{-1}.

Computational gain. Without diagonalisation: computing AkA^k by repeated matrix multiplication costs O(n3logk)O(n^3 \log k) (via fast exponentiation). With diagonalisation: once VV and Λ\Lambda are known, Akx=VΛkV1xA^k \mathbf{x} = V\Lambda^k V^{-1}\mathbf{x} costs O(n2)O(n^2) per query (three matrix-vector products) for any kk.

Worked example - Fibonacci numbers. The Fibonacci recurrence Fk+1=Fk+Fk1F_{k+1} = F_k + F_{k-1} is a linear system: (Fk+1Fk)=(1110)k(10)\begin{pmatrix}F_{k+1}\\F_k\end{pmatrix} = \begin{pmatrix}1&1\\1&0\end{pmatrix}^k \begin{pmatrix}1\\0\end{pmatrix}. Diagonalising A=(1110)A = \begin{pmatrix}1&1\\1&0\end{pmatrix} gives eigenvalues ϕ=(1+5)/21.618\phi = (1+\sqrt{5})/2 \approx 1.618 (golden ratio) and ϕ^=(15)/20.618\hat\phi = (1-\sqrt{5})/2 \approx -0.618. Therefore Fk=(ϕkϕ^k)/5F_k = (\phi^k - \hat\phi^k)/\sqrt{5} - Binet's formula. The dominant eigenvalue ϕ\phi controls asymptotic growth: Fkϕk/5F_k \approx \phi^k/\sqrt{5}.

For AI: Matrix powers appear in RNN unrolling (ht=Wth0+\mathbf{h}_t = W^t \mathbf{h}_0 + \ldots); Markov chain mixing (πt=Atπ0\pi_t = A^t \pi_0); and computing the influence of past inputs on current hidden states. The eigendecomposition reveals which "modes" persist long-term (large λi|\lambda_i|) and which decay quickly (small λi|\lambda_i|).

5.4 Matrix Functions via Diagonalisation

For a diagonalisable matrix A=VΛV1A = V\Lambda V^{-1} and any function f:CCf: \mathbb{C} \to \mathbb{C}, define:

f(A)=Vf(Λ)V1=Vdiag(f(λ1),,f(λn))V1f(A) = V\, f(\Lambda)\, V^{-1} = V\, \text{diag}(f(\lambda_1), \ldots, f(\lambda_n))\, V^{-1}

This is the natural extension of scalar functions to matrices: apply ff to each eigenvalue.

Matrix exponential: exp(A)=Vdiag(eλ1,,eλn)V1\exp(A) = V\,\text{diag}(e^{\lambda_1},\ldots,e^{\lambda_n})\,V^{-1}. The solution to dx/dt=Axd\mathbf{x}/dt = A\mathbf{x} is x(t)=exp(At)x0\mathbf{x}(t) = \exp(At)\mathbf{x}_0. For AI: neural ODEs use exp(At)\exp(At) to propagate hidden states through continuous time; the stability of the ODE is governed by whether Re(λi)<0\text{Re}(\lambda_i) < 0 for all ii.

Matrix square root: A1/2=Vdiag(λ1,,λn)V1A^{1/2} = V\,\text{diag}(\sqrt{\lambda_1},\ldots,\sqrt{\lambda_n})\,V^{-1} - requires λi0\lambda_i \geq 0 (valid for PSD matrices). Used in preconditioning: the natural gradient update involves F1/2F^{-1/2} where FF is the Fisher information matrix.

Log-determinant: logdet(A)=tr(logA)=ilogλi\log\det(A) = \text{tr}(\log A) = \sum_i \log\lambda_i. Appears in: Gaussian log-likelihoods 12logdet(Σ)12(xμ)Σ1(xμ)-\frac{1}{2}\log\det(\Sigma) - \frac{1}{2}(\mathbf{x}-\mu)^\top\Sigma^{-1}(\mathbf{x}-\mu); normalising flow log-densities; variational inference ELBO terms. Efficient estimation via stochastic trace estimators when nn is large.

Cayley-Hamilton theorem: Every matrix satisfies its own characteristic polynomial: p(A)=0p(A) = 0. This means AnA^n can be written as a linear combination of I,A,,An1I, A, \ldots, A^{n-1} - matrix polynomials can always be reduced to degree at most n1n-1.

5.5 Change of Basis Interpretation

Diagonalisation A=VΛV1A = V\Lambda V^{-1} encodes a three-step computation:

DIAGONALISATION AS CHANGE OF BASIS
========================================================================

  Input x  --> V^-^1 x  -->  \Lambda(V^-^1x)  -->  V \Lambda V^-^1 x  =  Ax
           (express in    (scale each    (convert back
            eigenbasis)    coordinate     to standard
                           by \lambda^i)        basis)

  In the eigenbasis, A acts as pure scaling - no coupling!

========================================================================

This is the fundamental reason diagonalisable matrices are tractable: in the right coordinate system, they decouple completely. The ii-th eigenvector direction is scaled by λi\lambda_i and is independent of all other directions. No information leaks between eigenvector components.

For AI: PCA is precisely this change of basis. The eigenvectors q1,,qd\mathbf{q}_1, \ldots, \mathbf{q}_d of the covariance matrix CC form a new coordinate system (the principal component basis). In this basis, CC is diagonal (C=QΛQC = Q\Lambda Q^\top, so QCQ=ΛQ^\top C Q = \Lambda): the features are uncorrelated. Each principal component direction is independent, scaled by its variance λi\lambda_i. Running PCA = expressing your data in the eigenbasis of its covariance.

5.6 Simultaneous Diagonalisation

Two symmetric matrices AA and BB are simultaneously diagonalisable - they share the same orthonormal eigenbasis - if and only if AB=BAAB = BA (they commute).

Proof (\Rightarrow). If A=QΛAQA = Q\Lambda_A Q^\top and B=QΛBQB = Q\Lambda_B Q^\top with the same QQ: AB=QΛAQQΛBQ=QΛAΛBQ=QΛBΛAQ=BAAB = Q\Lambda_A Q^\top Q\Lambda_B Q^\top = Q\Lambda_A\Lambda_B Q^\top = Q\Lambda_B\Lambda_A Q^\top = BA.

Proof (\Leftarrow). If AB=BAAB = BA and Av=λvA\mathbf{v} = \lambda\mathbf{v}: A(Bv)=B(Av)=λ(Bv)A(B\mathbf{v}) = B(A\mathbf{v}) = \lambda(B\mathbf{v}), so BvB\mathbf{v} is also in the λ\lambda-eigenspace of AA. BB maps each eigenspace of AA into itself. On each eigenspace (a symmetric subspace for BB), BB can be diagonalised. Choose an orthonormal basis for each eigenspace that also diagonalises BB. Together these form a simultaneous eigenbasis.

For AI: Simultaneous diagonalisation appears in the analysis of optimisers. If the Hessian HH and the Fisher information matrix FF commute, they share an eigenbasis - in this basis, the natural gradient F1LF^{-1}\nabla\mathcal{L} is a simple eigenvalue-rescaled version of the gradient. K-FAC (Kronecker-Factored Approximate Curvature) builds a structured approximation to FF that is easier to diagonalise, enabling efficient natural gradient updates. Attention heads whose key-query weight matrices commute have aligned eigenspaces, simplifying their circuit analysis.


6. The Spectral Theorem

6.1 Statement and Significance

The Spectral Theorem is the crown jewel of applied linear algebra. It guarantees that symmetric matrices have a perfect, complete structure - one that makes them tractable both theoretically and computationally.

Real Spectral Theorem. For any symmetric matrix A=ARn×nA = A^\top \in \mathbb{R}^{n \times n}:

  1. All eigenvalues are real: λiR\lambda_i \in \mathbb{R}.
  2. Eigenvectors for distinct eigenvalues are orthogonal: λiλjqiqj\lambda_i \neq \lambda_j \Rightarrow \mathbf{q}_i \perp \mathbf{q}_j.
  3. There always exist nn orthonormal eigenvectors, regardless of multiplicities.
  4. AA is orthogonally diagonalisable: A=QΛQA = Q\Lambda Q^\top where QQ is orthogonal (QQ=QQ=IQ^\top Q = QQ^\top = I) and Λ=diag(λ1,,λn)\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_n) is real diagonal.

The significance cannot be overstated. For a general matrix, diagonalisation may fail (defective case) or require a non-orthogonal VV (so V1VV^{-1} \neq V^\top, making computations expensive). For symmetric matrices: diagonalisation always works, and V1=VV^{-1} = V^\top (free, numerically stable). The eigenbasis is orthonormal - the most natural and well-conditioned basis possible.

Why symmetry matters for AI: Covariance matrices, Gram matrices, Hessians at local minima, graph Laplacians, and kernel matrices are all symmetric PSD. The Spectral Theorem applies to all of them, guaranteeing real non-negative eigenvalues and an orthonormal eigenbasis. This is why PCA, spectral clustering, and natural gradient methods are well-defined and numerically stable.

6.2 Proof Sketch

All eigenvalues real. Let Av=λvA\mathbf{v} = \lambda\mathbf{v} with vCn\mathbf{v} \in \mathbb{C}^n. Take the complex inner product: vˉAv=λvˉv=λv2\bar{\mathbf{v}}^\top A\mathbf{v} = \lambda \bar{\mathbf{v}}^\top\mathbf{v} = \lambda\|\mathbf{v}\|^2. Also, vˉAv=(Avˉ)v=Avv=λˉvˉv=λˉv2\bar{\mathbf{v}}^\top A\mathbf{v} = (A\bar{\mathbf{v}})^\top\mathbf{v} = \overline{A\mathbf{v}}^\top\mathbf{v} = \bar{\lambda}\bar{\mathbf{v}}^\top\mathbf{v} = \bar{\lambda}\|\mathbf{v}\|^2 (using A=AA = A^\top real). So λ=λˉ\lambda = \bar{\lambda}, meaning λ\lambda is real. \square

Eigenvectors for distinct eigenvalues are orthogonal. Let Av1=λ1v1A\mathbf{v}_1 = \lambda_1\mathbf{v}_1 and Av2=λ2v2A\mathbf{v}_2 = \lambda_2\mathbf{v}_2 with λ1λ2\lambda_1 \neq \lambda_2. Then:

λ1v1,v2=Av1,v2=v1,Av2=λ2v1,v2\lambda_1\langle\mathbf{v}_1,\mathbf{v}_2\rangle = \langle A\mathbf{v}_1,\mathbf{v}_2\rangle = \langle\mathbf{v}_1,A\mathbf{v}_2\rangle = \lambda_2\langle\mathbf{v}_1,\mathbf{v}_2\rangle

(using Au,v=u,Av\langle A\mathbf{u},\mathbf{v}\rangle = \langle\mathbf{u},A\mathbf{v}\rangle - self-adjointness of AA). Since λ1λ2\lambda_1 \neq \lambda_2: v1,v2=0\langle\mathbf{v}_1,\mathbf{v}_2\rangle = 0. \square

Existence of orthonormal eigenbasis. By induction on nn. AA has at least one real eigenvalue λ1\lambda_1 (proved above). Let q1\mathbf{q}_1 be a unit eigenvector. Consider the restriction of AA to q1\mathbf{q}_1^\perp (the orthogonal complement). This restriction is again symmetric (since AA maps q1\mathbf{q}_1^\perp to itself: if xq1\mathbf{x} \perp \mathbf{q}_1 then Ax,q1=x,Aq1=λ1x,q1=0\langle A\mathbf{x}, \mathbf{q}_1\rangle = \langle\mathbf{x}, A\mathbf{q}_1\rangle = \lambda_1\langle\mathbf{x},\mathbf{q}_1\rangle = 0). Apply induction to this (n1)(n-1)-dimensional restriction. \square

6.3 The Spectral Decomposition

From A=QΛQA = Q\Lambda Q^\top with columns q1,,qn\mathbf{q}_1,\ldots,\mathbf{q}_n:

A=i=1nλiqiqiA = \sum_{i=1}^n \lambda_i\, \mathbf{q}_i \mathbf{q}_i^\top

Each term λiqiqi\lambda_i\mathbf{q}_i\mathbf{q}_i^\top is a rank-1 symmetric matrix - a scaled orthogonal projection onto span{qi}\text{span}\{\mathbf{q}_i\}. The full matrix AA is a weighted sum of nn rank-1 orthogonal projections, with weights equal to the eigenvalues.

The projectors Pi=qiqiP_i = \mathbf{q}_i\mathbf{q}_i^\top satisfy: Pi2=PiP_i^2 = P_i (projection); Pi=PiP_i^\top = P_i (symmetric); PiPj=0P_iP_j = 0 for iji \neq j (orthogonality); iPi=I\sum_i P_i = I (completeness). These are the building blocks of the spectral decomposition.

For repeated eigenvalues: If λ\lambda has multiplicity kk with orthonormal eigenvectors q1,,qk\mathbf{q}_1,\ldots,\mathbf{q}_k, the contribution is λj=1kqjqj=λPE(λ)\lambda \sum_{j=1}^k \mathbf{q}_j\mathbf{q}_j^\top = \lambda P_{E(\lambda)} - the eigenvalue times the orthogonal projection onto the eigenspace.

For AI: The spectral decomposition of the covariance matrix C=iλiqiqiC = \sum_i \lambda_i\mathbf{q}_i\mathbf{q}_i^\top reveals PCA directly: each term λiqiqi\lambda_i\mathbf{q}_i\mathbf{q}_i^\top is one principal component scaled by its variance. Keeping the top-kk terms gives the best rank-kk approximation to CC (Eckart-Young). The spectral decomposition of the Hessian H=iλiqiqiH = \sum_i \lambda_i\mathbf{q}_i\mathbf{q}_i^\top decomposes the loss curvature into independent directions: gradient descent along qi\mathbf{q}_i converges at rate (1ηλi)t(1-\eta\lambda_i)^t, independent of all other directions.

6.4 Positive (Semi)definite Matrices via the Spectral Theorem

A symmetric matrix AA is positive semidefinite (PSD, written A0A \succeq 0) iff all eigenvalues are non-negative; positive definite (PD, A0A \succ 0) iff all eigenvalues are strictly positive.

Proof. Using the spectral decomposition: xAx=x ⁣(iλiqiqi) ⁣x=iλi(qix)2\mathbf{x}^\top A\mathbf{x} = \mathbf{x}^\top\!\left(\sum_i \lambda_i\mathbf{q}_i\mathbf{q}_i^\top\right)\!\mathbf{x} = \sum_i \lambda_i(\mathbf{q}_i^\top\mathbf{x})^2. This is a sum of λi\lambda_i weighted by non-negative squares (qix)2(\mathbf{q}_i^\top\mathbf{x})^2. The sum is 0\geq 0 for all x\mathbf{x} iff all λi0\lambda_i \geq 0; strictly >0> 0 for all x0\mathbf{x} \neq \mathbf{0} iff all λi>0\lambda_i > 0.

Equivalent characterisations of PSD:

  • A=BBA = B^\top B for some matrix BB (Gram matrix form)
  • All principal minors are non-negative
  • Cholesky decomposition A=LLA = LL^\top exists (with positive diagonal LL for PD)

Condition number: For PD matrix AA: κ2(A)=λmax/λmin\kappa_2(A) = \lambda_{\max}/\lambda_{\min}. Measures how elongated the ellipsoid {x:xAx=1}\{\mathbf{x}: \mathbf{x}^\top A\mathbf{x} = 1\} is. Large κ2\kappa_2: ill-conditioned; gradient descent converges slowly.

For AI: Covariance matrices and Gram matrices are always PSD (they are of the form XXX^\top X). The Hessian at a local minimum is PSD; at a saddle point, it has both positive and negative eigenvalues. Checking PSD via eigenvalues is the standard diagnostic: if np.linalg.eigvalsh(H).min() < 0, the critical point is not a local minimum.

6.5 The Complex Spectral Theorem

For complex Hermitian matrices A=A=AˉA = A^* = \bar{A}^\top (conjugate transpose), the same theorem holds over C\mathbb{C}: all eigenvalues are real; eigenvectors for distinct eigenvalues are orthogonal under the complex inner product; AA is unitarily diagonalisable: A=UΛUA = U\Lambda U^* where UU is unitary (UU=IU^*U = I).

For AI: Quantum computing represents observables as Hermitian operators; measurement outcomes are eigenvalues (always real). Complex-valued neural networks and unitary RNNs use Hermitian or unitary weight matrices to ensure stable, norm-preserving dynamics. The Fourier transform is a unitary operator whose eigenvalues are complex roots of unity - the eigenfunctions are pure sinusoids.


7. Jordan Normal Form

7.1 Motivation - Defective Matrices

Not every matrix is diagonalisable. When mg(λi)<ma(λi)m_g(\lambda_i) < m_a(\lambda_i) for some eigenvalue, we lack enough eigenvectors to form a basis. The Jordan Normal Form (JNF) is the canonical replacement for diagonalisation - it works for every square matrix over C\mathbb{C}, whether diagonalisable or not.

The JNF tells us exactly what the "obstruction" to diagonalisation is: instead of a pure scaling λ\lambda in each coordinate, we get a scaling λ\lambda plus a "coupling" to the next coordinate - a Jordan block.

7.2 Jordan Blocks

A Jordan block of size kk for eigenvalue λ\lambda is:

Jk(λ)=(λ1000λ1000λ1000λ)Rk×kJ_k(\lambda) = \begin{pmatrix} \lambda & 1 & 0 & \cdots & 0 \\ 0 & \lambda & 1 & \cdots & 0 \\ \vdots & & \ddots & \ddots & \vdots \\ 0 & 0 & \cdots & \lambda & 1 \\ 0 & 0 & \cdots & 0 & \lambda \end{pmatrix} \in \mathbb{R}^{k \times k}

Diagonal: all λ\lambda. Superdiagonal: all 1. Everything else: 0.

  • J1(λ)=[λ]J_1(\lambda) = [\lambda]: ordinary eigenvalue, non-defective.
  • Jk(λ)J_k(\lambda) for k>1k > 1: exactly one eigenvector e1=(1,0,,0)\mathbf{e}_1 = (1,0,\ldots,0)^\top; remaining k1k-1 vectors are generalised eigenvectors satisfying (AλI)vj=vj1(A-\lambda I)\mathbf{v}_j = \mathbf{v}_{j-1} for j=2,,kj = 2,\ldots,k.

7.3 The Jordan Normal Form

Theorem. For any ACn×nA \in \mathbb{C}^{n \times n}, there exists an invertible PP such that:

A=PJP1,J=(Jk1(λi1)Jk2(λi2))A = PJP^{-1}, \qquad J = \begin{pmatrix} J_{k_1}(\lambda_{i_1}) & & \\ & J_{k_2}(\lambda_{i_2}) & \\ & & \ddots \end{pmatrix}

JJ is block diagonal with Jordan blocks; jkj=n\sum_j k_j = n. For each eigenvalue λ\lambda: the number of Jordan blocks equals mg(λ)m_g(\lambda); the sizes are determined by the rank sequence of (AλI)k(A-\lambda I)^k.

The JNF is the unique canonical form for matrices up to similarity: two matrices are similar (B=PAP1B = PAP^{-1} for some invertible PP) iff they have the same Jordan form (up to reordering blocks).

7.4 Powers of Jordan Blocks

Jk(λ)t=((t0)λt(t1)λt1(t2)λt20(t0)λt(t1)λt100(t0)λt)J_k(\lambda)^t = \begin{pmatrix} \binom{t}{0}\lambda^t & \binom{t}{1}\lambda^{t-1} & \binom{t}{2}\lambda^{t-2} & \cdots \\ 0 & \binom{t}{0}\lambda^t & \binom{t}{1}\lambda^{t-1} & \cdots \\ \vdots & & \ddots & \\ 0 & \cdots & 0 & \binom{t}{0}\lambda^t \end{pmatrix}

The entries grow as (tj)λtjtjλtj/j!\binom{t}{j}\lambda^{t-j} \sim t^j \lambda^{t-j}/j! for fixed jj.

  • λ<1|\lambda| < 1: all entries 0\to 0 despite polynomial factor tk1t^{k-1} - stable convergence.
  • λ=1|\lambda| = 1, k>1k > 1: entries grow polynomially as tk1t^{k-1} - marginal instability.
  • λ>1|\lambda| > 1: exponential growth - unstable.

For AI: An RNN with ρ(W)=1\rho(W) = 1 exactly (at the stability boundary) is dangerous if WW has a Jordan block of size >1> 1 for the unit eigenvalue - the hidden state grows polynomially even though no eigenvalue exceeds 1. LSTM forget gates aim to keep the effective spectral radius exactly 1 while avoiding non-trivial Jordan blocks. Gradient clipping and spectral normalisation are practical safeguards against this polynomial growth.


8. Singular Value Decomposition

8.1 SVD as Generalisation of Eigendecomposition

Eigendecomposition A=VΛV1A = V\Lambda V^{-1} requires AA to be square and diagonalisable. The Singular Value Decomposition (SVD) generalises to any matrix ARm×nA \in \mathbb{R}^{m \times n}:

A=UΣVA = U\Sigma V^\top

where URm×mU \in \mathbb{R}^{m \times m} is orthogonal (left singular vectors), ΣRm×n\Sigma \in \mathbb{R}^{m \times n} is "diagonal" with non-negative entries σ1σ2σmin(m,n)0\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_{\min(m,n)} \geq 0 (singular values), and VRn×nV \in \mathbb{R}^{n \times n} is orthogonal (right singular vectors).

SVD always exists and is unique (when singular values are distinct). It applies to rectangular matrices, rank-deficient matrices, and matrices with repeated or zero singular values - no assumptions needed.

8.2 Connection to Eigenvalues

SVD and eigendecomposition are deeply linked via the symmetric matrices AAA^\top A and AAAA^\top:

  • AARn×nA^\top A \in \mathbb{R}^{n \times n} is symmetric PSD; its eigenvalues are σi2\sigma_i^2; its orthonormal eigenvectors are the right singular vectors vi\mathbf{v}_i (columns of VV): AAvi=σi2viA^\top A\,\mathbf{v}_i = \sigma_i^2\,\mathbf{v}_i.
  • AARm×mAA^\top \in \mathbb{R}^{m \times m} is symmetric PSD; its non-zero eigenvalues are the same σi2\sigma_i^2; its orthonormal eigenvectors are the left singular vectors ui\mathbf{u}_i (columns of UU): AAui=σi2uiAA^\top\,\mathbf{u}_i = \sigma_i^2\,\mathbf{u}_i.
  • The connection: Avi=σiuiA\mathbf{v}_i = \sigma_i\mathbf{u}_i and Aui=σiviA^\top\mathbf{u}_i = \sigma_i\mathbf{v}_i.

For a square symmetric PSD matrix: singular values = eigenvalues; U=V=QU = V = Q (the orthogonal eigenvector matrix); SVD = eigendecomposition.

8.3 Thin and Truncated SVD

Full SVD: URm×mU \in \mathbb{R}^{m \times m}, ΣRm×n\Sigma \in \mathbb{R}^{m \times n}, VRn×nV \in \mathbb{R}^{n \times n}. Includes singular vectors for zero singular values (null space directions).

Thin (economy) SVD for mnm \geq n: UnRm×nU_n \in \mathbb{R}^{m \times n}, ΣnRn×n\Sigma_n \in \mathbb{R}^{n \times n} (square), VRn×nV \in \mathbb{R}^{n \times n}. Equivalent for computing AA but smaller: np.linalg.svd(A, full_matrices=False).

Truncated rank-rr SVD: Keep only the top rr singular triplets:

Ar=i=1rσiuivi=UrΣrVrA_r = \sum_{i=1}^r \sigma_i\,\mathbf{u}_i\mathbf{v}_i^\top = U_r\Sigma_r V_r^\top

Eckart-Young Theorem: ArA_r is the best rank-rr approximation to AA in both spectral norm and Frobenius norm:

AAr2=σr+1,AArF=σr+12++σmin(m,n)2\|A - A_r\|_2 = \sigma_{r+1}, \qquad \|A - A_r\|_F = \sqrt{\sigma_{r+1}^2 + \cdots + \sigma_{\min(m,n)}^2}

No other rank-rr matrix is closer to AA. This is the mathematical foundation for LoRA, PCA, image compression, and recommender systems.

8.4 Geometric Interpretation

The SVD A=UΣVA = U\Sigma V^\top decomposes any linear map into three pure operations:

SVD GEOMETRIC DECOMPOSITION
========================================================================

  x  --> V^Tx  ------> \Sigma(V^Tx)  ------> U \Sigma V^T x  =  Ax
     (rotate/       (scale each      (rotate/reflect
      reflect        coordinate       into output
      input          by \sigma^i)          space)
      basis)

  Unit sphere --> (same orientation) --> ellipsoid
                                         semi-axes: \sigma_1u_1, \sigma_2u_2, ...

========================================================================

The unit sphere in Rn\mathbb{R}^n maps to an ellipsoid in Rm\mathbb{R}^m with semi-axes σiui\sigma_i\mathbf{u}_i. The singular values are the lengths of the ellipsoid axes; the right singular vectors vi\mathbf{v}_i are the input directions that map to those axes; the left singular vectors ui\mathbf{u}_i are the output directions.

8.5 SVD and the Four Fundamental Subspaces

SVD reveals all four fundamental subspaces of ARm×nA \in \mathbb{R}^{m \times n} with rank rr simultaneously:

SubspaceBasisSpace
Row space row(A)\text{row}(A)v1,,vr\mathbf{v}_1,\ldots,\mathbf{v}_r (columns of VrV_r)Rn\mathbb{R}^n
Null space null(A)\text{null}(A)vr+1,,vn\mathbf{v}_{r+1},\ldots,\mathbf{v}_nRn\mathbb{R}^n
Column space col(A)\text{col}(A)u1,,ur\mathbf{u}_1,\ldots,\mathbf{u}_r (columns of UrU_r)Rm\mathbb{R}^m
Left null space null(A)\text{null}(A^\top)ur+1,,um\mathbf{u}_{r+1},\ldots,\mathbf{u}_mRm\mathbb{R}^m

The orthogonality of UU and VV gives explicit orthogonal decompositions: Rn=row(A)null(A)\mathbb{R}^n = \text{row}(A) \oplus \text{null}(A) and Rm=col(A)null(A)\mathbb{R}^m = \text{col}(A) \oplus \text{null}(A^\top), with orthonormal bases for each subspace directly from VV and UU.

For AI: LoRA represents weight updates as ΔW=BA\Delta W = BA with BRm×rB \in \mathbb{R}^{m \times r}, ARr×nA \in \mathbb{R}^{r \times n}, rmin(m,n)r \ll \min(m,n). The SVD of ΔW\Delta W has at most rr non-zero singular values - it lives in an rr-dimensional subspace of weight space. The Eckart-Young theorem justifies this: if the true weight update is approximately low-rank (as empirically observed), LoRA captures most of its structure with far fewer parameters.


9. Eigendecomposition in AI Applications

9.1 Principal Component Analysis (PCA)

PCA is eigendecomposition of the sample covariance matrix, period. Given centred data X~Rn×d\tilde{X} \in \mathbb{R}^{n \times d} (subtract column means):

C=X~X~n1Rd×d,C=QΛQ,λ1λd0C = \frac{\tilde{X}^\top \tilde{X}}{n-1} \in \mathbb{R}^{d \times d}, \qquad C = Q\Lambda Q^\top, \quad \lambda_1 \geq \cdots \geq \lambda_d \geq 0

The eigenvectors q1,,qd\mathbf{q}_1, \ldots, \mathbf{q}_d are the principal components (directions of decreasing variance). The eigenvalues λi\lambda_i are the principal variances. The projection onto the top-kk components is:

Xpca=X~QkRn×k,Qk=[q1qk]X_\text{pca} = \tilde{X} Q_k \in \mathbb{R}^{n \times k}, \quad Q_k = [\mathbf{q}_1 \mid \cdots \mid \mathbf{q}_k]

Fraction of variance explained by the top-kk components: i=1kλi/i=1dλi\sum_{i=1}^k \lambda_i / \sum_{i=1}^d \lambda_i. A scree plot of λi\lambda_i vs ii shows how quickly variance falls off - a sharp "elbow" indicates a natural low-rank structure.

Equivalently via SVD: X~=UΣV\tilde{X} = U\Sigma V^\top; principal components = columns of VV; principal variances = σi2/(n1)\sigma_i^2/(n-1).

For AI: PCA of word embedding matrices (e.g., GloVe, word2vec) reveals semantic axes - the top principal components correspond to major semantic dimensions (sentiment, formality, etc.). Activation PCA in transformer residual streams (Cunningham et al. 2023) identifies sparse, interpretable directions corresponding to concepts. PCA of weight matrices diagnoses training pathologies: a sudden drop in the effective rank of weight matrices signals representational collapse.

9.2 Eigenvalues and Gradient Descent

For a quadratic loss L(θ)=12θHθbθ\mathcal{L}(\boldsymbol{\theta}) = \frac{1}{2}\boldsymbol{\theta}^\top H\boldsymbol{\theta} - \mathbf{b}^\top\boldsymbol{\theta} (the local approximation near any critical point), gradient descent with step size η\eta gives:

θt+1=θtη(Hθtb)=(IηH)θt+ηb\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta(H\boldsymbol{\theta}_t - \mathbf{b}) = (I - \eta H)\boldsymbol{\theta}_t + \eta\mathbf{b}

The error et=θtθ\mathbf{e}_t = \boldsymbol{\theta}_t - \boldsymbol{\theta}^* satisfies et+1=(IηH)et\mathbf{e}_{t+1} = (I-\eta H)\mathbf{e}_t. In the eigenbasis of HH (writing et=ici(t)qi\mathbf{e}_t = \sum_i c_i^{(t)}\mathbf{q}_i):

ci(t)=(1ηλi)tci(0)c_i^{(t)} = (1 - \eta\lambda_i)^t\, c_i^{(0)}

Each eigendirection decays independently. Convergence condition: 1ηλi<1|1 - \eta\lambda_i| < 1 for all ii \Leftrightarrow 0<η<2/λmax0 < \eta < 2/\lambda_{\max}.

Optimal step size: η=2/(λmax+λmin)\eta^* = 2/(\lambda_{\max} + \lambda_{\min}), giving convergence rate ρ=(κ1)/(κ+1)\rho = (\kappa - 1)/(\kappa + 1) where κ=λmax/λmin\kappa = \lambda_{\max}/\lambda_{\min} is the condition number.

Ill-conditioning: Large κ\kappa means slow convergence. Gradient descent zigzags: it overshoots in high-curvature directions (large λ\lambda) while barely moving in low-curvature ones (small λ\lambda). Preconditioning (multiply gradient by H1H^{-1} - Newton's method, or an approximation - Adam, K-FAC) aligns step sizes to eigenvalues and achieves κ=1\kappa = 1 in the ideal case.

9.3 Neural Network Hessian Spectrum

The Hessian 2L\nabla^2\mathcal{L} of a neural network loss at a critical point has a universal spectral structure observed empirically across architectures and datasets:

TYPICAL NEURAL NETWORK HESSIAN EIGENSPECTRUM
========================================================================

  Density
    |
    |####
    |########
    |###########                           -  -
    |################                   -        -
    +------------------------------------------------ \lambda
    0         small bulk           large outliers

  Bulk: \approx Marchenko-Pastur distribution (random matrix theory)
  Outliers: O(1)-O(100) eigenvalues, well-separated from bulk
  Ratio \kappa = \lambda_max / \lambda_min can be 10^6 or larger

========================================================================
  • Bulk eigenvalues: Dense cluster near 0; approximately Marchenko-Pastur distributed; correspond to weakly-learned or unlearned features.
  • Outlier eigenvalues: A small number (O(layers)O(\text{layers})) of large eigenvalues, well-separated from the bulk; correspond to the most-curved loss directions; dominate training dynamics.
  • Evolution during training: Outliers grow during early training as the model fits data; bulk shifts toward zero; the outlier/bulk gap widens.

2026 tools: PyHessian (Yao et al. 2020) estimates the top eigenvalues and eigenvectors of the Hessian using randomised Lanczos; HessianFlow provides gradient-free eigenvalue density estimation. Both are practical for networks with 10810^8 parameters.

9.4 Attention Eigenstructure

The scaled dot-product attention weight matrix S=softmax(QK/dk)Rn×nS = \text{softmax}(QK^\top/\sqrt{d_k}) \in \mathbb{R}^{n \times n} is row-stochastic: all entries are non-negative and each row sums to 1. By the Perron-Frobenius theorem: the dominant eigenvalue is λ1=1\lambda_1 = 1; all other eigenvalues satisfy λi1|\lambda_i| \leq 1; the left eigenvector for λ1\lambda_1 is 1/n\mathbf{1}^\top/n (the uniform distribution).

Softmax temperature effects on the spectrum:

  • High temperature (τ\tau \to \infty): S11/nS \to \mathbf{1}\mathbf{1}^\top/n (rank-1 uniform); eigenvalues =1,0,0,,0= 1, 0, 0, \ldots, 0. Completely diffuse attention.
  • Low temperature (τ0\tau \to 0): SS \to permutation matrix; eigenvalues = roots of unity on unit circle. Sharp, deterministic attention.
  • Intermediate: eigenvalues lie on a spectrum between these extremes; their distribution characterises head behaviour.

Mechanistic interpretability: Induction heads (heads that copy from the previous occurrence of a token) have attention matrices approximating a shifted identity - eigenvalues cluster near roots of unity. Name-mover heads have matrices with a few dominant eigenvalues corresponding to "this token attends to its referent." The eigenvalue structure is a fingerprint of the head's function.

9.5 Graph Laplacian and GNNs

For an undirected graph GG with nn nodes, adjacency matrix AA (symmetric, Aij=1A_{ij} = 1 if edge (i,j)(i,j) exists), and degree matrix D=diag(d1,,dn)D = \text{diag}(d_1,\ldots,d_n) where di=jAijd_i = \sum_j A_{ij}, the graph Laplacian is:

L=DAL = D - A

LL is symmetric PSD with eigenvalues 0=λ0λ1λn10 = \lambda_0 \leq \lambda_1 \leq \cdots \leq \lambda_{n-1}.

Key eigenvalue facts:

  • λ0=0\lambda_0 = 0 always, with eigenvector 1=(1,,1)\mathbf{1} = (1,\ldots,1)^\top (constant on all nodes).
  • The number of zero eigenvalues equals the number of connected components.
  • λ1>0\lambda_1 > 0 iff the graph is connected; λ1\lambda_1 is the Fiedler value (algebraic connectivity): larger λ1\lambda_1 means better connectivity and faster mixing.

Spectral clustering: The bottom-kk eigenvectors q0,,qk1\mathbf{q}_0,\ldots,\mathbf{q}_{k-1} of LL embed each node ii into Rk\mathbb{R}^k via its row in Qk=[q0qk1]Q_k = [\mathbf{q}_0\mid\cdots\mid\mathbf{q}_{k-1}]. Running kk-means on these embeddings clusters the graph into kk groups - nodes in the same cluster are well-connected. This is spectral clustering.

Spectral GNNs: Graph convolution is defined spectrally as f(L)=Qf(Λ)Qf(L) = Qf(\Lambda)Q^\top - a function of LL's eigenvalues applied in the eigenbasis. Polynomial filters h(L)=k=0KckLkh(L) = \sum_{k=0}^K c_k L^k (ChebConv) approximate any spectral filter without computing the full eigendecomposition. This is the mathematical foundation of graph neural networks.

For AI (2026): Graph transformers use Laplacian eigenvectors as positional encodings - the eigenvectors encode the multi-scale geometric structure of the graph, serving as a canonical "position" for each node regardless of graph isomorphism. Standard in molecular property prediction (OGB benchmarks) and code analysis.

9.6 Spectral Analysis of Weight Matrices

WeightWatcher (Martin & Mahoney, 2019-2026) analyses the eigenvalue (or singular value) spectra of weight matrices to predict model quality and generalisation - without any test data.

Heavy-tailed self-regularisation: Well-trained models develop weight matrices whose eigenvalue distributions follow a power law: ρ(λ)λα\rho(\lambda) \sim \lambda^{-\alpha} with α[2,4]\alpha \in [2, 4] for the tail. This deviates from the Marchenko-Pastur distribution (expected for random Gaussian matrices) - the deviation indicates that the model has learned non-random structure.

Alpha metric: Fit a power law to the tail of the singular value distribution of each layer; α2\alpha \approx 2-44 indicates a well-trained layer; α>6\alpha > 6 or a purely MP distribution indicates an undertrained layer.

Stable rank: sr(W)=WF2/W22=iσi2/σ12\text{sr}(W) = \|W\|_F^2 / \|W\|_2^2 = \sum_i \sigma_i^2 / \sigma_1^2. Measures effective dimensionality. Higher stable rank = more uniform singular value distribution = richer representations. Decreasing stable rank during fine-tuning signals representational collapse.

For AI (2026): WeightWatcher deployed at scale to compare LLM checkpoints, guide layer-wise learning rate selection in fine-tuning (layers with lower α\alpha need more adaptation), and identify which layers to include in LoRA adaptation without any downstream evaluation.

9.7 Eigenvalues in Recurrent Networks

The linearised RNN update htWht1+Uxt\mathbf{h}_t \approx W\mathbf{h}_{t-1} + U\mathbf{x}_t is a linear dynamical system. After TT steps from h0\mathbf{h}_0:

hTWTh0+t=0T1WT1tUxt\mathbf{h}_T \approx W^T\mathbf{h}_0 + \sum_{t=0}^{T-1} W^{T-1-t} U\mathbf{x}_t

The gradient of the loss with respect to h0\mathbf{h}_0 involves (W)T(W^\top)^T. By eigendecomposition: WTρ(W)T\|W^T\| \approx \rho(W)^T:

  • ρ(W)<1\rho(W) < 1: gradients vanish exponentially; long-range dependencies are forgotten.
  • ρ(W)>1\rho(W) > 1: gradients explode; training diverges without clipping.
  • ρ(W)=1\rho(W) = 1: marginal stability; gradients preserved in magnitude, but Jordan blocks cause polynomial growth.

Orthogonal RNNs: Constrain WW to be orthogonal (WW=IW^\top W = I) so ρ(W)=1\rho(W) = 1 with all eigenvalues on the unit circle and no Jordan blocks. Used in unitary RNNs (uRNN, EURNN) for long-range sequence modelling.

LSTM/GRU gating: The forget gate ft=σ(Wfht1+Ufxt)\mathbf{f}_t = \sigma(W_f\mathbf{h}_{t-1} + U_f\mathbf{x}_t) dynamically controls the effective spectral radius of the recurrent dynamics: when ft1\mathbf{f}_t \approx 1, the eigenvalue is 1\approx 1 (perfect memory); when ft0\mathbf{f}_t \approx 0, the eigenvalue is 0\approx 0 (full reset). Gating allows the effective spectral radius to be context-dependent.

9.8 Diffusion Models and Score Functions

In a diffusion model, the forward process gradually adds Gaussian noise: q(xtx0)=N(αˉtx0,(1αˉt)I)q(\mathbf{x}_t|\mathbf{x}_0) = \mathcal{N}(\sqrt{\bar\alpha_t}\mathbf{x}_0, (1-\bar\alpha_t)I). The conditional covariance of xt\mathbf{x}_t given x0\mathbf{x}_0 is αˉtΣdata+(1αˉt)I\bar\alpha_t \Sigma_\text{data} + (1-\bar\alpha_t)I.

Spectral perspective: Each eigenvalue λi\lambda_i of the data covariance Σdata\Sigma_\text{data} gets noise level 1αˉt1-\bar\alpha_t added. High-variance data directions (large λi\lambda_i) survive longer before being drowned by noise; low-variance directions are masked quickly. The score function xlogpt(x)\nabla_\mathbf{x}\log p_t(\mathbf{x}) at noise level tt effectively operates on the "surviving" eigenspectrum.

Optimal noise scheduling: The EDM framework (Karras et al. 2022) uses a continuous noise level σ\sigma parameterisation that sweeps uniformly across the log-eigenvalue spectrum of the data covariance - ensuring the network receives a balanced training signal at each noise level rather than spending too long on high-noise or low-noise regimes.

Flow matching (2022-2026): Constructs a straight-line path between noise distribution N(0,I)\mathcal{N}(0,I) and data distribution in the eigenbasis of the data covariance. Eigenvalue-aware interpolation reduces the number of function evaluations needed for accurate generation. Now the dominant paradigm for image and video generation at scale.


10. Generalised Eigenproblems

10.1 The Generalised Eigenvalue Problem

The standard eigenvalue problem Av=λvA\mathbf{v} = \lambda\mathbf{v} measures eigenvalues relative to the identity. The generalised eigenvalue problem measures them relative to a symmetric positive definite matrix BB:

Av=λBvA\mathbf{v} = \lambda B\mathbf{v}

For symmetric AA and SPD BB: generalised eigenvalues are all real; generalised eigenvectors are BB-orthogonal (viBvj=δij\mathbf{v}_i^\top B\mathbf{v}_j = \delta_{ij}); analogous to the standard spectral theorem.

Reduction to standard form. Cholesky-decompose B=LLB = LL^\top. Substituting w=Lv\mathbf{w} = L^\top\mathbf{v}: (L1AL)w=λw(L^{-1}AL^{-\top})\mathbf{w} = \lambda\mathbf{w}. The transformed matrix A~=L1AL\tilde{A} = L^{-1}AL^{-\top} is symmetric, so the standard spectral theorem applies. This is the numerically preferred approach (implemented in scipy.linalg.eigh(A, B)).

Variational characterisation. The kk-th generalised eigenvalue satisfies λk=min/max\lambda_k = \min/\max of the generalised Rayleigh quotient RB(A,x)=xAx/xBxR_B(A,\mathbf{x}) = \mathbf{x}^\top A\mathbf{x} / \mathbf{x}^\top B\mathbf{x} over BB-orthogonal subspaces - exactly as in Courant-Fischer.

10.2 AI Applications of Generalised Eigenproblem

Linear Discriminant Analysis (LDA). Find projection w\mathbf{w} maximising between-class variance relative to within-class variance: maxwwSBw/wSWw\max_\mathbf{w} \mathbf{w}^\top S_B\mathbf{w} / \mathbf{w}^\top S_W\mathbf{w} where SBS_B is the between-class scatter matrix and SWS_W is the within-class scatter. This is a generalised Rayleigh quotient; the optimal w\mathbf{w} is the top generalised eigenvector solving SBw=λSWwS_B\mathbf{w} = \lambda S_W\mathbf{w}.

Natural gradient. The natural gradient update θθηF1L\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta F^{-1}\nabla\mathcal{L} requires solving Fδθ=LF\delta\boldsymbol{\theta} = \nabla\mathcal{L} where FF is the Fisher information matrix. This is equivalent to finding the update direction that minimises loss while accounting for the geometry of the parameter manifold. K-FAC (Kronecker-Factored Approximate Curvature) approximates FF as a Kronecker product and diagonalises it layer-wise, enabling efficient natural gradient updates for neural networks.

Canonical Correlation Analysis (CCA). Find directions in two feature spaces that are maximally correlated. Reduces to a generalised eigenproblem involving cross-covariance matrices. Used in multimodal representation learning (aligning image and text embeddings) and multi-view learning.

10.3 Spectral Graph Positional Encodings for Transformers

Standard transformers use sinusoidal positional encodings (fixed functions of position index). For graph-structured data (molecules, code, knowledge graphs), position is not a linear index - it is a graph-theoretic notion.

Solution: Use the bottom-kk eigenvectors of the graph Laplacian LL as positional encodings. Node ii gets encoding PE(i)=[q1(i),,qk(i)]Rk\text{PE}(i) = [\mathbf{q}_1(i), \ldots, \mathbf{q}_k(i)] \in \mathbb{R}^k (its row in the eigenvector matrix). These encodings: (1) are permutation-equivariant - relabelling nodes relabels the encoding consistently; (2) capture multi-scale graph structure - low eigenvectors encode global structure, high eigenvectors encode local; (3) are provably more expressive than message-passing GNNs for distinguishing graph structures.

2026 status: Laplacian eigenvector positional encodings are standard in graph transformers for molecular property prediction (PCQM4Mv2 leaderboard), protein structure analysis, and code understanding graphs.


11. Eigenvalues and Stability Analysis

11.1 Discrete Dynamical Systems

The system xt+1=Axt\mathbf{x}_{t+1} = A\mathbf{x}_t with initial condition x0\mathbf{x}_0 has solution (for diagonalisable A=VΛV1A = V\Lambda V^{-1}):

xt=Atx0=VΛtV1x0=i=1nciλitvi\mathbf{x}_t = A^t\mathbf{x}_0 = V\Lambda^t V^{-1}\mathbf{x}_0 = \sum_{i=1}^n c_i\lambda_i^t\mathbf{v}_i

where ci=(vi)x0c_i = (\mathbf{v}_i^*)^\top \mathbf{x}_0 are the initial projections onto each mode.

Stability classification:

ConditionBehaviourAI example
ρ(A)<1\rho(A) < 1xt0\mathbf{x}_t \to \mathbf{0} (asymptotically stable)GD on strongly convex loss
ρ(A)=1\rho(A) = 1, Jordan size 1xt\mathbf{x}_t bounded (marginally stable)GD at critical step size
ρ(A)=1\rho(A) = 1, Jordan size >1> 1xt\mathbf{x}_t grows polynomiallyRNN at spectral radius = 1 with Jordan block
ρ(A)>1\rho(A) > 1xt\mathbf{x}_t grows exponentially (unstable)Exploding gradients, diverging GD

11.2 Continuous Dynamical Systems and Neural ODEs

The system dx/dt=Axd\mathbf{x}/dt = A\mathbf{x} has solution x(t)=eAtx0=icieλitvi\mathbf{x}(t) = e^{At}\mathbf{x}_0 = \sum_i c_i e^{\lambda_i t}\mathbf{v}_i.

Stability: Asymptotically stable iff Re(λi)<0\text{Re}(\lambda_i) < 0 for all ii; marginally stable iff Re(λi)0\text{Re}(\lambda_i) \leq 0 with Re(λi)=0\text{Re}(\lambda_i) = 0 eigenvalues having Jordan blocks of size 1; unstable if any Re(λi)>0\text{Re}(\lambda_i) > 0.

Neural ODEs (Chen et al. 2018) parameterise dh/dt=fθ(h,t)d\mathbf{h}/dt = f_\theta(\mathbf{h}, t) and use an ODE solver for the forward pass. Stability of the learned dynamics is controlled by the Jacobian fθ/h\partial f_\theta/\partial\mathbf{h} eigenvalues. Contractive neural ODEs enforce Re(λi)<0\text{Re}(\lambda_i) < 0 everywhere, guaranteeing convergence to a fixed point - used for stable generative models and physics-informed networks.

11.3 Spectral Normalisation

Spectral normalisation (Miyato et al. 2018) divides each weight matrix by its spectral norm (largest singular value) at every update step:

W^=W/σmax(W),σmax(W)=W2\hat{W} = W / \sigma_{\max}(W), \qquad \sigma_{\max}(W) = \|W\|_2

This enforces a Lipschitz constant 1\leq 1 for each linear layer. Applying it to all layers makes the entire network 1-Lipschitz: f(x)f(y)xy\|f(\mathbf{x}) - f(\mathbf{y})\| \leq \|\mathbf{x} - \mathbf{y}\|.

Efficient implementation: Estimate σmax\sigma_{\max} via one step of power iteration per training step - cost O(mn)O(mn) for an m×nm \times n matrix. Keep a running estimate u^,v^\hat{\mathbf{u}}, \hat{\mathbf{v}} and update: v^Wu^/Wu^\hat{\mathbf{v}} \leftarrow W^\top\hat{\mathbf{u}}/\|W^\top\hat{\mathbf{u}}\|, u^Wv^/Wv^\hat{\mathbf{u}} \leftarrow W\hat{\mathbf{v}}/\|W\hat{\mathbf{v}}\|; then σu^Wv^\sigma \approx \hat{\mathbf{u}}^\top W\hat{\mathbf{v}}.

Used in: GAN discriminators (stable training; eliminates mode collapse); normalising flows (invertibility guarantee); safe RL (bounding policy Lipschitz constant); diffusion model discriminators (2026 standard).

11.4 Gradient Flow and Spectral Bias

The gradient flow (continuous-time limit of gradient descent) satisfies dθ/dt=L(θ)d\boldsymbol{\theta}/dt = -\nabla\mathcal{L}(\boldsymbol{\theta}). Near a quadratic minimum with Hessian H=QΛQH = Q\Lambda Q^\top, each eigendirection decays independently: d(Qe)/dt=Λ(Qe)d(Q^\top\mathbf{e})/dt = -\Lambda(Q^\top\mathbf{e}), so each component ci(t)=ci(0)eλitc_i(t) = c_i(0)e^{-\lambda_i t}.

Spectral bias (frequency principle): In the infinite-width Neural Tangent Kernel (NTK) regime, the gradient flow dynamics are governed by the NTK matrix Θ\Theta. Large NTK eigenvalues correspond to low-frequency target function components - these are learned first and fastest. Small eigenvalues correspond to high-frequency components - learned slowly or not at all. This explains the empirical observation that neural networks learn coarse structure before fine details, and are biased toward smooth functions regardless of architecture.


12. Common Mistakes

#MistakeWhy It's WrongFix
1Using det(AλI)=0\det(A - \lambda I) = 0 vs det(λIA)=0\det(\lambda I - A) = 0 and getting sign errors in the characteristic polynomialBoth define the same eigenvalues, but the polynomial differs by (1)n(-1)^n; using the wrong one gives wrong coefficients for non-eigenvalue computationsUse p(λ)=det(λIA)p(\lambda) = \det(\lambda I - A) consistently (monic convention); eigenvalues are roots of either
2Concluding a matrix is diagonalisable just because it has nn eigenvaluesRepeated eigenvalues can have fewer than nn independent eigenvectors (defective case)Check mg(λi)=ma(λi)m_g(\lambda_i) = m_a(\lambda_i) for each repeated eigenvalue; compute rank(AλiI)\text{rank}(A - \lambda_i I) to verify
3Applying eigendecomposition to non-square matricesThe equation Av=λvA\mathbf{v} = \lambda\mathbf{v} requires A:RnRnA: \mathbb{R}^n \to \mathbb{R}^n (same domain and codomain)Use SVD for rectangular matrices; eigenvalues only for square matrices
4Forgetting that real matrices can have complex eigenvaluesA real matrix with negative discriminant (Δ<0\Delta < 0) has complex conjugate eigenvalue pairs; they are still valid eigenvaluesAlways check the discriminant; complex eigenvalues are meaningful (e.g., rotation matrices)
5Claiming all eigenvalues of AA^\top differ from those of AAAA and AA^\top always have identical eigenvalues (same characteristic polynomial) but different eigenvectorsThe eigenvectors of AA^\top are the left eigenvectors of AA; eigenvalues are the same
6Confusing algebraic and geometric multiplicityAM counts polynomial roots; GM counts independent eigenvectors; they can differ for defective matricesAlways compute GM via dimnull(AλI)=nrank(AλI)\dim\,\text{null}(A - \lambda I) = n - \text{rank}(A - \lambda I); don't assume AM = GM
7Thinking eigenvectors are uniqueEigenvectors define a direction, not a specific vector; any non-zero scalar multiple is also an eigenvectorSpeak of eigenspaces (subspaces), not single eigenvectors; normalise when uniqueness is needed
8Applying the spectral theorem to non-symmetric matricesOnly symmetric (or Hermitian) matrices are guaranteed real eigenvalues and orthogonal eigenvectorsCheck A=AA = A^\top first; for general AA, use Schur decomposition or Jordan form
9Setting step size based on λmax\lambda_{\max} but using the wrong normGradient descent convergence requires η<2/λmax(H)\eta < 2/\lambda_{\max}(H) where HH is the Hessian, not the weight matrixCompute the Hessian or its spectral norm; do not confuse weight matrix eigenvalues with Hessian eigenvalues
10Concluding ρ(A)<1\rho(A) < 1 from tr(A)<n\text{tr}(A) < nTrace = sum of eigenvalues; small trace is compatible with some large and some negative eigenvaluesCompute $\rho(A) = \max_i

13. Exercises

Exercise 1 * - Eigenpair Verification and 2\times2 Computation

(a) Verify that v=(1,2)\mathbf{v} = (1, 2)^\top is an eigenvector of A=(3122)A = \begin{pmatrix}3 & 1\\2 & 2\end{pmatrix} and find the eigenvalue. Check using the trace and determinant.

(b) Compute both eigenvalues and eigenvectors of AA using the characteristic polynomial. Verify λ1+λ2=tr(A)\lambda_1 + \lambda_2 = \text{tr}(A) and λ1λ2=det(A)\lambda_1\lambda_2 = \det(A).

(c) Is AA diagonalisable? Construct VV and Λ\Lambda. Verify A=VΛV1A = V\Lambda V^{-1} numerically.

Exercise 2 * - Characteristic Polynomial and Eigenspaces

(a) Compute the characteristic polynomial of B=(410041002)B = \begin{pmatrix}4&1&0\\0&4&1\\0&0&2\end{pmatrix}.

(b) Find all eigenvalues and their algebraic multiplicities.

(c) For each eigenvalue, compute a basis for the eigenspace. Identify defective eigenvalues.

(d) What is tr(B)\text{tr}(B) and det(B)\det(B)? Verify against the eigenvalues.

Exercise 3 * - Diagonalisation

Let C=(5131)C = \begin{pmatrix}5&-1\\3&1\end{pmatrix}.

(a) Find eigenvalues and eigenvectors. Construct V=[v1v2]V = [\mathbf{v}_1|\mathbf{v}_2] and Λ\Lambda.

(b) Compute V1V^{-1} and verify C=VΛV1C = V\Lambda V^{-1}.

(c) Use diagonalisation to compute C10C^{10} efficiently. Compare against np.linalg.matrix_power(C, 10).

(d) Compute eC=VeΛV1e^C = Ve^{\Lambda}V^{-1}. Compare against scipy.linalg.expm(C).

Exercise 4 ** - Spectral Theorem

Let S=(4221)S = \begin{pmatrix}4&2\\2&1\end{pmatrix}.

(a) Verify SS is symmetric and find its eigenvalues. Are they real and non-negative?

(b) Find orthonormal eigenvectors. Construct the orthogonal matrix QQ and verify QQ=IQ^\top Q = I.

(c) Verify the spectral decomposition: S=λ1q1q1+λ2q2q2S = \lambda_1\mathbf{q}_1\mathbf{q}_1^\top + \lambda_2\mathbf{q}_2\mathbf{q}_2^\top.

(d) Is SS PSD? PD? Compute the condition number κ2(S)=λmax/λmin\kappa_2(S) = \lambda_{\max}/\lambda_{\min}.

Exercise 5 ** - Matrix Powers and Recurrence Relations

The Fibonacci-like recurrence an+2=3an+12ana_{n+2} = 3a_{n+1} - 2a_n with a0=1a_0 = 1, a1=2a_1 = 2.

(a) Write the recurrence as xn+1=Axn\mathbf{x}_{n+1} = A\mathbf{x}_n where xn=(an+1,an)\mathbf{x}_n = (a_{n+1}, a_n)^\top.

(b) Find the eigenvalues and eigenvectors of AA. Diagonalise AA.

(c) Use An=VΛnV1A^n = V\Lambda^n V^{-1} to derive a closed-form formula for ana_n.

(d) Verify for n=0,1,2,3,4n = 0, 1, 2, 3, 4 numerically.

Exercise 6 ** - Rayleigh Quotient and PSD

Let M=(310131013)M = \begin{pmatrix}3&1&0\\1&3&1\\0&1&3\end{pmatrix}.

(a) Compute the eigenvalues of MM using np.linalg.eigvalsh. Verify MM is PD.

(b) Compute R(M,x)R(M, \mathbf{x}) for x=(1,0,0)\mathbf{x} = (1,0,0)^\top, (0,1,0)(0,1,0)^\top, (1,1,1)/3(1,1,1)^\top/\sqrt{3}, and a random unit vector. Verify λminRλmax\lambda_{\min} \leq R \leq \lambda_{\max}.

(c) Find the direction x\mathbf{x} that maximises R(M,x)R(M,\mathbf{x}) - confirm it is the top eigenvector.

(d) What is the condition number κ2(M)\kappa_2(M)? How does it affect the convergence rate of gradient descent on a quadratic loss with Hessian MM?

Exercise 7 *** - PCA from Scratch

Using the provided dataset XR100×5X \in \mathbb{R}^{100 \times 5}:

(a) Centre the data. Compute the sample covariance matrix C=X~X~/99C = \tilde{X}^\top\tilde{X}/99.

(b) Compute the eigendecomposition of CC. Sort eigenvalues in descending order.

(c) Project the data onto the top 2 principal components. Plot the 2D embedding.

(d) Compute the fraction of variance explained by 1, 2, 3 components. How many components explain 95% of variance?

(e) Verify that SVD of X~\tilde{X} gives the same principal components as the eigendecomposition of CC. Explain the connection: principal components = right singular vectors; principal variances = σi2/99\sigma_i^2/99.

Exercise 8 *** - Power Iteration and PageRank

(a) Implement power iteration for the dominant eigenpair of a random 5×55 \times 5 symmetric positive definite matrix AA. Track convergence of λkλtrue|\lambda_k - \lambda_{\text{true}}| vs iteration kk.

(b) Verify the convergence rate is approximately λ2/λ1|\lambda_2/\lambda_1| per iteration (compare theoretical to empirical).

(c) Build a small web graph (6 nodes) with a link matrix LL. Construct the row-stochastic PageRank transition matrix G=0.85L+0.1511/nG = 0.85 L + 0.15 \cdot \mathbf{1}\mathbf{1}^\top/n.

(d) Run power iteration on GG to find the PageRank vector (dominant eigenvector). Compare to np.linalg.eig(G.T).

(e) For attention: Construct a 4×44 \times 4 attention weight matrix SS (softmax of random logits). Compute its eigenvalues. Verify the dominant eigenvalue is 1. Plot the eigenvalue spectrum.


14. Why This Matters for AI (2026 Perspective)

ConceptWhere It Appears in AI/LLMsConcrete Method
Eigendecomposition of covarianceDimensionality reduction, embedding analysisPCA, Kernel PCA
Spectral radius ρ(W)\rho(W)RNN stability, gradient flow controlOrthogonal RNNs, spectral normalisation
Hessian eigenspectrumLoss landscape analysis, learning rate selectionPyHessian, K-FAC, SAM
Perron-Frobenius eigenvalue = 1Attention matrix structure, Markov chain mixingAttention head analysis, mechanistic interpretability
Low-rank eigenspace adaptationParameter-efficient fine-tuningLoRA, DoRA, GaLore
Graph Laplacian eigenvectorsGraph transformer positional encodingsLim et al. 2023, GRIT
SVD / singular valuesWeight compression, stable rank, generalisationWeightWatcher, model pruning
QR algorithmAll eigenvalue computations in numpy/scipynp.linalg.eigh, scipy.linalg.eig
Power iterationDominant eigenvalue estimation, PageRank, spectral normSpectral normalisation (Miyato 2018)
Condition number κ(H)\kappa(H)Gradient descent convergence ratePreconditioning, Adam, AdaGrad
Rayleigh quotientPCA objective, curvature measurementPCA, Fisher LDA, natural gradient
Heavy-tailed eigenvalue distributionModel quality without test dataWeightWatcher alpha metric
NTK eigenvaluesWhich features are learned first (spectral bias)NTK theory, frequency principle

15. Conceptual Bridge

Where This Section Sits

Eigenvalues and eigenvectors are the payoff for everything in Chapters 1 and 2. Vector spaces (Section 02-06) established the abstract setting - subspaces, span, dimension. Matrix rank (Section 02-05) measured the effective dimensionality of a linear map. Determinants (Section 02-04) provided the characteristic polynomial via det(λIA)=0\det(\lambda I - A) = 0. Row reduction (Section 02-03) is the tool for finding null spaces, hence eigenvectors. All of this machinery was built so that we could ask and answer the deepest question about a linear map: what are its natural axes, and how does it scale along them?

What This Section Unlocks

Eigenvalues open three major directions forward in this curriculum. First, Singular Value Decomposition (Section 03-02) is the direct generalisation: eigenvalues of AAA^\top A are the squared singular values; the eigenvectors of AAA^\top A and AAAA^\top are the right and left singular vectors. SVD extends eigendecomposition to rectangular matrices and gives the four fundamental subspaces with optimal orthonormal bases. Second, PCA (Section 03-03) is eigendecomposition of the covariance matrix in practice - the full pipeline from raw data to compressed, interpretable representations. Third, Positive Definite Matrices (Section 03-07) uses the spectral theorem to characterise the geometry of quadratic forms - every optimisation problem in Chapter 8 involves a positive definite Hessian.

The Larger Arc

Stepping back further: eigenvalues connect linear algebra to analysis, probability, and AI in ways that become clearer as the curriculum progresses. The Hessian eigenspectrum (Chapter 8, Optimization) controls whether gradient descent converges, how fast, and to what kind of minimum. The graph Laplacian eigenvalues (Chapter 11, Graph Theory) determine the topology of learned graph representations. The eigenvalues of the Fisher information matrix (Chapter 13, ML-Specific Math) define the natural gradient, the theoretically optimal update direction. The NTK eigenvalues (Chapter 14, Transformers) determine which target function components are learned first. Every time you see a covariance matrix, a Gram matrix, a weight matrix, or a Hessian - you are dealing with a system whose behaviour is ultimately governed by eigenvalues.

POSITION IN CURRICULUM
========================================================================

  02-06 Vector Spaces          02-05 Matrix Rank
  (subspaces, basis)           (null space, column space)
          |                              |
          +--------------+---------------+
                         v
              03-01 EIGENVALUES & EIGENVECTORS   <-- YOU ARE HERE
              (characteristic polynomial,
               diagonalisation, spectral theorem,
               Jordan form, SVD connection)
                         |
          +--------------+-----------------------+
          v              v                        v
    03-02 SVD      03-03 PCA              03-07 Positive
    (singular      (covariance             Definite
     values,        eigenstructure,        Matrices
     Eckart-Young)  dimension reduction)   (Cholesky,
          |              |                 curvature)
          +--------------+------------------------+
                         v
          Ch. 08 Optimization  *  Ch. 11 Graph Theory
          Ch. 13 ML-Specific Math  *  Ch. 14 Transformers

========================================================================

<- Back to Advanced Linear Algebra | Next: Singular Value Decomposition ->