Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 1
25 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Positive Definite Matrices: Part 1: Intuition to 5. Matrix Square Root and Whitening

1. Intuition

1.1 What Positive Definiteness Captures

The most useful single-sentence definition of a positive definite matrix is: a matrix that behaves like a positive number. To understand what this means, compare scalars and matrices in parallel.

For a scalar a>0a > 0:

  • The equation ax=bax = b has a unique solution x=b/ax = b/a
  • The function f(x)=ax2f(x) = ax^2 has a unique minimum at x=0x = 0
  • The scalar has a real square root: a>0\sqrt{a} > 0
  • The product axx=ax20ax \cdot x = ax^2 \geq 0, with equality only at x=0x = 0

For a positive definite matrix A0A \succ 0:

  • The system Ax=bA\mathbf{x} = \mathbf{b} has a unique solution x=A1b\mathbf{x} = A^{-1}\mathbf{b}
  • The function f(x)=xAxf(\mathbf{x}) = \mathbf{x}^\top A \mathbf{x} has a unique minimum at x=0\mathbf{x} = \mathbf{0}
  • The matrix has a unique positive definite square root A1/20A^{1/2} \succ 0
  • The quadratic form xAx>0\mathbf{x}^\top A \mathbf{x} > 0 for all x0\mathbf{x} \neq \mathbf{0}

The quadratic form Q(x)=xAxQ(\mathbf{x}) = \mathbf{x}^\top A \mathbf{x} is the central object. Think of it as a machine that takes a vector x\mathbf{x} and returns a scalar measuring its "energy" with respect to AA. When AA is the identity, Q(x)=x2Q(\mathbf{x}) = \|\mathbf{x}\|^2, the familiar squared Euclidean length. When AA is diagonal with positive entries d1,,dnd_1, \ldots, d_n, then Q(x)=d1x12++dnxn2Q(\mathbf{x}) = d_1 x_1^2 + \cdots + d_n x_n^2 - a weighted sum of squared coordinates. Positive definiteness requires this energy to be positive for every nonzero direction.

The formal definition requires AA to be symmetric. The condition xAx>0\mathbf{x}^\top A \mathbf{x} > 0 for x0\mathbf{x} \neq \mathbf{0} is called strict positive definiteness. The relaxed condition xAx0\mathbf{x}^\top A \mathbf{x} \geq 0 is positive semidefiniteness - it allows zero energy in some directions (those in the null space of AA).

For AI: Every loss function L(θ)\mathcal{L}(\boldsymbol{\theta}) that has a strict local minimum at θ\boldsymbol{\theta}^* satisfies 2L(θ)0\nabla^2 \mathcal{L}(\boldsymbol{\theta}^*) \succ 0. The Hessian being positive definite at a critical point is exactly the second-order sufficient condition for a local minimum. This is the mathematical content of "the loss landscape is a bowl."

1.2 Geometric Picture: Ellipsoids and Level Sets

The level sets of the quadratic form xAx=c\mathbf{x}^\top A \mathbf{x} = c reveal the geometry of AA directly.

Case 1: A=IA = I (identity). The level set xIx=c\mathbf{x}^\top I \mathbf{x} = c is x2=c\|\mathbf{x}\|^2 = c - a sphere of radius c\sqrt{c}.

Case 2: AA diagonal, A=diag(λ1,λ2)A = \text{diag}(\lambda_1, \lambda_2) with λ1>λ2>0\lambda_1 > \lambda_2 > 0. The level set λ1x12+λ2x22=1\lambda_1 x_1^2 + \lambda_2 x_2^2 = 1 is an ellipse with semi-axes 1/λ11/\sqrt{\lambda_1} and 1/λ21/\sqrt{\lambda_2}. The larger eigenvalue compresses that direction; the smaller eigenvalue stretches it.

Case 3: AA general symmetric PD. The level set xAx=1\mathbf{x}^\top A \mathbf{x} = 1 is an ellipsoid whose axes are aligned with the eigenvectors of AA and whose semi-axis lengths are 1/λi1/\sqrt{\lambda_i}. This follows from the spectral decomposition A=QΛQA = Q\Lambda Q^\top: substituting y=Qx\mathbf{y} = Q^\top \mathbf{x} gives yΛy=1\mathbf{y}^\top \Lambda \mathbf{y} = 1, a coordinate-aligned ellipsoid.

Case 4: AA indefinite (some positive, some negative eigenvalues). The level set is a hyperboloid - unbounded in the negative-eigenvalue directions. There is no minimum energy.

Case 5: AA positive semidefinite (some zero eigenvalues). The level set is an "infinite cylinder" - flat in the null space directions. Energy is zero for all vectors in the null space.

QUADRATIC FORM GEOMETRY
========================================================================

  2D quadratic form  Q(x_1,x_2) = x^TAx,  level set Q = 1

  PD (both \lambda > 0):     PSD (one \lambda = 0):    Indefinite:
                        
     +-------+           ---------          /       \
     |  /-\  |           ---------          ---------
     | /   \ |           ---------          \       /
     +-------+           ---------          ---------
     ellipse             parallel lines     hyperbola
     
  Semi-axes \propto 1/\sqrt\lambda^i
  Axes aligned with eigenvectors of A

========================================================================

This geometric picture has a direct machine learning reading. The covariance matrix Σ\Sigma of a multivariate Gaussian defines a metric on feature space: (xμ)Σ1(xμ)(\mathbf{x} - \boldsymbol{\mu})^\top \Sigma^{-1} (\mathbf{x} - \boldsymbol{\mu}) is the squared Mahalanobis distance, measuring how many standard deviations x\mathbf{x} is from μ\boldsymbol{\mu} in each principal direction. The level sets of this distance are the ellipsoidal contours of the Gaussian density.

1.3 Why PD Matrices Matter for AI

Positive definiteness is not a technical curiosity - it is a pervasive structural condition in machine learning systems:

Covariance matrices. Any valid probability distribution must have a non-negative variance. For a multivariate random variable xRn\mathbf{x} \in \mathbb{R}^n, the covariance matrix Σ=E[(xμ)(xμ)]\Sigma = \mathbb{E}[(\mathbf{x}-\boldsymbol{\mu})(\mathbf{x}-\boldsymbol{\mu})^\top] is always PSD. For a multivariate Gaussian N(μ,Σ)\mathcal{N}(\boldsymbol{\mu}, \Sigma) with a proper density, we need Σ0\Sigma \succ 0 (so detΣ0\det \Sigma \neq 0 and the density is normalized). Every time a VAE encoder outputs a covariance or diagonal variance, it must produce a PSD (or PD) parameterization.

Fisher information matrix. The Fisher information F(θ)=Exp(θ)[θlogp(xθ)θlogp(xθ)]F(\boldsymbol{\theta}) = \mathbb{E}_{\mathbf{x} \sim p(\cdot|\boldsymbol{\theta})}[\nabla_{\boldsymbol{\theta}} \log p(\mathbf{x}|\boldsymbol{\theta})\, \nabla_{\boldsymbol{\theta}} \log p(\mathbf{x}|\boldsymbol{\theta})^\top] is PSD by construction (it is a covariance matrix of score functions). The natural gradient ~L=F1L\tilde{\nabla} \mathcal{L} = F^{-1} \nabla \mathcal{L} uses the Fisher as a Riemannian metric on parameter space. K-FAC (Kronecker-Factored Approximate Curvature) approximates FF with a Kronecker-product PSD matrix for scalable second-order optimization.

Gram matrices and kernels. The inner product matrix Gij=xi,xjG_{ij} = \langle \mathbf{x}_i, \mathbf{x}_j \rangle for any set of vectors is always PSD. Kernel methods rely on this: a function k(x,z)k(\mathbf{x},\mathbf{z}) is a valid kernel if and only if its Gram matrix is PSD for any set of inputs (Mercer's theorem). The scaled attention score matrix QK/dkQK^\top / \sqrt{d_k} in transformers is a (non-symmetric) Gram matrix.

Cholesky in numerical computation. Cholesky decomposition (A=LLA = LL^\top) is the fastest algorithm for solving Ax=bA\mathbf{x} = \mathbf{b} when AA is known to be PD - twice as fast as LU. It is the standard backend for Gaussian process inference, multivariate normal sampling, and Bayesian linear regression. NumPy, SciPy, PyTorch all use Cholesky internally whenever PD structure is detected.

Hessian and second-order methods. At a local minimum θ\boldsymbol{\theta}^*, the Hessian 2L(θ)0\nabla^2 \mathcal{L}(\boldsymbol{\theta}^*) \succeq 0. Sharpness-Aware Minimization (SAM) seeks parameters where λmax(2L)\lambda_{\max}(\nabla^2 \mathcal{L}) is small, empirically improving generalization. The Gauss-Newton approximation HJJH \approx J^\top J is always PSD and is the basis for K-FAC and natural gradient methods.

1.4 Historical Timeline

YearContributorDevelopment
1801GaussQuadratic forms in celestial mechanics; method of least squares
1847SylvesterNamed "positive definite" forms; leading minor criterion
1852SylvesterSylvester's law of inertia (signature invariance under congruence)
1878FrobeniusSystematic theory of bilinear and quadratic forms
1910CholeskyCholesky decomposition for geodetic calculations (unpublished until 1924)
1934LoewnerMatrix ordering (Loewner order) ABA \succeq B defined rigorously
1936MercerMercer's theorem linking PSD functions to kernel expansions
1955BellmanPositive definite matrices in dynamic programming and control
1990Vandenberghe & BoydSemidefinite programming (SDP) as efficient optimization
1998Scholkopf & SmolaKernel methods: SVMs via PSD kernel matrices
2013Kingma & WellingVAE reparameterization trick using Cholesky sampling
2015Martens & GrosseK-FAC: Kronecker-factored PSD approximation to Fisher
2024-2026Multiple groupsPSD structure in diffusion model covariances, normalizing flows, structured covariance estimation

2. Formal Definitions

2.1 Quadratic Forms

Definition 2.1 (Quadratic Form). Let ARn×nA \in \mathbb{R}^{n \times n} be a symmetric matrix and xRn\mathbf{x} \in \mathbb{R}^n. The associated quadratic form is the function QA:RnRQ_A : \mathbb{R}^n \to \mathbb{R} defined by

QA(x)=xAx=i=1nj=1nAijxixj.Q_A(\mathbf{x}) = \mathbf{x}^\top A \mathbf{x} = \sum_{i=1}^n \sum_{j=1}^n A_{ij} x_i x_j.

Note: every quadratic form can be written with a symmetric matrix. If AA is not symmetric, replacing it with (A+A)/2(A + A^\top)/2 gives the same quadratic form, since xAx=xA+A2x\mathbf{x}^\top A \mathbf{x} = \mathbf{x}^\top \frac{A+A^\top}{2} \mathbf{x} for all x\mathbf{x}. We always assume AA is symmetric.

Expanding in 2D. For A=(abbd)A = \begin{pmatrix} a & b \\ b & d \end{pmatrix} and x=(x1,x2)\mathbf{x} = (x_1, x_2)^\top:

QA(x)=ax12+2bx1x2+dx22.Q_A(\mathbf{x}) = ax_1^2 + 2bx_1x_2 + dx_2^2.

The diagonal entries a,da, d give the pure squared terms; the off-diagonal bb gives the cross term.

Completing the square. For the 1D quadratic q(x)=ax2+2bx+c=a(x+b/a)2+(cb2/a)q(x) = ax^2 + 2bx + c = a(x + b/a)^2 + (c - b^2/a), the minimum is cb2/ac - b^2/a, achieved at x=b/ax = -b/a. The matrix analogue is the Schur complement (6) - the quadratic form xAx+2bx+c\mathbf{x}^\top A \mathbf{x} + 2\mathbf{b}^\top \mathbf{x} + c is minimized at x=A1b\mathbf{x}^* = -A^{-1}\mathbf{b} (when A0A \succ 0) with minimum value cbA1bc - \mathbf{b}^\top A^{-1} \mathbf{b}.

Connection to bilinear forms. The quadratic form QA(x)Q_A(\mathbf{x}) is the diagonal of the bilinear form BA(x,y)=xAyB_A(\mathbf{x}, \mathbf{y}) = \mathbf{x}^\top A \mathbf{y}, i.e., QA(x)=BA(x,x)Q_A(\mathbf{x}) = B_A(\mathbf{x}, \mathbf{x}). The bilinear form encodes the inner product structure induced by AA.

2.2 The Four Cases: PD, PSD, ND, Indefinite

Definition 2.2. Let ARn×nA \in \mathbb{R}^{n \times n} be symmetric. Then AA is:

NameSymbolConditionColloquial
Positive definiteA0A \succ 0xAx>0\mathbf{x}^\top A \mathbf{x} > 0 for all x0\mathbf{x} \neq \mathbf{0}"bowl"
Positive semidefiniteA0A \succeq 0xAx0\mathbf{x}^\top A \mathbf{x} \geq 0 for all x\mathbf{x}"flat bowl"
Negative definiteA0A \prec 0xAx<0\mathbf{x}^\top A \mathbf{x} < 0 for all x0\mathbf{x} \neq \mathbf{0}"dome"
Negative semidefiniteA0A \preceq 0xAx0\mathbf{x}^\top A \mathbf{x} \leq 0 for all x\mathbf{x}"flat dome"
Indefinite-QAQ_A takes both positive and negative values"saddle"

Note: A0A0A \prec 0 \Leftrightarrow -A \succ 0. The interesting cases for applications are PD and PSD.

Standard examples:

A=InA = I_n: xIx=x2>0\mathbf{x}^\top I \mathbf{x} = \|\mathbf{x}\|^2 > 0 for x0\mathbf{x} \neq \mathbf{0}. I0I \succ 0. OK

A=(2112)A = \begin{pmatrix} 2 & 1 \\ 1 & 2 \end{pmatrix}: Q=2x12+2x1x2+2x22=(x1+x2)2+x12+x22x12+x22>0Q = 2x_1^2 + 2x_1x_2 + 2x_2^2 = (x_1+x_2)^2 + x_1^2 + x_2^2 \geq x_1^2 + x_2^2 > 0 for x0\mathbf{x} \neq \mathbf{0}. 0\succ 0. OK

A=(1000)A = \begin{pmatrix} 1 & 0 \\ 0 & 0 \end{pmatrix}: Q=x120Q = x_1^2 \geq 0, and Q(e2)=0Q(\mathbf{e}_2) = 0. 0\succeq 0 but not 0\succ 0. (PSD, not PD.)

A=(1221)A = \begin{pmatrix} 1 & 2 \\ 2 & 1 \end{pmatrix}: Q(1,1)=14+1=2<0Q(1,-1) = 1 - 4 + 1 = -2 < 0, Q(1,0)=1>0Q(1,0) = 1 > 0. Indefinite.

Non-examples (common mistakes):

A=(2331)A = \begin{pmatrix} 2 & 3 \\ 3 & 1 \end{pmatrix}: looks "positive" but detA=29=7<0\det A = 2 - 9 = -7 < 0, so indefinite.

A=(0000)A = \begin{pmatrix} 0 & 0 \\ 0 & 0 \end{pmatrix}: Q0Q \equiv 0, so PSD but NOT PD.

A=(1002)A = \begin{pmatrix} -1 & 0 \\ 0 & 2 \end{pmatrix}: Q(e1)=1Q(\mathbf{e}_1) = -1, indefinite.

2.3 Immediate Consequences

If A0A \succ 0, then many properties follow immediately from the definition:

Proposition 2.3. Let A0A \succ 0 (symmetric). Then:

  1. Positive diagonal entries. Aii>0A_{ii} > 0 for all ii. Proof: Take x=ei\mathbf{x} = \mathbf{e}_i: eiAei=Aii>0\mathbf{e}_i^\top A \mathbf{e}_i = A_{ii} > 0.

  2. Positive trace. tr(A)=iAii>0\text{tr}(A) = \sum_i A_{ii} > 0.

  3. Positive determinant. detA>0\det A > 0. (Follows from all eigenvalues positive, 3.1.)

  4. Invertibility. AA is invertible, and A10A^{-1} \succ 0. Proof: If Ax=0A\mathbf{x} = \mathbf{0} then xAx=0\mathbf{x}^\top A \mathbf{x} = 0, so x=0\mathbf{x} = \mathbf{0}. For A1A^{-1}: (A1)(A^{-1}) is symmetric, and yA1y=(A1y)A(A1y)>0\mathbf{y}^\top A^{-1} \mathbf{y} = (A^{-1}\mathbf{y})^\top A (A^{-1}\mathbf{y}) > 0 for y0\mathbf{y} \neq \mathbf{0}.

  5. Closure under positive scaling. αA0\alpha A \succ 0 for α>0\alpha > 0.

  6. Closure under addition. If A,B0A, B \succ 0 then A+B0A + B \succ 0. Proof: x(A+B)x=xAx+xBx>0\mathbf{x}^\top(A+B)\mathbf{x} = \mathbf{x}^\top A\mathbf{x} + \mathbf{x}^\top B\mathbf{x} > 0.

  7. Tikhonov regularization is PD. If A0A \succeq 0 and λ>0\lambda > 0, then A+λI0A + \lambda I \succ 0. Proof: x(A+λI)xλx2>0\mathbf{x}^\top(A + \lambda I)\mathbf{x} \geq \lambda \|\mathbf{x}\|^2 > 0.

  8. Congruence preserves PD. If A0A \succ 0 and BB is invertible, then BAB0B^\top A B \succ 0. Proof: xBABx=(Bx)A(Bx)>0\mathbf{x}^\top B^\top A B \mathbf{x} = (B\mathbf{x})^\top A (B\mathbf{x}) > 0 since Bx0B\mathbf{x} \neq \mathbf{0} when x0\mathbf{x} \neq \mathbf{0}.

For AI: Property 4 ensures that covariance matrices can be inverted for precision computations. Property 7 is exactly why adding λI\lambda I (weight decay, Tikhonov regularization, jitter in Gaussian processes) to a PSD matrix makes it PD and invertible.

2.4 The Loewner Order

Definition 2.4 (Loewner Order). For symmetric matrices A,BRn×nA, B \in \mathbb{R}^{n \times n}, define the Loewner ordering:

ABAB0.A \succeq B \quad \Longleftrightarrow \quad A - B \succeq 0.

Similarly, ABAB0A \succ B \Leftrightarrow A - B \succ 0. This ordering is:

  • Reflexive: AAA \succeq A.
  • Antisymmetric: ABA \succeq B and BAB \succeq A implies A=BA = B.
  • Transitive: ABA \succeq B and BCB \succeq C implies ACA \succeq C.

However, it is NOT a total order - not every pair of symmetric matrices is comparable. For example, (1000)\begin{pmatrix}1&0\\0&0\end{pmatrix} and (0001)\begin{pmatrix}0&0\\0&1\end{pmatrix} are neither \succeq nor \preceq each other.

Properties of the Loewner order:

If ABA \succeq B then:

  • tr(A)tr(B)\text{tr}(A) \geq \text{tr}(B) (trace is monotone)
  • λi(A)λi(B)\lambda_i(A) \geq \lambda_i(B) for all ii when eigenvalues are sorted in decreasing order (Weyl's theorem)
  • CACCBCC^\top A C \succeq C^\top B C for any CC (congruence is monotone)
  • If additionally A,B0A, B \succ 0: B1A1B^{-1} \succeq A^{-1} (inversion reverses the order!)

The last property is surprising: if AB0A \succeq B \succ 0, then A1B1A^{-1} \preceq B^{-1}. Intuition: a larger matrix "moves vectors more," so its inverse "moves them less."

For AI: In Bayesian inference, the posterior covariance is Σpost=(Σprior1+XX/σ2)1\Sigma_{\text{post}} = (\Sigma_{\text{prior}}^{-1} + X^\top X / \sigma^2)^{-1}. Since we added XX/σ20X^\top X / \sigma^2 \succeq 0 to the prior precision, the posterior precision is larger, so by order-reversal, ΣpostΣprior\Sigma_{\text{post}} \preceq \Sigma_{\text{prior}}. Observing data never increases uncertainty - the Loewner order makes this rigorous.


3. Eigenvalue and Determinantal Characterizations

3.1 Spectral Characterization

The most computationally transparent characterization of positive definiteness uses eigenvalues.

Theorem 3.1 (Spectral Characterization). Let ARn×nA \in \mathbb{R}^{n \times n} be symmetric with eigenvalues λ1λ2λn\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_n. Then:

A0λi>0 for all i=1,,n.A \succ 0 \quad \Longleftrightarrow \quad \lambda_i > 0 \text{ for all } i = 1, \ldots, n. A0λi0 for all i=1,,n.A \succeq 0 \quad \Longleftrightarrow \quad \lambda_i \geq 0 \text{ for all } i = 1, \ldots, n.

Proof. By the spectral theorem (-> 01: Eigenvalues), A=QΛQA = Q\Lambda Q^\top where QQ is orthogonal and Λ=diag(λ1,,λn)\Lambda = \text{diag}(\lambda_1,\ldots,\lambda_n). For any x0\mathbf{x} \neq \mathbf{0}, let y=Qx0\mathbf{y} = Q^\top \mathbf{x} \neq \mathbf{0} (since QQ is invertible):

xAx=yΛy=i=1nλiyi2.\mathbf{x}^\top A \mathbf{x} = \mathbf{y}^\top \Lambda \mathbf{y} = \sum_{i=1}^n \lambda_i y_i^2.

If all λi>0\lambda_i > 0: the sum is positive for y0\mathbf{y} \neq \mathbf{0}. OK

If some λk0\lambda_k \leq 0: take y=ek\mathbf{y} = \mathbf{e}_k (so x=Qek=qk\mathbf{x} = Q\mathbf{e}_k = \mathbf{q}_k, the kk-th eigenvector). Then xAx=λk0\mathbf{x}^\top A \mathbf{x} = \lambda_k \leq 0. NO

The characterization for 0\succeq 0 follows by the same argument with \geq replacing >>. \square

Immediate consequences:

  • tr(A)=λi>0\text{tr}(A) = \sum \lambda_i > 0 for PD
  • detA=λi>0\det A = \prod \lambda_i > 0 for PD
  • A2=λ1\|A\|_2 = \lambda_1 and A12=1/λn\|A^{-1}\|_2 = 1/\lambda_n for PD (-> 06: Norms)
  • The condition number κ(A)=λ1/λn\kappa(A) = \lambda_1/\lambda_n measures how "nearly singular" AA is

For AI: In Gaussian process regression, the covariance kernel matrix KK must be PSD. Verifying K0K \succeq 0 by computing eigenvalues is expensive; checking a Cholesky decomposition (4) is faster and also provides the factorization needed for inference.

Note on the spectral theorem: This proof uses the spectral theorem for symmetric matrices - that every symmetric real matrix is orthogonally diagonalizable. For the full proof and discussion of spectral theory, see 01: Eigenvalues and Eigenvectors.

3.2 Sylvester's Criterion

Sylvester's criterion provides a purely determinantal test that avoids eigenvalue computation entirely, making it practical for hand calculations and symbolic proofs.

Definition 3.2 (Leading Principal Minors). The kk-th leading principal minor of ARn×nA \in \mathbb{R}^{n \times n} is:

Δk=det(A[1:k,1:k])=det(A11A1kAk1Akk).\Delta_k = \det(A[1:k, 1:k]) = \det\begin{pmatrix}A_{11} & \cdots & A_{1k} \\ \vdots & \ddots & \vdots \\ A_{k1} & \cdots & A_{kk}\end{pmatrix}.

So Δ1=A11\Delta_1 = A_{11}, Δ2=A11A22A122\Delta_2 = A_{11}A_{22} - A_{12}^2 (for symmetric AA), and Δn=detA\Delta_n = \det A.

Theorem 3.3 (Sylvester's Criterion). A symmetric matrix ARn×nA \in \mathbb{R}^{n \times n} is positive definite if and only if all leading principal minors are positive:

A0Δk>0 for all k=1,2,,n.A \succ 0 \quad \Longleftrightarrow \quad \Delta_k > 0 \text{ for all } k = 1, 2, \ldots, n.

Proof sketch. The key insight is that the leading principal submatrix Ak=A[1:k,1:k]A_k = A[1:k,1:k] is itself symmetric, and A0A \succ 0 implies Ak0A_k \succ 0 for every kk (restriction to the first kk standard basis vectors). The converse uses induction: if Δ1,,Δn1>0\Delta_1, \ldots, \Delta_{n-1} > 0 and Δn>0\Delta_n > 0, then by the inductive hypothesis all principal leading submatrices are PD, and the Cholesky factorization can be completed (4 gives the constructive proof). The determinant of a PD matrix equals the product of its Cholesky diagonal entries squared, all positive, so Δn>0\Delta_n > 0. \square

Working example. Test A=(421230102)A = \begin{pmatrix}4 & 2 & 1\\2 & 3 & 0\\1 & 0 & 2\end{pmatrix}:

  • Δ1=4>0\Delta_1 = 4 > 0 OK
  • Δ2=4322=124=8>0\Delta_2 = 4 \cdot 3 - 2^2 = 12 - 4 = 8 > 0 OK
  • Δ3=detA=4(60)2(40)+1(03)=2483=13>0\Delta_3 = \det A = 4(6-0) - 2(4-0) + 1(0-3) = 24 - 8 - 3 = 13 > 0 OK

So A0A \succ 0.

Caution for PSD. Sylvester's criterion does NOT characterize A0A \succeq 0. A matrix can have all leading principal minors 0\geq 0 but still not be PSD (a counterexample is A=(0001)A = \begin{pmatrix}0 & 0 \\ 0 & -1\end{pmatrix}: Δ1=00\Delta_1 = 0 \geq 0, Δ2=00\Delta_2 = 0 \geq 0, but Q(e2)=1<0Q(\mathbf{e}_2) = -1 < 0). For PSD testing, use all principal minors (not just leading ones), or use eigenvalues.

For AI: Sylvester's criterion is used in symbolic proofs and in optimization algorithms that maintain PD structure (e.g., proving that a block covariance matrix built by a GP model is PD, by verifying the diagonal blocks and Schur complements are positive).

3.3 Pivot Characterization

The connection between Gaussian elimination and positive definiteness is deep and computationally significant.

Theorem 3.4 (Pivot Characterization). A symmetric matrix ARn×nA \in \mathbb{R}^{n \times n} is positive definite if and only if Gaussian elimination without row exchanges produces all positive pivots.

The pivots of AA are d1=A11d_1 = A_{11}, d2=Δ2/Δ1d_2 = \Delta_2/\Delta_1, \ldots, dk=Δk/Δk1d_k = \Delta_k/\Delta_{k-1}. So A0A \succ 0 \Leftrightarrow all pivots dk>0d_k > 0 \Leftrightarrow all Δk>0\Delta_k > 0 (Sylvester). The LDL^T factorization makes this explicit: A=LDLA = LDL^\top where D=diag(d1,,dn)D = \text{diag}(d_1, \ldots, d_n) and A0A \succ 0 \Leftrightarrow all diagonal entries of DD are positive (4.4).

The pivot characterization is how Cholesky "discovers" positive definiteness: the Cholesky algorithm fails (tries to take the square root of a non-positive number) exactly when a pivot is non-positive. This makes Cholesky the standard computational test for positive definiteness: try to factor; failure reveals non-PD structure.

PIVOT CHARACTERIZATION SUMMARY
========================================================================

  A \in \mathbb{R}^n^x^n symmetric.  The following are equivalent:

  (i)   A \succ 0  (quadratic form positive for all x \neq 0)
  (ii)  All eigenvalues \lambda^i > 0                          [Spectral]
  (iii) All leading principal minors \Delta_k > 0              [Sylvester]
  (iv)  All Gaussian elimination pivots d_k > 0           [Pivot]
  (v)   Cholesky factorization A = LL^T exists with L     [Cholesky]
        lower triangular, L^i^i > 0

  Each characterization provides a different computational test:
  (ii)  O(n^3): eigenvalue decomposition  <- most expensive
  (iii) O(n^4): compute n determinants  <- symbolic/manual only
  (iv)  O(n^3): Gaussian elimination
  (v)   O(n^3/3): Cholesky             <- fastest in practice

========================================================================

3.4 Certifying PSD vs PD

In numerical practice, the PD/PSD boundary is a source of subtle bugs. Here we describe how to handle the boundary cases.

Near-PSD matrices. A matrix that is theoretically PSD may have small negative eigenvalues due to floating-point errors. For example, if AA is constructed as XXX^\top X but XX is nearly rank-deficient, AA might have eigenvalues like 1015-10^{-15} due to rounding. The standard fix is to add a small jitter: Aϵ=A+ϵIA_\epsilon = A + \epsilon I for ϵ106\epsilon \approx 10^{-6} or 10810^{-8}.

Numerical rank. For a PSD matrix, the rank equals the number of positive eigenvalues. In finite precision, eigenvalues below ϵA2\epsilon \cdot \|A\|_2 (machine epsilon times spectral norm) are treated as zero. numpy.linalg.matrix_rank uses this threshold internally.

Testing PSD in practice:

  1. Try np.linalg.cholesky(A) - succeeds iff A0A \succ 0 (strictly PD)
  2. For PSD test: check np.all(np.linalg.eigvalsh(A) >= -tol) with tol = 1e-10
  3. For rank-rr PSD: compute np.linalg.matrix_rank(A) and verify r<nr < n

The zero-eigenvalue case. If A0A \succeq 0 and rank(A)=r<n\text{rank}(A) = r < n, the Cholesky factorization of the full n×nn \times n matrix does not exist. However, the factorization A=LLA = LL^\top where LL is n×rn \times r (a "thin" Cholesky) does exist and can be computed via pivoted Cholesky. This is used in sparse GP approximations (Nystrom approximation, inducing points).

For AI: PyTorch's torch.linalg.cholesky raises torch.linalg.LinAlgError if the matrix is not PD. In practice, diagonal jitter (A + 1e-6 * torch.eye(n)) is the standard workaround in GP implementations and VAE covariance parameterizations.


4. Cholesky Decomposition

Cholesky decomposition is the most important computational tool in this section. Unlike the spectral characterization (eigenvalue decomposition, O(n3)O(n^3) with a large constant) or LU (general, O(n3)O(n^3) with pivoting), Cholesky is:

  • Twice as fast as LU for symmetric positive definite systems (n3/3\approx n^3/3 vs 2n3/3\approx 2n^3/3 operations)
  • Numerically backward-stable without pivoting (which LU requires for stability)
  • Reveals the square root: LL satisfies LL=ALL^\top = A, i.e., LL is a "matrix square root"
  • The canonical test for positive definiteness: it exists iff A0A \succ 0

4.1 The Theorem: Existence and Uniqueness

Theorem 4.1 (Cholesky Factorization). Let ARn×nA \in \mathbb{R}^{n \times n} be symmetric. Then:

A0! lower triangular L with positive diagonal such that A=LL.A \succ 0 \quad \Longleftrightarrow \quad \exists! \text{ lower triangular } L \text{ with positive diagonal such that } A = LL^\top.

This LL is called the Cholesky factor of AA. The factorization A=LLA = LL^\top is the Cholesky decomposition.

Proof of existence (constructive). We prove by induction on nn.

Base case (n=1n=1): A=(a11)A = (a_{11}) with a11>0a_{11} > 0 (since A0A \succ 0). Take L=(a11)L = (\sqrt{a_{11}}).

Inductive step: Write AA in block form:

A=(a11aaB)A = \begin{pmatrix} a_{11} & \mathbf{a}^\top \\ \mathbf{a} & B \end{pmatrix}

where a11>0a_{11} > 0 (Proposition 2.3), aRn1\mathbf{a} \in \mathbb{R}^{n-1}, and BR(n1)×(n1)B \in \mathbb{R}^{(n-1)\times(n-1)}.

Set 11=a11\ell_{11} = \sqrt{a_{11}}, =a/11\boldsymbol{\ell} = \mathbf{a}/\ell_{11}. Then B=Baa/a11B - \boldsymbol{\ell}\boldsymbol{\ell}^\top = B - \mathbf{a}\mathbf{a}^\top/a_{11} is the Schur complement S=Baa111aS = B - \mathbf{a} a_{11}^{-1} \mathbf{a}^\top.

Since A0A \succ 0, its Schur complement S0S \succ 0 (Theorem 6.2). By the inductive hypothesis, S=L22L22S = L_{22}L_{22}^\top for unique lower triangular L22L_{22} with positive diagonal.

Then:

L=(110L22),LL=(1121111+L22L22)=(a11aaB)=A.L = \begin{pmatrix}\ell_{11} & \mathbf{0}^\top \\ \boldsymbol{\ell} & L_{22}\end{pmatrix}, \quad LL^\top = \begin{pmatrix}\ell_{11}^2 & \ell_{11}\boldsymbol{\ell}^\top \\ \ell_{11}\boldsymbol{\ell} & \boldsymbol{\ell}\boldsymbol{\ell}^\top + L_{22}L_{22}^\top\end{pmatrix} = \begin{pmatrix}a_{11} & \mathbf{a}^\top \\ \mathbf{a} & B\end{pmatrix} = A.

Proof of uniqueness. Suppose A=L1L1=L2L2A = L_1 L_1^\top = L_2 L_2^\top with L1,L2L_1, L_2 lower triangular and positive diagonal. Then L21L1=L2(L1)1L_2^{-1}L_1 = L_2^\top (L_1^\top)^{-1}. The left side is lower triangular; the right side is upper triangular. Both sides equal the same matrix, which must be diagonal with positive entries (since L1,L2L_1, L_2 have positive diagonals). Comparing: L21L1=DL_2^{-1}L_1 = D (diagonal positive), so L1=L2DL_1 = L_2 D. But LL=L2DDL2LL^\top = L_2 D D L_2^\top, so D2=ID^2 = I, so D=ID = I, so L1=L2L_1 = L_2. \square

Converse. If A=LLA = LL^\top with LL lower triangular and positive diagonal, then for x0\mathbf{x} \neq \mathbf{0}: xAx=xLLx=Lx2>0\mathbf{x}^\top A \mathbf{x} = \mathbf{x}^\top L L^\top \mathbf{x} = \|L^\top \mathbf{x}\|^2 > 0 (since LL^\top is invertible by positive diagonal). So A0A \succ 0. \square

4.2 The Algorithm: Constructing L

The Cholesky algorithm computes LL column by column. From A=LLA = LL^\top, expanding element by element:

Aij=k=1min(i,j)LikLjk.A_{ij} = \sum_{k=1}^{\min(i,j)} L_{ik} L_{jk}.

For the diagonal entry (i,i)(i,i):

Aii=k=1iLik2=Lii2+k=1i1Lik2    Lii=Aiik=1i1Lik2.A_{ii} = \sum_{k=1}^{i} L_{ik}^2 = L_{ii}^2 + \sum_{k=1}^{i-1} L_{ik}^2 \implies L_{ii} = \sqrt{A_{ii} - \sum_{k=1}^{i-1} L_{ik}^2}.

For off-diagonal entry (i,j)(i,j) with i>ji > j:

Aij=k=1jLikLjk=LijLjj+k=1j1LikLjk    Lij=1Ljj(Aijk=1j1LikLjk).A_{ij} = \sum_{k=1}^{j} L_{ik}L_{jk} = L_{ij}L_{jj} + \sum_{k=1}^{j-1} L_{ik}L_{jk} \implies L_{ij} = \frac{1}{L_{jj}}\left(A_{ij} - \sum_{k=1}^{j-1} L_{ik}L_{jk}\right).

Algorithm 4.2 (Cholesky, column-wise):

Input:  Symmetric A \in \mathbb{R}^n^x^n
Output: Lower triangular L with A = LL^T, or FAILURE

for j = 1 to n:
    s = A[j,j] - sum(L[j,k]^2 for k = 1..j-1)
    if s \leq 0:
        FAILURE (A is not positive definite)
    L[j,j] = sqrt(s)
    for i = j+1 to n:
        L[i,j] = (A[i,j] - sum(L[i,k]*L[j,k] for k=1..j-1)) / L[j,j]
    for i = 1 to j-1:
        L[i,j] = 0       (upper triangle is zero)

Worked example. Compute the Cholesky factor of A=(420251013)A = \begin{pmatrix}4 & 2 & 0\\2 & 5 & 1\\0 & 1 & 3\end{pmatrix}:

Column 1: L11=4=2L_{11} = \sqrt{4} = 2, L21=2/2=1L_{21} = 2/2 = 1, L31=0/2=0L_{31} = 0/2 = 0.

Column 2: L22=512=4=2L_{22} = \sqrt{5 - 1^2} = \sqrt{4} = 2, L32=(101)/2=1/2L_{32} = (1 - 0\cdot1)/2 = 1/2.

Column 3: L33=302(1/2)2=31/4=11/4=11/2L_{33} = \sqrt{3 - 0^2 - (1/2)^2} = \sqrt{3 - 1/4} = \sqrt{11/4} = \sqrt{11}/2.

L=(20012001/211/2).L = \begin{pmatrix}2 & 0 & 0 \\ 1 & 2 & 0 \\ 0 & 1/2 & \sqrt{11}/2\end{pmatrix}.

Verify: LL=(420251013)=ALL^\top = \begin{pmatrix}4 & 2 & 0\\2 & 5 & 1\\0 & 1 & 3\end{pmatrix} = A. OK

4.3 Complexity and Numerical Properties

Flop count. The Cholesky algorithm requires approximately n3/3n^3/3 floating-point operations - precisely half the 2n3/32n^3/3 required by LU decomposition. For large nn, this factor-of-2 speedup is significant.

Memory. Only the lower triangle of LL needs to be stored: n(n+1)/2n(n+1)/2 entries vs n2n^2 for general LU. In-place variants overwrite the lower triangle of AA with LL.

Backward stability. Cholesky is numerically backward-stable without pivoting: the computed L^\hat{L} satisfies L^L^=A+E\hat{L}\hat{L}^\top = A + E where E2cnϵmachA2\|E\|_2 \leq c_n \epsilon_{\text{mach}} \|A\|_2 for a modest constant cnc_n. This is stronger than LU without pivoting (which can be unstable). The reason: each step computes positive\sqrt{\text{positive}}, which cannot amplify errors.

Solving Ax=bA\mathbf{x} = \mathbf{b}: Factor A=LLA = LL^\top, then solve Ly=bL\mathbf{y} = \mathbf{b} by forward substitution and Lx=yL^\top\mathbf{x} = \mathbf{y} by backward substitution. Total cost: n3/3+2n2n^3/3 + 2n^2 (factorization + two triangular solves).

Failure = non-PD. If at any step the quantity under the square root is non-positive, AA is not positive definite. This makes Cholesky the standard computational test for PD structure.

For AI: JAX's jax.scipy.linalg.cholesky, PyTorch's torch.linalg.cholesky, and SciPy's scipy.linalg.cholesky are all efficient LAPACK-backed implementations. In Gaussian process models, the Cholesky solve L \ b (forward substitution) appears in every prediction and log-marginal-likelihood computation.

4.4 LDL^T Factorization

The LDL^T decomposition is a variant of Cholesky that avoids square roots - useful when the matrix is PSD but not strictly PD, or when square root computation is expensive.

Theorem 4.3 (LDL^T Factorization). Every symmetric ARn×nA \in \mathbb{R}^{n \times n} that admits Gaussian elimination without row swaps can be uniquely factored as:

A=LDLA = LDL^\top

where LL is unit lower triangular (Lii=1L_{ii} = 1) and D=diag(d1,,dn)D = \text{diag}(d_1, \ldots, d_n).

Moreover, A0A \succ 0 \Leftrightarrow all diagonal entries di>0d_i > 0.

Relation to Cholesky. If A0A \succ 0, the LDL^T and Cholesky factorizations are related by:

A=LDL=LDDL=(LD)(LD).A = LDL^\top = L\sqrt{D}\sqrt{D}L^\top = (L\sqrt{D})(L\sqrt{D})^\top.

So the Cholesky factor is L^=LD\hat{L} = L\sqrt{D} where D=diag(d1,,dn)\sqrt{D} = \text{diag}(\sqrt{d_1},\ldots,\sqrt{d_n}). The LDL^T decomposition computes the did_i directly without taking square roots.

Algorithm 4.4 (LDL^T, column-wise):

for j = 1 to n:
    v[1..j-1] = D[1..j-1] * L[j,1..j-1]   (element-wise)
    D[j] = A[j,j] - dot(L[j,1..j-1], v[1..j-1])
    for i = j+1 to n:
        L[i,j] = (A[i,j] - dot(L[i,1..j-1], v[1..j-1])) / D[j]
    L[j,j] = 1

Worked example. Factor A=(4225)A = \begin{pmatrix}4 & 2\\2 & 5\end{pmatrix}:

d1=4d_1 = 4, L21=2/4=1/2L_{21} = 2/4 = 1/2, d2=5(1/2)24=51=4d_2 = 5 - (1/2)^2 \cdot 4 = 5 - 1 = 4.

L=(101/21),D=(4004).L = \begin{pmatrix}1 & 0 \\ 1/2 & 1\end{pmatrix}, \quad D = \begin{pmatrix}4 & 0 \\ 0 & 4\end{pmatrix}.

Verify: LDL=(101/21)(4004)(11/201)=(4225)LDL^\top = \begin{pmatrix}1&0\\1/2&1\end{pmatrix}\begin{pmatrix}4&0\\0&4\end{pmatrix}\begin{pmatrix}1&1/2\\0&1\end{pmatrix} = \begin{pmatrix}4&2\\2&5\end{pmatrix}. OK

When to prefer LDL^T over Cholesky:

  • When AA is indefinite but you still want a factorization (use BKLTBKLT pivoted LDL^T with 2×22\times 2 pivots)
  • When square root computation is numerically undesirable
  • In interval arithmetic or symbolic computation

Symmetric indefinite systems. The Bunch-Kaufman (BK) algorithm extends LDL^T to indefinite matrices using 1×11 \times 1 and 2×22 \times 2 pivots. The LAPACK routine dsytrf implements this.

4.5 Modified Cholesky for Near-PD Matrices

In optimization (especially trust-region methods and quasi-Newton methods), we frequently need to factor a matrix that is "nearly" PD - the Hessian approximation may have small negative eigenvalues due to finite differences or insufficient curvature.

The problem. Standard Cholesky fails (negative pivot) when AA is not strictly PD. Rather than failing, we want a PD matrix A+EA + E close to AA that can be Cholesky-factored.

Gill-Murray-Wright (GMW) modification. At each pivot step jj, if the computed pivot dj<δd_j < \delta for a small threshold δ>0\delta > 0, set djδ+djd_j \leftarrow \delta + |d_j| (adding a correction). After factorization, A+E=LLA + E = LL^\top where E=LcomputedLcomputedAE = L_{\text{computed}}L_{\text{computed}}^\top - A is a small diagonal correction.

Simple diagonal jitter. The most widely used approach in machine learning is to add ϵI\epsilon I before attempting Cholesky:

def robust_cholesky(A, jitter=1e-6):
    try:
        return np.linalg.cholesky(A)
    except np.linalg.LinAlgError:
        return np.linalg.cholesky(A + jitter * np.eye(len(A)))

This is the standard "GP jitter" pattern. The added ϵI\epsilon I corresponds to assuming a small observation noise or numerical regularization.

For AI: PyTorch GPyTorch (Gaussian Process library) wraps all kernel matrix Cholesky calls with diagonal jitter and a retry mechanism. The JAX implementation uses jax.scipy.linalg.cholesky and adds jitter via A += diag_jitter * jnp.eye(n). Understanding why jitter works - it adds ϵ\epsilon to each eigenvalue, making all eigenvalues at least ϵ>0\epsilon > 0 - is exactly positive definiteness via the spectral characterization.


5. Matrix Square Root and Whitening

5.1 The PSD Square Root

For a scalar a0a \geq 0, there is a unique non-negative square root a0\sqrt{a} \geq 0. For a PSD matrix, the same uniqueness holds - but only in the class of PSD matrices.

Theorem 5.1 (PSD Square Root). Let ARn×nA \in \mathbb{R}^{n \times n} be symmetric and positive semidefinite. Then there exists a unique symmetric positive semidefinite matrix A1/2A^{1/2} such that:

(A1/2)2=A1/2A1/2=A.(A^{1/2})^2 = A^{1/2} A^{1/2} = A.

If A0A \succ 0, then A1/20A^{1/2} \succ 0.

Proof. By the spectral theorem, A=QΛQA = Q\Lambda Q^\top where Λ=diag(λ1,,λn)\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_n) with all λi0\lambda_i \geq 0. Define:

A1/2=QΛ1/2Q,Λ1/2=diag(λ1,,λn).A^{1/2} = Q \Lambda^{1/2} Q^\top, \quad \Lambda^{1/2} = \text{diag}(\sqrt{\lambda_1}, \ldots, \sqrt{\lambda_n}).

Then (A1/2)2=QΛ1/2QQΛ1/2Q=QΛQ=A(A^{1/2})^2 = Q\Lambda^{1/2}Q^\top Q\Lambda^{1/2}Q^\top = Q\Lambda Q^\top = A. This A1/2A^{1/2} is symmetric PSD.

Uniqueness: Suppose B2=AB^2 = A with BB symmetric PSD. Then BB and AA commute (since BA=BB2=B3=B2B=ABBA = B \cdot B^2 = B^3 = B^2 \cdot B = A B), so they share eigenvectors. On each shared eigenspace with eigenvalue λ\lambda, BB must equal λ\sqrt{\lambda} (the unique non-negative root). So B=QΛ1/2Q=A1/2B = Q\Lambda^{1/2}Q^\top = A^{1/2}. \square

Warning: Non-uniqueness without PSD constraint. The equation B2=AB^2 = A for A0A \succ 0 has 2n2^n solutions in the class of symmetric matrices (one can independently choose ±λi\pm\sqrt{\lambda_i} for each eigenvalue). The PSD square root A1/2A^{1/2} is the unique solution with all non-negative eigenvalues.

5.2 Computing Square Roots

Via eigendecomposition:

A=QΛQ    A1/2=QΛ1/2Q.A = Q\Lambda Q^\top \implies A^{1/2} = Q\Lambda^{1/2}Q^\top.

This requires the full eigendecomposition, costing O(n3)O(n^3) with a large constant. For A0A \succ 0:

eigvals, Q = np.linalg.eigh(A)   # symmetric eigendecomposition
A_sqrt = Q @ np.diag(np.sqrt(eigvals)) @ Q.T

Via Cholesky (the "left Cholesky factor"). Note: the Cholesky factor LL satisfies LL=ALL^\top = A but LA1/2L \neq A^{1/2} in general (unless AA is diagonal). The Cholesky factor is a square root of AA in the sense A=LLA = LL^\top, but it is not symmetric and not equal to the eigendecomposition-based A1/2A^{1/2}.

Via Newton's method (for matrix functions): The iteration Xk+1=12(Xk+Xk1A)X_{k+1} = \frac{1}{2}(X_k + X_k^{-1}A) starting from X0=IX_0 = I converges to A1/2A^{1/2} for A0A \succ 0. This is the matrix analogue of Heron's method for scalar square roots. Convergence is quadratic once close enough.

Via Pade approximants: For AA close to II, write A=I+EA = I + E and use a matrix series A1/2=(I+E)1/2=I+12E18E2+A^{1/2} = (I+E)^{1/2} = I + \frac{1}{2}E - \frac{1}{8}E^2 + \cdots, truncated at some order.

5.3 Square Root vs Cholesky

The two "square root" operations serve different purposes:

PropertyCholesky factor LLPSD square root A1/2A^{1/2}
SatisfiesA=LLA = LL^\topA=(A1/2)2A = (A^{1/2})^2
ShapeLower triangularSymmetric
UniquenessUnique (positive diagonal)Unique (PSD)
ComputationO(n3/3)O(n^3/3)O(n3)O(n^3) (full eigen)
InvertibilityL1L^{-1} is lower triangular(A1/2)1=A1/2(A^{1/2})^{-1} = A^{-1/2}
Use caseSolving systems, samplingMahalanobis, whitening

When to use Cholesky: solving Ax=bA\mathbf{x}=\mathbf{b}, computing logdetA\log\det A, sampling from N(0,A)\mathcal{N}(\mathbf{0},A).

When to use A1/2A^{1/2}: defining the Mahalanobis inner product x,yA=xA1y=A1/2x2\langle\mathbf{x},\mathbf{y}\rangle_A = \mathbf{x}^\top A^{-1}\mathbf{y} = \|A^{-1/2}\mathbf{x}\|^2; whitening transforms; matrix differential equations.

5.4 Whitening Transforms

Definition 5.2 (Whitening). Given data x\mathbf{x} with covariance Σ0\Sigma \succ 0, the whitening transform maps xz=Σ1/2(xμ)\mathbf{x} \mapsto \mathbf{z} = \Sigma^{-1/2}(\mathbf{x} - \boldsymbol{\mu}). The transformed variable z\mathbf{z} has covariance II (identity) - it is "white."

ZCA whitening. The ZCA (Zero-phase Component Analysis) whitening uses the symmetric square root: WZCA=Σ1/2W_{\text{ZCA}} = \Sigma^{-1/2}. The transformed data WZCAxW_{\text{ZCA}}\mathbf{x} is as close to the original x\mathbf{x} as possible while being white. This minimizes the change in "appearance" of the data.

Cholesky whitening. Using WChol=L1W_{\text{Chol}} = L^{-1} (where Σ=LL\Sigma = LL^\top) also produces white data but rotates it differently. This form is computationally cheaper.

Mahalanobis distance. For A=Σ1A = \Sigma^{-1}, the quadratic form xΣ1x\mathbf{x}^\top \Sigma^{-1} \mathbf{x} is the squared Mahalanobis distance from the origin. Geometrically, it measures the distance in "standard deviations units," accounting for the correlation structure of Σ\Sigma. In nn dimensions, (xμ)Σ1(xμ)=(xμ)Σ12(\mathbf{x}-\boldsymbol{\mu})^\top\Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu}) = \|(\mathbf{x}-\boldsymbol{\mu})\|_{\Sigma^{-1}}^2.

For AI: Batch normalization can be viewed as approximate ZCA whitening per mini-batch. More precisely, it normalizes each feature to zero mean and unit variance (diagonal whitening), avoiding the expensive full whitening. The full Mahalanobis distance appears in metric learning (e.g., Siamese networks learning AA such that xAx\mathbf{x}^\top A \mathbf{x} is a meaningful distance) and in attention mechanisms where different layers may learn non-isotropic attention kernels.


Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue