NotesMath for LLMs

Positive Definite Matrices

Advanced Linear Algebra / Positive Definite Matrices

Notes

"Positive definiteness is the matrix condition that makes everything work: optimization has a unique minimum, Gaussians are proper distributions, and Cholesky gives you a square root. It is the difference between a bowl and a saddle."

Overview

Positive definite matrices are the matrices that behave like positive numbers. Just as a positive scalar c>0c > 0 makes the equation cx=bcx = b uniquely solvable and the function cx2cx^2 have a unique minimum, a positive definite matrix A0A \succ 0 makes the system Ax=bA\mathbf{x} = \mathbf{b} uniquely solvable, the quadratic form xAx\mathbf{x}^\top A \mathbf{x} have a unique minimum at zero, and the decomposition A=LLA = LL^\top (Cholesky) exist and be unique. Positive definiteness is the structural condition that separates "well-posed" from "ill-posed" in a precise algebraic sense.

This section develops the full theory of positive definite and positive semidefinite matrices. We begin with quadratic forms and the geometric picture - level sets that form ellipsoids, energy functions with unique bowls - then build the complete toolkit: four equivalent characterizations (eigenvalue, determinantal, pivot, Cholesky), the Cholesky and LDL^T decompositions as the canonical section topic, matrix square roots, the Schur complement and its role in Gaussian conditioning, log-determinants, Gram matrices, and the PSD cone with semidefinite programming.

The AI connections are dense and load-bearing. Every covariance matrix in a multivariate Gaussian must be positive semidefinite - this is not a convenience but a mathematical necessity. The Fisher information matrix is PSD by construction and drives natural gradient methods (K-FAC). The Hessian at a local minimum is PSD. Cholesky factorization is the standard algorithm for sampling from multivariate Gaussians and the reparameterization trick in VAEs. Log-determinants appear in every Gaussian log-likelihood and normalizing flow objective. The attention mechanism computes a scaled Gram matrix. Understanding positive definiteness is understanding the algebraic backbone of probabilistic machine learning.

Prerequisites

Companion Notebooks

NotebookDescription
theory.ipynbInteractive exploration: Cholesky algorithms, log-det computation, Gram matrices, GPR, VAE reparameterization, SDP
exercises.ipynb8 graded exercises from PD verification through Cholesky-based sampling for VAEs

Learning Objectives

After completing this section, you will be able to:

  • State all four equivalent characterizations of positive definiteness and prove the equivalences
  • Distinguish PD (A0A \succ 0), PSD (A0A \succeq 0), negative definite, and indefinite with examples
  • Apply Sylvester's criterion to certify positive definiteness without computing eigenvalues
  • Implement the Cholesky algorithm from scratch and analyze its numerical stability
  • Factor A=LDLA = LDL^\top and explain when to prefer LDL^T over standard Cholesky
  • Compute the matrix square root A1/2A^{1/2} and explain its uniqueness
  • Define the Schur complement and use it to test positive definiteness of block matrices
  • Apply the matrix inversion lemma via Schur complements
  • Compute logdetA\log\det A via Cholesky and derive the gradient AlogdetA=A1\nabla_A \log\det A = A^{-1}
  • Construct Gram matrices and prove they are PSD; state Mercer's theorem
  • Describe the PSD cone S+n\mathbb{S}_+^n as a convex set and formulate SDPs
  • Apply Cholesky reparameterization to sample from multivariate Gaussians in VAEs
  • Connect the Fisher information matrix, natural gradient, and K-FAC to PD structure

Table of Contents


1. Intuition

1.1 What Positive Definiteness Captures

The most useful single-sentence definition of a positive definite matrix is: a matrix that behaves like a positive number. To understand what this means, compare scalars and matrices in parallel.

For a scalar a>0a > 0:

  • The equation ax=bax = b has a unique solution x=b/ax = b/a
  • The function f(x)=ax2f(x) = ax^2 has a unique minimum at x=0x = 0
  • The scalar has a real square root: a>0\sqrt{a} > 0
  • The product axx=ax20ax \cdot x = ax^2 \geq 0, with equality only at x=0x = 0

For a positive definite matrix A0A \succ 0:

  • The system Ax=bA\mathbf{x} = \mathbf{b} has a unique solution x=A1b\mathbf{x} = A^{-1}\mathbf{b}
  • The function f(x)=xAxf(\mathbf{x}) = \mathbf{x}^\top A \mathbf{x} has a unique minimum at x=0\mathbf{x} = \mathbf{0}
  • The matrix has a unique positive definite square root A1/20A^{1/2} \succ 0
  • The quadratic form xAx>0\mathbf{x}^\top A \mathbf{x} > 0 for all x0\mathbf{x} \neq \mathbf{0}

The quadratic form Q(x)=xAxQ(\mathbf{x}) = \mathbf{x}^\top A \mathbf{x} is the central object. Think of it as a machine that takes a vector x\mathbf{x} and returns a scalar measuring its "energy" with respect to AA. When AA is the identity, Q(x)=x2Q(\mathbf{x}) = \|\mathbf{x}\|^2, the familiar squared Euclidean length. When AA is diagonal with positive entries d1,,dnd_1, \ldots, d_n, then Q(x)=d1x12++dnxn2Q(\mathbf{x}) = d_1 x_1^2 + \cdots + d_n x_n^2 - a weighted sum of squared coordinates. Positive definiteness requires this energy to be positive for every nonzero direction.

The formal definition requires AA to be symmetric. The condition xAx>0\mathbf{x}^\top A \mathbf{x} > 0 for x0\mathbf{x} \neq \mathbf{0} is called strict positive definiteness. The relaxed condition xAx0\mathbf{x}^\top A \mathbf{x} \geq 0 is positive semidefiniteness - it allows zero energy in some directions (those in the null space of AA).

For AI: Every loss function L(θ)\mathcal{L}(\boldsymbol{\theta}) that has a strict local minimum at θ\boldsymbol{\theta}^* satisfies 2L(θ)0\nabla^2 \mathcal{L}(\boldsymbol{\theta}^*) \succ 0. The Hessian being positive definite at a critical point is exactly the second-order sufficient condition for a local minimum. This is the mathematical content of "the loss landscape is a bowl."

1.2 Geometric Picture: Ellipsoids and Level Sets

The level sets of the quadratic form xAx=c\mathbf{x}^\top A \mathbf{x} = c reveal the geometry of AA directly.

Case 1: A=IA = I (identity). The level set xIx=c\mathbf{x}^\top I \mathbf{x} = c is x2=c\|\mathbf{x}\|^2 = c - a sphere of radius c\sqrt{c}.

Case 2: AA diagonal, A=diag(λ1,λ2)A = \text{diag}(\lambda_1, \lambda_2) with λ1>λ2>0\lambda_1 > \lambda_2 > 0. The level set λ1x12+λ2x22=1\lambda_1 x_1^2 + \lambda_2 x_2^2 = 1 is an ellipse with semi-axes 1/λ11/\sqrt{\lambda_1} and 1/λ21/\sqrt{\lambda_2}. The larger eigenvalue compresses that direction; the smaller eigenvalue stretches it.

Case 3: AA general symmetric PD. The level set xAx=1\mathbf{x}^\top A \mathbf{x} = 1 is an ellipsoid whose axes are aligned with the eigenvectors of AA and whose semi-axis lengths are 1/λi1/\sqrt{\lambda_i}. This follows from the spectral decomposition A=QΛQA = Q\Lambda Q^\top: substituting y=Qx\mathbf{y} = Q^\top \mathbf{x} gives yΛy=1\mathbf{y}^\top \Lambda \mathbf{y} = 1, a coordinate-aligned ellipsoid.

Case 4: AA indefinite (some positive, some negative eigenvalues). The level set is a hyperboloid - unbounded in the negative-eigenvalue directions. There is no minimum energy.

Case 5: AA positive semidefinite (some zero eigenvalues). The level set is an "infinite cylinder" - flat in the null space directions. Energy is zero for all vectors in the null space.

QUADRATIC FORM GEOMETRY
========================================================================

  2D quadratic form  Q(x_1,x_2) = x^TAx,  level set Q = 1

  PD (both \lambda > 0):     PSD (one \lambda = 0):    Indefinite:
                        
     +-------+           ---------          /       \
     |  /-\  |           ---------          ---------
     | /   \ |           ---------          \       /
     +-------+           ---------          ---------
     ellipse             parallel lines     hyperbola
     
  Semi-axes \propto 1/\sqrt\lambda^i
  Axes aligned with eigenvectors of A

========================================================================

This geometric picture has a direct machine learning reading. The covariance matrix Σ\Sigma of a multivariate Gaussian defines a metric on feature space: (xμ)Σ1(xμ)(\mathbf{x} - \boldsymbol{\mu})^\top \Sigma^{-1} (\mathbf{x} - \boldsymbol{\mu}) is the squared Mahalanobis distance, measuring how many standard deviations x\mathbf{x} is from μ\boldsymbol{\mu} in each principal direction. The level sets of this distance are the ellipsoidal contours of the Gaussian density.

1.3 Why PD Matrices Matter for AI

Positive definiteness is not a technical curiosity - it is a pervasive structural condition in machine learning systems:

Covariance matrices. Any valid probability distribution must have a non-negative variance. For a multivariate random variable xRn\mathbf{x} \in \mathbb{R}^n, the covariance matrix Σ=E[(xμ)(xμ)]\Sigma = \mathbb{E}[(\mathbf{x}-\boldsymbol{\mu})(\mathbf{x}-\boldsymbol{\mu})^\top] is always PSD. For a multivariate Gaussian N(μ,Σ)\mathcal{N}(\boldsymbol{\mu}, \Sigma) with a proper density, we need Σ0\Sigma \succ 0 (so detΣ0\det \Sigma \neq 0 and the density is normalized). Every time a VAE encoder outputs a covariance or diagonal variance, it must produce a PSD (or PD) parameterization.

Fisher information matrix. The Fisher information F(θ)=Exp(θ)[θlogp(xθ)θlogp(xθ)]F(\boldsymbol{\theta}) = \mathbb{E}_{\mathbf{x} \sim p(\cdot|\boldsymbol{\theta})}[\nabla_{\boldsymbol{\theta}} \log p(\mathbf{x}|\boldsymbol{\theta})\, \nabla_{\boldsymbol{\theta}} \log p(\mathbf{x}|\boldsymbol{\theta})^\top] is PSD by construction (it is a covariance matrix of score functions). The natural gradient ~L=F1L\tilde{\nabla} \mathcal{L} = F^{-1} \nabla \mathcal{L} uses the Fisher as a Riemannian metric on parameter space. K-FAC (Kronecker-Factored Approximate Curvature) approximates FF with a Kronecker-product PSD matrix for scalable second-order optimization.

Gram matrices and kernels. The inner product matrix Gij=xi,xjG_{ij} = \langle \mathbf{x}_i, \mathbf{x}_j \rangle for any set of vectors is always PSD. Kernel methods rely on this: a function k(x,z)k(\mathbf{x},\mathbf{z}) is a valid kernel if and only if its Gram matrix is PSD for any set of inputs (Mercer's theorem). The scaled attention score matrix QK/dkQK^\top / \sqrt{d_k} in transformers is a (non-symmetric) Gram matrix.

Cholesky in numerical computation. Cholesky decomposition (A=LLA = LL^\top) is the fastest algorithm for solving Ax=bA\mathbf{x} = \mathbf{b} when AA is known to be PD - twice as fast as LU. It is the standard backend for Gaussian process inference, multivariate normal sampling, and Bayesian linear regression. NumPy, SciPy, PyTorch all use Cholesky internally whenever PD structure is detected.

Hessian and second-order methods. At a local minimum θ\boldsymbol{\theta}^*, the Hessian 2L(θ)0\nabla^2 \mathcal{L}(\boldsymbol{\theta}^*) \succeq 0. Sharpness-Aware Minimization (SAM) seeks parameters where λmax(2L)\lambda_{\max}(\nabla^2 \mathcal{L}) is small, empirically improving generalization. The Gauss-Newton approximation HJJH \approx J^\top J is always PSD and is the basis for K-FAC and natural gradient methods.

1.4 Historical Timeline

YearContributorDevelopment
1801GaussQuadratic forms in celestial mechanics; method of least squares
1847SylvesterNamed "positive definite" forms; leading minor criterion
1852SylvesterSylvester's law of inertia (signature invariance under congruence)
1878FrobeniusSystematic theory of bilinear and quadratic forms
1910CholeskyCholesky decomposition for geodetic calculations (unpublished until 1924)
1934LoewnerMatrix ordering (Loewner order) ABA \succeq B defined rigorously
1936MercerMercer's theorem linking PSD functions to kernel expansions
1955BellmanPositive definite matrices in dynamic programming and control
1990Vandenberghe & BoydSemidefinite programming (SDP) as efficient optimization
1998Scholkopf & SmolaKernel methods: SVMs via PSD kernel matrices
2013Kingma & WellingVAE reparameterization trick using Cholesky sampling
2015Martens & GrosseK-FAC: Kronecker-factored PSD approximation to Fisher
2024-2026Multiple groupsPSD structure in diffusion model covariances, normalizing flows, structured covariance estimation

2. Formal Definitions

2.1 Quadratic Forms

Definition 2.1 (Quadratic Form). Let ARn×nA \in \mathbb{R}^{n \times n} be a symmetric matrix and xRn\mathbf{x} \in \mathbb{R}^n. The associated quadratic form is the function QA:RnRQ_A : \mathbb{R}^n \to \mathbb{R} defined by

QA(x)=xAx=i=1nj=1nAijxixj.Q_A(\mathbf{x}) = \mathbf{x}^\top A \mathbf{x} = \sum_{i=1}^n \sum_{j=1}^n A_{ij} x_i x_j.

Note: every quadratic form can be written with a symmetric matrix. If AA is not symmetric, replacing it with (A+A)/2(A + A^\top)/2 gives the same quadratic form, since xAx=xA+A2x\mathbf{x}^\top A \mathbf{x} = \mathbf{x}^\top \frac{A+A^\top}{2} \mathbf{x} for all x\mathbf{x}. We always assume AA is symmetric.

Expanding in 2D. For A=(abbd)A = \begin{pmatrix} a & b \\ b & d \end{pmatrix} and x=(x1,x2)\mathbf{x} = (x_1, x_2)^\top:

QA(x)=ax12+2bx1x2+dx22.Q_A(\mathbf{x}) = ax_1^2 + 2bx_1x_2 + dx_2^2.

The diagonal entries a,da, d give the pure squared terms; the off-diagonal bb gives the cross term.

Completing the square. For the 1D quadratic q(x)=ax2+2bx+c=a(x+b/a)2+(cb2/a)q(x) = ax^2 + 2bx + c = a(x + b/a)^2 + (c - b^2/a), the minimum is cb2/ac - b^2/a, achieved at x=b/ax = -b/a. The matrix analogue is the Schur complement (6) - the quadratic form xAx+2bx+c\mathbf{x}^\top A \mathbf{x} + 2\mathbf{b}^\top \mathbf{x} + c is minimized at x=A1b\mathbf{x}^* = -A^{-1}\mathbf{b} (when A0A \succ 0) with minimum value cbA1bc - \mathbf{b}^\top A^{-1} \mathbf{b}.

Connection to bilinear forms. The quadratic form QA(x)Q_A(\mathbf{x}) is the diagonal of the bilinear form BA(x,y)=xAyB_A(\mathbf{x}, \mathbf{y}) = \mathbf{x}^\top A \mathbf{y}, i.e., QA(x)=BA(x,x)Q_A(\mathbf{x}) = B_A(\mathbf{x}, \mathbf{x}). The bilinear form encodes the inner product structure induced by AA.

2.2 The Four Cases: PD, PSD, ND, Indefinite

Definition 2.2. Let ARn×nA \in \mathbb{R}^{n \times n} be symmetric. Then AA is:

NameSymbolConditionColloquial
Positive definiteA0A \succ 0xAx>0\mathbf{x}^\top A \mathbf{x} > 0 for all x0\mathbf{x} \neq \mathbf{0}"bowl"
Positive semidefiniteA0A \succeq 0xAx0\mathbf{x}^\top A \mathbf{x} \geq 0 for all x\mathbf{x}"flat bowl"
Negative definiteA0A \prec 0xAx<0\mathbf{x}^\top A \mathbf{x} < 0 for all x0\mathbf{x} \neq \mathbf{0}"dome"
Negative semidefiniteA0A \preceq 0xAx0\mathbf{x}^\top A \mathbf{x} \leq 0 for all x\mathbf{x}"flat dome"
Indefinite-QAQ_A takes both positive and negative values"saddle"

Note: A0A0A \prec 0 \Leftrightarrow -A \succ 0. The interesting cases for applications are PD and PSD.

Standard examples:

A=InA = I_n: xIx=x2>0\mathbf{x}^\top I \mathbf{x} = \|\mathbf{x}\|^2 > 0 for x0\mathbf{x} \neq \mathbf{0}. I0I \succ 0. OK

A=(2112)A = \begin{pmatrix} 2 & 1 \\ 1 & 2 \end{pmatrix}: Q=2x12+2x1x2+2x22=(x1+x2)2+x12+x22x12+x22>0Q = 2x_1^2 + 2x_1x_2 + 2x_2^2 = (x_1+x_2)^2 + x_1^2 + x_2^2 \geq x_1^2 + x_2^2 > 0 for x0\mathbf{x} \neq \mathbf{0}. 0\succ 0. OK

A=(1000)A = \begin{pmatrix} 1 & 0 \\ 0 & 0 \end{pmatrix}: Q=x120Q = x_1^2 \geq 0, and Q(e2)=0Q(\mathbf{e}_2) = 0. 0\succeq 0 but not 0\succ 0. (PSD, not PD.)

A=(1221)A = \begin{pmatrix} 1 & 2 \\ 2 & 1 \end{pmatrix}: Q(1,1)=14+1=2<0Q(1,-1) = 1 - 4 + 1 = -2 < 0, Q(1,0)=1>0Q(1,0) = 1 > 0. Indefinite.

Non-examples (common mistakes):

A=(2331)A = \begin{pmatrix} 2 & 3 \\ 3 & 1 \end{pmatrix}: looks "positive" but detA=29=7<0\det A = 2 - 9 = -7 < 0, so indefinite.

A=(0000)A = \begin{pmatrix} 0 & 0 \\ 0 & 0 \end{pmatrix}: Q0Q \equiv 0, so PSD but NOT PD.

A=(1002)A = \begin{pmatrix} -1 & 0 \\ 0 & 2 \end{pmatrix}: Q(e1)=1Q(\mathbf{e}_1) = -1, indefinite.

2.3 Immediate Consequences

If A0A \succ 0, then many properties follow immediately from the definition:

Proposition 2.3. Let A0A \succ 0 (symmetric). Then:

  1. Positive diagonal entries. Aii>0A_{ii} > 0 for all ii. Proof: Take x=ei\mathbf{x} = \mathbf{e}_i: eiAei=Aii>0\mathbf{e}_i^\top A \mathbf{e}_i = A_{ii} > 0.

  2. Positive trace. tr(A)=iAii>0\text{tr}(A) = \sum_i A_{ii} > 0.

  3. Positive determinant. detA>0\det A > 0. (Follows from all eigenvalues positive, 3.1.)

  4. Invertibility. AA is invertible, and A10A^{-1} \succ 0. Proof: If Ax=0A\mathbf{x} = \mathbf{0} then xAx=0\mathbf{x}^\top A \mathbf{x} = 0, so x=0\mathbf{x} = \mathbf{0}. For A1A^{-1}: (A1)(A^{-1}) is symmetric, and yA1y=(A1y)A(A1y)>0\mathbf{y}^\top A^{-1} \mathbf{y} = (A^{-1}\mathbf{y})^\top A (A^{-1}\mathbf{y}) > 0 for y0\mathbf{y} \neq \mathbf{0}.

  5. Closure under positive scaling. αA0\alpha A \succ 0 for α>0\alpha > 0.

  6. Closure under addition. If A,B0A, B \succ 0 then A+B0A + B \succ 0. Proof: x(A+B)x=xAx+xBx>0\mathbf{x}^\top(A+B)\mathbf{x} = \mathbf{x}^\top A\mathbf{x} + \mathbf{x}^\top B\mathbf{x} > 0.

  7. Tikhonov regularization is PD. If A0A \succeq 0 and λ>0\lambda > 0, then A+λI0A + \lambda I \succ 0. Proof: x(A+λI)xλx2>0\mathbf{x}^\top(A + \lambda I)\mathbf{x} \geq \lambda \|\mathbf{x}\|^2 > 0.

  8. Congruence preserves PD. If A0A \succ 0 and BB is invertible, then BAB0B^\top A B \succ 0. Proof: xBABx=(Bx)A(Bx)>0\mathbf{x}^\top B^\top A B \mathbf{x} = (B\mathbf{x})^\top A (B\mathbf{x}) > 0 since Bx0B\mathbf{x} \neq \mathbf{0} when x0\mathbf{x} \neq \mathbf{0}.

For AI: Property 4 ensures that covariance matrices can be inverted for precision computations. Property 7 is exactly why adding λI\lambda I (weight decay, Tikhonov regularization, jitter in Gaussian processes) to a PSD matrix makes it PD and invertible.

2.4 The Loewner Order

Definition 2.4 (Loewner Order). For symmetric matrices A,BRn×nA, B \in \mathbb{R}^{n \times n}, define the Loewner ordering:

ABAB0.A \succeq B \quad \Longleftrightarrow \quad A - B \succeq 0.

Similarly, ABAB0A \succ B \Leftrightarrow A - B \succ 0. This ordering is:

  • Reflexive: AAA \succeq A.
  • Antisymmetric: ABA \succeq B and BAB \succeq A implies A=BA = B.
  • Transitive: ABA \succeq B and BCB \succeq C implies ACA \succeq C.

However, it is NOT a total order - not every pair of symmetric matrices is comparable. For example, (1000)\begin{pmatrix}1&0\\0&0\end{pmatrix} and (0001)\begin{pmatrix}0&0\\0&1\end{pmatrix} are neither \succeq nor \preceq each other.

Properties of the Loewner order:

If ABA \succeq B then:

  • tr(A)tr(B)\text{tr}(A) \geq \text{tr}(B) (trace is monotone)
  • λi(A)λi(B)\lambda_i(A) \geq \lambda_i(B) for all ii when eigenvalues are sorted in decreasing order (Weyl's theorem)
  • CACCBCC^\top A C \succeq C^\top B C for any CC (congruence is monotone)
  • If additionally A,B0A, B \succ 0: B1A1B^{-1} \succeq A^{-1} (inversion reverses the order!)

The last property is surprising: if AB0A \succeq B \succ 0, then A1B1A^{-1} \preceq B^{-1}. Intuition: a larger matrix "moves vectors more," so its inverse "moves them less."

For AI: In Bayesian inference, the posterior covariance is Σpost=(Σprior1+XX/σ2)1\Sigma_{\text{post}} = (\Sigma_{\text{prior}}^{-1} + X^\top X / \sigma^2)^{-1}. Since we added XX/σ20X^\top X / \sigma^2 \succeq 0 to the prior precision, the posterior precision is larger, so by order-reversal, ΣpostΣprior\Sigma_{\text{post}} \preceq \Sigma_{\text{prior}}. Observing data never increases uncertainty - the Loewner order makes this rigorous.


3. Eigenvalue and Determinantal Characterizations

3.1 Spectral Characterization

The most computationally transparent characterization of positive definiteness uses eigenvalues.

Theorem 3.1 (Spectral Characterization). Let ARn×nA \in \mathbb{R}^{n \times n} be symmetric with eigenvalues λ1λ2λn\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_n. Then:

A0λi>0 for all i=1,,n.A \succ 0 \quad \Longleftrightarrow \quad \lambda_i > 0 \text{ for all } i = 1, \ldots, n. A0λi0 for all i=1,,n.A \succeq 0 \quad \Longleftrightarrow \quad \lambda_i \geq 0 \text{ for all } i = 1, \ldots, n.

Proof. By the spectral theorem (-> 01: Eigenvalues), A=QΛQA = Q\Lambda Q^\top where QQ is orthogonal and Λ=diag(λ1,,λn)\Lambda = \text{diag}(\lambda_1,\ldots,\lambda_n). For any x0\mathbf{x} \neq \mathbf{0}, let y=Qx0\mathbf{y} = Q^\top \mathbf{x} \neq \mathbf{0} (since QQ is invertible):

xAx=yΛy=i=1nλiyi2.\mathbf{x}^\top A \mathbf{x} = \mathbf{y}^\top \Lambda \mathbf{y} = \sum_{i=1}^n \lambda_i y_i^2.

If all λi>0\lambda_i > 0: the sum is positive for y0\mathbf{y} \neq \mathbf{0}. OK

If some λk0\lambda_k \leq 0: take y=ek\mathbf{y} = \mathbf{e}_k (so x=Qek=qk\mathbf{x} = Q\mathbf{e}_k = \mathbf{q}_k, the kk-th eigenvector). Then xAx=λk0\mathbf{x}^\top A \mathbf{x} = \lambda_k \leq 0. NO

The characterization for 0\succeq 0 follows by the same argument with \geq replacing >>. \square

Immediate consequences:

  • tr(A)=λi>0\text{tr}(A) = \sum \lambda_i > 0 for PD
  • detA=λi>0\det A = \prod \lambda_i > 0 for PD
  • A2=λ1\|A\|_2 = \lambda_1 and A12=1/λn\|A^{-1}\|_2 = 1/\lambda_n for PD (-> 06: Norms)
  • The condition number κ(A)=λ1/λn\kappa(A) = \lambda_1/\lambda_n measures how "nearly singular" AA is

For AI: In Gaussian process regression, the covariance kernel matrix KK must be PSD. Verifying K0K \succeq 0 by computing eigenvalues is expensive; checking a Cholesky decomposition (4) is faster and also provides the factorization needed for inference.

Note on the spectral theorem: This proof uses the spectral theorem for symmetric matrices - that every symmetric real matrix is orthogonally diagonalizable. For the full proof and discussion of spectral theory, see 01: Eigenvalues and Eigenvectors.

3.2 Sylvester's Criterion

Sylvester's criterion provides a purely determinantal test that avoids eigenvalue computation entirely, making it practical for hand calculations and symbolic proofs.

Definition 3.2 (Leading Principal Minors). The kk-th leading principal minor of ARn×nA \in \mathbb{R}^{n \times n} is:

Δk=det(A[1:k,1:k])=det(A11A1kAk1Akk).\Delta_k = \det(A[1:k, 1:k]) = \det\begin{pmatrix}A_{11} & \cdots & A_{1k} \\ \vdots & \ddots & \vdots \\ A_{k1} & \cdots & A_{kk}\end{pmatrix}.

So Δ1=A11\Delta_1 = A_{11}, Δ2=A11A22A122\Delta_2 = A_{11}A_{22} - A_{12}^2 (for symmetric AA), and Δn=detA\Delta_n = \det A.

Theorem 3.3 (Sylvester's Criterion). A symmetric matrix ARn×nA \in \mathbb{R}^{n \times n} is positive definite if and only if all leading principal minors are positive:

A0Δk>0 for all k=1,2,,n.A \succ 0 \quad \Longleftrightarrow \quad \Delta_k > 0 \text{ for all } k = 1, 2, \ldots, n.

Proof sketch. The key insight is that the leading principal submatrix Ak=A[1:k,1:k]A_k = A[1:k,1:k] is itself symmetric, and A0A \succ 0 implies Ak0A_k \succ 0 for every kk (restriction to the first kk standard basis vectors). The converse uses induction: if Δ1,,Δn1>0\Delta_1, \ldots, \Delta_{n-1} > 0 and Δn>0\Delta_n > 0, then by the inductive hypothesis all principal leading submatrices are PD, and the Cholesky factorization can be completed (4 gives the constructive proof). The determinant of a PD matrix equals the product of its Cholesky diagonal entries squared, all positive, so Δn>0\Delta_n > 0. \square

Working example. Test A=(421230102)A = \begin{pmatrix}4 & 2 & 1\\2 & 3 & 0\\1 & 0 & 2\end{pmatrix}:

  • Δ1=4>0\Delta_1 = 4 > 0 OK
  • Δ2=4322=124=8>0\Delta_2 = 4 \cdot 3 - 2^2 = 12 - 4 = 8 > 0 OK
  • Δ3=detA=4(60)2(40)+1(03)=2483=13>0\Delta_3 = \det A = 4(6-0) - 2(4-0) + 1(0-3) = 24 - 8 - 3 = 13 > 0 OK

So A0A \succ 0.

Caution for PSD. Sylvester's criterion does NOT characterize A0A \succeq 0. A matrix can have all leading principal minors 0\geq 0 but still not be PSD (a counterexample is A=(0001)A = \begin{pmatrix}0 & 0 \\ 0 & -1\end{pmatrix}: Δ1=00\Delta_1 = 0 \geq 0, Δ2=00\Delta_2 = 0 \geq 0, but Q(e2)=1<0Q(\mathbf{e}_2) = -1 < 0). For PSD testing, use all principal minors (not just leading ones), or use eigenvalues.

For AI: Sylvester's criterion is used in symbolic proofs and in optimization algorithms that maintain PD structure (e.g., proving that a block covariance matrix built by a GP model is PD, by verifying the diagonal blocks and Schur complements are positive).

3.3 Pivot Characterization

The connection between Gaussian elimination and positive definiteness is deep and computationally significant.

Theorem 3.4 (Pivot Characterization). A symmetric matrix ARn×nA \in \mathbb{R}^{n \times n} is positive definite if and only if Gaussian elimination without row exchanges produces all positive pivots.

The pivots of AA are d1=A11d_1 = A_{11}, d2=Δ2/Δ1d_2 = \Delta_2/\Delta_1, \ldots, dk=Δk/Δk1d_k = \Delta_k/\Delta_{k-1}. So A0A \succ 0 \Leftrightarrow all pivots dk>0d_k > 0 \Leftrightarrow all Δk>0\Delta_k > 0 (Sylvester). The LDL^T factorization makes this explicit: A=LDLA = LDL^\top where D=diag(d1,,dn)D = \text{diag}(d_1, \ldots, d_n) and A0A \succ 0 \Leftrightarrow all diagonal entries of DD are positive (4.4).

The pivot characterization is how Cholesky "discovers" positive definiteness: the Cholesky algorithm fails (tries to take the square root of a non-positive number) exactly when a pivot is non-positive. This makes Cholesky the standard computational test for positive definiteness: try to factor; failure reveals non-PD structure.

PIVOT CHARACTERIZATION SUMMARY
========================================================================

  A \in \mathbb{R}^n^x^n symmetric.  The following are equivalent:

  (i)   A \succ 0  (quadratic form positive for all x \neq 0)
  (ii)  All eigenvalues \lambda^i > 0                          [Spectral]
  (iii) All leading principal minors \Delta_k > 0              [Sylvester]
  (iv)  All Gaussian elimination pivots d_k > 0           [Pivot]
  (v)   Cholesky factorization A = LL^T exists with L     [Cholesky]
        lower triangular, L^i^i > 0

  Each characterization provides a different computational test:
  (ii)  O(n^3): eigenvalue decomposition  <- most expensive
  (iii) O(n^4): compute n determinants  <- symbolic/manual only
  (iv)  O(n^3): Gaussian elimination
  (v)   O(n^3/3): Cholesky             <- fastest in practice

========================================================================

3.4 Certifying PSD vs PD

In numerical practice, the PD/PSD boundary is a source of subtle bugs. Here we describe how to handle the boundary cases.

Near-PSD matrices. A matrix that is theoretically PSD may have small negative eigenvalues due to floating-point errors. For example, if AA is constructed as XXX^\top X but XX is nearly rank-deficient, AA might have eigenvalues like 1015-10^{-15} due to rounding. The standard fix is to add a small jitter: Aϵ=A+ϵIA_\epsilon = A + \epsilon I for ϵ106\epsilon \approx 10^{-6} or 10810^{-8}.

Numerical rank. For a PSD matrix, the rank equals the number of positive eigenvalues. In finite precision, eigenvalues below ϵA2\epsilon \cdot \|A\|_2 (machine epsilon times spectral norm) are treated as zero. numpy.linalg.matrix_rank uses this threshold internally.

Testing PSD in practice:

  1. Try np.linalg.cholesky(A) - succeeds iff A0A \succ 0 (strictly PD)
  2. For PSD test: check np.all(np.linalg.eigvalsh(A) >= -tol) with tol = 1e-10
  3. For rank-rr PSD: compute np.linalg.matrix_rank(A) and verify r<nr < n

The zero-eigenvalue case. If A0A \succeq 0 and rank(A)=r<n\text{rank}(A) = r < n, the Cholesky factorization of the full n×nn \times n matrix does not exist. However, the factorization A=LLA = LL^\top where LL is n×rn \times r (a "thin" Cholesky) does exist and can be computed via pivoted Cholesky. This is used in sparse GP approximations (Nystrom approximation, inducing points).

For AI: PyTorch's torch.linalg.cholesky raises torch.linalg.LinAlgError if the matrix is not PD. In practice, diagonal jitter (A + 1e-6 * torch.eye(n)) is the standard workaround in GP implementations and VAE covariance parameterizations.


4. Cholesky Decomposition

Cholesky decomposition is the most important computational tool in this section. Unlike the spectral characterization (eigenvalue decomposition, O(n3)O(n^3) with a large constant) or LU (general, O(n3)O(n^3) with pivoting), Cholesky is:

  • Twice as fast as LU for symmetric positive definite systems (n3/3\approx n^3/3 vs 2n3/3\approx 2n^3/3 operations)
  • Numerically backward-stable without pivoting (which LU requires for stability)
  • Reveals the square root: LL satisfies LL=ALL^\top = A, i.e., LL is a "matrix square root"
  • The canonical test for positive definiteness: it exists iff A0A \succ 0

4.1 The Theorem: Existence and Uniqueness

Theorem 4.1 (Cholesky Factorization). Let ARn×nA \in \mathbb{R}^{n \times n} be symmetric. Then:

A0! lower triangular L with positive diagonal such that A=LL.A \succ 0 \quad \Longleftrightarrow \quad \exists! \text{ lower triangular } L \text{ with positive diagonal such that } A = LL^\top.

This LL is called the Cholesky factor of AA. The factorization A=LLA = LL^\top is the Cholesky decomposition.

Proof of existence (constructive). We prove by induction on nn.

Base case (n=1n=1): A=(a11)A = (a_{11}) with a11>0a_{11} > 0 (since A0A \succ 0). Take L=(a11)L = (\sqrt{a_{11}}).

Inductive step: Write AA in block form:

A=(a11aaB)A = \begin{pmatrix} a_{11} & \mathbf{a}^\top \\ \mathbf{a} & B \end{pmatrix}

where a11>0a_{11} > 0 (Proposition 2.3), aRn1\mathbf{a} \in \mathbb{R}^{n-1}, and BR(n1)×(n1)B \in \mathbb{R}^{(n-1)\times(n-1)}.

Set 11=a11\ell_{11} = \sqrt{a_{11}}, =a/11\boldsymbol{\ell} = \mathbf{a}/\ell_{11}. Then B=Baa/a11B - \boldsymbol{\ell}\boldsymbol{\ell}^\top = B - \mathbf{a}\mathbf{a}^\top/a_{11} is the Schur complement S=Baa111aS = B - \mathbf{a} a_{11}^{-1} \mathbf{a}^\top.

Since A0A \succ 0, its Schur complement S0S \succ 0 (Theorem 6.2). By the inductive hypothesis, S=L22L22S = L_{22}L_{22}^\top for unique lower triangular L22L_{22} with positive diagonal.

Then:

L=(110L22),LL=(1121111+L22L22)=(a11aaB)=A.L = \begin{pmatrix}\ell_{11} & \mathbf{0}^\top \\ \boldsymbol{\ell} & L_{22}\end{pmatrix}, \quad LL^\top = \begin{pmatrix}\ell_{11}^2 & \ell_{11}\boldsymbol{\ell}^\top \\ \ell_{11}\boldsymbol{\ell} & \boldsymbol{\ell}\boldsymbol{\ell}^\top + L_{22}L_{22}^\top\end{pmatrix} = \begin{pmatrix}a_{11} & \mathbf{a}^\top \\ \mathbf{a} & B\end{pmatrix} = A.

Proof of uniqueness. Suppose A=L1L1=L2L2A = L_1 L_1^\top = L_2 L_2^\top with L1,L2L_1, L_2 lower triangular and positive diagonal. Then L21L1=L2(L1)1L_2^{-1}L_1 = L_2^\top (L_1^\top)^{-1}. The left side is lower triangular; the right side is upper triangular. Both sides equal the same matrix, which must be diagonal with positive entries (since L1,L2L_1, L_2 have positive diagonals). Comparing: L21L1=DL_2^{-1}L_1 = D (diagonal positive), so L1=L2DL_1 = L_2 D. But LL=L2DDL2LL^\top = L_2 D D L_2^\top, so D2=ID^2 = I, so D=ID = I, so L1=L2L_1 = L_2. \square

Converse. If A=LLA = LL^\top with LL lower triangular and positive diagonal, then for x0\mathbf{x} \neq \mathbf{0}: xAx=xLLx=Lx2>0\mathbf{x}^\top A \mathbf{x} = \mathbf{x}^\top L L^\top \mathbf{x} = \|L^\top \mathbf{x}\|^2 > 0 (since LL^\top is invertible by positive diagonal). So A0A \succ 0. \square

4.2 The Algorithm: Constructing L

The Cholesky algorithm computes LL column by column. From A=LLA = LL^\top, expanding element by element:

Aij=k=1min(i,j)LikLjk.A_{ij} = \sum_{k=1}^{\min(i,j)} L_{ik} L_{jk}.

For the diagonal entry (i,i)(i,i):

Aii=k=1iLik2=Lii2+k=1i1Lik2    Lii=Aiik=1i1Lik2.A_{ii} = \sum_{k=1}^{i} L_{ik}^2 = L_{ii}^2 + \sum_{k=1}^{i-1} L_{ik}^2 \implies L_{ii} = \sqrt{A_{ii} - \sum_{k=1}^{i-1} L_{ik}^2}.

For off-diagonal entry (i,j)(i,j) with i>ji > j:

Aij=k=1jLikLjk=LijLjj+k=1j1LikLjk    Lij=1Ljj(Aijk=1j1LikLjk).A_{ij} = \sum_{k=1}^{j} L_{ik}L_{jk} = L_{ij}L_{jj} + \sum_{k=1}^{j-1} L_{ik}L_{jk} \implies L_{ij} = \frac{1}{L_{jj}}\left(A_{ij} - \sum_{k=1}^{j-1} L_{ik}L_{jk}\right).

Algorithm 4.2 (Cholesky, column-wise):

Input:  Symmetric A \in \mathbb{R}^n^x^n
Output: Lower triangular L with A = LL^T, or FAILURE

for j = 1 to n:
    s = A[j,j] - sum(L[j,k]^2 for k = 1..j-1)
    if s \leq 0:
        FAILURE (A is not positive definite)
    L[j,j] = sqrt(s)
    for i = j+1 to n:
        L[i,j] = (A[i,j] - sum(L[i,k]*L[j,k] for k=1..j-1)) / L[j,j]
    for i = 1 to j-1:
        L[i,j] = 0       (upper triangle is zero)

Worked example. Compute the Cholesky factor of A=(420251013)A = \begin{pmatrix}4 & 2 & 0\\2 & 5 & 1\\0 & 1 & 3\end{pmatrix}:

Column 1: L11=4=2L_{11} = \sqrt{4} = 2, L21=2/2=1L_{21} = 2/2 = 1, L31=0/2=0L_{31} = 0/2 = 0.

Column 2: L22=512=4=2L_{22} = \sqrt{5 - 1^2} = \sqrt{4} = 2, L32=(101)/2=1/2L_{32} = (1 - 0\cdot1)/2 = 1/2.

Column 3: L33=302(1/2)2=31/4=11/4=11/2L_{33} = \sqrt{3 - 0^2 - (1/2)^2} = \sqrt{3 - 1/4} = \sqrt{11/4} = \sqrt{11}/2.

L=(20012001/211/2).L = \begin{pmatrix}2 & 0 & 0 \\ 1 & 2 & 0 \\ 0 & 1/2 & \sqrt{11}/2\end{pmatrix}.

Verify: LL=(420251013)=ALL^\top = \begin{pmatrix}4 & 2 & 0\\2 & 5 & 1\\0 & 1 & 3\end{pmatrix} = A. OK

4.3 Complexity and Numerical Properties

Flop count. The Cholesky algorithm requires approximately n3/3n^3/3 floating-point operations - precisely half the 2n3/32n^3/3 required by LU decomposition. For large nn, this factor-of-2 speedup is significant.

Memory. Only the lower triangle of LL needs to be stored: n(n+1)/2n(n+1)/2 entries vs n2n^2 for general LU. In-place variants overwrite the lower triangle of AA with LL.

Backward stability. Cholesky is numerically backward-stable without pivoting: the computed L^\hat{L} satisfies L^L^=A+E\hat{L}\hat{L}^\top = A + E where E2cnϵmachA2\|E\|_2 \leq c_n \epsilon_{\text{mach}} \|A\|_2 for a modest constant cnc_n. This is stronger than LU without pivoting (which can be unstable). The reason: each step computes positive\sqrt{\text{positive}}, which cannot amplify errors.

Solving Ax=bA\mathbf{x} = \mathbf{b}: Factor A=LLA = LL^\top, then solve Ly=bL\mathbf{y} = \mathbf{b} by forward substitution and Lx=yL^\top\mathbf{x} = \mathbf{y} by backward substitution. Total cost: n3/3+2n2n^3/3 + 2n^2 (factorization + two triangular solves).

Failure = non-PD. If at any step the quantity under the square root is non-positive, AA is not positive definite. This makes Cholesky the standard computational test for PD structure.

For AI: JAX's jax.scipy.linalg.cholesky, PyTorch's torch.linalg.cholesky, and SciPy's scipy.linalg.cholesky are all efficient LAPACK-backed implementations. In Gaussian process models, the Cholesky solve L \ b (forward substitution) appears in every prediction and log-marginal-likelihood computation.

4.4 LDL^T Factorization

The LDL^T decomposition is a variant of Cholesky that avoids square roots - useful when the matrix is PSD but not strictly PD, or when square root computation is expensive.

Theorem 4.3 (LDL^T Factorization). Every symmetric ARn×nA \in \mathbb{R}^{n \times n} that admits Gaussian elimination without row swaps can be uniquely factored as:

A=LDLA = LDL^\top

where LL is unit lower triangular (Lii=1L_{ii} = 1) and D=diag(d1,,dn)D = \text{diag}(d_1, \ldots, d_n).

Moreover, A0A \succ 0 \Leftrightarrow all diagonal entries di>0d_i > 0.

Relation to Cholesky. If A0A \succ 0, the LDL^T and Cholesky factorizations are related by:

A=LDL=LDDL=(LD)(LD).A = LDL^\top = L\sqrt{D}\sqrt{D}L^\top = (L\sqrt{D})(L\sqrt{D})^\top.

So the Cholesky factor is L^=LD\hat{L} = L\sqrt{D} where D=diag(d1,,dn)\sqrt{D} = \text{diag}(\sqrt{d_1},\ldots,\sqrt{d_n}). The LDL^T decomposition computes the did_i directly without taking square roots.

Algorithm 4.4 (LDL^T, column-wise):

for j = 1 to n:
    v[1..j-1] = D[1..j-1] * L[j,1..j-1]   (element-wise)
    D[j] = A[j,j] - dot(L[j,1..j-1], v[1..j-1])
    for i = j+1 to n:
        L[i,j] = (A[i,j] - dot(L[i,1..j-1], v[1..j-1])) / D[j]
    L[j,j] = 1

Worked example. Factor A=(4225)A = \begin{pmatrix}4 & 2\\2 & 5\end{pmatrix}:

d1=4d_1 = 4, L21=2/4=1/2L_{21} = 2/4 = 1/2, d2=5(1/2)24=51=4d_2 = 5 - (1/2)^2 \cdot 4 = 5 - 1 = 4.

L=(101/21),D=(4004).L = \begin{pmatrix}1 & 0 \\ 1/2 & 1\end{pmatrix}, \quad D = \begin{pmatrix}4 & 0 \\ 0 & 4\end{pmatrix}.

Verify: LDL=(101/21)(4004)(11/201)=(4225)LDL^\top = \begin{pmatrix}1&0\\1/2&1\end{pmatrix}\begin{pmatrix}4&0\\0&4\end{pmatrix}\begin{pmatrix}1&1/2\\0&1\end{pmatrix} = \begin{pmatrix}4&2\\2&5\end{pmatrix}. OK

When to prefer LDL^T over Cholesky:

  • When AA is indefinite but you still want a factorization (use BKLTBKLT pivoted LDL^T with 2×22\times 2 pivots)
  • When square root computation is numerically undesirable
  • In interval arithmetic or symbolic computation

Symmetric indefinite systems. The Bunch-Kaufman (BK) algorithm extends LDL^T to indefinite matrices using 1×11 \times 1 and 2×22 \times 2 pivots. The LAPACK routine dsytrf implements this.

4.5 Modified Cholesky for Near-PD Matrices

In optimization (especially trust-region methods and quasi-Newton methods), we frequently need to factor a matrix that is "nearly" PD - the Hessian approximation may have small negative eigenvalues due to finite differences or insufficient curvature.

The problem. Standard Cholesky fails (negative pivot) when AA is not strictly PD. Rather than failing, we want a PD matrix A+EA + E close to AA that can be Cholesky-factored.

Gill-Murray-Wright (GMW) modification. At each pivot step jj, if the computed pivot dj<δd_j < \delta for a small threshold δ>0\delta > 0, set djδ+djd_j \leftarrow \delta + |d_j| (adding a correction). After factorization, A+E=LLA + E = LL^\top where E=LcomputedLcomputedAE = L_{\text{computed}}L_{\text{computed}}^\top - A is a small diagonal correction.

Simple diagonal jitter. The most widely used approach in machine learning is to add ϵI\epsilon I before attempting Cholesky:

def robust_cholesky(A, jitter=1e-6):
    try:
        return np.linalg.cholesky(A)
    except np.linalg.LinAlgError:
        return np.linalg.cholesky(A + jitter * np.eye(len(A)))

This is the standard "GP jitter" pattern. The added ϵI\epsilon I corresponds to assuming a small observation noise or numerical regularization.

For AI: PyTorch GPyTorch (Gaussian Process library) wraps all kernel matrix Cholesky calls with diagonal jitter and a retry mechanism. The JAX implementation uses jax.scipy.linalg.cholesky and adds jitter via A += diag_jitter * jnp.eye(n). Understanding why jitter works - it adds ϵ\epsilon to each eigenvalue, making all eigenvalues at least ϵ>0\epsilon > 0 - is exactly positive definiteness via the spectral characterization.


5. Matrix Square Root and Whitening

5.1 The PSD Square Root

For a scalar a0a \geq 0, there is a unique non-negative square root a0\sqrt{a} \geq 0. For a PSD matrix, the same uniqueness holds - but only in the class of PSD matrices.

Theorem 5.1 (PSD Square Root). Let ARn×nA \in \mathbb{R}^{n \times n} be symmetric and positive semidefinite. Then there exists a unique symmetric positive semidefinite matrix A1/2A^{1/2} such that:

(A1/2)2=A1/2A1/2=A.(A^{1/2})^2 = A^{1/2} A^{1/2} = A.

If A0A \succ 0, then A1/20A^{1/2} \succ 0.

Proof. By the spectral theorem, A=QΛQA = Q\Lambda Q^\top where Λ=diag(λ1,,λn)\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_n) with all λi0\lambda_i \geq 0. Define:

A1/2=QΛ1/2Q,Λ1/2=diag(λ1,,λn).A^{1/2} = Q \Lambda^{1/2} Q^\top, \quad \Lambda^{1/2} = \text{diag}(\sqrt{\lambda_1}, \ldots, \sqrt{\lambda_n}).

Then (A1/2)2=QΛ1/2QQΛ1/2Q=QΛQ=A(A^{1/2})^2 = Q\Lambda^{1/2}Q^\top Q\Lambda^{1/2}Q^\top = Q\Lambda Q^\top = A. This A1/2A^{1/2} is symmetric PSD.

Uniqueness: Suppose B2=AB^2 = A with BB symmetric PSD. Then BB and AA commute (since BA=BB2=B3=B2B=ABBA = B \cdot B^2 = B^3 = B^2 \cdot B = A B), so they share eigenvectors. On each shared eigenspace with eigenvalue λ\lambda, BB must equal λ\sqrt{\lambda} (the unique non-negative root). So B=QΛ1/2Q=A1/2B = Q\Lambda^{1/2}Q^\top = A^{1/2}. \square

Warning: Non-uniqueness without PSD constraint. The equation B2=AB^2 = A for A0A \succ 0 has 2n2^n solutions in the class of symmetric matrices (one can independently choose ±λi\pm\sqrt{\lambda_i} for each eigenvalue). The PSD square root A1/2A^{1/2} is the unique solution with all non-negative eigenvalues.

5.2 Computing Square Roots

Via eigendecomposition:

A=QΛQ    A1/2=QΛ1/2Q.A = Q\Lambda Q^\top \implies A^{1/2} = Q\Lambda^{1/2}Q^\top.

This requires the full eigendecomposition, costing O(n3)O(n^3) with a large constant. For A0A \succ 0:

eigvals, Q = np.linalg.eigh(A)   # symmetric eigendecomposition
A_sqrt = Q @ np.diag(np.sqrt(eigvals)) @ Q.T

Via Cholesky (the "left Cholesky factor"). Note: the Cholesky factor LL satisfies LL=ALL^\top = A but LA1/2L \neq A^{1/2} in general (unless AA is diagonal). The Cholesky factor is a square root of AA in the sense A=LLA = LL^\top, but it is not symmetric and not equal to the eigendecomposition-based A1/2A^{1/2}.

Via Newton's method (for matrix functions): The iteration Xk+1=12(Xk+Xk1A)X_{k+1} = \frac{1}{2}(X_k + X_k^{-1}A) starting from X0=IX_0 = I converges to A1/2A^{1/2} for A0A \succ 0. This is the matrix analogue of Heron's method for scalar square roots. Convergence is quadratic once close enough.

Via Pade approximants: For AA close to II, write A=I+EA = I + E and use a matrix series A1/2=(I+E)1/2=I+12E18E2+A^{1/2} = (I+E)^{1/2} = I + \frac{1}{2}E - \frac{1}{8}E^2 + \cdots, truncated at some order.

5.3 Square Root vs Cholesky

The two "square root" operations serve different purposes:

PropertyCholesky factor LLPSD square root A1/2A^{1/2}
SatisfiesA=LLA = LL^\topA=(A1/2)2A = (A^{1/2})^2
ShapeLower triangularSymmetric
UniquenessUnique (positive diagonal)Unique (PSD)
ComputationO(n3/3)O(n^3/3)O(n3)O(n^3) (full eigen)
InvertibilityL1L^{-1} is lower triangular(A1/2)1=A1/2(A^{1/2})^{-1} = A^{-1/2}
Use caseSolving systems, samplingMahalanobis, whitening

When to use Cholesky: solving Ax=bA\mathbf{x}=\mathbf{b}, computing logdetA\log\det A, sampling from N(0,A)\mathcal{N}(\mathbf{0},A).

When to use A1/2A^{1/2}: defining the Mahalanobis inner product x,yA=xA1y=A1/2x2\langle\mathbf{x},\mathbf{y}\rangle_A = \mathbf{x}^\top A^{-1}\mathbf{y} = \|A^{-1/2}\mathbf{x}\|^2; whitening transforms; matrix differential equations.

5.4 Whitening Transforms

Definition 5.2 (Whitening). Given data x\mathbf{x} with covariance Σ0\Sigma \succ 0, the whitening transform maps xz=Σ1/2(xμ)\mathbf{x} \mapsto \mathbf{z} = \Sigma^{-1/2}(\mathbf{x} - \boldsymbol{\mu}). The transformed variable z\mathbf{z} has covariance II (identity) - it is "white."

ZCA whitening. The ZCA (Zero-phase Component Analysis) whitening uses the symmetric square root: WZCA=Σ1/2W_{\text{ZCA}} = \Sigma^{-1/2}. The transformed data WZCAxW_{\text{ZCA}}\mathbf{x} is as close to the original x\mathbf{x} as possible while being white. This minimizes the change in "appearance" of the data.

Cholesky whitening. Using WChol=L1W_{\text{Chol}} = L^{-1} (where Σ=LL\Sigma = LL^\top) also produces white data but rotates it differently. This form is computationally cheaper.

Mahalanobis distance. For A=Σ1A = \Sigma^{-1}, the quadratic form xΣ1x\mathbf{x}^\top \Sigma^{-1} \mathbf{x} is the squared Mahalanobis distance from the origin. Geometrically, it measures the distance in "standard deviations units," accounting for the correlation structure of Σ\Sigma. In nn dimensions, (xμ)Σ1(xμ)=(xμ)Σ12(\mathbf{x}-\boldsymbol{\mu})^\top\Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu}) = \|(\mathbf{x}-\boldsymbol{\mu})\|_{\Sigma^{-1}}^2.

For AI: Batch normalization can be viewed as approximate ZCA whitening per mini-batch. More precisely, it normalizes each feature to zero mean and unit variance (diagonal whitening), avoiding the expensive full whitening. The full Mahalanobis distance appears in metric learning (e.g., Siamese networks learning AA such that xAx\mathbf{x}^\top A \mathbf{x} is a meaningful distance) and in attention mechanisms where different layers may learn non-isotropic attention kernels.


6. Schur Complement

The Schur complement is the "matrix analogue of completing the square" for block matrices. It appears everywhere in probability (Gaussian conditioning), linear algebra (block matrix inversion), and optimization (constraint elimination).

6.1 Definition for Block Matrices

Definition 6.1 (Schur Complement). Let

M=(ABCD)M = \begin{pmatrix}A & B \\ C & D\end{pmatrix}

be a block matrix with ARp×pA \in \mathbb{R}^{p \times p} invertible. The Schur complement of AA in MM is:

S=DCA1BRq×q.S = D - CA^{-1}B \in \mathbb{R}^{q \times q}.

Similarly, if DRq×qD \in \mathbb{R}^{q \times q} is invertible, the Schur complement of DD is ABD1CA - BD^{-1}C.

Origin: block Gaussian elimination. The Schur complement arises naturally when eliminating the (2,1)(2,1) block:

M=(ABCD)=(I0CA1I)(AB0DCA1B).M = \begin{pmatrix}A & B \\ C & D\end{pmatrix} = \begin{pmatrix}I & 0 \\ CA^{-1} & I\end{pmatrix}\begin{pmatrix}A & B \\ 0 & D-CA^{-1}B\end{pmatrix}.

The (2,2)(2,2) block in the upper triangular factor is exactly S=DCA1BS = D - CA^{-1}B.

Determinant formula. A key consequence of the block LU:

detM=detAdet(DCA1B)=detAdetS.\det M = \det A \cdot \det(D - CA^{-1}B) = \det A \cdot \det S.

Similarly, detM=detDdet(ABD1C)\det M = \det D \cdot \det(A - BD^{-1}C) when DD is invertible.

Block matrix inverse. Using the Schur complement:

M1=(A1+A1BS1CA1A1BS1S1CA1S1)M^{-1} = \begin{pmatrix}A^{-1} + A^{-1}B S^{-1} C A^{-1} & -A^{-1}B S^{-1} \\ -S^{-1}CA^{-1} & S^{-1}\end{pmatrix}

when both AA and S=DCA1BS = D - CA^{-1}B are invertible.

6.2 Schur Complement and Positive Definiteness

The Schur complement provides an elegant characterization of block PD matrices.

Theorem 6.2 (Schur PD Criterion). Let M=(ABBD)M = \begin{pmatrix}A & B \\ B^\top & D\end{pmatrix} be symmetric (so C=BC = B^\top). Then:

M0A0 and S=DBA1B0.M \succ 0 \quad \Longleftrightarrow \quad A \succ 0 \text{ and } S = D - B^\top A^{-1} B \succ 0.

Proof. We use the block Cholesky:

M=(ABBD)=(I0BA1I)(A00S)(IA1B0I).M = \begin{pmatrix}A & B \\ B^\top & D\end{pmatrix} = \begin{pmatrix}I & 0 \\ B^\top A^{-1} & I\end{pmatrix}\begin{pmatrix}A & 0 \\ 0 & S\end{pmatrix}\begin{pmatrix}I & A^{-1}B \\ 0 & I\end{pmatrix}.

The middle factor is block diagonal: (A00S)\begin{pmatrix}A & 0 \\ 0 & S\end{pmatrix}. For any v=(x,y)\mathbf{v} = (\mathbf{x}, \mathbf{y})^\top:

vMv=(x+A1By)A(x+A1By)+ySy.\mathbf{v}^\top M \mathbf{v} = (\mathbf{x} + A^{-1}B\mathbf{y})^\top A (\mathbf{x} + A^{-1}B\mathbf{y}) + \mathbf{y}^\top S \mathbf{y}.

(\Rightarrow): If M0M \succ 0, taking y=0\mathbf{y} = \mathbf{0} shows A0A \succ 0; taking x=A1By\mathbf{x} = -A^{-1}B\mathbf{y} shows ySy>0\mathbf{y}^\top S\mathbf{y} > 0 for y0\mathbf{y} \neq 0, so S0S \succ 0.

(\Leftarrow): If A0A \succ 0 and S0S \succ 0, then both terms are non-negative, and at least one is positive for (x,y)(0,0)(\mathbf{x},\mathbf{y}) \neq (\mathbf{0},\mathbf{0}), so M0M \succ 0. \square

Corollary. M0A0M \succeq 0 \Leftrightarrow A \succeq 0 and S=DBA1B0S = D - B^\top A^{-1}B \succeq 0 (when AA is invertible; otherwise use the rank condition).

SCHUR COMPLEMENT AND BLOCK PD
========================================================================

  M = ( A    B  )  symmetric
      ( B^T   D  )
      
  M \succ 0  <=>  A \succ 0  AND  S = D - B^TA^-^1B \succ 0

  Intuition: "completing the square" in block form
  
  v^TMv = (x + A^-^1By)^TA(x + A^-^1By) + y^TSy
          -----------------------------   -----
          \geq 0 (since A \succ 0)              \geq 0 (since S \succ 0)

========================================================================

6.3 Matrix Inversion Lemma

The Woodbury matrix identity (also called the matrix inversion lemma or Sherman-Morrison-Woodbury formula) is:

(A+UCV)1=A1A1U(C1+VA1U)1VA1(A + UCV)^{-1} = A^{-1} - A^{-1}U(C^{-1} + VA^{-1}U)^{-1}VA^{-1}

where ARn×nA \in \mathbb{R}^{n \times n}, CRk×kC \in \mathbb{R}^{k \times k}, URn×kU \in \mathbb{R}^{n \times k}, VRk×nV \in \mathbb{R}^{k \times n}. This is valuable when knk \ll n (low-rank update): instead of inverting an n×nn \times n matrix, invert a k×kk \times k matrix.

Derivation via Schur complement. Consider the block system:

M=(AUVC1).M = \begin{pmatrix}A & U \\ -V & C^{-1}\end{pmatrix}.

The Schur complement of C1C^{-1} is AU(V)1(V)=A+UCVA - U(-V)^{-1}(-V) = A + UCV (with some sign manipulation). The Schur complement of AA is C1+VA1UC^{-1} + VA^{-1}U. Using the block inverse formula gives the Woodbury identity.

Special case (rank-1 update): If U=uU = \mathbf{u}, V=vV = \mathbf{v}^\top, C=cC = c (scalar):

(A+cuv)1=A1cA1uvA11+cvA1u.(A + c\mathbf{u}\mathbf{v}^\top)^{-1} = A^{-1} - \frac{c A^{-1}\mathbf{u}\mathbf{v}^\top A^{-1}}{1 + c\mathbf{v}^\top A^{-1}\mathbf{u}}.

This is the Sherman-Morrison formula for rank-1 updates.

For AI: The Woodbury identity is used in:

  • Gaussian process prediction: posterior covariance (K+σ2I)1(K + \sigma^2 I)^{-1} where K=XXK = X^\top X and σ2I\sigma^2 I is noise; Woodbury allows using n×nn \times n or p×pp \times p (whichever is smaller)
  • LoRA / low-rank adaptation: the effective weight W0+BAW_0 + BA is a rank-rr update; Woodbury allows efficient inversion without materializing the full matrix
  • Kalman filter update step: (P1+HR1H)1(P^{-1} + H^\top R^{-1}H)^{-1} uses Woodbury to avoid inverting large state covariances

6.4 Gaussian Conditioning via Schur Complement

The Schur complement is the algebraic engine behind the conditional distribution of multivariate Gaussians.

Setup. Let (x1x2)N ⁣((μ1μ2),(Σ11Σ12Σ21Σ22))\begin{pmatrix}\mathbf{x}_1 \\ \mathbf{x}_2\end{pmatrix} \sim \mathcal{N}\!\left(\begin{pmatrix}\boldsymbol{\mu}_1 \\ \boldsymbol{\mu}_2\end{pmatrix}, \begin{pmatrix}\Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22}\end{pmatrix}\right).

Conditional distribution. The conditional x1x2=a\mathbf{x}_1 | \mathbf{x}_2 = \mathbf{a} is Gaussian:

x1x2=aN(μ12,Σ12)\mathbf{x}_1 | \mathbf{x}_2 = \mathbf{a} \sim \mathcal{N}(\boldsymbol{\mu}_{1|2},\, \Sigma_{1|2})

where:

μ12=μ1+Σ12Σ221(aμ2)\boldsymbol{\mu}_{1|2} = \boldsymbol{\mu}_1 + \Sigma_{12}\Sigma_{22}^{-1}(\mathbf{a} - \boldsymbol{\mu}_2) Σ12=Σ11Σ12Σ221Σ21.\Sigma_{1|2} = \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}.

The conditional covariance Σ12\Sigma_{1|2} is exactly the Schur complement of Σ22\Sigma_{22} in Σ\Sigma. Theorem 6.2 guarantees Σ120\Sigma_{1|2} \succ 0 whenever Σ0\Sigma \succ 0.

Derivation. Complete the square in the joint density. The exponent:

(xμ)Σ1(xμ)(\mathbf{x} - \boldsymbol{\mu})^\top \Sigma^{-1} (\mathbf{x} - \boldsymbol{\mu})

factored using the block inverse of Σ1\Sigma^{-1} (the "precision matrix") gives the Schur complement form.

For AI: In Gaussian process regression, the predictive distribution at new points x\mathbf{x}_* given training observations y\mathbf{y} uses exactly this formula:

μ=Kn(Knn+σ2I)1y,Σ=KKn(Knn+σ2I)1Kn.\boldsymbol{\mu}_* = K_{*n}(K_{nn} + \sigma^2 I)^{-1}\mathbf{y}, \quad \Sigma_* = K_{**} - K_{*n}(K_{nn} + \sigma^2 I)^{-1}K_{n*}.

The second term is the Schur complement. Computing it via Cholesky: L=chol(Knn+σ2I)L = \text{chol}(K_{nn} + \sigma^2 I), then Σ=K(L1Kn)(L1Kn)\Sigma_* = K_{**} - (L^{-1}K_{n*})^\top(L^{-1}K_{n*}).


7. Log-Determinant

7.1 Definition and Properties

For a positive definite matrix A0A \succ 0, the log-determinant is:

logdetA=logi=1nλi=i=1nlogλi.\log\det A = \log \prod_{i=1}^n \lambda_i = \sum_{i=1}^n \log \lambda_i.

Since A0A \succ 0 implies λi>0\lambda_i > 0 for all ii, this is well-defined. The log-det is defined only on the interior of the PSD cone (i.e., on PD matrices).

Key computation via Cholesky:

logdetA=logdet(LL)=log(detL)2=2logdetL=2i=1nlogLii.\log\det A = \log\det(LL^\top) = \log(\det L)^2 = 2\log\det L = 2\sum_{i=1}^n \log L_{ii}.

Since Lii>0L_{ii} > 0 (Cholesky diagonal is positive), logLii\log L_{ii} is well-defined. This is the standard computational formula: factor A=LLA = LL^\top, then sum the logs of the diagonal entries.

L = np.linalg.cholesky(A)
log_det_A = 2 * np.sum(np.log(np.diag(L)))

This is numerically more stable than np.log(np.linalg.det(A)) for large matrices, because det can underflow or overflow.

Properties:

  • logdet(AB)=logdetA+logdetB\log\det(AB) = \log\det A + \log\det B (for any invertible A,BA, B)
  • logdet(A1)=logdetA\log\det(A^{-1}) = -\log\det A
  • logdet(αA)=nlogα+logdetA\log\det(\alpha A) = n\log\alpha + \log\det A for scalar α>0\alpha > 0
  • logdet(A)=tr(logA)\log\det(A) = \text{tr}(\log A) where logA\log A is the matrix logarithm (eigendecomposition-based)
  • As AS+nA \to \partial \mathbb{S}_+^n (boundary, a singular matrix): logdetA\log\det A \to -\infty

7.2 Log-Det as a Concave Function

Theorem 7.1. The function f:S++nRf: \mathbb{S}_{++}^n \to \mathbb{R} defined by f(A)=logdetAf(A) = \log\det A is strictly concave on the cone of PD matrices.

Proof. We need to show f(λA+(1λ)B)λf(A)+(1λ)f(B)f(\lambda A + (1-\lambda)B) \geq \lambda f(A) + (1-\lambda) f(B) for A,B0A, B \succ 0 and λ(0,1)\lambda \in (0,1), with equality iff A=BA = B.

Fix A0A \succ 0 and let C=A1/2BA1/20C = A^{-1/2}BA^{-1/2} \succ 0. The eigenvalues of CC are μ1μn>0\mu_1 \geq \cdots \geq \mu_n > 0.

f(λA+(1λ)B)=logdet(λA+(1λ)B).f(\lambda A + (1-\lambda)B) = \log\det(\lambda A + (1-\lambda)B).

Factoring: λA+(1λ)B=A1/2(λI+(1λ)C)A1/2\lambda A + (1-\lambda)B = A^{1/2}(\lambda I + (1-\lambda)C)A^{1/2}, so:

f(λA+(1λ)B)=logdetA+i=1nlog(λ+(1λ)μi).f(\lambda A + (1-\lambda)B) = \log\det A + \sum_{i=1}^n \log(\lambda + (1-\lambda)\mu_i).

Since g(t)=logtg(t) = \log t is strictly concave: log(λ+(1λ)μi)λlog1+(1λ)logμi=(1λ)logμi\log(\lambda + (1-\lambda)\mu_i) \geq \lambda\log 1 + (1-\lambda)\log\mu_i = (1-\lambda)\log\mu_i.

Summing: f(λA+(1λ)B)logdetA+(1λ)ilogμi=λf(A)+(1λ)f(B)f(\lambda A + (1-\lambda)B) \geq \log\det A + (1-\lambda)\sum_i \log\mu_i = \lambda f(A) + (1-\lambda)f(B).

Equality holds iff all μi=1\mu_i = 1, i.e., C=IC = I, i.e., B=AB = A. \square

Consequences of concavity:

  • The log-det has no local maxima except at the global maximum (over any convex domain)
  • Gradient methods for maximizing logdetA\log\det A over a convex set converge to the global maximum
  • The function is differentiable at every PD matrix (the gradient is A1A^{-1}, see 7.3)

7.3 Gradient and Matrix Calculus

Theorem 7.2 (Log-Det Gradient). Let f(A)=logdetAf(A) = \log\det A for A0A \succ 0. Then:

logdetAA=A=A1(since A is symmetric).\frac{\partial \log\det A}{\partial A} = A^{-\top} = A^{-1} \quad (\text{since } A \text{ is symmetric}).

More precisely, in the matrix calculus convention where f/Aij\partial f/\partial A_{ij} means the (i,j)(i,j) entry of the gradient matrix:

(logdetAA)ij=(A1)ji.\left(\frac{\partial \log\det A}{\partial A}\right)_{ij} = (A^{-1})_{ji}.

Proof using Jacobi's formula. Jacobi's formula states d(detA)=tr(adj(A)dA)=det(A)tr(A1dA)d(\det A) = \text{tr}(\text{adj}(A)\, dA) = \det(A)\,\text{tr}(A^{-1}\,dA). Therefore:

d(logdetA)=d(detA)detA=tr(A1dA).d(\log\det A) = \frac{d(\det A)}{\det A} = \text{tr}(A^{-1}\,dA).

Since d(logdetA)=AlogdetA,dAF=tr((AlogdetA)dA)d(\log\det A) = \langle \nabla_A \log\det A,\, dA\rangle_F = \text{tr}((\nabla_A \log\det A)^\top dA), we read off AlogdetA=A=A1\nabla_A \log\det A = A^{-\top} = A^{-1} for symmetric AA.

Second derivative (Hessian operator): For the function AlogdetAA \mapsto \log\det A:

d2(logdetA)[H,H]=tr(A1HA1H)=tr((A1/2HA1/2)2)0.d^2(\log\det A)[H, H] = -\text{tr}(A^{-1}H A^{-1}H) = -\text{tr}((A^{-1/2}HA^{-1/2})^2) \leq 0.

The negative sign confirms concavity.

Other useful log-det derivatives:

  • alogdet(A+aa)=2(A+aa)1a\frac{\partial}{\partial \mathbf{a}} \log\det(A + \mathbf{a}\mathbf{a}^\top) = 2(A + \mathbf{a}\mathbf{a}^\top)^{-1}\mathbf{a} (rank-1 update)
  • tlogdet(A+tB)=tr((A+tB)1B)\frac{\partial}{\partial t} \log\det(A + tB) = \text{tr}((A+tB)^{-1}B)

7.4 Log-Det in Machine Learning

Multivariate Gaussian log-likelihood. For xN(μ,Σ)\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma):

logp(x)=n2log(2π)12logdetΣ12(xμ)Σ1(xμ).\log p(\mathbf{x}) = -\frac{n}{2}\log(2\pi) - \frac{1}{2}\log\det\Sigma - \frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^\top\Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu}).

The term 12logdetΣ-\frac{1}{2}\log\det\Sigma penalizes large (spread-out) distributions. When fitting Σ\Sigma to data, maximizing the log-likelihood requires differentiating through logdetΣ\log\det\Sigma, using logdetΣ/Σ=Σ1\partial\log\det\Sigma/\partial\Sigma = \Sigma^{-1}.

Gaussian process marginal likelihood. For GP regression with kernel matrix KK and noise σ2\sigma^2:

logp(y)=12y(K+σ2I)1y12logdet(K+σ2I)n2log(2π).\log p(\mathbf{y}) = -\frac{1}{2}\mathbf{y}^\top(K+\sigma^2 I)^{-1}\mathbf{y} - \frac{1}{2}\log\det(K+\sigma^2I) - \frac{n}{2}\log(2\pi).

Both terms require Cholesky: L=chol(K+σ2I)L = \text{chol}(K + \sigma^2 I), then logdet=2logLii\log\det = 2\sum\log L_{ii} and the quadratic form via triangular solves. Gradient with respect to kernel hyperparameters uses logp/θ=tr(αα(K+σ2I)1)K/θ\partial\log p/\partial\theta = \text{tr}(\alpha\alpha^\top - (K+\sigma^2I)^{-1})\partial K/\partial\theta where α=(K+σ2I)1y\alpha = (K+\sigma^2I)^{-1}\mathbf{y}.

Normalizing flows. A normalizing flow defines a bijective mapping f:zxf: \mathbf{z} \mapsto \mathbf{x} where zN(0,I)\mathbf{z} \sim \mathcal{N}(0,I). The log-likelihood of data x\mathbf{x} is:

logp(x)=logpz(f1(x))+logdetJf1\log p(\mathbf{x}) = \log p_z(f^{-1}(\mathbf{x})) + \log|\det J_{f^{-1}}|

where Jf1J_{f^{-1}} is the Jacobian. Efficiently computing logdetJ\log|\det J| is the central computational challenge of normalizing flows. Architectures like RealNVP (triangular Jacobian, logdetJ=logJii\log|\det J| = \sum\log|J_{ii}|) and FFJORD (stochastic trace estimator) are designed specifically to make this computation tractable.

Log-det estimators for large matrices. When KK is very large (e.g., a kernel matrix for millions of training points), exact Cholesky is intractable. Randomized log-det estimators use the identity logdetK=tr(logK)\log\det K = \text{tr}(\log K) and the stochastic trace estimator tr(logK)1mi=1mzi(logK)zi\text{tr}(\log K) \approx \frac{1}{m}\sum_{i=1}^m \mathbf{z}_i^\top (\log K)\mathbf{z}_i with random ziN(0,I)\mathbf{z}_i \sim \mathcal{N}(0,I), computed via Lanczos iterations. This is used in scalable GP libraries (GPyTorch, 2018).


8. Gram Matrices and Kernel Connections

8.1 The Gram Matrix Construction

Definition 8.1 (Gram Matrix). Let x1,,xnRd\mathbf{x}_1, \ldots, \mathbf{x}_n \in \mathbb{R}^d be a collection of vectors. Their Gram matrix is:

GRn×n,Gij=xi,xj=xixj.G \in \mathbb{R}^{n \times n}, \quad G_{ij} = \langle \mathbf{x}_i, \mathbf{x}_j \rangle = \mathbf{x}_i^\top \mathbf{x}_j.

In matrix form: if X=[x1xn]Rn×dX = [\mathbf{x}_1|\cdots|\mathbf{x}_n]^\top \in \mathbb{R}^{n \times d} (rows are data points), then G=XXG = XX^\top.

Theorem 8.2. Every Gram matrix is positive semidefinite. Moreover, G0G \succ 0 iff x1,,xn\mathbf{x}_1, \ldots, \mathbf{x}_n are linearly independent.

Proof. For any cRn\mathbf{c} \in \mathbb{R}^n:

cGc=i,jciGijcj=i,jcixixjcj=icixi20.\mathbf{c}^\top G \mathbf{c} = \sum_{i,j} c_i G_{ij} c_j = \sum_{i,j} c_i \mathbf{x}_i^\top \mathbf{x}_j c_j = \left\|\sum_i c_i \mathbf{x}_i\right\|^2 \geq 0.

So G0G \succeq 0. Equality holds iff icixi=0\sum_i c_i\mathbf{x}_i = \mathbf{0}, i.e., iff the vectors are linearly dependent. So G0G \succ 0 iff they are linearly independent. \square

Corollary. rank(G)=rank(X)\text{rank}(G) = \text{rank}(X), the number of linearly independent data vectors.

Examples:

  • X=InX = I_n (standard basis): G=In0G = I_n \succ 0.
  • n>dn > d: rank(G)d<n\text{rank}(G) \leq d < n, so G0G \succeq 0 but G⊁0G \not\succ 0.
  • n=dn = d, XX invertible: G0G \succ 0.

8.2 Every PSD Matrix is a Gram Matrix

Theorem 8.3. G0G \succeq 0 if and only if GG is the Gram matrix of some set of vectors in some inner product space.

Proof. (\Leftarrow): proved above. (\Rightarrow): If G0G \succeq 0, let LL be the Cholesky-like factor: G=LLG = LL^\top (where LL may be rectangular, n×rn \times r, r=rank(G)r = \text{rank}(G)). Take xi=L[i,:]\mathbf{x}_i^\top = L[i,:] (the ii-th row of LL). Then Gij=L[i,:]L[j,:]=xixjG_{ij} = L[i,:] \cdot L[j,:]^\top = \mathbf{x}_i^\top \mathbf{x}_j. \square

This is a deep result: the class of PSD matrices and the class of Gram matrices are exactly the same. Any PSD matrix can be "explained" as a matrix of inner products between some set of vectors.

Feature maps. If ϕ:XRd\phi: \mathcal{X} \to \mathbb{R}^d is a feature map, then the Gram matrix Gij=ϕ(xi)ϕ(xj)G_{ij} = \phi(\mathbf{x}_i)^\top\phi(\mathbf{x}_j) is PSD. Kernel methods replace ϕ(xi)ϕ(xj)\phi(\mathbf{x}_i)^\top\phi(\mathbf{x}_j) with k(xi,xj)k(\mathbf{x}_i, \mathbf{x}_j) directly, avoiding explicit feature computation.

8.3 Kernel Matrices and Mercer's Theorem

A positive definite kernel is a function k:X×XRk: \mathcal{X} \times \mathcal{X} \to \mathbb{R} such that for every finite set {x1,,xn}X\{x_1,\ldots,x_n\} \subset \mathcal{X}, the Gram matrix Kij=k(xi,xj)K_{ij} = k(x_i, x_j) is PSD.

Mercer's Theorem (informal statement). A continuous, symmetric function k:X×XRk: \mathcal{X} \times \mathcal{X} \to \mathbb{R} is a positive definite kernel (i.e., produces PSD Gram matrices for any finite set) if and only if there exists a Hilbert space H\mathcal{H} and a feature map ϕ:XH\phi: \mathcal{X} \to \mathcal{H} such that:

k(x,z)=ϕ(x),ϕ(z)H.k(x, z) = \langle \phi(x), \phi(z) \rangle_{\mathcal{H}}.

This is the mathematical foundation of the kernel trick: instead of explicitly computing ϕ(x)\phi(x) (which may be infinite-dimensional), we evaluate k(x,z)k(x,z) directly.

Forward reference: The full treatment of Mercer's theorem, reproducing kernel Hilbert spaces (RKHS), and the kernel trick appears in Chapter 12: Functional Analysis.

Standard PD kernels:

  • Linear kernel: k(x,z)=xzk(\mathbf{x},\mathbf{z}) = \mathbf{x}^\top\mathbf{z} (standard Gram matrix)
  • RBF/Gaussian kernel: k(x,z)=exp(xz2/22)k(\mathbf{x},\mathbf{z}) = \exp(-\|\mathbf{x}-\mathbf{z}\|^2 / 2\ell^2)
  • Polynomial kernel: k(x,z)=(xz+c)dk(\mathbf{x},\mathbf{z}) = (\mathbf{x}^\top\mathbf{z} + c)^d for c0c \geq 0, dNd \in \mathbb{N}
  • Matern kernel: used in GP regression with controllable smoothness

Verifying kernel validity. For a proposed kernel kk, the standard test is:

  1. Compute the n×nn \times n Gram matrix KK for a test set
  2. Check K0K \succeq 0 (e.g., via np.linalg.eigvalsh(K) >= -1e-10)

8.4 Attention Scores as Gram Matrices

The scaled dot-product attention mechanism in transformers computes:

Attention(Q,K,V)=softmax ⁣(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V

where Q,KRn×dkQ, K \in \mathbb{R}^{n \times d_k} are query and key matrices.

The score matrix S=QK/dkRn×nS = QK^\top / \sqrt{d_k} \in \mathbb{R}^{n \times n} is a scaled Gram matrix (but with different row-spaces for QQ and KK, so not necessarily PSD). In the special case of self-attention with tied weights Q=KQ = K, SS is proportional to QQ/dk0QQ^\top / \sqrt{d_k} \succeq 0.

Why the 1/dk1/\sqrt{d_k} scaling. If QQ and KK have independent entries from N(0,1)\mathcal{N}(0,1), then QijKijQ_{ij}K_{ij} has variance 1 and (QK)ij=k=1dkQikKjk(QK^\top)_{ij} = \sum_{k=1}^{d_k} Q_{ik}K_{jk} has variance dkd_k. The 1/dk1/\sqrt{d_k} rescaling brings the variance back to 1, preventing the softmax from saturating into near-one-hot distributions.

Attention as kernel regression. The attention output for query q\mathbf{q} is:

output=j=1nexp(qkj/dk)lexp(qkl/dk)vj.\text{output} = \sum_{j=1}^n \frac{\exp(\mathbf{q}^\top\mathbf{k}_j/\sqrt{d_k})}{\sum_l \exp(\mathbf{q}^\top\mathbf{k}_l/\sqrt{d_k})} \mathbf{v}_j.

This is a Nadaraya-Watson kernel regression with an exponential kernel k(q,kj)=exp(qkj/dk)k(\mathbf{q},\mathbf{k}_j) = \exp(\mathbf{q}^\top\mathbf{k}_j/\sqrt{d_k}). The attention weights are the normalized kernel similarities, and the output is a kernel-weighted average of values. The exponential of the dot product is related to the RBF kernel (by the Johnson-Lindenstrauss / random Fourier features perspective used in Performer / FAVOR+).


9. The PSD Cone and Semidefinite Programming

9.1 The Cone of PSD Matrices

The set of all n×nn \times n symmetric positive semidefinite matrices is denoted S+n\mathbb{S}_+^n (or S0n\mathbb{S}_{\geq 0}^n). It lives inside the vector space Sn\mathbb{S}^n of n×nn \times n real symmetric matrices (which has dimension n(n+1)/2n(n+1)/2).

Theorem 9.1. S+n\mathbb{S}_+^n is a proper convex cone:

  1. Convex: If A,B0A, B \succeq 0 and λ[0,1]\lambda \in [0,1], then λA+(1λ)B0\lambda A + (1-\lambda)B \succeq 0.
  2. Cone: If A0A \succeq 0 and t0t \geq 0, then tA0tA \succeq 0.
  3. Pointed: A0A \succeq 0 and A0A \preceq 0 implies A=0A = 0 (no lines through the origin in the cone).
  4. Closed: S+n\mathbb{S}_+^n is a closed subset of Sn\mathbb{S}^n (limits of PSD sequences are PSD).
  5. Full-dimensional: The interior of S+n\mathbb{S}_+^n is S++n\mathbb{S}_{++}^n (the PD matrices), which is non-empty.

Proof of convexity: For any x\mathbf{x}: x(λA+(1λ)B)x=λxAx+(1λ)xBx0\mathbf{x}^\top(\lambda A + (1-\lambda)B)\mathbf{x} = \lambda\mathbf{x}^\top A\mathbf{x} + (1-\lambda)\mathbf{x}^\top B\mathbf{x} \geq 0. \square

The boundary. The boundary S+n=S+nS++n\partial\mathbb{S}_+^n = \mathbb{S}_+^n \setminus \mathbb{S}_{++}^n consists of PSD matrices with at least one zero eigenvalue. These are rank-deficient PSD matrices. The boundary has lower dimension: the set of rank-rr PSD matrices is a manifold of dimension rnr(r1)/2rn - r(r-1)/2.

Low-dimensional picture. For n=2n=2: S2\mathbb{S}^2 is 3-dimensional (coordinates A11,A12,A22A_{11}, A_{12}, A_{22}). The PSD cone S+2\mathbb{S}_+^2 is the set where A110A_{11} \geq 0, A220A_{22} \geq 0, and A11A22A122A_{11}A_{22} \geq A_{12}^2 - the interior of an ice cream cone in 3D.

Dual cone. The dual of S+n\mathbb{S}_+^n with respect to the Frobenius inner product A,BF=tr(AB)\langle A, B\rangle_F = \text{tr}(AB) is:

(S+n)={BSn:tr(AB)0 for all A0}=S+n.(\mathbb{S}_+^n)^* = \{B \in \mathbb{S}^n : \text{tr}(AB) \geq 0 \text{ for all } A \succeq 0\} = \mathbb{S}_+^n.

The PSD cone is self-dual: (S+n)=S+n(\mathbb{S}_+^n)^* = \mathbb{S}_+^n. This is analogous to the non-negative reals being self-dual.

9.2 Semidefinite Programming

Semidefinite programming (SDP) is the optimization of a linear objective over the intersection of the PSD cone with an affine set:

Standard form SDP (primal):

minXSnC,XF=tr(CX)\min_{X \in \mathbb{S}^n} \quad \langle C, X\rangle_F = \text{tr}(CX) subject toAi,XF=bi,i=1,,m\text{subject to} \quad \langle A_i, X\rangle_F = b_i, \quad i = 1,\ldots,m X0.\quad\quad\quad\quad X \succeq 0.

Here C,A1,,AmSnC, A_1, \ldots, A_m \in \mathbb{S}^n and bRmb \in \mathbb{R}^m are the problem data; XSnX \in \mathbb{S}^n is the decision variable.

Dual SDP:

maxyRmby\max_{\mathbf{y} \in \mathbb{R}^m} \quad \mathbf{b}^\top\mathbf{y} subject toCi=1myiAi0.\text{subject to} \quad C - \sum_{i=1}^m y_i A_i \succeq 0.

Duality. Weak duality always holds: primal valuedual value\text{primal value} \geq \text{dual value}. Strong duality (primal = dual) holds under Slater's condition: if the primal is strictly feasible (some X0X \succ 0 satisfies all constraints), then strong duality holds and the dual optimum is attained.

Relation to other optimization problems. SDP generalizes:

  • Linear programming (LP): LP is SDP with diagonal matrices C,AiC, A_i
  • SOCP (second-order cone programming): SOCP is a special SDP
  • Quadratically constrained QP: Many QCQPs can be lifted to SDPs via semidefinite relaxation

Algorithms. Interior-point methods (barrier methods) solve SDPs in polynomial time: O(m1.5n3)O(m^{1.5} n^3) per iteration for an mm-constraint, n×nn \times n SDP. Standard solvers: SCS, MOSEK, CVXPY (modelling layer).

9.3 SDP in Machine Learning

Max-cut relaxation (Goemans-Williamson, 1995). The max-cut problem on a graph G=(V,E)G=(V,E) with edge weights wijw_{ij} is NP-hard. The Goemans-Williamson SDP relaxation gives a 0.8780.878-approximation:

maxX012ijwij(1Xij)\max_{X \succeq 0} \quad \frac{1}{2}\sum_{ij} w_{ij}(1 - X_{ij}) s.t.Xii=1,i=1,,n.\text{s.t.} \quad X_{ii} = 1, \quad i=1,\ldots,n.

The solution XX^* satisfies X=VVX^* = VV^\top for some unit vectors v1,,vn\mathbf{v}_1,\ldots,\mathbf{v}_n; random hyperplane rounding recovers a near-optimal cut.

Metric learning. Learning a Mahalanobis distance dA(x,z)=(xz)A(xz)1/2d_A(\mathbf{x},\mathbf{z}) = (\mathbf{x}-\mathbf{z})^\top A (\mathbf{x}-\mathbf{z})^{1/2} requires A0A \succeq 0. Methods like ITML (Information-Theoretic Metric Learning) and SDML formulate this as an SDP over PSD matrices with constraints that similar pairs are close and dissimilar pairs are far.

Covariance estimation. In high dimensions (p>np > n), the sample covariance Σ^=1nXX\hat{\Sigma} = \frac{1}{n}X^\top X may be singular. Regularized covariance estimation (graphical lasso, precision matrix estimation) adds a sparsity constraint or penalty:

minΣ0[tr(S^Σ1)logdetΣ+λΣ11]\min_{\Sigma \succ 0} \left[ \text{tr}(\hat{S}\Sigma^{-1}) - \log\det\Sigma + \lambda\|\Sigma^{-1}\|_1 \right]

where S^\hat{S} is the sample covariance and 1\|\cdot\|_1 is the element-wise 1\ell_1 norm. This is not an SDP directly, but the PSD constraint Σ0\Sigma \succ 0 is the core structural requirement.

Fairness constraints. In algorithmic fairness, PSD constraints arise as necessary conditions for disparate impact compliance. Certified defenses against adversarial examples (semidefinite relaxations of neural network verification) are large-scale SDPs; solvers like α\alpha-CROWN reformulate them via Lagrangian relaxation.


10. Applications in Machine Learning

10.1 Multivariate Gaussians and Covariance Matrices

The multivariate Gaussian. A random vector xRn\mathbf{x} \in \mathbb{R}^n has the distribution N(μ,Σ)\mathcal{N}(\boldsymbol{\mu}, \Sigma) if:

p(x)=1(2π)n/2detΣ1/2exp ⁣(12(xμ)Σ1(xμ)).p(\mathbf{x}) = \frac{1}{(2\pi)^{n/2}|\det\Sigma|^{1/2}} \exp\!\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^\top\Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu})\right).

For this density to be a valid (normalized, integrable) probability distribution, Σ\Sigma must be symmetric positive definite. The three requirements:

  1. Symmetry: Σ=Σ\Sigma = \Sigma^\top (covariance is symmetric by definition)
  2. Positive definiteness: Σ0\Sigma \succ 0 ensures detΣ0\det\Sigma \neq 0 (normalizing constant finite) and Σ1\Sigma^{-1} exists (the exponent is a proper quadratic form)
  3. If Σ0\Sigma \succeq 0 but singular: The distribution becomes degenerate - supported on an affine subspace, not all of Rn\mathbb{R}^n

Parameterizing covariances in neural networks. A neural network that outputs a covariance matrix must parameterize it to be PSD. Standard approaches:

  • Diagonal: Σ=diag(exp(s))\Sigma = \text{diag}(\exp(\boldsymbol{s})) where s\boldsymbol{s} is a learned vector. Automatically PD.
  • Cholesky lower triangular: Σ=LL\Sigma = LL^\top where LL has positive diagonal (enforced via softplus on diagonal entries). This is the most expressive parameterization.
  • Low-rank + diagonal: Σ=FF+D\Sigma = FF^\top + D where FRn×kF \in \mathbb{R}^{n \times k} (knk \ll n) and DD diagonal positive. Woodbury allows efficient inversion.

Sampling from N(μ,Σ)\mathcal{N}(\boldsymbol{\mu}, \Sigma): Given ϵN(0,I)\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, I):

x=μ+Lϵ\mathbf{x} = \boldsymbol{\mu} + L\boldsymbol{\epsilon}

where L=chol(Σ)L = \text{chol}(\Sigma). This is the fundamental sampling algorithm: the Cholesky factor maps isotropic noise to correlated noise.

10.2 Fisher Information Matrix and Natural Gradient

Definition 10.1 (Fisher Information Matrix). For a statistical model p(xθ)p(\mathbf{x}|\boldsymbol{\theta}) with parameter θRd\boldsymbol{\theta} \in \mathbb{R}^d, the Fisher information matrix is:

F(θ)=Exp(θ) ⁣[θlogp(xθ)θlogp(xθ)].F(\boldsymbol{\theta}) = \mathbb{E}_{\mathbf{x} \sim p(\cdot|\boldsymbol{\theta})}\!\left[\nabla_{\boldsymbol{\theta}} \log p(\mathbf{x}|\boldsymbol{\theta})\, \nabla_{\boldsymbol{\theta}} \log p(\mathbf{x}|\boldsymbol{\theta})^\top\right].

PSD proof. FF is a covariance matrix of the score function θlogp(xθ)\nabla_{\boldsymbol{\theta}} \log p(\mathbf{x}|\boldsymbol{\theta}): it is E[ss]\mathbb{E}[\mathbf{s}\mathbf{s}^\top] where s\mathbf{s} is a random vector. Any matrix of the form E[ss]\mathbb{E}[\mathbf{s}\mathbf{s}^\top] is PSD (it is the expected outer product). In fact, F0F \succ 0 for regular statistical models (identifiable, full rank).

Natural gradient. Ordinary gradient descent θθηθL\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta\nabla_{\boldsymbol{\theta}}\mathcal{L} uses the Euclidean metric on parameter space. The natural gradient uses the Fisher metric:

~L=F(θ)1θL.\tilde{\nabla}\mathcal{L} = F(\boldsymbol{\theta})^{-1}\nabla_{\boldsymbol{\theta}}\mathcal{L}.

The natural gradient is invariant to reparameterization of the model - it measures the steepest descent direction with respect to the KL divergence geometry (information geometry).

K-FAC (Kronecker-Factored Approximate Curvature). Martens & Grosse (2015) approximate the Fisher information matrix for neural networks as a block-diagonal matrix, where each block factorizes as a Kronecker product:

FF^=block-diag(A1G1,A2G2,)F \approx \hat{F} = \text{block-diag}(A_1 \otimes G_1, A_2 \otimes G_2, \ldots)

where Al=E[al1al1]A_l = \mathbb{E}[\mathbf{a}_{l-1}\mathbf{a}_{l-1}^\top] (input activation covariance) and Gl=E[δlδl]G_l = \mathbb{E}[\delta_l\delta_l^\top] (pre-activation gradient covariance) for layer ll. Both AlA_l and GlG_l are PSD (covariance matrices). The Kronecker product AlGlA_l \otimes G_l is PSD, and its inverse is (AlGl)1=Al1Gl1(A_l \otimes G_l)^{-1} = A_l^{-1} \otimes G_l^{-1}, computable cheaply via Cholesky of each factor separately.

10.3 Gaussian Process Regression

Gaussian process regression is the Bayesian non-parametric regression method that uses PD kernel matrices as the core computational object.

Model. Place a GP prior: fGP(0,k)f \sim \mathcal{GP}(0, k) where k:X×XRk: \mathcal{X} \times \mathcal{X} \to \mathbb{R} is a PD kernel. Observe noisy outputs y=f(X)+ϵ\mathbf{y} = f(\mathbf{X}) + \boldsymbol{\epsilon}, ϵN(0,σ2I)\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 I).

Prediction. The predictive distribution at new points XX_* is:

fX,X,yN(μ,Σ)f_* | X_*, \mathbf{X}, \mathbf{y} \sim \mathcal{N}(\boldsymbol{\mu}_*, \Sigma_*)

where:

μ=Kn(Knn+σ2I)1y\boldsymbol{\mu}_* = K_{*n}(K_{nn} + \sigma^2I)^{-1}\mathbf{y} Σ=KKn(Knn+σ2I)1Kn\Sigma_* = K_{**} - K_{*n}(K_{nn} + \sigma^2I)^{-1}K_{n*}

and Knn=k(X,X)K_{nn} = k(\mathbf{X}, \mathbf{X}), Kn=k(X,X)K_{*n} = k(X_*, \mathbf{X}), K=k(X,X)K_{**} = k(X_*, X_*).

Computational core: Cholesky. Factor Knn+σ2I=LLK_{nn} + \sigma^2 I = LL^\top. Then:

  • α=L\(L\y)\boldsymbol{\alpha} = L^\top \backslash (L \backslash \mathbf{y}) (two triangular solves)
  • μ=Knα\boldsymbol{\mu}_* = K_{*n}\boldsymbol{\alpha} (matrix-vector product)
  • V=L\KnV = L \backslash K_{n*} (triangular solve, n×nn \times n_* system)
  • Σ=KVV\Sigma_* = K_{**} - V^\top V (Schur complement form)
  • logp(y)=12yαilogLiin2log(2π)\log p(\mathbf{y}) = -\frac{1}{2}\mathbf{y}^\top\boldsymbol{\alpha} - \sum_i\log L_{ii} - \frac{n}{2}\log(2\pi)

The entire GP regression computation is O(n3)O(n^3) via Cholesky - the classic bottleneck for large datasets, motivating sparse GP approximations (inducing points, Nystrom, SVGP).

10.4 Hessian and Loss Landscape Sharpness

Second-order characterization of minima. At a critical point L(θ)=0\nabla\mathcal{L}(\boldsymbol{\theta}^*) = 0 of a smooth loss L\mathcal{L}:

  • 2L(θ)0\nabla^2\mathcal{L}(\boldsymbol{\theta}^*) \succ 0: strict local minimum (the loss bowl is strictly convex)
  • 2L(θ)0\nabla^2\mathcal{L}(\boldsymbol{\theta}^*) \succeq 0: local minimum or saddle (Hessian is PSD)
  • 2L(θ)\nabla^2\mathcal{L}(\boldsymbol{\theta}^*) indefinite: saddle point

Sharpness. The sharpness of a minimum is λmax(2L(θ))\lambda_{\max}(\nabla^2\mathcal{L}(\boldsymbol{\theta}^*)) - the largest eigenvalue of the Hessian. Flat minima (small sharpness) generalize better than sharp minima: a small perturbation θ+δ\boldsymbol{\theta}^* + \boldsymbol{\delta} with δρ\|\boldsymbol{\delta}\| \leq \rho changes the loss by at most ρ2λmax(2L)\rho^2 \lambda_{\max}(\nabla^2\mathcal{L}) (by Taylor expansion and the PSD bound δHδλmaxδ2\boldsymbol{\delta}^\top H\boldsymbol{\delta} \leq \lambda_{\max}\|\boldsymbol{\delta}\|^2).

SAM (Sharpness-Aware Minimization). Foret et al. (2021) propose:

minθmaxδρL(θ+δ).\min_{\boldsymbol{\theta}} \max_{\|\boldsymbol{\delta}\|\leq\rho} \mathcal{L}(\boldsymbol{\theta}+\boldsymbol{\delta}).

The inner max finds the worst-case perturbation (solved approximately as δ=ρL/L\boldsymbol{\delta}^* = \rho\, \nabla\mathcal{L}/\|\nabla\mathcal{L}\|). SAM is a first-order approximation to minimizing sharpness and is reported to improve generalization on image and language tasks.

Gauss-Newton and PSD Hessian approximations. The true Hessian 2L\nabla^2\mathcal{L} may be indefinite during training (early stages, overparameterized models). The Gauss-Newton matrix G=JJG = J^\top J (where JJ is the Jacobian of the predictions) is always PSD and is often a better preconditioner. K-FAC uses the Gauss-Newton approximation.

10.5 Reparameterization Trick in VAEs

The variational autoencoder (VAE) requires sampling from a distribution qϕ(zx)=N(μϕ(x),Σϕ(x))q_\phi(\mathbf{z}|\mathbf{x}) = \mathcal{N}(\boldsymbol{\mu}_\phi(\mathbf{x}), \Sigma_\phi(\mathbf{x})) where μϕ\boldsymbol{\mu}_\phi and Σϕ\Sigma_\phi are outputs of an encoder neural network, and differentiating through the sampling process.

The reparameterization trick. Instead of sampling zN(μ,Σ)\mathbf{z} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma) directly (which blocks gradient flow), write:

z=μ+Lϵ,ϵN(0,I)\mathbf{z} = \boldsymbol{\mu} + L\boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, I)

where L=chol(Σ)L = \text{chol}(\Sigma). Now z\mathbf{z} is a deterministic function of (μ,L,ϵ)(\boldsymbol{\mu}, L, \boldsymbol{\epsilon}), and gradients can flow through μ\boldsymbol{\mu} and LL.

Diagonal VAE (standard). Most VAE implementations use diagonal covariance: Σ=diag(exp(s))\Sigma = \text{diag}(\exp(\mathbf{s})), so L=diag(exp(s/2))L = \text{diag}(\exp(\mathbf{s}/2)) and z=μ+exp(s/2)ϵ\mathbf{z} = \boldsymbol{\mu} + \exp(\mathbf{s}/2) \odot \boldsymbol{\epsilon}.

Full covariance VAE. For a full covariance Cholesky parameterization, the encoder outputs:

  1. Mean μRd\boldsymbol{\mu} \in \mathbb{R}^d
  2. Lower triangular LRd×dL \in \mathbb{R}^{d \times d} with positive diagonal (e.g., Lii=softplus(L~ii)L_{ii} = \text{softplus}(\tilde{L}_{ii}), off-diagonal unconstrained)

Then Σ=LL\Sigma = LL^\top and z=μ+Lϵ\mathbf{z} = \boldsymbol{\mu} + L\boldsymbol{\epsilon}. The gradient through LL back to the encoder parameters flows via:

zL=(Lϵ)L=ϵId\frac{\partial \mathbf{z}}{\partial L} = \frac{\partial(L\boldsymbol{\epsilon})}{\partial L} = \boldsymbol{\epsilon} \otimes I_d

(the outer product of the sampled noise ϵ\boldsymbol{\epsilon} with the identity).

For normalizing flows. More expressive VAE variants use normalizing flows for the posterior qϕ(zx)q_\phi(\mathbf{z}|\mathbf{x}). The flow is a sequence of invertible maps; the log-det Jacobian of each map must be computed efficiently. Triangular Jacobians (e.g., masked autoregressive flow / IAF) achieve O(d)O(d) log-det computation.


11. Common Mistakes

#MistakeWhy It's WrongFix
1Checking only diagonal entries to test PDPositive diagonal is necessary but not sufficient. A=(1221)A = \begin{pmatrix}1 & 2 \\ 2 & 1\end{pmatrix} has positive diagonal but is indefinite (det=3\det = -3).Use Sylvester's criterion (all leading minors > 0) or attempt Cholesky.
2Concluding A0A \succ 0 from detA>0\det A > 0 alonedet>0\det > 0 is necessary but not sufficient. (1002)\begin{pmatrix}-1 & 0 \\ 0 & -2\end{pmatrix} has det=2>0\det = 2 > 0 but is negative definite.Check all leading principal minors, not just the full determinant.
3Assuming AA0A^\top A \succ 0 for any AAAA0A^\top A \succeq 0 always, but AA0A^\top A \succ 0 iff AA has full column rank. If AA has a null vector (Av=0A\mathbf{v} = 0), then vAAv=0\mathbf{v}^\top A^\top A \mathbf{v} = 0.Verify rank(A)=number of columns\text{rank}(A) = \text{number of columns}.
4Confusing the Cholesky factor LL with the PSD square root A1/2A^{1/2}LL is lower triangular; A1/2A^{1/2} is symmetric. Both satisfy "squared = AA" but in different senses (LL=ALL^\top = A vs (A1/2)2=A(A^{1/2})^2 = A). They are equal only when AA is diagonal.Use LL for solving/sampling; use A1/2A^{1/2} for Mahalanobis/whitening.
5Using np.log(np.linalg.det(A)) for large AAnp.linalg.det can overflow/underflow for large nn (products of many numbers).Use Cholesky: 2 * np.sum(np.log(np.diag(np.linalg.cholesky(A)))).
6Forgetting Sylvester's criterion does NOT characterize PSDSylvester requires all leading principal minors > 0, which gives PD not PSD. For PSD, you need all principal minors (not just leading ones) 0\geq 0 - a much larger set.For PSD testing, use eigenvalues or attempt pivoted Cholesky.
7Inverting the Loewner order incorrectlyStudents often think ABA \succeq B implies A1B1A^{-1} \succeq B^{-1}. The correct fact is AB0B1A1A \succeq B \succ 0 \Rightarrow B^{-1} \succeq A^{-1} (inversion reverses the order).Remember: larger matrix -> smaller inverse (as for positive scalars: ab>01/a1/ba \geq b > 0 \Rightarrow 1/a \leq 1/b).
8Adding jitter ϵI\epsilon I without checking scaleIf A2106\|A\|_2 \approx 10^6 and you add ϵ=106\epsilon = 10^{-6}, the relative jitter is 101210^{-12}, effectively zero numerically.Set jitter proportional to the scale: ϵ=δtr(A)/n\epsilon = \delta \cdot \text{tr}(A)/n for some small δ\delta.
9Assuming the kernel matrix stays PD after operationsSum, product, and pointwise operations on PD kernels may not preserve PD structure. E.g., k1(x,z)k2(x,z)k_1(\mathbf{x},\mathbf{z}) - k_2(\mathbf{x},\mathbf{z}) may not be PD.Verify the Gram matrix or use closure properties: sum/product/exponentiation of PD kernels are PD.
10Computing Schur complement with singular AAThe Schur complement DCA1BD - CA^{-1}B requires AA invertible. If AA is only PSD (rank-deficient), A1A^{-1} does not exist.Use the Moore-Penrose pseudoinverse: S=DCABS = D - CA^\dagger B, though the PD characterization no longer holds as stated.
11Treating "positive definite" as a non-symmetric matrix propertyPD is only defined for symmetric matrices. An arbitrary matrix AA with all positive eigenvalues may not have a positive quadratic form (complex eigenvalues, non-symmetric).Always symmetrize: replace AA with (A+A)/2(A + A^\top)/2 before testing PD.
12Thinking PD constraint is automatically satisfied by a NN outputA neural network outputting n(n+1)/2n(n+1)/2 numbers does not automatically produce a valid lower triangular LL with positive diagonal.Enforce: use softplus on the diagonal entries; leave off-diagonal unconstrained. Then Σ=LL\Sigma = LL^\top is guaranteed PD.

12. Exercises

Exercise 1 * - Verify PD Axioms for AA+λIA^\top A + \lambda I

Let ARm×nA \in \mathbb{R}^{m \times n} with mnm \geq n and λ>0\lambda > 0. Define B=AA+λIB = A^\top A + \lambda I.

(a) Prove B0B \succ 0 using the definition xBx>0\mathbf{x}^\top B\mathbf{x} > 0 for x0\mathbf{x} \neq \mathbf{0}.

(b) What is the smallest eigenvalue of BB? Express it in terms of the singular values of AA and λ\lambda.

(c) What happens as λ0\lambda \to 0? Under what condition on AA does AAA^\top A itself become PD?

(d) Show B11λIB^{-1} \preceq \frac{1}{\lambda}I using the Loewner order.

(e) AI connection: Identify B=AA+λIB = A^\top A + \lambda I as the normal equations matrix for ridge regression. Why does λ>0\lambda > 0 guarantee B0B \succ 0 regardless of the rank of AA?


Exercise 2 * - Implement Cholesky from Scratch

(a) Implement the Cholesky algorithm (Algorithm 4.2) in pure NumPy without using np.linalg.cholesky. Your function should return the lower triangular LL satisfying A=LLA = LL^\top, or raise a ValueError if AA is not PD.

(b) Test on A=(41216123743164398)A = \begin{pmatrix}4 & 12 & -16 \\ 12 & 37 & -43 \\ -16 & -43 & 98\end{pmatrix} and verify LL=ALL^\top = A.

(c) Compare the result with np.linalg.cholesky - they should agree to machine precision.

(d) Test what happens when you apply your function to an indefinite matrix A=(1331)A = \begin{pmatrix}1 & 3 \\ 3 & 1\end{pmatrix}.

(e) Efficiency: What is the flop count of your implementation for n×nn \times n input? Compare to LU factorization.


Exercise 3 * - Sylvester's Criterion

Let A(α)=(1α0α2α0α4)A(\alpha) = \begin{pmatrix}1 & \alpha & 0 \\ \alpha & 2 & \alpha \\ 0 & \alpha & 4\end{pmatrix}.

(a) Compute the three leading principal minors Δ1,Δ2,Δ3\Delta_1, \Delta_2, \Delta_3 as functions of α\alpha.

(b) Use Sylvester's criterion to find all values of αR\alpha \in \mathbb{R} for which A(α)0A(\alpha) \succ 0.

(c) At the boundary values of α\alpha (where AA transitions from PD to indefinite), what is the nullspace of AA? Interpret this geometrically.

(d) Verify your answer computationally by plotting the smallest eigenvalue of A(α)A(\alpha) as a function of α\alpha.


Exercise 4 ** - LDL^T Factorization

(a) Implement the LDL^T algorithm (Algorithm 4.4) in NumPy. Your function returns unit lower triangular LL and diagonal DD such that A=LDLA = LDL^\top.

(b) Test on A=(421253136)A = \begin{pmatrix}4 & 2 & 1 \\ 2 & 5 & 3 \\ 1 & 3 & 6\end{pmatrix}. Verify LDL=ALDL^\top = A.

(c) Using your factorization, solve Ax=bA\mathbf{x} = \mathbf{b} for b=(1,2,3)\mathbf{b} = (1, 2, 3)^\top by: (i) solve Ly=bL\mathbf{y} = \mathbf{b}, (ii) solve Dw=yD\mathbf{w} = \mathbf{y}, (iii) solve Lx=wL^\top\mathbf{x} = \mathbf{w}.

(d) Explain why LDL^T avoids the numerical issues of Cholesky when AA is indefinite.

(e) AI connection: Why do second-order optimizers (like KFAC) prefer working with DD (the pivots) rather than D\sqrt{D} (the Cholesky diagonal)?


Exercise 5 ** - Schur Complement and Block PD Test

Let M=(ABBD)M = \begin{pmatrix}A & B \\ B^\top & D\end{pmatrix} where:

A=(4113),B=(12),D=(5).A = \begin{pmatrix}4 & 1 \\ 1 & 3\end{pmatrix}, \quad B = \begin{pmatrix}1 \\ 2\end{pmatrix}, \quad D = \begin{pmatrix}5\end{pmatrix}.

(a) Compute the Schur complement S=DBA1BS = D - B^\top A^{-1}B.

(b) Use Theorem 6.2 to determine whether M0M \succ 0.

(c) Compute detM\det M using detM=detAdetS\det M = \det A \cdot \det S and verify numerically.

(d) Use the block inverse formula to compute M1M^{-1}. Verify MM1=IMM^{-1} = I.

(e) Gaussian conditioning: Interpret MM as a joint covariance matrix Σ=(Σ11Σ12Σ21Σ22)\Sigma = \begin{pmatrix}\Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22}\end{pmatrix}. What is the conditional covariance Σ12\Sigma_{1|2}? Interpret: does conditioning reduce uncertainty?


Exercise 6 ** - Log-Det via Cholesky and Gradient Verification

(a) Compute logdetA\log\det A for A=(310142025)A = \begin{pmatrix}3 & 1 & 0 \\ 1 & 4 & 2 \\ 0 & 2 & 5\end{pmatrix} using: (i) direct eigenvalues, (ii) Cholesky diagonal formula, (iii) np.log(np.linalg.det(A)). Verify all three agree.

(b) Implement a function log_det(A) using Cholesky. Compare speed vs eigenvalue method for n=100,500,1000n = 100, 500, 1000 random PD matrices.

(c) Numerically verify the gradient formula AlogdetA=A1\nabla_A \log\det A = A^{-1} using finite differences: f(A+hEij)f(A)h(A1)ij\frac{f(A + hE_{ij}) - f(A)}{h} \approx (A^{-1})_{ij}.

(d) For the Gaussian log-likelihood: given 50 samples from N(0,Σtrue)\mathcal{N}(\mathbf{0}, \Sigma_{\text{true}}) where Σtrue=A\Sigma_{\text{true}} = A from part (a), compute the log-likelihood (Σ)=n2logdetΣ12tr(Σ1S)\ell(\Sigma) = -\frac{n}{2}\log\det\Sigma - \frac{1}{2}\text{tr}(\Sigma^{-1}S) where S=1nixixiS = \frac{1}{n}\sum_i\mathbf{x}_i\mathbf{x}_i^\top is the sample covariance.


Exercise 7 *** - Gram Matrix / Kernel Matrix and Mercer Check

(a) Given 5 points x1,,x5R3\mathbf{x}_1, \ldots, \mathbf{x}_5 \in \mathbb{R}^3 (design your own interesting configuration), compute the Gram matrix G=XXG = XX^\top.

(b) Verify G0G \succeq 0 by checking eigenvalues. Explain why rank(G)3\text{rank}(G) \leq 3.

(c) Implement and test three kernels on these points: (i) linear k(x,z)=xzk(\mathbf{x},\mathbf{z}) = \mathbf{x}^\top\mathbf{z}, (ii) RBF k(x,z)=exp(xz2/2)k(\mathbf{x},\mathbf{z}) = \exp(-\|\mathbf{x}-\mathbf{z}\|^2/2), (iii) polynomial k(x,z)=(xz+1)2k(\mathbf{x},\mathbf{z}) = (\mathbf{x}^\top\mathbf{z}+1)^2. For each, compute the 5×55 \times 5 kernel matrix and verify PSD.

(d) Now try the "not-a-kernel" k(x,z)=exp(xz)k(\mathbf{x},\mathbf{z}) = \exp(-\|\mathbf{x}-\mathbf{z}\|) (Laplacian kernel IS a valid kernel) vs k(x,z)=cos(xz)k(\mathbf{x},\mathbf{z}) = \cos(\|\mathbf{x}-\mathbf{z}\|) (NOT always a valid kernel). Determine empirically whether each produces a PSD Gram matrix.

(e) AI connection: Implement a simple kernel regression: given training inputs X\mathbf{X} and outputs y\mathbf{y}, compute predictions at test points XX_* using y^=Kn(Knn+0.1I)1y\hat{\mathbf{y}}_* = K_{*n}(K_{nn} + 0.1 I)^{-1}\mathbf{y} (Nadaraya-Watson/GPR predictive mean).


Exercise 8 *** - Cholesky Reparameterization for VAE Sampling

(a) Implement a function sample_mvn(mu, L, n_samples) that samples nn vectors from N(μ,LL)\mathcal{N}(\boldsymbol{\mu}, LL^\top) using the reparameterization trick z=μ+Lϵ\mathbf{z} = \boldsymbol{\mu} + L\boldsymbol{\epsilon}, ϵN(0,I)\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0},I).

(b) Let μ=(1,2)\boldsymbol{\mu} = (1, 2)^\top and Σ=(4223)\Sigma = \begin{pmatrix}4 & 2 \\ 2 & 3\end{pmatrix}. Compute L=chol(Σ)L = \text{chol}(\Sigma). Sample 1000 points, compute the sample mean and covariance, and verify they match μ\boldsymbol{\mu} and Σ\Sigma (up to sampling noise).

(c) Simulate a VAE encoder: the encoder takes a scalar input xx and outputs μϕ(x)=Wx+b\boldsymbol{\mu}_\phi(x) = Wx + b and a lower triangular Lϕ(x)L_\phi(x) (a diagonal LL for simplicity). Sample z\mathbf{z} and pass it through a simple decoder x^=cz\hat{x} = c^\top\mathbf{z}. Compute the ELBO:

ELBO=Ez[logp(xz)]DKL(N(μ,Σ)N(0,I)).\text{ELBO} = \mathbb{E}_{\mathbf{z}}[\log p(x|\mathbf{z})] - D_{\text{KL}}(\mathcal{N}(\boldsymbol{\mu}, \Sigma) \| \mathcal{N}(\mathbf{0}, I)).

where DKL(N(μ,Σ)N(0,I))=12[tr(Σ)+μμdlogdetΣ]D_{\text{KL}}(\mathcal{N}(\boldsymbol{\mu}, \Sigma) \| \mathcal{N}(\mathbf{0},I)) = \frac{1}{2}[\text{tr}(\Sigma) + \boldsymbol{\mu}^\top\boldsymbol{\mu} - d - \log\det\Sigma].

(d) Verify the KL formula using both the analytical expression and Monte Carlo estimation from samples.

(e) Gradient check: Confirm that the reparameterization trick allows gradients to flow through μ\boldsymbol{\mu} and LL (not ϵ\boldsymbol{\epsilon}) by computing ELBO/μ\partial\text{ELBO}/\partial\boldsymbol{\mu} analytically and comparing with finite differences.


13. Why This Matters for AI (2026 Perspective)

ConceptConcrete AI/LLM RoleWhere / Method
Positive definite covarianceValid multivariate Gaussian distributionsVAEs (Kingma & Welling, 2013); diffusion models (DDPM, score matching)
Cholesky factorizationReparameterization trick for differentiable samplingAll variational inference; NUTS-HMC sampler in Pyro, NumPyro
LDL^T factorizationStable Hessian factorization in trust-region optimizersL-BFGS, trust-region Newton; scipy.optimize
Schur complementConditional Gaussian inference; GP predictionGPyTorch, GPflow; Bayesian neural networks
Matrix inversion lemma (Woodbury)Low-rank updates without full matrix inversionLoRA weight merging; Kalman filter; GP with inducing points
Log-determinantNormalizing constant in Gaussian log-likelihoodGP hyperparameter optimization; normalizing flows (RealNVP, Glow)
logdetA=A1\nabla\log\det A = A^{-1}Gradient of log-likelihood w.r.t. covarianceFitting GP kernel parameters; structured covariance learning
Gram matrix / kernel matrixKernel methods, self-attention similaritySVM training; transformer attention; Performer / FAVOR+
Fisher information matrixRiemannian metric on parameter spaceK-FAC optimizer (Google Brain, 2015); natural gradient variational inference
PSD Hessian at minimaSecond-order optimality conditionSAM (flatness-seeking optimizer); loss landscape analysis
Sharpness λmax(H)\lambda_{\max}(H)Generalization proxy; flat minima theorySAM, ASAM, Stein Self-Repulsive (2024)
PSD cone / SDPConstrained optimization; certified robustnessα\alpha-CROWN, β\beta-CROWN neural network verification; fairness constraints
Mercer's theoremTheoretical foundation for kernel machinesSVMs; kernel PCA; connections to infinite-width NTK theory
Cholesky reparameterizationParameterizing full covariance in deep modelsDeep covariance estimation; Matrix-variate distributions in meta-learning
Log-det stochastic estimatorScalable GP training on millions of pointsGPyTorch (Gardner et al., 2018); SVGP; stochastic trace estimation

14. Conceptual Bridge

Looking Back

This section builds on a foundation laid across the previous six sections of Advanced Linear Algebra. From eigenvalues and eigenvectors (01), we borrow the spectral theorem - the fact that symmetric matrices diagonalize orthogonally - which provides the cleanest proof that positive definiteness is equivalent to positive eigenvalues. From SVD (02), we inherit the language of singular values and the connection between the spectral norm, condition number, and the stability of Cholesky. The orthogonality theory from 05 underpins the proof that QΛQQ\Lambda Q^\top is symmetric (the spectral theorem) and that the PSD square root is well-defined and unique. Matrix norms from 06 give precise bounds on how a PD matrix behaves numerically: the condition number κ(A)=λmax/λmin\kappa(A) = \lambda_{\max}/\lambda_{\min} measures how far AA is from singular, and ill-conditioned PD matrices lead to numerical difficulties in Cholesky.

Most fundamentally, positive definite matrices are the matrices that have all four of the desirable properties we want from a symmetric matrix: uniquely solvable linear systems (invertible), well-defined quadratic energy (positive), computable square root (Cholesky), and monotone response to matrix operations (Loewner order). Understanding this quadrant of symmetry \times positivity is the analytical key to all probabilistic machine learning.

Looking Forward

Positive definite matrices are the entry point to the next major development: 08 (Matrix Decompositions) uses the Cholesky factorization developed here as one member of a family that includes LU (general matrices) and QR (orthogonal factorization). The Cholesky decomposition studied here in full generality is previewed (only briefly) in 08 - by the time students reach 08, the Cholesky algorithm should feel familiar.

The deeper application of positive definiteness comes later in the curriculum. In Chapter 6 (Probability Theory), covariance matrices Σ0\Sigma \succeq 0 appear in every multivariate distribution. In Chapter 8 (Optimization), the second-order conditions for minima require 2L0\nabla^2\mathcal{L} \succeq 0, and convexity of a function is equivalent to its Hessian being PSD everywhere. In Chapter 12 (Functional Analysis), Mercer's theorem and RKHS theory extend PSD kernels to infinite-dimensional inner product spaces. The PSD cone and semidefinite programming previewed in 9 appear extensively in Chapter 8 (Optimization) and in the robotics/control applications in later chapters.

Curriculum Position

POSITIVE DEFINITE MATRICES IN THE CURRICULUM
========================================================================

  Chapter 2: Linear Algebra Basics
  +-- 01 Vectors and Spaces        -> inner products, quadratic forms
  +-- 04 Determinants              -> Sylvester's criterion
  +-- 06 Vector Spaces             -> subspaces, null space
  
  Chapter 3: Advanced Linear Algebra
  +-- 01 Eigenvalues         -> spectral characterization of PD
  +-- 02 SVD                 -> singular values, condition number
  +-- 05 Orthogonality       -> spectral theorem, square root
  +-- 06 Matrix Norms        -> condition number, jitter scale
  +-- 07 Positive Definite   <== YOU ARE HERE
  |     |   Cholesky, LDL^T, Schur, log-det, Gram, SDP
  +-- 08 Matrix Decompositions  -> LU, QR, brief Cholesky recap
  
  Downstream:
  +-- Ch. 6 Probability: Gaussian covariance \Sigma \succ 0
  +-- Ch. 8 Optimization: Hessian \succ 0 at local minima; SDP
  +-- Ch. 12 Functional Analysis: Mercer, RKHS
  +-- Ch. 13 ML Math: K-FAC, VAE, GP, normalizing flows

========================================================================

The central insight to carry forward: positive definiteness is not a restriction but a characterization. A matrix that fails to be PD is not "almost OK" - it is categorically different, with no unique minimum, no valid probability interpretation, no Cholesky factorization. The condition A0A \succ 0 is the boundary between well-posedness and ill-posedness in a wide range of mathematical problems, and recognizing that boundary - and the tools to work with it (Cholesky, Schur, log-det) - is an essential skill for the working ML practitioner.


Appendix A: Closure Properties and Algebraic Operations

A.1 Preserving Positive Definiteness Under Operations

Understanding which operations preserve PD/PSD structure is practically important - it tells you when a composed matrix is guaranteed to be PD without recomputing.

Direct results:

OperationInputsResultCondition
αA\alpha AA0A \succ 0, α>0\alpha > 00\succ 0Always
A+BA + BA,B0A, B \succ 00\succ 0Always
A+BA + BA0A \succ 0, B0B \succeq 00\succ 0Always
BABB^\top A BA0A \succ 0, BB invertible0\succ 0Always
BABB^\top A BA0A \succ 0, BB full column rank0\succ 0Always
A1A^{-1}A0A \succ 00\succ 0Always
AkA^kA0A \succ 0, kZk \in \mathbb{Z}0\succ 0Always
AtA^tA0A \succ 0, tRt \in \mathbb{R}0\succ 0Matrix power via exp(t log A)
ABABA,B0A, B \succ 0NOT necessarily 0\succeq 0Only if AB=BAAB = BA
ABA \otimes BA,B0A, B \succ 00\succ 0Always (Kronecker)
ABA \otimes BA,B0A, B \succeq 00\succeq 0Always (Kronecker)
ABA \odot B (Hadamard)A,B0A, B \succeq 00\succeq 0Schur product theorem

The Schur product theorem. The Hadamard (element-wise) product of two PSD matrices is PSD. This is non-obvious! It is used in kernel methods: if k1k_1 and k2k_2 are valid kernels, their pointwise product k1(x,z)k2(x,z)k_1(\mathbf{x},\mathbf{z}) \cdot k_2(\mathbf{x},\mathbf{z}) is also a valid kernel.

Proof of Schur product theorem. If A,B0A, B \succeq 0, write A=iuiuiA = \sum_i \mathbf{u}_i\mathbf{u}_i^\top and B=jvjvjB = \sum_j \mathbf{v}_j\mathbf{v}_j^\top (Gram representations). Then (AB)kl=AklBkl(A \odot B)_{kl} = A_{kl}B_{kl}, which can be expressed as the Gram matrix of vectors uivj\mathbf{u}_i \otimes \mathbf{v}_j (Kronecker products of the generating vectors). Gram matrices are PSD. \square

Why ABAB is not always PSD. Consider A=(1002)A = \begin{pmatrix}1 & 0 \\ 0 & 2\end{pmatrix} and B=(2111)B = \begin{pmatrix}2 & 1 \\ 1 & 1\end{pmatrix}. Both are PD, but AB=(2122)AB = \begin{pmatrix}2 & 1 \\ 2 & 2\end{pmatrix}, which is not symmetric (hence not PSD). The product of symmetric matrices is symmetric iff they commute.

For AI: Kronecker product structure is central to K-FAC: FAlGlF \approx A_l \otimes G_l is PSD since Al,Gl0A_l, G_l \succeq 0. The Schur product theorem ensures that combining PD kernels (feature interactions) produces valid kernels.

A.2 Principal Submatrix Inheritance

Theorem A.1. If A0A \succ 0, every principal submatrix is 0\succ 0. If A0A \succeq 0, every principal submatrix is 0\succeq 0.

A principal submatrix of AA is obtained by deleting rows and columns with the same index set: AS=A[S,S]A_S = A[S, S] for S{1,,n}S \subseteq \{1,\ldots,n\}.

Proof. For any yRS\mathbf{y} \in \mathbb{R}^{|S|}, embed in Rn\mathbb{R}^n via xi=yi\mathbf{x}_i = y_i for iSi \in S and xi=0\mathbf{x}_i = 0 for iSi \notin S. Then yASy=xAx>0\mathbf{y}^\top A_S \mathbf{y} = \mathbf{x}^\top A \mathbf{x} > 0 for y0\mathbf{y} \neq 0 (since x0\mathbf{x} \neq 0). \square

Consequence. Every diagonal entry Aii>0A_{ii} > 0 for A0A \succ 0 (take S={i}S = \{i\}). Every 2×22 \times 2 principal submatrix is PD, so AiiAjj>Aij2A_{ii}A_{jj} > A_{ij}^2 for all iji \neq j (Cauchy-Schwarz-type bound for matrix entries).

A.3 Indefinite Matrices and Saddle Points

For completeness: an indefinite symmetric matrix AA has both positive and negative eigenvalues. The corresponding quadratic form takes positive values in some directions and negative values in others. The level sets are hyperboloids.

Connection to saddle points in optimization. A critical point L(θ)=0\nabla\mathcal{L}(\boldsymbol{\theta}^*) = 0 with indefinite Hessian 2L(θ)\nabla^2\mathcal{L}(\boldsymbol{\theta}^*) is a saddle point - not a local minimum. In neural network optimization:

  • Gradient descent near a saddle point may slow down dramatically (gradient is small, Hessian is indefinite)
  • The negative eigenvalue directions are "escape directions" - following them reduces the loss
  • Stochastic gradient descent (SGD) with noise "escapes" saddles; deterministic gradient descent can get stuck

Sylvester's law of inertia. For a symmetric matrix AA, the signature (p,q,r)(p, q, r) (number of positive, negative, zero eigenvalues) is invariant under congruence transformations ABABA \mapsto B^\top A B (for invertible BB). This means no invertible change of basis can turn a saddle into a minimum - the number of descent directions is a geometric invariant of the quadratic form.


Appendix B: Numerical Methods and Implementation

B.1 Cholesky for Linear System Solving

A primary application of Cholesky is solving Ax=bA\mathbf{x} = \mathbf{b} for PD AA.

Algorithm:

  1. Factor: A=LLA = LL^\top (Cholesky, O(n3/3)O(n^3/3))
  2. Forward solve: Ly=bL\mathbf{y} = \mathbf{b} (substitute from top, O(n2)O(n^2))
  3. Backward solve: Lx=yL^\top\mathbf{x} = \mathbf{y} (substitute from bottom, O(n2)O(n^2))

For multiple right-hand sides B=[b1bk]B = [\mathbf{b}_1|\cdots|\mathbf{b}_k]: factor once, solve kk times at cost O(n3/3+kn2)O(n^3/3 + kn^2) total.

Numerical example: Solve (4223)x=(87)\begin{pmatrix}4 & 2 \\ 2 & 3\end{pmatrix}\mathbf{x} = \begin{pmatrix}8 \\ 7\end{pmatrix}.

Cholesky: L=(2012)L = \begin{pmatrix}2 & 0 \\ 1 & \sqrt{2}\end{pmatrix} (since L11=2L_{11}=2, L21=1L_{21}=1, L22=31=2L_{22}=\sqrt{3-1}=\sqrt{2}).

Forward: 2y1=8y1=42y_1 = 8 \Rightarrow y_1 = 4; y1+2y2=7y2=(74)/2=3/2y_1 + \sqrt{2}y_2 = 7 \Rightarrow y_2 = (7-4)/\sqrt{2} = 3/\sqrt{2}.

Backward: 2x2=3/2x2=3/2\sqrt{2}x_2 = 3/\sqrt{2} \Rightarrow x_2 = 3/2; 2x1+x2=4x1=(43/2)/2=5/42x_1 + x_2 = 4 \Rightarrow x_1 = (4 - 3/2)/2 = 5/4.

Verify: (4223)(5/43/2)=(5+35/2+9/2)=(87)\begin{pmatrix}4&2\\2&3\end{pmatrix}\begin{pmatrix}5/4\\3/2\end{pmatrix} = \begin{pmatrix}5+3\\5/2+9/2\end{pmatrix} = \begin{pmatrix}8\\7\end{pmatrix}. OK

B.2 Cholesky Update and Downdate

When AA changes by a rank-1 update AA+vvA \leftarrow A + \mathbf{v}\mathbf{v}^\top (or downdate AAvvA \leftarrow A - \mathbf{v}\mathbf{v}^\top), it is wasteful to recompute the full Cholesky factorization. Rank-1 Cholesky update algorithms (LINPACK's dchud) update LL in O(n2)O(n^2) time.

Algorithm (rank-1 update, Gill-Golub-Murray-Saunders style):

Given A=LLA = LL^\top, compute LL' such that LL=A+vvL'L'^\top = A + \mathbf{v}\mathbf{v}^\top:

x = L \ v           (forward solve: O(n^2))
for k = 1 to n:
    r = sqrt(L[k,k]^2 + x[k]^2)
    c = r / L[k,k]
    s = x[k] / L[k,k]
    L[k,k] = r
    L[k+1:n, k] = (L[k+1:n,k] + s*x[k+1:n]) / c
    x[k+1:n]    = c*x[k+1:n] - s*L[k+1:n,k]

This uses Givens rotations (-> 05: Orthogonality) to update LL in place.

For AI: Online learning algorithms that add one data point at a time use rank-1 Cholesky updates to maintain the Cholesky factorization of the Gram matrix. Sequential Bayesian updating (online GP regression) uses rank-1 Cholesky updates at cost O(n2)O(n^2) per update instead of O(n3)O(n^3) full refactorization.

B.3 Condition Number and Numerical Stability of Cholesky

Condition number. For A0A \succ 0, the condition number κ(A)=λmax/λmin\kappa(A) = \lambda_{\max}/\lambda_{\min}. The relative error in solving Ax=bA\mathbf{x} = \mathbf{b} satisfies:

x^xxκ(A)ϵmach.\frac{\|\hat{\mathbf{x}} - \mathbf{x}\|}{\|\mathbf{x}\|} \leq \kappa(A) \cdot \epsilon_{\text{mach}}.

For Cholesky specifically: the backward error bound on the Cholesky factor is AL^L^cnϵmachA\|A - \hat{L}\hat{L}^\top\| \leq c_n\epsilon_{\text{mach}}\|A\|. This means Cholesky is unconditionally backward stable for PD matrices. In contrast, LU without pivoting can have backward errors proportional to LU\|L\|\|U\|, which can be exponentially large.

Ill-conditioning warning. For large κ(A)\kappa(A):

  • The log-det computation 2logLii2\sum\log L_{ii} involves some very small LiiL_{ii} (near-zero for near-singular dimensions) - those log terms dominate and may lose digits
  • Linear solves A1bA^{-1}\mathbf{b} lose approximately log10κ(A)\log_{10}\kappa(A) digits of accuracy
  • Cholesky itself remains stable; the error comes from the ill-conditioning of the problem, not the algorithm

Practical rule: If κ(A)10k\kappa(A) \approx 10^k and your floating-point arithmetic has 15\approx 15 decimal digits (double precision), you lose kk digits in the solution. For k12k \geq 12, the solution has no reliable digits.

B.4 Sparse Cholesky

For sparse PD matrices (many zero entries, arising in graph-structured models, FEM discretizations, and spatial statistics), general Cholesky is wasteful because it may "fill in" zeros and produce dense LL.

The fill-in problem. Even if AA is sparse, LL may be dense. The fill-in pattern depends on the sparsity structure of AA and the ordering of variables.

Reordering algorithms. The approximate minimum degree (AMD) and nested dissection algorithms reorder the variables (permute APAPA \leftarrow PAP^\top for a permutation matrix PP) to minimize fill-in. The standard library for sparse Cholesky is SuiteSparse (CHOLMOD).

For AI: In Gaussian Markov random fields (GMRFs), the precision matrix (inverse covariance) is sparse. Sparse Cholesky enables efficient inference in spatial models, graph neural networks with Gaussian priors, and structured prediction. GPyTorch's KeOps and PyTorch Sparse backends exploit sparsity for large-scale GP approximations.


Appendix C: Positive Definiteness in Complex Vector Spaces

C.1 Hermitian Positive Definite Matrices

Over C\mathbb{C}, positive definiteness generalizes to Hermitian positive definite (HPD) matrices.

Definition. ACn×nA \in \mathbb{C}^{n \times n} is Hermitian positive definite if:

  • A=AA = A^* (where A=AˉA^* = \bar{A}^\top is the conjugate transpose)
  • xAx>0\mathbf{x}^* A \mathbf{x} > 0 for all xCn{0}\mathbf{x} \in \mathbb{C}^n \setminus \{0\}

(Note: xAx\mathbf{x}^* A \mathbf{x} is always real for Hermitian AA, so the positivity condition makes sense.)

Cholesky for HPD. Every HPD matrix AA has a unique Cholesky factorization A=LLA = LL^* where LL is lower triangular with positive real diagonal entries.

For AI: Complex-valued neural networks (used in signal processing, quantum computing simulation) use HPD covariance matrices. Radar signal processing, MRI reconstruction, and quantum chemistry all work in Cn\mathbb{C}^n with HPD matrices.


Appendix D: Advanced Theory

D.1 The Polar Decomposition and Positive Definite Factors

Every invertible matrix ARn×nA \in \mathbb{R}^{n \times n} has a unique polar decomposition:

A=QPA = QP

where QQ is orthogonal (QQ=IQ^\top Q = I) and P=AA0P = \sqrt{A^\top A} \succ 0 is the unique PD symmetric positive definite "right polar factor." This is the matrix analogue of the polar form z=eiθzz = e^{i\theta}|z| for complex numbers.

Existence and uniqueness. Using SVD A=UΣVA = U\Sigma V^\top:

A=UΣV=(UV)(VΣV)=QPA = U\Sigma V^\top = (UV^\top)(V\Sigma V^\top) = QP

where Q=UVQ = UV^\top (orthogonal) and P=VΣV0P = V\Sigma V^\top \succ 0 (PSD square root of AAA^\top A).

Uniqueness of PP: P=(AA)1/2P = (A^\top A)^{1/2}, which is the unique PSD square root (Theorem 5.1). Given PP, Q=AP1Q = AP^{-1} is uniquely determined.

For AI: The polar decomposition appears in:

  • Weight matrix orthogonalization: in Muon optimizer (Kosson et al., 2024), gradient updates are "orthogonalized" using the polar factor Q=AP1Q = AP^{-1}, which projects the weight update onto the manifold of orthogonal matrices, maintaining stable singular value spectrum during training
  • Group equivariance: equivariant neural networks (E(n)-equivariant GNNs) use orthogonal matrices as symmetry group elements; the polar decomposition provides a differentiable projection onto O(n)O(n)

Computing the polar factor. Newton iteration: Xk+1=12(Xk+Xk)X_{k+1} = \frac{1}{2}(X_k + X_k^{-\top}) converges to QQ starting from X0=AX_0 = A. Quadratic convergence once close to QQ. Each step costs one matrix inversion.

D.2 Positive Definite Functions and Bochner's Theorem

A function k:RdRk: \mathbb{R}^d \to \mathbb{R} is a positive definite function if for every nn and every set of points x1,,xnRd\mathbf{x}_1,\ldots,\mathbf{x}_n \in \mathbb{R}^d, the matrix Kij=k(xixj)K_{ij} = k(\mathbf{x}_i - \mathbf{x}_j) is PSD.

Bochner's theorem. A continuous stationary kernel k(xz)k(\mathbf{x}-\mathbf{z}) is a positive definite function on Rd\mathbb{R}^d if and only if it is the Fourier transform of a non-negative measure μ\mu (spectral density):

k(τ)=Rdeiωτdμ(ω).k(\boldsymbol{\tau}) = \int_{\mathbb{R}^d} e^{i\boldsymbol{\omega}^\top\boldsymbol{\tau}} d\mu(\boldsymbol{\omega}).

Implications for kernel design:

  • RBF kernel: k(τ)=exp(τ2/22)k(\boldsymbol{\tau}) = \exp(-\|\boldsymbol{\tau}\|^2/2\ell^2) is PD (its Fourier transform is a Gaussian, which is non-negative)
  • Matern kernels: PD for all valid smoothness parameters ν>0\nu > 0
  • Checking validity: A kernel is valid iff its Fourier transform is non-negative - this is the ultimate test

Random Fourier features (Rahimi & Recht, 2007). Bochner's theorem gives a randomized approximation to stationary PD kernels:

k(xz)z(x)z(z),z(x)=2/D[cos(ω1x+b1),,cos(ωDx+bD)]k(\mathbf{x}-\mathbf{z}) \approx z(\mathbf{x})^\top z(\mathbf{z}), \quad z(\mathbf{x}) = \sqrt{2/D}[\cos(\boldsymbol{\omega}_1^\top\mathbf{x}+b_1),\ldots,\cos(\boldsymbol{\omega}_D^\top\mathbf{x}+b_D)]

where ωiμ\boldsymbol{\omega}_i \sim \mu (the spectral distribution) and biUniform[0,2π]b_i \sim \text{Uniform}[0,2\pi]. This approximates the PD kernel as a finite inner product, enabling scalable kernel methods without forming the full n×nn \times n kernel matrix. Used in Performer transformer attention (FAVOR+, Choromanski et al., 2020).

D.3 The Determinant and Volume

Geometric interpretation of detA\det A for A0A \succ 0.

For A0A \succ 0, the quadratic form xA1x1\mathbf{x}^\top A^{-1} \mathbf{x} \leq 1 defines an ellipsoid EA\mathcal{E}_A in Rn\mathbb{R}^n. Its volume is:

Vol(EA)=πn/2Γ(n/2+1)(detA)1/2.\text{Vol}(\mathcal{E}_A) = \frac{\pi^{n/2}}{\Gamma(n/2+1)} \cdot (\det A)^{1/2}.

So detA\det A measures the "volume" of the ellipsoid associated with AA. Larger detA\det A means a more "spread out" distribution.

In the Gaussian context: The normalizing constant of N(μ,Σ)\mathcal{N}(\boldsymbol{\mu}, \Sigma) is (2π)n/2(detΣ)1/2(2\pi)^{n/2}(\det\Sigma)^{1/2}. This is the volume of the "effective support" ellipsoid scaled by (2π)n/2(2\pi)^{n/2}.

D-optimal experimental design. In Bayesian experimental design, the D-optimal criterion selects experiments x1,,xn\mathbf{x}_1,\ldots,\mathbf{x}_n to maximize det(XX+σ2I)\det(X^\top X + \sigma^{-2}I) - the determinant of the posterior precision matrix, which equals 1/det(Σpost)1/\det(\Sigma_{\text{post}}). Maximizing det(posterior precision)\det(\text{posterior precision}) minimizes the volume of the posterior confidence ellipsoid. This is an SDP with XX as the optimization variable.

Maximum entropy Gaussians. Among all distributions with mean μ\boldsymbol{\mu} and covariance Σ\Sigma, the maximum entropy distribution is N(μ,Σ)\mathcal{N}(\boldsymbol{\mu}, \Sigma), with entropy H=n2(1+log(2π))+12logdetΣH = \frac{n}{2}(1+\log(2\pi)) + \frac{1}{2}\log\det\Sigma. Maximizing entropy subject to a trace constraint tr(Σ)c\text{tr}(\Sigma) \leq c gives Σ=(c/n)I\Sigma = (c/n)I (isotropic Gaussian) - the most "uninformed" Gaussian with bounded total variance.

D.4 Matrix Exponential and the PD Manifold

The set of PD matrices S++n\mathbb{S}_{++}^n is an open smooth manifold (in fact, a Riemannian manifold with the affine-invariant metric). This means concepts from differential geometry apply.

The matrix exponential. For any symmetric SSnS \in \mathbb{S}^n, eS=QeΛQS++ne^S = Q e^\Lambda Q^\top \in \mathbb{S}_{++}^n (positive definite, since exponentials of real numbers are positive). The map exp:SnS++n\exp: \mathbb{S}^n \to \mathbb{S}_{++}^n is a bijection - PD matrices are exactly exponentials of symmetric matrices.

Log-Euclidean metric. A simple Riemannian metric on S++n\mathbb{S}_{++}^n (used in diffusion tensor imaging) defines the "distance" between A,B0A, B \succ 0 as:

dlog(A,B)=logAlogBF.d_{\log}(A, B) = \|\log A - \log B\|_F.

This treats PD matrices as if they lived in a flat space via the log map. Computations (means, geodesics, interpolations) become Euclidean operations on logA\log A and logB\log B.

Affine-invariant metric (Riemannian). The more geometrically natural metric:

dAI(A,B)=log(A1/2BA1/2)F=(i=1nlog2λi(A1B))1/2d_{\text{AI}}(A, B) = \|\log(A^{-1/2}B A^{-1/2})\|_F = \left(\sum_{i=1}^n \log^2\lambda_i(A^{-1}B)\right)^{1/2}

is invariant under congruence ACACA \mapsto C^\top A C (affine-invariant), making it suitable for geometric statistics.

For AI: The SPD (symmetric positive definite) manifold appears in:

  • Diffusion tensor MRI: each voxel has a 3×33 \times 3 diffusion tensor S++3\in \mathbb{S}_{++}^3; Riemannian means avoid non-PD artifacts
  • Covariance-based classifiers: classifying neural activation covariance matrices using Riemannian geometry (SPDNet, 2017)
  • Transformer positional encodings with structured covariances: attention with SPD constraints on the key-query metric tensors

D.5 Inequalities for Positive Definite Matrices

Several fundamental inequalities apply specifically to PD matrices:

Hadamard's inequality. For A0A \succ 0:

detAi=1nAii.\det A \leq \prod_{i=1}^n A_{ii}.

Equality iff AA is diagonal. Interpretation: the volume of the ellipsoid defined by AA is at most the product of the diagonal entries (the volumes along coordinate axes).

Minkowski's inequality. For A,B0A, B \succ 0:

(det(A+B))1/n(detA)1/n+(detB)1/n.(\det(A+B))^{1/n} \geq (\det A)^{1/n} + (\det B)^{1/n}.

This is the matrix analogue of a+bnan+bn\sqrt[n]{a+b} \geq \sqrt[n]{a} + \sqrt[n]{b} for scalars (Minkowski). It follows from the concavity of A(detA)1/nA \mapsto (\det A)^{1/n} on S++n\mathbb{S}_{++}^n.

Fan-Ky inequality. For A0A \succ 0 and knk \leq n:

i=1klogλi(A+B)i=1klogλi(A)+i=1klogλi(B).\sum_{i=1}^k \log\lambda_i(A+B) \geq \sum_{i=1}^k \log\lambda_i(A) + \sum_{i=1}^k \log\lambda_i(B).

(The sum of the kk largest log-eigenvalues of A+BA+B is at least the sum from AA plus the sum from BB.)

Anderson's inequality. For A,B,C0A, B, C \succ 0 with C=A+BC = A + B:

C1A1+B1.C^{-1} \preceq A^{-1} + B^{-1}.

Interpretation: combining two experiments (adding their information matrices) gives more information than either alone, but the combined precision matrix is "sandwiched" by the sum of individual inverse covariances.

For AI: Hadamard's inequality provides a bound on the log-det: logdetAilogAii=logAii\log\det A \leq \sum_i \log A_{ii} = \log\prod A_{ii}, useful when diagonal entries are cheaply computed. Minkowski's inequality is used in information-geometric proofs about combining Gaussian priors in Bayesian learning.


Appendix E: Extended Examples and Case Studies

E.1 Case Study: Covariance Estimation in High Dimensions

One of the most practically important applications of PSD matrix theory is estimating covariance matrices from data when the number of features pp is comparable to or larger than the number of samples nn.

The problem. Given nn samples x1,,xnRp\mathbf{x}_1,\ldots,\mathbf{x}_n \in \mathbb{R}^p (zero mean for simplicity), the sample covariance is:

S^=1ni=1nxixi=1nXX.\hat{S} = \frac{1}{n}\sum_{i=1}^n \mathbf{x}_i\mathbf{x}_i^\top = \frac{1}{n}X^\top X.

S^0\hat{S} \succeq 0 always (it's a Gram matrix). But:

  • If n<pn < p: rank(S^)n<p\text{rank}(\hat{S}) \leq n < p, so S^\hat{S} is singular (not PD)
  • Even for npn \geq p: S^\hat{S} can be ill-conditioned (extreme ratio of largest to smallest eigenvalue)

The Marchenko-Pastur law. For n,pn, p \to \infty with n/pγ>1n/p \to \gamma > 1 and xiN(0,Ip)\mathbf{x}_i \sim \mathcal{N}(0, I_p), the empirical eigenvalue distribution of S^\hat{S} converges to the Marchenko-Pastur distribution:

f(λ)=(λ+λ)(λλ)2πγλf(\lambda) = \frac{\sqrt{(\lambda_+ - \lambda)(\lambda - \lambda_-)}}{2\pi\gamma\lambda}

for λ[λ,λ+]\lambda \in [\lambda_-, \lambda_+], where λ±=(1±1/γ)2\lambda_\pm = (1 \pm 1/\sqrt{\gamma})^2. This means even under the identity covariance, sample eigenvalues are spread over a wide range - the smallest eigenvalues of S^\hat{S} are much smaller than 1 and the largest are much larger than 1.

Ledoit-Wolf shrinkage. The Ledoit-Wolf estimator regularizes the sample covariance toward a multiple of identity:

Σ^LW=αS^+(1α)μI\hat{\Sigma}_{\text{LW}} = \alpha\hat{S} + (1-\alpha)\mu I

where μ=tr(S^)/p\mu = \text{tr}(\hat{S})/p (the average eigenvalue) and α\alpha is chosen to minimize mean squared error under the Frobenius norm. The result is always PD (since μI0\mu I \succ 0 and S^0\hat{S} \succeq 0).

Graphical lasso. Assuming the precision matrix (inverse covariance) Ω=Σ1\Omega = \Sigma^{-1} is sparse (many zeros encode conditional independence), the graphical lasso estimates:

Ω^=argminΩ0[tr(S^Ω)logdetΩ+λΩ1]\hat{\Omega} = \arg\min_{\Omega \succ 0} \left[\text{tr}(\hat{S}\Omega) - \log\det\Omega + \lambda\|\Omega\|_1\right]

where 1\|\cdot\|_1 is the element-wise 1\ell_1 norm. The constraint Ω0\Omega \succ 0 is the PD cone constraint. The objective tr(S^Ω)logdetΩ\text{tr}(\hat{S}\Omega) - \log\det\Omega is the negative Gaussian log-likelihood (up to constants). Optimized by block coordinate descent (glasso algorithm) using the Schur complement for each block update.

For LLMs: Large language models trained with structured sparse attention patterns (local + global attention, as in Longformer and BigBird) can be understood as learning sparse precision matrices of the attention distributions.

E.2 Case Study: Gaussian Process Regression, Step by Step

Let's trace through a complete GP regression computation to see all the PD machinery in action.

Setup. Training data: (x1,y1)=(0,0.5)(x_1, y_1) = (0, 0.5), (x2,y2)=(1,1.2)(x_2, y_2) = (1, 1.2), (x3,y3)=(2,0.8)(x_3, y_3) = (2, 0.8). Test point: x=1.5x_* = 1.5. Kernel: k(x,z)=exp((xz)2/2)k(x,z) = \exp(-(x-z)^2/2) (RBF with length-scale 1). Noise: σ2=0.1\sigma^2 = 0.1.

Step 1: Compute the kernel matrix.

Kij=exp((xixj)2/2).K_{ij} = \exp(-(x_i - x_j)^2/2). K=(1e0.5e2e0.51e0.5e2e0.51)(10.6070.1350.60710.6070.1350.6071).K = \begin{pmatrix}1 & e^{-0.5} & e^{-2} \\ e^{-0.5} & 1 & e^{-0.5} \\ e^{-2} & e^{-0.5} & 1\end{pmatrix} \approx \begin{pmatrix}1 & 0.607 & 0.135 \\ 0.607 & 1 & 0.607 \\ 0.135 & 0.607 & 1\end{pmatrix}.

Step 2: Add noise and Cholesky factor.

K+σ2I=K+0.1I(1.10.6070.1350.6071.10.6070.1350.6071.1).K + \sigma^2 I = K + 0.1I \approx \begin{pmatrix}1.1 & 0.607 & 0.135 \\ 0.607 & 1.1 & 0.607 \\ 0.135 & 0.607 & 1.1\end{pmatrix}.

Compute L=chol(K+0.1I)L = \text{chol}(K + 0.1I). (Details in theory.ipynb.)

Step 3: Predictive mean.

k=[k(x,xi)]i=[exp(2.25/2),exp(0.25/2),exp(0.25/2)][0.325,0.882,0.882]\mathbf{k}_* = [k(x_*, x_i)]_i = [\exp(-2.25/2), \exp(-0.25/2), \exp(-0.25/2)] \approx [0.325, 0.882, 0.882].

α=(K+0.1I)1y=LL1y(via Cholesky solve)\boldsymbol{\alpha} = (K + 0.1I)^{-1}\mathbf{y} = L^{-\top}L^{-1}\mathbf{y} \quad (\text{via Cholesky solve}) μ=kα.\mu_* = \mathbf{k}_*^\top\boldsymbol{\alpha}.

Step 4: Predictive variance.

σ2=k(x,x)k(K+0.1I)1k=1vv,v=L1k.\sigma_*^2 = k(x_*, x_*) - \mathbf{k}_*^\top(K+0.1I)^{-1}\mathbf{k}_* = 1 - \mathbf{v}^\top\mathbf{v}, \quad \mathbf{v} = L^{-1}\mathbf{k}_*.

The variance is the Schur complement of (K+0.1I)(K + 0.1I) evaluated at xx_*.

Interpretation. The Cholesky factorization appears in every step: solving for α\boldsymbol{\alpha}, computing the predictive variance, and computing the log-marginal likelihood for hyperparameter optimization. The PD condition on K+0.1IK + 0.1I guarantees σ20\sigma_*^2 \geq 0 (the Schur complement is non-negative).

E.3 Case Study: The Multivariate Normal in Deep Learning

Parameterizing structured covariances. In deep generative models (VAEs, normalizing flows, diffusion models), we frequently need to parameterize a multivariate Gaussian N(μ,Σ)\mathcal{N}(\boldsymbol{\mu}, \Sigma) with Σ\Sigma learnable. The challenge: Σ\Sigma must be PD, and the parameterization must be differentiable.

Approach 1: Diagonal. Σ=diag(σ12,,σd2)\Sigma = \text{diag}(\sigma_1^2,\ldots,\sigma_d^2) with σi=softplus(si)\sigma_i = \text{softplus}(s_i). Free parameters: dd. Correlation structure: none. Used in standard VAE (Kingma & Welling, 2013).

Approach 2: Lower triangular Cholesky. Σ=LL\Sigma = LL^\top where LL is lower triangular:

  • Diagonal: Lii=softplus(L~ii)L_{ii} = \text{softplus}(\tilde{L}_{ii}) ensures positivity
  • Off-diagonal: LijL_{ij} for i>ji > j unconstrained (any real number) Free parameters: d(d+1)/2d(d+1)/2. Full correlation structure. Used in full-covariance VAEs, normalizing flows.

Approach 3: Low-rank + diagonal. Σ=FF+D\Sigma = FF^\top + D where FRd×rF \in \mathbb{R}^{d \times r} (rdr \ll d) and D=diag(d1,,dd)D = \text{diag}(d_1,\ldots,d_d) with di>0d_i > 0. Free parameters: dr+ddr + d. Captures rr "directions of correlation." Used in structured VBs, mean-field approximations.

The KL divergence between Gaussians. For q=N(μq,Σq)q = \mathcal{N}(\boldsymbol{\mu}_q, \Sigma_q) and p=N(μp,Σp)p = \mathcal{N}(\boldsymbol{\mu}_p, \Sigma_p):

DKL(qp)=12[tr(Σp1Σq)+(μpμq)Σp1(μpμq)d+logdetΣpdetΣq].D_{\text{KL}}(q\|p) = \frac{1}{2}\left[\text{tr}(\Sigma_p^{-1}\Sigma_q) + (\boldsymbol{\mu}_p-\boldsymbol{\mu}_q)^\top\Sigma_p^{-1}(\boldsymbol{\mu}_p-\boldsymbol{\mu}_q) - d + \log\frac{\det\Sigma_p}{\det\Sigma_q}\right].

For p=N(0,I)p = \mathcal{N}(\mathbf{0}, I) (standard VAE prior):

DKL(qN(0,I))=12[tr(Σq)+μqμqdlogdetΣq].D_{\text{KL}}(q\|\mathcal{N}(\mathbf{0},I)) = \frac{1}{2}\left[\text{tr}(\Sigma_q) + \boldsymbol{\mu}_q^\top\boldsymbol{\mu}_q - d - \log\det\Sigma_q\right].

For diagonal Σq=diag(σi2)\Sigma_q = \text{diag}(\sigma_i^2):

DKL=12i=1d[σi2+μq,i21logσi2].D_{\text{KL}} = \frac{1}{2}\sum_{i=1}^d \left[\sigma_i^2 + \mu_{q,i}^2 - 1 - \log\sigma_i^2\right].

For Cholesky Σq=LL\Sigma_q = LL^\top: logdetΣq=2ilogLii\log\det\Sigma_q = 2\sum_i \log L_{ii} and tr(Σq)=LF2\text{tr}(\Sigma_q) = \|L\|_F^2.

All these computations - logdetΣq\log\det\Sigma_q, tr(Σq)\text{tr}(\Sigma_q), Σq1\Sigma_q^{-1} - exploit the Cholesky and log-det tools developed in 4 and 7.

E.4 The Normal Equations and Ridge Regression

Least squares. Given XRn×pX \in \mathbb{R}^{n \times p} (data) and yRn\mathbf{y} \in \mathbb{R}^n (targets), ordinary least squares (OLS) minimizes:

β^=argminβXβy2.\hat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta}} \|X\boldsymbol{\beta} - \mathbf{y}\|^2.

The normal equations: XXβ^=XyX^\top X \hat{\boldsymbol{\beta}} = X^\top\mathbf{y}. The matrix XX0X^\top X \succeq 0 is always PSD (Gram matrix). If XX has full column rank (npn \geq p, rank pp), then XX0X^\top X \succ 0 and the unique solution is β^=(XX)1Xy\hat{\boldsymbol{\beta}} = (X^\top X)^{-1}X^\top\mathbf{y}.

Ridge regression. Add 2\ell_2 regularization: minimize Xβy2+λβ2\|X\boldsymbol{\beta}-\mathbf{y}\|^2 + \lambda\|\boldsymbol{\beta}\|^2.

Normal equations: (XX+λI)β^=Xy(X^\top X + \lambda I)\hat{\boldsymbol{\beta}} = X^\top\mathbf{y}.

The matrix A=XX+λI0A = X^\top X + \lambda I \succ 0 for any λ>0\lambda > 0 (by Proposition 2.3.7). This is why ridge regression is always numerically stable: the PD matrix AA has all positive pivots, and Cholesky never fails.

Condition number improvement. κ(XX+λI)=(σ12+λ)/(σp2+λ)\kappa(X^\top X + \lambda I) = (\sigma_1^2 + \lambda)/(\sigma_p^2 + \lambda) where σ1σp0\sigma_1 \geq \cdots \geq \sigma_p \geq 0 are the singular values of XX. For small λ\lambda, κσ12/σp2=κ(X)2\kappa \approx \sigma_1^2/\sigma_p^2 = \kappa(X)^2. As λ\lambda \to \infty, κ1\kappa \to 1 (well-conditioned). Choosing λ\lambda balances bias (regularization) vs numerical stability.

Kernel ridge regression. For nonlinear regression using a PD kernel kk, form the kernel matrix K=k(X,X)K = k(\mathbf{X},\mathbf{X}) and solve (K+λI)α=y(K + \lambda I)\boldsymbol{\alpha} = \mathbf{y}. Predictions: f^(x)=kα\hat{f}(\mathbf{x}_*) = \mathbf{k}_*^\top\boldsymbol{\alpha} where k=k(x,X)\mathbf{k}_* = k(\mathbf{x}_*, \mathbf{X}). This is equivalent to GP regression with the kernel kk and noise variance λ\lambda. The PD condition on K+λIK + \lambda I is exactly the Cholesky solvability condition.


Appendix F: Quick Reference

F.1 Four Equivalent Characterizations of A0A \succ 0

POSITIVE DEFINITENESS: FOUR EQUIVALENT TESTS
========================================================================

  Let A \in \mathbb{R}^n^x^n be symmetric. The following are equivalent:

  DEFINITION   \forallx\neq0: x^TAx > 0            (quadratic form positive)
  SPECTRAL     \foralli: \lambda^i(A) > 0              (all eigenvalues positive)
  SYLVESTER    \forallk: \Delta_k = det(A[1:k,1:k]) > 0 (leading minors positive)
  CHOLESKY     \exists! lower triangular L with positive diagonal: A = LL^T

  COST:
  Definition  O(n) per test vector            <- manual/symbolic
  Spectral    O(n^3) eigendecomposition        <- most informative
  Sylvester   O(n^4) for all n determinants    <- manual/symbolic
  Cholesky    O(n^3/3) factorization           <- fastest in practice

========================================================================

F.2 Key Formulas

FormulaNotes
A=LLA = LL^\topCholesky factorization (A PD <-> exists)
A=LDLA = LDL^\topLDL^T: unit lower triangular + diagonal
A=QA1/2A = QA^{1/2}Polar: orthogonal \times PD
logdetA=2ilogLii\log\det A = 2\sum_i\log L_{ii}Log-det via Cholesky
AlogdetA=A1\nabla_A\log\det A = A^{-1}Matrix calculus: gradient of log-det
S=DCA1BS = D - CA^{-1}BSchur complement of AA in block matrix
M0A0,S0M \succ 0 \Leftrightarrow A \succ 0,\, S \succ 0Schur PD criterion
(A+UCV)1=A1A1U(C1+VA1U)1VA1(A+UCV)^{-1} = A^{-1} - A^{-1}U(C^{-1}+VA^{-1}U)^{-1}VA^{-1}Woodbury identity
x=μ+Lϵ\mathbf{x} = \boldsymbol{\mu} + L\boldsymbol{\epsilon}, ϵN(0,I)\boldsymbol{\epsilon}\sim\mathcal{N}(0,I)Reparameterization trick
DKL(N(μ,Σ)N(0,I))=12[trΣ+μ2dlogdetΣ]D_{\text{KL}}(\mathcal{N}(\boldsymbol{\mu},\Sigma)\|\mathcal{N}(0,I)) = \frac{1}{2}[\text{tr}\Sigma + \|\boldsymbol{\mu}\|^2 - d - \log\det\Sigma]VAE KL term

F.3 Notation Summary

Following the NOTATION_GUIDE.md conventions:

SymbolMeaning
A0A \succ 0AA is positive definite
A0A \succeq 0AA is positive semidefinite
A0A \prec 0AA is negative definite
Sn\mathbb{S}^nSpace of n×nn \times n symmetric matrices
S++n\mathbb{S}_{++}^nCone of n×nn \times n PD matrices (interior)
S+n\mathbb{S}_+^nCone of n×nn \times n PSD matrices
LLCholesky factor (lower triangular, positive diagonal)
A1/2A^{1/2}Symmetric PSD square root
A1/2A^{-1/2}Inverse PSD square root
Σ\SigmaCovariance matrix (0\succeq 0)
Λ\LambdaPrecision matrix (=Σ1= \Sigma^{-1}, 0\succ 0)
FFFisher information matrix (0\succeq 0)
SSSchur complement
GGGram matrix (=XX= XX^\top, 0\succeq 0)
KKKernel matrix (0\succeq 0)

F.4 Python Cheatsheet

import numpy as np
import scipy.linalg as la

# === Cholesky factorization ===
L = np.linalg.cholesky(A)          # A = L @ L.T
# Or via scipy (can also compute upper factor):
L = la.cholesky(A, lower=True)

# === LDL^T factorization ===
L_ldl, D_ldl, _ = la.ldl(A)       # A = L @ D @ L.T

# === Log-determinant (numerically stable) ===
L = np.linalg.cholesky(A)
log_det = 2 * np.sum(np.log(np.diag(L)))

# === Solve A x = b via Cholesky ===
L = np.linalg.cholesky(A)
x = la.cho_solve((L, True), b)     # uses LAPACK dpotrs

# === PSD square root ===
eigvals, Q = np.linalg.eigh(A)     # symmetric eigendecomp
A_sqrt = Q @ np.diag(np.sqrt(eigvals)) @ Q.T

# === Woodbury: (A + U C V)^{-1} ===
# via np.linalg.solve on the smaller system

# === Reparameterization trick ===
L = np.linalg.cholesky(Sigma)      # Sigma = L @ L.T
eps = np.random.randn(n_samples, d)
z = mu + (L @ eps.T).T             # z ~ N(mu, Sigma)

# === Gram matrix ===
G = X @ X.T                        # G = X X^T, PSD

# === Check PD ===
try:
    np.linalg.cholesky(A)
    is_pd = True
except np.linalg.LinAlgError:
    is_pd = False

# === Check PSD ===
eigvals = np.linalg.eigvalsh(A)
is_psd = np.all(eigvals >= -1e-10)

Appendix G: Proofs and Extensions

G.1 Complete Proof of the Cholesky Existence Theorem

We present the full inductive proof more carefully, making each step explicit.

Theorem (Full Cholesky Existence). ARn×nA \in \mathbb{R}^{n \times n} symmetric. A0A \succ 0 if and only if A=LLA = LL^\top for a unique lower triangular LL with positive diagonal.

Proof by strong induction on nn.

Base case n=1n=1: A=(a)A = (a) with a>0a > 0 (since A0A \succ 0 iff quadratic form ax2>0ax^2 > 0 for x0x \neq 0 iff a>0a > 0). Take L=(a)L = (\sqrt{a}). Unique since >0\ell > 0 and 2=a\ell^2 = a gives =a\ell = \sqrt{a}.

Inductive step: Assume the result holds for all (n1)×(n1)(n-1) \times (n-1) PD matrices. Let ARn×nA \in \mathbb{R}^{n \times n} be PD. Write:

A=(a11aaA^)A = \begin{pmatrix}a_{11} & \mathbf{a}^\top \\ \mathbf{a} & \hat{A}\end{pmatrix}

where a11>0a_{11} > 0 (Proposition 2.3, item 1), aRn1\mathbf{a} \in \mathbb{R}^{n-1}, and A^R(n1)×(n1)\hat{A} \in \mathbb{R}^{(n-1)\times(n-1)}.

Define 11=a11>0\ell_{11} = \sqrt{a_{11}} > 0 and =a/11Rn1\boldsymbol{\ell} = \mathbf{a}/\ell_{11} \in \mathbb{R}^{n-1}.

Compute the Schur complement: A~=A^=A^aa111a\tilde{A} = \hat{A} - \boldsymbol{\ell}\boldsymbol{\ell}^\top = \hat{A} - \mathbf{a}a_{11}^{-1}\mathbf{a}^\top.

Claim: A~0\tilde{A} \succ 0.

Proof of claim: For any yRn1\mathbf{y} \in \mathbb{R}^{n-1} with y0\mathbf{y} \neq \mathbf{0}, set x=(a111ayy)Rn\mathbf{x} = \begin{pmatrix}-a_{11}^{-1}\mathbf{a}^\top\mathbf{y} \\ \mathbf{y}\end{pmatrix} \in \mathbb{R}^n. Note x0\mathbf{x} \neq \mathbf{0} (the second block is y0\mathbf{y} \neq \mathbf{0}). Then:

xAx=a11(a111ay)22aya111ay+yA^y=y(A^aa111a)y=yA~y.\mathbf{x}^\top A \mathbf{x} = a_{11}(a_{11}^{-1}\mathbf{a}^\top\mathbf{y})^2 - 2\mathbf{a}^\top\mathbf{y} \cdot a_{11}^{-1}\mathbf{a}^\top\mathbf{y} + \mathbf{y}^\top\hat{A}\mathbf{y} = \mathbf{y}^\top(\hat{A} - \mathbf{a}a_{11}^{-1}\mathbf{a}^\top)\mathbf{y} = \mathbf{y}^\top\tilde{A}\mathbf{y}.

Since A0A \succ 0: xAx>0\mathbf{x}^\top A\mathbf{x} > 0, so yA~y>0\mathbf{y}^\top\tilde{A}\mathbf{y} > 0. Since y0\mathbf{y} \neq 0 was arbitrary, A~0\tilde{A} \succ 0. \square (claim)

By the inductive hypothesis, A~=L22L22\tilde{A} = L_{22}L_{22}^\top for a unique lower triangular L22R(n1)×(n1)L_{22} \in \mathbb{R}^{(n-1)\times(n-1)} with positive diagonal.

Set L=(110L22)L = \begin{pmatrix}\ell_{11} & \mathbf{0}^\top \\ \boldsymbol{\ell} & L_{22}\end{pmatrix}. Then LL is lower triangular with positive diagonal (11,(L22)11,,(L22)n1,n1)(\ell_{11}, (L_{22})_{11}, \ldots, (L_{22})_{n-1,n-1}) and:

LL=(110L22)(110L22)=(1121111+L22L22)=(a11aaA^)=A.LL^\top = \begin{pmatrix}\ell_{11} & \mathbf{0}^\top \\ \boldsymbol{\ell} & L_{22}\end{pmatrix}\begin{pmatrix}\ell_{11} & \boldsymbol{\ell}^\top \\ \mathbf{0} & L_{22}^\top\end{pmatrix} = \begin{pmatrix}\ell_{11}^2 & \ell_{11}\boldsymbol{\ell}^\top \\ \ell_{11}\boldsymbol{\ell} & \boldsymbol{\ell}\boldsymbol{\ell}^\top + L_{22}L_{22}^\top\end{pmatrix} = \begin{pmatrix}a_{11} & \mathbf{a}^\top \\ \mathbf{a} & \hat{A}\end{pmatrix} = A.

Uniqueness. Suppose A=L1L1=L2L2A = L_1L_1^\top = L_2L_2^\top with L1,L2L_1, L_2 lower triangular and positive diagonal. Then M:=L21L1M := L_2^{-1}L_1 (product of lower triangular matrices, lower triangular) satisfies MM=IMM^\top = I. An orthogonal lower triangular matrix must be diagonal. Since L1,L2L_1, L_2 have positive diagonal, MM has positive diagonal. MM=IMM^\top = I for diagonal MM gives M=IM = I, so L1=L2L_1 = L_2. \square

G.2 Proof That the Fisher Information Is PSD

Theorem. For a regular statistical model p(xθ)p(\mathbf{x}|\boldsymbol{\theta}), the Fisher information matrix F(θ)=E[s(x;θ)s(x;θ)]F(\boldsymbol{\theta}) = \mathbb{E}[\mathbf{s}(\mathbf{x};\boldsymbol{\theta})\mathbf{s}(\mathbf{x};\boldsymbol{\theta})^\top] where s=θlogp\mathbf{s} = \nabla_{\boldsymbol{\theta}}\log p is the score function is PSD.

Proof. For any vRd\mathbf{v} \in \mathbb{R}^d:

vFv=E[vssv]=E[(vs)2]0.\mathbf{v}^\top F\mathbf{v} = \mathbb{E}[\mathbf{v}^\top\mathbf{s}\mathbf{s}^\top\mathbf{v}] = \mathbb{E}[(\mathbf{v}^\top\mathbf{s})^2] \geq 0.

(The expectation of a squared real random variable is non-negative.) \square

When is F0F \succ 0? FF is PD iff E[(vs)2]>0\mathbb{E}[(\mathbf{v}^\top\mathbf{s})^2] > 0 for all v0\mathbf{v} \neq 0. This holds iff the score function vθlogp(xθ)0\mathbf{v}^\top\nabla_{\boldsymbol{\theta}}\log p(\mathbf{x}|\boldsymbol{\theta}) \neq 0 with positive probability for every v0\mathbf{v} \neq 0 - the model is "identifiable" in all directions. Singular FF indicates structural non-identifiability (two parameters produce identical distributions).

G.3 Derivatives and the Matrix-Valued Chain Rule

For completeness, we derive the key matrix calculus formulas used in 7.3.

Jacobi's formula. For differentiable A(t)A(t):

ddtdetA(t)=detA(t)tr(A(t)1A˙(t)).\frac{d}{dt}\det A(t) = \det A(t) \cdot \text{tr}(A(t)^{-1}\dot{A}(t)).

Proof: Using the adjugate matrix (cofactor expansion), detA=jAij(adjA)ji\det A = \sum_j A_{ij}(\text{adj}A)_{ji}. Differentiating with respect to AijA_{ij} gives (adjA)ji=(detA)(A1)ji(\text{adj}A)_{ji} = (\det A)(A^{-1})_{ji} (Cramer's rule). By the chain rule: d(detA)/dt=ij(adjA)jiA˙ij=det(A)ij(A1)jiA˙ij=det(A)tr(A1A˙)d(\det A)/dt = \sum_{ij}(\text{adj}A)_{ji}\dot{A}_{ij} = \det(A)\sum_{ij}(A^{-1})_{ji}\dot{A}_{ij} = \det(A)\,\text{tr}(A^{-1}\dot{A}).

Log-det gradient. dlogdetA=d(detA)/detA=tr(A1dA)d\log\det A = d(\det A)/\det A = \text{tr}(A^{-1}dA). Since tr(A1dA)=A,dAF\text{tr}(A^{-1}dA) = \langle A^{-\top}, dA\rangle_F, the gradient of logdet\log\det at AA is A=A1A^{-\top} = A^{-1} (for symmetric AA).

Trace-inverse gradient. For f(A)=tr(A1B)f(A) = \text{tr}(A^{-1}B): df=tr(A1dAA1B)=tr(A1BA1dA)=(A1BA1),dAFdf = \text{tr}(-A^{-1}dA\,A^{-1}B) = -\text{tr}(A^{-1}BA^{-1}dA) = -\langle (A^{-1}BA^{-1})^\top, dA\rangle_F. So Atr(A1B)=(A1BA1)=ABA\nabla_A\text{tr}(A^{-1}B) = -(A^{-1}BA^{-1})^\top = -A^{-\top}B^\top A^{-\top}.

For AI - GP hyperparameter gradients: The gradient of the GP log-marginal-likelihood with respect to a kernel hyperparameter θ\theta is:

logp(yθ)θ=12tr ⁣[(αα(K+σ2I)1)Kθ]\frac{\partial\log p(\mathbf{y}|\theta)}{\partial\theta} = \frac{1}{2}\text{tr}\!\left[(\boldsymbol{\alpha}\boldsymbol{\alpha}^\top - (K+\sigma^2I)^{-1})\frac{\partial K}{\partial\theta}\right]

where α=(K+σ2I)1y\boldsymbol{\alpha} = (K+\sigma^2I)^{-1}\mathbf{y}. This uses KlogdetK=K1\nabla_K\log\det K = K^{-1} and Ktr(K1S)=(K1SK1)\nabla_K\text{tr}(K^{-1}S) = -(K^{-1}SK^{-1})^\top. Efficiently computed via Cholesky: form V=L1K/θV = L^{-1}\partial K/\partial\theta (triangular solve), then tr(K1K/θ)=tr(VV)=VF2\text{tr}(K^{-1}\partial K/\partial\theta) = \text{tr}(V^\top V) = \|V\|_F^2.


Appendix H: Summary and Further Reading

H.1 Core Theorems Summary

TheoremStatementReference
Spectral characterizationA0A \succ 0 \Leftrightarrow all eigenvalues >0> 03.1
Sylvester's criterionA0A \succ 0 \Leftrightarrow all leading principal minors >0> 03.2
Cholesky existenceA0!A \succ 0 \Leftrightarrow \exists! lower triangular LL (pos. diag.) with A=LLA = LL^\top4.1
LDL^TSymmetric AA=LDLA \to A = LDL^\top; A0A \succ 0 \Leftrightarrow all di>0d_i > 04.4
PSD square rootA0!A \succeq 0 \Rightarrow \exists! symmetric PSD A1/2A^{1/2} with (A1/2)2=A(A^{1/2})^2 = A5.1
Schur PD criterionM=(ABBD)0A0,DBA1B0M = \begin{pmatrix}A&B\\B^\top&D\end{pmatrix} \succ 0 \Leftrightarrow A \succ 0, D-B^\top A^{-1}B \succ 06.2
Woodbury identity(A+UCV)1=A1A1U(C1+VA1U)1VA1(A+UCV)^{-1} = A^{-1} - A^{-1}U(C^{-1}+VA^{-1}U)^{-1}VA^{-1}6.3
Log-det concavityf(A)=logdetAf(A) = \log\det A is strictly concave on S++n\mathbb{S}_{++}^n7.2
Log-det gradientAlogdetA=A1\nabla_A\log\det A = A^{-1}7.3
Gram matrix PSDG=XX0G = XX^\top \succeq 0 always; PD iff rows of XX are linearly independent8.1
Schur productHadamard product of PSD matrices is PSDAppendix A
Hadamard inequalitydetAiAii\det A \leq \prod_i A_{ii} for A0A \succ 0Appendix D
Log-det CholeskylogdetA=2ilogLii\log\det A = 2\sum_i\log L_{ii}7.1

H.2 Further Reading

  1. Textbooks:

    • Golub & Van Loan, Matrix Computations (4th ed., 2013) - Chapters 4, 7: definitive reference on Cholesky, LDL^T algorithms
    • Higham, Accuracy and Stability of Numerical Algorithms (2002) - backward stability proofs for Cholesky, modified Cholesky
    • Boyd & Vandenberghe, Convex Optimization (2004) - Chapter 4: SDP, PSD cone; freely available online
    • Bernstein, Matrix Mathematics (2nd ed., 2009) - comprehensive collection of PD matrix identities
  2. Research papers:

    • Cholesky (1924, posthumous publication via Benoit): original algorithm for geodetic calculations
    • Gill, Murray & Wright (1981): modified Cholesky for optimization
    • Vandenberghe & Boyd (1996): semidefinite programming tutorial
    • Martens & Grosse (2015): K-FAC - Kronecker-factored natural gradient
    • Gardner et al. (2018): GPyTorch - scalable GP with stochastic log-det estimation
    • Foret et al. (2021): SAM - sharpness-aware minimization
  3. Online resources:

    • Petersen & Pedersen, The Matrix Cookbook - practical matrix calculus identities
    • NumPy/SciPy docs: np.linalg.cholesky, scipy.linalg.cholesky, scipy.linalg.ldl
    • GPyTorch documentation: scalable GP inference with Cholesky
    • CVXPY documentation: SDP examples and semidefinite programming
  4. Next sections:


End of 07: Positive Definite Matrices. Next: 08: Matrix Decompositions


Appendix I: Worked Examples - Complete Solutions

I.1 Full 3\times3 Cholesky Example with Verification

Compute L=chol(A)L = \text{chol}(A) for A=(93331713112)A = \begin{pmatrix}9 & 3 & -3 \\ 3 & 17 & -1 \\ -3 & -1 & 12\end{pmatrix}.

Step 1: L11=9=3L_{11} = \sqrt{9} = 3.

Step 2: L21=A21/L11=3/3=1L_{21} = A_{21}/L_{11} = 3/3 = 1. L31=A31/L11=3/3=1L_{31} = A_{31}/L_{11} = -3/3 = -1.

Step 3: L22=A22L212=171=16=4L_{22} = \sqrt{A_{22} - L_{21}^2} = \sqrt{17 - 1} = \sqrt{16} = 4.

Step 4: L32=(A32L31L21)/L22=(1(1)(1))/4=0/4=0L_{32} = (A_{32} - L_{31}L_{21})/L_{22} = (-1 - (-1)(1))/4 = 0/4 = 0.

Step 5: L33=A33L312L322=1210=11L_{33} = \sqrt{A_{33} - L_{31}^2 - L_{32}^2} = \sqrt{12 - 1 - 0} = \sqrt{11}.

L=(3001401011).L = \begin{pmatrix}3 & 0 & 0 \\ 1 & 4 & 0 \\ -1 & 0 & \sqrt{11}\end{pmatrix}.

Verification:

LL=(3001401011)(3110400011)=(93331713112)=A.LL^\top = \begin{pmatrix}3&0&0\\1&4&0\\-1&0&\sqrt{11}\end{pmatrix}\begin{pmatrix}3&1&-1\\0&4&0\\0&0&\sqrt{11}\end{pmatrix} = \begin{pmatrix}9&3&-3\\3&17&-1\\-3&-1&12\end{pmatrix} = A. \checkmark

Also: logdetA=2(log3+log4+log11)=2(1.099+1.386+1.199)=2(3.684)=7.368\log\det A = 2(\log 3 + \log 4 + \log\sqrt{11}) = 2(1.099 + 1.386 + 1.199) = 2(3.684) = 7.368.

I.2 Schur Complement and Conditional Gaussian

Let Σ=(4225)\Sigma = \begin{pmatrix}4 & 2 \\ 2 & 5\end{pmatrix} be the covariance of (X1,X2)N(0,Σ)(X_1, X_2)^\top \sim \mathcal{N}(\mathbf{0}, \Sigma).

Schur complement of Σ11\Sigma_{11}:

S=Σ22Σ21Σ111Σ12=52142=51=4.S = \Sigma_{22} - \Sigma_{21}\Sigma_{11}^{-1}\Sigma_{12} = 5 - 2 \cdot \frac{1}{4} \cdot 2 = 5 - 1 = 4.

Conditional distribution X1X2=aX_1 | X_2 = a:

μ12=0+Σ12Σ221(a0)=25a.\mu_{1|2} = 0 + \Sigma_{12}\Sigma_{22}^{-1}(a - 0) = \frac{2}{5}a. σ122=Σ11Σ12Σ221Σ21=42152=445=165.\sigma_{1|2}^2 = \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} = 4 - 2 \cdot \frac{1}{5} \cdot 2 = 4 - \frac{4}{5} = \frac{16}{5}.

Note: the unconditional variance of X1X_1 is σ12=4\sigma_1^2 = 4. The conditional variance 16/5=3.2<416/5 = 3.2 < 4 - observing X2X_2 reduces uncertainty about X1X_1 (as guaranteed by the Loewner order, 2.4). The correlation ρ=2/45=2/20=1/50.447\rho = 2/\sqrt{4 \cdot 5} = 2/\sqrt{20} = 1/\sqrt{5} \approx 0.447 explains the moderate uncertainty reduction.