Part 1

25 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Positive Definite Matrices: Part 1: Intuition to 5. Matrix Square Root and Whitening

1. Intuition

1.1 What Positive Definiteness Captures

The most useful single-sentence definition of a positive definite matrix is: a matrix that behaves like a positive number. To understand what this means, compare scalars and matrices in parallel.

For a scalar $a > 0$ :

The equation $ax = b$ has a unique solution $x = b/a$
The function $f(x) = ax^2$ has a unique minimum at $x = 0$
The scalar has a real square root: $\sqrt{a} > 0$
The product $ax \cdot x = ax^2 \geq 0$ , with equality only at $x = 0$

For a positive definite matrix $A \succ 0$ :

The system $A\mathbf{x} = \mathbf{b}$ has a unique solution $\mathbf{x} = A^{-1}\mathbf{b}$
The function $f(\mathbf{x}) = \mathbf{x}^\top A \mathbf{x}$ has a unique minimum at $\mathbf{x} = \mathbf{0}$
The matrix has a unique positive definite square root $A^{1/2} \succ 0$
The quadratic form $\mathbf{x}^\top A \mathbf{x} > 0$ for all $\mathbf{x} \neq \mathbf{0}$

The quadratic form $Q(\mathbf{x}) = \mathbf{x}^\top A \mathbf{x}$ is the central object. Think of it as a machine that takes a vector $\mathbf{x}$ and returns a scalar measuring its "energy" with respect to $A$ . When $A$ is the identity, $Q(\mathbf{x}) = \|\mathbf{x}\|^2$ , the familiar squared Euclidean length. When $A$ is diagonal with positive entries $d_1, \ldots, d_n$ , then $Q(\mathbf{x}) = d_1 x_1^2 + \cdots + d_n x_n^2$ - a weighted sum of squared coordinates. Positive definiteness requires this energy to be positive for every nonzero direction.

The formal definition requires $A$ to be symmetric. The condition $\mathbf{x}^\top A \mathbf{x} > 0$ for $\mathbf{x} \neq \mathbf{0}$ is called strict positive definiteness. The relaxed condition $\mathbf{x}^\top A \mathbf{x} \geq 0$ is positive semidefiniteness - it allows zero energy in some directions (those in the null space of $A$ ).

For AI: Every loss function $\mathcal{L}(\boldsymbol{\theta})$ that has a strict local minimum at $\boldsymbol{\theta}^*$ satisfies $\nabla^2 \mathcal{L}(\boldsymbol{\theta}^*) \succ 0$ . The Hessian being positive definite at a critical point is exactly the second-order sufficient condition for a local minimum. This is the mathematical content of "the loss landscape is a bowl."

1.2 Geometric Picture: Ellipsoids and Level Sets

The level sets of the quadratic form $\mathbf{x}^\top A \mathbf{x} = c$ reveal the geometry of $A$ directly.

Case 1: $A = I$ (identity). The level set $\mathbf{x}^\top I \mathbf{x} = c$ is $\|\mathbf{x}\|^2 = c$ - a sphere of radius $\sqrt{c}$ .

Case 2: $A$ diagonal, $A = \text{diag}(\lambda_1, \lambda_2)$ with $\lambda_1 > \lambda_2 > 0$ . The level set $\lambda_1 x_1^2 + \lambda_2 x_2^2 = 1$ is an ellipse with semi-axes $1/\sqrt{\lambda_1}$ and $1/\sqrt{\lambda_2}$ . The larger eigenvalue compresses that direction; the smaller eigenvalue stretches it.

Case 3: $A$ general symmetric PD. The level set $\mathbf{x}^\top A \mathbf{x} = 1$ is an ellipsoid whose axes are aligned with the eigenvectors of $A$ and whose semi-axis lengths are $1/\sqrt{\lambda_i}$ . This follows from the spectral decomposition $A = Q\Lambda Q^\top$ : substituting $\mathbf{y} = Q^\top \mathbf{x}$ gives $\mathbf{y}^\top \Lambda \mathbf{y} = 1$ , a coordinate-aligned ellipsoid.

Case 4: $A$ indefinite (some positive, some negative eigenvalues). The level set is a hyperboloid - unbounded in the negative-eigenvalue directions. There is no minimum energy.

Case 5: $A$ positive semidefinite (some zero eigenvalues). The level set is an "infinite cylinder" - flat in the null space directions. Energy is zero for all vectors in the null space.

QUADRATIC FORM GEOMETRY
========================================================================

  2D quadratic form  Q(x_1,x_2) = x^TAx,  level set Q = 1

  PD (both \lambda > 0):     PSD (one \lambda = 0):    Indefinite:
                        
     +-------+           ---------          /       \
     |  /-\  |           ---------          ---------
     | /   \ |           ---------          \       /
     +-------+           ---------          ---------
     ellipse             parallel lines     hyperbola
     
  Semi-axes \propto 1/\sqrt\lambda^i
  Axes aligned with eigenvectors of A

========================================================================

This geometric picture has a direct machine learning reading. The covariance matrix $\Sigma$ of a multivariate Gaussian defines a metric on feature space: $(\mathbf{x} - \boldsymbol{\mu})^\top \Sigma^{-1} (\mathbf{x} - \boldsymbol{\mu})$ is the squared Mahalanobis distance, measuring how many standard deviations $\mathbf{x}$ is from $\boldsymbol{\mu}$ in each principal direction. The level sets of this distance are the ellipsoidal contours of the Gaussian density.

1.3 Why PD Matrices Matter for AI

Positive definiteness is not a technical curiosity - it is a pervasive structural condition in machine learning systems:

Covariance matrices. Any valid probability distribution must have a non-negative variance. For a multivariate random variable $\mathbf{x} \in \mathbb{R}^n$ , the covariance matrix $\Sigma = \mathbb{E}[(\mathbf{x}-\boldsymbol{\mu})(\mathbf{x}-\boldsymbol{\mu})^\top]$ is always PSD. For a multivariate Gaussian $\mathcal{N}(\boldsymbol{\mu}, \Sigma)$ with a proper density, we need $\Sigma \succ 0$ (so $\det \Sigma \neq 0$ and the density is normalized). Every time a VAE encoder outputs a covariance or diagonal variance, it must produce a PSD (or PD) parameterization.

Fisher information matrix. The Fisher information $F(\boldsymbol{\theta}) = \mathbb{E}_{\mathbf{x} \sim p(\cdot|\boldsymbol{\theta})}[\nabla_{\boldsymbol{\theta}} \log p(\mathbf{x}|\boldsymbol{\theta})\, \nabla_{\boldsymbol{\theta}} \log p(\mathbf{x}|\boldsymbol{\theta})^\top]$ is PSD by construction (it is a covariance matrix of score functions). The natural gradient $\tilde{\nabla} \mathcal{L} = F^{-1} \nabla \mathcal{L}$ uses the Fisher as a Riemannian metric on parameter space. K-FAC (Kronecker-Factored Approximate Curvature) approximates $F$ with a Kronecker-product PSD matrix for scalable second-order optimization.

Gram matrices and kernels. The inner product matrix $G_{ij} = \langle \mathbf{x}_i, \mathbf{x}_j \rangle$ for any set of vectors is always PSD. Kernel methods rely on this: a function $k(\mathbf{x},\mathbf{z})$ is a valid kernel if and only if its Gram matrix is PSD for any set of inputs (Mercer's theorem). The scaled attention score matrix $QK^\top / \sqrt{d_k}$ in transformers is a (non-symmetric) Gram matrix.

Cholesky in numerical computation. Cholesky decomposition ( $A = LL^\top$ ) is the fastest algorithm for solving $A\mathbf{x} = \mathbf{b}$ when $A$ is known to be PD - twice as fast as LU. It is the standard backend for Gaussian process inference, multivariate normal sampling, and Bayesian linear regression. NumPy, SciPy, PyTorch all use Cholesky internally whenever PD structure is detected.

Hessian and second-order methods. At a local minimum $\boldsymbol{\theta}^*$ , the Hessian $\nabla^2 \mathcal{L}(\boldsymbol{\theta}^*) \succeq 0$ . Sharpness-Aware Minimization (SAM) seeks parameters where $\lambda_{\max}(\nabla^2 \mathcal{L})$ is small, empirically improving generalization. The Gauss-Newton approximation $H \approx J^\top J$ is always PSD and is the basis for K-FAC and natural gradient methods.

1.4 Historical Timeline

Year	Contributor	Development
1801	Gauss	Quadratic forms in celestial mechanics; method of least squares
1847	Sylvester	Named "positive definite" forms; leading minor criterion
1852	Sylvester	Sylvester's law of inertia (signature invariance under congruence)
1878	Frobenius	Systematic theory of bilinear and quadratic forms
1910	Cholesky	Cholesky decomposition for geodetic calculations (unpublished until 1924)
1934	Loewner	Matrix ordering (Loewner order) $A \succeq B$ defined rigorously
1936	Mercer	Mercer's theorem linking PSD functions to kernel expansions
1955	Bellman	Positive definite matrices in dynamic programming and control
1990	Vandenberghe & Boyd	Semidefinite programming (SDP) as efficient optimization
1998	Scholkopf & Smola	Kernel methods: SVMs via PSD kernel matrices
2013	Kingma & Welling	VAE reparameterization trick using Cholesky sampling
2015	Martens & Grosse	K-FAC: Kronecker-factored PSD approximation to Fisher
2024-2026	Multiple groups	PSD structure in diffusion model covariances, normalizing flows, structured covariance estimation

2. Formal Definitions

2.1 Quadratic Forms

Definition 2.1 (Quadratic Form). Let $A \in \mathbb{R}^{n \times n}$ be a symmetric matrix and $\mathbf{x} \in \mathbb{R}^n$ . The associated quadratic form is the function $Q_A : \mathbb{R}^n \to \mathbb{R}$ defined by

Q_A(\mathbf{x}) = \mathbf{x}^\top A \mathbf{x} = \sum_{i=1}^n \sum_{j=1}^n A_{ij} x_i x_j.

Note: every quadratic form can be written with a symmetric matrix. If $A$ is not symmetric, replacing it with $(A + A^\top)/2$ gives the same quadratic form, since $\mathbf{x}^\top A \mathbf{x} = \mathbf{x}^\top \frac{A+A^\top}{2} \mathbf{x}$ for all $\mathbf{x}$ . We always assume $A$ is symmetric.

Expanding in 2D. For $A = \begin{pmatrix} a & b \\ b & d \end{pmatrix}$ and $\mathbf{x} = (x_1, x_2)^\top$ :

Q_A(\mathbf{x}) = ax_1^2 + 2bx_1x_2 + dx_2^2.

The diagonal entries $a, d$ give the pure squared terms; the off-diagonal $b$ gives the cross term.

Completing the square. For the 1D quadratic $q(x) = ax^2 + 2bx + c = a(x + b/a)^2 + (c - b^2/a)$ , the minimum is $c - b^2/a$ , achieved at $x = -b/a$ . The matrix analogue is the Schur complement (6) - the quadratic form $\mathbf{x}^\top A \mathbf{x} + 2\mathbf{b}^\top \mathbf{x} + c$ is minimized at $\mathbf{x}^* = -A^{-1}\mathbf{b}$ (when $A \succ 0$ ) with minimum value $c - \mathbf{b}^\top A^{-1} \mathbf{b}$ .

Connection to bilinear forms. The quadratic form $Q_A(\mathbf{x})$ is the diagonal of the bilinear form $B_A(\mathbf{x}, \mathbf{y}) = \mathbf{x}^\top A \mathbf{y}$ , i.e., $Q_A(\mathbf{x}) = B_A(\mathbf{x}, \mathbf{x})$ . The bilinear form encodes the inner product structure induced by $A$ .

2.2 The Four Cases: PD, PSD, ND, Indefinite

Definition 2.2. Let $A \in \mathbb{R}^{n \times n}$ be symmetric. Then $A$ is:

Name	Symbol	Condition	Colloquial
Positive definite	$A \succ 0$	$\mathbf{x}^\top A \mathbf{x} > 0$ for all $\mathbf{x} \neq \mathbf{0}$	"bowl"
Positive semidefinite	$A \succeq 0$	$\mathbf{x}^\top A \mathbf{x} \geq 0$ for all $\mathbf{x}$	"flat bowl"
Negative definite	$A \prec 0$	$\mathbf{x}^\top A \mathbf{x} < 0$ for all $\mathbf{x} \neq \mathbf{0}$	"dome"
Negative semidefinite	$A \preceq 0$	$\mathbf{x}^\top A \mathbf{x} \leq 0$ for all $\mathbf{x}$	"flat dome"
Indefinite	-	$Q_A$ takes both positive and negative values	"saddle"

Note: $A \prec 0 \Leftrightarrow -A \succ 0$ . The interesting cases for applications are PD and PSD.

Standard examples:

$A = I_n$ : $\mathbf{x}^\top I \mathbf{x} = \|\mathbf{x}\|^2 > 0$ for $\mathbf{x} \neq \mathbf{0}$ . $I \succ 0$ . OK

$A = \begin{pmatrix} 2 & 1 \\ 1 & 2 \end{pmatrix}$ : $Q = 2x_1^2 + 2x_1x_2 + 2x_2^2 = (x_1+x_2)^2 + x_1^2 + x_2^2 \geq x_1^2 + x_2^2 > 0$ for $\mathbf{x} \neq \mathbf{0}$ . $\succ 0$ . OK

$A = \begin{pmatrix} 1 & 0 \\ 0 & 0 \end{pmatrix}$ : $Q = x_1^2 \geq 0$ , and $Q(\mathbf{e}_2) = 0$ . $\succeq 0$ but not $\succ 0$ . (PSD, not PD.)

$A = \begin{pmatrix} 1 & 2 \\ 2 & 1 \end{pmatrix}$ : $Q(1,-1) = 1 - 4 + 1 = -2 < 0$ , $Q(1,0) = 1 > 0$ . Indefinite.

Non-examples (common mistakes):

$A = \begin{pmatrix} 2 & 3 \\ 3 & 1 \end{pmatrix}$ : looks "positive" but $\det A = 2 - 9 = -7 < 0$ , so indefinite.

$A = \begin{pmatrix} 0 & 0 \\ 0 & 0 \end{pmatrix}$ : $Q \equiv 0$ , so PSD but NOT PD.

$A = \begin{pmatrix} -1 & 0 \\ 0 & 2 \end{pmatrix}$ : $Q(\mathbf{e}_1) = -1$ , indefinite.

2.3 Immediate Consequences

If $A \succ 0$ , then many properties follow immediately from the definition:

Proposition 2.3. Let $A \succ 0$ (symmetric). Then:

Positive diagonal entries. $A_{ii} > 0$ for all $i$ . Proof: Take $\mathbf{x} = \mathbf{e}_i$ : $\mathbf{e}_i^\top A \mathbf{e}_i = A_{ii} > 0$ .
Positive trace. $\text{tr}(A) = \sum_i A_{ii} > 0$ .
Positive determinant. $\det A > 0$ . (Follows from all eigenvalues positive, 3.1.)
Invertibility. $A$ is invertible, and $A^{-1} \succ 0$ . Proof: If $A\mathbf{x} = \mathbf{0}$ then $\mathbf{x}^\top A \mathbf{x} = 0$ , so $\mathbf{x} = \mathbf{0}$ . For $A^{-1}$ : $(A^{-1})$ is symmetric, and $\mathbf{y}^\top A^{-1} \mathbf{y} = (A^{-1}\mathbf{y})^\top A (A^{-1}\mathbf{y}) > 0$ for $\mathbf{y} \neq \mathbf{0}$ .
Closure under positive scaling. $\alpha A \succ 0$ for $\alpha > 0$ .
Closure under addition. If $A, B \succ 0$ then $A + B \succ 0$ . Proof: $\mathbf{x}^\top(A+B)\mathbf{x} = \mathbf{x}^\top A\mathbf{x} + \mathbf{x}^\top B\mathbf{x} > 0$ .
Tikhonov regularization is PD. If $A \succeq 0$ and $\lambda > 0$ , then $A + \lambda I \succ 0$ . Proof: $\mathbf{x}^\top(A + \lambda I)\mathbf{x} \geq \lambda \|\mathbf{x}\|^2 > 0$ .
Congruence preserves PD. If $A \succ 0$ and $B$ is invertible, then $B^\top A B \succ 0$ . Proof: $\mathbf{x}^\top B^\top A B \mathbf{x} = (B\mathbf{x})^\top A (B\mathbf{x}) > 0$ since $B\mathbf{x} \neq \mathbf{0}$ when $\mathbf{x} \neq \mathbf{0}$ .

For AI: Property 4 ensures that covariance matrices can be inverted for precision computations. Property 7 is exactly why adding $\lambda I$ (weight decay, Tikhonov regularization, jitter in Gaussian processes) to a PSD matrix makes it PD and invertible.

2.4 The Loewner Order

Definition 2.4 (Loewner Order). For symmetric matrices $A, B \in \mathbb{R}^{n \times n}$ , define the Loewner ordering:

A \succeq B \quad \Longleftrightarrow \quad A - B \succeq 0.

Similarly, $A \succ B \Leftrightarrow A - B \succ 0$ . This ordering is:

Reflexive: $A \succeq A$ .
Antisymmetric: $A \succeq B$ and $B \succeq A$ implies $A = B$ .
Transitive: $A \succeq B$ and $B \succeq C$ implies $A \succeq C$ .

However, it is NOT a total order - not every pair of symmetric matrices is comparable. For example, $\begin{pmatrix}1&0\\0&0\end{pmatrix}$ and $\begin{pmatrix}0&0\\0&1\end{pmatrix}$ are neither $\succeq$ nor $\preceq$ each other.

Properties of the Loewner order:

If $A \succeq B$ then:

$\text{tr}(A) \geq \text{tr}(B)$ (trace is monotone)
$\lambda_i(A) \geq \lambda_i(B)$ for all $i$ when eigenvalues are sorted in decreasing order (Weyl's theorem)
$C^\top A C \succeq C^\top B C$ for any $C$ (congruence is monotone)
If additionally $A, B \succ 0$ : $B^{-1} \succeq A^{-1}$ (inversion reverses the order!)

The last property is surprising: if $A \succeq B \succ 0$ , then $A^{-1} \preceq B^{-1}$ . Intuition: a larger matrix "moves vectors more," so its inverse "moves them less."

For AI: In Bayesian inference, the posterior covariance is $\Sigma_{\text{post}} = (\Sigma_{\text{prior}}^{-1} + X^\top X / \sigma^2)^{-1}$ . Since we added $X^\top X / \sigma^2 \succeq 0$ to the prior precision, the posterior precision is larger, so by order-reversal, $\Sigma_{\text{post}} \preceq \Sigma_{\text{prior}}$ . Observing data never increases uncertainty - the Loewner order makes this rigorous.

3. Eigenvalue and Determinantal Characterizations

3.1 Spectral Characterization

The most computationally transparent characterization of positive definiteness uses eigenvalues.

Theorem 3.1 (Spectral Characterization). Let $A \in \mathbb{R}^{n \times n}$ be symmetric with eigenvalues $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_n$ . Then:

A \succ 0 \quad \Longleftrightarrow \quad \lambda_i > 0 \text{ for all } i = 1, \ldots, n.

A \succeq 0 \quad \Longleftrightarrow \quad \lambda_i \geq 0 \text{ for all } i = 1, \ldots, n.

Proof. By the spectral theorem (-> 01: Eigenvalues), $A = Q\Lambda Q^\top$ where $Q$ is orthogonal and $\Lambda = \text{diag}(\lambda_1,\ldots,\lambda_n)$ . For any $\mathbf{x} \neq \mathbf{0}$ , let $\mathbf{y} = Q^\top \mathbf{x} \neq \mathbf{0}$ (since $Q$ is invertible):

\mathbf{x}^\top A \mathbf{x} = \mathbf{y}^\top \Lambda \mathbf{y} = \sum_{i=1}^n \lambda_i y_i^2.

If all $\lambda_i > 0$ : the sum is positive for $\mathbf{y} \neq \mathbf{0}$ . OK

If some $\lambda_k \leq 0$ : take $\mathbf{y} = \mathbf{e}_k$ (so $\mathbf{x} = Q\mathbf{e}_k = \mathbf{q}_k$ , the $k$ -th eigenvector). Then $\mathbf{x}^\top A \mathbf{x} = \lambda_k \leq 0$ . NO

The characterization for $\succeq 0$ follows by the same argument with $\geq$ replacing $>$ . $\square$

Immediate consequences:

$\text{tr}(A) = \sum \lambda_i > 0$ for PD
$\det A = \prod \lambda_i > 0$ for PD
$\|A\|_2 = \lambda_1$ and $\|A^{-1}\|_2 = 1/\lambda_n$ for PD (-> 06: Norms)
The condition number $\kappa(A) = \lambda_1/\lambda_n$ measures how "nearly singular" $A$ is

For AI: In Gaussian process regression, the covariance kernel matrix $K$ must be PSD. Verifying $K \succeq 0$ by computing eigenvalues is expensive; checking a Cholesky decomposition (4) is faster and also provides the factorization needed for inference.

Note on the spectral theorem: This proof uses the spectral theorem for symmetric matrices - that every symmetric real matrix is orthogonally diagonalizable. For the full proof and discussion of spectral theory, see 01: Eigenvalues and Eigenvectors.

3.2 Sylvester's Criterion

Sylvester's criterion provides a purely determinantal test that avoids eigenvalue computation entirely, making it practical for hand calculations and symbolic proofs.

Definition 3.2 (Leading Principal Minors). The $k$ -th leading principal minor of $A \in \mathbb{R}^{n \times n}$ is:

\Delta_k = \det(A[1:k, 1:k]) = \det\begin{pmatrix}A_{11} & \cdots & A_{1k} \\ \vdots & \ddots & \vdots \\ A_{k1} & \cdots & A_{kk}\end{pmatrix}.

So $\Delta_1 = A_{11}$ , $\Delta_2 = A_{11}A_{22} - A_{12}^2$ (for symmetric $A$ ), and $\Delta_n = \det A$ .

Theorem 3.3 (Sylvester's Criterion). A symmetric matrix $A \in \mathbb{R}^{n \times n}$ is positive definite if and only if all leading principal minors are positive:

A \succ 0 \quad \Longleftrightarrow \quad \Delta_k > 0 \text{ for all } k = 1, 2, \ldots, n.

Proof sketch. The key insight is that the leading principal submatrix $A_k = A[1:k,1:k]$ is itself symmetric, and $A \succ 0$ implies $A_k \succ 0$ for every $k$ (restriction to the first $k$ standard basis vectors). The converse uses induction: if $\Delta_1, \ldots, \Delta_{n-1} > 0$ and $\Delta_n > 0$ , then by the inductive hypothesis all principal leading submatrices are PD, and the Cholesky factorization can be completed (4 gives the constructive proof). The determinant of a PD matrix equals the product of its Cholesky diagonal entries squared, all positive, so $\Delta_n > 0$ . $\square$

Working example. Test $A = \begin{pmatrix}4 & 2 & 1\\2 & 3 & 0\\1 & 0 & 2\end{pmatrix}$ :

$\Delta_1 = 4 > 0$ OK
$\Delta_2 = 4 \cdot 3 - 2^2 = 12 - 4 = 8 > 0$ OK
$\Delta_3 = \det A = 4(6-0) - 2(4-0) + 1(0-3) = 24 - 8 - 3 = 13 > 0$ OK

So $A \succ 0$ .

Caution for PSD. Sylvester's criterion does NOT characterize $A \succeq 0$ . A matrix can have all leading principal minors $\geq 0$ but still not be PSD (a counterexample is $A = \begin{pmatrix}0 & 0 \\ 0 & -1\end{pmatrix}$ : $\Delta_1 = 0 \geq 0$ , $\Delta_2 = 0 \geq 0$ , but $Q(\mathbf{e}_2) = -1 < 0$ ). For PSD testing, use all principal minors (not just leading ones), or use eigenvalues.

For AI: Sylvester's criterion is used in symbolic proofs and in optimization algorithms that maintain PD structure (e.g., proving that a block covariance matrix built by a GP model is PD, by verifying the diagonal blocks and Schur complements are positive).

3.3 Pivot Characterization

The connection between Gaussian elimination and positive definiteness is deep and computationally significant.

Theorem 3.4 (Pivot Characterization). A symmetric matrix $A \in \mathbb{R}^{n \times n}$ is positive definite if and only if Gaussian elimination without row exchanges produces all positive pivots.

The pivots of $A$ are $d_1 = A_{11}$ , $d_2 = \Delta_2/\Delta_1$ , $\ldots$ , $d_k = \Delta_k/\Delta_{k-1}$ . So $A \succ 0 \Leftrightarrow$ all pivots $d_k > 0 \Leftrightarrow$ all $\Delta_k > 0$ (Sylvester). The LDL^T factorization makes this explicit: $A = LDL^\top$ where $D = \text{diag}(d_1, \ldots, d_n)$ and $A \succ 0 \Leftrightarrow$ all diagonal entries of $D$ are positive (4.4).

The pivot characterization is how Cholesky "discovers" positive definiteness: the Cholesky algorithm fails (tries to take the square root of a non-positive number) exactly when a pivot is non-positive. This makes Cholesky the standard computational test for positive definiteness: try to factor; failure reveals non-PD structure.

PIVOT CHARACTERIZATION SUMMARY
========================================================================

  A \in \mathbb{R}^n^x^n symmetric.  The following are equivalent:

  (i)   A \succ 0  (quadratic form positive for all x \neq 0)
  (ii)  All eigenvalues \lambda^i > 0                          [Spectral]
  (iii) All leading principal minors \Delta_k > 0              [Sylvester]
  (iv)  All Gaussian elimination pivots d_k > 0           [Pivot]
  (v)   Cholesky factorization A = LL^T exists with L     [Cholesky]
        lower triangular, L^i^i > 0

  Each characterization provides a different computational test:
  (ii)  O(n^3): eigenvalue decomposition  <- most expensive
  (iii) O(n^4): compute n determinants  <- symbolic/manual only
  (iv)  O(n^3): Gaussian elimination
  (v)   O(n^3/3): Cholesky             <- fastest in practice

========================================================================

3.4 Certifying PSD vs PD

In numerical practice, the PD/PSD boundary is a source of subtle bugs. Here we describe how to handle the boundary cases.

Near-PSD matrices. A matrix that is theoretically PSD may have small negative eigenvalues due to floating-point errors. For example, if $A$ is constructed as $X^\top X$ but $X$ is nearly rank-deficient, $A$ might have eigenvalues like $-10^{-15}$ due to rounding. The standard fix is to add a small jitter: $A_\epsilon = A + \epsilon I$ for $\epsilon \approx 10^{-6}$ or $10^{-8}$ .

Numerical rank. For a PSD matrix, the rank equals the number of positive eigenvalues. In finite precision, eigenvalues below $\epsilon \cdot \|A\|_2$ (machine epsilon times spectral norm) are treated as zero. numpy.linalg.matrix_rank uses this threshold internally.

Testing PSD in practice:

Try np.linalg.cholesky(A) - succeeds iff $A \succ 0$ (strictly PD)
For PSD test: check np.all(np.linalg.eigvalsh(A) >= -tol) with tol = 1e-10
For rank- $r$ PSD: compute np.linalg.matrix_rank(A) and verify $r < n$

The zero-eigenvalue case. If $A \succeq 0$ and $\text{rank}(A) = r < n$ , the Cholesky factorization of the full $n \times n$ matrix does not exist. However, the factorization $A = LL^\top$ where $L$ is $n \times r$ (a "thin" Cholesky) does exist and can be computed via pivoted Cholesky. This is used in sparse GP approximations (Nystrom approximation, inducing points).

For AI: PyTorch's torch.linalg.cholesky raises torch.linalg.LinAlgError if the matrix is not PD. In practice, diagonal jitter (A + 1e-6 * torch.eye(n)) is the standard workaround in GP implementations and VAE covariance parameterizations.

4. Cholesky Decomposition

Cholesky decomposition is the most important computational tool in this section. Unlike the spectral characterization (eigenvalue decomposition, $O(n^3)$ with a large constant) or LU (general, $O(n^3)$ with pivoting), Cholesky is:

Twice as fast as LU for symmetric positive definite systems ( $\approx n^3/3$ vs $\approx 2n^3/3$ operations)
Numerically backward-stable without pivoting (which LU requires for stability)
Reveals the square root: $L$ satisfies $LL^\top = A$ , i.e., $L$ is a "matrix square root"
The canonical test for positive definiteness: it exists iff $A \succ 0$

4.1 The Theorem: Existence and Uniqueness

Theorem 4.1 (Cholesky Factorization). Let $A \in \mathbb{R}^{n \times n}$ be symmetric. Then:

A \succ 0 \quad \Longleftrightarrow \quad \exists! \text{ lower triangular } L \text{ with positive diagonal such that } A = LL^\top.

This $L$ is called the Cholesky factor of $A$ . The factorization $A = LL^\top$ is the Cholesky decomposition.

Proof of existence (constructive). We prove by induction on $n$ .

Base case ( $n=1$ ): $A = (a_{11})$ with $a_{11} > 0$ (since $A \succ 0$ ). Take $L = (\sqrt{a_{11}})$ .

Inductive step: Write $A$ in block form:

A = \begin{pmatrix} a_{11} & \mathbf{a}^\top \\ \mathbf{a} & B \end{pmatrix}

where $a_{11} > 0$ (Proposition 2.3), $\mathbf{a} \in \mathbb{R}^{n-1}$ , and $B \in \mathbb{R}^{(n-1)\times(n-1)}$ .

Set $\ell_{11} = \sqrt{a_{11}}$ , $\boldsymbol{\ell} = \mathbf{a}/\ell_{11}$ . Then $B - \boldsymbol{\ell}\boldsymbol{\ell}^\top = B - \mathbf{a}\mathbf{a}^\top/a_{11}$ is the Schur complement $S = B - \mathbf{a} a_{11}^{-1} \mathbf{a}^\top$ .

Since $A \succ 0$ , its Schur complement $S \succ 0$ (Theorem 6.2). By the inductive hypothesis, $S = L_{22}L_{22}^\top$ for unique lower triangular $L_{22}$ with positive diagonal.

Then:

L = \begin{pmatrix}\ell_{11} & \mathbf{0}^\top \\ \boldsymbol{\ell} & L_{22}\end{pmatrix}, \quad LL^\top = \begin{pmatrix}\ell_{11}^2 & \ell_{11}\boldsymbol{\ell}^\top \\ \ell_{11}\boldsymbol{\ell} & \boldsymbol{\ell}\boldsymbol{\ell}^\top + L_{22}L_{22}^\top\end{pmatrix} = \begin{pmatrix}a_{11} & \mathbf{a}^\top \\ \mathbf{a} & B\end{pmatrix} = A.

Proof of uniqueness. Suppose $A = L_1 L_1^\top = L_2 L_2^\top$ with $L_1, L_2$ lower triangular and positive diagonal. Then $L_2^{-1}L_1 = L_2^\top (L_1^\top)^{-1}$ . The left side is lower triangular; the right side is upper triangular. Both sides equal the same matrix, which must be diagonal with positive entries (since $L_1, L_2$ have positive diagonals). Comparing: $L_2^{-1}L_1 = D$ (diagonal positive), so $L_1 = L_2 D$ . But $LL^\top = L_2 D D L_2^\top$ , so $D^2 = I$ , so $D = I$ , so $L_1 = L_2$ . $\square$

Converse. If $A = LL^\top$ with $L$ lower triangular and positive diagonal, then for $\mathbf{x} \neq \mathbf{0}$ : $\mathbf{x}^\top A \mathbf{x} = \mathbf{x}^\top L L^\top \mathbf{x} = \|L^\top \mathbf{x}\|^2 > 0$ (since $L^\top$ is invertible by positive diagonal). So $A \succ 0$ . $\square$

4.2 The Algorithm: Constructing L

The Cholesky algorithm computes $L$ column by column. From $A = LL^\top$ , expanding element by element:

A_{ij} = \sum_{k=1}^{\min(i,j)} L_{ik} L_{jk}.

For the diagonal entry $(i,i)$ :

A_{ii} = \sum_{k=1}^{i} L_{ik}^2 = L_{ii}^2 + \sum_{k=1}^{i-1} L_{ik}^2 \implies L_{ii} = \sqrt{A_{ii} - \sum_{k=1}^{i-1} L_{ik}^2}.

For off-diagonal entry $(i,j)$ with $i > j$ :

A_{ij} = \sum_{k=1}^{j} L_{ik}L_{jk} = L_{ij}L_{jj} + \sum_{k=1}^{j-1} L_{ik}L_{jk} \implies L_{ij} = \frac{1}{L_{jj}}\left(A_{ij} - \sum_{k=1}^{j-1} L_{ik}L_{jk}\right).

Algorithm 4.2 (Cholesky, column-wise):

Input:  Symmetric A \in \mathbb{R}^n^x^n
Output: Lower triangular L with A = LL^T, or FAILURE

for j = 1 to n:
    s = A[j,j] - sum(L[j,k]^2 for k = 1..j-1)
    if s \leq 0:
        FAILURE (A is not positive definite)
    L[j,j] = sqrt(s)
    for i = j+1 to n:
        L[i,j] = (A[i,j] - sum(L[i,k]*L[j,k] for k=1..j-1)) / L[j,j]
    for i = 1 to j-1:
        L[i,j] = 0       (upper triangle is zero)

Worked example. Compute the Cholesky factor of $A = \begin{pmatrix}4 & 2 & 0\\2 & 5 & 1\\0 & 1 & 3\end{pmatrix}$ :

Column 1: $L_{11} = \sqrt{4} = 2$ , $L_{21} = 2/2 = 1$ , $L_{31} = 0/2 = 0$ .

Column 2: $L_{22} = \sqrt{5 - 1^2} = \sqrt{4} = 2$ , $L_{32} = (1 - 0\cdot1)/2 = 1/2$ .

Column 3: $L_{33} = \sqrt{3 - 0^2 - (1/2)^2} = \sqrt{3 - 1/4} = \sqrt{11/4} = \sqrt{11}/2$ .

L = \begin{pmatrix}2 & 0 & 0 \\ 1 & 2 & 0 \\ 0 & 1/2 & \sqrt{11}/2\end{pmatrix}.

Verify: $LL^\top = \begin{pmatrix}4 & 2 & 0\\2 & 5 & 1\\0 & 1 & 3\end{pmatrix} = A$ . OK

4.3 Complexity and Numerical Properties

Flop count. The Cholesky algorithm requires approximately $n^3/3$ floating-point operations - precisely half the $2n^3/3$ required by LU decomposition. For large $n$ , this factor-of-2 speedup is significant.

Memory. Only the lower triangle of $L$ needs to be stored: $n(n+1)/2$ entries vs $n^2$ for general LU. In-place variants overwrite the lower triangle of $A$ with $L$ .

Backward stability. Cholesky is numerically backward-stable without pivoting: the computed $\hat{L}$ satisfies $\hat{L}\hat{L}^\top = A + E$ where $\|E\|_2 \leq c_n \epsilon_{\text{mach}} \|A\|_2$ for a modest constant $c_n$ . This is stronger than LU without pivoting (which can be unstable). The reason: each step computes $\sqrt{\text{positive}}$ , which cannot amplify errors.

Solving $A\mathbf{x} = \mathbf{b}$ : Factor $A = LL^\top$ , then solve $L\mathbf{y} = \mathbf{b}$ by forward substitution and $L^\top\mathbf{x} = \mathbf{y}$ by backward substitution. Total cost: $n^3/3 + 2n^2$ (factorization + two triangular solves).

Failure = non-PD. If at any step the quantity under the square root is non-positive, $A$ is not positive definite. This makes Cholesky the standard computational test for PD structure.

For AI: JAX's jax.scipy.linalg.cholesky, PyTorch's torch.linalg.cholesky, and SciPy's scipy.linalg.cholesky are all efficient LAPACK-backed implementations. In Gaussian process models, the Cholesky solve L \ b (forward substitution) appears in every prediction and log-marginal-likelihood computation.

4.4 LDL^T Factorization

The LDL^T decomposition is a variant of Cholesky that avoids square roots - useful when the matrix is PSD but not strictly PD, or when square root computation is expensive.

Theorem 4.3 (LDL^T Factorization). Every symmetric $A \in \mathbb{R}^{n \times n}$ that admits Gaussian elimination without row swaps can be uniquely factored as:

A = LDL^\top

where $L$ is unit lower triangular ( $L_{ii} = 1$ ) and $D = \text{diag}(d_1, \ldots, d_n)$ .

Moreover, $A \succ 0 \Leftrightarrow$ all diagonal entries $d_i > 0$ .

Relation to Cholesky. If $A \succ 0$ , the LDL^T and Cholesky factorizations are related by:

A = LDL^\top = L\sqrt{D}\sqrt{D}L^\top = (L\sqrt{D})(L\sqrt{D})^\top.

So the Cholesky factor is $\hat{L} = L\sqrt{D}$ where $\sqrt{D} = \text{diag}(\sqrt{d_1},\ldots,\sqrt{d_n})$ . The LDL^T decomposition computes the $d_i$ directly without taking square roots.

Algorithm 4.4 (LDL^T, column-wise):

for j = 1 to n:
    v[1..j-1] = D[1..j-1] * L[j,1..j-1]   (element-wise)
    D[j] = A[j,j] - dot(L[j,1..j-1], v[1..j-1])
    for i = j+1 to n:
        L[i,j] = (A[i,j] - dot(L[i,1..j-1], v[1..j-1])) / D[j]
    L[j,j] = 1

Worked example. Factor $A = \begin{pmatrix}4 & 2\\2 & 5\end{pmatrix}$ :

$d_1 = 4$ , $L_{21} = 2/4 = 1/2$ , $d_2 = 5 - (1/2)^2 \cdot 4 = 5 - 1 = 4$ .

L = \begin{pmatrix}1 & 0 \\ 1/2 & 1\end{pmatrix}, \quad D = \begin{pmatrix}4 & 0 \\ 0 & 4\end{pmatrix}.

Verify: $LDL^\top = \begin{pmatrix}1&0\\1/2&1\end{pmatrix}\begin{pmatrix}4&0\\0&4\end{pmatrix}\begin{pmatrix}1&1/2\\0&1\end{pmatrix} = \begin{pmatrix}4&2\\2&5\end{pmatrix}$ . OK

When to prefer LDL^T over Cholesky:

When $A$ is indefinite but you still want a factorization (use $BKLT$ pivoted LDL^T with $2\times 2$ pivots)
When square root computation is numerically undesirable
In interval arithmetic or symbolic computation

Symmetric indefinite systems. The Bunch-Kaufman (BK) algorithm extends LDL^T to indefinite matrices using $1 \times 1$ and $2 \times 2$ pivots. The LAPACK routine dsytrf implements this.

4.5 Modified Cholesky for Near-PD Matrices

In optimization (especially trust-region methods and quasi-Newton methods), we frequently need to factor a matrix that is "nearly" PD - the Hessian approximation may have small negative eigenvalues due to finite differences or insufficient curvature.

The problem. Standard Cholesky fails (negative pivot) when $A$ is not strictly PD. Rather than failing, we want a PD matrix $A + E$ close to $A$ that can be Cholesky-factored.

Gill-Murray-Wright (GMW) modification. At each pivot step $j$ , if the computed pivot $d_j < \delta$ for a small threshold $\delta > 0$ , set $d_j \leftarrow \delta + |d_j|$ (adding a correction). After factorization, $A + E = LL^\top$ where $E = L_{\text{computed}}L_{\text{computed}}^\top - A$ is a small diagonal correction.

Simple diagonal jitter. The most widely used approach in machine learning is to add $\epsilon I$ before attempting Cholesky:

def robust_cholesky(A, jitter=1e-6):
    try:
        return np.linalg.cholesky(A)
    except np.linalg.LinAlgError:
        return np.linalg.cholesky(A + jitter * np.eye(len(A)))

This is the standard "GP jitter" pattern. The added $\epsilon I$ corresponds to assuming a small observation noise or numerical regularization.

For AI: PyTorch GPyTorch (Gaussian Process library) wraps all kernel matrix Cholesky calls with diagonal jitter and a retry mechanism. The JAX implementation uses jax.scipy.linalg.cholesky and adds jitter via A += diag_jitter * jnp.eye(n). Understanding why jitter works - it adds $\epsilon$ to each eigenvalue, making all eigenvalues at least $\epsilon > 0$ - is exactly positive definiteness via the spectral characterization.

5. Matrix Square Root and Whitening

5.1 The PSD Square Root

For a scalar $a \geq 0$ , there is a unique non-negative square root $\sqrt{a} \geq 0$ . For a PSD matrix, the same uniqueness holds - but only in the class of PSD matrices.

Theorem 5.1 (PSD Square Root). Let $A \in \mathbb{R}^{n \times n}$ be symmetric and positive semidefinite. Then there exists a unique symmetric positive semidefinite matrix $A^{1/2}$ such that:

(A^{1/2})^2 = A^{1/2} A^{1/2} = A.

If $A \succ 0$ , then $A^{1/2} \succ 0$ .

Proof. By the spectral theorem, $A = Q\Lambda Q^\top$ where $\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_n)$ with all $\lambda_i \geq 0$ . Define:

A^{1/2} = Q \Lambda^{1/2} Q^\top, \quad \Lambda^{1/2} = \text{diag}(\sqrt{\lambda_1}, \ldots, \sqrt{\lambda_n}).

Then $(A^{1/2})^2 = Q\Lambda^{1/2}Q^\top Q\Lambda^{1/2}Q^\top = Q\Lambda Q^\top = A$ . This $A^{1/2}$ is symmetric PSD.

Uniqueness: Suppose $B^2 = A$ with $B$ symmetric PSD. Then $B$ and $A$ commute (since $BA = B \cdot B^2 = B^3 = B^2 \cdot B = A B$ ), so they share eigenvectors. On each shared eigenspace with eigenvalue $\lambda$ , $B$ must equal $\sqrt{\lambda}$ (the unique non-negative root). So $B = Q\Lambda^{1/2}Q^\top = A^{1/2}$ . $\square$

Warning: Non-uniqueness without PSD constraint. The equation $B^2 = A$ for $A \succ 0$ has $2^n$ solutions in the class of symmetric matrices (one can independently choose $\pm\sqrt{\lambda_i}$ for each eigenvalue). The PSD square root $A^{1/2}$ is the unique solution with all non-negative eigenvalues.

5.2 Computing Square Roots

Via eigendecomposition:

$A = Q\Lambda Q^\top \implies A^{1/2} = Q\Lambda^{1/2}Q^\top.$

This requires the full eigendecomposition, costing $O(n^3)$ with a large constant. For $A \succ 0$ :

eigvals, Q = np.linalg.eigh(A)   # symmetric eigendecomposition
A_sqrt = Q @ np.diag(np.sqrt(eigvals)) @ Q.T

Via Cholesky (the "left Cholesky factor"). Note: the Cholesky factor $L$ satisfies $LL^\top = A$ but $L \neq A^{1/2}$ in general (unless $A$ is diagonal). The Cholesky factor is a square root of $A$ in the sense $A = LL^\top$ , but it is not symmetric and not equal to the eigendecomposition-based $A^{1/2}$ .

Via Newton's method (for matrix functions): The iteration $X_{k+1} = \frac{1}{2}(X_k + X_k^{-1}A)$ starting from $X_0 = I$ converges to $A^{1/2}$ for $A \succ 0$ . This is the matrix analogue of Heron's method for scalar square roots. Convergence is quadratic once close enough.

Via Pade approximants: For $A$ close to $I$ , write $A = I + E$ and use a matrix series $A^{1/2} = (I+E)^{1/2} = I + \frac{1}{2}E - \frac{1}{8}E^2 + \cdots$ , truncated at some order.

5.3 Square Root vs Cholesky

The two "square root" operations serve different purposes:

Property	Cholesky factor $L$	PSD square root $A^{1/2}$
Satisfies	$A = LL^\top$	$A = (A^{1/2})^2$
Shape	Lower triangular	Symmetric
Uniqueness	Unique (positive diagonal)	Unique (PSD)
Computation	$O(n^3/3)$	$O(n^3)$ (full eigen)
Invertibility	$L^{-1}$ is lower triangular	$(A^{1/2})^{-1} = A^{-1/2}$
Use case	Solving systems, sampling	Mahalanobis, whitening

When to use Cholesky: solving $A\mathbf{x}=\mathbf{b}$ , computing $\log\det A$ , sampling from $\mathcal{N}(\mathbf{0},A)$ .

When to use $A^{1/2}$ : defining the Mahalanobis inner product $\langle\mathbf{x},\mathbf{y}\rangle_A = \mathbf{x}^\top A^{-1}\mathbf{y} = \|A^{-1/2}\mathbf{x}\|^2$ ; whitening transforms; matrix differential equations.

5.4 Whitening Transforms

Definition 5.2 (Whitening). Given data $\mathbf{x}$ with covariance $\Sigma \succ 0$ , the whitening transform maps $\mathbf{x} \mapsto \mathbf{z} = \Sigma^{-1/2}(\mathbf{x} - \boldsymbol{\mu})$ . The transformed variable $\mathbf{z}$ has covariance $I$ (identity) - it is "white."

ZCA whitening. The ZCA (Zero-phase Component Analysis) whitening uses the symmetric square root: $W_{\text{ZCA}} = \Sigma^{-1/2}$ . The transformed data $W_{\text{ZCA}}\mathbf{x}$ is as close to the original $\mathbf{x}$ as possible while being white. This minimizes the change in "appearance" of the data.

Cholesky whitening. Using $W_{\text{Chol}} = L^{-1}$ (where $\Sigma = LL^\top$ ) also produces white data but rotates it differently. This form is computationally cheaper.

Mahalanobis distance. For $A = \Sigma^{-1}$ , the quadratic form $\mathbf{x}^\top \Sigma^{-1} \mathbf{x}$ is the squared Mahalanobis distance from the origin. Geometrically, it measures the distance in "standard deviations units," accounting for the correlation structure of $\Sigma$ . In $n$ dimensions, $(\mathbf{x}-\boldsymbol{\mu})^\top\Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu}) = \|(\mathbf{x}-\boldsymbol{\mu})\|_{\Sigma^{-1}}^2$ .

For AI: Batch normalization can be viewed as approximate ZCA whitening per mini-batch. More precisely, it normalizes each feature to zero mean and unit variance (diagonal whitening), avoiding the expensive full whitening. The full Mahalanobis distance appears in metric learning (e.g., Siamese networks learning $A$ such that $\mathbf{x}^\top A \mathbf{x}$ is a meaningful distance) and in attention mechanisms where different layers may learn non-isotropic attention kernels.

Positive Definite Matrices: Part 1 - Intuition To 5 Matrix Square Root And Whitening