Lesson overview | Lesson overview | Next part
Positive Definite Matrices: Part 1: Intuition to 5. Matrix Square Root and Whitening
1. Intuition
1.1 What Positive Definiteness Captures
The most useful single-sentence definition of a positive definite matrix is: a matrix that behaves like a positive number. To understand what this means, compare scalars and matrices in parallel.
For a scalar :
- The equation has a unique solution
- The function has a unique minimum at
- The scalar has a real square root:
- The product , with equality only at
For a positive definite matrix :
- The system has a unique solution
- The function has a unique minimum at
- The matrix has a unique positive definite square root
- The quadratic form for all
The quadratic form is the central object. Think of it as a machine that takes a vector and returns a scalar measuring its "energy" with respect to . When is the identity, , the familiar squared Euclidean length. When is diagonal with positive entries , then - a weighted sum of squared coordinates. Positive definiteness requires this energy to be positive for every nonzero direction.
The formal definition requires to be symmetric. The condition for is called strict positive definiteness. The relaxed condition is positive semidefiniteness - it allows zero energy in some directions (those in the null space of ).
For AI: Every loss function that has a strict local minimum at satisfies . The Hessian being positive definite at a critical point is exactly the second-order sufficient condition for a local minimum. This is the mathematical content of "the loss landscape is a bowl."
1.2 Geometric Picture: Ellipsoids and Level Sets
The level sets of the quadratic form reveal the geometry of directly.
Case 1: (identity). The level set is - a sphere of radius .
Case 2: diagonal, with . The level set is an ellipse with semi-axes and . The larger eigenvalue compresses that direction; the smaller eigenvalue stretches it.
Case 3: general symmetric PD. The level set is an ellipsoid whose axes are aligned with the eigenvectors of and whose semi-axis lengths are . This follows from the spectral decomposition : substituting gives , a coordinate-aligned ellipsoid.
Case 4: indefinite (some positive, some negative eigenvalues). The level set is a hyperboloid - unbounded in the negative-eigenvalue directions. There is no minimum energy.
Case 5: positive semidefinite (some zero eigenvalues). The level set is an "infinite cylinder" - flat in the null space directions. Energy is zero for all vectors in the null space.
QUADRATIC FORM GEOMETRY
========================================================================
2D quadratic form Q(x_1,x_2) = x^TAx, level set Q = 1
PD (both \lambda > 0): PSD (one \lambda = 0): Indefinite:
+-------+ --------- / \
| /-\ | --------- ---------
| / \ | --------- \ /
+-------+ --------- ---------
ellipse parallel lines hyperbola
Semi-axes \propto 1/\sqrt\lambda^i
Axes aligned with eigenvectors of A
========================================================================
This geometric picture has a direct machine learning reading. The covariance matrix of a multivariate Gaussian defines a metric on feature space: is the squared Mahalanobis distance, measuring how many standard deviations is from in each principal direction. The level sets of this distance are the ellipsoidal contours of the Gaussian density.
1.3 Why PD Matrices Matter for AI
Positive definiteness is not a technical curiosity - it is a pervasive structural condition in machine learning systems:
Covariance matrices. Any valid probability distribution must have a non-negative variance. For a multivariate random variable , the covariance matrix is always PSD. For a multivariate Gaussian with a proper density, we need (so and the density is normalized). Every time a VAE encoder outputs a covariance or diagonal variance, it must produce a PSD (or PD) parameterization.
Fisher information matrix. The Fisher information is PSD by construction (it is a covariance matrix of score functions). The natural gradient uses the Fisher as a Riemannian metric on parameter space. K-FAC (Kronecker-Factored Approximate Curvature) approximates with a Kronecker-product PSD matrix for scalable second-order optimization.
Gram matrices and kernels. The inner product matrix for any set of vectors is always PSD. Kernel methods rely on this: a function is a valid kernel if and only if its Gram matrix is PSD for any set of inputs (Mercer's theorem). The scaled attention score matrix in transformers is a (non-symmetric) Gram matrix.
Cholesky in numerical computation. Cholesky decomposition () is the fastest algorithm for solving when is known to be PD - twice as fast as LU. It is the standard backend for Gaussian process inference, multivariate normal sampling, and Bayesian linear regression. NumPy, SciPy, PyTorch all use Cholesky internally whenever PD structure is detected.
Hessian and second-order methods. At a local minimum , the Hessian . Sharpness-Aware Minimization (SAM) seeks parameters where is small, empirically improving generalization. The Gauss-Newton approximation is always PSD and is the basis for K-FAC and natural gradient methods.
1.4 Historical Timeline
| Year | Contributor | Development |
|---|---|---|
| 1801 | Gauss | Quadratic forms in celestial mechanics; method of least squares |
| 1847 | Sylvester | Named "positive definite" forms; leading minor criterion |
| 1852 | Sylvester | Sylvester's law of inertia (signature invariance under congruence) |
| 1878 | Frobenius | Systematic theory of bilinear and quadratic forms |
| 1910 | Cholesky | Cholesky decomposition for geodetic calculations (unpublished until 1924) |
| 1934 | Loewner | Matrix ordering (Loewner order) defined rigorously |
| 1936 | Mercer | Mercer's theorem linking PSD functions to kernel expansions |
| 1955 | Bellman | Positive definite matrices in dynamic programming and control |
| 1990 | Vandenberghe & Boyd | Semidefinite programming (SDP) as efficient optimization |
| 1998 | Scholkopf & Smola | Kernel methods: SVMs via PSD kernel matrices |
| 2013 | Kingma & Welling | VAE reparameterization trick using Cholesky sampling |
| 2015 | Martens & Grosse | K-FAC: Kronecker-factored PSD approximation to Fisher |
| 2024-2026 | Multiple groups | PSD structure in diffusion model covariances, normalizing flows, structured covariance estimation |
2. Formal Definitions
2.1 Quadratic Forms
Definition 2.1 (Quadratic Form). Let be a symmetric matrix and . The associated quadratic form is the function defined by
Note: every quadratic form can be written with a symmetric matrix. If is not symmetric, replacing it with gives the same quadratic form, since for all . We always assume is symmetric.
Expanding in 2D. For and :
The diagonal entries give the pure squared terms; the off-diagonal gives the cross term.
Completing the square. For the 1D quadratic , the minimum is , achieved at . The matrix analogue is the Schur complement (6) - the quadratic form is minimized at (when ) with minimum value .
Connection to bilinear forms. The quadratic form is the diagonal of the bilinear form , i.e., . The bilinear form encodes the inner product structure induced by .
2.2 The Four Cases: PD, PSD, ND, Indefinite
Definition 2.2. Let be symmetric. Then is:
| Name | Symbol | Condition | Colloquial |
|---|---|---|---|
| Positive definite | for all | "bowl" | |
| Positive semidefinite | for all | "flat bowl" | |
| Negative definite | for all | "dome" | |
| Negative semidefinite | for all | "flat dome" | |
| Indefinite | - | takes both positive and negative values | "saddle" |
Note: . The interesting cases for applications are PD and PSD.
Standard examples:
: for . . OK
: for . . OK
: , and . but not . (PSD, not PD.)
: , . Indefinite.
Non-examples (common mistakes):
: looks "positive" but , so indefinite.
: , so PSD but NOT PD.
: , indefinite.
2.3 Immediate Consequences
If , then many properties follow immediately from the definition:
Proposition 2.3. Let (symmetric). Then:
-
Positive diagonal entries. for all . Proof: Take : .
-
Positive trace. .
-
Positive determinant. . (Follows from all eigenvalues positive, 3.1.)
-
Invertibility. is invertible, and . Proof: If then , so . For : is symmetric, and for .
-
Closure under positive scaling. for .
-
Closure under addition. If then . Proof: .
-
Tikhonov regularization is PD. If and , then . Proof: .
-
Congruence preserves PD. If and is invertible, then . Proof: since when .
For AI: Property 4 ensures that covariance matrices can be inverted for precision computations. Property 7 is exactly why adding (weight decay, Tikhonov regularization, jitter in Gaussian processes) to a PSD matrix makes it PD and invertible.
2.4 The Loewner Order
Definition 2.4 (Loewner Order). For symmetric matrices , define the Loewner ordering:
Similarly, . This ordering is:
- Reflexive: .
- Antisymmetric: and implies .
- Transitive: and implies .
However, it is NOT a total order - not every pair of symmetric matrices is comparable. For example, and are neither nor each other.
Properties of the Loewner order:
If then:
- (trace is monotone)
- for all when eigenvalues are sorted in decreasing order (Weyl's theorem)
- for any (congruence is monotone)
- If additionally : (inversion reverses the order!)
The last property is surprising: if , then . Intuition: a larger matrix "moves vectors more," so its inverse "moves them less."
For AI: In Bayesian inference, the posterior covariance is . Since we added to the prior precision, the posterior precision is larger, so by order-reversal, . Observing data never increases uncertainty - the Loewner order makes this rigorous.
3. Eigenvalue and Determinantal Characterizations
3.1 Spectral Characterization
The most computationally transparent characterization of positive definiteness uses eigenvalues.
Theorem 3.1 (Spectral Characterization). Let be symmetric with eigenvalues . Then:
Proof. By the spectral theorem (-> 01: Eigenvalues), where is orthogonal and . For any , let (since is invertible):
If all : the sum is positive for . OK
If some : take (so , the -th eigenvector). Then . NO
The characterization for follows by the same argument with replacing .
Immediate consequences:
- for PD
- for PD
- and for PD (-> 06: Norms)
- The condition number measures how "nearly singular" is
For AI: In Gaussian process regression, the covariance kernel matrix must be PSD. Verifying by computing eigenvalues is expensive; checking a Cholesky decomposition (4) is faster and also provides the factorization needed for inference.
Note on the spectral theorem: This proof uses the spectral theorem for symmetric matrices - that every symmetric real matrix is orthogonally diagonalizable. For the full proof and discussion of spectral theory, see 01: Eigenvalues and Eigenvectors.
3.2 Sylvester's Criterion
Sylvester's criterion provides a purely determinantal test that avoids eigenvalue computation entirely, making it practical for hand calculations and symbolic proofs.
Definition 3.2 (Leading Principal Minors). The -th leading principal minor of is:
So , (for symmetric ), and .
Theorem 3.3 (Sylvester's Criterion). A symmetric matrix is positive definite if and only if all leading principal minors are positive:
Proof sketch. The key insight is that the leading principal submatrix is itself symmetric, and implies for every (restriction to the first standard basis vectors). The converse uses induction: if and , then by the inductive hypothesis all principal leading submatrices are PD, and the Cholesky factorization can be completed (4 gives the constructive proof). The determinant of a PD matrix equals the product of its Cholesky diagonal entries squared, all positive, so .
Working example. Test :
- OK
- OK
- OK
So .
Caution for PSD. Sylvester's criterion does NOT characterize . A matrix can have all leading principal minors but still not be PSD (a counterexample is : , , but ). For PSD testing, use all principal minors (not just leading ones), or use eigenvalues.
For AI: Sylvester's criterion is used in symbolic proofs and in optimization algorithms that maintain PD structure (e.g., proving that a block covariance matrix built by a GP model is PD, by verifying the diagonal blocks and Schur complements are positive).
3.3 Pivot Characterization
The connection between Gaussian elimination and positive definiteness is deep and computationally significant.
Theorem 3.4 (Pivot Characterization). A symmetric matrix is positive definite if and only if Gaussian elimination without row exchanges produces all positive pivots.
The pivots of are , , , . So all pivots all (Sylvester). The LDL^T factorization makes this explicit: where and all diagonal entries of are positive (4.4).
The pivot characterization is how Cholesky "discovers" positive definiteness: the Cholesky algorithm fails (tries to take the square root of a non-positive number) exactly when a pivot is non-positive. This makes Cholesky the standard computational test for positive definiteness: try to factor; failure reveals non-PD structure.
PIVOT CHARACTERIZATION SUMMARY
========================================================================
A \in \mathbb{R}^n^x^n symmetric. The following are equivalent:
(i) A \succ 0 (quadratic form positive for all x \neq 0)
(ii) All eigenvalues \lambda^i > 0 [Spectral]
(iii) All leading principal minors \Delta_k > 0 [Sylvester]
(iv) All Gaussian elimination pivots d_k > 0 [Pivot]
(v) Cholesky factorization A = LL^T exists with L [Cholesky]
lower triangular, L^i^i > 0
Each characterization provides a different computational test:
(ii) O(n^3): eigenvalue decomposition <- most expensive
(iii) O(n^4): compute n determinants <- symbolic/manual only
(iv) O(n^3): Gaussian elimination
(v) O(n^3/3): Cholesky <- fastest in practice
========================================================================
3.4 Certifying PSD vs PD
In numerical practice, the PD/PSD boundary is a source of subtle bugs. Here we describe how to handle the boundary cases.
Near-PSD matrices. A matrix that is theoretically PSD may have small negative eigenvalues due to floating-point errors. For example, if is constructed as but is nearly rank-deficient, might have eigenvalues like due to rounding. The standard fix is to add a small jitter: for or .
Numerical rank. For a PSD matrix, the rank equals the number of positive eigenvalues. In finite precision, eigenvalues below (machine epsilon times spectral norm) are treated as zero. numpy.linalg.matrix_rank uses this threshold internally.
Testing PSD in practice:
- Try
np.linalg.cholesky(A)- succeeds iff (strictly PD) - For PSD test: check
np.all(np.linalg.eigvalsh(A) >= -tol)withtol = 1e-10 - For rank- PSD: compute
np.linalg.matrix_rank(A)and verify
The zero-eigenvalue case. If and , the Cholesky factorization of the full matrix does not exist. However, the factorization where is (a "thin" Cholesky) does exist and can be computed via pivoted Cholesky. This is used in sparse GP approximations (Nystrom approximation, inducing points).
For AI: PyTorch's torch.linalg.cholesky raises torch.linalg.LinAlgError if the matrix is not PD. In practice, diagonal jitter (A + 1e-6 * torch.eye(n)) is the standard workaround in GP implementations and VAE covariance parameterizations.
4. Cholesky Decomposition
Cholesky decomposition is the most important computational tool in this section. Unlike the spectral characterization (eigenvalue decomposition, with a large constant) or LU (general, with pivoting), Cholesky is:
- Twice as fast as LU for symmetric positive definite systems ( vs operations)
- Numerically backward-stable without pivoting (which LU requires for stability)
- Reveals the square root: satisfies , i.e., is a "matrix square root"
- The canonical test for positive definiteness: it exists iff
4.1 The Theorem: Existence and Uniqueness
Theorem 4.1 (Cholesky Factorization). Let be symmetric. Then:
This is called the Cholesky factor of . The factorization is the Cholesky decomposition.
Proof of existence (constructive). We prove by induction on .
Base case (): with (since ). Take .
Inductive step: Write in block form:
where (Proposition 2.3), , and .
Set , . Then is the Schur complement .
Since , its Schur complement (Theorem 6.2). By the inductive hypothesis, for unique lower triangular with positive diagonal.
Then:
Proof of uniqueness. Suppose with lower triangular and positive diagonal. Then . The left side is lower triangular; the right side is upper triangular. Both sides equal the same matrix, which must be diagonal with positive entries (since have positive diagonals). Comparing: (diagonal positive), so . But , so , so , so .
Converse. If with lower triangular and positive diagonal, then for : (since is invertible by positive diagonal). So .
4.2 The Algorithm: Constructing L
The Cholesky algorithm computes column by column. From , expanding element by element:
For the diagonal entry :
For off-diagonal entry with :
Algorithm 4.2 (Cholesky, column-wise):
Input: Symmetric A \in \mathbb{R}^n^x^n
Output: Lower triangular L with A = LL^T, or FAILURE
for j = 1 to n:
s = A[j,j] - sum(L[j,k]^2 for k = 1..j-1)
if s \leq 0:
FAILURE (A is not positive definite)
L[j,j] = sqrt(s)
for i = j+1 to n:
L[i,j] = (A[i,j] - sum(L[i,k]*L[j,k] for k=1..j-1)) / L[j,j]
for i = 1 to j-1:
L[i,j] = 0 (upper triangle is zero)
Worked example. Compute the Cholesky factor of :
Column 1: , , .
Column 2: , .
Column 3: .
Verify: . OK
4.3 Complexity and Numerical Properties
Flop count. The Cholesky algorithm requires approximately floating-point operations - precisely half the required by LU decomposition. For large , this factor-of-2 speedup is significant.
Memory. Only the lower triangle of needs to be stored: entries vs for general LU. In-place variants overwrite the lower triangle of with .
Backward stability. Cholesky is numerically backward-stable without pivoting: the computed satisfies where for a modest constant . This is stronger than LU without pivoting (which can be unstable). The reason: each step computes , which cannot amplify errors.
Solving : Factor , then solve by forward substitution and by backward substitution. Total cost: (factorization + two triangular solves).
Failure = non-PD. If at any step the quantity under the square root is non-positive, is not positive definite. This makes Cholesky the standard computational test for PD structure.
For AI: JAX's jax.scipy.linalg.cholesky, PyTorch's torch.linalg.cholesky, and SciPy's scipy.linalg.cholesky are all efficient LAPACK-backed implementations. In Gaussian process models, the Cholesky solve L \ b (forward substitution) appears in every prediction and log-marginal-likelihood computation.
4.4 LDL^T Factorization
The LDL^T decomposition is a variant of Cholesky that avoids square roots - useful when the matrix is PSD but not strictly PD, or when square root computation is expensive.
Theorem 4.3 (LDL^T Factorization). Every symmetric that admits Gaussian elimination without row swaps can be uniquely factored as:
where is unit lower triangular () and .
Moreover, all diagonal entries .
Relation to Cholesky. If , the LDL^T and Cholesky factorizations are related by:
So the Cholesky factor is where . The LDL^T decomposition computes the directly without taking square roots.
Algorithm 4.4 (LDL^T, column-wise):
for j = 1 to n:
v[1..j-1] = D[1..j-1] * L[j,1..j-1] (element-wise)
D[j] = A[j,j] - dot(L[j,1..j-1], v[1..j-1])
for i = j+1 to n:
L[i,j] = (A[i,j] - dot(L[i,1..j-1], v[1..j-1])) / D[j]
L[j,j] = 1
Worked example. Factor :
, , .
Verify: . OK
When to prefer LDL^T over Cholesky:
- When is indefinite but you still want a factorization (use pivoted LDL^T with pivots)
- When square root computation is numerically undesirable
- In interval arithmetic or symbolic computation
Symmetric indefinite systems. The Bunch-Kaufman (BK) algorithm extends LDL^T to indefinite matrices using and pivots. The LAPACK routine dsytrf implements this.
4.5 Modified Cholesky for Near-PD Matrices
In optimization (especially trust-region methods and quasi-Newton methods), we frequently need to factor a matrix that is "nearly" PD - the Hessian approximation may have small negative eigenvalues due to finite differences or insufficient curvature.
The problem. Standard Cholesky fails (negative pivot) when is not strictly PD. Rather than failing, we want a PD matrix close to that can be Cholesky-factored.
Gill-Murray-Wright (GMW) modification. At each pivot step , if the computed pivot for a small threshold , set (adding a correction). After factorization, where is a small diagonal correction.
Simple diagonal jitter. The most widely used approach in machine learning is to add before attempting Cholesky:
def robust_cholesky(A, jitter=1e-6):
try:
return np.linalg.cholesky(A)
except np.linalg.LinAlgError:
return np.linalg.cholesky(A + jitter * np.eye(len(A)))
This is the standard "GP jitter" pattern. The added corresponds to assuming a small observation noise or numerical regularization.
For AI: PyTorch GPyTorch (Gaussian Process library) wraps all kernel matrix Cholesky calls with diagonal jitter and a retry mechanism. The JAX implementation uses jax.scipy.linalg.cholesky and adds jitter via A += diag_jitter * jnp.eye(n). Understanding why jitter works - it adds to each eigenvalue, making all eigenvalues at least - is exactly positive definiteness via the spectral characterization.
5. Matrix Square Root and Whitening
5.1 The PSD Square Root
For a scalar , there is a unique non-negative square root . For a PSD matrix, the same uniqueness holds - but only in the class of PSD matrices.
Theorem 5.1 (PSD Square Root). Let be symmetric and positive semidefinite. Then there exists a unique symmetric positive semidefinite matrix such that:
If , then .
Proof. By the spectral theorem, where with all . Define:
Then . This is symmetric PSD.
Uniqueness: Suppose with symmetric PSD. Then and commute (since ), so they share eigenvectors. On each shared eigenspace with eigenvalue , must equal (the unique non-negative root). So .
Warning: Non-uniqueness without PSD constraint. The equation for has solutions in the class of symmetric matrices (one can independently choose for each eigenvalue). The PSD square root is the unique solution with all non-negative eigenvalues.
5.2 Computing Square Roots
Via eigendecomposition:
This requires the full eigendecomposition, costing with a large constant. For :
eigvals, Q = np.linalg.eigh(A) # symmetric eigendecomposition
A_sqrt = Q @ np.diag(np.sqrt(eigvals)) @ Q.T
Via Cholesky (the "left Cholesky factor"). Note: the Cholesky factor satisfies but in general (unless is diagonal). The Cholesky factor is a square root of in the sense , but it is not symmetric and not equal to the eigendecomposition-based .
Via Newton's method (for matrix functions): The iteration starting from converges to for . This is the matrix analogue of Heron's method for scalar square roots. Convergence is quadratic once close enough.
Via Pade approximants: For close to , write and use a matrix series , truncated at some order.
5.3 Square Root vs Cholesky
The two "square root" operations serve different purposes:
| Property | Cholesky factor | PSD square root |
|---|---|---|
| Satisfies | ||
| Shape | Lower triangular | Symmetric |
| Uniqueness | Unique (positive diagonal) | Unique (PSD) |
| Computation | (full eigen) | |
| Invertibility | is lower triangular | |
| Use case | Solving systems, sampling | Mahalanobis, whitening |
When to use Cholesky: solving , computing , sampling from .
When to use : defining the Mahalanobis inner product ; whitening transforms; matrix differential equations.
5.4 Whitening Transforms
Definition 5.2 (Whitening). Given data with covariance , the whitening transform maps . The transformed variable has covariance (identity) - it is "white."
ZCA whitening. The ZCA (Zero-phase Component Analysis) whitening uses the symmetric square root: . The transformed data is as close to the original as possible while being white. This minimizes the change in "appearance" of the data.
Cholesky whitening. Using (where ) also produces white data but rotates it differently. This form is computationally cheaper.
Mahalanobis distance. For , the quadratic form is the squared Mahalanobis distance from the origin. Geometrically, it measures the distance in "standard deviations units," accounting for the correlation structure of . In dimensions, .
For AI: Batch normalization can be viewed as approximate ZCA whitening per mini-batch. More precisely, it normalizes each feature to zero mean and unit variance (diagonal whitening), avoiding the expensive full whitening. The full Mahalanobis distance appears in metric learning (e.g., Siamese networks learning such that is a meaningful distance) and in attention mechanisms where different layers may learn non-isotropic attention kernels.