"Positive definiteness is the matrix condition that makes everything work: optimization has a unique minimum, Gaussians are proper distributions, and Cholesky gives you a square root. It is the difference between a bowl and a saddle."
Overview
Positive definite matrices are the matrices that behave like positive numbers. Just as a positive scalar makes the equation uniquely solvable and the function have a unique minimum, a positive definite matrix makes the system uniquely solvable, the quadratic form have a unique minimum at zero, and the decomposition (Cholesky) exist and be unique. Positive definiteness is the structural condition that separates "well-posed" from "ill-posed" in a precise algebraic sense.
This section develops the full theory of positive definite and positive semidefinite matrices. We begin with quadratic forms and the geometric picture - level sets that form ellipsoids, energy functions with unique bowls - then build the complete toolkit: four equivalent characterizations (eigenvalue, determinantal, pivot, Cholesky), the Cholesky and LDL^T decompositions as the canonical section topic, matrix square roots, the Schur complement and its role in Gaussian conditioning, log-determinants, Gram matrices, and the PSD cone with semidefinite programming.
The AI connections are dense and load-bearing. Every covariance matrix in a multivariate Gaussian must be positive semidefinite - this is not a convenience but a mathematical necessity. The Fisher information matrix is PSD by construction and drives natural gradient methods (K-FAC). The Hessian at a local minimum is PSD. Cholesky factorization is the standard algorithm for sampling from multivariate Gaussians and the reparameterization trick in VAEs. Log-determinants appear in every Gaussian log-likelihood and normalizing flow objective. The attention mechanism computes a scaled Gram matrix. Understanding positive definiteness is understanding the algebraic backbone of probabilistic machine learning.
Prerequisites
- Eigenvalues, eigenvectors, spectral theorem for symmetric matrices - 01: Eigenvalues and Eigenvectors
- Singular value decomposition - 02: SVD
- Inner products, orthogonality, orthogonal matrices - 05: Orthogonality
- Matrix norms, condition numbers - 06: Matrix Norms
- Determinants - Chapter 2 04: Determinants
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | Interactive exploration: Cholesky algorithms, log-det computation, Gram matrices, GPR, VAE reparameterization, SDP |
| exercises.ipynb | 8 graded exercises from PD verification through Cholesky-based sampling for VAEs |
Learning Objectives
After completing this section, you will be able to:
- State all four equivalent characterizations of positive definiteness and prove the equivalences
- Distinguish PD (), PSD (), negative definite, and indefinite with examples
- Apply Sylvester's criterion to certify positive definiteness without computing eigenvalues
- Implement the Cholesky algorithm from scratch and analyze its numerical stability
- Factor and explain when to prefer LDL^T over standard Cholesky
- Compute the matrix square root and explain its uniqueness
- Define the Schur complement and use it to test positive definiteness of block matrices
- Apply the matrix inversion lemma via Schur complements
- Compute via Cholesky and derive the gradient
- Construct Gram matrices and prove they are PSD; state Mercer's theorem
- Describe the PSD cone as a convex set and formulate SDPs
- Apply Cholesky reparameterization to sample from multivariate Gaussians in VAEs
- Connect the Fisher information matrix, natural gradient, and K-FAC to PD structure
Table of Contents
- 1. Intuition
- 2. Formal Definitions
- 3. Eigenvalue and Determinantal Characterizations
- 4. Cholesky Decomposition
- 5. Matrix Square Root and Whitening
- 6. Schur Complement
- 7. Log-Determinant
- 8. Gram Matrices and Kernel Connections
- 9. The PSD Cone and Semidefinite Programming
- 10. Applications in Machine Learning
- 11. Common Mistakes
- 12. Exercises
- 13. Why This Matters for AI (2026 Perspective)
- 14. Conceptual Bridge
1. Intuition
1.1 What Positive Definiteness Captures
The most useful single-sentence definition of a positive definite matrix is: a matrix that behaves like a positive number. To understand what this means, compare scalars and matrices in parallel.
For a scalar :
- The equation has a unique solution
- The function has a unique minimum at
- The scalar has a real square root:
- The product , with equality only at
For a positive definite matrix :
- The system has a unique solution
- The function has a unique minimum at
- The matrix has a unique positive definite square root
- The quadratic form for all
The quadratic form is the central object. Think of it as a machine that takes a vector and returns a scalar measuring its "energy" with respect to . When is the identity, , the familiar squared Euclidean length. When is diagonal with positive entries , then - a weighted sum of squared coordinates. Positive definiteness requires this energy to be positive for every nonzero direction.
The formal definition requires to be symmetric. The condition for is called strict positive definiteness. The relaxed condition is positive semidefiniteness - it allows zero energy in some directions (those in the null space of ).
For AI: Every loss function that has a strict local minimum at satisfies . The Hessian being positive definite at a critical point is exactly the second-order sufficient condition for a local minimum. This is the mathematical content of "the loss landscape is a bowl."
1.2 Geometric Picture: Ellipsoids and Level Sets
The level sets of the quadratic form reveal the geometry of directly.
Case 1: (identity). The level set is - a sphere of radius .
Case 2: diagonal, with . The level set is an ellipse with semi-axes and . The larger eigenvalue compresses that direction; the smaller eigenvalue stretches it.
Case 3: general symmetric PD. The level set is an ellipsoid whose axes are aligned with the eigenvectors of and whose semi-axis lengths are . This follows from the spectral decomposition : substituting gives , a coordinate-aligned ellipsoid.
Case 4: indefinite (some positive, some negative eigenvalues). The level set is a hyperboloid - unbounded in the negative-eigenvalue directions. There is no minimum energy.
Case 5: positive semidefinite (some zero eigenvalues). The level set is an "infinite cylinder" - flat in the null space directions. Energy is zero for all vectors in the null space.
QUADRATIC FORM GEOMETRY
========================================================================
2D quadratic form Q(x_1,x_2) = x^TAx, level set Q = 1
PD (both \lambda > 0): PSD (one \lambda = 0): Indefinite:
+-------+ --------- / \
| /-\ | --------- ---------
| / \ | --------- \ /
+-------+ --------- ---------
ellipse parallel lines hyperbola
Semi-axes \propto 1/\sqrt\lambda^i
Axes aligned with eigenvectors of A
========================================================================
This geometric picture has a direct machine learning reading. The covariance matrix of a multivariate Gaussian defines a metric on feature space: is the squared Mahalanobis distance, measuring how many standard deviations is from in each principal direction. The level sets of this distance are the ellipsoidal contours of the Gaussian density.
1.3 Why PD Matrices Matter for AI
Positive definiteness is not a technical curiosity - it is a pervasive structural condition in machine learning systems:
Covariance matrices. Any valid probability distribution must have a non-negative variance. For a multivariate random variable , the covariance matrix is always PSD. For a multivariate Gaussian with a proper density, we need (so and the density is normalized). Every time a VAE encoder outputs a covariance or diagonal variance, it must produce a PSD (or PD) parameterization.
Fisher information matrix. The Fisher information is PSD by construction (it is a covariance matrix of score functions). The natural gradient uses the Fisher as a Riemannian metric on parameter space. K-FAC (Kronecker-Factored Approximate Curvature) approximates with a Kronecker-product PSD matrix for scalable second-order optimization.
Gram matrices and kernels. The inner product matrix for any set of vectors is always PSD. Kernel methods rely on this: a function is a valid kernel if and only if its Gram matrix is PSD for any set of inputs (Mercer's theorem). The scaled attention score matrix in transformers is a (non-symmetric) Gram matrix.
Cholesky in numerical computation. Cholesky decomposition () is the fastest algorithm for solving when is known to be PD - twice as fast as LU. It is the standard backend for Gaussian process inference, multivariate normal sampling, and Bayesian linear regression. NumPy, SciPy, PyTorch all use Cholesky internally whenever PD structure is detected.
Hessian and second-order methods. At a local minimum , the Hessian . Sharpness-Aware Minimization (SAM) seeks parameters where is small, empirically improving generalization. The Gauss-Newton approximation is always PSD and is the basis for K-FAC and natural gradient methods.
1.4 Historical Timeline
| Year | Contributor | Development |
|---|---|---|
| 1801 | Gauss | Quadratic forms in celestial mechanics; method of least squares |
| 1847 | Sylvester | Named "positive definite" forms; leading minor criterion |
| 1852 | Sylvester | Sylvester's law of inertia (signature invariance under congruence) |
| 1878 | Frobenius | Systematic theory of bilinear and quadratic forms |
| 1910 | Cholesky | Cholesky decomposition for geodetic calculations (unpublished until 1924) |
| 1934 | Loewner | Matrix ordering (Loewner order) defined rigorously |
| 1936 | Mercer | Mercer's theorem linking PSD functions to kernel expansions |
| 1955 | Bellman | Positive definite matrices in dynamic programming and control |
| 1990 | Vandenberghe & Boyd | Semidefinite programming (SDP) as efficient optimization |
| 1998 | Scholkopf & Smola | Kernel methods: SVMs via PSD kernel matrices |
| 2013 | Kingma & Welling | VAE reparameterization trick using Cholesky sampling |
| 2015 | Martens & Grosse | K-FAC: Kronecker-factored PSD approximation to Fisher |
| 2024-2026 | Multiple groups | PSD structure in diffusion model covariances, normalizing flows, structured covariance estimation |
2. Formal Definitions
2.1 Quadratic Forms
Definition 2.1 (Quadratic Form). Let be a symmetric matrix and . The associated quadratic form is the function defined by
Note: every quadratic form can be written with a symmetric matrix. If is not symmetric, replacing it with gives the same quadratic form, since for all . We always assume is symmetric.
Expanding in 2D. For and :
The diagonal entries give the pure squared terms; the off-diagonal gives the cross term.
Completing the square. For the 1D quadratic , the minimum is , achieved at . The matrix analogue is the Schur complement (6) - the quadratic form is minimized at (when ) with minimum value .
Connection to bilinear forms. The quadratic form is the diagonal of the bilinear form , i.e., . The bilinear form encodes the inner product structure induced by .
2.2 The Four Cases: PD, PSD, ND, Indefinite
Definition 2.2. Let be symmetric. Then is:
| Name | Symbol | Condition | Colloquial |
|---|---|---|---|
| Positive definite | for all | "bowl" | |
| Positive semidefinite | for all | "flat bowl" | |
| Negative definite | for all | "dome" | |
| Negative semidefinite | for all | "flat dome" | |
| Indefinite | - | takes both positive and negative values | "saddle" |
Note: . The interesting cases for applications are PD and PSD.
Standard examples:
: for . . OK
: for . . OK
: , and . but not . (PSD, not PD.)
: , . Indefinite.
Non-examples (common mistakes):
: looks "positive" but , so indefinite.
: , so PSD but NOT PD.
: , indefinite.
2.3 Immediate Consequences
If , then many properties follow immediately from the definition:
Proposition 2.3. Let (symmetric). Then:
-
Positive diagonal entries. for all . Proof: Take : .
-
Positive trace. .
-
Positive determinant. . (Follows from all eigenvalues positive, 3.1.)
-
Invertibility. is invertible, and . Proof: If then , so . For : is symmetric, and for .
-
Closure under positive scaling. for .
-
Closure under addition. If then . Proof: .
-
Tikhonov regularization is PD. If and , then . Proof: .
-
Congruence preserves PD. If and is invertible, then . Proof: since when .
For AI: Property 4 ensures that covariance matrices can be inverted for precision computations. Property 7 is exactly why adding (weight decay, Tikhonov regularization, jitter in Gaussian processes) to a PSD matrix makes it PD and invertible.
2.4 The Loewner Order
Definition 2.4 (Loewner Order). For symmetric matrices , define the Loewner ordering:
Similarly, . This ordering is:
- Reflexive: .
- Antisymmetric: and implies .
- Transitive: and implies .
However, it is NOT a total order - not every pair of symmetric matrices is comparable. For example, and are neither nor each other.
Properties of the Loewner order:
If then:
- (trace is monotone)
- for all when eigenvalues are sorted in decreasing order (Weyl's theorem)
- for any (congruence is monotone)
- If additionally : (inversion reverses the order!)
The last property is surprising: if , then . Intuition: a larger matrix "moves vectors more," so its inverse "moves them less."
For AI: In Bayesian inference, the posterior covariance is . Since we added to the prior precision, the posterior precision is larger, so by order-reversal, . Observing data never increases uncertainty - the Loewner order makes this rigorous.
3. Eigenvalue and Determinantal Characterizations
3.1 Spectral Characterization
The most computationally transparent characterization of positive definiteness uses eigenvalues.
Theorem 3.1 (Spectral Characterization). Let be symmetric with eigenvalues . Then:
Proof. By the spectral theorem (-> 01: Eigenvalues), where is orthogonal and . For any , let (since is invertible):
If all : the sum is positive for . OK
If some : take (so , the -th eigenvector). Then . NO
The characterization for follows by the same argument with replacing .
Immediate consequences:
- for PD
- for PD
- and for PD (-> 06: Norms)
- The condition number measures how "nearly singular" is
For AI: In Gaussian process regression, the covariance kernel matrix must be PSD. Verifying by computing eigenvalues is expensive; checking a Cholesky decomposition (4) is faster and also provides the factorization needed for inference.
Note on the spectral theorem: This proof uses the spectral theorem for symmetric matrices - that every symmetric real matrix is orthogonally diagonalizable. For the full proof and discussion of spectral theory, see 01: Eigenvalues and Eigenvectors.
3.2 Sylvester's Criterion
Sylvester's criterion provides a purely determinantal test that avoids eigenvalue computation entirely, making it practical for hand calculations and symbolic proofs.
Definition 3.2 (Leading Principal Minors). The -th leading principal minor of is:
So , (for symmetric ), and .
Theorem 3.3 (Sylvester's Criterion). A symmetric matrix is positive definite if and only if all leading principal minors are positive:
Proof sketch. The key insight is that the leading principal submatrix is itself symmetric, and implies for every (restriction to the first standard basis vectors). The converse uses induction: if and , then by the inductive hypothesis all principal leading submatrices are PD, and the Cholesky factorization can be completed (4 gives the constructive proof). The determinant of a PD matrix equals the product of its Cholesky diagonal entries squared, all positive, so .
Working example. Test :
- OK
- OK
- OK
So .
Caution for PSD. Sylvester's criterion does NOT characterize . A matrix can have all leading principal minors but still not be PSD (a counterexample is : , , but ). For PSD testing, use all principal minors (not just leading ones), or use eigenvalues.
For AI: Sylvester's criterion is used in symbolic proofs and in optimization algorithms that maintain PD structure (e.g., proving that a block covariance matrix built by a GP model is PD, by verifying the diagonal blocks and Schur complements are positive).
3.3 Pivot Characterization
The connection between Gaussian elimination and positive definiteness is deep and computationally significant.
Theorem 3.4 (Pivot Characterization). A symmetric matrix is positive definite if and only if Gaussian elimination without row exchanges produces all positive pivots.
The pivots of are , , , . So all pivots all (Sylvester). The LDL^T factorization makes this explicit: where and all diagonal entries of are positive (4.4).
The pivot characterization is how Cholesky "discovers" positive definiteness: the Cholesky algorithm fails (tries to take the square root of a non-positive number) exactly when a pivot is non-positive. This makes Cholesky the standard computational test for positive definiteness: try to factor; failure reveals non-PD structure.
PIVOT CHARACTERIZATION SUMMARY
========================================================================
A \in \mathbb{R}^n^x^n symmetric. The following are equivalent:
(i) A \succ 0 (quadratic form positive for all x \neq 0)
(ii) All eigenvalues \lambda^i > 0 [Spectral]
(iii) All leading principal minors \Delta_k > 0 [Sylvester]
(iv) All Gaussian elimination pivots d_k > 0 [Pivot]
(v) Cholesky factorization A = LL^T exists with L [Cholesky]
lower triangular, L^i^i > 0
Each characterization provides a different computational test:
(ii) O(n^3): eigenvalue decomposition <- most expensive
(iii) O(n^4): compute n determinants <- symbolic/manual only
(iv) O(n^3): Gaussian elimination
(v) O(n^3/3): Cholesky <- fastest in practice
========================================================================
3.4 Certifying PSD vs PD
In numerical practice, the PD/PSD boundary is a source of subtle bugs. Here we describe how to handle the boundary cases.
Near-PSD matrices. A matrix that is theoretically PSD may have small negative eigenvalues due to floating-point errors. For example, if is constructed as but is nearly rank-deficient, might have eigenvalues like due to rounding. The standard fix is to add a small jitter: for or .
Numerical rank. For a PSD matrix, the rank equals the number of positive eigenvalues. In finite precision, eigenvalues below (machine epsilon times spectral norm) are treated as zero. numpy.linalg.matrix_rank uses this threshold internally.
Testing PSD in practice:
- Try
np.linalg.cholesky(A)- succeeds iff (strictly PD) - For PSD test: check
np.all(np.linalg.eigvalsh(A) >= -tol)withtol = 1e-10 - For rank- PSD: compute
np.linalg.matrix_rank(A)and verify
The zero-eigenvalue case. If and , the Cholesky factorization of the full matrix does not exist. However, the factorization where is (a "thin" Cholesky) does exist and can be computed via pivoted Cholesky. This is used in sparse GP approximations (Nystrom approximation, inducing points).
For AI: PyTorch's torch.linalg.cholesky raises torch.linalg.LinAlgError if the matrix is not PD. In practice, diagonal jitter (A + 1e-6 * torch.eye(n)) is the standard workaround in GP implementations and VAE covariance parameterizations.
4. Cholesky Decomposition
Cholesky decomposition is the most important computational tool in this section. Unlike the spectral characterization (eigenvalue decomposition, with a large constant) or LU (general, with pivoting), Cholesky is:
- Twice as fast as LU for symmetric positive definite systems ( vs operations)
- Numerically backward-stable without pivoting (which LU requires for stability)
- Reveals the square root: satisfies , i.e., is a "matrix square root"
- The canonical test for positive definiteness: it exists iff
4.1 The Theorem: Existence and Uniqueness
Theorem 4.1 (Cholesky Factorization). Let be symmetric. Then:
This is called the Cholesky factor of . The factorization is the Cholesky decomposition.
Proof of existence (constructive). We prove by induction on .
Base case (): with (since ). Take .
Inductive step: Write in block form:
where (Proposition 2.3), , and .
Set , . Then is the Schur complement .
Since , its Schur complement (Theorem 6.2). By the inductive hypothesis, for unique lower triangular with positive diagonal.
Then:
Proof of uniqueness. Suppose with lower triangular and positive diagonal. Then . The left side is lower triangular; the right side is upper triangular. Both sides equal the same matrix, which must be diagonal with positive entries (since have positive diagonals). Comparing: (diagonal positive), so . But , so , so , so .
Converse. If with lower triangular and positive diagonal, then for : (since is invertible by positive diagonal). So .
4.2 The Algorithm: Constructing L
The Cholesky algorithm computes column by column. From , expanding element by element:
For the diagonal entry :
For off-diagonal entry with :
Algorithm 4.2 (Cholesky, column-wise):
Input: Symmetric A \in \mathbb{R}^n^x^n
Output: Lower triangular L with A = LL^T, or FAILURE
for j = 1 to n:
s = A[j,j] - sum(L[j,k]^2 for k = 1..j-1)
if s \leq 0:
FAILURE (A is not positive definite)
L[j,j] = sqrt(s)
for i = j+1 to n:
L[i,j] = (A[i,j] - sum(L[i,k]*L[j,k] for k=1..j-1)) / L[j,j]
for i = 1 to j-1:
L[i,j] = 0 (upper triangle is zero)
Worked example. Compute the Cholesky factor of :
Column 1: , , .
Column 2: , .
Column 3: .
Verify: . OK
4.3 Complexity and Numerical Properties
Flop count. The Cholesky algorithm requires approximately floating-point operations - precisely half the required by LU decomposition. For large , this factor-of-2 speedup is significant.
Memory. Only the lower triangle of needs to be stored: entries vs for general LU. In-place variants overwrite the lower triangle of with .
Backward stability. Cholesky is numerically backward-stable without pivoting: the computed satisfies where for a modest constant . This is stronger than LU without pivoting (which can be unstable). The reason: each step computes , which cannot amplify errors.
Solving : Factor , then solve by forward substitution and by backward substitution. Total cost: (factorization + two triangular solves).
Failure = non-PD. If at any step the quantity under the square root is non-positive, is not positive definite. This makes Cholesky the standard computational test for PD structure.
For AI: JAX's jax.scipy.linalg.cholesky, PyTorch's torch.linalg.cholesky, and SciPy's scipy.linalg.cholesky are all efficient LAPACK-backed implementations. In Gaussian process models, the Cholesky solve L \ b (forward substitution) appears in every prediction and log-marginal-likelihood computation.
4.4 LDL^T Factorization
The LDL^T decomposition is a variant of Cholesky that avoids square roots - useful when the matrix is PSD but not strictly PD, or when square root computation is expensive.
Theorem 4.3 (LDL^T Factorization). Every symmetric that admits Gaussian elimination without row swaps can be uniquely factored as:
where is unit lower triangular () and .
Moreover, all diagonal entries .
Relation to Cholesky. If , the LDL^T and Cholesky factorizations are related by:
So the Cholesky factor is where . The LDL^T decomposition computes the directly without taking square roots.
Algorithm 4.4 (LDL^T, column-wise):
for j = 1 to n:
v[1..j-1] = D[1..j-1] * L[j,1..j-1] (element-wise)
D[j] = A[j,j] - dot(L[j,1..j-1], v[1..j-1])
for i = j+1 to n:
L[i,j] = (A[i,j] - dot(L[i,1..j-1], v[1..j-1])) / D[j]
L[j,j] = 1
Worked example. Factor :
, , .
Verify: . OK
When to prefer LDL^T over Cholesky:
- When is indefinite but you still want a factorization (use pivoted LDL^T with pivots)
- When square root computation is numerically undesirable
- In interval arithmetic or symbolic computation
Symmetric indefinite systems. The Bunch-Kaufman (BK) algorithm extends LDL^T to indefinite matrices using and pivots. The LAPACK routine dsytrf implements this.
4.5 Modified Cholesky for Near-PD Matrices
In optimization (especially trust-region methods and quasi-Newton methods), we frequently need to factor a matrix that is "nearly" PD - the Hessian approximation may have small negative eigenvalues due to finite differences or insufficient curvature.
The problem. Standard Cholesky fails (negative pivot) when is not strictly PD. Rather than failing, we want a PD matrix close to that can be Cholesky-factored.
Gill-Murray-Wright (GMW) modification. At each pivot step , if the computed pivot for a small threshold , set (adding a correction). After factorization, where is a small diagonal correction.
Simple diagonal jitter. The most widely used approach in machine learning is to add before attempting Cholesky:
def robust_cholesky(A, jitter=1e-6):
try:
return np.linalg.cholesky(A)
except np.linalg.LinAlgError:
return np.linalg.cholesky(A + jitter * np.eye(len(A)))
This is the standard "GP jitter" pattern. The added corresponds to assuming a small observation noise or numerical regularization.
For AI: PyTorch GPyTorch (Gaussian Process library) wraps all kernel matrix Cholesky calls with diagonal jitter and a retry mechanism. The JAX implementation uses jax.scipy.linalg.cholesky and adds jitter via A += diag_jitter * jnp.eye(n). Understanding why jitter works - it adds to each eigenvalue, making all eigenvalues at least - is exactly positive definiteness via the spectral characterization.
5. Matrix Square Root and Whitening
5.1 The PSD Square Root
For a scalar , there is a unique non-negative square root . For a PSD matrix, the same uniqueness holds - but only in the class of PSD matrices.
Theorem 5.1 (PSD Square Root). Let be symmetric and positive semidefinite. Then there exists a unique symmetric positive semidefinite matrix such that:
If , then .
Proof. By the spectral theorem, where with all . Define:
Then . This is symmetric PSD.
Uniqueness: Suppose with symmetric PSD. Then and commute (since ), so they share eigenvectors. On each shared eigenspace with eigenvalue , must equal (the unique non-negative root). So .
Warning: Non-uniqueness without PSD constraint. The equation for has solutions in the class of symmetric matrices (one can independently choose for each eigenvalue). The PSD square root is the unique solution with all non-negative eigenvalues.
5.2 Computing Square Roots
Via eigendecomposition:
This requires the full eigendecomposition, costing with a large constant. For :
eigvals, Q = np.linalg.eigh(A) # symmetric eigendecomposition
A_sqrt = Q @ np.diag(np.sqrt(eigvals)) @ Q.T
Via Cholesky (the "left Cholesky factor"). Note: the Cholesky factor satisfies but in general (unless is diagonal). The Cholesky factor is a square root of in the sense , but it is not symmetric and not equal to the eigendecomposition-based .
Via Newton's method (for matrix functions): The iteration starting from converges to for . This is the matrix analogue of Heron's method for scalar square roots. Convergence is quadratic once close enough.
Via Pade approximants: For close to , write and use a matrix series , truncated at some order.
5.3 Square Root vs Cholesky
The two "square root" operations serve different purposes:
| Property | Cholesky factor | PSD square root |
|---|---|---|
| Satisfies | ||
| Shape | Lower triangular | Symmetric |
| Uniqueness | Unique (positive diagonal) | Unique (PSD) |
| Computation | (full eigen) | |
| Invertibility | is lower triangular | |
| Use case | Solving systems, sampling | Mahalanobis, whitening |
When to use Cholesky: solving , computing , sampling from .
When to use : defining the Mahalanobis inner product ; whitening transforms; matrix differential equations.
5.4 Whitening Transforms
Definition 5.2 (Whitening). Given data with covariance , the whitening transform maps . The transformed variable has covariance (identity) - it is "white."
ZCA whitening. The ZCA (Zero-phase Component Analysis) whitening uses the symmetric square root: . The transformed data is as close to the original as possible while being white. This minimizes the change in "appearance" of the data.
Cholesky whitening. Using (where ) also produces white data but rotates it differently. This form is computationally cheaper.
Mahalanobis distance. For , the quadratic form is the squared Mahalanobis distance from the origin. Geometrically, it measures the distance in "standard deviations units," accounting for the correlation structure of . In dimensions, .
For AI: Batch normalization can be viewed as approximate ZCA whitening per mini-batch. More precisely, it normalizes each feature to zero mean and unit variance (diagonal whitening), avoiding the expensive full whitening. The full Mahalanobis distance appears in metric learning (e.g., Siamese networks learning such that is a meaningful distance) and in attention mechanisms where different layers may learn non-isotropic attention kernels.
6. Schur Complement
The Schur complement is the "matrix analogue of completing the square" for block matrices. It appears everywhere in probability (Gaussian conditioning), linear algebra (block matrix inversion), and optimization (constraint elimination).
6.1 Definition for Block Matrices
Definition 6.1 (Schur Complement). Let
be a block matrix with invertible. The Schur complement of in is:
Similarly, if is invertible, the Schur complement of is .
Origin: block Gaussian elimination. The Schur complement arises naturally when eliminating the block:
The block in the upper triangular factor is exactly .
Determinant formula. A key consequence of the block LU:
Similarly, when is invertible.
Block matrix inverse. Using the Schur complement:
when both and are invertible.
6.2 Schur Complement and Positive Definiteness
The Schur complement provides an elegant characterization of block PD matrices.
Theorem 6.2 (Schur PD Criterion). Let be symmetric (so ). Then:
Proof. We use the block Cholesky:
The middle factor is block diagonal: . For any :
(): If , taking shows ; taking shows for , so .
(): If and , then both terms are non-negative, and at least one is positive for , so .
Corollary. and (when is invertible; otherwise use the rank condition).
SCHUR COMPLEMENT AND BLOCK PD
========================================================================
M = ( A B ) symmetric
( B^T D )
M \succ 0 <=> A \succ 0 AND S = D - B^TA^-^1B \succ 0
Intuition: "completing the square" in block form
v^TMv = (x + A^-^1By)^TA(x + A^-^1By) + y^TSy
----------------------------- -----
\geq 0 (since A \succ 0) \geq 0 (since S \succ 0)
========================================================================
6.3 Matrix Inversion Lemma
The Woodbury matrix identity (also called the matrix inversion lemma or Sherman-Morrison-Woodbury formula) is:
where , , , . This is valuable when (low-rank update): instead of inverting an matrix, invert a matrix.
Derivation via Schur complement. Consider the block system:
The Schur complement of is (with some sign manipulation). The Schur complement of is . Using the block inverse formula gives the Woodbury identity.
Special case (rank-1 update): If , , (scalar):
This is the Sherman-Morrison formula for rank-1 updates.
For AI: The Woodbury identity is used in:
- Gaussian process prediction: posterior covariance where and is noise; Woodbury allows using or (whichever is smaller)
- LoRA / low-rank adaptation: the effective weight is a rank- update; Woodbury allows efficient inversion without materializing the full matrix
- Kalman filter update step: uses Woodbury to avoid inverting large state covariances
6.4 Gaussian Conditioning via Schur Complement
The Schur complement is the algebraic engine behind the conditional distribution of multivariate Gaussians.
Setup. Let .
Conditional distribution. The conditional is Gaussian:
where:
The conditional covariance is exactly the Schur complement of in . Theorem 6.2 guarantees whenever .
Derivation. Complete the square in the joint density. The exponent:
factored using the block inverse of (the "precision matrix") gives the Schur complement form.
For AI: In Gaussian process regression, the predictive distribution at new points given training observations uses exactly this formula:
The second term is the Schur complement. Computing it via Cholesky: , then .
7. Log-Determinant
7.1 Definition and Properties
For a positive definite matrix , the log-determinant is:
Since implies for all , this is well-defined. The log-det is defined only on the interior of the PSD cone (i.e., on PD matrices).
Key computation via Cholesky:
Since (Cholesky diagonal is positive), is well-defined. This is the standard computational formula: factor , then sum the logs of the diagonal entries.
L = np.linalg.cholesky(A)
log_det_A = 2 * np.sum(np.log(np.diag(L)))
This is numerically more stable than np.log(np.linalg.det(A)) for large matrices, because det can underflow or overflow.
Properties:
- (for any invertible )
- for scalar
- where is the matrix logarithm (eigendecomposition-based)
- As (boundary, a singular matrix):
7.2 Log-Det as a Concave Function
Theorem 7.1. The function defined by is strictly concave on the cone of PD matrices.
Proof. We need to show for and , with equality iff .
Fix and let . The eigenvalues of are .
Factoring: , so:
Since is strictly concave: .
Summing: .
Equality holds iff all , i.e., , i.e., .
Consequences of concavity:
- The log-det has no local maxima except at the global maximum (over any convex domain)
- Gradient methods for maximizing over a convex set converge to the global maximum
- The function is differentiable at every PD matrix (the gradient is , see 7.3)
7.3 Gradient and Matrix Calculus
Theorem 7.2 (Log-Det Gradient). Let for . Then:
More precisely, in the matrix calculus convention where means the entry of the gradient matrix:
Proof using Jacobi's formula. Jacobi's formula states . Therefore:
Since , we read off for symmetric .
Second derivative (Hessian operator): For the function :
The negative sign confirms concavity.
Other useful log-det derivatives:
- (rank-1 update)
7.4 Log-Det in Machine Learning
Multivariate Gaussian log-likelihood. For :
The term penalizes large (spread-out) distributions. When fitting to data, maximizing the log-likelihood requires differentiating through , using .
Gaussian process marginal likelihood. For GP regression with kernel matrix and noise :
Both terms require Cholesky: , then and the quadratic form via triangular solves. Gradient with respect to kernel hyperparameters uses where .
Normalizing flows. A normalizing flow defines a bijective mapping where . The log-likelihood of data is:
where is the Jacobian. Efficiently computing is the central computational challenge of normalizing flows. Architectures like RealNVP (triangular Jacobian, ) and FFJORD (stochastic trace estimator) are designed specifically to make this computation tractable.
Log-det estimators for large matrices. When is very large (e.g., a kernel matrix for millions of training points), exact Cholesky is intractable. Randomized log-det estimators use the identity and the stochastic trace estimator with random , computed via Lanczos iterations. This is used in scalable GP libraries (GPyTorch, 2018).
8. Gram Matrices and Kernel Connections
8.1 The Gram Matrix Construction
Definition 8.1 (Gram Matrix). Let be a collection of vectors. Their Gram matrix is:
In matrix form: if (rows are data points), then .
Theorem 8.2. Every Gram matrix is positive semidefinite. Moreover, iff are linearly independent.
Proof. For any :
So . Equality holds iff , i.e., iff the vectors are linearly dependent. So iff they are linearly independent.
Corollary. , the number of linearly independent data vectors.
Examples:
- (standard basis): .
- : , so but .
- , invertible: .
8.2 Every PSD Matrix is a Gram Matrix
Theorem 8.3. if and only if is the Gram matrix of some set of vectors in some inner product space.
Proof. (): proved above. (): If , let be the Cholesky-like factor: (where may be rectangular, , ). Take (the -th row of ). Then .
This is a deep result: the class of PSD matrices and the class of Gram matrices are exactly the same. Any PSD matrix can be "explained" as a matrix of inner products between some set of vectors.
Feature maps. If is a feature map, then the Gram matrix is PSD. Kernel methods replace with directly, avoiding explicit feature computation.
8.3 Kernel Matrices and Mercer's Theorem
A positive definite kernel is a function such that for every finite set , the Gram matrix is PSD.
Mercer's Theorem (informal statement). A continuous, symmetric function is a positive definite kernel (i.e., produces PSD Gram matrices for any finite set) if and only if there exists a Hilbert space and a feature map such that:
This is the mathematical foundation of the kernel trick: instead of explicitly computing (which may be infinite-dimensional), we evaluate directly.
Forward reference: The full treatment of Mercer's theorem, reproducing kernel Hilbert spaces (RKHS), and the kernel trick appears in Chapter 12: Functional Analysis.
Standard PD kernels:
- Linear kernel: (standard Gram matrix)
- RBF/Gaussian kernel:
- Polynomial kernel: for ,
- Matern kernel: used in GP regression with controllable smoothness
Verifying kernel validity. For a proposed kernel , the standard test is:
- Compute the Gram matrix for a test set
- Check (e.g., via
np.linalg.eigvalsh(K) >= -1e-10)
8.4 Attention Scores as Gram Matrices
The scaled dot-product attention mechanism in transformers computes:
where are query and key matrices.
The score matrix is a scaled Gram matrix (but with different row-spaces for and , so not necessarily PSD). In the special case of self-attention with tied weights , is proportional to .
Why the scaling. If and have independent entries from , then has variance 1 and has variance . The rescaling brings the variance back to 1, preventing the softmax from saturating into near-one-hot distributions.
Attention as kernel regression. The attention output for query is:
This is a Nadaraya-Watson kernel regression with an exponential kernel . The attention weights are the normalized kernel similarities, and the output is a kernel-weighted average of values. The exponential of the dot product is related to the RBF kernel (by the Johnson-Lindenstrauss / random Fourier features perspective used in Performer / FAVOR+).
9. The PSD Cone and Semidefinite Programming
9.1 The Cone of PSD Matrices
The set of all symmetric positive semidefinite matrices is denoted (or ). It lives inside the vector space of real symmetric matrices (which has dimension ).
Theorem 9.1. is a proper convex cone:
- Convex: If and , then .
- Cone: If and , then .
- Pointed: and implies (no lines through the origin in the cone).
- Closed: is a closed subset of (limits of PSD sequences are PSD).
- Full-dimensional: The interior of is (the PD matrices), which is non-empty.
Proof of convexity: For any : .
The boundary. The boundary consists of PSD matrices with at least one zero eigenvalue. These are rank-deficient PSD matrices. The boundary has lower dimension: the set of rank- PSD matrices is a manifold of dimension .
Low-dimensional picture. For : is 3-dimensional (coordinates ). The PSD cone is the set where , , and - the interior of an ice cream cone in 3D.
Dual cone. The dual of with respect to the Frobenius inner product is:
The PSD cone is self-dual: . This is analogous to the non-negative reals being self-dual.
9.2 Semidefinite Programming
Semidefinite programming (SDP) is the optimization of a linear objective over the intersection of the PSD cone with an affine set:
Standard form SDP (primal):
Here and are the problem data; is the decision variable.
Dual SDP:
Duality. Weak duality always holds: . Strong duality (primal = dual) holds under Slater's condition: if the primal is strictly feasible (some satisfies all constraints), then strong duality holds and the dual optimum is attained.
Relation to other optimization problems. SDP generalizes:
- Linear programming (LP): LP is SDP with diagonal matrices
- SOCP (second-order cone programming): SOCP is a special SDP
- Quadratically constrained QP: Many QCQPs can be lifted to SDPs via semidefinite relaxation
Algorithms. Interior-point methods (barrier methods) solve SDPs in polynomial time: per iteration for an -constraint, SDP. Standard solvers: SCS, MOSEK, CVXPY (modelling layer).
9.3 SDP in Machine Learning
Max-cut relaxation (Goemans-Williamson, 1995). The max-cut problem on a graph with edge weights is NP-hard. The Goemans-Williamson SDP relaxation gives a -approximation:
The solution satisfies for some unit vectors ; random hyperplane rounding recovers a near-optimal cut.
Metric learning. Learning a Mahalanobis distance requires . Methods like ITML (Information-Theoretic Metric Learning) and SDML formulate this as an SDP over PSD matrices with constraints that similar pairs are close and dissimilar pairs are far.
Covariance estimation. In high dimensions (), the sample covariance may be singular. Regularized covariance estimation (graphical lasso, precision matrix estimation) adds a sparsity constraint or penalty:
where is the sample covariance and is the element-wise norm. This is not an SDP directly, but the PSD constraint is the core structural requirement.
Fairness constraints. In algorithmic fairness, PSD constraints arise as necessary conditions for disparate impact compliance. Certified defenses against adversarial examples (semidefinite relaxations of neural network verification) are large-scale SDPs; solvers like -CROWN reformulate them via Lagrangian relaxation.
10. Applications in Machine Learning
10.1 Multivariate Gaussians and Covariance Matrices
The multivariate Gaussian. A random vector has the distribution if:
For this density to be a valid (normalized, integrable) probability distribution, must be symmetric positive definite. The three requirements:
- Symmetry: (covariance is symmetric by definition)
- Positive definiteness: ensures (normalizing constant finite) and exists (the exponent is a proper quadratic form)
- If but singular: The distribution becomes degenerate - supported on an affine subspace, not all of
Parameterizing covariances in neural networks. A neural network that outputs a covariance matrix must parameterize it to be PSD. Standard approaches:
- Diagonal: where is a learned vector. Automatically PD.
- Cholesky lower triangular: where has positive diagonal (enforced via softplus on diagonal entries). This is the most expressive parameterization.
- Low-rank + diagonal: where () and diagonal positive. Woodbury allows efficient inversion.
Sampling from : Given :
where . This is the fundamental sampling algorithm: the Cholesky factor maps isotropic noise to correlated noise.
10.2 Fisher Information Matrix and Natural Gradient
Definition 10.1 (Fisher Information Matrix). For a statistical model with parameter , the Fisher information matrix is:
PSD proof. is a covariance matrix of the score function : it is where is a random vector. Any matrix of the form is PSD (it is the expected outer product). In fact, for regular statistical models (identifiable, full rank).
Natural gradient. Ordinary gradient descent uses the Euclidean metric on parameter space. The natural gradient uses the Fisher metric:
The natural gradient is invariant to reparameterization of the model - it measures the steepest descent direction with respect to the KL divergence geometry (information geometry).
K-FAC (Kronecker-Factored Approximate Curvature). Martens & Grosse (2015) approximate the Fisher information matrix for neural networks as a block-diagonal matrix, where each block factorizes as a Kronecker product:
where (input activation covariance) and (pre-activation gradient covariance) for layer . Both and are PSD (covariance matrices). The Kronecker product is PSD, and its inverse is , computable cheaply via Cholesky of each factor separately.
10.3 Gaussian Process Regression
Gaussian process regression is the Bayesian non-parametric regression method that uses PD kernel matrices as the core computational object.
Model. Place a GP prior: where is a PD kernel. Observe noisy outputs , .
Prediction. The predictive distribution at new points is:
where:
and , , .
Computational core: Cholesky. Factor . Then:
- (two triangular solves)
- (matrix-vector product)
- (triangular solve, system)
- (Schur complement form)
The entire GP regression computation is via Cholesky - the classic bottleneck for large datasets, motivating sparse GP approximations (inducing points, Nystrom, SVGP).
10.4 Hessian and Loss Landscape Sharpness
Second-order characterization of minima. At a critical point of a smooth loss :
- : strict local minimum (the loss bowl is strictly convex)
- : local minimum or saddle (Hessian is PSD)
- indefinite: saddle point
Sharpness. The sharpness of a minimum is - the largest eigenvalue of the Hessian. Flat minima (small sharpness) generalize better than sharp minima: a small perturbation with changes the loss by at most (by Taylor expansion and the PSD bound ).
SAM (Sharpness-Aware Minimization). Foret et al. (2021) propose:
The inner max finds the worst-case perturbation (solved approximately as ). SAM is a first-order approximation to minimizing sharpness and is reported to improve generalization on image and language tasks.
Gauss-Newton and PSD Hessian approximations. The true Hessian may be indefinite during training (early stages, overparameterized models). The Gauss-Newton matrix (where is the Jacobian of the predictions) is always PSD and is often a better preconditioner. K-FAC uses the Gauss-Newton approximation.
10.5 Reparameterization Trick in VAEs
The variational autoencoder (VAE) requires sampling from a distribution where and are outputs of an encoder neural network, and differentiating through the sampling process.
The reparameterization trick. Instead of sampling directly (which blocks gradient flow), write:
where . Now is a deterministic function of , and gradients can flow through and .
Diagonal VAE (standard). Most VAE implementations use diagonal covariance: , so and .
Full covariance VAE. For a full covariance Cholesky parameterization, the encoder outputs:
- Mean
- Lower triangular with positive diagonal (e.g., , off-diagonal unconstrained)
Then and . The gradient through back to the encoder parameters flows via:
(the outer product of the sampled noise with the identity).
For normalizing flows. More expressive VAE variants use normalizing flows for the posterior . The flow is a sequence of invertible maps; the log-det Jacobian of each map must be computed efficiently. Triangular Jacobians (e.g., masked autoregressive flow / IAF) achieve log-det computation.
11. Common Mistakes
| # | Mistake | Why It's Wrong | Fix |
|---|---|---|---|
| 1 | Checking only diagonal entries to test PD | Positive diagonal is necessary but not sufficient. has positive diagonal but is indefinite (). | Use Sylvester's criterion (all leading minors > 0) or attempt Cholesky. |
| 2 | Concluding from alone | is necessary but not sufficient. has but is negative definite. | Check all leading principal minors, not just the full determinant. |
| 3 | Assuming for any | always, but iff has full column rank. If has a null vector (), then . | Verify . |
| 4 | Confusing the Cholesky factor with the PSD square root | is lower triangular; is symmetric. Both satisfy "squared = " but in different senses ( vs ). They are equal only when is diagonal. | Use for solving/sampling; use for Mahalanobis/whitening. |
| 5 | Using np.log(np.linalg.det(A)) for large | np.linalg.det can overflow/underflow for large (products of many numbers). | Use Cholesky: 2 * np.sum(np.log(np.diag(np.linalg.cholesky(A)))). |
| 6 | Forgetting Sylvester's criterion does NOT characterize PSD | Sylvester requires all leading principal minors > 0, which gives PD not PSD. For PSD, you need all principal minors (not just leading ones) - a much larger set. | For PSD testing, use eigenvalues or attempt pivoted Cholesky. |
| 7 | Inverting the Loewner order incorrectly | Students often think implies . The correct fact is (inversion reverses the order). | Remember: larger matrix -> smaller inverse (as for positive scalars: ). |
| 8 | Adding jitter without checking scale | If and you add , the relative jitter is , effectively zero numerically. | Set jitter proportional to the scale: for some small . |
| 9 | Assuming the kernel matrix stays PD after operations | Sum, product, and pointwise operations on PD kernels may not preserve PD structure. E.g., may not be PD. | Verify the Gram matrix or use closure properties: sum/product/exponentiation of PD kernels are PD. |
| 10 | Computing Schur complement with singular | The Schur complement requires invertible. If is only PSD (rank-deficient), does not exist. | Use the Moore-Penrose pseudoinverse: , though the PD characterization no longer holds as stated. |
| 11 | Treating "positive definite" as a non-symmetric matrix property | PD is only defined for symmetric matrices. An arbitrary matrix with all positive eigenvalues may not have a positive quadratic form (complex eigenvalues, non-symmetric). | Always symmetrize: replace with before testing PD. |
| 12 | Thinking PD constraint is automatically satisfied by a NN output | A neural network outputting numbers does not automatically produce a valid lower triangular with positive diagonal. | Enforce: use softplus on the diagonal entries; leave off-diagonal unconstrained. Then is guaranteed PD. |
12. Exercises
Exercise 1 * - Verify PD Axioms for
Let with and . Define .
(a) Prove using the definition for .
(b) What is the smallest eigenvalue of ? Express it in terms of the singular values of and .
(c) What happens as ? Under what condition on does itself become PD?
(d) Show using the Loewner order.
(e) AI connection: Identify as the normal equations matrix for ridge regression. Why does guarantee regardless of the rank of ?
Exercise 2 * - Implement Cholesky from Scratch
(a) Implement the Cholesky algorithm (Algorithm 4.2) in pure NumPy without using np.linalg.cholesky. Your function should return the lower triangular satisfying , or raise a ValueError if is not PD.
(b) Test on and verify .
(c) Compare the result with np.linalg.cholesky - they should agree to machine precision.
(d) Test what happens when you apply your function to an indefinite matrix .
(e) Efficiency: What is the flop count of your implementation for input? Compare to LU factorization.
Exercise 3 * - Sylvester's Criterion
Let .
(a) Compute the three leading principal minors as functions of .
(b) Use Sylvester's criterion to find all values of for which .
(c) At the boundary values of (where transitions from PD to indefinite), what is the nullspace of ? Interpret this geometrically.
(d) Verify your answer computationally by plotting the smallest eigenvalue of as a function of .
Exercise 4 ** - LDL^T Factorization
(a) Implement the LDL^T algorithm (Algorithm 4.4) in NumPy. Your function returns unit lower triangular and diagonal such that .
(b) Test on . Verify .
(c) Using your factorization, solve for by: (i) solve , (ii) solve , (iii) solve .
(d) Explain why LDL^T avoids the numerical issues of Cholesky when is indefinite.
(e) AI connection: Why do second-order optimizers (like KFAC) prefer working with (the pivots) rather than (the Cholesky diagonal)?
Exercise 5 ** - Schur Complement and Block PD Test
Let where:
(a) Compute the Schur complement .
(b) Use Theorem 6.2 to determine whether .
(c) Compute using and verify numerically.
(d) Use the block inverse formula to compute . Verify .
(e) Gaussian conditioning: Interpret as a joint covariance matrix . What is the conditional covariance ? Interpret: does conditioning reduce uncertainty?
Exercise 6 ** - Log-Det via Cholesky and Gradient Verification
(a) Compute for using: (i) direct eigenvalues, (ii) Cholesky diagonal formula, (iii) np.log(np.linalg.det(A)). Verify all three agree.
(b) Implement a function log_det(A) using Cholesky. Compare speed vs eigenvalue method for random PD matrices.
(c) Numerically verify the gradient formula using finite differences: .
(d) For the Gaussian log-likelihood: given 50 samples from where from part (a), compute the log-likelihood where is the sample covariance.
Exercise 7 *** - Gram Matrix / Kernel Matrix and Mercer Check
(a) Given 5 points (design your own interesting configuration), compute the Gram matrix .
(b) Verify by checking eigenvalues. Explain why .
(c) Implement and test three kernels on these points: (i) linear , (ii) RBF , (iii) polynomial . For each, compute the kernel matrix and verify PSD.
(d) Now try the "not-a-kernel" (Laplacian kernel IS a valid kernel) vs (NOT always a valid kernel). Determine empirically whether each produces a PSD Gram matrix.
(e) AI connection: Implement a simple kernel regression: given training inputs and outputs , compute predictions at test points using (Nadaraya-Watson/GPR predictive mean).
Exercise 8 *** - Cholesky Reparameterization for VAE Sampling
(a) Implement a function sample_mvn(mu, L, n_samples) that samples vectors from using the reparameterization trick , .
(b) Let and . Compute . Sample 1000 points, compute the sample mean and covariance, and verify they match and (up to sampling noise).
(c) Simulate a VAE encoder: the encoder takes a scalar input and outputs and a lower triangular (a diagonal for simplicity). Sample and pass it through a simple decoder . Compute the ELBO:
where .
(d) Verify the KL formula using both the analytical expression and Monte Carlo estimation from samples.
(e) Gradient check: Confirm that the reparameterization trick allows gradients to flow through and (not ) by computing analytically and comparing with finite differences.
13. Why This Matters for AI (2026 Perspective)
| Concept | Concrete AI/LLM Role | Where / Method |
|---|---|---|
| Positive definite covariance | Valid multivariate Gaussian distributions | VAEs (Kingma & Welling, 2013); diffusion models (DDPM, score matching) |
| Cholesky factorization | Reparameterization trick for differentiable sampling | All variational inference; NUTS-HMC sampler in Pyro, NumPyro |
| LDL^T factorization | Stable Hessian factorization in trust-region optimizers | L-BFGS, trust-region Newton; scipy.optimize |
| Schur complement | Conditional Gaussian inference; GP prediction | GPyTorch, GPflow; Bayesian neural networks |
| Matrix inversion lemma (Woodbury) | Low-rank updates without full matrix inversion | LoRA weight merging; Kalman filter; GP with inducing points |
| Log-determinant | Normalizing constant in Gaussian log-likelihood | GP hyperparameter optimization; normalizing flows (RealNVP, Glow) |
| Gradient of log-likelihood w.r.t. covariance | Fitting GP kernel parameters; structured covariance learning | |
| Gram matrix / kernel matrix | Kernel methods, self-attention similarity | SVM training; transformer attention; Performer / FAVOR+ |
| Fisher information matrix | Riemannian metric on parameter space | K-FAC optimizer (Google Brain, 2015); natural gradient variational inference |
| PSD Hessian at minima | Second-order optimality condition | SAM (flatness-seeking optimizer); loss landscape analysis |
| Sharpness | Generalization proxy; flat minima theory | SAM, ASAM, Stein Self-Repulsive (2024) |
| PSD cone / SDP | Constrained optimization; certified robustness | -CROWN, -CROWN neural network verification; fairness constraints |
| Mercer's theorem | Theoretical foundation for kernel machines | SVMs; kernel PCA; connections to infinite-width NTK theory |
| Cholesky reparameterization | Parameterizing full covariance in deep models | Deep covariance estimation; Matrix-variate distributions in meta-learning |
| Log-det stochastic estimator | Scalable GP training on millions of points | GPyTorch (Gardner et al., 2018); SVGP; stochastic trace estimation |
14. Conceptual Bridge
Looking Back
This section builds on a foundation laid across the previous six sections of Advanced Linear Algebra. From eigenvalues and eigenvectors (01), we borrow the spectral theorem - the fact that symmetric matrices diagonalize orthogonally - which provides the cleanest proof that positive definiteness is equivalent to positive eigenvalues. From SVD (02), we inherit the language of singular values and the connection between the spectral norm, condition number, and the stability of Cholesky. The orthogonality theory from 05 underpins the proof that is symmetric (the spectral theorem) and that the PSD square root is well-defined and unique. Matrix norms from 06 give precise bounds on how a PD matrix behaves numerically: the condition number measures how far is from singular, and ill-conditioned PD matrices lead to numerical difficulties in Cholesky.
Most fundamentally, positive definite matrices are the matrices that have all four of the desirable properties we want from a symmetric matrix: uniquely solvable linear systems (invertible), well-defined quadratic energy (positive), computable square root (Cholesky), and monotone response to matrix operations (Loewner order). Understanding this quadrant of symmetry \times positivity is the analytical key to all probabilistic machine learning.
Looking Forward
Positive definite matrices are the entry point to the next major development: 08 (Matrix Decompositions) uses the Cholesky factorization developed here as one member of a family that includes LU (general matrices) and QR (orthogonal factorization). The Cholesky decomposition studied here in full generality is previewed (only briefly) in 08 - by the time students reach 08, the Cholesky algorithm should feel familiar.
The deeper application of positive definiteness comes later in the curriculum. In Chapter 6 (Probability Theory), covariance matrices appear in every multivariate distribution. In Chapter 8 (Optimization), the second-order conditions for minima require , and convexity of a function is equivalent to its Hessian being PSD everywhere. In Chapter 12 (Functional Analysis), Mercer's theorem and RKHS theory extend PSD kernels to infinite-dimensional inner product spaces. The PSD cone and semidefinite programming previewed in 9 appear extensively in Chapter 8 (Optimization) and in the robotics/control applications in later chapters.
Curriculum Position
POSITIVE DEFINITE MATRICES IN THE CURRICULUM
========================================================================
Chapter 2: Linear Algebra Basics
+-- 01 Vectors and Spaces -> inner products, quadratic forms
+-- 04 Determinants -> Sylvester's criterion
+-- 06 Vector Spaces -> subspaces, null space
Chapter 3: Advanced Linear Algebra
+-- 01 Eigenvalues -> spectral characterization of PD
+-- 02 SVD -> singular values, condition number
+-- 05 Orthogonality -> spectral theorem, square root
+-- 06 Matrix Norms -> condition number, jitter scale
+-- 07 Positive Definite <== YOU ARE HERE
| | Cholesky, LDL^T, Schur, log-det, Gram, SDP
+-- 08 Matrix Decompositions -> LU, QR, brief Cholesky recap
Downstream:
+-- Ch. 6 Probability: Gaussian covariance \Sigma \succ 0
+-- Ch. 8 Optimization: Hessian \succ 0 at local minima; SDP
+-- Ch. 12 Functional Analysis: Mercer, RKHS
+-- Ch. 13 ML Math: K-FAC, VAE, GP, normalizing flows
========================================================================
The central insight to carry forward: positive definiteness is not a restriction but a characterization. A matrix that fails to be PD is not "almost OK" - it is categorically different, with no unique minimum, no valid probability interpretation, no Cholesky factorization. The condition is the boundary between well-posedness and ill-posedness in a wide range of mathematical problems, and recognizing that boundary - and the tools to work with it (Cholesky, Schur, log-det) - is an essential skill for the working ML practitioner.
Appendix A: Closure Properties and Algebraic Operations
A.1 Preserving Positive Definiteness Under Operations
Understanding which operations preserve PD/PSD structure is practically important - it tells you when a composed matrix is guaranteed to be PD without recomputing.
Direct results:
| Operation | Inputs | Result | Condition |
|---|---|---|---|
| , | Always | ||
| Always | |||
| , | Always | ||
| , invertible | Always | ||
| , full column rank | Always | ||
| Always | |||
| , | Always | ||
| , | Matrix power via exp(t log A) | ||
| NOT necessarily | Only if | ||
| Always (Kronecker) | |||
| Always (Kronecker) | |||
| (Hadamard) | Schur product theorem |
The Schur product theorem. The Hadamard (element-wise) product of two PSD matrices is PSD. This is non-obvious! It is used in kernel methods: if and are valid kernels, their pointwise product is also a valid kernel.
Proof of Schur product theorem. If , write and (Gram representations). Then , which can be expressed as the Gram matrix of vectors (Kronecker products of the generating vectors). Gram matrices are PSD.
Why is not always PSD. Consider and . Both are PD, but , which is not symmetric (hence not PSD). The product of symmetric matrices is symmetric iff they commute.
For AI: Kronecker product structure is central to K-FAC: is PSD since . The Schur product theorem ensures that combining PD kernels (feature interactions) produces valid kernels.
A.2 Principal Submatrix Inheritance
Theorem A.1. If , every principal submatrix is . If , every principal submatrix is .
A principal submatrix of is obtained by deleting rows and columns with the same index set: for .
Proof. For any , embed in via for and for . Then for (since ).
Consequence. Every diagonal entry for (take ). Every principal submatrix is PD, so for all (Cauchy-Schwarz-type bound for matrix entries).
A.3 Indefinite Matrices and Saddle Points
For completeness: an indefinite symmetric matrix has both positive and negative eigenvalues. The corresponding quadratic form takes positive values in some directions and negative values in others. The level sets are hyperboloids.
Connection to saddle points in optimization. A critical point with indefinite Hessian is a saddle point - not a local minimum. In neural network optimization:
- Gradient descent near a saddle point may slow down dramatically (gradient is small, Hessian is indefinite)
- The negative eigenvalue directions are "escape directions" - following them reduces the loss
- Stochastic gradient descent (SGD) with noise "escapes" saddles; deterministic gradient descent can get stuck
Sylvester's law of inertia. For a symmetric matrix , the signature (number of positive, negative, zero eigenvalues) is invariant under congruence transformations (for invertible ). This means no invertible change of basis can turn a saddle into a minimum - the number of descent directions is a geometric invariant of the quadratic form.
Appendix B: Numerical Methods and Implementation
B.1 Cholesky for Linear System Solving
A primary application of Cholesky is solving for PD .
Algorithm:
- Factor: (Cholesky, )
- Forward solve: (substitute from top, )
- Backward solve: (substitute from bottom, )
For multiple right-hand sides : factor once, solve times at cost total.
Numerical example: Solve .
Cholesky: (since , , ).
Forward: ; .
Backward: ; .
Verify: . OK
B.2 Cholesky Update and Downdate
When changes by a rank-1 update (or downdate ), it is wasteful to recompute the full Cholesky factorization. Rank-1 Cholesky update algorithms (LINPACK's dchud) update in time.
Algorithm (rank-1 update, Gill-Golub-Murray-Saunders style):
Given , compute such that :
x = L \ v (forward solve: O(n^2))
for k = 1 to n:
r = sqrt(L[k,k]^2 + x[k]^2)
c = r / L[k,k]
s = x[k] / L[k,k]
L[k,k] = r
L[k+1:n, k] = (L[k+1:n,k] + s*x[k+1:n]) / c
x[k+1:n] = c*x[k+1:n] - s*L[k+1:n,k]
This uses Givens rotations (-> 05: Orthogonality) to update in place.
For AI: Online learning algorithms that add one data point at a time use rank-1 Cholesky updates to maintain the Cholesky factorization of the Gram matrix. Sequential Bayesian updating (online GP regression) uses rank-1 Cholesky updates at cost per update instead of full refactorization.
B.3 Condition Number and Numerical Stability of Cholesky
Condition number. For , the condition number . The relative error in solving satisfies:
For Cholesky specifically: the backward error bound on the Cholesky factor is . This means Cholesky is unconditionally backward stable for PD matrices. In contrast, LU without pivoting can have backward errors proportional to , which can be exponentially large.
Ill-conditioning warning. For large :
- The log-det computation involves some very small (near-zero for near-singular dimensions) - those log terms dominate and may lose digits
- Linear solves lose approximately digits of accuracy
- Cholesky itself remains stable; the error comes from the ill-conditioning of the problem, not the algorithm
Practical rule: If and your floating-point arithmetic has decimal digits (double precision), you lose digits in the solution. For , the solution has no reliable digits.
B.4 Sparse Cholesky
For sparse PD matrices (many zero entries, arising in graph-structured models, FEM discretizations, and spatial statistics), general Cholesky is wasteful because it may "fill in" zeros and produce dense .
The fill-in problem. Even if is sparse, may be dense. The fill-in pattern depends on the sparsity structure of and the ordering of variables.
Reordering algorithms. The approximate minimum degree (AMD) and nested dissection algorithms reorder the variables (permute for a permutation matrix ) to minimize fill-in. The standard library for sparse Cholesky is SuiteSparse (CHOLMOD).
For AI: In Gaussian Markov random fields (GMRFs), the precision matrix (inverse covariance) is sparse. Sparse Cholesky enables efficient inference in spatial models, graph neural networks with Gaussian priors, and structured prediction. GPyTorch's KeOps and PyTorch Sparse backends exploit sparsity for large-scale GP approximations.
Appendix C: Positive Definiteness in Complex Vector Spaces
C.1 Hermitian Positive Definite Matrices
Over , positive definiteness generalizes to Hermitian positive definite (HPD) matrices.
Definition. is Hermitian positive definite if:
- (where is the conjugate transpose)
- for all
(Note: is always real for Hermitian , so the positivity condition makes sense.)
Cholesky for HPD. Every HPD matrix has a unique Cholesky factorization where is lower triangular with positive real diagonal entries.
For AI: Complex-valued neural networks (used in signal processing, quantum computing simulation) use HPD covariance matrices. Radar signal processing, MRI reconstruction, and quantum chemistry all work in with HPD matrices.
Appendix D: Advanced Theory
D.1 The Polar Decomposition and Positive Definite Factors
Every invertible matrix has a unique polar decomposition:
where is orthogonal () and is the unique PD symmetric positive definite "right polar factor." This is the matrix analogue of the polar form for complex numbers.
Existence and uniqueness. Using SVD :
where (orthogonal) and (PSD square root of ).
Uniqueness of : , which is the unique PSD square root (Theorem 5.1). Given , is uniquely determined.
For AI: The polar decomposition appears in:
- Weight matrix orthogonalization: in Muon optimizer (Kosson et al., 2024), gradient updates are "orthogonalized" using the polar factor , which projects the weight update onto the manifold of orthogonal matrices, maintaining stable singular value spectrum during training
- Group equivariance: equivariant neural networks (E(n)-equivariant GNNs) use orthogonal matrices as symmetry group elements; the polar decomposition provides a differentiable projection onto
Computing the polar factor. Newton iteration: converges to starting from . Quadratic convergence once close to . Each step costs one matrix inversion.
D.2 Positive Definite Functions and Bochner's Theorem
A function is a positive definite function if for every and every set of points , the matrix is PSD.
Bochner's theorem. A continuous stationary kernel is a positive definite function on if and only if it is the Fourier transform of a non-negative measure (spectral density):
Implications for kernel design:
- RBF kernel: is PD (its Fourier transform is a Gaussian, which is non-negative)
- Matern kernels: PD for all valid smoothness parameters
- Checking validity: A kernel is valid iff its Fourier transform is non-negative - this is the ultimate test
Random Fourier features (Rahimi & Recht, 2007). Bochner's theorem gives a randomized approximation to stationary PD kernels:
where (the spectral distribution) and . This approximates the PD kernel as a finite inner product, enabling scalable kernel methods without forming the full kernel matrix. Used in Performer transformer attention (FAVOR+, Choromanski et al., 2020).
D.3 The Determinant and Volume
Geometric interpretation of for .
For , the quadratic form defines an ellipsoid in . Its volume is:
So measures the "volume" of the ellipsoid associated with . Larger means a more "spread out" distribution.
In the Gaussian context: The normalizing constant of is . This is the volume of the "effective support" ellipsoid scaled by .
D-optimal experimental design. In Bayesian experimental design, the D-optimal criterion selects experiments to maximize - the determinant of the posterior precision matrix, which equals . Maximizing minimizes the volume of the posterior confidence ellipsoid. This is an SDP with as the optimization variable.
Maximum entropy Gaussians. Among all distributions with mean and covariance , the maximum entropy distribution is , with entropy . Maximizing entropy subject to a trace constraint gives (isotropic Gaussian) - the most "uninformed" Gaussian with bounded total variance.
D.4 Matrix Exponential and the PD Manifold
The set of PD matrices is an open smooth manifold (in fact, a Riemannian manifold with the affine-invariant metric). This means concepts from differential geometry apply.
The matrix exponential. For any symmetric , (positive definite, since exponentials of real numbers are positive). The map is a bijection - PD matrices are exactly exponentials of symmetric matrices.
Log-Euclidean metric. A simple Riemannian metric on (used in diffusion tensor imaging) defines the "distance" between as:
This treats PD matrices as if they lived in a flat space via the log map. Computations (means, geodesics, interpolations) become Euclidean operations on and .
Affine-invariant metric (Riemannian). The more geometrically natural metric:
is invariant under congruence (affine-invariant), making it suitable for geometric statistics.
For AI: The SPD (symmetric positive definite) manifold appears in:
- Diffusion tensor MRI: each voxel has a diffusion tensor ; Riemannian means avoid non-PD artifacts
- Covariance-based classifiers: classifying neural activation covariance matrices using Riemannian geometry (SPDNet, 2017)
- Transformer positional encodings with structured covariances: attention with SPD constraints on the key-query metric tensors
D.5 Inequalities for Positive Definite Matrices
Several fundamental inequalities apply specifically to PD matrices:
Hadamard's inequality. For :
Equality iff is diagonal. Interpretation: the volume of the ellipsoid defined by is at most the product of the diagonal entries (the volumes along coordinate axes).
Minkowski's inequality. For :
This is the matrix analogue of for scalars (Minkowski). It follows from the concavity of on .
Fan-Ky inequality. For and :
(The sum of the largest log-eigenvalues of is at least the sum from plus the sum from .)
Anderson's inequality. For with :
Interpretation: combining two experiments (adding their information matrices) gives more information than either alone, but the combined precision matrix is "sandwiched" by the sum of individual inverse covariances.
For AI: Hadamard's inequality provides a bound on the log-det: , useful when diagonal entries are cheaply computed. Minkowski's inequality is used in information-geometric proofs about combining Gaussian priors in Bayesian learning.
Appendix E: Extended Examples and Case Studies
E.1 Case Study: Covariance Estimation in High Dimensions
One of the most practically important applications of PSD matrix theory is estimating covariance matrices from data when the number of features is comparable to or larger than the number of samples .
The problem. Given samples (zero mean for simplicity), the sample covariance is:
always (it's a Gram matrix). But:
- If : , so is singular (not PD)
- Even for : can be ill-conditioned (extreme ratio of largest to smallest eigenvalue)
The Marchenko-Pastur law. For with and , the empirical eigenvalue distribution of converges to the Marchenko-Pastur distribution:
for , where . This means even under the identity covariance, sample eigenvalues are spread over a wide range - the smallest eigenvalues of are much smaller than 1 and the largest are much larger than 1.
Ledoit-Wolf shrinkage. The Ledoit-Wolf estimator regularizes the sample covariance toward a multiple of identity:
where (the average eigenvalue) and is chosen to minimize mean squared error under the Frobenius norm. The result is always PD (since and ).
Graphical lasso. Assuming the precision matrix (inverse covariance) is sparse (many zeros encode conditional independence), the graphical lasso estimates:
where is the element-wise norm. The constraint is the PD cone constraint. The objective is the negative Gaussian log-likelihood (up to constants). Optimized by block coordinate descent (glasso algorithm) using the Schur complement for each block update.
For LLMs: Large language models trained with structured sparse attention patterns (local + global attention, as in Longformer and BigBird) can be understood as learning sparse precision matrices of the attention distributions.
E.2 Case Study: Gaussian Process Regression, Step by Step
Let's trace through a complete GP regression computation to see all the PD machinery in action.
Setup. Training data: , , . Test point: . Kernel: (RBF with length-scale 1). Noise: .
Step 1: Compute the kernel matrix.
Step 2: Add noise and Cholesky factor.
Compute . (Details in theory.ipynb.)
Step 3: Predictive mean.
.
Step 4: Predictive variance.
The variance is the Schur complement of evaluated at .
Interpretation. The Cholesky factorization appears in every step: solving for , computing the predictive variance, and computing the log-marginal likelihood for hyperparameter optimization. The PD condition on guarantees (the Schur complement is non-negative).
E.3 Case Study: The Multivariate Normal in Deep Learning
Parameterizing structured covariances. In deep generative models (VAEs, normalizing flows, diffusion models), we frequently need to parameterize a multivariate Gaussian with learnable. The challenge: must be PD, and the parameterization must be differentiable.
Approach 1: Diagonal. with . Free parameters: . Correlation structure: none. Used in standard VAE (Kingma & Welling, 2013).
Approach 2: Lower triangular Cholesky. where is lower triangular:
- Diagonal: ensures positivity
- Off-diagonal: for unconstrained (any real number) Free parameters: . Full correlation structure. Used in full-covariance VAEs, normalizing flows.
Approach 3: Low-rank + diagonal. where () and with . Free parameters: . Captures "directions of correlation." Used in structured VBs, mean-field approximations.
The KL divergence between Gaussians. For and :
For (standard VAE prior):
For diagonal :
For Cholesky : and .
All these computations - , , - exploit the Cholesky and log-det tools developed in 4 and 7.
E.4 The Normal Equations and Ridge Regression
Least squares. Given (data) and (targets), ordinary least squares (OLS) minimizes:
The normal equations: . The matrix is always PSD (Gram matrix). If has full column rank (, rank ), then and the unique solution is .
Ridge regression. Add regularization: minimize .
Normal equations: .
The matrix for any (by Proposition 2.3.7). This is why ridge regression is always numerically stable: the PD matrix has all positive pivots, and Cholesky never fails.
Condition number improvement. where are the singular values of . For small , . As , (well-conditioned). Choosing balances bias (regularization) vs numerical stability.
Kernel ridge regression. For nonlinear regression using a PD kernel , form the kernel matrix and solve . Predictions: where . This is equivalent to GP regression with the kernel and noise variance . The PD condition on is exactly the Cholesky solvability condition.
Appendix F: Quick Reference
F.1 Four Equivalent Characterizations of
POSITIVE DEFINITENESS: FOUR EQUIVALENT TESTS
========================================================================
Let A \in \mathbb{R}^n^x^n be symmetric. The following are equivalent:
DEFINITION \forallx\neq0: x^TAx > 0 (quadratic form positive)
SPECTRAL \foralli: \lambda^i(A) > 0 (all eigenvalues positive)
SYLVESTER \forallk: \Delta_k = det(A[1:k,1:k]) > 0 (leading minors positive)
CHOLESKY \exists! lower triangular L with positive diagonal: A = LL^T
COST:
Definition O(n) per test vector <- manual/symbolic
Spectral O(n^3) eigendecomposition <- most informative
Sylvester O(n^4) for all n determinants <- manual/symbolic
Cholesky O(n^3/3) factorization <- fastest in practice
========================================================================
F.2 Key Formulas
| Formula | Notes |
|---|---|
| Cholesky factorization (A PD <-> exists) | |
| LDL^T: unit lower triangular + diagonal | |
| Polar: orthogonal \times PD | |
| Log-det via Cholesky | |
| Matrix calculus: gradient of log-det | |
| Schur complement of in block matrix | |
| Schur PD criterion | |
| Woodbury identity | |
| , | Reparameterization trick |
| VAE KL term |
F.3 Notation Summary
Following the NOTATION_GUIDE.md conventions:
| Symbol | Meaning |
|---|---|
| is positive definite | |
| is positive semidefinite | |
| is negative definite | |
| Space of symmetric matrices | |
| Cone of PD matrices (interior) | |
| Cone of PSD matrices | |
| Cholesky factor (lower triangular, positive diagonal) | |
| Symmetric PSD square root | |
| Inverse PSD square root | |
| Covariance matrix () | |
| Precision matrix (, ) | |
| Fisher information matrix () | |
| Schur complement | |
| Gram matrix (, ) | |
| Kernel matrix () |
F.4 Python Cheatsheet
import numpy as np
import scipy.linalg as la
# === Cholesky factorization ===
L = np.linalg.cholesky(A) # A = L @ L.T
# Or via scipy (can also compute upper factor):
L = la.cholesky(A, lower=True)
# === LDL^T factorization ===
L_ldl, D_ldl, _ = la.ldl(A) # A = L @ D @ L.T
# === Log-determinant (numerically stable) ===
L = np.linalg.cholesky(A)
log_det = 2 * np.sum(np.log(np.diag(L)))
# === Solve A x = b via Cholesky ===
L = np.linalg.cholesky(A)
x = la.cho_solve((L, True), b) # uses LAPACK dpotrs
# === PSD square root ===
eigvals, Q = np.linalg.eigh(A) # symmetric eigendecomp
A_sqrt = Q @ np.diag(np.sqrt(eigvals)) @ Q.T
# === Woodbury: (A + U C V)^{-1} ===
# via np.linalg.solve on the smaller system
# === Reparameterization trick ===
L = np.linalg.cholesky(Sigma) # Sigma = L @ L.T
eps = np.random.randn(n_samples, d)
z = mu + (L @ eps.T).T # z ~ N(mu, Sigma)
# === Gram matrix ===
G = X @ X.T # G = X X^T, PSD
# === Check PD ===
try:
np.linalg.cholesky(A)
is_pd = True
except np.linalg.LinAlgError:
is_pd = False
# === Check PSD ===
eigvals = np.linalg.eigvalsh(A)
is_psd = np.all(eigvals >= -1e-10)
Appendix G: Proofs and Extensions
G.1 Complete Proof of the Cholesky Existence Theorem
We present the full inductive proof more carefully, making each step explicit.
Theorem (Full Cholesky Existence). symmetric. if and only if for a unique lower triangular with positive diagonal.
Proof by strong induction on .
Base case : with (since iff quadratic form for iff ). Take . Unique since and gives .
Inductive step: Assume the result holds for all PD matrices. Let be PD. Write:
where (Proposition 2.3, item 1), , and .
Define and .
Compute the Schur complement: .
Claim: .
Proof of claim: For any with , set . Note (the second block is ). Then:
Since : , so . Since was arbitrary, . (claim)
By the inductive hypothesis, for a unique lower triangular with positive diagonal.
Set . Then is lower triangular with positive diagonal and:
Uniqueness. Suppose with lower triangular and positive diagonal. Then (product of lower triangular matrices, lower triangular) satisfies . An orthogonal lower triangular matrix must be diagonal. Since have positive diagonal, has positive diagonal. for diagonal gives , so .
G.2 Proof That the Fisher Information Is PSD
Theorem. For a regular statistical model , the Fisher information matrix where is the score function is PSD.
Proof. For any :
(The expectation of a squared real random variable is non-negative.)
When is ? is PD iff for all . This holds iff the score function with positive probability for every - the model is "identifiable" in all directions. Singular indicates structural non-identifiability (two parameters produce identical distributions).
G.3 Derivatives and the Matrix-Valued Chain Rule
For completeness, we derive the key matrix calculus formulas used in 7.3.
Jacobi's formula. For differentiable :
Proof: Using the adjugate matrix (cofactor expansion), . Differentiating with respect to gives (Cramer's rule). By the chain rule: .
Log-det gradient. . Since , the gradient of at is (for symmetric ).
Trace-inverse gradient. For : . So .
For AI - GP hyperparameter gradients: The gradient of the GP log-marginal-likelihood with respect to a kernel hyperparameter is:
where . This uses and . Efficiently computed via Cholesky: form (triangular solve), then .
Appendix H: Summary and Further Reading
H.1 Core Theorems Summary
| Theorem | Statement | Reference |
|---|---|---|
| Spectral characterization | all eigenvalues | 3.1 |
| Sylvester's criterion | all leading principal minors | 3.2 |
| Cholesky existence | lower triangular (pos. diag.) with | 4.1 |
| LDL^T | Symmetric ; all | 4.4 |
| PSD square root | symmetric PSD with | 5.1 |
| Schur PD criterion | 6.2 | |
| Woodbury identity | 6.3 | |
| Log-det concavity | is strictly concave on | 7.2 |
| Log-det gradient | 7.3 | |
| Gram matrix PSD | always; PD iff rows of are linearly independent | 8.1 |
| Schur product | Hadamard product of PSD matrices is PSD | Appendix A |
| Hadamard inequality | for | Appendix D |
| Log-det Cholesky | 7.1 |
H.2 Further Reading
-
Textbooks:
- Golub & Van Loan, Matrix Computations (4th ed., 2013) - Chapters 4, 7: definitive reference on Cholesky, LDL^T algorithms
- Higham, Accuracy and Stability of Numerical Algorithms (2002) - backward stability proofs for Cholesky, modified Cholesky
- Boyd & Vandenberghe, Convex Optimization (2004) - Chapter 4: SDP, PSD cone; freely available online
- Bernstein, Matrix Mathematics (2nd ed., 2009) - comprehensive collection of PD matrix identities
-
Research papers:
- Cholesky (1924, posthumous publication via Benoit): original algorithm for geodetic calculations
- Gill, Murray & Wright (1981): modified Cholesky for optimization
- Vandenberghe & Boyd (1996): semidefinite programming tutorial
- Martens & Grosse (2015): K-FAC - Kronecker-factored natural gradient
- Gardner et al. (2018): GPyTorch - scalable GP with stochastic log-det estimation
- Foret et al. (2021): SAM - sharpness-aware minimization
-
Online resources:
- Petersen & Pedersen, The Matrix Cookbook - practical matrix calculus identities
- NumPy/SciPy docs:
np.linalg.cholesky,scipy.linalg.cholesky,scipy.linalg.ldl - GPyTorch documentation: scalable GP inference with Cholesky
- CVXPY documentation: SDP examples and semidefinite programming
-
Next sections:
- 08: Matrix Decompositions - LU, QR, SVD algorithms (Cholesky briefly revisited in algorithmic context)
- Chapter 6: Probability Theory - multivariate Gaussians, covariance matrices
- Chapter 8: Optimization - second-order methods, convexity, SDP
- Chapter 12: Functional Analysis - RKHS, Mercer's theorem, kernel methods
End of 07: Positive Definite Matrices. Next: 08: Matrix Decompositions
Appendix I: Worked Examples - Complete Solutions
I.1 Full 3\times3 Cholesky Example with Verification
Compute for .
Step 1: .
Step 2: . .
Step 3: .
Step 4: .
Step 5: .
Verification:
Also: .
I.2 Schur Complement and Conditional Gaussian
Let be the covariance of .
Schur complement of :
Conditional distribution :
Note: the unconditional variance of is . The conditional variance - observing reduces uncertainty about (as guaranteed by the Loewner order, 2.4). The correlation explains the moderate uncertainty reduction.