Part 2

26 min read18 headingsSplit lesson page

Lesson overview | Previous part | Next part

Positive Definite Matrices: Part 6: Schur Complement to 11. Common Mistakes

6. Schur Complement

The Schur complement is the "matrix analogue of completing the square" for block matrices. It appears everywhere in probability (Gaussian conditioning), linear algebra (block matrix inversion), and optimization (constraint elimination).

6.1 Definition for Block Matrices

Definition 6.1 (Schur Complement). Let

M = \begin{pmatrix}A & B \\ C & D\end{pmatrix}

be a block matrix with $A \in \mathbb{R}^{p \times p}$ invertible. The Schur complement of $A$ in $M$ is:

S = D - CA^{-1}B \in \mathbb{R}^{q \times q}.

Similarly, if $D \in \mathbb{R}^{q \times q}$ is invertible, the Schur complement of $D$ is $A - BD^{-1}C$ .

Origin: block Gaussian elimination. The Schur complement arises naturally when eliminating the $(2,1)$ block:

M = \begin{pmatrix}A & B \\ C & D\end{pmatrix} = \begin{pmatrix}I & 0 \\ CA^{-1} & I\end{pmatrix}\begin{pmatrix}A & B \\ 0 & D-CA^{-1}B\end{pmatrix}.

The $(2,2)$ block in the upper triangular factor is exactly $S = D - CA^{-1}B$ .

Determinant formula. A key consequence of the block LU:

\det M = \det A \cdot \det(D - CA^{-1}B) = \det A \cdot \det S.

Similarly, $\det M = \det D \cdot \det(A - BD^{-1}C)$ when $D$ is invertible.

Block matrix inverse. Using the Schur complement:

M^{-1} = \begin{pmatrix}A^{-1} + A^{-1}B S^{-1} C A^{-1} & -A^{-1}B S^{-1} \\ -S^{-1}CA^{-1} & S^{-1}\end{pmatrix}

when both $A$ and $S = D - CA^{-1}B$ are invertible.

6.2 Schur Complement and Positive Definiteness

The Schur complement provides an elegant characterization of block PD matrices.

Theorem 6.2 (Schur PD Criterion). Let $M = \begin{pmatrix}A & B \\ B^\top & D\end{pmatrix}$ be symmetric (so $C = B^\top$ ). Then:

M \succ 0 \quad \Longleftrightarrow \quad A \succ 0 \text{ and } S = D - B^\top A^{-1} B \succ 0.

Proof. We use the block Cholesky:

M = \begin{pmatrix}A & B \\ B^\top & D\end{pmatrix} = \begin{pmatrix}I & 0 \\ B^\top A^{-1} & I\end{pmatrix}\begin{pmatrix}A & 0 \\ 0 & S\end{pmatrix}\begin{pmatrix}I & A^{-1}B \\ 0 & I\end{pmatrix}.

The middle factor is block diagonal: $\begin{pmatrix}A & 0 \\ 0 & S\end{pmatrix}$ . For any $\mathbf{v} = (\mathbf{x}, \mathbf{y})^\top$ :

\mathbf{v}^\top M \mathbf{v} = (\mathbf{x} + A^{-1}B\mathbf{y})^\top A (\mathbf{x} + A^{-1}B\mathbf{y}) + \mathbf{y}^\top S \mathbf{y}.

( $\Rightarrow$ ): If $M \succ 0$ , taking $\mathbf{y} = \mathbf{0}$ shows $A \succ 0$ ; taking $\mathbf{x} = -A^{-1}B\mathbf{y}$ shows $\mathbf{y}^\top S\mathbf{y} > 0$ for $\mathbf{y} \neq 0$ , so $S \succ 0$ .

( $\Leftarrow$ ): If $A \succ 0$ and $S \succ 0$ , then both terms are non-negative, and at least one is positive for $(\mathbf{x},\mathbf{y}) \neq (\mathbf{0},\mathbf{0})$ , so $M \succ 0$ . $\square$

Corollary. $M \succeq 0 \Leftrightarrow A \succeq 0$ and $S = D - B^\top A^{-1}B \succeq 0$ (when $A$ is invertible; otherwise use the rank condition).

SCHUR COMPLEMENT AND BLOCK PD
========================================================================

  M = ( A    B  )  symmetric
      ( B^T   D  )
      
  M \succ 0  <=>  A \succ 0  AND  S = D - B^TA^-^1B \succ 0

  Intuition: "completing the square" in block form
  
  v^TMv = (x + A^-^1By)^TA(x + A^-^1By) + y^TSy
          -----------------------------   -----
          \geq 0 (since A \succ 0)              \geq 0 (since S \succ 0)

========================================================================

6.3 Matrix Inversion Lemma

The Woodbury matrix identity (also called the matrix inversion lemma or Sherman-Morrison-Woodbury formula) is:

(A + UCV)^{-1} = A^{-1} - A^{-1}U(C^{-1} + VA^{-1}U)^{-1}VA^{-1}

where $A \in \mathbb{R}^{n \times n}$ , $C \in \mathbb{R}^{k \times k}$ , $U \in \mathbb{R}^{n \times k}$ , $V \in \mathbb{R}^{k \times n}$ . This is valuable when $k \ll n$ (low-rank update): instead of inverting an $n \times n$ matrix, invert a $k \times k$ matrix.

Derivation via Schur complement. Consider the block system:

M = \begin{pmatrix}A & U \\ -V & C^{-1}\end{pmatrix}.

The Schur complement of $C^{-1}$ is $A - U(-V)^{-1}(-V) = A + UCV$ (with some sign manipulation). The Schur complement of $A$ is $C^{-1} + VA^{-1}U$ . Using the block inverse formula gives the Woodbury identity.

Special case (rank-1 update): If $U = \mathbf{u}$ , $V = \mathbf{v}^\top$ , $C = c$ (scalar):

(A + c\mathbf{u}\mathbf{v}^\top)^{-1} = A^{-1} - \frac{c A^{-1}\mathbf{u}\mathbf{v}^\top A^{-1}}{1 + c\mathbf{v}^\top A^{-1}\mathbf{u}}.

This is the Sherman-Morrison formula for rank-1 updates.

For AI: The Woodbury identity is used in:

Gaussian process prediction: posterior covariance $(K + \sigma^2 I)^{-1}$ where $K = X^\top X$ and $\sigma^2 I$ is noise; Woodbury allows using $n \times n$ or $p \times p$ (whichever is smaller)
LoRA / low-rank adaptation: the effective weight $W_0 + BA$ is a rank- $r$ update; Woodbury allows efficient inversion without materializing the full matrix
Kalman filter update step: $(P^{-1} + H^\top R^{-1}H)^{-1}$ uses Woodbury to avoid inverting large state covariances

6.4 Gaussian Conditioning via Schur Complement

The Schur complement is the algebraic engine behind the conditional distribution of multivariate Gaussians.

Setup. Let $\begin{pmatrix}\mathbf{x}_1 \\ \mathbf{x}_2\end{pmatrix} \sim \mathcal{N}\!\left(\begin{pmatrix}\boldsymbol{\mu}_1 \\ \boldsymbol{\mu}_2\end{pmatrix}, \begin{pmatrix}\Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22}\end{pmatrix}\right)$ .

Conditional distribution. The conditional $\mathbf{x}_1 | \mathbf{x}_2 = \mathbf{a}$ is Gaussian:

\mathbf{x}_1 | \mathbf{x}_2 = \mathbf{a} \sim \mathcal{N}(\boldsymbol{\mu}_{1|2},\, \Sigma_{1|2})

where:

\boldsymbol{\mu}_{1|2} = \boldsymbol{\mu}_1 + \Sigma_{12}\Sigma_{22}^{-1}(\mathbf{a} - \boldsymbol{\mu}_2)

\Sigma_{1|2} = \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}.

The conditional covariance $\Sigma_{1|2}$ is exactly the Schur complement of $\Sigma_{22}$ in $\Sigma$ . Theorem 6.2 guarantees $\Sigma_{1|2} \succ 0$ whenever $\Sigma \succ 0$ .

Derivation. Complete the square in the joint density. The exponent:

(\mathbf{x} - \boldsymbol{\mu})^\top \Sigma^{-1} (\mathbf{x} - \boldsymbol{\mu})

factored using the block inverse of $\Sigma^{-1}$ (the "precision matrix") gives the Schur complement form.

For AI: In Gaussian process regression, the predictive distribution at new points $\mathbf{x}_*$ given training observations $\mathbf{y}$ uses exactly this formula:

\boldsymbol{\mu}_* = K_{*n}(K_{nn} + \sigma^2 I)^{-1}\mathbf{y}, \quad \Sigma_* = K_{**} - K_{*n}(K_{nn} + \sigma^2 I)^{-1}K_{n*}.

The second term is the Schur complement. Computing it via Cholesky: $L = \text{chol}(K_{nn} + \sigma^2 I)$ , then $\Sigma_* = K_{**} - (L^{-1}K_{n*})^\top(L^{-1}K_{n*})$ .

7. Log-Determinant

7.1 Definition and Properties

For a positive definite matrix $A \succ 0$ , the log-determinant is:

\log\det A = \log \prod_{i=1}^n \lambda_i = \sum_{i=1}^n \log \lambda_i.

Since $A \succ 0$ implies $\lambda_i > 0$ for all $i$ , this is well-defined. The log-det is defined only on the interior of the PSD cone (i.e., on PD matrices).

Key computation via Cholesky:

\log\det A = \log\det(LL^\top) = \log(\det L)^2 = 2\log\det L = 2\sum_{i=1}^n \log L_{ii}.

Since $L_{ii} > 0$ (Cholesky diagonal is positive), $\log L_{ii}$ is well-defined. This is the standard computational formula: factor $A = LL^\top$ , then sum the logs of the diagonal entries.

L = np.linalg.cholesky(A)
log_det_A = 2 * np.sum(np.log(np.diag(L)))

This is numerically more stable than np.log(np.linalg.det(A)) for large matrices, because det can underflow or overflow.

Properties:

$\log\det(AB) = \log\det A + \log\det B$ (for any invertible $A, B$ )
$\log\det(A^{-1}) = -\log\det A$
$\log\det(\alpha A) = n\log\alpha + \log\det A$ for scalar $\alpha > 0$
$\log\det(A) = \text{tr}(\log A)$ where $\log A$ is the matrix logarithm (eigendecomposition-based)
As $A \to \partial \mathbb{S}_+^n$ (boundary, a singular matrix): $\log\det A \to -\infty$

7.2 Log-Det as a Concave Function

Theorem 7.1. The function $f: \mathbb{S}_{++}^n \to \mathbb{R}$ defined by $f(A) = \log\det A$ is strictly concave on the cone of PD matrices.

Proof. We need to show $f(\lambda A + (1-\lambda)B) \geq \lambda f(A) + (1-\lambda) f(B)$ for $A, B \succ 0$ and $\lambda \in (0,1)$ , with equality iff $A = B$ .

Fix $A \succ 0$ and let $C = A^{-1/2}BA^{-1/2} \succ 0$ . The eigenvalues of $C$ are $\mu_1 \geq \cdots \geq \mu_n > 0$ .

f(\lambda A + (1-\lambda)B) = \log\det(\lambda A + (1-\lambda)B).

Factoring: $\lambda A + (1-\lambda)B = A^{1/2}(\lambda I + (1-\lambda)C)A^{1/2}$ , so:

f(\lambda A + (1-\lambda)B) = \log\det A + \sum_{i=1}^n \log(\lambda + (1-\lambda)\mu_i).

Since $g(t) = \log t$ is strictly concave: $\log(\lambda + (1-\lambda)\mu_i) \geq \lambda\log 1 + (1-\lambda)\log\mu_i = (1-\lambda)\log\mu_i$ .

Summing: $f(\lambda A + (1-\lambda)B) \geq \log\det A + (1-\lambda)\sum_i \log\mu_i = \lambda f(A) + (1-\lambda)f(B)$ .

Equality holds iff all $\mu_i = 1$ , i.e., $C = I$ , i.e., $B = A$ . $\square$

Consequences of concavity:

The log-det has no local maxima except at the global maximum (over any convex domain)
Gradient methods for maximizing $\log\det A$ over a convex set converge to the global maximum
The function is differentiable at every PD matrix (the gradient is $A^{-1}$ , see 7.3)

7.3 Gradient and Matrix Calculus

Theorem 7.2 (Log-Det Gradient). Let $f(A) = \log\det A$ for $A \succ 0$ . Then:

\frac{\partial \log\det A}{\partial A} = A^{-\top} = A^{-1} \quad (\text{since } A \text{ is symmetric}).

More precisely, in the matrix calculus convention where $\partial f/\partial A_{ij}$ means the $(i,j)$ entry of the gradient matrix:

\left(\frac{\partial \log\det A}{\partial A}\right)_{ij} = (A^{-1})_{ji}.

Proof using Jacobi's formula. Jacobi's formula states $d(\det A) = \text{tr}(\text{adj}(A)\, dA) = \det(A)\,\text{tr}(A^{-1}\,dA)$ . Therefore:

d(\log\det A) = \frac{d(\det A)}{\det A} = \text{tr}(A^{-1}\,dA).

Since $d(\log\det A) = \langle \nabla_A \log\det A,\, dA\rangle_F = \text{tr}((\nabla_A \log\det A)^\top dA)$ , we read off $\nabla_A \log\det A = A^{-\top} = A^{-1}$ for symmetric $A$ .

Second derivative (Hessian operator): For the function $A \mapsto \log\det A$ :

d^2(\log\det A)[H, H] = -\text{tr}(A^{-1}H A^{-1}H) = -\text{tr}((A^{-1/2}HA^{-1/2})^2) \leq 0.

The negative sign confirms concavity.

Other useful log-det derivatives:

$\frac{\partial}{\partial \mathbf{a}} \log\det(A + \mathbf{a}\mathbf{a}^\top) = 2(A + \mathbf{a}\mathbf{a}^\top)^{-1}\mathbf{a}$ (rank-1 update)
$\frac{\partial}{\partial t} \log\det(A + tB) = \text{tr}((A+tB)^{-1}B)$

7.4 Log-Det in Machine Learning

Multivariate Gaussian log-likelihood. For $\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma)$ :

\log p(\mathbf{x}) = -\frac{n}{2}\log(2\pi) - \frac{1}{2}\log\det\Sigma - \frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^\top\Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu}).

The term $-\frac{1}{2}\log\det\Sigma$ penalizes large (spread-out) distributions. When fitting $\Sigma$ to data, maximizing the log-likelihood requires differentiating through $\log\det\Sigma$ , using $\partial\log\det\Sigma/\partial\Sigma = \Sigma^{-1}$ .

Gaussian process marginal likelihood. For GP regression with kernel matrix $K$ and noise $\sigma^2$ :

\log p(\mathbf{y}) = -\frac{1}{2}\mathbf{y}^\top(K+\sigma^2 I)^{-1}\mathbf{y} - \frac{1}{2}\log\det(K+\sigma^2I) - \frac{n}{2}\log(2\pi).

Both terms require Cholesky: $L = \text{chol}(K + \sigma^2 I)$ , then $\log\det = 2\sum\log L_{ii}$ and the quadratic form via triangular solves. Gradient with respect to kernel hyperparameters uses $\partial\log p/\partial\theta = \text{tr}(\alpha\alpha^\top - (K+\sigma^2I)^{-1})\partial K/\partial\theta$ where $\alpha = (K+\sigma^2I)^{-1}\mathbf{y}$ .

Normalizing flows. A normalizing flow defines a bijective mapping $f: \mathbf{z} \mapsto \mathbf{x}$ where $\mathbf{z} \sim \mathcal{N}(0,I)$ . The log-likelihood of data $\mathbf{x}$ is:

\log p(\mathbf{x}) = \log p_z(f^{-1}(\mathbf{x})) + \log|\det J_{f^{-1}}|

where $J_{f^{-1}}$ is the Jacobian. Efficiently computing $\log|\det J|$ is the central computational challenge of normalizing flows. Architectures like RealNVP (triangular Jacobian, $\log|\det J| = \sum\log|J_{ii}|$ ) and FFJORD (stochastic trace estimator) are designed specifically to make this computation tractable.

Log-det estimators for large matrices. When $K$ is very large (e.g., a kernel matrix for millions of training points), exact Cholesky is intractable. Randomized log-det estimators use the identity $\log\det K = \text{tr}(\log K)$ and the stochastic trace estimator $\text{tr}(\log K) \approx \frac{1}{m}\sum_{i=1}^m \mathbf{z}_i^\top (\log K)\mathbf{z}_i$ with random $\mathbf{z}_i \sim \mathcal{N}(0,I)$ , computed via Lanczos iterations. This is used in scalable GP libraries (GPyTorch, 2018).

8. Gram Matrices and Kernel Connections

8.1 The Gram Matrix Construction

Definition 8.1 (Gram Matrix). Let $\mathbf{x}_1, \ldots, \mathbf{x}_n \in \mathbb{R}^d$ be a collection of vectors. Their Gram matrix is:

G \in \mathbb{R}^{n \times n}, \quad G_{ij} = \langle \mathbf{x}_i, \mathbf{x}_j \rangle = \mathbf{x}_i^\top \mathbf{x}_j.

In matrix form: if $X = [\mathbf{x}_1|\cdots|\mathbf{x}_n]^\top \in \mathbb{R}^{n \times d}$ (rows are data points), then $G = XX^\top$ .

Theorem 8.2. Every Gram matrix is positive semidefinite. Moreover, $G \succ 0$ iff $\mathbf{x}_1, \ldots, \mathbf{x}_n$ are linearly independent.

Proof. For any $\mathbf{c} \in \mathbb{R}^n$ :

\mathbf{c}^\top G \mathbf{c} = \sum_{i,j} c_i G_{ij} c_j = \sum_{i,j} c_i \mathbf{x}_i^\top \mathbf{x}_j c_j = \left\|\sum_i c_i \mathbf{x}_i\right\|^2 \geq 0.

So $G \succeq 0$ . Equality holds iff $\sum_i c_i\mathbf{x}_i = \mathbf{0}$ , i.e., iff the vectors are linearly dependent. So $G \succ 0$ iff they are linearly independent. $\square$

Corollary. $\text{rank}(G) = \text{rank}(X)$ , the number of linearly independent data vectors.

Examples:

$X = I_n$ (standard basis): $G = I_n \succ 0$ .
$n > d$ : $\text{rank}(G) \leq d < n$ , so $G \succeq 0$ but $G \not\succ 0$ .
$n = d$ , $X$ invertible: $G \succ 0$ .

8.2 Every PSD Matrix is a Gram Matrix

Theorem 8.3. $G \succeq 0$ if and only if $G$ is the Gram matrix of some set of vectors in some inner product space.

Proof. ( $\Leftarrow$ ): proved above. ( $\Rightarrow$ ): If $G \succeq 0$ , let $L$ be the Cholesky-like factor: $G = LL^\top$ (where $L$ may be rectangular, $n \times r$ , $r = \text{rank}(G)$ ). Take $\mathbf{x}_i^\top = L[i,:]$ (the $i$ -th row of $L$ ). Then $G_{ij} = L[i,:] \cdot L[j,:]^\top = \mathbf{x}_i^\top \mathbf{x}_j$ . $\square$

This is a deep result: the class of PSD matrices and the class of Gram matrices are exactly the same. Any PSD matrix can be "explained" as a matrix of inner products between some set of vectors.

Feature maps. If $\phi: \mathcal{X} \to \mathbb{R}^d$ is a feature map, then the Gram matrix $G_{ij} = \phi(\mathbf{x}_i)^\top\phi(\mathbf{x}_j)$ is PSD. Kernel methods replace $\phi(\mathbf{x}_i)^\top\phi(\mathbf{x}_j)$ with $k(\mathbf{x}_i, \mathbf{x}_j)$ directly, avoiding explicit feature computation.

8.3 Kernel Matrices and Mercer's Theorem

A positive definite kernel is a function $k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ such that for every finite set $\{x_1,\ldots,x_n\} \subset \mathcal{X}$ , the Gram matrix $K_{ij} = k(x_i, x_j)$ is PSD.

Mercer's Theorem (informal statement). A continuous, symmetric function $k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ is a positive definite kernel (i.e., produces PSD Gram matrices for any finite set) if and only if there exists a Hilbert space $\mathcal{H}$ and a feature map $\phi: \mathcal{X} \to \mathcal{H}$ such that:

k(x, z) = \langle \phi(x), \phi(z) \rangle_{\mathcal{H}}.

This is the mathematical foundation of the kernel trick: instead of explicitly computing $\phi(x)$ (which may be infinite-dimensional), we evaluate $k(x,z)$ directly.

Forward reference: The full treatment of Mercer's theorem, reproducing kernel Hilbert spaces (RKHS), and the kernel trick appears in Chapter 12: Functional Analysis.

Standard PD kernels:

Linear kernel: $k(\mathbf{x},\mathbf{z}) = \mathbf{x}^\top\mathbf{z}$ (standard Gram matrix)
RBF/Gaussian kernel: $k(\mathbf{x},\mathbf{z}) = \exp(-\|\mathbf{x}-\mathbf{z}\|^2 / 2\ell^2)$
Polynomial kernel: $k(\mathbf{x},\mathbf{z}) = (\mathbf{x}^\top\mathbf{z} + c)^d$ for $c \geq 0$ , $d \in \mathbb{N}$
Matern kernel: used in GP regression with controllable smoothness

Verifying kernel validity. For a proposed kernel $k$ , the standard test is:

Compute the $n \times n$ Gram matrix $K$ for a test set
Check $K \succeq 0$ (e.g., via np.linalg.eigvalsh(K) >= -1e-10)

8.4 Attention Scores as Gram Matrices

The scaled dot-product attention mechanism in transformers computes:

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V

where $Q, K \in \mathbb{R}^{n \times d_k}$ are query and key matrices.

The score matrix $S = QK^\top / \sqrt{d_k} \in \mathbb{R}^{n \times n}$ is a scaled Gram matrix (but with different row-spaces for $Q$ and $K$ , so not necessarily PSD). In the special case of self-attention with tied weights $Q = K$ , $S$ is proportional to $QQ^\top / \sqrt{d_k} \succeq 0$ .

Why the $1/\sqrt{d_k}$ scaling. If $Q$ and $K$ have independent entries from $\mathcal{N}(0,1)$ , then $Q_{ij}K_{ij}$ has variance 1 and $(QK^\top)_{ij} = \sum_{k=1}^{d_k} Q_{ik}K_{jk}$ has variance $d_k$ . The $1/\sqrt{d_k}$ rescaling brings the variance back to 1, preventing the softmax from saturating into near-one-hot distributions.

Attention as kernel regression. The attention output for query $\mathbf{q}$ is:

\text{output} = \sum_{j=1}^n \frac{\exp(\mathbf{q}^\top\mathbf{k}_j/\sqrt{d_k})}{\sum_l \exp(\mathbf{q}^\top\mathbf{k}_l/\sqrt{d_k})} \mathbf{v}_j.

This is a Nadaraya-Watson kernel regression with an exponential kernel $k(\mathbf{q},\mathbf{k}_j) = \exp(\mathbf{q}^\top\mathbf{k}_j/\sqrt{d_k})$ . The attention weights are the normalized kernel similarities, and the output is a kernel-weighted average of values. The exponential of the dot product is related to the RBF kernel (by the Johnson-Lindenstrauss / random Fourier features perspective used in Performer / FAVOR+).

9. The PSD Cone and Semidefinite Programming

9.1 The Cone of PSD Matrices

The set of all $n \times n$ symmetric positive semidefinite matrices is denoted $\mathbb{S}_+^n$ (or $\mathbb{S}_{\geq 0}^n$ ). It lives inside the vector space $\mathbb{S}^n$ of $n \times n$ real symmetric matrices (which has dimension $n(n+1)/2$ ).

Theorem 9.1. $\mathbb{S}_+^n$ is a proper convex cone:

Convex: If $A, B \succeq 0$ and $\lambda \in [0,1]$ , then $\lambda A + (1-\lambda)B \succeq 0$ .
Cone: If $A \succeq 0$ and $t \geq 0$ , then $tA \succeq 0$ .
Pointed: $A \succeq 0$ and $A \preceq 0$ implies $A = 0$ (no lines through the origin in the cone).
Closed: $\mathbb{S}_+^n$ is a closed subset of $\mathbb{S}^n$ (limits of PSD sequences are PSD).
Full-dimensional: The interior of $\mathbb{S}_+^n$ is $\mathbb{S}_{++}^n$ (the PD matrices), which is non-empty.

Proof of convexity: For any $\mathbf{x}$ : $\mathbf{x}^\top(\lambda A + (1-\lambda)B)\mathbf{x} = \lambda\mathbf{x}^\top A\mathbf{x} + (1-\lambda)\mathbf{x}^\top B\mathbf{x} \geq 0$ . $\square$

The boundary. The boundary $\partial\mathbb{S}_+^n = \mathbb{S}_+^n \setminus \mathbb{S}_{++}^n$ consists of PSD matrices with at least one zero eigenvalue. These are rank-deficient PSD matrices. The boundary has lower dimension: the set of rank- $r$ PSD matrices is a manifold of dimension $rn - r(r-1)/2$ .

Low-dimensional picture. For $n=2$ : $\mathbb{S}^2$ is 3-dimensional (coordinates $A_{11}, A_{12}, A_{22}$ ). The PSD cone $\mathbb{S}_+^2$ is the set where $A_{11} \geq 0$ , $A_{22} \geq 0$ , and $A_{11}A_{22} \geq A_{12}^2$ - the interior of an ice cream cone in 3D.

Dual cone. The dual of $\mathbb{S}_+^n$ with respect to the Frobenius inner product $\langle A, B\rangle_F = \text{tr}(AB)$ is:

(\mathbb{S}_+^n)^* = \{B \in \mathbb{S}^n : \text{tr}(AB) \geq 0 \text{ for all } A \succeq 0\} = \mathbb{S}_+^n.

The PSD cone is self-dual: $(\mathbb{S}_+^n)^* = \mathbb{S}_+^n$ . This is analogous to the non-negative reals being self-dual.

9.2 Semidefinite Programming

Semidefinite programming (SDP) is the optimization of a linear objective over the intersection of the PSD cone with an affine set:

Standard form SDP (primal):

\min_{X \in \mathbb{S}^n} \quad \langle C, X\rangle_F = \text{tr}(CX)

\text{subject to} \quad \langle A_i, X\rangle_F = b_i, \quad i = 1,\ldots,m

\quad\quad\quad\quad X \succeq 0.

Here $C, A_1, \ldots, A_m \in \mathbb{S}^n$ and $b \in \mathbb{R}^m$ are the problem data; $X \in \mathbb{S}^n$ is the decision variable.

Dual SDP:

\max_{\mathbf{y} \in \mathbb{R}^m} \quad \mathbf{b}^\top\mathbf{y}

\text{subject to} \quad C - \sum_{i=1}^m y_i A_i \succeq 0.

Duality. Weak duality always holds: $\text{primal value} \geq \text{dual value}$ . Strong duality (primal = dual) holds under Slater's condition: if the primal is strictly feasible (some $X \succ 0$ satisfies all constraints), then strong duality holds and the dual optimum is attained.

Relation to other optimization problems. SDP generalizes:

Linear programming (LP): LP is SDP with diagonal matrices $C, A_i$
SOCP (second-order cone programming): SOCP is a special SDP
Quadratically constrained QP: Many QCQPs can be lifted to SDPs via semidefinite relaxation

Algorithms. Interior-point methods (barrier methods) solve SDPs in polynomial time: $O(m^{1.5} n^3)$ per iteration for an $m$ -constraint, $n \times n$ SDP. Standard solvers: SCS, MOSEK, CVXPY (modelling layer).

9.3 SDP in Machine Learning

Max-cut relaxation (Goemans-Williamson, 1995). The max-cut problem on a graph $G=(V,E)$ with edge weights $w_{ij}$ is NP-hard. The Goemans-Williamson SDP relaxation gives a $0.878$ -approximation:

\max_{X \succeq 0} \quad \frac{1}{2}\sum_{ij} w_{ij}(1 - X_{ij})

\text{s.t.} \quad X_{ii} = 1, \quad i=1,\ldots,n.

The solution $X^*$ satisfies $X^* = VV^\top$ for some unit vectors $\mathbf{v}_1,\ldots,\mathbf{v}_n$ ; random hyperplane rounding recovers a near-optimal cut.

Metric learning. Learning a Mahalanobis distance $d_A(\mathbf{x},\mathbf{z}) = (\mathbf{x}-\mathbf{z})^\top A (\mathbf{x}-\mathbf{z})^{1/2}$ requires $A \succeq 0$ . Methods like ITML (Information-Theoretic Metric Learning) and SDML formulate this as an SDP over PSD matrices with constraints that similar pairs are close and dissimilar pairs are far.

Covariance estimation. In high dimensions ( $p > n$ ), the sample covariance $\hat{\Sigma} = \frac{1}{n}X^\top X$ may be singular. Regularized covariance estimation (graphical lasso, precision matrix estimation) adds a sparsity constraint or penalty:

\min_{\Sigma \succ 0} \left[ \text{tr}(\hat{S}\Sigma^{-1}) - \log\det\Sigma + \lambda\|\Sigma^{-1}\|_1 \right]

where $\hat{S}$ is the sample covariance and $\|\cdot\|_1$ is the element-wise $\ell_1$ norm. This is not an SDP directly, but the PSD constraint $\Sigma \succ 0$ is the core structural requirement.

Fairness constraints. In algorithmic fairness, PSD constraints arise as necessary conditions for disparate impact compliance. Certified defenses against adversarial examples (semidefinite relaxations of neural network verification) are large-scale SDPs; solvers like $\alpha$ -CROWN reformulate them via Lagrangian relaxation.

10. Applications in Machine Learning

10.1 Multivariate Gaussians and Covariance Matrices

The multivariate Gaussian. A random vector $\mathbf{x} \in \mathbb{R}^n$ has the distribution $\mathcal{N}(\boldsymbol{\mu}, \Sigma)$ if:

p(\mathbf{x}) = \frac{1}{(2\pi)^{n/2}|\det\Sigma|^{1/2}} \exp\!\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^\top\Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu})\right).

For this density to be a valid (normalized, integrable) probability distribution, $\Sigma$ must be symmetric positive definite. The three requirements:

Symmetry: $\Sigma = \Sigma^\top$ (covariance is symmetric by definition)
Positive definiteness: $\Sigma \succ 0$ ensures $\det\Sigma \neq 0$ (normalizing constant finite) and $\Sigma^{-1}$ exists (the exponent is a proper quadratic form)
If $\Sigma \succeq 0$ but singular: The distribution becomes degenerate - supported on an affine subspace, not all of $\mathbb{R}^n$

Parameterizing covariances in neural networks. A neural network that outputs a covariance matrix must parameterize it to be PSD. Standard approaches:

Diagonal: $\Sigma = \text{diag}(\exp(\boldsymbol{s}))$ where $\boldsymbol{s}$ is a learned vector. Automatically PD.
Cholesky lower triangular: $\Sigma = LL^\top$ where $L$ has positive diagonal (enforced via softplus on diagonal entries). This is the most expressive parameterization.
Low-rank + diagonal: $\Sigma = FF^\top + D$ where $F \in \mathbb{R}^{n \times k}$ ( $k \ll n$ ) and $D$ diagonal positive. Woodbury allows efficient inversion.

Sampling from $\mathcal{N}(\boldsymbol{\mu}, \Sigma)$ : Given $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, I)$ :

\mathbf{x} = \boldsymbol{\mu} + L\boldsymbol{\epsilon}

where $L = \text{chol}(\Sigma)$ . This is the fundamental sampling algorithm: the Cholesky factor maps isotropic noise to correlated noise.

10.2 Fisher Information Matrix and Natural Gradient

Definition 10.1 (Fisher Information Matrix). For a statistical model $p(\mathbf{x}|\boldsymbol{\theta})$ with parameter $\boldsymbol{\theta} \in \mathbb{R}^d$ , the Fisher information matrix is:

F(\boldsymbol{\theta}) = \mathbb{E}_{\mathbf{x} \sim p(\cdot|\boldsymbol{\theta})}\!\left[\nabla_{\boldsymbol{\theta}} \log p(\mathbf{x}|\boldsymbol{\theta})\, \nabla_{\boldsymbol{\theta}} \log p(\mathbf{x}|\boldsymbol{\theta})^\top\right].

PSD proof. $F$ is a covariance matrix of the score function $\nabla_{\boldsymbol{\theta}} \log p(\mathbf{x}|\boldsymbol{\theta})$ : it is $\mathbb{E}[\mathbf{s}\mathbf{s}^\top]$ where $\mathbf{s}$ is a random vector. Any matrix of the form $\mathbb{E}[\mathbf{s}\mathbf{s}^\top]$ is PSD (it is the expected outer product). In fact, $F \succ 0$ for regular statistical models (identifiable, full rank).

Natural gradient. Ordinary gradient descent $\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta\nabla_{\boldsymbol{\theta}}\mathcal{L}$ uses the Euclidean metric on parameter space. The natural gradient uses the Fisher metric:

\tilde{\nabla}\mathcal{L} = F(\boldsymbol{\theta})^{-1}\nabla_{\boldsymbol{\theta}}\mathcal{L}.

The natural gradient is invariant to reparameterization of the model - it measures the steepest descent direction with respect to the KL divergence geometry (information geometry).

K-FAC (Kronecker-Factored Approximate Curvature). Martens & Grosse (2015) approximate the Fisher information matrix for neural networks as a block-diagonal matrix, where each block factorizes as a Kronecker product:

F \approx \hat{F} = \text{block-diag}(A_1 \otimes G_1, A_2 \otimes G_2, \ldots)

where $A_l = \mathbb{E}[\mathbf{a}_{l-1}\mathbf{a}_{l-1}^\top]$ (input activation covariance) and $G_l = \mathbb{E}[\delta_l\delta_l^\top]$ (pre-activation gradient covariance) for layer $l$ . Both $A_l$ and $G_l$ are PSD (covariance matrices). The Kronecker product $A_l \otimes G_l$ is PSD, and its inverse is $(A_l \otimes G_l)^{-1} = A_l^{-1} \otimes G_l^{-1}$ , computable cheaply via Cholesky of each factor separately.

10.3 Gaussian Process Regression

Gaussian process regression is the Bayesian non-parametric regression method that uses PD kernel matrices as the core computational object.

Model. Place a GP prior: $f \sim \mathcal{GP}(0, k)$ where $k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ is a PD kernel. Observe noisy outputs $\mathbf{y} = f(\mathbf{X}) + \boldsymbol{\epsilon}$ , $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 I)$ .

Prediction. The predictive distribution at new points $X_*$ is:

f_* | X_*, \mathbf{X}, \mathbf{y} \sim \mathcal{N}(\boldsymbol{\mu}_*, \Sigma_*)

where:

\boldsymbol{\mu}_* = K_{*n}(K_{nn} + \sigma^2I)^{-1}\mathbf{y}

\Sigma_* = K_{**} - K_{*n}(K_{nn} + \sigma^2I)^{-1}K_{n*}

and $K_{nn} = k(\mathbf{X}, \mathbf{X})$ , $K_{*n} = k(X_*, \mathbf{X})$ , $K_{**} = k(X_*, X_*)$ .

Computational core: Cholesky. Factor $K_{nn} + \sigma^2 I = LL^\top$ . Then:

$\boldsymbol{\alpha} = L^\top \backslash (L \backslash \mathbf{y})$ (two triangular solves)
$\boldsymbol{\mu}_* = K_{*n}\boldsymbol{\alpha}$ (matrix-vector product)
$V = L \backslash K_{n*}$ (triangular solve, $n \times n_*$ system)
$\Sigma_* = K_{**} - V^\top V$ (Schur complement form)
$\log p(\mathbf{y}) = -\frac{1}{2}\mathbf{y}^\top\boldsymbol{\alpha} - \sum_i\log L_{ii} - \frac{n}{2}\log(2\pi)$

The entire GP regression computation is $O(n^3)$ via Cholesky - the classic bottleneck for large datasets, motivating sparse GP approximations (inducing points, Nystrom, SVGP).

10.4 Hessian and Loss Landscape Sharpness

Second-order characterization of minima. At a critical point $\nabla\mathcal{L}(\boldsymbol{\theta}^*) = 0$ of a smooth loss $\mathcal{L}$ :

$\nabla^2\mathcal{L}(\boldsymbol{\theta}^*) \succ 0$ : strict local minimum (the loss bowl is strictly convex)
$\nabla^2\mathcal{L}(\boldsymbol{\theta}^*) \succeq 0$ : local minimum or saddle (Hessian is PSD)
$\nabla^2\mathcal{L}(\boldsymbol{\theta}^*)$ indefinite: saddle point

Sharpness. The sharpness of a minimum is $\lambda_{\max}(\nabla^2\mathcal{L}(\boldsymbol{\theta}^*))$ - the largest eigenvalue of the Hessian. Flat minima (small sharpness) generalize better than sharp minima: a small perturbation $\boldsymbol{\theta}^* + \boldsymbol{\delta}$ with $\|\boldsymbol{\delta}\| \leq \rho$ changes the loss by at most $\rho^2 \lambda_{\max}(\nabla^2\mathcal{L})$ (by Taylor expansion and the PSD bound $\boldsymbol{\delta}^\top H\boldsymbol{\delta} \leq \lambda_{\max}\|\boldsymbol{\delta}\|^2$ ).

SAM (Sharpness-Aware Minimization). Foret et al. (2021) propose:

\min_{\boldsymbol{\theta}} \max_{\|\boldsymbol{\delta}\|\leq\rho} \mathcal{L}(\boldsymbol{\theta}+\boldsymbol{\delta}).

The inner max finds the worst-case perturbation (solved approximately as $\boldsymbol{\delta}^* = \rho\, \nabla\mathcal{L}/\|\nabla\mathcal{L}\|$ ). SAM is a first-order approximation to minimizing sharpness and is reported to improve generalization on image and language tasks.

Gauss-Newton and PSD Hessian approximations. The true Hessian $\nabla^2\mathcal{L}$ may be indefinite during training (early stages, overparameterized models). The Gauss-Newton matrix $G = J^\top J$ (where $J$ is the Jacobian of the predictions) is always PSD and is often a better preconditioner. K-FAC uses the Gauss-Newton approximation.

10.5 Reparameterization Trick in VAEs

The variational autoencoder (VAE) requires sampling from a distribution $q_\phi(\mathbf{z}|\mathbf{x}) = \mathcal{N}(\boldsymbol{\mu}_\phi(\mathbf{x}), \Sigma_\phi(\mathbf{x}))$ where $\boldsymbol{\mu}_\phi$ and $\Sigma_\phi$ are outputs of an encoder neural network, and differentiating through the sampling process.

The reparameterization trick. Instead of sampling $\mathbf{z} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma)$ directly (which blocks gradient flow), write:

\mathbf{z} = \boldsymbol{\mu} + L\boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, I)

where $L = \text{chol}(\Sigma)$ . Now $\mathbf{z}$ is a deterministic function of $(\boldsymbol{\mu}, L, \boldsymbol{\epsilon})$ , and gradients can flow through $\boldsymbol{\mu}$ and $L$ .

Diagonal VAE (standard). Most VAE implementations use diagonal covariance: $\Sigma = \text{diag}(\exp(\mathbf{s}))$ , so $L = \text{diag}(\exp(\mathbf{s}/2))$ and $\mathbf{z} = \boldsymbol{\mu} + \exp(\mathbf{s}/2) \odot \boldsymbol{\epsilon}$ .

Full covariance VAE. For a full covariance Cholesky parameterization, the encoder outputs:

Mean $\boldsymbol{\mu} \in \mathbb{R}^d$
Lower triangular $L \in \mathbb{R}^{d \times d}$ with positive diagonal (e.g., $L_{ii} = \text{softplus}(\tilde{L}_{ii})$ , off-diagonal unconstrained)

Then $\Sigma = LL^\top$ and $\mathbf{z} = \boldsymbol{\mu} + L\boldsymbol{\epsilon}$ . The gradient through $L$ back to the encoder parameters flows via:

\frac{\partial \mathbf{z}}{\partial L} = \frac{\partial(L\boldsymbol{\epsilon})}{\partial L} = \boldsymbol{\epsilon} \otimes I_d

(the outer product of the sampled noise $\boldsymbol{\epsilon}$ with the identity).

For normalizing flows. More expressive VAE variants use normalizing flows for the posterior $q_\phi(\mathbf{z}|\mathbf{x})$ . The flow is a sequence of invertible maps; the log-det Jacobian of each map must be computed efficiently. Triangular Jacobians (e.g., masked autoregressive flow / IAF) achieve $O(d)$ log-det computation.

11. Common Mistakes

#	Mistake	Why It's Wrong	Fix
1	Checking only diagonal entries to test PD	Positive diagonal is necessary but not sufficient. $A = \begin{pmatrix}1 & 2 \\ 2 & 1\end{pmatrix}$ has positive diagonal but is indefinite ( $\det = -3$ ).	Use Sylvester's criterion (all leading minors > 0) or attempt Cholesky.
2	Concluding $A \succ 0$ from $\det A > 0$ alone	$\det > 0$ is necessary but not sufficient. $\begin{pmatrix}-1 & 0 \\ 0 & -2\end{pmatrix}$ has $\det = 2 > 0$ but is negative definite.	Check all leading principal minors, not just the full determinant.
3	Assuming $A^\top A \succ 0$ for any $A$	$A^\top A \succeq 0$ always, but $A^\top A \succ 0$ iff $A$ has full column rank. If $A$ has a null vector ( $A\mathbf{v} = 0$ ), then $\mathbf{v}^\top A^\top A \mathbf{v} = 0$ .	Verify $\text{rank}(A) = \text{number of columns}$ .
4	Confusing the Cholesky factor $L$ with the PSD square root $A^{1/2}$	$L$ is lower triangular; $A^{1/2}$ is symmetric. Both satisfy "squared = $A$ " but in different senses ( $LL^\top = A$ vs $(A^{1/2})^2 = A$ ). They are equal only when $A$ is diagonal.	Use $L$ for solving/sampling; use $A^{1/2}$ for Mahalanobis/whitening.
5	Using `np.log(np.linalg.det(A))` for large $A$	`np.linalg.det` can overflow/underflow for large $n$ (products of many numbers).	Use Cholesky: `2 * np.sum(np.log(np.diag(np.linalg.cholesky(A))))`.
6	Forgetting Sylvester's criterion does NOT characterize PSD	Sylvester requires all leading principal minors > 0, which gives PD not PSD. For PSD, you need all principal minors (not just leading ones) $\geq 0$ - a much larger set.	For PSD testing, use eigenvalues or attempt pivoted Cholesky.
7	Inverting the Loewner order incorrectly	Students often think $A \succeq B$ implies $A^{-1} \succeq B^{-1}$ . The correct fact is $A \succeq B \succ 0 \Rightarrow B^{-1} \succeq A^{-1}$ (inversion reverses the order).	Remember: larger matrix -> smaller inverse (as for positive scalars: $a \geq b > 0 \Rightarrow 1/a \leq 1/b$ ).
8	Adding jitter $\epsilon I$ without checking scale	If $\\|A\\|_2 \approx 10^6$ and you add $\epsilon = 10^{-6}$ , the relative jitter is $10^{-12}$ , effectively zero numerically.	Set jitter proportional to the scale: $\epsilon = \delta \cdot \text{tr}(A)/n$ for some small $\delta$ .
9	Assuming the kernel matrix stays PD after operations	Sum, product, and pointwise operations on PD kernels may not preserve PD structure. E.g., $k_1(\mathbf{x},\mathbf{z}) - k_2(\mathbf{x},\mathbf{z})$ may not be PD.	Verify the Gram matrix or use closure properties: sum/product/exponentiation of PD kernels are PD.
10	Computing Schur complement with singular $A$	The Schur complement $D - CA^{-1}B$ requires $A$ invertible. If $A$ is only PSD (rank-deficient), $A^{-1}$ does not exist.	Use the Moore-Penrose pseudoinverse: $S = D - CA^\dagger B$ , though the PD characterization no longer holds as stated.
11	Treating "positive definite" as a non-symmetric matrix property	PD is only defined for symmetric matrices. An arbitrary matrix $A$ with all positive eigenvalues may not have a positive quadratic form (complex eigenvalues, non-symmetric).	Always symmetrize: replace $A$ with $(A + A^\top)/2$ before testing PD.
12	Thinking PD constraint is automatically satisfied by a NN output	A neural network outputting $n(n+1)/2$ numbers does not automatically produce a valid lower triangular $L$ with positive diagonal.	Enforce: use softplus on the diagonal entries; leave off-diagonal unconstrained. Then $\Sigma = LL^\top$ is guaranteed PD.

Positive Definite Matrices: Part 2 - Schur Complement To 11 Common Mistakes