Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 2
26 min read18 headingsSplit lesson page

Lesson overview | Previous part | Next part

Positive Definite Matrices: Part 6: Schur Complement to 11. Common Mistakes

6. Schur Complement

The Schur complement is the "matrix analogue of completing the square" for block matrices. It appears everywhere in probability (Gaussian conditioning), linear algebra (block matrix inversion), and optimization (constraint elimination).

6.1 Definition for Block Matrices

Definition 6.1 (Schur Complement). Let

M=(ABCD)M = \begin{pmatrix}A & B \\ C & D\end{pmatrix}

be a block matrix with ARp×pA \in \mathbb{R}^{p \times p} invertible. The Schur complement of AA in MM is:

S=DCA1BRq×q.S = D - CA^{-1}B \in \mathbb{R}^{q \times q}.

Similarly, if DRq×qD \in \mathbb{R}^{q \times q} is invertible, the Schur complement of DD is ABD1CA - BD^{-1}C.

Origin: block Gaussian elimination. The Schur complement arises naturally when eliminating the (2,1)(2,1) block:

M=(ABCD)=(I0CA1I)(AB0DCA1B).M = \begin{pmatrix}A & B \\ C & D\end{pmatrix} = \begin{pmatrix}I & 0 \\ CA^{-1} & I\end{pmatrix}\begin{pmatrix}A & B \\ 0 & D-CA^{-1}B\end{pmatrix}.

The (2,2)(2,2) block in the upper triangular factor is exactly S=DCA1BS = D - CA^{-1}B.

Determinant formula. A key consequence of the block LU:

detM=detAdet(DCA1B)=detAdetS.\det M = \det A \cdot \det(D - CA^{-1}B) = \det A \cdot \det S.

Similarly, detM=detDdet(ABD1C)\det M = \det D \cdot \det(A - BD^{-1}C) when DD is invertible.

Block matrix inverse. Using the Schur complement:

M1=(A1+A1BS1CA1A1BS1S1CA1S1)M^{-1} = \begin{pmatrix}A^{-1} + A^{-1}B S^{-1} C A^{-1} & -A^{-1}B S^{-1} \\ -S^{-1}CA^{-1} & S^{-1}\end{pmatrix}

when both AA and S=DCA1BS = D - CA^{-1}B are invertible.

6.2 Schur Complement and Positive Definiteness

The Schur complement provides an elegant characterization of block PD matrices.

Theorem 6.2 (Schur PD Criterion). Let M=(ABBD)M = \begin{pmatrix}A & B \\ B^\top & D\end{pmatrix} be symmetric (so C=BC = B^\top). Then:

M0A0 and S=DBA1B0.M \succ 0 \quad \Longleftrightarrow \quad A \succ 0 \text{ and } S = D - B^\top A^{-1} B \succ 0.

Proof. We use the block Cholesky:

M=(ABBD)=(I0BA1I)(A00S)(IA1B0I).M = \begin{pmatrix}A & B \\ B^\top & D\end{pmatrix} = \begin{pmatrix}I & 0 \\ B^\top A^{-1} & I\end{pmatrix}\begin{pmatrix}A & 0 \\ 0 & S\end{pmatrix}\begin{pmatrix}I & A^{-1}B \\ 0 & I\end{pmatrix}.

The middle factor is block diagonal: (A00S)\begin{pmatrix}A & 0 \\ 0 & S\end{pmatrix}. For any v=(x,y)\mathbf{v} = (\mathbf{x}, \mathbf{y})^\top:

vMv=(x+A1By)A(x+A1By)+ySy.\mathbf{v}^\top M \mathbf{v} = (\mathbf{x} + A^{-1}B\mathbf{y})^\top A (\mathbf{x} + A^{-1}B\mathbf{y}) + \mathbf{y}^\top S \mathbf{y}.

(\Rightarrow): If M0M \succ 0, taking y=0\mathbf{y} = \mathbf{0} shows A0A \succ 0; taking x=A1By\mathbf{x} = -A^{-1}B\mathbf{y} shows ySy>0\mathbf{y}^\top S\mathbf{y} > 0 for y0\mathbf{y} \neq 0, so S0S \succ 0.

(\Leftarrow): If A0A \succ 0 and S0S \succ 0, then both terms are non-negative, and at least one is positive for (x,y)(0,0)(\mathbf{x},\mathbf{y}) \neq (\mathbf{0},\mathbf{0}), so M0M \succ 0. \square

Corollary. M0A0M \succeq 0 \Leftrightarrow A \succeq 0 and S=DBA1B0S = D - B^\top A^{-1}B \succeq 0 (when AA is invertible; otherwise use the rank condition).

SCHUR COMPLEMENT AND BLOCK PD
========================================================================

  M = ( A    B  )  symmetric
      ( B^T   D  )
      
  M \succ 0  <=>  A \succ 0  AND  S = D - B^TA^-^1B \succ 0

  Intuition: "completing the square" in block form
  
  v^TMv = (x + A^-^1By)^TA(x + A^-^1By) + y^TSy
          -----------------------------   -----
          \geq 0 (since A \succ 0)              \geq 0 (since S \succ 0)

========================================================================

6.3 Matrix Inversion Lemma

The Woodbury matrix identity (also called the matrix inversion lemma or Sherman-Morrison-Woodbury formula) is:

(A+UCV)1=A1A1U(C1+VA1U)1VA1(A + UCV)^{-1} = A^{-1} - A^{-1}U(C^{-1} + VA^{-1}U)^{-1}VA^{-1}

where ARn×nA \in \mathbb{R}^{n \times n}, CRk×kC \in \mathbb{R}^{k \times k}, URn×kU \in \mathbb{R}^{n \times k}, VRk×nV \in \mathbb{R}^{k \times n}. This is valuable when knk \ll n (low-rank update): instead of inverting an n×nn \times n matrix, invert a k×kk \times k matrix.

Derivation via Schur complement. Consider the block system:

M=(AUVC1).M = \begin{pmatrix}A & U \\ -V & C^{-1}\end{pmatrix}.

The Schur complement of C1C^{-1} is AU(V)1(V)=A+UCVA - U(-V)^{-1}(-V) = A + UCV (with some sign manipulation). The Schur complement of AA is C1+VA1UC^{-1} + VA^{-1}U. Using the block inverse formula gives the Woodbury identity.

Special case (rank-1 update): If U=uU = \mathbf{u}, V=vV = \mathbf{v}^\top, C=cC = c (scalar):

(A+cuv)1=A1cA1uvA11+cvA1u.(A + c\mathbf{u}\mathbf{v}^\top)^{-1} = A^{-1} - \frac{c A^{-1}\mathbf{u}\mathbf{v}^\top A^{-1}}{1 + c\mathbf{v}^\top A^{-1}\mathbf{u}}.

This is the Sherman-Morrison formula for rank-1 updates.

For AI: The Woodbury identity is used in:

  • Gaussian process prediction: posterior covariance (K+σ2I)1(K + \sigma^2 I)^{-1} where K=XXK = X^\top X and σ2I\sigma^2 I is noise; Woodbury allows using n×nn \times n or p×pp \times p (whichever is smaller)
  • LoRA / low-rank adaptation: the effective weight W0+BAW_0 + BA is a rank-rr update; Woodbury allows efficient inversion without materializing the full matrix
  • Kalman filter update step: (P1+HR1H)1(P^{-1} + H^\top R^{-1}H)^{-1} uses Woodbury to avoid inverting large state covariances

6.4 Gaussian Conditioning via Schur Complement

The Schur complement is the algebraic engine behind the conditional distribution of multivariate Gaussians.

Setup. Let (x1x2)N ⁣((μ1μ2),(Σ11Σ12Σ21Σ22))\begin{pmatrix}\mathbf{x}_1 \\ \mathbf{x}_2\end{pmatrix} \sim \mathcal{N}\!\left(\begin{pmatrix}\boldsymbol{\mu}_1 \\ \boldsymbol{\mu}_2\end{pmatrix}, \begin{pmatrix}\Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22}\end{pmatrix}\right).

Conditional distribution. The conditional x1x2=a\mathbf{x}_1 | \mathbf{x}_2 = \mathbf{a} is Gaussian:

x1x2=aN(μ12,Σ12)\mathbf{x}_1 | \mathbf{x}_2 = \mathbf{a} \sim \mathcal{N}(\boldsymbol{\mu}_{1|2},\, \Sigma_{1|2})

where:

μ12=μ1+Σ12Σ221(aμ2)\boldsymbol{\mu}_{1|2} = \boldsymbol{\mu}_1 + \Sigma_{12}\Sigma_{22}^{-1}(\mathbf{a} - \boldsymbol{\mu}_2) Σ12=Σ11Σ12Σ221Σ21.\Sigma_{1|2} = \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}.

The conditional covariance Σ12\Sigma_{1|2} is exactly the Schur complement of Σ22\Sigma_{22} in Σ\Sigma. Theorem 6.2 guarantees Σ120\Sigma_{1|2} \succ 0 whenever Σ0\Sigma \succ 0.

Derivation. Complete the square in the joint density. The exponent:

(xμ)Σ1(xμ)(\mathbf{x} - \boldsymbol{\mu})^\top \Sigma^{-1} (\mathbf{x} - \boldsymbol{\mu})

factored using the block inverse of Σ1\Sigma^{-1} (the "precision matrix") gives the Schur complement form.

For AI: In Gaussian process regression, the predictive distribution at new points x\mathbf{x}_* given training observations y\mathbf{y} uses exactly this formula:

μ=Kn(Knn+σ2I)1y,Σ=KKn(Knn+σ2I)1Kn.\boldsymbol{\mu}_* = K_{*n}(K_{nn} + \sigma^2 I)^{-1}\mathbf{y}, \quad \Sigma_* = K_{**} - K_{*n}(K_{nn} + \sigma^2 I)^{-1}K_{n*}.

The second term is the Schur complement. Computing it via Cholesky: L=chol(Knn+σ2I)L = \text{chol}(K_{nn} + \sigma^2 I), then Σ=K(L1Kn)(L1Kn)\Sigma_* = K_{**} - (L^{-1}K_{n*})^\top(L^{-1}K_{n*}).


7. Log-Determinant

7.1 Definition and Properties

For a positive definite matrix A0A \succ 0, the log-determinant is:

logdetA=logi=1nλi=i=1nlogλi.\log\det A = \log \prod_{i=1}^n \lambda_i = \sum_{i=1}^n \log \lambda_i.

Since A0A \succ 0 implies λi>0\lambda_i > 0 for all ii, this is well-defined. The log-det is defined only on the interior of the PSD cone (i.e., on PD matrices).

Key computation via Cholesky:

logdetA=logdet(LL)=log(detL)2=2logdetL=2i=1nlogLii.\log\det A = \log\det(LL^\top) = \log(\det L)^2 = 2\log\det L = 2\sum_{i=1}^n \log L_{ii}.

Since Lii>0L_{ii} > 0 (Cholesky diagonal is positive), logLii\log L_{ii} is well-defined. This is the standard computational formula: factor A=LLA = LL^\top, then sum the logs of the diagonal entries.

L = np.linalg.cholesky(A)
log_det_A = 2 * np.sum(np.log(np.diag(L)))

This is numerically more stable than np.log(np.linalg.det(A)) for large matrices, because det can underflow or overflow.

Properties:

  • logdet(AB)=logdetA+logdetB\log\det(AB) = \log\det A + \log\det B (for any invertible A,BA, B)
  • logdet(A1)=logdetA\log\det(A^{-1}) = -\log\det A
  • logdet(αA)=nlogα+logdetA\log\det(\alpha A) = n\log\alpha + \log\det A for scalar α>0\alpha > 0
  • logdet(A)=tr(logA)\log\det(A) = \text{tr}(\log A) where logA\log A is the matrix logarithm (eigendecomposition-based)
  • As AS+nA \to \partial \mathbb{S}_+^n (boundary, a singular matrix): logdetA\log\det A \to -\infty

7.2 Log-Det as a Concave Function

Theorem 7.1. The function f:S++nRf: \mathbb{S}_{++}^n \to \mathbb{R} defined by f(A)=logdetAf(A) = \log\det A is strictly concave on the cone of PD matrices.

Proof. We need to show f(λA+(1λ)B)λf(A)+(1λ)f(B)f(\lambda A + (1-\lambda)B) \geq \lambda f(A) + (1-\lambda) f(B) for A,B0A, B \succ 0 and λ(0,1)\lambda \in (0,1), with equality iff A=BA = B.

Fix A0A \succ 0 and let C=A1/2BA1/20C = A^{-1/2}BA^{-1/2} \succ 0. The eigenvalues of CC are μ1μn>0\mu_1 \geq \cdots \geq \mu_n > 0.

f(λA+(1λ)B)=logdet(λA+(1λ)B).f(\lambda A + (1-\lambda)B) = \log\det(\lambda A + (1-\lambda)B).

Factoring: λA+(1λ)B=A1/2(λI+(1λ)C)A1/2\lambda A + (1-\lambda)B = A^{1/2}(\lambda I + (1-\lambda)C)A^{1/2}, so:

f(λA+(1λ)B)=logdetA+i=1nlog(λ+(1λ)μi).f(\lambda A + (1-\lambda)B) = \log\det A + \sum_{i=1}^n \log(\lambda + (1-\lambda)\mu_i).

Since g(t)=logtg(t) = \log t is strictly concave: log(λ+(1λ)μi)λlog1+(1λ)logμi=(1λ)logμi\log(\lambda + (1-\lambda)\mu_i) \geq \lambda\log 1 + (1-\lambda)\log\mu_i = (1-\lambda)\log\mu_i.

Summing: f(λA+(1λ)B)logdetA+(1λ)ilogμi=λf(A)+(1λ)f(B)f(\lambda A + (1-\lambda)B) \geq \log\det A + (1-\lambda)\sum_i \log\mu_i = \lambda f(A) + (1-\lambda)f(B).

Equality holds iff all μi=1\mu_i = 1, i.e., C=IC = I, i.e., B=AB = A. \square

Consequences of concavity:

  • The log-det has no local maxima except at the global maximum (over any convex domain)
  • Gradient methods for maximizing logdetA\log\det A over a convex set converge to the global maximum
  • The function is differentiable at every PD matrix (the gradient is A1A^{-1}, see 7.3)

7.3 Gradient and Matrix Calculus

Theorem 7.2 (Log-Det Gradient). Let f(A)=logdetAf(A) = \log\det A for A0A \succ 0. Then:

logdetAA=A=A1(since A is symmetric).\frac{\partial \log\det A}{\partial A} = A^{-\top} = A^{-1} \quad (\text{since } A \text{ is symmetric}).

More precisely, in the matrix calculus convention where f/Aij\partial f/\partial A_{ij} means the (i,j)(i,j) entry of the gradient matrix:

(logdetAA)ij=(A1)ji.\left(\frac{\partial \log\det A}{\partial A}\right)_{ij} = (A^{-1})_{ji}.

Proof using Jacobi's formula. Jacobi's formula states d(detA)=tr(adj(A)dA)=det(A)tr(A1dA)d(\det A) = \text{tr}(\text{adj}(A)\, dA) = \det(A)\,\text{tr}(A^{-1}\,dA). Therefore:

d(logdetA)=d(detA)detA=tr(A1dA).d(\log\det A) = \frac{d(\det A)}{\det A} = \text{tr}(A^{-1}\,dA).

Since d(logdetA)=AlogdetA,dAF=tr((AlogdetA)dA)d(\log\det A) = \langle \nabla_A \log\det A,\, dA\rangle_F = \text{tr}((\nabla_A \log\det A)^\top dA), we read off AlogdetA=A=A1\nabla_A \log\det A = A^{-\top} = A^{-1} for symmetric AA.

Second derivative (Hessian operator): For the function AlogdetAA \mapsto \log\det A:

d2(logdetA)[H,H]=tr(A1HA1H)=tr((A1/2HA1/2)2)0.d^2(\log\det A)[H, H] = -\text{tr}(A^{-1}H A^{-1}H) = -\text{tr}((A^{-1/2}HA^{-1/2})^2) \leq 0.

The negative sign confirms concavity.

Other useful log-det derivatives:

  • alogdet(A+aa)=2(A+aa)1a\frac{\partial}{\partial \mathbf{a}} \log\det(A + \mathbf{a}\mathbf{a}^\top) = 2(A + \mathbf{a}\mathbf{a}^\top)^{-1}\mathbf{a} (rank-1 update)
  • tlogdet(A+tB)=tr((A+tB)1B)\frac{\partial}{\partial t} \log\det(A + tB) = \text{tr}((A+tB)^{-1}B)

7.4 Log-Det in Machine Learning

Multivariate Gaussian log-likelihood. For xN(μ,Σ)\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma):

logp(x)=n2log(2π)12logdetΣ12(xμ)Σ1(xμ).\log p(\mathbf{x}) = -\frac{n}{2}\log(2\pi) - \frac{1}{2}\log\det\Sigma - \frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^\top\Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu}).

The term 12logdetΣ-\frac{1}{2}\log\det\Sigma penalizes large (spread-out) distributions. When fitting Σ\Sigma to data, maximizing the log-likelihood requires differentiating through logdetΣ\log\det\Sigma, using logdetΣ/Σ=Σ1\partial\log\det\Sigma/\partial\Sigma = \Sigma^{-1}.

Gaussian process marginal likelihood. For GP regression with kernel matrix KK and noise σ2\sigma^2:

logp(y)=12y(K+σ2I)1y12logdet(K+σ2I)n2log(2π).\log p(\mathbf{y}) = -\frac{1}{2}\mathbf{y}^\top(K+\sigma^2 I)^{-1}\mathbf{y} - \frac{1}{2}\log\det(K+\sigma^2I) - \frac{n}{2}\log(2\pi).

Both terms require Cholesky: L=chol(K+σ2I)L = \text{chol}(K + \sigma^2 I), then logdet=2logLii\log\det = 2\sum\log L_{ii} and the quadratic form via triangular solves. Gradient with respect to kernel hyperparameters uses logp/θ=tr(αα(K+σ2I)1)K/θ\partial\log p/\partial\theta = \text{tr}(\alpha\alpha^\top - (K+\sigma^2I)^{-1})\partial K/\partial\theta where α=(K+σ2I)1y\alpha = (K+\sigma^2I)^{-1}\mathbf{y}.

Normalizing flows. A normalizing flow defines a bijective mapping f:zxf: \mathbf{z} \mapsto \mathbf{x} where zN(0,I)\mathbf{z} \sim \mathcal{N}(0,I). The log-likelihood of data x\mathbf{x} is:

logp(x)=logpz(f1(x))+logdetJf1\log p(\mathbf{x}) = \log p_z(f^{-1}(\mathbf{x})) + \log|\det J_{f^{-1}}|

where Jf1J_{f^{-1}} is the Jacobian. Efficiently computing logdetJ\log|\det J| is the central computational challenge of normalizing flows. Architectures like RealNVP (triangular Jacobian, logdetJ=logJii\log|\det J| = \sum\log|J_{ii}|) and FFJORD (stochastic trace estimator) are designed specifically to make this computation tractable.

Log-det estimators for large matrices. When KK is very large (e.g., a kernel matrix for millions of training points), exact Cholesky is intractable. Randomized log-det estimators use the identity logdetK=tr(logK)\log\det K = \text{tr}(\log K) and the stochastic trace estimator tr(logK)1mi=1mzi(logK)zi\text{tr}(\log K) \approx \frac{1}{m}\sum_{i=1}^m \mathbf{z}_i^\top (\log K)\mathbf{z}_i with random ziN(0,I)\mathbf{z}_i \sim \mathcal{N}(0,I), computed via Lanczos iterations. This is used in scalable GP libraries (GPyTorch, 2018).


8. Gram Matrices and Kernel Connections

8.1 The Gram Matrix Construction

Definition 8.1 (Gram Matrix). Let x1,,xnRd\mathbf{x}_1, \ldots, \mathbf{x}_n \in \mathbb{R}^d be a collection of vectors. Their Gram matrix is:

GRn×n,Gij=xi,xj=xixj.G \in \mathbb{R}^{n \times n}, \quad G_{ij} = \langle \mathbf{x}_i, \mathbf{x}_j \rangle = \mathbf{x}_i^\top \mathbf{x}_j.

In matrix form: if X=[x1xn]Rn×dX = [\mathbf{x}_1|\cdots|\mathbf{x}_n]^\top \in \mathbb{R}^{n \times d} (rows are data points), then G=XXG = XX^\top.

Theorem 8.2. Every Gram matrix is positive semidefinite. Moreover, G0G \succ 0 iff x1,,xn\mathbf{x}_1, \ldots, \mathbf{x}_n are linearly independent.

Proof. For any cRn\mathbf{c} \in \mathbb{R}^n:

cGc=i,jciGijcj=i,jcixixjcj=icixi20.\mathbf{c}^\top G \mathbf{c} = \sum_{i,j} c_i G_{ij} c_j = \sum_{i,j} c_i \mathbf{x}_i^\top \mathbf{x}_j c_j = \left\|\sum_i c_i \mathbf{x}_i\right\|^2 \geq 0.

So G0G \succeq 0. Equality holds iff icixi=0\sum_i c_i\mathbf{x}_i = \mathbf{0}, i.e., iff the vectors are linearly dependent. So G0G \succ 0 iff they are linearly independent. \square

Corollary. rank(G)=rank(X)\text{rank}(G) = \text{rank}(X), the number of linearly independent data vectors.

Examples:

  • X=InX = I_n (standard basis): G=In0G = I_n \succ 0.
  • n>dn > d: rank(G)d<n\text{rank}(G) \leq d < n, so G0G \succeq 0 but G⊁0G \not\succ 0.
  • n=dn = d, XX invertible: G0G \succ 0.

8.2 Every PSD Matrix is a Gram Matrix

Theorem 8.3. G0G \succeq 0 if and only if GG is the Gram matrix of some set of vectors in some inner product space.

Proof. (\Leftarrow): proved above. (\Rightarrow): If G0G \succeq 0, let LL be the Cholesky-like factor: G=LLG = LL^\top (where LL may be rectangular, n×rn \times r, r=rank(G)r = \text{rank}(G)). Take xi=L[i,:]\mathbf{x}_i^\top = L[i,:] (the ii-th row of LL). Then Gij=L[i,:]L[j,:]=xixjG_{ij} = L[i,:] \cdot L[j,:]^\top = \mathbf{x}_i^\top \mathbf{x}_j. \square

This is a deep result: the class of PSD matrices and the class of Gram matrices are exactly the same. Any PSD matrix can be "explained" as a matrix of inner products between some set of vectors.

Feature maps. If ϕ:XRd\phi: \mathcal{X} \to \mathbb{R}^d is a feature map, then the Gram matrix Gij=ϕ(xi)ϕ(xj)G_{ij} = \phi(\mathbf{x}_i)^\top\phi(\mathbf{x}_j) is PSD. Kernel methods replace ϕ(xi)ϕ(xj)\phi(\mathbf{x}_i)^\top\phi(\mathbf{x}_j) with k(xi,xj)k(\mathbf{x}_i, \mathbf{x}_j) directly, avoiding explicit feature computation.

8.3 Kernel Matrices and Mercer's Theorem

A positive definite kernel is a function k:X×XRk: \mathcal{X} \times \mathcal{X} \to \mathbb{R} such that for every finite set {x1,,xn}X\{x_1,\ldots,x_n\} \subset \mathcal{X}, the Gram matrix Kij=k(xi,xj)K_{ij} = k(x_i, x_j) is PSD.

Mercer's Theorem (informal statement). A continuous, symmetric function k:X×XRk: \mathcal{X} \times \mathcal{X} \to \mathbb{R} is a positive definite kernel (i.e., produces PSD Gram matrices for any finite set) if and only if there exists a Hilbert space H\mathcal{H} and a feature map ϕ:XH\phi: \mathcal{X} \to \mathcal{H} such that:

k(x,z)=ϕ(x),ϕ(z)H.k(x, z) = \langle \phi(x), \phi(z) \rangle_{\mathcal{H}}.

This is the mathematical foundation of the kernel trick: instead of explicitly computing ϕ(x)\phi(x) (which may be infinite-dimensional), we evaluate k(x,z)k(x,z) directly.

Forward reference: The full treatment of Mercer's theorem, reproducing kernel Hilbert spaces (RKHS), and the kernel trick appears in Chapter 12: Functional Analysis.

Standard PD kernels:

  • Linear kernel: k(x,z)=xzk(\mathbf{x},\mathbf{z}) = \mathbf{x}^\top\mathbf{z} (standard Gram matrix)
  • RBF/Gaussian kernel: k(x,z)=exp(xz2/22)k(\mathbf{x},\mathbf{z}) = \exp(-\|\mathbf{x}-\mathbf{z}\|^2 / 2\ell^2)
  • Polynomial kernel: k(x,z)=(xz+c)dk(\mathbf{x},\mathbf{z}) = (\mathbf{x}^\top\mathbf{z} + c)^d for c0c \geq 0, dNd \in \mathbb{N}
  • Matern kernel: used in GP regression with controllable smoothness

Verifying kernel validity. For a proposed kernel kk, the standard test is:

  1. Compute the n×nn \times n Gram matrix KK for a test set
  2. Check K0K \succeq 0 (e.g., via np.linalg.eigvalsh(K) >= -1e-10)

8.4 Attention Scores as Gram Matrices

The scaled dot-product attention mechanism in transformers computes:

Attention(Q,K,V)=softmax ⁣(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V

where Q,KRn×dkQ, K \in \mathbb{R}^{n \times d_k} are query and key matrices.

The score matrix S=QK/dkRn×nS = QK^\top / \sqrt{d_k} \in \mathbb{R}^{n \times n} is a scaled Gram matrix (but with different row-spaces for QQ and KK, so not necessarily PSD). In the special case of self-attention with tied weights Q=KQ = K, SS is proportional to QQ/dk0QQ^\top / \sqrt{d_k} \succeq 0.

Why the 1/dk1/\sqrt{d_k} scaling. If QQ and KK have independent entries from N(0,1)\mathcal{N}(0,1), then QijKijQ_{ij}K_{ij} has variance 1 and (QK)ij=k=1dkQikKjk(QK^\top)_{ij} = \sum_{k=1}^{d_k} Q_{ik}K_{jk} has variance dkd_k. The 1/dk1/\sqrt{d_k} rescaling brings the variance back to 1, preventing the softmax from saturating into near-one-hot distributions.

Attention as kernel regression. The attention output for query q\mathbf{q} is:

output=j=1nexp(qkj/dk)lexp(qkl/dk)vj.\text{output} = \sum_{j=1}^n \frac{\exp(\mathbf{q}^\top\mathbf{k}_j/\sqrt{d_k})}{\sum_l \exp(\mathbf{q}^\top\mathbf{k}_l/\sqrt{d_k})} \mathbf{v}_j.

This is a Nadaraya-Watson kernel regression with an exponential kernel k(q,kj)=exp(qkj/dk)k(\mathbf{q},\mathbf{k}_j) = \exp(\mathbf{q}^\top\mathbf{k}_j/\sqrt{d_k}). The attention weights are the normalized kernel similarities, and the output is a kernel-weighted average of values. The exponential of the dot product is related to the RBF kernel (by the Johnson-Lindenstrauss / random Fourier features perspective used in Performer / FAVOR+).


9. The PSD Cone and Semidefinite Programming

9.1 The Cone of PSD Matrices

The set of all n×nn \times n symmetric positive semidefinite matrices is denoted S+n\mathbb{S}_+^n (or S0n\mathbb{S}_{\geq 0}^n). It lives inside the vector space Sn\mathbb{S}^n of n×nn \times n real symmetric matrices (which has dimension n(n+1)/2n(n+1)/2).

Theorem 9.1. S+n\mathbb{S}_+^n is a proper convex cone:

  1. Convex: If A,B0A, B \succeq 0 and λ[0,1]\lambda \in [0,1], then λA+(1λ)B0\lambda A + (1-\lambda)B \succeq 0.
  2. Cone: If A0A \succeq 0 and t0t \geq 0, then tA0tA \succeq 0.
  3. Pointed: A0A \succeq 0 and A0A \preceq 0 implies A=0A = 0 (no lines through the origin in the cone).
  4. Closed: S+n\mathbb{S}_+^n is a closed subset of Sn\mathbb{S}^n (limits of PSD sequences are PSD).
  5. Full-dimensional: The interior of S+n\mathbb{S}_+^n is S++n\mathbb{S}_{++}^n (the PD matrices), which is non-empty.

Proof of convexity: For any x\mathbf{x}: x(λA+(1λ)B)x=λxAx+(1λ)xBx0\mathbf{x}^\top(\lambda A + (1-\lambda)B)\mathbf{x} = \lambda\mathbf{x}^\top A\mathbf{x} + (1-\lambda)\mathbf{x}^\top B\mathbf{x} \geq 0. \square

The boundary. The boundary S+n=S+nS++n\partial\mathbb{S}_+^n = \mathbb{S}_+^n \setminus \mathbb{S}_{++}^n consists of PSD matrices with at least one zero eigenvalue. These are rank-deficient PSD matrices. The boundary has lower dimension: the set of rank-rr PSD matrices is a manifold of dimension rnr(r1)/2rn - r(r-1)/2.

Low-dimensional picture. For n=2n=2: S2\mathbb{S}^2 is 3-dimensional (coordinates A11,A12,A22A_{11}, A_{12}, A_{22}). The PSD cone S+2\mathbb{S}_+^2 is the set where A110A_{11} \geq 0, A220A_{22} \geq 0, and A11A22A122A_{11}A_{22} \geq A_{12}^2 - the interior of an ice cream cone in 3D.

Dual cone. The dual of S+n\mathbb{S}_+^n with respect to the Frobenius inner product A,BF=tr(AB)\langle A, B\rangle_F = \text{tr}(AB) is:

(S+n)={BSn:tr(AB)0 for all A0}=S+n.(\mathbb{S}_+^n)^* = \{B \in \mathbb{S}^n : \text{tr}(AB) \geq 0 \text{ for all } A \succeq 0\} = \mathbb{S}_+^n.

The PSD cone is self-dual: (S+n)=S+n(\mathbb{S}_+^n)^* = \mathbb{S}_+^n. This is analogous to the non-negative reals being self-dual.

9.2 Semidefinite Programming

Semidefinite programming (SDP) is the optimization of a linear objective over the intersection of the PSD cone with an affine set:

Standard form SDP (primal):

minXSnC,XF=tr(CX)\min_{X \in \mathbb{S}^n} \quad \langle C, X\rangle_F = \text{tr}(CX) subject toAi,XF=bi,i=1,,m\text{subject to} \quad \langle A_i, X\rangle_F = b_i, \quad i = 1,\ldots,m X0.\quad\quad\quad\quad X \succeq 0.

Here C,A1,,AmSnC, A_1, \ldots, A_m \in \mathbb{S}^n and bRmb \in \mathbb{R}^m are the problem data; XSnX \in \mathbb{S}^n is the decision variable.

Dual SDP:

maxyRmby\max_{\mathbf{y} \in \mathbb{R}^m} \quad \mathbf{b}^\top\mathbf{y} subject toCi=1myiAi0.\text{subject to} \quad C - \sum_{i=1}^m y_i A_i \succeq 0.

Duality. Weak duality always holds: primal valuedual value\text{primal value} \geq \text{dual value}. Strong duality (primal = dual) holds under Slater's condition: if the primal is strictly feasible (some X0X \succ 0 satisfies all constraints), then strong duality holds and the dual optimum is attained.

Relation to other optimization problems. SDP generalizes:

  • Linear programming (LP): LP is SDP with diagonal matrices C,AiC, A_i
  • SOCP (second-order cone programming): SOCP is a special SDP
  • Quadratically constrained QP: Many QCQPs can be lifted to SDPs via semidefinite relaxation

Algorithms. Interior-point methods (barrier methods) solve SDPs in polynomial time: O(m1.5n3)O(m^{1.5} n^3) per iteration for an mm-constraint, n×nn \times n SDP. Standard solvers: SCS, MOSEK, CVXPY (modelling layer).

9.3 SDP in Machine Learning

Max-cut relaxation (Goemans-Williamson, 1995). The max-cut problem on a graph G=(V,E)G=(V,E) with edge weights wijw_{ij} is NP-hard. The Goemans-Williamson SDP relaxation gives a 0.8780.878-approximation:

maxX012ijwij(1Xij)\max_{X \succeq 0} \quad \frac{1}{2}\sum_{ij} w_{ij}(1 - X_{ij}) s.t.Xii=1,i=1,,n.\text{s.t.} \quad X_{ii} = 1, \quad i=1,\ldots,n.

The solution XX^* satisfies X=VVX^* = VV^\top for some unit vectors v1,,vn\mathbf{v}_1,\ldots,\mathbf{v}_n; random hyperplane rounding recovers a near-optimal cut.

Metric learning. Learning a Mahalanobis distance dA(x,z)=(xz)A(xz)1/2d_A(\mathbf{x},\mathbf{z}) = (\mathbf{x}-\mathbf{z})^\top A (\mathbf{x}-\mathbf{z})^{1/2} requires A0A \succeq 0. Methods like ITML (Information-Theoretic Metric Learning) and SDML formulate this as an SDP over PSD matrices with constraints that similar pairs are close and dissimilar pairs are far.

Covariance estimation. In high dimensions (p>np > n), the sample covariance Σ^=1nXX\hat{\Sigma} = \frac{1}{n}X^\top X may be singular. Regularized covariance estimation (graphical lasso, precision matrix estimation) adds a sparsity constraint or penalty:

minΣ0[tr(S^Σ1)logdetΣ+λΣ11]\min_{\Sigma \succ 0} \left[ \text{tr}(\hat{S}\Sigma^{-1}) - \log\det\Sigma + \lambda\|\Sigma^{-1}\|_1 \right]

where S^\hat{S} is the sample covariance and 1\|\cdot\|_1 is the element-wise 1\ell_1 norm. This is not an SDP directly, but the PSD constraint Σ0\Sigma \succ 0 is the core structural requirement.

Fairness constraints. In algorithmic fairness, PSD constraints arise as necessary conditions for disparate impact compliance. Certified defenses against adversarial examples (semidefinite relaxations of neural network verification) are large-scale SDPs; solvers like α\alpha-CROWN reformulate them via Lagrangian relaxation.


10. Applications in Machine Learning

10.1 Multivariate Gaussians and Covariance Matrices

The multivariate Gaussian. A random vector xRn\mathbf{x} \in \mathbb{R}^n has the distribution N(μ,Σ)\mathcal{N}(\boldsymbol{\mu}, \Sigma) if:

p(x)=1(2π)n/2detΣ1/2exp ⁣(12(xμ)Σ1(xμ)).p(\mathbf{x}) = \frac{1}{(2\pi)^{n/2}|\det\Sigma|^{1/2}} \exp\!\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^\top\Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu})\right).

For this density to be a valid (normalized, integrable) probability distribution, Σ\Sigma must be symmetric positive definite. The three requirements:

  1. Symmetry: Σ=Σ\Sigma = \Sigma^\top (covariance is symmetric by definition)
  2. Positive definiteness: Σ0\Sigma \succ 0 ensures detΣ0\det\Sigma \neq 0 (normalizing constant finite) and Σ1\Sigma^{-1} exists (the exponent is a proper quadratic form)
  3. If Σ0\Sigma \succeq 0 but singular: The distribution becomes degenerate - supported on an affine subspace, not all of Rn\mathbb{R}^n

Parameterizing covariances in neural networks. A neural network that outputs a covariance matrix must parameterize it to be PSD. Standard approaches:

  • Diagonal: Σ=diag(exp(s))\Sigma = \text{diag}(\exp(\boldsymbol{s})) where s\boldsymbol{s} is a learned vector. Automatically PD.
  • Cholesky lower triangular: Σ=LL\Sigma = LL^\top where LL has positive diagonal (enforced via softplus on diagonal entries). This is the most expressive parameterization.
  • Low-rank + diagonal: Σ=FF+D\Sigma = FF^\top + D where FRn×kF \in \mathbb{R}^{n \times k} (knk \ll n) and DD diagonal positive. Woodbury allows efficient inversion.

Sampling from N(μ,Σ)\mathcal{N}(\boldsymbol{\mu}, \Sigma): Given ϵN(0,I)\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, I):

x=μ+Lϵ\mathbf{x} = \boldsymbol{\mu} + L\boldsymbol{\epsilon}

where L=chol(Σ)L = \text{chol}(\Sigma). This is the fundamental sampling algorithm: the Cholesky factor maps isotropic noise to correlated noise.

10.2 Fisher Information Matrix and Natural Gradient

Definition 10.1 (Fisher Information Matrix). For a statistical model p(xθ)p(\mathbf{x}|\boldsymbol{\theta}) with parameter θRd\boldsymbol{\theta} \in \mathbb{R}^d, the Fisher information matrix is:

F(θ)=Exp(θ) ⁣[θlogp(xθ)θlogp(xθ)].F(\boldsymbol{\theta}) = \mathbb{E}_{\mathbf{x} \sim p(\cdot|\boldsymbol{\theta})}\!\left[\nabla_{\boldsymbol{\theta}} \log p(\mathbf{x}|\boldsymbol{\theta})\, \nabla_{\boldsymbol{\theta}} \log p(\mathbf{x}|\boldsymbol{\theta})^\top\right].

PSD proof. FF is a covariance matrix of the score function θlogp(xθ)\nabla_{\boldsymbol{\theta}} \log p(\mathbf{x}|\boldsymbol{\theta}): it is E[ss]\mathbb{E}[\mathbf{s}\mathbf{s}^\top] where s\mathbf{s} is a random vector. Any matrix of the form E[ss]\mathbb{E}[\mathbf{s}\mathbf{s}^\top] is PSD (it is the expected outer product). In fact, F0F \succ 0 for regular statistical models (identifiable, full rank).

Natural gradient. Ordinary gradient descent θθηθL\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta\nabla_{\boldsymbol{\theta}}\mathcal{L} uses the Euclidean metric on parameter space. The natural gradient uses the Fisher metric:

~L=F(θ)1θL.\tilde{\nabla}\mathcal{L} = F(\boldsymbol{\theta})^{-1}\nabla_{\boldsymbol{\theta}}\mathcal{L}.

The natural gradient is invariant to reparameterization of the model - it measures the steepest descent direction with respect to the KL divergence geometry (information geometry).

K-FAC (Kronecker-Factored Approximate Curvature). Martens & Grosse (2015) approximate the Fisher information matrix for neural networks as a block-diagonal matrix, where each block factorizes as a Kronecker product:

FF^=block-diag(A1G1,A2G2,)F \approx \hat{F} = \text{block-diag}(A_1 \otimes G_1, A_2 \otimes G_2, \ldots)

where Al=E[al1al1]A_l = \mathbb{E}[\mathbf{a}_{l-1}\mathbf{a}_{l-1}^\top] (input activation covariance) and Gl=E[δlδl]G_l = \mathbb{E}[\delta_l\delta_l^\top] (pre-activation gradient covariance) for layer ll. Both AlA_l and GlG_l are PSD (covariance matrices). The Kronecker product AlGlA_l \otimes G_l is PSD, and its inverse is (AlGl)1=Al1Gl1(A_l \otimes G_l)^{-1} = A_l^{-1} \otimes G_l^{-1}, computable cheaply via Cholesky of each factor separately.

10.3 Gaussian Process Regression

Gaussian process regression is the Bayesian non-parametric regression method that uses PD kernel matrices as the core computational object.

Model. Place a GP prior: fGP(0,k)f \sim \mathcal{GP}(0, k) where k:X×XRk: \mathcal{X} \times \mathcal{X} \to \mathbb{R} is a PD kernel. Observe noisy outputs y=f(X)+ϵ\mathbf{y} = f(\mathbf{X}) + \boldsymbol{\epsilon}, ϵN(0,σ2I)\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 I).

Prediction. The predictive distribution at new points XX_* is:

fX,X,yN(μ,Σ)f_* | X_*, \mathbf{X}, \mathbf{y} \sim \mathcal{N}(\boldsymbol{\mu}_*, \Sigma_*)

where:

μ=Kn(Knn+σ2I)1y\boldsymbol{\mu}_* = K_{*n}(K_{nn} + \sigma^2I)^{-1}\mathbf{y} Σ=KKn(Knn+σ2I)1Kn\Sigma_* = K_{**} - K_{*n}(K_{nn} + \sigma^2I)^{-1}K_{n*}

and Knn=k(X,X)K_{nn} = k(\mathbf{X}, \mathbf{X}), Kn=k(X,X)K_{*n} = k(X_*, \mathbf{X}), K=k(X,X)K_{**} = k(X_*, X_*).

Computational core: Cholesky. Factor Knn+σ2I=LLK_{nn} + \sigma^2 I = LL^\top. Then:

  • α=L\(L\y)\boldsymbol{\alpha} = L^\top \backslash (L \backslash \mathbf{y}) (two triangular solves)
  • μ=Knα\boldsymbol{\mu}_* = K_{*n}\boldsymbol{\alpha} (matrix-vector product)
  • V=L\KnV = L \backslash K_{n*} (triangular solve, n×nn \times n_* system)
  • Σ=KVV\Sigma_* = K_{**} - V^\top V (Schur complement form)
  • logp(y)=12yαilogLiin2log(2π)\log p(\mathbf{y}) = -\frac{1}{2}\mathbf{y}^\top\boldsymbol{\alpha} - \sum_i\log L_{ii} - \frac{n}{2}\log(2\pi)

The entire GP regression computation is O(n3)O(n^3) via Cholesky - the classic bottleneck for large datasets, motivating sparse GP approximations (inducing points, Nystrom, SVGP).

10.4 Hessian and Loss Landscape Sharpness

Second-order characterization of minima. At a critical point L(θ)=0\nabla\mathcal{L}(\boldsymbol{\theta}^*) = 0 of a smooth loss L\mathcal{L}:

  • 2L(θ)0\nabla^2\mathcal{L}(\boldsymbol{\theta}^*) \succ 0: strict local minimum (the loss bowl is strictly convex)
  • 2L(θ)0\nabla^2\mathcal{L}(\boldsymbol{\theta}^*) \succeq 0: local minimum or saddle (Hessian is PSD)
  • 2L(θ)\nabla^2\mathcal{L}(\boldsymbol{\theta}^*) indefinite: saddle point

Sharpness. The sharpness of a minimum is λmax(2L(θ))\lambda_{\max}(\nabla^2\mathcal{L}(\boldsymbol{\theta}^*)) - the largest eigenvalue of the Hessian. Flat minima (small sharpness) generalize better than sharp minima: a small perturbation θ+δ\boldsymbol{\theta}^* + \boldsymbol{\delta} with δρ\|\boldsymbol{\delta}\| \leq \rho changes the loss by at most ρ2λmax(2L)\rho^2 \lambda_{\max}(\nabla^2\mathcal{L}) (by Taylor expansion and the PSD bound δHδλmaxδ2\boldsymbol{\delta}^\top H\boldsymbol{\delta} \leq \lambda_{\max}\|\boldsymbol{\delta}\|^2).

SAM (Sharpness-Aware Minimization). Foret et al. (2021) propose:

minθmaxδρL(θ+δ).\min_{\boldsymbol{\theta}} \max_{\|\boldsymbol{\delta}\|\leq\rho} \mathcal{L}(\boldsymbol{\theta}+\boldsymbol{\delta}).

The inner max finds the worst-case perturbation (solved approximately as δ=ρL/L\boldsymbol{\delta}^* = \rho\, \nabla\mathcal{L}/\|\nabla\mathcal{L}\|). SAM is a first-order approximation to minimizing sharpness and is reported to improve generalization on image and language tasks.

Gauss-Newton and PSD Hessian approximations. The true Hessian 2L\nabla^2\mathcal{L} may be indefinite during training (early stages, overparameterized models). The Gauss-Newton matrix G=JJG = J^\top J (where JJ is the Jacobian of the predictions) is always PSD and is often a better preconditioner. K-FAC uses the Gauss-Newton approximation.

10.5 Reparameterization Trick in VAEs

The variational autoencoder (VAE) requires sampling from a distribution qϕ(zx)=N(μϕ(x),Σϕ(x))q_\phi(\mathbf{z}|\mathbf{x}) = \mathcal{N}(\boldsymbol{\mu}_\phi(\mathbf{x}), \Sigma_\phi(\mathbf{x})) where μϕ\boldsymbol{\mu}_\phi and Σϕ\Sigma_\phi are outputs of an encoder neural network, and differentiating through the sampling process.

The reparameterization trick. Instead of sampling zN(μ,Σ)\mathbf{z} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma) directly (which blocks gradient flow), write:

z=μ+Lϵ,ϵN(0,I)\mathbf{z} = \boldsymbol{\mu} + L\boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, I)

where L=chol(Σ)L = \text{chol}(\Sigma). Now z\mathbf{z} is a deterministic function of (μ,L,ϵ)(\boldsymbol{\mu}, L, \boldsymbol{\epsilon}), and gradients can flow through μ\boldsymbol{\mu} and LL.

Diagonal VAE (standard). Most VAE implementations use diagonal covariance: Σ=diag(exp(s))\Sigma = \text{diag}(\exp(\mathbf{s})), so L=diag(exp(s/2))L = \text{diag}(\exp(\mathbf{s}/2)) and z=μ+exp(s/2)ϵ\mathbf{z} = \boldsymbol{\mu} + \exp(\mathbf{s}/2) \odot \boldsymbol{\epsilon}.

Full covariance VAE. For a full covariance Cholesky parameterization, the encoder outputs:

  1. Mean μRd\boldsymbol{\mu} \in \mathbb{R}^d
  2. Lower triangular LRd×dL \in \mathbb{R}^{d \times d} with positive diagonal (e.g., Lii=softplus(L~ii)L_{ii} = \text{softplus}(\tilde{L}_{ii}), off-diagonal unconstrained)

Then Σ=LL\Sigma = LL^\top and z=μ+Lϵ\mathbf{z} = \boldsymbol{\mu} + L\boldsymbol{\epsilon}. The gradient through LL back to the encoder parameters flows via:

zL=(Lϵ)L=ϵId\frac{\partial \mathbf{z}}{\partial L} = \frac{\partial(L\boldsymbol{\epsilon})}{\partial L} = \boldsymbol{\epsilon} \otimes I_d

(the outer product of the sampled noise ϵ\boldsymbol{\epsilon} with the identity).

For normalizing flows. More expressive VAE variants use normalizing flows for the posterior qϕ(zx)q_\phi(\mathbf{z}|\mathbf{x}). The flow is a sequence of invertible maps; the log-det Jacobian of each map must be computed efficiently. Triangular Jacobians (e.g., masked autoregressive flow / IAF) achieve O(d)O(d) log-det computation.


11. Common Mistakes

#MistakeWhy It's WrongFix
1Checking only diagonal entries to test PDPositive diagonal is necessary but not sufficient. A=(1221)A = \begin{pmatrix}1 & 2 \\ 2 & 1\end{pmatrix} has positive diagonal but is indefinite (det=3\det = -3).Use Sylvester's criterion (all leading minors > 0) or attempt Cholesky.
2Concluding A0A \succ 0 from detA>0\det A > 0 alonedet>0\det > 0 is necessary but not sufficient. (1002)\begin{pmatrix}-1 & 0 \\ 0 & -2\end{pmatrix} has det=2>0\det = 2 > 0 but is negative definite.Check all leading principal minors, not just the full determinant.
3Assuming AA0A^\top A \succ 0 for any AAAA0A^\top A \succeq 0 always, but AA0A^\top A \succ 0 iff AA has full column rank. If AA has a null vector (Av=0A\mathbf{v} = 0), then vAAv=0\mathbf{v}^\top A^\top A \mathbf{v} = 0.Verify rank(A)=number of columns\text{rank}(A) = \text{number of columns}.
4Confusing the Cholesky factor LL with the PSD square root A1/2A^{1/2}LL is lower triangular; A1/2A^{1/2} is symmetric. Both satisfy "squared = AA" but in different senses (LL=ALL^\top = A vs (A1/2)2=A(A^{1/2})^2 = A). They are equal only when AA is diagonal.Use LL for solving/sampling; use A1/2A^{1/2} for Mahalanobis/whitening.
5Using np.log(np.linalg.det(A)) for large AAnp.linalg.det can overflow/underflow for large nn (products of many numbers).Use Cholesky: 2 * np.sum(np.log(np.diag(np.linalg.cholesky(A)))).
6Forgetting Sylvester's criterion does NOT characterize PSDSylvester requires all leading principal minors > 0, which gives PD not PSD. For PSD, you need all principal minors (not just leading ones) 0\geq 0 - a much larger set.For PSD testing, use eigenvalues or attempt pivoted Cholesky.
7Inverting the Loewner order incorrectlyStudents often think ABA \succeq B implies A1B1A^{-1} \succeq B^{-1}. The correct fact is AB0B1A1A \succeq B \succ 0 \Rightarrow B^{-1} \succeq A^{-1} (inversion reverses the order).Remember: larger matrix -> smaller inverse (as for positive scalars: ab>01/a1/ba \geq b > 0 \Rightarrow 1/a \leq 1/b).
8Adding jitter ϵI\epsilon I without checking scaleIf A2106\|A\|_2 \approx 10^6 and you add ϵ=106\epsilon = 10^{-6}, the relative jitter is 101210^{-12}, effectively zero numerically.Set jitter proportional to the scale: ϵ=δtr(A)/n\epsilon = \delta \cdot \text{tr}(A)/n for some small δ\delta.
9Assuming the kernel matrix stays PD after operationsSum, product, and pointwise operations on PD kernels may not preserve PD structure. E.g., k1(x,z)k2(x,z)k_1(\mathbf{x},\mathbf{z}) - k_2(\mathbf{x},\mathbf{z}) may not be PD.Verify the Gram matrix or use closure properties: sum/product/exponentiation of PD kernels are PD.
10Computing Schur complement with singular AAThe Schur complement DCA1BD - CA^{-1}B requires AA invertible. If AA is only PSD (rank-deficient), A1A^{-1} does not exist.Use the Moore-Penrose pseudoinverse: S=DCABS = D - CA^\dagger B, though the PD characterization no longer holds as stated.
11Treating "positive definite" as a non-symmetric matrix propertyPD is only defined for symmetric matrices. An arbitrary matrix AA with all positive eigenvalues may not have a positive quadratic form (complex eigenvalues, non-symmetric).Always symmetrize: replace AA with (A+A)/2(A + A^\top)/2 before testing PD.
12Thinking PD constraint is automatically satisfied by a NN outputA neural network outputting n(n+1)/2n(n+1)/2 numbers does not automatically produce a valid lower triangular LL with positive diagonal.Enforce: use softplus on the diagonal entries; leave off-diagonal unconstrained. Then Σ=LL\Sigma = LL^\top is guaranteed PD.

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue