Part 4

9 min read10 headingsSplit lesson page

Lesson overview | Previous part | Lesson overview

Positive Definite Matrices: Appendix G: Proofs and Extensions to Appendix I: Worked Examples - Complete Solutions

Appendix G: Proofs and Extensions

G.1 Complete Proof of the Cholesky Existence Theorem

We present the full inductive proof more carefully, making each step explicit.

Theorem (Full Cholesky Existence). $A \in \mathbb{R}^{n \times n}$ symmetric. $A \succ 0$ if and only if $A = LL^\top$ for a unique lower triangular $L$ with positive diagonal.

Proof by strong induction on $n$ .

Base case $n=1$ : $A = (a)$ with $a > 0$ (since $A \succ 0$ iff quadratic form $ax^2 > 0$ for $x \neq 0$ iff $a > 0$ ). Take $L = (\sqrt{a})$ . Unique since $\ell > 0$ and $\ell^2 = a$ gives $\ell = \sqrt{a}$ .

Inductive step: Assume the result holds for all $(n-1) \times (n-1)$ PD matrices. Let $A \in \mathbb{R}^{n \times n}$ be PD. Write:

A = \begin{pmatrix}a_{11} & \mathbf{a}^\top \\ \mathbf{a} & \hat{A}\end{pmatrix}

where $a_{11} > 0$ (Proposition 2.3, item 1), $\mathbf{a} \in \mathbb{R}^{n-1}$ , and $\hat{A} \in \mathbb{R}^{(n-1)\times(n-1)}$ .

Define $\ell_{11} = \sqrt{a_{11}} > 0$ and $\boldsymbol{\ell} = \mathbf{a}/\ell_{11} \in \mathbb{R}^{n-1}$ .

Compute the Schur complement: $\tilde{A} = \hat{A} - \boldsymbol{\ell}\boldsymbol{\ell}^\top = \hat{A} - \mathbf{a}a_{11}^{-1}\mathbf{a}^\top$ .

Claim: $\tilde{A} \succ 0$ .

Proof of claim: For any $\mathbf{y} \in \mathbb{R}^{n-1}$ with $\mathbf{y} \neq \mathbf{0}$ , set $\mathbf{x} = \begin{pmatrix}-a_{11}^{-1}\mathbf{a}^\top\mathbf{y} \\ \mathbf{y}\end{pmatrix} \in \mathbb{R}^n$ . Note $\mathbf{x} \neq \mathbf{0}$ (the second block is $\mathbf{y} \neq \mathbf{0}$ ). Then:

\mathbf{x}^\top A \mathbf{x} = a_{11}(a_{11}^{-1}\mathbf{a}^\top\mathbf{y})^2 - 2\mathbf{a}^\top\mathbf{y} \cdot a_{11}^{-1}\mathbf{a}^\top\mathbf{y} + \mathbf{y}^\top\hat{A}\mathbf{y} = \mathbf{y}^\top(\hat{A} - \mathbf{a}a_{11}^{-1}\mathbf{a}^\top)\mathbf{y} = \mathbf{y}^\top\tilde{A}\mathbf{y}.

Since $A \succ 0$ : $\mathbf{x}^\top A\mathbf{x} > 0$ , so $\mathbf{y}^\top\tilde{A}\mathbf{y} > 0$ . Since $\mathbf{y} \neq 0$ was arbitrary, $\tilde{A} \succ 0$ . $\square$ (claim)

By the inductive hypothesis, $\tilde{A} = L_{22}L_{22}^\top$ for a unique lower triangular $L_{22} \in \mathbb{R}^{(n-1)\times(n-1)}$ with positive diagonal.

Set $L = \begin{pmatrix}\ell_{11} & \mathbf{0}^\top \\ \boldsymbol{\ell} & L_{22}\end{pmatrix}$ . Then $L$ is lower triangular with positive diagonal $(\ell_{11}, (L_{22})_{11}, \ldots, (L_{22})_{n-1,n-1})$ and:

LL^\top = \begin{pmatrix}\ell_{11} & \mathbf{0}^\top \\ \boldsymbol{\ell} & L_{22}\end{pmatrix}\begin{pmatrix}\ell_{11} & \boldsymbol{\ell}^\top \\ \mathbf{0} & L_{22}^\top\end{pmatrix} = \begin{pmatrix}\ell_{11}^2 & \ell_{11}\boldsymbol{\ell}^\top \\ \ell_{11}\boldsymbol{\ell} & \boldsymbol{\ell}\boldsymbol{\ell}^\top + L_{22}L_{22}^\top\end{pmatrix} = \begin{pmatrix}a_{11} & \mathbf{a}^\top \\ \mathbf{a} & \hat{A}\end{pmatrix} = A.

Uniqueness. Suppose $A = L_1L_1^\top = L_2L_2^\top$ with $L_1, L_2$ lower triangular and positive diagonal. Then $M := L_2^{-1}L_1$ (product of lower triangular matrices, lower triangular) satisfies $MM^\top = I$ . An orthogonal lower triangular matrix must be diagonal. Since $L_1, L_2$ have positive diagonal, $M$ has positive diagonal. $MM^\top = I$ for diagonal $M$ gives $M = I$ , so $L_1 = L_2$ . $\square$

G.2 Proof That the Fisher Information Is PSD

Theorem. For a regular statistical model $p(\mathbf{x}|\boldsymbol{\theta})$ , the Fisher information matrix $F(\boldsymbol{\theta}) = \mathbb{E}[\mathbf{s}(\mathbf{x};\boldsymbol{\theta})\mathbf{s}(\mathbf{x};\boldsymbol{\theta})^\top]$ where $\mathbf{s} = \nabla_{\boldsymbol{\theta}}\log p$ is the score function is PSD.

Proof. For any $\mathbf{v} \in \mathbb{R}^d$ :

\mathbf{v}^\top F\mathbf{v} = \mathbb{E}[\mathbf{v}^\top\mathbf{s}\mathbf{s}^\top\mathbf{v}] = \mathbb{E}[(\mathbf{v}^\top\mathbf{s})^2] \geq 0.

(The expectation of a squared real random variable is non-negative.) $\square$

When is $F \succ 0$ ? $F$ is PD iff $\mathbb{E}[(\mathbf{v}^\top\mathbf{s})^2] > 0$ for all $\mathbf{v} \neq 0$ . This holds iff the score function $\mathbf{v}^\top\nabla_{\boldsymbol{\theta}}\log p(\mathbf{x}|\boldsymbol{\theta}) \neq 0$ with positive probability for every $\mathbf{v} \neq 0$ - the model is "identifiable" in all directions. Singular $F$ indicates structural non-identifiability (two parameters produce identical distributions).

G.3 Derivatives and the Matrix-Valued Chain Rule

For completeness, we derive the key matrix calculus formulas used in 7.3.

Jacobi's formula. For differentiable $A(t)$ :

\frac{d}{dt}\det A(t) = \det A(t) \cdot \text{tr}(A(t)^{-1}\dot{A}(t)).

Proof: Using the adjugate matrix (cofactor expansion), $\det A = \sum_j A_{ij}(\text{adj}A)_{ji}$ . Differentiating with respect to $A_{ij}$ gives $(\text{adj}A)_{ji} = (\det A)(A^{-1})_{ji}$ (Cramer's rule). By the chain rule: $d(\det A)/dt = \sum_{ij}(\text{adj}A)_{ji}\dot{A}_{ij} = \det(A)\sum_{ij}(A^{-1})_{ji}\dot{A}_{ij} = \det(A)\,\text{tr}(A^{-1}\dot{A})$ .

Log-det gradient. $d\log\det A = d(\det A)/\det A = \text{tr}(A^{-1}dA)$ . Since $\text{tr}(A^{-1}dA) = \langle A^{-\top}, dA\rangle_F$ , the gradient of $\log\det$ at $A$ is $A^{-\top} = A^{-1}$ (for symmetric $A$ ).

Trace-inverse gradient. For $f(A) = \text{tr}(A^{-1}B)$ : $df = \text{tr}(-A^{-1}dA\,A^{-1}B) = -\text{tr}(A^{-1}BA^{-1}dA) = -\langle (A^{-1}BA^{-1})^\top, dA\rangle_F$ . So $\nabla_A\text{tr}(A^{-1}B) = -(A^{-1}BA^{-1})^\top = -A^{-\top}B^\top A^{-\top}$ .

For AI - GP hyperparameter gradients: The gradient of the GP log-marginal-likelihood with respect to a kernel hyperparameter $\theta$ is:

\frac{\partial\log p(\mathbf{y}|\theta)}{\partial\theta} = \frac{1}{2}\text{tr}\!\left[(\boldsymbol{\alpha}\boldsymbol{\alpha}^\top - (K+\sigma^2I)^{-1})\frac{\partial K}{\partial\theta}\right]

where $\boldsymbol{\alpha} = (K+\sigma^2I)^{-1}\mathbf{y}$ . This uses $\nabla_K\log\det K = K^{-1}$ and $\nabla_K\text{tr}(K^{-1}S) = -(K^{-1}SK^{-1})^\top$ . Efficiently computed via Cholesky: form $V = L^{-1}\partial K/\partial\theta$ (triangular solve), then $\text{tr}(K^{-1}\partial K/\partial\theta) = \text{tr}(V^\top V) = \|V\|_F^2$ .

Appendix H: Summary and Further Reading

H.1 Core Theorems Summary

Theorem	Statement	Reference
Spectral characterization	$A \succ 0 \Leftrightarrow$ all eigenvalues $> 0$	3.1
Sylvester's criterion	$A \succ 0 \Leftrightarrow$ all leading principal minors $> 0$	3.2
Cholesky existence	$A \succ 0 \Leftrightarrow \exists!$ lower triangular $L$ (pos. diag.) with $A = LL^\top$	4.1
LDL^T	Symmetric $A \to A = LDL^\top$ ; $A \succ 0 \Leftrightarrow$ all $d_i > 0$	4.4
PSD square root	$A \succeq 0 \Rightarrow \exists!$ symmetric PSD $A^{1/2}$ with $(A^{1/2})^2 = A$	5.1
Schur PD criterion	$M = \begin{pmatrix}A&B\\B^\top&D\end{pmatrix} \succ 0 \Leftrightarrow A \succ 0, D-B^\top A^{-1}B \succ 0$	6.2
Woodbury identity	$(A+UCV)^{-1} = A^{-1} - A^{-1}U(C^{-1}+VA^{-1}U)^{-1}VA^{-1}$	6.3
Log-det concavity	$f(A) = \log\det A$ is strictly concave on $\mathbb{S}_{++}^n$	7.2
Log-det gradient	$\nabla_A\log\det A = A^{-1}$	7.3
Gram matrix PSD	$G = XX^\top \succeq 0$ always; PD iff rows of $X$ are linearly independent	8.1
Schur product	Hadamard product of PSD matrices is PSD	Appendix A
Hadamard inequality	$\det A \leq \prod_i A_{ii}$ for $A \succ 0$	Appendix D
Log-det Cholesky	$\log\det A = 2\sum_i\log L_{ii}$	7.1

H.2 Further Reading

Textbooks:
- Golub & Van Loan, Matrix Computations (4th ed., 2013) - Chapters 4, 7: definitive reference on Cholesky, LDL^T algorithms
- Higham, Accuracy and Stability of Numerical Algorithms (2002) - backward stability proofs for Cholesky, modified Cholesky
- Boyd & Vandenberghe, Convex Optimization (2004) - Chapter 4: SDP, PSD cone; freely available online
- Bernstein, Matrix Mathematics (2nd ed., 2009) - comprehensive collection of PD matrix identities
Research papers:
- Cholesky (1924, posthumous publication via Benoit): original algorithm for geodetic calculations
- Gill, Murray & Wright (1981): modified Cholesky for optimization
- Vandenberghe & Boyd (1996): semidefinite programming tutorial
- Martens & Grosse (2015): K-FAC - Kronecker-factored natural gradient
- Gardner et al. (2018): GPyTorch - scalable GP with stochastic log-det estimation
- Foret et al. (2021): SAM - sharpness-aware minimization
Online resources:
- Petersen & Pedersen, The Matrix Cookbook - practical matrix calculus identities
- NumPy/SciPy docs: np.linalg.cholesky, scipy.linalg.cholesky, scipy.linalg.ldl
- GPyTorch documentation: scalable GP inference with Cholesky
- CVXPY documentation: SDP examples and semidefinite programming
Next sections:
- 08: Matrix Decompositions - LU, QR, SVD algorithms (Cholesky briefly revisited in algorithmic context)
- Chapter 6: Probability Theory - multivariate Gaussians, covariance matrices
- Chapter 8: Optimization - second-order methods, convexity, SDP
- Chapter 12: Functional Analysis - RKHS, Mercer's theorem, kernel methods

End of 07: Positive Definite Matrices. Next: 08: Matrix Decompositions

Appendix I: Worked Examples - Complete Solutions

I.1 Full 3\times3 Cholesky Example with Verification

Compute $L = \text{chol}(A)$ for $A = \begin{pmatrix}9 & 3 & -3 \\ 3 & 17 & -1 \\ -3 & -1 & 12\end{pmatrix}$ .

Step 1: $L_{11} = \sqrt{9} = 3$ .

Step 2: $L_{21} = A_{21}/L_{11} = 3/3 = 1$ . $L_{31} = A_{31}/L_{11} = -3/3 = -1$ .

Step 3: $L_{22} = \sqrt{A_{22} - L_{21}^2} = \sqrt{17 - 1} = \sqrt{16} = 4$ .

Step 4: $L_{32} = (A_{32} - L_{31}L_{21})/L_{22} = (-1 - (-1)(1))/4 = 0/4 = 0$ .

Step 5: $L_{33} = \sqrt{A_{33} - L_{31}^2 - L_{32}^2} = \sqrt{12 - 1 - 0} = \sqrt{11}$ .

L = \begin{pmatrix}3 & 0 & 0 \\ 1 & 4 & 0 \\ -1 & 0 & \sqrt{11}\end{pmatrix}.

Verification:

LL^\top = \begin{pmatrix}3&0&0\\1&4&0\\-1&0&\sqrt{11}\end{pmatrix}\begin{pmatrix}3&1&-1\\0&4&0\\0&0&\sqrt{11}\end{pmatrix} = \begin{pmatrix}9&3&-3\\3&17&-1\\-3&-1&12\end{pmatrix} = A. \checkmark

Also: $\log\det A = 2(\log 3 + \log 4 + \log\sqrt{11}) = 2(1.099 + 1.386 + 1.199) = 2(3.684) = 7.368$ .

I.2 Schur Complement and Conditional Gaussian

Let $\Sigma = \begin{pmatrix}4 & 2 \\ 2 & 5\end{pmatrix}$ be the covariance of $(X_1, X_2)^\top \sim \mathcal{N}(\mathbf{0}, \Sigma)$ .

Schur complement of $\Sigma_{11}$ :

S = \Sigma_{22} - \Sigma_{21}\Sigma_{11}^{-1}\Sigma_{12} = 5 - 2 \cdot \frac{1}{4} \cdot 2 = 5 - 1 = 4.

Conditional distribution $X_1 | X_2 = a$ :

\mu_{1|2} = 0 + \Sigma_{12}\Sigma_{22}^{-1}(a - 0) = \frac{2}{5}a.

\sigma_{1|2}^2 = \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} = 4 - 2 \cdot \frac{1}{5} \cdot 2 = 4 - \frac{4}{5} = \frac{16}{5}.

Note: the unconditional variance of $X_1$ is $\sigma_1^2 = 4$ . The conditional variance $16/5 = 3.2 < 4$ - observing $X_2$ reduces uncertainty about $X_1$ (as guaranteed by the Loewner order, 2.4). The correlation $\rho = 2/\sqrt{4 \cdot 5} = 2/\sqrt{20} = 1/\sqrt{5} \approx 0.447$ explains the moderate uncertainty reduction.

Positive Definite Matrices: Part 4 - Appendix G Proofs And Extensions To Appendix I Worked Examples Complet

Positive Definite Matrices: Appendix G: Proofs and Extensions to Appendix I: Worked Examples - Complete Solutions

Appendix G: Proofs and Extensions

G.1 Complete Proof of the Cholesky Existence Theorem

G.2 Proof That the Fisher Information Is PSD

G.3 Derivatives and the Matrix-Valued Chain Rule

Appendix H: Summary and Further Reading

H.1 Core Theorems Summary

H.2 Further Reading

Appendix I: Worked Examples - Complete Solutions

I.1 Full 3\times3 Cholesky Example with Verification

I.2 Schur Complement and Conditional Gaussian

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?