Part 4Math for LLMs

Positive Definite Matrices: Part 4 - Appendix G Proofs And Extensions To Appendix I Worked Examples Complet

Advanced Linear Algebra / Positive Definite Matrices

Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 4
9 min read10 headingsSplit lesson page

Lesson overview | Previous part | Lesson overview

Positive Definite Matrices: Appendix G: Proofs and Extensions to Appendix I: Worked Examples - Complete Solutions

Appendix G: Proofs and Extensions

G.1 Complete Proof of the Cholesky Existence Theorem

We present the full inductive proof more carefully, making each step explicit.

Theorem (Full Cholesky Existence). ARn×nA \in \mathbb{R}^{n \times n} symmetric. A0A \succ 0 if and only if A=LLA = LL^\top for a unique lower triangular LL with positive diagonal.

Proof by strong induction on nn.

Base case n=1n=1: A=(a)A = (a) with a>0a > 0 (since A0A \succ 0 iff quadratic form ax2>0ax^2 > 0 for x0x \neq 0 iff a>0a > 0). Take L=(a)L = (\sqrt{a}). Unique since >0\ell > 0 and 2=a\ell^2 = a gives =a\ell = \sqrt{a}.

Inductive step: Assume the result holds for all (n1)×(n1)(n-1) \times (n-1) PD matrices. Let ARn×nA \in \mathbb{R}^{n \times n} be PD. Write:

A=(a11aaA^)A = \begin{pmatrix}a_{11} & \mathbf{a}^\top \\ \mathbf{a} & \hat{A}\end{pmatrix}

where a11>0a_{11} > 0 (Proposition 2.3, item 1), aRn1\mathbf{a} \in \mathbb{R}^{n-1}, and A^R(n1)×(n1)\hat{A} \in \mathbb{R}^{(n-1)\times(n-1)}.

Define 11=a11>0\ell_{11} = \sqrt{a_{11}} > 0 and =a/11Rn1\boldsymbol{\ell} = \mathbf{a}/\ell_{11} \in \mathbb{R}^{n-1}.

Compute the Schur complement: A~=A^=A^aa111a\tilde{A} = \hat{A} - \boldsymbol{\ell}\boldsymbol{\ell}^\top = \hat{A} - \mathbf{a}a_{11}^{-1}\mathbf{a}^\top.

Claim: A~0\tilde{A} \succ 0.

Proof of claim: For any yRn1\mathbf{y} \in \mathbb{R}^{n-1} with y0\mathbf{y} \neq \mathbf{0}, set x=(a111ayy)Rn\mathbf{x} = \begin{pmatrix}-a_{11}^{-1}\mathbf{a}^\top\mathbf{y} \\ \mathbf{y}\end{pmatrix} \in \mathbb{R}^n. Note x0\mathbf{x} \neq \mathbf{0} (the second block is y0\mathbf{y} \neq \mathbf{0}). Then:

xAx=a11(a111ay)22aya111ay+yA^y=y(A^aa111a)y=yA~y.\mathbf{x}^\top A \mathbf{x} = a_{11}(a_{11}^{-1}\mathbf{a}^\top\mathbf{y})^2 - 2\mathbf{a}^\top\mathbf{y} \cdot a_{11}^{-1}\mathbf{a}^\top\mathbf{y} + \mathbf{y}^\top\hat{A}\mathbf{y} = \mathbf{y}^\top(\hat{A} - \mathbf{a}a_{11}^{-1}\mathbf{a}^\top)\mathbf{y} = \mathbf{y}^\top\tilde{A}\mathbf{y}.

Since A0A \succ 0: xAx>0\mathbf{x}^\top A\mathbf{x} > 0, so yA~y>0\mathbf{y}^\top\tilde{A}\mathbf{y} > 0. Since y0\mathbf{y} \neq 0 was arbitrary, A~0\tilde{A} \succ 0. \square (claim)

By the inductive hypothesis, A~=L22L22\tilde{A} = L_{22}L_{22}^\top for a unique lower triangular L22R(n1)×(n1)L_{22} \in \mathbb{R}^{(n-1)\times(n-1)} with positive diagonal.

Set L=(110L22)L = \begin{pmatrix}\ell_{11} & \mathbf{0}^\top \\ \boldsymbol{\ell} & L_{22}\end{pmatrix}. Then LL is lower triangular with positive diagonal (11,(L22)11,,(L22)n1,n1)(\ell_{11}, (L_{22})_{11}, \ldots, (L_{22})_{n-1,n-1}) and:

LL=(110L22)(110L22)=(1121111+L22L22)=(a11aaA^)=A.LL^\top = \begin{pmatrix}\ell_{11} & \mathbf{0}^\top \\ \boldsymbol{\ell} & L_{22}\end{pmatrix}\begin{pmatrix}\ell_{11} & \boldsymbol{\ell}^\top \\ \mathbf{0} & L_{22}^\top\end{pmatrix} = \begin{pmatrix}\ell_{11}^2 & \ell_{11}\boldsymbol{\ell}^\top \\ \ell_{11}\boldsymbol{\ell} & \boldsymbol{\ell}\boldsymbol{\ell}^\top + L_{22}L_{22}^\top\end{pmatrix} = \begin{pmatrix}a_{11} & \mathbf{a}^\top \\ \mathbf{a} & \hat{A}\end{pmatrix} = A.

Uniqueness. Suppose A=L1L1=L2L2A = L_1L_1^\top = L_2L_2^\top with L1,L2L_1, L_2 lower triangular and positive diagonal. Then M:=L21L1M := L_2^{-1}L_1 (product of lower triangular matrices, lower triangular) satisfies MM=IMM^\top = I. An orthogonal lower triangular matrix must be diagonal. Since L1,L2L_1, L_2 have positive diagonal, MM has positive diagonal. MM=IMM^\top = I for diagonal MM gives M=IM = I, so L1=L2L_1 = L_2. \square

G.2 Proof That the Fisher Information Is PSD

Theorem. For a regular statistical model p(xθ)p(\mathbf{x}|\boldsymbol{\theta}), the Fisher information matrix F(θ)=E[s(x;θ)s(x;θ)]F(\boldsymbol{\theta}) = \mathbb{E}[\mathbf{s}(\mathbf{x};\boldsymbol{\theta})\mathbf{s}(\mathbf{x};\boldsymbol{\theta})^\top] where s=θlogp\mathbf{s} = \nabla_{\boldsymbol{\theta}}\log p is the score function is PSD.

Proof. For any vRd\mathbf{v} \in \mathbb{R}^d:

vFv=E[vssv]=E[(vs)2]0.\mathbf{v}^\top F\mathbf{v} = \mathbb{E}[\mathbf{v}^\top\mathbf{s}\mathbf{s}^\top\mathbf{v}] = \mathbb{E}[(\mathbf{v}^\top\mathbf{s})^2] \geq 0.

(The expectation of a squared real random variable is non-negative.) \square

When is F0F \succ 0? FF is PD iff E[(vs)2]>0\mathbb{E}[(\mathbf{v}^\top\mathbf{s})^2] > 0 for all v0\mathbf{v} \neq 0. This holds iff the score function vθlogp(xθ)0\mathbf{v}^\top\nabla_{\boldsymbol{\theta}}\log p(\mathbf{x}|\boldsymbol{\theta}) \neq 0 with positive probability for every v0\mathbf{v} \neq 0 - the model is "identifiable" in all directions. Singular FF indicates structural non-identifiability (two parameters produce identical distributions).

G.3 Derivatives and the Matrix-Valued Chain Rule

For completeness, we derive the key matrix calculus formulas used in 7.3.

Jacobi's formula. For differentiable A(t)A(t):

ddtdetA(t)=detA(t)tr(A(t)1A˙(t)).\frac{d}{dt}\det A(t) = \det A(t) \cdot \text{tr}(A(t)^{-1}\dot{A}(t)).

Proof: Using the adjugate matrix (cofactor expansion), detA=jAij(adjA)ji\det A = \sum_j A_{ij}(\text{adj}A)_{ji}. Differentiating with respect to AijA_{ij} gives (adjA)ji=(detA)(A1)ji(\text{adj}A)_{ji} = (\det A)(A^{-1})_{ji} (Cramer's rule). By the chain rule: d(detA)/dt=ij(adjA)jiA˙ij=det(A)ij(A1)jiA˙ij=det(A)tr(A1A˙)d(\det A)/dt = \sum_{ij}(\text{adj}A)_{ji}\dot{A}_{ij} = \det(A)\sum_{ij}(A^{-1})_{ji}\dot{A}_{ij} = \det(A)\,\text{tr}(A^{-1}\dot{A}).

Log-det gradient. dlogdetA=d(detA)/detA=tr(A1dA)d\log\det A = d(\det A)/\det A = \text{tr}(A^{-1}dA). Since tr(A1dA)=A,dAF\text{tr}(A^{-1}dA) = \langle A^{-\top}, dA\rangle_F, the gradient of logdet\log\det at AA is A=A1A^{-\top} = A^{-1} (for symmetric AA).

Trace-inverse gradient. For f(A)=tr(A1B)f(A) = \text{tr}(A^{-1}B): df=tr(A1dAA1B)=tr(A1BA1dA)=(A1BA1),dAFdf = \text{tr}(-A^{-1}dA\,A^{-1}B) = -\text{tr}(A^{-1}BA^{-1}dA) = -\langle (A^{-1}BA^{-1})^\top, dA\rangle_F. So Atr(A1B)=(A1BA1)=ABA\nabla_A\text{tr}(A^{-1}B) = -(A^{-1}BA^{-1})^\top = -A^{-\top}B^\top A^{-\top}.

For AI - GP hyperparameter gradients: The gradient of the GP log-marginal-likelihood with respect to a kernel hyperparameter θ\theta is:

logp(yθ)θ=12tr ⁣[(αα(K+σ2I)1)Kθ]\frac{\partial\log p(\mathbf{y}|\theta)}{\partial\theta} = \frac{1}{2}\text{tr}\!\left[(\boldsymbol{\alpha}\boldsymbol{\alpha}^\top - (K+\sigma^2I)^{-1})\frac{\partial K}{\partial\theta}\right]

where α=(K+σ2I)1y\boldsymbol{\alpha} = (K+\sigma^2I)^{-1}\mathbf{y}. This uses KlogdetK=K1\nabla_K\log\det K = K^{-1} and Ktr(K1S)=(K1SK1)\nabla_K\text{tr}(K^{-1}S) = -(K^{-1}SK^{-1})^\top. Efficiently computed via Cholesky: form V=L1K/θV = L^{-1}\partial K/\partial\theta (triangular solve), then tr(K1K/θ)=tr(VV)=VF2\text{tr}(K^{-1}\partial K/\partial\theta) = \text{tr}(V^\top V) = \|V\|_F^2.


Appendix H: Summary and Further Reading

H.1 Core Theorems Summary

TheoremStatementReference
Spectral characterizationA0A \succ 0 \Leftrightarrow all eigenvalues >0> 03.1
Sylvester's criterionA0A \succ 0 \Leftrightarrow all leading principal minors >0> 03.2
Cholesky existenceA0!A \succ 0 \Leftrightarrow \exists! lower triangular LL (pos. diag.) with A=LLA = LL^\top4.1
LDL^TSymmetric AA=LDLA \to A = LDL^\top; A0A \succ 0 \Leftrightarrow all di>0d_i > 04.4
PSD square rootA0!A \succeq 0 \Rightarrow \exists! symmetric PSD A1/2A^{1/2} with (A1/2)2=A(A^{1/2})^2 = A5.1
Schur PD criterionM=(ABBD)0A0,DBA1B0M = \begin{pmatrix}A&B\\B^\top&D\end{pmatrix} \succ 0 \Leftrightarrow A \succ 0, D-B^\top A^{-1}B \succ 06.2
Woodbury identity(A+UCV)1=A1A1U(C1+VA1U)1VA1(A+UCV)^{-1} = A^{-1} - A^{-1}U(C^{-1}+VA^{-1}U)^{-1}VA^{-1}6.3
Log-det concavityf(A)=logdetAf(A) = \log\det A is strictly concave on S++n\mathbb{S}_{++}^n7.2
Log-det gradientAlogdetA=A1\nabla_A\log\det A = A^{-1}7.3
Gram matrix PSDG=XX0G = XX^\top \succeq 0 always; PD iff rows of XX are linearly independent8.1
Schur productHadamard product of PSD matrices is PSDAppendix A
Hadamard inequalitydetAiAii\det A \leq \prod_i A_{ii} for A0A \succ 0Appendix D
Log-det CholeskylogdetA=2ilogLii\log\det A = 2\sum_i\log L_{ii}7.1

H.2 Further Reading

  1. Textbooks:

    • Golub & Van Loan, Matrix Computations (4th ed., 2013) - Chapters 4, 7: definitive reference on Cholesky, LDL^T algorithms
    • Higham, Accuracy and Stability of Numerical Algorithms (2002) - backward stability proofs for Cholesky, modified Cholesky
    • Boyd & Vandenberghe, Convex Optimization (2004) - Chapter 4: SDP, PSD cone; freely available online
    • Bernstein, Matrix Mathematics (2nd ed., 2009) - comprehensive collection of PD matrix identities
  2. Research papers:

    • Cholesky (1924, posthumous publication via Benoit): original algorithm for geodetic calculations
    • Gill, Murray & Wright (1981): modified Cholesky for optimization
    • Vandenberghe & Boyd (1996): semidefinite programming tutorial
    • Martens & Grosse (2015): K-FAC - Kronecker-factored natural gradient
    • Gardner et al. (2018): GPyTorch - scalable GP with stochastic log-det estimation
    • Foret et al. (2021): SAM - sharpness-aware minimization
  3. Online resources:

    • Petersen & Pedersen, The Matrix Cookbook - practical matrix calculus identities
    • NumPy/SciPy docs: np.linalg.cholesky, scipy.linalg.cholesky, scipy.linalg.ldl
    • GPyTorch documentation: scalable GP inference with Cholesky
    • CVXPY documentation: SDP examples and semidefinite programming
  4. Next sections:


End of 07: Positive Definite Matrices. Next: 08: Matrix Decompositions


Appendix I: Worked Examples - Complete Solutions

I.1 Full 3\times3 Cholesky Example with Verification

Compute L=chol(A)L = \text{chol}(A) for A=(93331713112)A = \begin{pmatrix}9 & 3 & -3 \\ 3 & 17 & -1 \\ -3 & -1 & 12\end{pmatrix}.

Step 1: L11=9=3L_{11} = \sqrt{9} = 3.

Step 2: L21=A21/L11=3/3=1L_{21} = A_{21}/L_{11} = 3/3 = 1. L31=A31/L11=3/3=1L_{31} = A_{31}/L_{11} = -3/3 = -1.

Step 3: L22=A22L212=171=16=4L_{22} = \sqrt{A_{22} - L_{21}^2} = \sqrt{17 - 1} = \sqrt{16} = 4.

Step 4: L32=(A32L31L21)/L22=(1(1)(1))/4=0/4=0L_{32} = (A_{32} - L_{31}L_{21})/L_{22} = (-1 - (-1)(1))/4 = 0/4 = 0.

Step 5: L33=A33L312L322=1210=11L_{33} = \sqrt{A_{33} - L_{31}^2 - L_{32}^2} = \sqrt{12 - 1 - 0} = \sqrt{11}.

L=(3001401011).L = \begin{pmatrix}3 & 0 & 0 \\ 1 & 4 & 0 \\ -1 & 0 & \sqrt{11}\end{pmatrix}.

Verification:

LL=(3001401011)(3110400011)=(93331713112)=A.LL^\top = \begin{pmatrix}3&0&0\\1&4&0\\-1&0&\sqrt{11}\end{pmatrix}\begin{pmatrix}3&1&-1\\0&4&0\\0&0&\sqrt{11}\end{pmatrix} = \begin{pmatrix}9&3&-3\\3&17&-1\\-3&-1&12\end{pmatrix} = A. \checkmark

Also: logdetA=2(log3+log4+log11)=2(1.099+1.386+1.199)=2(3.684)=7.368\log\det A = 2(\log 3 + \log 4 + \log\sqrt{11}) = 2(1.099 + 1.386 + 1.199) = 2(3.684) = 7.368.

I.2 Schur Complement and Conditional Gaussian

Let Σ=(4225)\Sigma = \begin{pmatrix}4 & 2 \\ 2 & 5\end{pmatrix} be the covariance of (X1,X2)N(0,Σ)(X_1, X_2)^\top \sim \mathcal{N}(\mathbf{0}, \Sigma).

Schur complement of Σ11\Sigma_{11}:

S=Σ22Σ21Σ111Σ12=52142=51=4.S = \Sigma_{22} - \Sigma_{21}\Sigma_{11}^{-1}\Sigma_{12} = 5 - 2 \cdot \frac{1}{4} \cdot 2 = 5 - 1 = 4.

Conditional distribution X1X2=aX_1 | X_2 = a:

μ12=0+Σ12Σ221(a0)=25a.\mu_{1|2} = 0 + \Sigma_{12}\Sigma_{22}^{-1}(a - 0) = \frac{2}{5}a. σ122=Σ11Σ12Σ221Σ21=42152=445=165.\sigma_{1|2}^2 = \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} = 4 - 2 \cdot \frac{1}{5} \cdot 2 = 4 - \frac{4}{5} = \frac{16}{5}.

Note: the unconditional variance of X1X_1 is σ12=4\sigma_1^2 = 4. The conditional variance 16/5=3.2<416/5 = 3.2 < 4 - observing X2X_2 reduces uncertainty about X1X_1 (as guaranteed by the Loewner order, 2.4). The correlation ρ=2/45=2/20=1/50.447\rho = 2/\sqrt{4 \cdot 5} = 2/\sqrt{20} = 1/\sqrt{5} \approx 0.447 explains the moderate uncertainty reduction.

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue