Uniqueness. Suppose A=L1L1⊤=L2L2⊤ with L1,L2 lower triangular and positive diagonal. Then M:=L2−1L1 (product of lower triangular matrices, lower triangular) satisfies MM⊤=I. An orthogonal lower triangular matrix must be diagonal. Since L1,L2 have positive diagonal, M has positive diagonal. MM⊤=I for diagonal M gives M=I, so L1=L2. □
G.2 Proof That the Fisher Information Is PSD
Theorem. For a regular statistical model p(x∣θ), the Fisher information matrix F(θ)=E[s(x;θ)s(x;θ)⊤] where s=∇θlogp is the score function is PSD.
Proof. For any v∈Rd:
v⊤Fv=E[v⊤ss⊤v]=E[(v⊤s)2]≥0.
(The expectation of a squared real random variable is non-negative.) □
When is F≻0?F is PD iff E[(v⊤s)2]>0 for all v=0. This holds iff the score function v⊤∇θlogp(x∣θ)=0 with positive probability for every v=0 - the model is "identifiable" in all directions. Singular F indicates structural non-identifiability (two parameters produce identical distributions).
G.3 Derivatives and the Matrix-Valued Chain Rule
For completeness, we derive the key matrix calculus formulas used in 7.3.
Jacobi's formula. For differentiable A(t):
dtddetA(t)=detA(t)⋅tr(A(t)−1A˙(t)).
Proof: Using the adjugate matrix (cofactor expansion), detA=∑jAij(adjA)ji. Differentiating with respect to Aij gives (adjA)ji=(detA)(A−1)ji (Cramer's rule). By the chain rule: d(detA)/dt=∑ij(adjA)jiA˙ij=det(A)∑ij(A−1)jiA˙ij=det(A)tr(A−1A˙).
Log-det gradient.dlogdetA=d(detA)/detA=tr(A−1dA). Since tr(A−1dA)=⟨A−⊤,dA⟩F, the gradient of logdet at A is A−⊤=A−1 (for symmetric A).
Trace-inverse gradient. For f(A)=tr(A−1B): df=tr(−A−1dAA−1B)=−tr(A−1BA−1dA)=−⟨(A−1BA−1)⊤,dA⟩F. So ∇Atr(A−1B)=−(A−1BA−1)⊤=−A−⊤B⊤A−⊤.
For AI - GP hyperparameter gradients: The gradient of the GP log-marginal-likelihood with respect to a kernel hyperparameter θ is:
∂θ∂logp(y∣θ)=21tr[(αα⊤−(K+σ2I)−1)∂θ∂K]
where α=(K+σ2I)−1y. This uses ∇KlogdetK=K−1 and ∇Ktr(K−1S)=−(K−1SK−1)⊤. Efficiently computed via Cholesky: form V=L−1∂K/∂θ (triangular solve), then tr(K−1∂K/∂θ)=tr(V⊤V)=∥V∥F2.
Appendix H: Summary and Further Reading
H.1 Core Theorems Summary
Theorem
Statement
Reference
Spectral characterization
A≻0⇔ all eigenvalues >0
3.1
Sylvester's criterion
A≻0⇔ all leading principal minors >0
3.2
Cholesky existence
A≻0⇔∃! lower triangular L (pos. diag.) with A=LL⊤
4.1
LDL^T
Symmetric A→A=LDL⊤; A≻0⇔ all di>0
4.4
PSD square root
A⪰0⇒∃! symmetric PSD A1/2 with (A1/2)2=A
5.1
Schur PD criterion
M=(AB⊤BD)≻0⇔A≻0,D−B⊤A−1B≻0
6.2
Woodbury identity
(A+UCV)−1=A−1−A−1U(C−1+VA−1U)−1VA−1
6.3
Log-det concavity
f(A)=logdetA is strictly concave on S++n
7.2
Log-det gradient
∇AlogdetA=A−1
7.3
Gram matrix PSD
G=XX⊤⪰0 always; PD iff rows of X are linearly independent
8.1
Schur product
Hadamard product of PSD matrices is PSD
Appendix A
Hadamard inequality
detA≤∏iAii for A≻0
Appendix D
Log-det Cholesky
logdetA=2∑ilogLii
7.1
H.2 Further Reading
Textbooks:
Golub & Van Loan, Matrix Computations (4th ed., 2013) - Chapters 4, 7: definitive reference on Cholesky, LDL^T algorithms
Higham, Accuracy and Stability of Numerical Algorithms (2002) - backward stability proofs for Cholesky, modified Cholesky
Note: the unconditional variance of X1 is σ12=4. The conditional variance 16/5=3.2<4 - observing X2 reduces uncertainty about X1 (as guaranteed by the Loewner order, 2.4). The correlation ρ=2/4⋅5=2/20=1/5≈0.447 explains the moderate uncertainty reduction.
Skill Check
Test this lesson
Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.
--
Score
0/4
Answered
Not attempted
Status
1
Which module does this lesson belong to?
2
Which section is covered in this lesson content?
3
Which term is most central to this lesson?
4
What is the best way to use this lesson for real learning?
Your answers save locally first, then sync when account storage is available.