Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 3
30 min read18 headingsSplit lesson page

Lesson overview | Previous part | Next part

Positive Definite Matrices: Part 12: Exercises to Appendix F: Quick Reference

12. Exercises

Exercise 1 * - Verify PD Axioms for AA+λIA^\top A + \lambda I

Let ARm×nA \in \mathbb{R}^{m \times n} with mnm \geq n and λ>0\lambda > 0. Define B=AA+λIB = A^\top A + \lambda I.

(a) Prove B0B \succ 0 using the definition xBx>0\mathbf{x}^\top B\mathbf{x} > 0 for x0\mathbf{x} \neq \mathbf{0}.

(b) What is the smallest eigenvalue of BB? Express it in terms of the singular values of AA and λ\lambda.

(c) What happens as λ0\lambda \to 0? Under what condition on AA does AAA^\top A itself become PD?

(d) Show B11λIB^{-1} \preceq \frac{1}{\lambda}I using the Loewner order.

(e) AI connection: Identify B=AA+λIB = A^\top A + \lambda I as the normal equations matrix for ridge regression. Why does λ>0\lambda > 0 guarantee B0B \succ 0 regardless of the rank of AA?


Exercise 2 * - Implement Cholesky from Scratch

(a) Implement the Cholesky algorithm (Algorithm 4.2) in pure NumPy without using np.linalg.cholesky. Your function should return the lower triangular LL satisfying A=LLA = LL^\top, or raise a ValueError if AA is not PD.

(b) Test on A=(41216123743164398)A = \begin{pmatrix}4 & 12 & -16 \\ 12 & 37 & -43 \\ -16 & -43 & 98\end{pmatrix} and verify LL=ALL^\top = A.

(c) Compare the result with np.linalg.cholesky - they should agree to machine precision.

(d) Test what happens when you apply your function to an indefinite matrix A=(1331)A = \begin{pmatrix}1 & 3 \\ 3 & 1\end{pmatrix}.

(e) Efficiency: What is the flop count of your implementation for n×nn \times n input? Compare to LU factorization.


Exercise 3 * - Sylvester's Criterion

Let A(α)=(1α0α2α0α4)A(\alpha) = \begin{pmatrix}1 & \alpha & 0 \\ \alpha & 2 & \alpha \\ 0 & \alpha & 4\end{pmatrix}.

(a) Compute the three leading principal minors Δ1,Δ2,Δ3\Delta_1, \Delta_2, \Delta_3 as functions of α\alpha.

(b) Use Sylvester's criterion to find all values of αR\alpha \in \mathbb{R} for which A(α)0A(\alpha) \succ 0.

(c) At the boundary values of α\alpha (where AA transitions from PD to indefinite), what is the nullspace of AA? Interpret this geometrically.

(d) Verify your answer computationally by plotting the smallest eigenvalue of A(α)A(\alpha) as a function of α\alpha.


Exercise 4 ** - LDL^T Factorization

(a) Implement the LDL^T algorithm (Algorithm 4.4) in NumPy. Your function returns unit lower triangular LL and diagonal DD such that A=LDLA = LDL^\top.

(b) Test on A=(421253136)A = \begin{pmatrix}4 & 2 & 1 \\ 2 & 5 & 3 \\ 1 & 3 & 6\end{pmatrix}. Verify LDL=ALDL^\top = A.

(c) Using your factorization, solve Ax=bA\mathbf{x} = \mathbf{b} for b=(1,2,3)\mathbf{b} = (1, 2, 3)^\top by: (i) solve Ly=bL\mathbf{y} = \mathbf{b}, (ii) solve Dw=yD\mathbf{w} = \mathbf{y}, (iii) solve Lx=wL^\top\mathbf{x} = \mathbf{w}.

(d) Explain why LDL^T avoids the numerical issues of Cholesky when AA is indefinite.

(e) AI connection: Why do second-order optimizers (like KFAC) prefer working with DD (the pivots) rather than D\sqrt{D} (the Cholesky diagonal)?


Exercise 5 ** - Schur Complement and Block PD Test

Let M=(ABBD)M = \begin{pmatrix}A & B \\ B^\top & D\end{pmatrix} where:

A=(4113),B=(12),D=(5).A = \begin{pmatrix}4 & 1 \\ 1 & 3\end{pmatrix}, \quad B = \begin{pmatrix}1 \\ 2\end{pmatrix}, \quad D = \begin{pmatrix}5\end{pmatrix}.

(a) Compute the Schur complement S=DBA1BS = D - B^\top A^{-1}B.

(b) Use Theorem 6.2 to determine whether M0M \succ 0.

(c) Compute detM\det M using detM=detAdetS\det M = \det A \cdot \det S and verify numerically.

(d) Use the block inverse formula to compute M1M^{-1}. Verify MM1=IMM^{-1} = I.

(e) Gaussian conditioning: Interpret MM as a joint covariance matrix Σ=(Σ11Σ12Σ21Σ22)\Sigma = \begin{pmatrix}\Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22}\end{pmatrix}. What is the conditional covariance Σ12\Sigma_{1|2}? Interpret: does conditioning reduce uncertainty?


Exercise 6 ** - Log-Det via Cholesky and Gradient Verification

(a) Compute logdetA\log\det A for A=(310142025)A = \begin{pmatrix}3 & 1 & 0 \\ 1 & 4 & 2 \\ 0 & 2 & 5\end{pmatrix} using: (i) direct eigenvalues, (ii) Cholesky diagonal formula, (iii) np.log(np.linalg.det(A)). Verify all three agree.

(b) Implement a function log_det(A) using Cholesky. Compare speed vs eigenvalue method for n=100,500,1000n = 100, 500, 1000 random PD matrices.

(c) Numerically verify the gradient formula AlogdetA=A1\nabla_A \log\det A = A^{-1} using finite differences: f(A+hEij)f(A)h(A1)ij\frac{f(A + hE_{ij}) - f(A)}{h} \approx (A^{-1})_{ij}.

(d) For the Gaussian log-likelihood: given 50 samples from N(0,Σtrue)\mathcal{N}(\mathbf{0}, \Sigma_{\text{true}}) where Σtrue=A\Sigma_{\text{true}} = A from part (a), compute the log-likelihood (Σ)=n2logdetΣ12tr(Σ1S)\ell(\Sigma) = -\frac{n}{2}\log\det\Sigma - \frac{1}{2}\text{tr}(\Sigma^{-1}S) where S=1nixixiS = \frac{1}{n}\sum_i\mathbf{x}_i\mathbf{x}_i^\top is the sample covariance.


Exercise 7 *** - Gram Matrix / Kernel Matrix and Mercer Check

(a) Given 5 points x1,,x5R3\mathbf{x}_1, \ldots, \mathbf{x}_5 \in \mathbb{R}^3 (design your own interesting configuration), compute the Gram matrix G=XXG = XX^\top.

(b) Verify G0G \succeq 0 by checking eigenvalues. Explain why rank(G)3\text{rank}(G) \leq 3.

(c) Implement and test three kernels on these points: (i) linear k(x,z)=xzk(\mathbf{x},\mathbf{z}) = \mathbf{x}^\top\mathbf{z}, (ii) RBF k(x,z)=exp(xz2/2)k(\mathbf{x},\mathbf{z}) = \exp(-\|\mathbf{x}-\mathbf{z}\|^2/2), (iii) polynomial k(x,z)=(xz+1)2k(\mathbf{x},\mathbf{z}) = (\mathbf{x}^\top\mathbf{z}+1)^2. For each, compute the 5×55 \times 5 kernel matrix and verify PSD.

(d) Now try the "not-a-kernel" k(x,z)=exp(xz)k(\mathbf{x},\mathbf{z}) = \exp(-\|\mathbf{x}-\mathbf{z}\|) (Laplacian kernel IS a valid kernel) vs k(x,z)=cos(xz)k(\mathbf{x},\mathbf{z}) = \cos(\|\mathbf{x}-\mathbf{z}\|) (NOT always a valid kernel). Determine empirically whether each produces a PSD Gram matrix.

(e) AI connection: Implement a simple kernel regression: given training inputs X\mathbf{X} and outputs y\mathbf{y}, compute predictions at test points XX_* using y^=Kn(Knn+0.1I)1y\hat{\mathbf{y}}_* = K_{*n}(K_{nn} + 0.1 I)^{-1}\mathbf{y} (Nadaraya-Watson/GPR predictive mean).


Exercise 8 *** - Cholesky Reparameterization for VAE Sampling

(a) Implement a function sample_mvn(mu, L, n_samples) that samples nn vectors from N(μ,LL)\mathcal{N}(\boldsymbol{\mu}, LL^\top) using the reparameterization trick z=μ+Lϵ\mathbf{z} = \boldsymbol{\mu} + L\boldsymbol{\epsilon}, ϵN(0,I)\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0},I).

(b) Let μ=(1,2)\boldsymbol{\mu} = (1, 2)^\top and Σ=(4223)\Sigma = \begin{pmatrix}4 & 2 \\ 2 & 3\end{pmatrix}. Compute L=chol(Σ)L = \text{chol}(\Sigma). Sample 1000 points, compute the sample mean and covariance, and verify they match μ\boldsymbol{\mu} and Σ\Sigma (up to sampling noise).

(c) Simulate a VAE encoder: the encoder takes a scalar input xx and outputs μϕ(x)=Wx+b\boldsymbol{\mu}_\phi(x) = Wx + b and a lower triangular Lϕ(x)L_\phi(x) (a diagonal LL for simplicity). Sample z\mathbf{z} and pass it through a simple decoder x^=cz\hat{x} = c^\top\mathbf{z}. Compute the ELBO:

ELBO=Ez[logp(xz)]DKL(N(μ,Σ)N(0,I)).\text{ELBO} = \mathbb{E}_{\mathbf{z}}[\log p(x|\mathbf{z})] - D_{\text{KL}}(\mathcal{N}(\boldsymbol{\mu}, \Sigma) \| \mathcal{N}(\mathbf{0}, I)).

where DKL(N(μ,Σ)N(0,I))=12[tr(Σ)+μμdlogdetΣ]D_{\text{KL}}(\mathcal{N}(\boldsymbol{\mu}, \Sigma) \| \mathcal{N}(\mathbf{0},I)) = \frac{1}{2}[\text{tr}(\Sigma) + \boldsymbol{\mu}^\top\boldsymbol{\mu} - d - \log\det\Sigma].

(d) Verify the KL formula using both the analytical expression and Monte Carlo estimation from samples.

(e) Gradient check: Confirm that the reparameterization trick allows gradients to flow through μ\boldsymbol{\mu} and LL (not ϵ\boldsymbol{\epsilon}) by computing ELBO/μ\partial\text{ELBO}/\partial\boldsymbol{\mu} analytically and comparing with finite differences.


13. Why This Matters for AI (2026 Perspective)

ConceptConcrete AI/LLM RoleWhere / Method
Positive definite covarianceValid multivariate Gaussian distributionsVAEs (Kingma & Welling, 2013); diffusion models (DDPM, score matching)
Cholesky factorizationReparameterization trick for differentiable samplingAll variational inference; NUTS-HMC sampler in Pyro, NumPyro
LDL^T factorizationStable Hessian factorization in trust-region optimizersL-BFGS, trust-region Newton; scipy.optimize
Schur complementConditional Gaussian inference; GP predictionGPyTorch, GPflow; Bayesian neural networks
Matrix inversion lemma (Woodbury)Low-rank updates without full matrix inversionLoRA weight merging; Kalman filter; GP with inducing points
Log-determinantNormalizing constant in Gaussian log-likelihoodGP hyperparameter optimization; normalizing flows (RealNVP, Glow)
logdetA=A1\nabla\log\det A = A^{-1}Gradient of log-likelihood w.r.t. covarianceFitting GP kernel parameters; structured covariance learning
Gram matrix / kernel matrixKernel methods, self-attention similaritySVM training; transformer attention; Performer / FAVOR+
Fisher information matrixRiemannian metric on parameter spaceK-FAC optimizer (Google Brain, 2015); natural gradient variational inference
PSD Hessian at minimaSecond-order optimality conditionSAM (flatness-seeking optimizer); loss landscape analysis
Sharpness λmax(H)\lambda_{\max}(H)Generalization proxy; flat minima theorySAM, ASAM, Stein Self-Repulsive (2024)
PSD cone / SDPConstrained optimization; certified robustnessα\alpha-CROWN, β\beta-CROWN neural network verification; fairness constraints
Mercer's theoremTheoretical foundation for kernel machinesSVMs; kernel PCA; connections to infinite-width NTK theory
Cholesky reparameterizationParameterizing full covariance in deep modelsDeep covariance estimation; Matrix-variate distributions in meta-learning
Log-det stochastic estimatorScalable GP training on millions of pointsGPyTorch (Gardner et al., 2018); SVGP; stochastic trace estimation

14. Conceptual Bridge

Looking Back

This section builds on a foundation laid across the previous six sections of Advanced Linear Algebra. From eigenvalues and eigenvectors (01), we borrow the spectral theorem - the fact that symmetric matrices diagonalize orthogonally - which provides the cleanest proof that positive definiteness is equivalent to positive eigenvalues. From SVD (02), we inherit the language of singular values and the connection between the spectral norm, condition number, and the stability of Cholesky. The orthogonality theory from 05 underpins the proof that QΛQQ\Lambda Q^\top is symmetric (the spectral theorem) and that the PSD square root is well-defined and unique. Matrix norms from 06 give precise bounds on how a PD matrix behaves numerically: the condition number κ(A)=λmax/λmin\kappa(A) = \lambda_{\max}/\lambda_{\min} measures how far AA is from singular, and ill-conditioned PD matrices lead to numerical difficulties in Cholesky.

Most fundamentally, positive definite matrices are the matrices that have all four of the desirable properties we want from a symmetric matrix: uniquely solvable linear systems (invertible), well-defined quadratic energy (positive), computable square root (Cholesky), and monotone response to matrix operations (Loewner order). Understanding this quadrant of symmetry \times positivity is the analytical key to all probabilistic machine learning.

Looking Forward

Positive definite matrices are the entry point to the next major development: 08 (Matrix Decompositions) uses the Cholesky factorization developed here as one member of a family that includes LU (general matrices) and QR (orthogonal factorization). The Cholesky decomposition studied here in full generality is previewed (only briefly) in 08 - by the time students reach 08, the Cholesky algorithm should feel familiar.

The deeper application of positive definiteness comes later in the curriculum. In Chapter 6 (Probability Theory), covariance matrices Σ0\Sigma \succeq 0 appear in every multivariate distribution. In Chapter 8 (Optimization), the second-order conditions for minima require 2L0\nabla^2\mathcal{L} \succeq 0, and convexity of a function is equivalent to its Hessian being PSD everywhere. In Chapter 12 (Functional Analysis), Mercer's theorem and RKHS theory extend PSD kernels to infinite-dimensional inner product spaces. The PSD cone and semidefinite programming previewed in 9 appear extensively in Chapter 8 (Optimization) and in the robotics/control applications in later chapters.

Curriculum Position

POSITIVE DEFINITE MATRICES IN THE CURRICULUM
========================================================================

  Chapter 2: Linear Algebra Basics
  +-- 01 Vectors and Spaces        -> inner products, quadratic forms
  +-- 04 Determinants              -> Sylvester's criterion
  +-- 06 Vector Spaces             -> subspaces, null space
  
  Chapter 3: Advanced Linear Algebra
  +-- 01 Eigenvalues         -> spectral characterization of PD
  +-- 02 SVD                 -> singular values, condition number
  +-- 05 Orthogonality       -> spectral theorem, square root
  +-- 06 Matrix Norms        -> condition number, jitter scale
  +-- 07 Positive Definite   <== YOU ARE HERE
  |     |   Cholesky, LDL^T, Schur, log-det, Gram, SDP
  +-- 08 Matrix Decompositions  -> LU, QR, brief Cholesky recap
  
  Downstream:
  +-- Ch. 6 Probability: Gaussian covariance \Sigma \succ 0
  +-- Ch. 8 Optimization: Hessian \succ 0 at local minima; SDP
  +-- Ch. 12 Functional Analysis: Mercer, RKHS
  +-- Ch. 13 ML Math: K-FAC, VAE, GP, normalizing flows

========================================================================

The central insight to carry forward: positive definiteness is not a restriction but a characterization. A matrix that fails to be PD is not "almost OK" - it is categorically different, with no unique minimum, no valid probability interpretation, no Cholesky factorization. The condition A0A \succ 0 is the boundary between well-posedness and ill-posedness in a wide range of mathematical problems, and recognizing that boundary - and the tools to work with it (Cholesky, Schur, log-det) - is an essential skill for the working ML practitioner.


Appendix A: Closure Properties and Algebraic Operations

A.1 Preserving Positive Definiteness Under Operations

Understanding which operations preserve PD/PSD structure is practically important - it tells you when a composed matrix is guaranteed to be PD without recomputing.

Direct results:

OperationInputsResultCondition
αA\alpha AA0A \succ 0, α>0\alpha > 00\succ 0Always
A+BA + BA,B0A, B \succ 00\succ 0Always
A+BA + BA0A \succ 0, B0B \succeq 00\succ 0Always
BABB^\top A BA0A \succ 0, BB invertible0\succ 0Always
BABB^\top A BA0A \succ 0, BB full column rank0\succ 0Always
A1A^{-1}A0A \succ 00\succ 0Always
AkA^kA0A \succ 0, kZk \in \mathbb{Z}0\succ 0Always
AtA^tA0A \succ 0, tRt \in \mathbb{R}0\succ 0Matrix power via exp(t log A)
ABABA,B0A, B \succ 0NOT necessarily 0\succeq 0Only if AB=BAAB = BA
ABA \otimes BA,B0A, B \succ 00\succ 0Always (Kronecker)
ABA \otimes BA,B0A, B \succeq 00\succeq 0Always (Kronecker)
ABA \odot B (Hadamard)A,B0A, B \succeq 00\succeq 0Schur product theorem

The Schur product theorem. The Hadamard (element-wise) product of two PSD matrices is PSD. This is non-obvious! It is used in kernel methods: if k1k_1 and k2k_2 are valid kernels, their pointwise product k1(x,z)k2(x,z)k_1(\mathbf{x},\mathbf{z}) \cdot k_2(\mathbf{x},\mathbf{z}) is also a valid kernel.

Proof of Schur product theorem. If A,B0A, B \succeq 0, write A=iuiuiA = \sum_i \mathbf{u}_i\mathbf{u}_i^\top and B=jvjvjB = \sum_j \mathbf{v}_j\mathbf{v}_j^\top (Gram representations). Then (AB)kl=AklBkl(A \odot B)_{kl} = A_{kl}B_{kl}, which can be expressed as the Gram matrix of vectors uivj\mathbf{u}_i \otimes \mathbf{v}_j (Kronecker products of the generating vectors). Gram matrices are PSD. \square

Why ABAB is not always PSD. Consider A=(1002)A = \begin{pmatrix}1 & 0 \\ 0 & 2\end{pmatrix} and B=(2111)B = \begin{pmatrix}2 & 1 \\ 1 & 1\end{pmatrix}. Both are PD, but AB=(2122)AB = \begin{pmatrix}2 & 1 \\ 2 & 2\end{pmatrix}, which is not symmetric (hence not PSD). The product of symmetric matrices is symmetric iff they commute.

For AI: Kronecker product structure is central to K-FAC: FAlGlF \approx A_l \otimes G_l is PSD since Al,Gl0A_l, G_l \succeq 0. The Schur product theorem ensures that combining PD kernels (feature interactions) produces valid kernels.

A.2 Principal Submatrix Inheritance

Theorem A.1. If A0A \succ 0, every principal submatrix is 0\succ 0. If A0A \succeq 0, every principal submatrix is 0\succeq 0.

A principal submatrix of AA is obtained by deleting rows and columns with the same index set: AS=A[S,S]A_S = A[S, S] for S{1,,n}S \subseteq \{1,\ldots,n\}.

Proof. For any yRS\mathbf{y} \in \mathbb{R}^{|S|}, embed in Rn\mathbb{R}^n via xi=yi\mathbf{x}_i = y_i for iSi \in S and xi=0\mathbf{x}_i = 0 for iSi \notin S. Then yASy=xAx>0\mathbf{y}^\top A_S \mathbf{y} = \mathbf{x}^\top A \mathbf{x} > 0 for y0\mathbf{y} \neq 0 (since x0\mathbf{x} \neq 0). \square

Consequence. Every diagonal entry Aii>0A_{ii} > 0 for A0A \succ 0 (take S={i}S = \{i\}). Every 2×22 \times 2 principal submatrix is PD, so AiiAjj>Aij2A_{ii}A_{jj} > A_{ij}^2 for all iji \neq j (Cauchy-Schwarz-type bound for matrix entries).

A.3 Indefinite Matrices and Saddle Points

For completeness: an indefinite symmetric matrix AA has both positive and negative eigenvalues. The corresponding quadratic form takes positive values in some directions and negative values in others. The level sets are hyperboloids.

Connection to saddle points in optimization. A critical point L(θ)=0\nabla\mathcal{L}(\boldsymbol{\theta}^*) = 0 with indefinite Hessian 2L(θ)\nabla^2\mathcal{L}(\boldsymbol{\theta}^*) is a saddle point - not a local minimum. In neural network optimization:

  • Gradient descent near a saddle point may slow down dramatically (gradient is small, Hessian is indefinite)
  • The negative eigenvalue directions are "escape directions" - following them reduces the loss
  • Stochastic gradient descent (SGD) with noise "escapes" saddles; deterministic gradient descent can get stuck

Sylvester's law of inertia. For a symmetric matrix AA, the signature (p,q,r)(p, q, r) (number of positive, negative, zero eigenvalues) is invariant under congruence transformations ABABA \mapsto B^\top A B (for invertible BB). This means no invertible change of basis can turn a saddle into a minimum - the number of descent directions is a geometric invariant of the quadratic form.


Appendix B: Numerical Methods and Implementation

B.1 Cholesky for Linear System Solving

A primary application of Cholesky is solving Ax=bA\mathbf{x} = \mathbf{b} for PD AA.

Algorithm:

  1. Factor: A=LLA = LL^\top (Cholesky, O(n3/3)O(n^3/3))
  2. Forward solve: Ly=bL\mathbf{y} = \mathbf{b} (substitute from top, O(n2)O(n^2))
  3. Backward solve: Lx=yL^\top\mathbf{x} = \mathbf{y} (substitute from bottom, O(n2)O(n^2))

For multiple right-hand sides B=[b1bk]B = [\mathbf{b}_1|\cdots|\mathbf{b}_k]: factor once, solve kk times at cost O(n3/3+kn2)O(n^3/3 + kn^2) total.

Numerical example: Solve (4223)x=(87)\begin{pmatrix}4 & 2 \\ 2 & 3\end{pmatrix}\mathbf{x} = \begin{pmatrix}8 \\ 7\end{pmatrix}.

Cholesky: L=(2012)L = \begin{pmatrix}2 & 0 \\ 1 & \sqrt{2}\end{pmatrix} (since L11=2L_{11}=2, L21=1L_{21}=1, L22=31=2L_{22}=\sqrt{3-1}=\sqrt{2}).

Forward: 2y1=8y1=42y_1 = 8 \Rightarrow y_1 = 4; y1+2y2=7y2=(74)/2=3/2y_1 + \sqrt{2}y_2 = 7 \Rightarrow y_2 = (7-4)/\sqrt{2} = 3/\sqrt{2}.

Backward: 2x2=3/2x2=3/2\sqrt{2}x_2 = 3/\sqrt{2} \Rightarrow x_2 = 3/2; 2x1+x2=4x1=(43/2)/2=5/42x_1 + x_2 = 4 \Rightarrow x_1 = (4 - 3/2)/2 = 5/4.

Verify: (4223)(5/43/2)=(5+35/2+9/2)=(87)\begin{pmatrix}4&2\\2&3\end{pmatrix}\begin{pmatrix}5/4\\3/2\end{pmatrix} = \begin{pmatrix}5+3\\5/2+9/2\end{pmatrix} = \begin{pmatrix}8\\7\end{pmatrix}. OK

B.2 Cholesky Update and Downdate

When AA changes by a rank-1 update AA+vvA \leftarrow A + \mathbf{v}\mathbf{v}^\top (or downdate AAvvA \leftarrow A - \mathbf{v}\mathbf{v}^\top), it is wasteful to recompute the full Cholesky factorization. Rank-1 Cholesky update algorithms (LINPACK's dchud) update LL in O(n2)O(n^2) time.

Algorithm (rank-1 update, Gill-Golub-Murray-Saunders style):

Given A=LLA = LL^\top, compute LL' such that LL=A+vvL'L'^\top = A + \mathbf{v}\mathbf{v}^\top:

x = L \ v           (forward solve: O(n^2))
for k = 1 to n:
    r = sqrt(L[k,k]^2 + x[k]^2)
    c = r / L[k,k]
    s = x[k] / L[k,k]
    L[k,k] = r
    L[k+1:n, k] = (L[k+1:n,k] + s*x[k+1:n]) / c
    x[k+1:n]    = c*x[k+1:n] - s*L[k+1:n,k]

This uses Givens rotations (-> 05: Orthogonality) to update LL in place.

For AI: Online learning algorithms that add one data point at a time use rank-1 Cholesky updates to maintain the Cholesky factorization of the Gram matrix. Sequential Bayesian updating (online GP regression) uses rank-1 Cholesky updates at cost O(n2)O(n^2) per update instead of O(n3)O(n^3) full refactorization.

B.3 Condition Number and Numerical Stability of Cholesky

Condition number. For A0A \succ 0, the condition number κ(A)=λmax/λmin\kappa(A) = \lambda_{\max}/\lambda_{\min}. The relative error in solving Ax=bA\mathbf{x} = \mathbf{b} satisfies:

x^xxκ(A)ϵmach.\frac{\|\hat{\mathbf{x}} - \mathbf{x}\|}{\|\mathbf{x}\|} \leq \kappa(A) \cdot \epsilon_{\text{mach}}.

For Cholesky specifically: the backward error bound on the Cholesky factor is AL^L^cnϵmachA\|A - \hat{L}\hat{L}^\top\| \leq c_n\epsilon_{\text{mach}}\|A\|. This means Cholesky is unconditionally backward stable for PD matrices. In contrast, LU without pivoting can have backward errors proportional to LU\|L\|\|U\|, which can be exponentially large.

Ill-conditioning warning. For large κ(A)\kappa(A):

  • The log-det computation 2logLii2\sum\log L_{ii} involves some very small LiiL_{ii} (near-zero for near-singular dimensions) - those log terms dominate and may lose digits
  • Linear solves A1bA^{-1}\mathbf{b} lose approximately log10κ(A)\log_{10}\kappa(A) digits of accuracy
  • Cholesky itself remains stable; the error comes from the ill-conditioning of the problem, not the algorithm

Practical rule: If κ(A)10k\kappa(A) \approx 10^k and your floating-point arithmetic has 15\approx 15 decimal digits (double precision), you lose kk digits in the solution. For k12k \geq 12, the solution has no reliable digits.

B.4 Sparse Cholesky

For sparse PD matrices (many zero entries, arising in graph-structured models, FEM discretizations, and spatial statistics), general Cholesky is wasteful because it may "fill in" zeros and produce dense LL.

The fill-in problem. Even if AA is sparse, LL may be dense. The fill-in pattern depends on the sparsity structure of AA and the ordering of variables.

Reordering algorithms. The approximate minimum degree (AMD) and nested dissection algorithms reorder the variables (permute APAPA \leftarrow PAP^\top for a permutation matrix PP) to minimize fill-in. The standard library for sparse Cholesky is SuiteSparse (CHOLMOD).

For AI: In Gaussian Markov random fields (GMRFs), the precision matrix (inverse covariance) is sparse. Sparse Cholesky enables efficient inference in spatial models, graph neural networks with Gaussian priors, and structured prediction. GPyTorch's KeOps and PyTorch Sparse backends exploit sparsity for large-scale GP approximations.


Appendix C: Positive Definiteness in Complex Vector Spaces

C.1 Hermitian Positive Definite Matrices

Over C\mathbb{C}, positive definiteness generalizes to Hermitian positive definite (HPD) matrices.

Definition. ACn×nA \in \mathbb{C}^{n \times n} is Hermitian positive definite if:

  • A=AA = A^* (where A=AˉA^* = \bar{A}^\top is the conjugate transpose)
  • xAx>0\mathbf{x}^* A \mathbf{x} > 0 for all xCn{0}\mathbf{x} \in \mathbb{C}^n \setminus \{0\}

(Note: xAx\mathbf{x}^* A \mathbf{x} is always real for Hermitian AA, so the positivity condition makes sense.)

Cholesky for HPD. Every HPD matrix AA has a unique Cholesky factorization A=LLA = LL^* where LL is lower triangular with positive real diagonal entries.

For AI: Complex-valued neural networks (used in signal processing, quantum computing simulation) use HPD covariance matrices. Radar signal processing, MRI reconstruction, and quantum chemistry all work in Cn\mathbb{C}^n with HPD matrices.


Appendix D: Advanced Theory

D.1 The Polar Decomposition and Positive Definite Factors

Every invertible matrix ARn×nA \in \mathbb{R}^{n \times n} has a unique polar decomposition:

A=QPA = QP

where QQ is orthogonal (QQ=IQ^\top Q = I) and P=AA0P = \sqrt{A^\top A} \succ 0 is the unique PD symmetric positive definite "right polar factor." This is the matrix analogue of the polar form z=eiθzz = e^{i\theta}|z| for complex numbers.

Existence and uniqueness. Using SVD A=UΣVA = U\Sigma V^\top:

A=UΣV=(UV)(VΣV)=QPA = U\Sigma V^\top = (UV^\top)(V\Sigma V^\top) = QP

where Q=UVQ = UV^\top (orthogonal) and P=VΣV0P = V\Sigma V^\top \succ 0 (PSD square root of AAA^\top A).

Uniqueness of PP: P=(AA)1/2P = (A^\top A)^{1/2}, which is the unique PSD square root (Theorem 5.1). Given PP, Q=AP1Q = AP^{-1} is uniquely determined.

For AI: The polar decomposition appears in:

  • Weight matrix orthogonalization: in Muon optimizer (Kosson et al., 2024), gradient updates are "orthogonalized" using the polar factor Q=AP1Q = AP^{-1}, which projects the weight update onto the manifold of orthogonal matrices, maintaining stable singular value spectrum during training
  • Group equivariance: equivariant neural networks (E(n)-equivariant GNNs) use orthogonal matrices as symmetry group elements; the polar decomposition provides a differentiable projection onto O(n)O(n)

Computing the polar factor. Newton iteration: Xk+1=12(Xk+Xk)X_{k+1} = \frac{1}{2}(X_k + X_k^{-\top}) converges to QQ starting from X0=AX_0 = A. Quadratic convergence once close to QQ. Each step costs one matrix inversion.

D.2 Positive Definite Functions and Bochner's Theorem

A function k:RdRk: \mathbb{R}^d \to \mathbb{R} is a positive definite function if for every nn and every set of points x1,,xnRd\mathbf{x}_1,\ldots,\mathbf{x}_n \in \mathbb{R}^d, the matrix Kij=k(xixj)K_{ij} = k(\mathbf{x}_i - \mathbf{x}_j) is PSD.

Bochner's theorem. A continuous stationary kernel k(xz)k(\mathbf{x}-\mathbf{z}) is a positive definite function on Rd\mathbb{R}^d if and only if it is the Fourier transform of a non-negative measure μ\mu (spectral density):

k(τ)=Rdeiωτdμ(ω).k(\boldsymbol{\tau}) = \int_{\mathbb{R}^d} e^{i\boldsymbol{\omega}^\top\boldsymbol{\tau}} d\mu(\boldsymbol{\omega}).

Implications for kernel design:

  • RBF kernel: k(τ)=exp(τ2/22)k(\boldsymbol{\tau}) = \exp(-\|\boldsymbol{\tau}\|^2/2\ell^2) is PD (its Fourier transform is a Gaussian, which is non-negative)
  • Matern kernels: PD for all valid smoothness parameters ν>0\nu > 0
  • Checking validity: A kernel is valid iff its Fourier transform is non-negative - this is the ultimate test

Random Fourier features (Rahimi & Recht, 2007). Bochner's theorem gives a randomized approximation to stationary PD kernels:

k(xz)z(x)z(z),z(x)=2/D[cos(ω1x+b1),,cos(ωDx+bD)]k(\mathbf{x}-\mathbf{z}) \approx z(\mathbf{x})^\top z(\mathbf{z}), \quad z(\mathbf{x}) = \sqrt{2/D}[\cos(\boldsymbol{\omega}_1^\top\mathbf{x}+b_1),\ldots,\cos(\boldsymbol{\omega}_D^\top\mathbf{x}+b_D)]

where ωiμ\boldsymbol{\omega}_i \sim \mu (the spectral distribution) and biUniform[0,2π]b_i \sim \text{Uniform}[0,2\pi]. This approximates the PD kernel as a finite inner product, enabling scalable kernel methods without forming the full n×nn \times n kernel matrix. Used in Performer transformer attention (FAVOR+, Choromanski et al., 2020).

D.3 The Determinant and Volume

Geometric interpretation of detA\det A for A0A \succ 0.

For A0A \succ 0, the quadratic form xA1x1\mathbf{x}^\top A^{-1} \mathbf{x} \leq 1 defines an ellipsoid EA\mathcal{E}_A in Rn\mathbb{R}^n. Its volume is:

Vol(EA)=πn/2Γ(n/2+1)(detA)1/2.\text{Vol}(\mathcal{E}_A) = \frac{\pi^{n/2}}{\Gamma(n/2+1)} \cdot (\det A)^{1/2}.

So detA\det A measures the "volume" of the ellipsoid associated with AA. Larger detA\det A means a more "spread out" distribution.

In the Gaussian context: The normalizing constant of N(μ,Σ)\mathcal{N}(\boldsymbol{\mu}, \Sigma) is (2π)n/2(detΣ)1/2(2\pi)^{n/2}(\det\Sigma)^{1/2}. This is the volume of the "effective support" ellipsoid scaled by (2π)n/2(2\pi)^{n/2}.

D-optimal experimental design. In Bayesian experimental design, the D-optimal criterion selects experiments x1,,xn\mathbf{x}_1,\ldots,\mathbf{x}_n to maximize det(XX+σ2I)\det(X^\top X + \sigma^{-2}I) - the determinant of the posterior precision matrix, which equals 1/det(Σpost)1/\det(\Sigma_{\text{post}}). Maximizing det(posterior precision)\det(\text{posterior precision}) minimizes the volume of the posterior confidence ellipsoid. This is an SDP with XX as the optimization variable.

Maximum entropy Gaussians. Among all distributions with mean μ\boldsymbol{\mu} and covariance Σ\Sigma, the maximum entropy distribution is N(μ,Σ)\mathcal{N}(\boldsymbol{\mu}, \Sigma), with entropy H=n2(1+log(2π))+12logdetΣH = \frac{n}{2}(1+\log(2\pi)) + \frac{1}{2}\log\det\Sigma. Maximizing entropy subject to a trace constraint tr(Σ)c\text{tr}(\Sigma) \leq c gives Σ=(c/n)I\Sigma = (c/n)I (isotropic Gaussian) - the most "uninformed" Gaussian with bounded total variance.

D.4 Matrix Exponential and the PD Manifold

The set of PD matrices S++n\mathbb{S}_{++}^n is an open smooth manifold (in fact, a Riemannian manifold with the affine-invariant metric). This means concepts from differential geometry apply.

The matrix exponential. For any symmetric SSnS \in \mathbb{S}^n, eS=QeΛQS++ne^S = Q e^\Lambda Q^\top \in \mathbb{S}_{++}^n (positive definite, since exponentials of real numbers are positive). The map exp:SnS++n\exp: \mathbb{S}^n \to \mathbb{S}_{++}^n is a bijection - PD matrices are exactly exponentials of symmetric matrices.

Log-Euclidean metric. A simple Riemannian metric on S++n\mathbb{S}_{++}^n (used in diffusion tensor imaging) defines the "distance" between A,B0A, B \succ 0 as:

dlog(A,B)=logAlogBF.d_{\log}(A, B) = \|\log A - \log B\|_F.

This treats PD matrices as if they lived in a flat space via the log map. Computations (means, geodesics, interpolations) become Euclidean operations on logA\log A and logB\log B.

Affine-invariant metric (Riemannian). The more geometrically natural metric:

dAI(A,B)=log(A1/2BA1/2)F=(i=1nlog2λi(A1B))1/2d_{\text{AI}}(A, B) = \|\log(A^{-1/2}B A^{-1/2})\|_F = \left(\sum_{i=1}^n \log^2\lambda_i(A^{-1}B)\right)^{1/2}

is invariant under congruence ACACA \mapsto C^\top A C (affine-invariant), making it suitable for geometric statistics.

For AI: The SPD (symmetric positive definite) manifold appears in:

  • Diffusion tensor MRI: each voxel has a 3×33 \times 3 diffusion tensor S++3\in \mathbb{S}_{++}^3; Riemannian means avoid non-PD artifacts
  • Covariance-based classifiers: classifying neural activation covariance matrices using Riemannian geometry (SPDNet, 2017)
  • Transformer positional encodings with structured covariances: attention with SPD constraints on the key-query metric tensors

D.5 Inequalities for Positive Definite Matrices

Several fundamental inequalities apply specifically to PD matrices:

Hadamard's inequality. For A0A \succ 0:

detAi=1nAii.\det A \leq \prod_{i=1}^n A_{ii}.

Equality iff AA is diagonal. Interpretation: the volume of the ellipsoid defined by AA is at most the product of the diagonal entries (the volumes along coordinate axes).

Minkowski's inequality. For A,B0A, B \succ 0:

(det(A+B))1/n(detA)1/n+(detB)1/n.(\det(A+B))^{1/n} \geq (\det A)^{1/n} + (\det B)^{1/n}.

This is the matrix analogue of a+bnan+bn\sqrt[n]{a+b} \geq \sqrt[n]{a} + \sqrt[n]{b} for scalars (Minkowski). It follows from the concavity of A(detA)1/nA \mapsto (\det A)^{1/n} on S++n\mathbb{S}_{++}^n.

Fan-Ky inequality. For A0A \succ 0 and knk \leq n:

i=1klogλi(A+B)i=1klogλi(A)+i=1klogλi(B).\sum_{i=1}^k \log\lambda_i(A+B) \geq \sum_{i=1}^k \log\lambda_i(A) + \sum_{i=1}^k \log\lambda_i(B).

(The sum of the kk largest log-eigenvalues of A+BA+B is at least the sum from AA plus the sum from BB.)

Anderson's inequality. For A,B,C0A, B, C \succ 0 with C=A+BC = A + B:

C1A1+B1.C^{-1} \preceq A^{-1} + B^{-1}.

Interpretation: combining two experiments (adding their information matrices) gives more information than either alone, but the combined precision matrix is "sandwiched" by the sum of individual inverse covariances.

For AI: Hadamard's inequality provides a bound on the log-det: logdetAilogAii=logAii\log\det A \leq \sum_i \log A_{ii} = \log\prod A_{ii}, useful when diagonal entries are cheaply computed. Minkowski's inequality is used in information-geometric proofs about combining Gaussian priors in Bayesian learning.


Appendix E: Extended Examples and Case Studies

E.1 Case Study: Covariance Estimation in High Dimensions

One of the most practically important applications of PSD matrix theory is estimating covariance matrices from data when the number of features pp is comparable to or larger than the number of samples nn.

The problem. Given nn samples x1,,xnRp\mathbf{x}_1,\ldots,\mathbf{x}_n \in \mathbb{R}^p (zero mean for simplicity), the sample covariance is:

S^=1ni=1nxixi=1nXX.\hat{S} = \frac{1}{n}\sum_{i=1}^n \mathbf{x}_i\mathbf{x}_i^\top = \frac{1}{n}X^\top X.

S^0\hat{S} \succeq 0 always (it's a Gram matrix). But:

  • If n<pn < p: rank(S^)n<p\text{rank}(\hat{S}) \leq n < p, so S^\hat{S} is singular (not PD)
  • Even for npn \geq p: S^\hat{S} can be ill-conditioned (extreme ratio of largest to smallest eigenvalue)

The Marchenko-Pastur law. For n,pn, p \to \infty with n/pγ>1n/p \to \gamma > 1 and xiN(0,Ip)\mathbf{x}_i \sim \mathcal{N}(0, I_p), the empirical eigenvalue distribution of S^\hat{S} converges to the Marchenko-Pastur distribution:

f(λ)=(λ+λ)(λλ)2πγλf(\lambda) = \frac{\sqrt{(\lambda_+ - \lambda)(\lambda - \lambda_-)}}{2\pi\gamma\lambda}

for λ[λ,λ+]\lambda \in [\lambda_-, \lambda_+], where λ±=(1±1/γ)2\lambda_\pm = (1 \pm 1/\sqrt{\gamma})^2. This means even under the identity covariance, sample eigenvalues are spread over a wide range - the smallest eigenvalues of S^\hat{S} are much smaller than 1 and the largest are much larger than 1.

Ledoit-Wolf shrinkage. The Ledoit-Wolf estimator regularizes the sample covariance toward a multiple of identity:

Σ^LW=αS^+(1α)μI\hat{\Sigma}_{\text{LW}} = \alpha\hat{S} + (1-\alpha)\mu I

where μ=tr(S^)/p\mu = \text{tr}(\hat{S})/p (the average eigenvalue) and α\alpha is chosen to minimize mean squared error under the Frobenius norm. The result is always PD (since μI0\mu I \succ 0 and S^0\hat{S} \succeq 0).

Graphical lasso. Assuming the precision matrix (inverse covariance) Ω=Σ1\Omega = \Sigma^{-1} is sparse (many zeros encode conditional independence), the graphical lasso estimates:

Ω^=argminΩ0[tr(S^Ω)logdetΩ+λΩ1]\hat{\Omega} = \arg\min_{\Omega \succ 0} \left[\text{tr}(\hat{S}\Omega) - \log\det\Omega + \lambda\|\Omega\|_1\right]

where 1\|\cdot\|_1 is the element-wise 1\ell_1 norm. The constraint Ω0\Omega \succ 0 is the PD cone constraint. The objective tr(S^Ω)logdetΩ\text{tr}(\hat{S}\Omega) - \log\det\Omega is the negative Gaussian log-likelihood (up to constants). Optimized by block coordinate descent (glasso algorithm) using the Schur complement for each block update.

For LLMs: Large language models trained with structured sparse attention patterns (local + global attention, as in Longformer and BigBird) can be understood as learning sparse precision matrices of the attention distributions.

E.2 Case Study: Gaussian Process Regression, Step by Step

Let's trace through a complete GP regression computation to see all the PD machinery in action.

Setup. Training data: (x1,y1)=(0,0.5)(x_1, y_1) = (0, 0.5), (x2,y2)=(1,1.2)(x_2, y_2) = (1, 1.2), (x3,y3)=(2,0.8)(x_3, y_3) = (2, 0.8). Test point: x=1.5x_* = 1.5. Kernel: k(x,z)=exp((xz)2/2)k(x,z) = \exp(-(x-z)^2/2) (RBF with length-scale 1). Noise: σ2=0.1\sigma^2 = 0.1.

Step 1: Compute the kernel matrix.

Kij=exp((xixj)2/2).K_{ij} = \exp(-(x_i - x_j)^2/2). K=(1e0.5e2e0.51e0.5e2e0.51)(10.6070.1350.60710.6070.1350.6071).K = \begin{pmatrix}1 & e^{-0.5} & e^{-2} \\ e^{-0.5} & 1 & e^{-0.5} \\ e^{-2} & e^{-0.5} & 1\end{pmatrix} \approx \begin{pmatrix}1 & 0.607 & 0.135 \\ 0.607 & 1 & 0.607 \\ 0.135 & 0.607 & 1\end{pmatrix}.

Step 2: Add noise and Cholesky factor.

K+σ2I=K+0.1I(1.10.6070.1350.6071.10.6070.1350.6071.1).K + \sigma^2 I = K + 0.1I \approx \begin{pmatrix}1.1 & 0.607 & 0.135 \\ 0.607 & 1.1 & 0.607 \\ 0.135 & 0.607 & 1.1\end{pmatrix}.

Compute L=chol(K+0.1I)L = \text{chol}(K + 0.1I). (Details in theory.ipynb.)

Step 3: Predictive mean.

k=[k(x,xi)]i=[exp(2.25/2),exp(0.25/2),exp(0.25/2)][0.325,0.882,0.882]\mathbf{k}_* = [k(x_*, x_i)]_i = [\exp(-2.25/2), \exp(-0.25/2), \exp(-0.25/2)] \approx [0.325, 0.882, 0.882].

α=(K+0.1I)1y=LL1y(via Cholesky solve)\boldsymbol{\alpha} = (K + 0.1I)^{-1}\mathbf{y} = L^{-\top}L^{-1}\mathbf{y} \quad (\text{via Cholesky solve}) μ=kα.\mu_* = \mathbf{k}_*^\top\boldsymbol{\alpha}.

Step 4: Predictive variance.

σ2=k(x,x)k(K+0.1I)1k=1vv,v=L1k.\sigma_*^2 = k(x_*, x_*) - \mathbf{k}_*^\top(K+0.1I)^{-1}\mathbf{k}_* = 1 - \mathbf{v}^\top\mathbf{v}, \quad \mathbf{v} = L^{-1}\mathbf{k}_*.

The variance is the Schur complement of (K+0.1I)(K + 0.1I) evaluated at xx_*.

Interpretation. The Cholesky factorization appears in every step: solving for α\boldsymbol{\alpha}, computing the predictive variance, and computing the log-marginal likelihood for hyperparameter optimization. The PD condition on K+0.1IK + 0.1I guarantees σ20\sigma_*^2 \geq 0 (the Schur complement is non-negative).

E.3 Case Study: The Multivariate Normal in Deep Learning

Parameterizing structured covariances. In deep generative models (VAEs, normalizing flows, diffusion models), we frequently need to parameterize a multivariate Gaussian N(μ,Σ)\mathcal{N}(\boldsymbol{\mu}, \Sigma) with Σ\Sigma learnable. The challenge: Σ\Sigma must be PD, and the parameterization must be differentiable.

Approach 1: Diagonal. Σ=diag(σ12,,σd2)\Sigma = \text{diag}(\sigma_1^2,\ldots,\sigma_d^2) with σi=softplus(si)\sigma_i = \text{softplus}(s_i). Free parameters: dd. Correlation structure: none. Used in standard VAE (Kingma & Welling, 2013).

Approach 2: Lower triangular Cholesky. Σ=LL\Sigma = LL^\top where LL is lower triangular:

  • Diagonal: Lii=softplus(L~ii)L_{ii} = \text{softplus}(\tilde{L}_{ii}) ensures positivity
  • Off-diagonal: LijL_{ij} for i>ji > j unconstrained (any real number) Free parameters: d(d+1)/2d(d+1)/2. Full correlation structure. Used in full-covariance VAEs, normalizing flows.

Approach 3: Low-rank + diagonal. Σ=FF+D\Sigma = FF^\top + D where FRd×rF \in \mathbb{R}^{d \times r} (rdr \ll d) and D=diag(d1,,dd)D = \text{diag}(d_1,\ldots,d_d) with di>0d_i > 0. Free parameters: dr+ddr + d. Captures rr "directions of correlation." Used in structured VBs, mean-field approximations.

The KL divergence between Gaussians. For q=N(μq,Σq)q = \mathcal{N}(\boldsymbol{\mu}_q, \Sigma_q) and p=N(μp,Σp)p = \mathcal{N}(\boldsymbol{\mu}_p, \Sigma_p):

DKL(qp)=12[tr(Σp1Σq)+(μpμq)Σp1(μpμq)d+logdetΣpdetΣq].D_{\text{KL}}(q\|p) = \frac{1}{2}\left[\text{tr}(\Sigma_p^{-1}\Sigma_q) + (\boldsymbol{\mu}_p-\boldsymbol{\mu}_q)^\top\Sigma_p^{-1}(\boldsymbol{\mu}_p-\boldsymbol{\mu}_q) - d + \log\frac{\det\Sigma_p}{\det\Sigma_q}\right].

For p=N(0,I)p = \mathcal{N}(\mathbf{0}, I) (standard VAE prior):

DKL(qN(0,I))=12[tr(Σq)+μqμqdlogdetΣq].D_{\text{KL}}(q\|\mathcal{N}(\mathbf{0},I)) = \frac{1}{2}\left[\text{tr}(\Sigma_q) + \boldsymbol{\mu}_q^\top\boldsymbol{\mu}_q - d - \log\det\Sigma_q\right].

For diagonal Σq=diag(σi2)\Sigma_q = \text{diag}(\sigma_i^2):

DKL=12i=1d[σi2+μq,i21logσi2].D_{\text{KL}} = \frac{1}{2}\sum_{i=1}^d \left[\sigma_i^2 + \mu_{q,i}^2 - 1 - \log\sigma_i^2\right].

For Cholesky Σq=LL\Sigma_q = LL^\top: logdetΣq=2ilogLii\log\det\Sigma_q = 2\sum_i \log L_{ii} and tr(Σq)=LF2\text{tr}(\Sigma_q) = \|L\|_F^2.

All these computations - logdetΣq\log\det\Sigma_q, tr(Σq)\text{tr}(\Sigma_q), Σq1\Sigma_q^{-1} - exploit the Cholesky and log-det tools developed in 4 and 7.

E.4 The Normal Equations and Ridge Regression

Least squares. Given XRn×pX \in \mathbb{R}^{n \times p} (data) and yRn\mathbf{y} \in \mathbb{R}^n (targets), ordinary least squares (OLS) minimizes:

β^=argminβXβy2.\hat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta}} \|X\boldsymbol{\beta} - \mathbf{y}\|^2.

The normal equations: XXβ^=XyX^\top X \hat{\boldsymbol{\beta}} = X^\top\mathbf{y}. The matrix XX0X^\top X \succeq 0 is always PSD (Gram matrix). If XX has full column rank (npn \geq p, rank pp), then XX0X^\top X \succ 0 and the unique solution is β^=(XX)1Xy\hat{\boldsymbol{\beta}} = (X^\top X)^{-1}X^\top\mathbf{y}.

Ridge regression. Add 2\ell_2 regularization: minimize Xβy2+λβ2\|X\boldsymbol{\beta}-\mathbf{y}\|^2 + \lambda\|\boldsymbol{\beta}\|^2.

Normal equations: (XX+λI)β^=Xy(X^\top X + \lambda I)\hat{\boldsymbol{\beta}} = X^\top\mathbf{y}.

The matrix A=XX+λI0A = X^\top X + \lambda I \succ 0 for any λ>0\lambda > 0 (by Proposition 2.3.7). This is why ridge regression is always numerically stable: the PD matrix AA has all positive pivots, and Cholesky never fails.

Condition number improvement. κ(XX+λI)=(σ12+λ)/(σp2+λ)\kappa(X^\top X + \lambda I) = (\sigma_1^2 + \lambda)/(\sigma_p^2 + \lambda) where σ1σp0\sigma_1 \geq \cdots \geq \sigma_p \geq 0 are the singular values of XX. For small λ\lambda, κσ12/σp2=κ(X)2\kappa \approx \sigma_1^2/\sigma_p^2 = \kappa(X)^2. As λ\lambda \to \infty, κ1\kappa \to 1 (well-conditioned). Choosing λ\lambda balances bias (regularization) vs numerical stability.

Kernel ridge regression. For nonlinear regression using a PD kernel kk, form the kernel matrix K=k(X,X)K = k(\mathbf{X},\mathbf{X}) and solve (K+λI)α=y(K + \lambda I)\boldsymbol{\alpha} = \mathbf{y}. Predictions: f^(x)=kα\hat{f}(\mathbf{x}_*) = \mathbf{k}_*^\top\boldsymbol{\alpha} where k=k(x,X)\mathbf{k}_* = k(\mathbf{x}_*, \mathbf{X}). This is equivalent to GP regression with the kernel kk and noise variance λ\lambda. The PD condition on K+λIK + \lambda I is exactly the Cholesky solvability condition.


Appendix F: Quick Reference

F.1 Four Equivalent Characterizations of A0A \succ 0

POSITIVE DEFINITENESS: FOUR EQUIVALENT TESTS
========================================================================

  Let A \in \mathbb{R}^n^x^n be symmetric. The following are equivalent:

  DEFINITION   \forallx\neq0: x^TAx > 0            (quadratic form positive)
  SPECTRAL     \foralli: \lambda^i(A) > 0              (all eigenvalues positive)
  SYLVESTER    \forallk: \Delta_k = det(A[1:k,1:k]) > 0 (leading minors positive)
  CHOLESKY     \exists! lower triangular L with positive diagonal: A = LL^T

  COST:
  Definition  O(n) per test vector            <- manual/symbolic
  Spectral    O(n^3) eigendecomposition        <- most informative
  Sylvester   O(n^4) for all n determinants    <- manual/symbolic
  Cholesky    O(n^3/3) factorization           <- fastest in practice

========================================================================

F.2 Key Formulas

FormulaNotes
A=LLA = LL^\topCholesky factorization (A PD <-> exists)
A=LDLA = LDL^\topLDL^T: unit lower triangular + diagonal
A=QA1/2A = QA^{1/2}Polar: orthogonal \times PD
logdetA=2ilogLii\log\det A = 2\sum_i\log L_{ii}Log-det via Cholesky
AlogdetA=A1\nabla_A\log\det A = A^{-1}Matrix calculus: gradient of log-det
S=DCA1BS = D - CA^{-1}BSchur complement of AA in block matrix
M0A0,S0M \succ 0 \Leftrightarrow A \succ 0,\, S \succ 0Schur PD criterion
(A+UCV)1=A1A1U(C1+VA1U)1VA1(A+UCV)^{-1} = A^{-1} - A^{-1}U(C^{-1}+VA^{-1}U)^{-1}VA^{-1}Woodbury identity
x=μ+Lϵ\mathbf{x} = \boldsymbol{\mu} + L\boldsymbol{\epsilon}, ϵN(0,I)\boldsymbol{\epsilon}\sim\mathcal{N}(0,I)Reparameterization trick
DKL(N(μ,Σ)N(0,I))=12[trΣ+μ2dlogdetΣ]D_{\text{KL}}(\mathcal{N}(\boldsymbol{\mu},\Sigma)\|\mathcal{N}(0,I)) = \frac{1}{2}[\text{tr}\Sigma + \|\boldsymbol{\mu}\|^2 - d - \log\det\Sigma]VAE KL term

F.3 Notation Summary

Following the NOTATION_GUIDE.md conventions:

SymbolMeaning
A0A \succ 0AA is positive definite
A0A \succeq 0AA is positive semidefinite
A0A \prec 0AA is negative definite
Sn\mathbb{S}^nSpace of n×nn \times n symmetric matrices
S++n\mathbb{S}_{++}^nCone of n×nn \times n PD matrices (interior)
S+n\mathbb{S}_+^nCone of n×nn \times n PSD matrices
LLCholesky factor (lower triangular, positive diagonal)
A1/2A^{1/2}Symmetric PSD square root
A1/2A^{-1/2}Inverse PSD square root
Σ\SigmaCovariance matrix (0\succeq 0)
Λ\LambdaPrecision matrix (=Σ1= \Sigma^{-1}, 0\succ 0)
FFFisher information matrix (0\succeq 0)
SSSchur complement
GGGram matrix (=XX= XX^\top, 0\succeq 0)
KKKernel matrix (0\succeq 0)

F.4 Python Cheatsheet

import numpy as np
import scipy.linalg as la

# === Cholesky factorization ===
L = np.linalg.cholesky(A)          # A = L @ L.T
# Or via scipy (can also compute upper factor):
L = la.cholesky(A, lower=True)

# === LDL^T factorization ===
L_ldl, D_ldl, _ = la.ldl(A)       # A = L @ D @ L.T

# === Log-determinant (numerically stable) ===
L = np.linalg.cholesky(A)
log_det = 2 * np.sum(np.log(np.diag(L)))

# === Solve A x = b via Cholesky ===
L = np.linalg.cholesky(A)
x = la.cho_solve((L, True), b)     # uses LAPACK dpotrs

# === PSD square root ===
eigvals, Q = np.linalg.eigh(A)     # symmetric eigendecomp
A_sqrt = Q @ np.diag(np.sqrt(eigvals)) @ Q.T

# === Woodbury: (A + U C V)^{-1} ===
# via np.linalg.solve on the smaller system

# === Reparameterization trick ===
L = np.linalg.cholesky(Sigma)      # Sigma = L @ L.T
eps = np.random.randn(n_samples, d)
z = mu + (L @ eps.T).T             # z ~ N(mu, Sigma)

# === Gram matrix ===
G = X @ X.T                        # G = X X^T, PSD

# === Check PD ===
try:
    np.linalg.cholesky(A)
    is_pd = True
except np.linalg.LinAlgError:
    is_pd = False

# === Check PSD ===
eigvals = np.linalg.eigvalsh(A)
is_psd = np.all(eigvals >= -1e-10)

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue