Positive Definite Matrices: Part 6: Schur Complement to 11. Common Mistakes
6. Schur Complement
The Schur complement is the "matrix analogue of completing the square" for block matrices. It appears everywhere in probability (Gaussian conditioning), linear algebra (block matrix inversion), and optimization (constraint elimination).
6.1 Definition for Block Matrices
Definition 6.1 (Schur Complement). Let
M=(ACBD)
be a block matrix with A∈Rp×p invertible. The Schur complement of A in M is:
S=D−CA−1B∈Rq×q.
Similarly, if D∈Rq×q is invertible, the Schur complement of D is A−BD−1C.
Origin: block Gaussian elimination. The Schur complement arises naturally when eliminating the (2,1) block:
M=(ACBD)=(ICA−10I)(A0BD−CA−1B).
The (2,2) block in the upper triangular factor is exactly S=D−CA−1B.
Determinant formula. A key consequence of the block LU:
detM=detA⋅det(D−CA−1B)=detA⋅detS.
Similarly, detM=detD⋅det(A−BD−1C) when D is invertible.
Block matrix inverse. Using the Schur complement:
M−1=(A−1+A−1BS−1CA−1−S−1CA−1−A−1BS−1S−1)
when both A and S=D−CA−1B are invertible.
6.2 Schur Complement and Positive Definiteness
The Schur complement provides an elegant characterization of block PD matrices.
Theorem 6.2 (Schur PD Criterion). Let M=(AB⊤BD) be symmetric (so C=B⊤). Then:
M≻0⟺A≻0 and S=D−B⊤A−1B≻0.
Proof. We use the block Cholesky:
M=(AB⊤BD)=(IB⊤A−10I)(A00S)(I0A−1BI).
The middle factor is block diagonal: (A00S). For any v=(x,y)⊤:
v⊤Mv=(x+A−1By)⊤A(x+A−1By)+y⊤Sy.
(⇒): If M≻0, taking y=0 shows A≻0; taking x=−A−1By shows y⊤Sy>0 for y=0, so S≻0.
(⇐): If A≻0 and S≻0, then both terms are non-negative, and at least one is positive for (x,y)=(0,0), so M≻0. □
Corollary.M⪰0⇔A⪰0 and S=D−B⊤A−1B⪰0 (when A is invertible; otherwise use the rank condition).
SCHUR COMPLEMENT AND BLOCK PD
========================================================================
M = ( A B ) symmetric
( B^T D )
M \succ 0 <=> A \succ 0 AND S = D - B^TA^-^1B \succ 0
Intuition: "completing the square" in block form
v^TMv = (x + A^-^1By)^TA(x + A^-^1By) + y^TSy
----------------------------- -----
\geq 0 (since A \succ 0) \geq 0 (since S \succ 0)
========================================================================
6.3 Matrix Inversion Lemma
The Woodbury matrix identity (also called the matrix inversion lemma or Sherman-Morrison-Woodbury formula) is:
(A+UCV)−1=A−1−A−1U(C−1+VA−1U)−1VA−1
where A∈Rn×n, C∈Rk×k, U∈Rn×k, V∈Rk×n. This is valuable when k≪n (low-rank update): instead of inverting an n×n matrix, invert a k×k matrix.
Derivation via Schur complement. Consider the block system:
M=(A−VUC−1).
The Schur complement of C−1 is A−U(−V)−1(−V)=A+UCV (with some sign manipulation). The Schur complement of A is C−1+VA−1U. Using the block inverse formula gives the Woodbury identity.
Special case (rank-1 update): If U=u, V=v⊤, C=c (scalar):
(A+cuv⊤)−1=A−1−1+cv⊤A−1ucA−1uv⊤A−1.
This is the Sherman-Morrison formula for rank-1 updates.
For AI: The Woodbury identity is used in:
Gaussian process prediction: posterior covariance (K+σ2I)−1 where K=X⊤X and σ2I is noise; Woodbury allows using n×n or p×p (whichever is smaller)
LoRA / low-rank adaptation: the effective weight W0+BA is a rank-r update; Woodbury allows efficient inversion without materializing the full matrix
Kalman filter update step:(P−1+H⊤R−1H)−1 uses Woodbury to avoid inverting large state covariances
6.4 Gaussian Conditioning via Schur Complement
The Schur complement is the algebraic engine behind the conditional distribution of multivariate Gaussians.
Setup. Let (x1x2)∼N((μ1μ2),(Σ11Σ21Σ12Σ22)).
Conditional distribution. The conditional x1∣x2=a is Gaussian:
Since Lii>0 (Cholesky diagonal is positive), logLii is well-defined. This is the standard computational formula: factor A=LL⊤, then sum the logs of the diagonal entries.
L = np.linalg.cholesky(A)
log_det_A = 2 * np.sum(np.log(np.diag(L)))
This is numerically more stable than np.log(np.linalg.det(A)) for large matrices, because det can underflow or overflow.
Properties:
logdet(AB)=logdetA+logdetB (for any invertible A,B)
logdet(A−1)=−logdetA
logdet(αA)=nlogα+logdetA for scalar α>0
logdet(A)=tr(logA) where logA is the matrix logarithm (eigendecomposition-based)
As A→∂S+n (boundary, a singular matrix): logdetA→−∞
7.2 Log-Det as a Concave Function
Theorem 7.1. The function f:S++n→R defined by f(A)=logdetA is strictly concave on the cone of PD matrices.
Proof. We need to show f(λA+(1−λ)B)≥λf(A)+(1−λ)f(B) for A,B≻0 and λ∈(0,1), with equality iff A=B.
Fix A≻0 and let C=A−1/2BA−1/2≻0. The eigenvalues of C are μ1≥⋯≥μn>0.
f(λA+(1−λ)B)=logdet(λA+(1−λ)B).
Factoring: λA+(1−λ)B=A1/2(λI+(1−λ)C)A1/2, so:
f(λA+(1−λ)B)=logdetA+i=1∑nlog(λ+(1−λ)μi).
Since g(t)=logt is strictly concave: log(λ+(1−λ)μi)≥λlog1+(1−λ)logμi=(1−λ)logμi.
Multivariate Gaussian log-likelihood. For x∼N(μ,Σ):
logp(x)=−2nlog(2π)−21logdetΣ−21(x−μ)⊤Σ−1(x−μ).
The term −21logdetΣ penalizes large (spread-out) distributions. When fitting Σ to data, maximizing the log-likelihood requires differentiating through logdetΣ, using ∂logdetΣ/∂Σ=Σ−1.
Gaussian process marginal likelihood. For GP regression with kernel matrix K and noise σ2:
Both terms require Cholesky: L=chol(K+σ2I), then logdet=2∑logLii and the quadratic form via triangular solves. Gradient with respect to kernel hyperparameters uses ∂logp/∂θ=tr(αα⊤−(K+σ2I)−1)∂K/∂θ where α=(K+σ2I)−1y.
Normalizing flows. A normalizing flow defines a bijective mapping f:z↦x where z∼N(0,I). The log-likelihood of data x is:
logp(x)=logpz(f−1(x))+log∣detJf−1∣
where Jf−1 is the Jacobian. Efficiently computing log∣detJ∣ is the central computational challenge of normalizing flows. Architectures like RealNVP (triangular Jacobian, log∣detJ∣=∑log∣Jii∣) and FFJORD (stochastic trace estimator) are designed specifically to make this computation tractable.
Log-det estimators for large matrices. When K is very large (e.g., a kernel matrix for millions of training points), exact Cholesky is intractable. Randomized log-det estimators use the identity logdetK=tr(logK) and the stochastic trace estimator tr(logK)≈m1∑i=1mzi⊤(logK)zi with random zi∼N(0,I), computed via Lanczos iterations. This is used in scalable GP libraries (GPyTorch, 2018).
8. Gram Matrices and Kernel Connections
8.1 The Gram Matrix Construction
Definition 8.1 (Gram Matrix). Let x1,…,xn∈Rd be a collection of vectors. Their Gram matrix is:
G∈Rn×n,Gij=⟨xi,xj⟩=xi⊤xj.
In matrix form: if X=[x1∣⋯∣xn]⊤∈Rn×d (rows are data points), then G=XX⊤.
Theorem 8.2. Every Gram matrix is positive semidefinite. Moreover, G≻0 iff x1,…,xn are linearly independent.
So G⪰0. Equality holds iff ∑icixi=0, i.e., iff the vectors are linearly dependent. So G≻0 iff they are linearly independent. □
Corollary.rank(G)=rank(X), the number of linearly independent data vectors.
Examples:
X=In (standard basis): G=In≻0.
n>d: rank(G)≤d<n, so G⪰0 but G≻0.
n=d, X invertible: G≻0.
8.2 Every PSD Matrix is a Gram Matrix
Theorem 8.3.G⪰0 if and only if G is the Gram matrix of some set of vectors in some inner product space.
Proof. (⇐): proved above. (⇒): If G⪰0, let L be the Cholesky-like factor: G=LL⊤ (where L may be rectangular, n×r, r=rank(G)). Take xi⊤=L[i,:] (the i-th row of L). Then Gij=L[i,:]⋅L[j,:]⊤=xi⊤xj. □
This is a deep result: the class of PSD matrices and the class of Gram matrices are exactly the same. Any PSD matrix can be "explained" as a matrix of inner products between some set of vectors.
Feature maps. If ϕ:X→Rd is a feature map, then the Gram matrix Gij=ϕ(xi)⊤ϕ(xj) is PSD. Kernel methods replace ϕ(xi)⊤ϕ(xj) with k(xi,xj) directly, avoiding explicit feature computation.
8.3 Kernel Matrices and Mercer's Theorem
A positive definite kernel is a function k:X×X→R such that for every finite set {x1,…,xn}⊂X, the Gram matrix Kij=k(xi,xj) is PSD.
Mercer's Theorem (informal statement). A continuous, symmetric function k:X×X→R is a positive definite kernel (i.e., produces PSD Gram matrices for any finite set) if and only if there exists a Hilbert space H and a feature map ϕ:X→H such that:
k(x,z)=⟨ϕ(x),ϕ(z)⟩H.
This is the mathematical foundation of the kernel trick: instead of explicitly computing ϕ(x) (which may be infinite-dimensional), we evaluate k(x,z) directly.
Forward reference: The full treatment of Mercer's theorem, reproducing kernel Hilbert spaces (RKHS), and the kernel trick appears in Chapter 12: Functional Analysis.
Standard PD kernels:
Linear kernel: k(x,z)=x⊤z (standard Gram matrix)
RBF/Gaussian kernel: k(x,z)=exp(−∥x−z∥2/2ℓ2)
Polynomial kernel: k(x,z)=(x⊤z+c)d for c≥0, d∈N
Matern kernel: used in GP regression with controllable smoothness
Verifying kernel validity. For a proposed kernel k, the standard test is:
Compute the n×n Gram matrix K for a test set
Check K⪰0 (e.g., via np.linalg.eigvalsh(K) >= -1e-10)
8.4 Attention Scores as Gram Matrices
The scaled dot-product attention mechanism in transformers computes:
Attention(Q,K,V)=softmax(dkQK⊤)V
where Q,K∈Rn×dk are query and key matrices.
The score matrix S=QK⊤/dk∈Rn×n is a scaled Gram matrix (but with different row-spaces for Q and K, so not necessarily PSD). In the special case of self-attention with tied weights Q=K, S is proportional to QQ⊤/dk⪰0.
Why the 1/dk scaling. If Q and K have independent entries from N(0,1), then QijKij has variance 1 and (QK⊤)ij=∑k=1dkQikKjk has variance dk. The 1/dk rescaling brings the variance back to 1, preventing the softmax from saturating into near-one-hot distributions.
Attention as kernel regression. The attention output for query q is:
This is a Nadaraya-Watson kernel regression with an exponential kernel k(q,kj)=exp(q⊤kj/dk). The attention weights are the normalized kernel similarities, and the output is a kernel-weighted average of values. The exponential of the dot product is related to the RBF kernel (by the Johnson-Lindenstrauss / random Fourier features perspective used in Performer / FAVOR+).
9. The PSD Cone and Semidefinite Programming
9.1 The Cone of PSD Matrices
The set of all n×n symmetric positive semidefinite matrices is denoted S+n (or S≥0n). It lives inside the vector space Sn of n×n real symmetric matrices (which has dimension n(n+1)/2).
Theorem 9.1.S+n is a proper convex cone:
Convex: If A,B⪰0 and λ∈[0,1], then λA+(1−λ)B⪰0.
Cone: If A⪰0 and t≥0, then tA⪰0.
Pointed:A⪰0 and A⪯0 implies A=0 (no lines through the origin in the cone).
Closed:S+n is a closed subset of Sn (limits of PSD sequences are PSD).
Full-dimensional: The interior of S+n is S++n (the PD matrices), which is non-empty.
Proof of convexity: For any x: x⊤(λA+(1−λ)B)x=λx⊤Ax+(1−λ)x⊤Bx≥0. □
The boundary. The boundary ∂S+n=S+n∖S++n consists of PSD matrices with at least one zero eigenvalue. These are rank-deficient PSD matrices. The boundary has lower dimension: the set of rank-r PSD matrices is a manifold of dimension rn−r(r−1)/2.
Low-dimensional picture. For n=2: S2 is 3-dimensional (coordinates A11,A12,A22). The PSD cone S+2 is the set where A11≥0, A22≥0, and A11A22≥A122 - the interior of an ice cream cone in 3D.
Dual cone. The dual of S+n with respect to the Frobenius inner product ⟨A,B⟩F=tr(AB) is:
(S+n)∗={B∈Sn:tr(AB)≥0 for all A⪰0}=S+n.
The PSD cone is self-dual: (S+n)∗=S+n. This is analogous to the non-negative reals being self-dual.
9.2 Semidefinite Programming
Semidefinite programming (SDP) is the optimization of a linear objective over the intersection of the PSD cone with an affine set:
Here C,A1,…,Am∈Sn and b∈Rm are the problem data; X∈Sn is the decision variable.
Dual SDP:
y∈Rmmaxb⊤ysubject toC−i=1∑myiAi⪰0.
Duality. Weak duality always holds: primal value≥dual value. Strong duality (primal = dual) holds under Slater's condition: if the primal is strictly feasible (some X≻0 satisfies all constraints), then strong duality holds and the dual optimum is attained.
Relation to other optimization problems. SDP generalizes:
Linear programming (LP): LP is SDP with diagonal matrices C,Ai
SOCP (second-order cone programming): SOCP is a special SDP
Quadratically constrained QP: Many QCQPs can be lifted to SDPs via semidefinite relaxation
Algorithms. Interior-point methods (barrier methods) solve SDPs in polynomial time: O(m1.5n3) per iteration for an m-constraint, n×n SDP. Standard solvers: SCS, MOSEK, CVXPY (modelling layer).
9.3 SDP in Machine Learning
Max-cut relaxation (Goemans-Williamson, 1995). The max-cut problem on a graph G=(V,E) with edge weights wij is NP-hard. The Goemans-Williamson SDP relaxation gives a 0.878-approximation:
X⪰0max21ij∑wij(1−Xij)s.t.Xii=1,i=1,…,n.
The solution X∗ satisfies X∗=VV⊤ for some unit vectors v1,…,vn; random hyperplane rounding recovers a near-optimal cut.
Metric learning. Learning a Mahalanobis distance dA(x,z)=(x−z)⊤A(x−z)1/2 requires A⪰0. Methods like ITML (Information-Theoretic Metric Learning) and SDML formulate this as an SDP over PSD matrices with constraints that similar pairs are close and dissimilar pairs are far.
Covariance estimation. In high dimensions (p>n), the sample covariance Σ^=n1X⊤X may be singular. Regularized covariance estimation (graphical lasso, precision matrix estimation) adds a sparsity constraint or penalty:
Σ≻0min[tr(S^Σ−1)−logdetΣ+λ∥Σ−1∥1]
where S^ is the sample covariance and ∥⋅∥1 is the element-wise ℓ1 norm. This is not an SDP directly, but the PSD constraint Σ≻0 is the core structural requirement.
Fairness constraints. In algorithmic fairness, PSD constraints arise as necessary conditions for disparate impact compliance. Certified defenses against adversarial examples (semidefinite relaxations of neural network verification) are large-scale SDPs; solvers like α-CROWN reformulate them via Lagrangian relaxation.
10. Applications in Machine Learning
10.1 Multivariate Gaussians and Covariance Matrices
The multivariate Gaussian. A random vector x∈Rn has the distribution N(μ,Σ) if:
p(x)=(2π)n/2∣detΣ∣1/21exp(−21(x−μ)⊤Σ−1(x−μ)).
For this density to be a valid (normalized, integrable) probability distribution, Σ must be symmetric positive definite. The three requirements:
Symmetry:Σ=Σ⊤ (covariance is symmetric by definition)
Positive definiteness:Σ≻0 ensures detΣ=0 (normalizing constant finite) and Σ−1 exists (the exponent is a proper quadratic form)
If Σ⪰0 but singular: The distribution becomes degenerate - supported on an affine subspace, not all of Rn
Parameterizing covariances in neural networks. A neural network that outputs a covariance matrix must parameterize it to be PSD. Standard approaches:
Diagonal:Σ=diag(exp(s)) where s is a learned vector. Automatically PD.
Cholesky lower triangular:Σ=LL⊤ where L has positive diagonal (enforced via softplus on diagonal entries). This is the most expressive parameterization.
Low-rank + diagonal:Σ=FF⊤+D where F∈Rn×k (k≪n) and D diagonal positive. Woodbury allows efficient inversion.
Sampling from N(μ,Σ): Given ϵ∼N(0,I):
x=μ+Lϵ
where L=chol(Σ). This is the fundamental sampling algorithm: the Cholesky factor maps isotropic noise to correlated noise.
10.2 Fisher Information Matrix and Natural Gradient
Definition 10.1 (Fisher Information Matrix). For a statistical model p(x∣θ) with parameter θ∈Rd, the Fisher information matrix is:
F(θ)=Ex∼p(⋅∣θ)[∇θlogp(x∣θ)∇θlogp(x∣θ)⊤].
PSD proof.F is a covariance matrix of the score function ∇θlogp(x∣θ): it is E[ss⊤] where s is a random vector. Any matrix of the form E[ss⊤] is PSD (it is the expected outer product). In fact, F≻0 for regular statistical models (identifiable, full rank).
Natural gradient. Ordinary gradient descent θ←θ−η∇θL uses the Euclidean metric on parameter space. The natural gradient uses the Fisher metric:
∇~L=F(θ)−1∇θL.
The natural gradient is invariant to reparameterization of the model - it measures the steepest descent direction with respect to the KL divergence geometry (information geometry).
K-FAC (Kronecker-Factored Approximate Curvature). Martens & Grosse (2015) approximate the Fisher information matrix for neural networks as a block-diagonal matrix, where each block factorizes as a Kronecker product:
F≈F^=block-diag(A1⊗G1,A2⊗G2,…)
where Al=E[al−1al−1⊤] (input activation covariance) and Gl=E[δlδl⊤] (pre-activation gradient covariance) for layer l. Both Al and Gl are PSD (covariance matrices). The Kronecker product Al⊗Gl is PSD, and its inverse is (Al⊗Gl)−1=Al−1⊗Gl−1, computable cheaply via Cholesky of each factor separately.
10.3 Gaussian Process Regression
Gaussian process regression is the Bayesian non-parametric regression method that uses PD kernel matrices as the core computational object.
Model. Place a GP prior: f∼GP(0,k) where k:X×X→R is a PD kernel. Observe noisy outputs y=f(X)+ϵ, ϵ∼N(0,σ2I).
Prediction. The predictive distribution at new points X∗ is:
The entire GP regression computation is O(n3) via Cholesky - the classic bottleneck for large datasets, motivating sparse GP approximations (inducing points, Nystrom, SVGP).
10.4 Hessian and Loss Landscape Sharpness
Second-order characterization of minima. At a critical point ∇L(θ∗)=0 of a smooth loss L:
∇2L(θ∗)≻0: strict local minimum (the loss bowl is strictly convex)
∇2L(θ∗)⪰0: local minimum or saddle (Hessian is PSD)
∇2L(θ∗) indefinite: saddle point
Sharpness. The sharpness of a minimum is λmax(∇2L(θ∗)) - the largest eigenvalue of the Hessian. Flat minima (small sharpness) generalize better than sharp minima: a small perturbation θ∗+δ with ∥δ∥≤ρ changes the loss by at most ρ2λmax(∇2L) (by Taylor expansion and the PSD bound δ⊤Hδ≤λmax∥δ∥2).
SAM (Sharpness-Aware Minimization). Foret et al. (2021) propose:
θmin∥δ∥≤ρmaxL(θ+δ).
The inner max finds the worst-case perturbation (solved approximately as δ∗=ρ∇L/∥∇L∥). SAM is a first-order approximation to minimizing sharpness and is reported to improve generalization on image and language tasks.
Gauss-Newton and PSD Hessian approximations. The true Hessian ∇2L may be indefinite during training (early stages, overparameterized models). The Gauss-Newton matrix G=J⊤J (where J is the Jacobian of the predictions) is always PSD and is often a better preconditioner. K-FAC uses the Gauss-Newton approximation.
10.5 Reparameterization Trick in VAEs
The variational autoencoder (VAE) requires sampling from a distribution qϕ(z∣x)=N(μϕ(x),Σϕ(x)) where μϕ and Σϕ are outputs of an encoder neural network, and differentiating through the sampling process.
The reparameterization trick. Instead of sampling z∼N(μ,Σ) directly (which blocks gradient flow), write:
z=μ+Lϵ,ϵ∼N(0,I)
where L=chol(Σ). Now z is a deterministic function of (μ,L,ϵ), and gradients can flow through μ and L.
Diagonal VAE (standard). Most VAE implementations use diagonal covariance: Σ=diag(exp(s)), so L=diag(exp(s/2)) and z=μ+exp(s/2)⊙ϵ.
Full covariance VAE. For a full covariance Cholesky parameterization, the encoder outputs:
Mean μ∈Rd
Lower triangular L∈Rd×d with positive diagonal (e.g., Lii=softplus(L~ii), off-diagonal unconstrained)
Then Σ=LL⊤ and z=μ+Lϵ. The gradient through L back to the encoder parameters flows via:
∂L∂z=∂L∂(Lϵ)=ϵ⊗Id
(the outer product of the sampled noise ϵ with the identity).
For normalizing flows. More expressive VAE variants use normalizing flows for the posterior qϕ(z∣x). The flow is a sequence of invertible maps; the log-det Jacobian of each map must be computed efficiently. Triangular Jacobians (e.g., masked autoregressive flow / IAF) achieve O(d) log-det computation.
11. Common Mistakes
#
Mistake
Why It's Wrong
Fix
1
Checking only diagonal entries to test PD
Positive diagonal is necessary but not sufficient. A=(1221) has positive diagonal but is indefinite (det=−3).
Use Sylvester's criterion (all leading minors > 0) or attempt Cholesky.
2
Concluding A≻0 from detA>0 alone
det>0 is necessary but not sufficient. (−100−2) has det=2>0 but is negative definite.
Check all leading principal minors, not just the full determinant.
3
Assuming A⊤A≻0 for any A
A⊤A⪰0 always, but A⊤A≻0 iff A has full column rank. If A has a null vector (Av=0), then v⊤A⊤Av=0.
Verify rank(A)=number of columns.
4
Confusing the Cholesky factor L with the PSD square root A1/2
L is lower triangular; A1/2 is symmetric. Both satisfy "squared = A" but in different senses (LL⊤=A vs (A1/2)2=A). They are equal only when A is diagonal.
Use L for solving/sampling; use A1/2 for Mahalanobis/whitening.
5
Using np.log(np.linalg.det(A)) for large A
np.linalg.det can overflow/underflow for large n (products of many numbers).
Use Cholesky: 2 * np.sum(np.log(np.diag(np.linalg.cholesky(A)))).
6
Forgetting Sylvester's criterion does NOT characterize PSD
Sylvester requires all leading principal minors > 0, which gives PD not PSD. For PSD, you need all principal minors (not just leading ones) ≥0 - a much larger set.
For PSD testing, use eigenvalues or attempt pivoted Cholesky.
7
Inverting the Loewner order incorrectly
Students often think A⪰B implies A−1⪰B−1. The correct fact is A⪰B≻0⇒B−1⪰A−1 (inversion reverses the order).
If ∥A∥2≈106 and you add ϵ=10−6, the relative jitter is 10−12, effectively zero numerically.
Set jitter proportional to the scale: ϵ=δ⋅tr(A)/n for some small δ.
9
Assuming the kernel matrix stays PD after operations
Sum, product, and pointwise operations on PD kernels may not preserve PD structure. E.g., k1(x,z)−k2(x,z) may not be PD.
Verify the Gram matrix or use closure properties: sum/product/exponentiation of PD kernels are PD.
10
Computing Schur complement with singular A
The Schur complement D−CA−1B requires A invertible. If A is only PSD (rank-deficient), A−1 does not exist.
Use the Moore-Penrose pseudoinverse: S=D−CA†B, though the PD characterization no longer holds as stated.
11
Treating "positive definite" as a non-symmetric matrix property
PD is only defined for symmetric matrices. An arbitrary matrix A with all positive eigenvalues may not have a positive quadratic form (complex eigenvalues, non-symmetric).
Always symmetrize: replace A with (A+A⊤)/2 before testing PD.
12
Thinking PD constraint is automatically satisfied by a NN output
A neural network outputting n(n+1)/2 numbers does not automatically produce a valid lower triangular L with positive diagonal.
Enforce: use softplus on the diagonal entries; leave off-diagonal unconstrained. Then Σ=LL⊤ is guaranteed PD.
Skill Check
Test this lesson
Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.
--
Score
0/4
Answered
Not attempted
Status
1
Which module does this lesson belong to?
2
Which section is covered in this lesson content?
3
Which term is most central to this lesson?
4
What is the best way to use this lesson for real learning?
Your answers save locally first, then sync when account storage is available.