NotesMath for LLMs

Determinants

Linear Algebra Basics / Determinants

Notes

"A determinant turns an entire linear transformation into one number without throwing away its most important geometry: invertibility, orientation, and volume change."

Overview

Among all the quantities attached to a square matrix, the determinant is the most compressed and the most deceptive. It is only one scalar, but it simultaneously encodes whether a matrix is invertible, whether it preserves or reverses orientation, how it scales area or volume, and how its eigenvalues multiply together. That is why determinants feel both elementary and deep: the formulas look concrete, but the ideas connect linear algebra, multivariable calculus, probability, geometry, and modern machine learning.

At a geometric level, the determinant answers a simple question:

What happens to volume when the linear map xAx acts on space?\text{What happens to volume when the linear map } x \mapsto Ax \text{ acts on space?}

If AA maps the unit square, unit cube, or unit hypercube to a parallelogram, parallelepiped, or higher-dimensional analogue, the signed volume of that image is exactly det(A)\det(A). The absolute value tells you the volume scaling factor. The sign tells you whether the transformation preserves or flips orientation.

At an algebraic level, the same number answers equally fundamental questions:

  • Is the matrix invertible?
  • Are its columns linearly independent?
  • What is the constant term of its characteristic polynomial?
  • What is the product of its eigenvalues?

For machine learning, determinants are not decorative theory. They appear operationally in:

  • normalising flows through logdetJ\log|\det J|
  • multivariate Gaussian likelihoods through logdet(Σ)\log\det(\Sigma)
  • Gaussian process marginal likelihoods through covariance log-determinants
  • information geometry through Fisher-metric volume terms
  • stability analysis through eigenvalue products and Jacobian determinants
  • structured matrix updates through determinant identities such as the matrix determinant lemma

This chapter therefore treats determinants in four intertwined ways:

  • geometric meaning
  • formal definitions
  • efficient computation
  • AI-relevant applications

The goal is not to memorize formulas in isolation. It is to understand why all determinant formulas are really statements about the same object seen from different angles.

Prerequisites

  • Matrix multiplication, transpose, and inverse
  • Systems of linear equations and row reduction
  • Rank, linear dependence, and eigenvalue basics
  • Comfort with basic multivariable calculus notation

Companion Notebooks

NotebookDescription
theory.ipynbInteractive determinant computation, geometric volume intuition, log-det examples, and AI-motivated demos
exercises.ipynbGuided practice on cofactor expansion, characteristic polynomials, determinant identities, log-determinants, and applications

Learning Objectives

After completing this chapter, you should be able to:

  • Explain the determinant as signed volume scaling and orientation change
  • Compute determinants using the Leibniz formula, cofactor expansion, and LU-based elimination
  • Use determinant properties correctly under row operations, products, transpose, similarity, and scaling
  • Connect determinants to invertibility, rank, eigenvalues, and characteristic polynomials
  • Derive and use the adjugate identity and Cramer's rule
  • Compute stable log-determinants for SPD matrices and general square matrices
  • Explain why triangular Jacobians make normalising flows tractable
  • Use determinant identities such as the matrix determinant lemma, Sylvester's theorem, and Schur complements
  • Interpret determinant-based quantities in Gaussian models, GPs, DPPs, and information geometry

Table of Contents


1. Intuition

1.1 What Is a Determinant?

The determinant is a function

det:Rn×nR\det : \mathbb{R}^{n \times n} \to \mathbb{R}

that assigns a single scalar to every square matrix. The remarkable fact is not that such a function exists. The remarkable fact is how much it knows.

From one number, we can tell:

  • whether the matrix is invertible
  • whether its columns are linearly independent
  • how it scales nn-dimensional volume
  • whether it preserves or flips orientation
  • what the product of its eigenvalues is

So while a matrix has n2n^2 entries, the determinant distills its most global linear effect into one scalar.

The determinant should be thought of as answering this geometric question:

Take a unit box in n dimensions.
Apply the linear map A.

How much does its signed volume change?

If the answer is zero, the transformation crushes space into a lower-dimensional object. If the answer is non-zero, the transformation preserves dimension and therefore remains invertible.

This is why

det(A)=0    A is singular\det(A) = 0 \iff A \text{ is singular}

is not an isolated theorem. It is a geometric inevitability.

1.2 The Geometric Picture - Volume and Orientation

In two dimensions, the determinant of

A=(abcd)A = \begin{pmatrix} a & b \\ c & d \end{pmatrix}

is

det(A)=adbc.\det(A) = ad - bc.

If the columns of AA are the vectors

u=(ac),v=(bd),u = \begin{pmatrix} a \\ c \end{pmatrix}, \qquad v = \begin{pmatrix} b \\ d \end{pmatrix},

then det(A)|\det(A)| is exactly the area of the parallelogram spanned by uu and vv.

2D picture

v
^
|      /
|     /
|    /    parallelogram area = |det([u v])|
|   /___
|  /   /
| /   /
|/___/------> u

The sign matters too.

  • det(A)>0\det(A) > 0: orientation is preserved
  • det(A)<0\det(A) < 0: orientation is reversed
  • det(A)=0\det(A) = 0: the two column vectors are parallel, so the parallelogram collapses to a line

In three dimensions, the same idea becomes the signed volume of the parallelepiped spanned by the three columns.

In nn dimensions, nothing conceptually changes. The determinant is the signed nn-dimensional volume scaling factor of the linear map.

This is one of the most important cases where geometric intuition scales cleanly from low dimension to high dimension.

1.3 Why Determinants Matter for AI

Determinants are not just a classical linear algebra topic that happens to show up occasionally in machine learning. They sit inside several major AI computations.

Normalising flows

The change-of-variables formula uses the Jacobian determinant:

logpX(x)=logpZ(f1(x))+logdetf1x.\log p_X(x) = \log p_Z(f^{-1}(x)) + \log \left| \det \frac{\partial f^{-1}}{\partial x} \right|.

The entire architecture design of coupling flows, autoregressive flows, Glow-style invertible convolutions, and CNFs is about making this determinant or log-determinant tractable.

Multivariate Gaussian models

For

xN(μ,Σ),x \sim \mathcal{N}(\mu, \Sigma),

the density contains the factor

det(Σ)1/2.\det(\Sigma)^{-1/2}.

This is the normalisation term that makes the density integrate to one. Gaussian processes, Bayesian linear regression, Kalman filtering, and many variational models depend on this.

Spectral structure

Eigenvalues are defined by the equation

det(λIA)=0.\det(\lambda I - A) = 0.

So the entry point to eigenvalue theory is itself determinant theory.

Optimization and stability

The Hessian determinant appears in second-derivative tests. Jacobian determinants help diagnose local invertibility, singularity, and stability in implicit or dynamical models.

1.4 The Determinant as a Function

The determinant is best understood not just by formulas, but by its defining properties.

It is the unique function satisfying:

  1. Multilinearity in the columns
  2. Alternating behaviour under column swaps
  3. Normalization on the identity

That is:

  • linear in each column separately
  • zero if two columns coincide
  • sign flips when two columns are swapped
  • det(I)=1\det(I)=1

These properties are so strong that they determine the determinant uniquely.

This perspective is powerful because it explains why so many facts about determinants are inevitable:

  • swapping rows changes sign
  • triangular determinants are products of diagonal entries
  • equal or dependent columns force determinant zero
  • elimination operations preserve or track determinant in predictable ways

So instead of thinking "the determinant is a complicated formula," it is better to think:

The determinant is the unique alternating multilinear volume form
normalised to be 1 on the identity basis.

1.5 Historical Timeline

  • Seki Takakazu and Leibniz both developed determinant-like expressions in the late 17th century.
  • Cramer gave the first widely recognised determinant-based explicit rule for solving linear systems.
  • Vandermonde and Laplace developed systematic formulas and expansions.
  • Cauchy established crucial algebraic properties such as multiplicativity.
  • Jacobi connected determinants to calculus through the Jacobian.
  • The modern matrix viewpoint then made determinants part of a broader algebraic theory of linear transformations.
  • In the 20th and 21st centuries, determinants moved into operator theory, probability, random matrix theory, and computational ML through log-determinants and Jacobian-based likelihood models.

Historically, determinants were first used to solve systems. Only later did their true geometric meaning become central. In modern ML, both roles are alive at once.


2. Formal Definitions

2.1 The Leibniz Formula

For a square matrix ARn×nA \in \mathbb{R}^{n \times n}, the determinant can be defined by the Leibniz formula:

det(A)=σSnsgn(σ)i=1nai,σ(i).\det(A) = \sum_{\sigma \in S_n} \operatorname{sgn}(\sigma) \prod_{i=1}^n a_{i,\sigma(i)}.

This looks intimidating at first, but the pattern is precise:

  • choose one entry from each row
  • choose one entry from each column
  • multiply them
  • assign a sign based on the parity of the corresponding permutation
  • sum over all permutations

For n=2n=2, there are only two permutations, so we recover

det(abcd)=adbc.\det \begin{pmatrix} a & b \\ c & d \end{pmatrix} = ad - bc.

For n=3n=3, there are six permutations, producing the usual six-term formula.

The Leibniz formula is exact and conceptually complete, but computationally terrible for large nn, since it has n!n! terms.

2.2 The Permutation Group and Signs

The sign in the Leibniz formula comes from permutation parity.

The symmetric group SnS_n contains all permutations of {1,,n}\{1,\dots,n\}. Each permutation can be written as a product of transpositions, and its sign is

sgn(σ)=(1)k,\operatorname{sgn}(\sigma) = (-1)^k,

where kk is the number of transpositions in such a decomposition.

What matters is not the decomposition itself but its parity. Even permutations always have sign +1+1, odd permutations always have sign 1-1.

For n=3n=3, the six permutations split into:

  • three even permutations
  • three odd permutations

This is why the 3×33 \times 3 determinant formula has three positive and three negative terms.

The determinant therefore depends not only on which row-column products are chosen, but on the parity structure of those choices.

2.3 The Axiomatic Definition

The cleanest abstract definition is:

The determinant is the unique function

det:Rn×nR\det : \mathbb{R}^{n \times n} \to \mathbb{R}

such that:

  1. Multilinearity
    For each column separately,
det(,αu+βv,)=αdet(,u,)+βdet(,v,) \det(\dots, \alpha u + \beta v, \dots) = \alpha \det(\dots, u, \dots) + \beta \det(\dots, v, \dots)
  1. Alternating property
    Swapping any two columns changes the sign:
det(,ci,,cj,)=det(,cj,,ci,) \det(\dots, c_i, \dots, c_j, \dots) = -\det(\dots, c_j, \dots, c_i, \dots)
  1. Normalization
det(I)=1 \det(I) = 1

This definition is mathematically elegant because all the familiar determinant formulas follow from it.

It also makes uniqueness believable: expand every column in the standard basis, apply multilinearity, observe that alternating kills every term with repeated basis vectors, and only permutation terms survive.

2.4 Equivalent Characterisations

The determinant can be described in several equivalent ways:

  • det(A)=0\det(A)=0 iff the columns of AA are linearly dependent
  • det(A)0\det(A)\neq 0 iff AA is invertible
  • det(A)\det(A) is the signed volume scaling of the linear map
  • det(A)\det(A) is the product of the eigenvalues, counting algebraic multiplicity
  • for triangular matrices, det(A)\det(A) is the product of diagonal entries

These are not unrelated facts. They are different manifestations of the same underlying object.

2.5 Cofactor Definition (Recursive)

Another exact definition is recursive.

Delete row ii and column jj from AA to obtain the minor matrix MijM_{ij}. Its determinant is the minor associated with (i,j)(i,j). The cofactor is

Cij=(1)i+jdet(Mij).C_{ij} = (-1)^{i+j}\det(M_{ij}).

Then expansion along row ii gives

det(A)=j=1naijCij,\det(A) = \sum_{j=1}^n a_{ij} C_{ij},

and expansion along column jj gives

det(A)=i=1naijCij.\det(A) = \sum_{i=1}^n a_{ij} C_{ij}.

The sign pattern is the familiar checkerboard:

+  -  +  -  ...
-  +  -  +  ...
+  -  +  -  ...
-  +  -  +  ...

This definition is theoretically useful and perfect for symbolic manipulation or small matrices, but again computationally poor for large nn.


3. Computing Determinants

3.1 The 2x2 Determinant

For

A=(abcd),A = \begin{pmatrix} a & b \\ c & d \end{pmatrix},

the determinant is

det(A)=adbc.\det(A)=ad-bc.

This is the simplest nontrivial determinant and already captures all the core geometry:

  • if adbc=0ad-bc=0, the columns are parallel
  • if adbc>0ad-bc>0, orientation is preserved
  • if adbc<0ad-bc<0, orientation is reversed

Geometrically, this is the signed area of the parallelogram spanned by the two columns.

3.2 The 3x3 Determinant - Sarrus' Rule

For a 3×33 \times 3 matrix,

A=(abcdefghi),A = \begin{pmatrix} a & b & c \\ d & e & f \\ g & h & i \end{pmatrix},

the determinant is

det(A)=aei+bfg+cdhcegbdiafh.\det(A)=aei+bfg+cdh-ceg-bdi-afh.

Sarrus' rule is a mnemonic for this formula:

a b c | a b
d e f | d e
g h i | g h

positive diagonals:  aei + bfg + cdh
negative diagonals:  ceg + afh + bdi

Important warning:

Sarrus' rule works only for 3x3 matrices.
It does not generalise.

3.3 Cofactor Expansion - Worked Example (4x4)

For larger symbolic determinants, cofactor expansion is practical only when the matrix has a good row or column with many zeros.

Suppose

A=(1201030200410005).A = \begin{pmatrix} 1 & 2 & 0 & 1 \\ 0 & 3 & 0 & 2 \\ 0 & 0 & 4 & 1 \\ 0 & 0 & 0 & 5 \end{pmatrix}.

This matrix is already upper triangular, so we should not expand at all; we should use the triangular rule. But if we did cofactor-expand along the first column, only one term would survive.

This example teaches the real lesson:

The best determinant method depends on structure.
Zeros are opportunities.
Triangular form is the goal.

3.4 Gaussian Elimination Method (Practical)

For numerical work, the practical determinant algorithm is elimination.

If Gaussian elimination with partial pivoting gives

PA=LU,PA = LU,

then

det(A)=det(P1)det(L)det(U).\det(A) = \det(P^{-1})\det(L)\det(U).

Now:

  • det(L)=1\det(L)=1 for unit lower triangular LL
  • det(U)=iUii\det(U)=\prod_i U_{ii}
  • det(P)=(1)k\det(P)=(-1)^k, where kk is the number of row swaps

Therefore

det(A)=(1)kiUii.\det(A) = (-1)^k \prod_i U_{ii}.

This reduces determinant computation from factorial cost to cubic cost:

O(n!)O(n3).O(n!) \quad \longrightarrow \quad O(n^3).

This is why every serious determinant routine for moderate or large dense matrices is LU-based.

3.5 Determinant of Triangular Matrices

If AA is triangular, then

det(A)=i=1naii.\det(A)=\prod_{i=1}^n a_{ii}.

The reason is simple. In the Leibniz formula, any non-identity permutation must pick at least one entry above or below the diagonal where the triangular matrix has a zero. So only the identity permutation survives.

This makes diagonal, upper triangular, lower triangular, and block triangular matrices especially nice.

3.6 Special Formulas

Several determinant formulas recur constantly in applications.

Block triangular

det(AB0D)=det(A)det(D).\det \begin{pmatrix} A & B \\ 0 & D \end{pmatrix} = \det(A)\det(D).

Block diagonal

det(diag(A1,,Ak))=j=1kdet(Aj).\det(\operatorname{diag}(A_1,\dots,A_k)) = \prod_{j=1}^k \det(A_j).

Matrix determinant lemma

For invertible AA and vectors u,vu,v,

det(A+uv)=(1+vA1u)det(A).\det(A+uv^\top) = (1+v^\top A^{-1}u)\det(A).

Schur complement formula

If AA is invertible, then

det(ABCD)=det(A)det(DCA1B).\det \begin{pmatrix} A & B \\ C & D \end{pmatrix} = \det(A)\det(D-CA^{-1}B).

These formulas matter because they turn large determinants into smaller ones and are central in GP updates, low-rank corrections, block systems, and structured models.


4. Properties of Determinants

4.1 Multiplicativity

The defining algebraic property is

det(AB)=det(A)det(B).\det(AB)=\det(A)\det(B).

This is one of the most powerful identities in linear algebra.

Immediate consequences:

det(A1)=1det(A),\det(A^{-1}) = \frac{1}{\det(A)}, det(Ak)=det(A)k,\det(A^k) = \det(A)^k,

and for scalar α\alpha,

det(αA)=αndet(A).\det(\alpha A)=\alpha^n \det(A).

The last formula is often misremembered. The exponent nn appears because scaling the entire matrix by α\alpha scales each of the nn columns by α\alpha.

4.2 Transpose Invariance

Determinant is unchanged by transpose:

det(A)=det(A).\det(A^\top)=\det(A).

This means every row-based statement has a corresponding column-based statement and vice versa.

So:

  • multilinearity holds in the rows as well as the columns
  • swapping two rows also changes sign
  • cofactor expansion works along any row or any column

4.3 Row and Column Operations

Determinants respond to elementary operations in a very controlled way.

Swap two rows

  • determinant changes sign

Scale one row by α\alpha

  • determinant is multiplied by α\alpha

Add a multiple of one row to another

  • determinant is unchanged

The last fact is what makes elimination so useful for determinant computation. Row replacement simplifies the matrix without altering the determinant.

The same statements hold for column operations by transpose invariance.

4.4 Determinant of Products of Special Matrices

If QQ is orthogonal, then

det(Q)=±1.\det(Q)=\pm 1.

So orthogonal matrices preserve volume magnitude, though not necessarily orientation.

If PP is invertible, then

det(P1AP)=det(A).\det(P^{-1}AP)=\det(A).

So determinant is similarity-invariant. It depends on the linear map itself, not on the particular basis representation.

This matters conceptually for ML: changing basis in representation space does not change the determinant of the underlying linear operator.

4.5 Linear Dependence Test

The determinant detects full-rank failure:

det(A)=0    columns of A are linearly dependent    rank(A)<n.\det(A)=0 \iff \text{columns of } A \text{ are linearly dependent} \iff \operatorname{rank}(A)<n.

This is one reason determinants became historically tied to system solving. A zero determinant means the square system cannot have a unique solution.

But there is also a numerical warning:

det(A) close to 0 does not reliably mean "numerically close to singular."

A tiny determinant may simply come from global scaling or large dimension. Condition number, not determinant magnitude, is the correct practical test for near-singularity.

4.6 Determinant and Eigenvalues

For any square matrix,

det(A)=i=1nλi,\det(A)=\prod_{i=1}^n \lambda_i,

where the eigenvalues are counted with algebraic multiplicity over C\mathbb{C}.

This is one of the deepest bridges in the subject:

  • determinant is defined from entries
  • eigenvalues are defined spectrally
  • the product formula connects the two exactly

For symmetric positive definite matrices, all eigenvalues are positive, so the determinant is positive. For orthogonal matrices, the eigenvalues lie on the unit circle, so the determinant has absolute value 11.

5. The Characteristic Polynomial and Eigenvalues

5.1 Definition of the Characteristic Polynomial

Given a square matrix ARn×nA \in \mathbb{R}^{n \times n}, its characteristic polynomial is

pA(λ)=det(λIA).p_A(\lambda)=\det(\lambda I-A).

This is a degree-nn polynomial in the scalar variable λ\lambda. It is monic, meaning the coefficient of λn\lambda^n is 11.

Expanding it gives

pA(λ)=λntr(A)λn1++(1)ndet(A).p_A(\lambda) = \lambda^n - \operatorname{tr}(A)\lambda^{n-1} + \cdots + (-1)^n \det(A).

Several facts are packed into that one line:

  • the trace is the sum of eigenvalues
  • the determinant is the product of eigenvalues
  • the intermediate coefficients are symmetric polynomials in the eigenvalues

So determinants do not merely help with eigenvalues. They define the polynomial whose roots are the eigenvalues.

5.2 Finding Eigenvalues via the Characteristic Polynomial

A scalar λ\lambda is an eigenvalue of AA exactly when there exists a nonzero vector vv such that

Av=λv.Av=\lambda v.

Rearranging,

(AλI)v=0.(A-\lambda I)v=0.

This homogeneous system has a nontrivial solution exactly when AλIA-\lambda I is singular, so

λ is an eigenvalue    det(AλI)=0.\lambda \text{ is an eigenvalue} \iff \det(A-\lambda I)=0.

That determinant equation is called the characteristic equation.

Matrix entries
    ->
det(lambda I - A)
    ->
characteristic polynomial
    ->
roots = eigenvalues

For 2×22 \times 2 matrices, this leads to a quadratic. For 3×33 \times 3, a cubic. Beyond that, the polynomial remains theoretically central, but direct symbolic root-finding quickly becomes the wrong computational tool.

In practical numerical linear algebra, one does not compute eigenvalues by first expanding the characteristic polynomial. One uses QR-like iterative algorithms. The determinant remains conceptually foundational, even when it is not computationally front-and-center.

5.3 The Cayley-Hamilton Theorem

One of the most beautiful consequences of the characteristic polynomial is the Cayley-Hamilton theorem:

Every square matrix satisfies its own characteristic polynomial.

If

pA(λ)=λn+cn1λn1++c1λ+c0,p_A(\lambda)=\lambda^n + c_{n-1}\lambda^{n-1}+\cdots + c_1\lambda + c_0,

then

pA(A)=An+cn1An1++c1A+c0I=0.p_A(A)=A^n + c_{n-1}A^{n-1}+\cdots + c_1A + c_0I = 0.

For a 2×22 \times 2 matrix,

pA(λ)=λ2tr(A)λ+det(A),p_A(\lambda)=\lambda^2-\operatorname{tr}(A)\lambda+\det(A),

so Cayley-Hamilton becomes

A2tr(A)A+det(A)I=0.A^2-\operatorname{tr}(A)A+\det(A)I=0.

If det(A)0\det(A)\neq 0, this can be rearranged to express the inverse:

A1=tr(A)IAdet(A).A^{-1}=\frac{\operatorname{tr}(A)I-A}{\det(A)}.

That formula is rarely the best numerical method, but it is conceptually revealing. It says the determinant is not merely a scalar summary. It also enters explicit polynomial identities satisfied by the matrix itself.

5.4 Characteristic Polynomial Examples

Some standard examples are worth memorising because they calibrate intuition.

Identity matrix

pI(λ)=det(λII)=(λ1)n.p_I(\lambda)=\det(\lambda I-I)=(\lambda-1)^n.

All eigenvalues are 11, so det(I)=1\det(I)=1.

Zero matrix

p0(λ)=λn.p_0(\lambda)=\lambda^n.

All eigenvalues are 00, so det(0)=0\det(0)=0.

Projection matrix

If P2=PP^2=P and rank(P)=r\operatorname{rank}(P)=r, the eigenvalues are 00 and 11, so

pP(λ)=λnr(λ1)rp_P(\lambda)=\lambda^{n-r}(\lambda-1)^r

and

det(P)=0\det(P)=0

unless P=IP=I.

Rotation matrix in 2D

For

Rθ=(cosθsinθsinθcosθ),R_\theta= \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix},

the determinant is

cos2θ+sin2θ=1.\cos^2\theta+\sin^2\theta=1.

So rotations preserve area and orientation.

5.5 Resolvent and Green's Function

The resolvent of a matrix is

R(λ)=(λIA)1,R(\lambda)=(\lambda I-A)^{-1},

defined whenever λIA\lambda I-A is invertible.

That means

R(λ) exists     det(λIA)0    λ is not an eigenvalue of A.R(\lambda) \text{ exists } \iff \det(\lambda I-A)\neq 0 \iff \lambda \text{ is not an eigenvalue of } A.

So the determinant detects exactly where the resolvent breaks down.

This matters in spectral analysis because the poles of the resolvent occur at eigenvalues. In PDEs and operator theory, resolvents lead to Green's functions. In matrix analysis, they give a clean way to think about spectral separation: if λ\lambda is close to an eigenvalue, the resolvent norm tends to become large.

In machine learning, this viewpoint appears indirectly in:

  • stability of recurrent and iterative models
  • spectral filtering methods
  • graph diffusion operators
  • continuous-time linear systems

The determinant is the scalar object that tells you when the resolvent is allowed to exist.

Scope of this section: Section 5 covers the determinantal side of eigenvalue theory - how the characteristic polynomial is defined, why its roots are eigenvalues, and what Cayley-Hamilton says about matrix polynomials. The full eigenvalue story (computation, geometric multiplicity, diagonalization, spectral theorem, Jordan form, applications in gradient dynamics and transformers) is the canonical subject of the next chapter.

-> Full treatment: Eigenvalues and Eigenvectors

6. Cofactor Matrix and Adjugate

6.1 The Cofactor Matrix

For each entry aija_{ij} of an n×nn \times n matrix AA, delete row ii and column jj. The determinant of what remains is the minor MijM_{ij}.

The corresponding cofactor is

Cij=(1)i+jMij.C_{ij}=(-1)^{i+j}M_{ij}.

The alternating sign pattern is the familiar checkerboard:

+  -  +  -  ...
-  +  -  +  ...
+  -  +  -  ...
-  +  -  +  ...

The matrix of these cofactors is the cofactor matrix.

Why does this matter? Because cofactors package every cofactor expansion at once. They are not just bookkeeping devices. They are the entries of the gradient of the determinant and the building blocks of the inverse formula.

6.2 The Adjugate (Classical Adjoint)

The adjugate of AA, written adj(A)\operatorname{adj}(A), is the transpose of the cofactor matrix:

adj(A)ij=Cji.\operatorname{adj}(A)_{ij}=C_{ji}.

Its key identity is

Aadj(A)=adj(A)A=det(A)I.A \operatorname{adj}(A)=\operatorname{adj}(A)A=\det(A)I.

This is one of the most important identities in the chapter.

Why does it work?

  • on the diagonal, you recover the cofactor expansion of the determinant
  • off the diagonal, you get the determinant of a matrix with two equal rows, which vanishes

So when det(A)0\det(A)\neq 0,

A1=adj(A)det(A).A^{-1}=\frac{\operatorname{adj}(A)}{\det(A)}.

For a 2×22 \times 2 matrix,

adj(abcd)=(dbca),\operatorname{adj} \begin{pmatrix} a & b \\ c & d \end{pmatrix} = \begin{pmatrix} d & -b \\ -c & a \end{pmatrix},

which reproduces the standard inverse formula.

6.3 Cramer's Rule

Suppose Ax=bAx=b and det(A)0\det(A)\neq 0. Let AiA_i be the matrix obtained by replacing column ii of AA with the right-hand side vector bb. Then

xi=det(Ai)det(A).x_i=\frac{\det(A_i)}{\det(A)}.

This is Cramer's rule.

Its computational value is low for large systems, but its theoretical value is high:

  • it gives an explicit formula for each coordinate of the solution
  • it proves uniqueness immediately when det(A)0\det(A)\neq 0
  • it shows solutions depend rationally on the data

In modern numerical work, Cramer's rule is almost never used for solving systems. LU or QR is the right tool. But Cramer's rule remains important in theory, symbolic algebra, and derivations involving parameter dependence.

6.4 Derivative of the Determinant

A major reason determinants matter in machine learning is that they differentiate cleanly.

For each entry,

det(A)Aij=Cij.\frac{\partial \det(A)}{\partial A_{ij}}=C_{ij}.

In matrix form,

Adet(A)=adj(A)T.\nabla_A \det(A)=\operatorname{adj}(A)^T.

If AA is invertible, using adj(A)=det(A)A1\operatorname{adj}(A)=\det(A)A^{-1} gives

Adet(A)=det(A)AT.\nabla_A \det(A)=\det(A)A^{-T}.

The log-determinant is even cleaner:

Alogdet(A)=AT.\nabla_A \log|\det(A)|=A^{-T}.

This formula appears constantly in:

  • normalising flow training
  • Gaussian process hyperparameter optimisation
  • covariance learning
  • information geometry

There is also a useful scalar derivative identity with respect to parameters:

ddθlogdetA(θ)=tr ⁣(A(θ)1dA(θ)dθ),\frac{d}{d\theta}\log\det A(\theta) = \operatorname{tr}\!\left(A(\theta)^{-1}\frac{dA(\theta)}{d\theta}\right),

assuming A(θ)A(\theta) stays invertible.

This converts a difficult-looking derivative of a determinant into a trace of a matrix product, which is much easier to manipulate analytically and computationally.

7. Determinants and Geometric Transformations

7.1 Area and Volume

The cleanest geometric interpretation of the determinant is volume scaling.

If the columns of AA are the vectors v1,,vnv_1,\dots,v_n, then

det(A)|\det(A)|

is the volume of the parallelepiped spanned by those columns.

In 2D:
columns -> parallelogram
|det|   -> area

In 3D:
columns -> parallelepiped
|det|   -> volume

In nD:
columns -> n-dimensional parallelotope
|det|   -> n-volume

This interpretation is not just intuition. It is the reason the determinant appears in the change-of-variables theorem from multivariable calculus.

7.2 Orientation

Absolute value gives size change. The sign gives orientation.

Two ordered bases of Rn\mathbb{R}^n have the same orientation if the change-of-basis matrix between them has positive determinant, and opposite orientation if the determinant is negative.

So:

  • det(A)>0\det(A)>0: orientation preserved
  • det(A)<0\det(A)<0: orientation reversed
  • det(A)=0\det(A)=0: orientation is no longer meaningful because the map collapses dimension
det > 0   preserve handedness
det < 0   flip handedness
det = 0   flatten space

This is why reflections have determinant 1-1 while rotations have determinant +1+1.

7.3 Specific Transformations and Their Determinants

Some standard transformations are worth learning as templates.

Uniform scaling

For αI\alpha I in Rn×n\mathbb{R}^{n \times n},

det(αI)=αn.\det(\alpha I)=\alpha^n.

Scaling each axis by α\alpha scales volume by αn\alpha^n.

Rotation

Any proper rotation has determinant +1+1. It preserves both volume and orientation.

Reflection

A reflection has determinant 1-1. It preserves volume magnitude but flips orientation.

Shear

A shear matrix typically has determinant 11. It distorts shape but preserves volume.

Projection

A nontrivial projection has determinant 00, since it collapses at least one dimension.

These examples are operationally useful because they let you interpret determinants before calculating them.

7.4 The Cross Product via Determinants

In R3\mathbb{R}^3, the cross product can be written formally as a determinant:

u×v=e1e2e3u1u2u3v1v2v3.u \times v = \begin{vmatrix} e_1 & e_2 & e_3 \\ u_1 & u_2 & u_3 \\ v_1 & v_2 & v_3 \end{vmatrix}.

Expanding along the first row yields

u×v=(u2v3u3v2)e1(u1v3u3v1)e2+(u1v2u2v1)e3.u \times v = (u_2v_3-u_3v_2)e_1 - (u_1v_3-u_3v_1)e_2 + (u_1v_2-u_2v_1)e_3.

The magnitude satisfies

u×v=uvsinθ,\|u \times v\| = \|u\|\,\|v\|\sin\theta,

which is the area of the parallelogram spanned by uu and vv.

So determinant structure is hiding inside the cross product too. In three dimensions, oriented area and determinant algebra become the same story told in two different languages.

7.5 Gram Determinant

If v1,,vkRnv_1,\dots,v_k \in \mathbb{R}^n, their Gram matrix is

Gij=vi,vj.G_{ij}=\langle v_i,v_j\rangle.

If V=[v1  vk]V=[v_1 \ \cdots \ v_k], then

G=VTV.G=V^TV.

The determinant of GG satisfies:

  • det(G)0\det(G)\geq 0
  • det(G)=0\det(G)=0 iff the vectors are linearly dependent
  • det(G)\sqrt{\det(G)} is the kk-dimensional volume of the parallelepiped spanned by the vectors

This is a subtle but important extension:

  • ordinary determinant measures volume when the spanning vectors live in the same dimension as the space
  • Gram determinant measures the intrinsic volume of kk vectors inside a possibly larger ambient space

That distinction matters in high-dimensional ML, where one often studies a small set of vectors inside a very large representation space.

8. Determinants in Special Matrix Classes

8.1 Diagonal and Triangular Matrices

For diagonal and triangular matrices, determinant computation collapses to the simplest possible formula:

det(A)=i=1naii.\det(A)=\prod_{i=1}^n a_{ii}.

For diagonal matrices this is obvious from the Leibniz formula: only the identity permutation contributes.

For triangular matrices the same reasoning applies. Any non-identity permutation must select at least one off-diagonal zero, so every non-identity term vanishes.

This fact is why LU factorisation is so powerful. Once a matrix has been reduced to triangular form, determinant computation becomes just a signed product of pivots.

A --elimination--> U

det(A) = sign_from_swaps * product(diagonal of U)

8.2 Orthogonal Matrices

If QQ is orthogonal, then

QTQ=I.Q^TQ=I.

Taking determinants gives

det(Q)2=det(I)=1,\det(Q)^2=\det(I)=1,

so

det(Q){+1,1}.\det(Q)\in\{+1,-1\}.

This means orthogonal matrices preserve volume magnitude exactly.

  • det(Q)=+1\det(Q)=+1: rotation-type behaviour
  • det(Q)=1\det(Q)=-1: reflection-type behaviour

This is one reason orthogonal initialisation is so useful in deep learning. A matrix with singular values near 11 avoids exploding or vanishing signal magnitude, and the determinant provides the most global version of that statement: no overall volume collapse or explosion occurs when det(Q)=1|\det(Q)|=1.

8.3 Symmetric Positive Definite Matrices

If AA is symmetric positive definite (SPD), then all eigenvalues are positive, so

det(A)=i=1nλi>0.\det(A)=\prod_{i=1}^n \lambda_i > 0.

This ensures the log-determinant is real:

logdet(A)=i=1nlogλi.\log\det(A)=\sum_{i=1}^n \log\lambda_i.

If A=LLTA=LL^T is the Cholesky factorisation, then

det(A)=det(L)2=(i=1nLii)2\det(A)=\det(L)^2=\left(\prod_{i=1}^n L_{ii}\right)^2

and therefore

logdet(A)=2i=1nlogLii.\log\det(A)=2\sum_{i=1}^n \log L_{ii}.

This identity is central in:

  • Gaussian likelihoods
  • Gaussian process marginal likelihoods
  • kernel methods
  • covariance estimation

It is numerically far better than computing the determinant directly and then taking a logarithm.

8.4 Vandermonde Matrix

The Vandermonde matrix associated with numbers x1,,xnx_1,\dots,x_n is

V=(111x1x2xnx12x22xn2x1n1x2n1xnn1).V= \begin{pmatrix} 1 & 1 & \cdots & 1 \\ x_1 & x_2 & \cdots & x_n \\ x_1^2 & x_2^2 & \cdots & x_n^2 \\ \vdots & \vdots & \ddots & \vdots \\ x_1^{n-1} & x_2^{n-1} & \cdots & x_n^{n-1} \end{pmatrix}.

Its determinant is

det(V)=1i<jn(xjxi).\det(V)=\prod_{1\leq i<j\leq n}(x_j-x_i).

So:

  • it is zero exactly when two nodes coincide
  • it is nonzero exactly when polynomial interpolation on distinct nodes is unique

This is one of the great closed-form determinant formulas in classical linear algebra.

8.5 Circulant Matrices

A circulant matrix is determined entirely by its first row, and each later row is a cyclic shift of the previous one.

These matrices are diagonalised by the discrete Fourier transform (DFT) matrix, so their eigenvalues are given by the Fourier transform of the first row. Therefore

det(C)=k=1nλk,\det(C)=\prod_{k=1}^n \lambda_k,

where those λk\lambda_k are Fourier-domain quantities.

This is a useful example because it shows how structure converts determinant computation from generic O(n3)O(n^3) work into something closer to FFT cost.

In ML, circulant and convolution-like structure appears in:

  • convolutional kernels
  • FFT-based linear layers
  • structured state-space models
  • fast kernel methods

8.6 Tridiagonal Matrices

For a tridiagonal matrix, determinants satisfy a recurrence relation rather than requiring full elimination.

If the diagonal entries are aia_i, upper diagonal entries bib_i, and lower diagonal entries cic_i, then the determinant dnd_n of the leading n×nn \times n block satisfies

dn=andn1bn1cn1dn2,d_n = a_n d_{n-1} - b_{n-1}c_{n-1} d_{n-2},

with appropriate initial conditions.

This reduces the cost from cubic to linear time for that special structure.

That matters in PDE discretisations, Kalman-style banded systems, and any structured model where nearest-neighbour interactions dominate.

9. Determinantal Identities

9.1 The Matrix Determinant Lemma

For invertible AA and vectors u,vu,v,

det(A+uvT)=(1+vTA1u)det(A).\det(A+uv^T)=\left(1+v^TA^{-1}u\right)\det(A).

This is the matrix determinant lemma.

It says a rank-1 perturbation of a matrix changes the determinant by a scalar correction factor rather than requiring a full recomputation.

That is already useful on its own, but the deeper lesson is structural:

full n x n determinant
    +
low-rank update
        ->
small correction problem

For low-rank updates UVTUV^T with U,VRn×kU,V \in \mathbb{R}^{n \times k}, the identity generalises to

det(A+UVT)=det(A)det(Ik+VTA1U).\det(A+UV^T)=\det(A)\det(I_k+V^TA^{-1}U).

Now an n×nn \times n determinant becomes a k×kk \times k determinant, which is a massive computational win when knk \ll n.

9.2 Sylvester's Determinant Theorem

If ARm×nA \in \mathbb{R}^{m \times n} and BRn×mB \in \mathbb{R}^{n \times m}, then

det(Im+AB)=det(In+BA).\det(I_m+AB)=\det(I_n+BA).

The matrices on the two sides do not even have the same size, yet the determinants agree.

This is a profoundly useful identity because it allows you to move the determinant to the smaller side.

If mnm \ll n, compute the left side. If nmn \ll m, compute the right side.

In ML this matters whenever a covariance, Hessian approximation, or low-rank adapter can be written as "identity plus low-rank product".

9.3 Weinstein-Aronszajn Identity

Closely related identities let us factor determinant changes under perturbation:

det(AB)=det(A)det(IA1B),\det(A-B)=\det(A)\det(I-A^{-1}B),

whenever AA is invertible.

This is conceptually the same move:

  • pull out the large, known matrix
  • reduce the new determinant to a perturbation around identity

The identity is especially useful when A1BA^{-1}B is low-rank, small in norm, or has special structure.

9.4 Schur Complement and Block Determinants

For a block matrix

M=(ABCD),M= \begin{pmatrix} A & B \\ C & D \end{pmatrix},

if AA is invertible, then

det(M)=det(A)det(DCA1B).\det(M)=\det(A)\det(D-CA^{-1}B).

The matrix

DCA1BD-CA^{-1}B

is the Schur complement of AA in MM.

This identity is everywhere in applied mathematics because it converts a large determinant into:

  • determinant of a block
  • determinant of a smaller corrected block

It underlies block Gaussian elimination, conditional Gaussians, saddle-point systems, and many structured probabilistic models.

9.5 Cauchy-Binet Formula

The Cauchy-Binet formula generalises det(AB)=det(A)det(B)\det(AB)=\det(A)\det(B) to rectangular matrices.

If ARm×nA \in \mathbb{R}^{m \times n} and BRn×mB \in \mathbb{R}^{n \times m} with mnm \leq n, then

det(AB)=S{1,,n},S=mdet(A:,S)det(BS,:).\det(AB) = \sum_{S \subseteq \{1,\dots,n\}, |S|=m} \det(A_{:,S})\det(B_{S,:}).

This looks technical, but its meaning is geometric: the determinant of the composed map can be decomposed into contributions from all mm-dimensional coordinate selections.

It appears naturally in:

  • volume identities
  • exterior algebra
  • determinantal point process theory
  • low-rank approximation arguments

10. Log-Determinants in Machine Learning

10.1 Why Log-Determinant?

Determinants grow or shrink exponentially in dimension. That makes raw determinant values numerically fragile.

For example,

det(2I1000)=21000,det(0.5I1000)=0.51000.\det(2I_{1000})=2^{1000}, \qquad \det(0.5I_{1000})=0.5^{1000}.

One overflows; the other underflows.

The log-determinant fixes this:

logdet(2I1000)=1000log2,\log|\det(2I_{1000})| = 1000\log 2,

which is perfectly manageable.

This is why modern probabilistic ML almost always uses logdet\log\det rather than det\det directly.

There is also an optimization reason:

Alogdet(A)=AT\nabla_A \log|\det(A)|=A^{-T}

is much cleaner than differentiating the determinant itself.

10.2 Normalising Flows

Normalising flows define an invertible map

x=f(z)x=f(z)

that transforms a simple base distribution into a more complex one.

The change-of-variables formula says

logpX(x)=logpZ(z)logdetfz.\log p_X(x)=\log p_Z(z)-\log\left|\det\frac{\partial f}{\partial z}\right|.

So every flow model lives or dies by the cost of computing

logdetJf(z).\log|\det J_f(z)|.

This is not an implementation detail. It is the central architectural constraint.

If the Jacobian is dense and unstructured, the cost is generically cubic in dimension. That is too expensive for large models. Therefore flow architectures are designed so the Jacobian is:

  • triangular
  • block triangular
  • diagonal plus structured corrections
  • tractable via traces in continuous-time settings

10.3 Architectures Enabling Efficient Log-Det

There are several standard design patterns.

Autoregressive flows

Each output depends only on earlier inputs, so the Jacobian is triangular. For triangular matrices,

logdetJ=ilogJii.\log|\det J| = \sum_i \log|J_{ii}|.

This turns an O(n3)O(n^3) problem into an O(n)O(n) one.

Coupling layers

Part of the input is copied, while the rest is scaled and shifted using functions of the copied part. The Jacobian becomes block triangular, so again the log-determinant is just a sum over easy diagonal terms.

Invertible 1x1 convolutions

Glow-style models use learned invertible channel mixing. If WW is parameterised with LU structure, then

logdetW\log|\det W|

can be computed from the diagonal of the triangular factor.

Continuous normalising flows

Instead of computing a full determinant of a Jacobian, one uses the instantaneous identity

ddtlogp(z(t))=tr(vz),\frac{d}{dt}\log p(z(t)) = -\operatorname{tr}\left(\frac{\partial v}{\partial z}\right),

and estimates traces stochastically.

The pattern is always the same:

generic Jacobian -> too expensive
structured Jacobian -> cheap log-det

10.4 Multivariate Gaussian Log-Likelihood

For

xN(μ,Σ),x \sim \mathcal{N}(\mu,\Sigma),

the log-density is

logp(x)=n2log(2π)12logdet(Σ)12(xμ)TΣ1(xμ).\log p(x) = -\frac{n}{2}\log(2\pi) -\frac{1}{2}\log\det(\Sigma) -\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu).

The determinant term is the normalization factor. Geometrically, it measures how spread out the Gaussian ellipsoid is.

Large determinant:

  • covariance ellipsoid has large volume
  • density is more diffuse

Small determinant:

  • covariance ellipsoid is narrow or nearly degenerate
  • density is more concentrated

For SPD covariance matrices, Cholesky gives

logdet(Σ)=2ilogLii.\log\det(\Sigma)=2\sum_i \log L_{ii}.

That is the standard numerically stable implementation.

10.5 Gaussian Process Marginal Likelihood

Gaussian processes require the log marginal likelihood

logp(yX,θ)=12yTKθ1y12logdet(Kθ)n2log(2π),\log p(y|X,\theta) = -\frac{1}{2}y^TK_\theta^{-1}y -\frac{1}{2}\log\det(K_\theta) -\frac{n}{2}\log(2\pi),

where KθK_\theta is the kernel matrix plus observation noise.

This creates two hard matrix tasks:

  • solve a linear system with KθK_\theta
  • compute logdet(Kθ)\log\det(K_\theta)

For exact GP inference, Cholesky is the classical answer. For large-scale approximate GP methods, stochastic trace and Lanczos-style log-det estimators become essential.

This is one of the cleanest places in modern ML where determinant theory, numerical linear algebra, and probabilistic modelling meet directly.

10.6 Information-Theoretic Role of Log-Det

For a Gaussian random vector,

H(X)=12logdet(2πeΣ).H(X)=\frac{1}{2}\log\det(2\pi e\,\Sigma).

So differential entropy is directly controlled by the log-determinant of the covariance.

This gives log-det a genuine information-theoretic meaning:

  • larger log-det -> more spread -> larger entropy
  • smaller log-det -> less spread -> lower entropy

Related quantities also appear in:

  • mutual information formulas for Gaussians
  • Bayesian experimental design
  • feature diversity regularisation
  • Fisher information geometry

So when ML objectives contain a log-determinant, they are often measuring some combination of volume, uncertainty, diversity, or information content.

11. Determinants in Advanced Topics

11.1 Jacobian Determinant in Calculus

For a differentiable map

f:RnRn,f:\mathbb{R}^n \to \mathbb{R}^n,

the Jacobian matrix is

Jf(x)=[fixj].J_f(x)=\left[\frac{\partial f_i}{\partial x_j}\right].

Its determinant measures the local volume scaling of the map near the point xx.

  • detJf(x)>1|\det J_f(x)| > 1: local expansion
  • detJf(x)<1|\det J_f(x)| < 1: local contraction
  • detJf(x)=0\det J_f(x)=0: local singularity

The inverse function theorem says:

detJf(x)0    f is locally invertible near x.\det J_f(x)\neq 0 \implies f \text{ is locally invertible near } x.

That theorem is the nonlinear analogue of the fact that a square matrix is invertible exactly when its determinant is nonzero.

11.2 Functional Determinants

In infinite-dimensional analysis, determinants generalise to operators.

One important example is the Fredholm determinant, written formally as

det(I+K)=i(1+λi),\det(I+K)=\prod_i (1+\lambda_i),

for suitable trace-class operators KK.

This idea appears in:

  • PDE and operator theory
  • statistical physics
  • quantum field theory
  • continuous-time probabilistic models

In ML, the finite-dimensional determinant story survives in approximate form through Jacobian traces, spectral sums, and operator-inspired kernels.

11.3 Determinantal Point Processes

A determinantal point process (DPP) is a probability distribution over subsets where

P(Y=S)det(KS)P(Y=S)\propto \det(K_S)

for a positive semidefinite kernel matrix KK and principal submatrix KSK_S.

Why determinant? Because det(KS)\det(K_S) measures the volume spanned by the feature embeddings of the selected items. Large determinant means the selected items are both high-quality and diverse.

This creates repulsion:

  • redundant items have similar feature vectors
  • similar vectors reduce the determinant
  • diverse sets get higher probability

That makes DPPs natural for:

  • diverse retrieval
  • extractive summarisation
  • representative subset selection
  • active learning

11.4 Random Matrix Theory and Determinants

Random matrix theory studies eigenvalue distributions of random matrices, and determinants appear all over the subject because the joint eigenvalue density often contains Vandermonde-type determinant factors.

This matters for modern ML because spectra of trained weight matrices often show structured deviations from classical random baselines. Those deviations are informative about:

  • effective rank
  • heavy-tailed structure
  • noise versus signal
  • generalisation-related geometry

Even when one is not computing determinants directly, determinant identities live in the background of spectral density theory.

11.5 Determinants in Stability Analysis

For linear dynamics

x˙=Ax\dot{x}=Ax

or discrete dynamics

xt+1=Axt,x_{t+1}=Ax_t,

the eigenvalues of AA control stability, and the determinant gives their product.

That makes determinant a coarse but meaningful summary of total expansion or contraction.

In recurrent models, one often cares more directly about singular values than determinants, but the determinant still carries interpretable global information:

  • det(A)1|\det(A)| \gg 1: strong global volume expansion
  • det(A)1|\det(A)| \ll 1: strong global contraction
  • det(A)=1|\det(A)| = 1: overall volume preserved

This is why orthogonal and unitary constructions are associated with stable signal propagation.

12. Computational Considerations

12.1 Algorithms Comparison

In practice, determinant computation is not about formulas first. It is about choosing the right factorisation for the matrix class.

MethodCostStabilityBest use
Leibniz formulaO(n!)O(n!)Exact but combinatorialOnly tiny symbolic cases
Cofactor expansionExponential in generalFine for hand workSmall matrices, many zeros
LU factorisationO(n3)O(n^3)Good with pivotingGeneral dense square matrices
CholeskyO(n3)O(n^3) setup, cheap log-det afterExcellent for SPDCovariance and kernel matrices
Eigenvalue productO(n3)O(n^3)Fine if spectrum already neededSpectral analysis

The practical rule is simple:

  • hand computation -> cofactor / structure
  • code -> LU or Cholesky

12.2 Log-Determinant Computation

Never compute a determinant and then take its logarithm if numerical stability matters.

Instead use:

  • LU-based slogdet for general matrices
  • Cholesky-based formulas for SPD matrices

Conceptually:

det(A)       -> can overflow / underflow
log|det(A)|  -> stable scale
sign + logabsdet -> safest representation

That is why libraries such as NumPy, SciPy, PyTorch, and JAX expose sign-and-log-determinant APIs rather than encouraging raw determinant use in probabilistic objectives.

12.3 Gradient of Log-Determinant in Autograd

Autodiff frameworks implement

Alogdet(A)=AT\nabla_A \log|\det(A)|=A^{-T}

through stable matrix factorizations rather than symbolic expansion.

This matters because a naive determinant implementation would be:

  • slow
  • unstable
  • disastrous for gradients

In practice, gradient flow through log-det is usually routed through LU, QR, or Cholesky internals depending on matrix structure.

12.4 Stochastic Log-Det Estimation

For very large SPD matrices, exact O(n3)O(n^3) log-det is too expensive.

A standard trick is

logdet(A)=tr(logA).\log\det(A)=\operatorname{tr}(\log A).

Then instead of forming logA\log A explicitly, one estimates the trace stochastically using random probe vectors and polynomial or Lanczos approximations.

This leads to methods such as:

  • Hutchinson trace estimation
  • stochastic Lanczos quadrature
  • Chebyshev-based trace approximations

These methods are critical in scalable Gaussian process toolkits because they replace dense factorisations with repeated matrix-vector products.

12.5 Determinants with Low-Rank Structure

Low-rank structure is determinant gold.

If

AA+UVTA \mapsto A + UV^T

with rank knk \ll n, then the determinant lemma turns an n×nn \times n problem into a k×kk \times k one.

That is the same general efficiency principle behind many ML approximations:

  • low-rank covariance updates
  • inducing-point approximations
  • LoRA-style matrix updates
  • adapter-style structured perturbations

The chapter theme is repeating itself:

generic matrix -> expensive
structured matrix -> determinant becomes tractable

13. Determinants and Linear System Theory

13.1 Invertibility and the Determinant

For a square matrix,

A invertible     det(A)0.A \text{ invertible } \iff \det(A)\neq 0.

This is the determinant's most famous theorem, but it should be understood as the synthesis of many equivalent statements:

det(A)0    rank(A)=n    null(A)={0}    A1 exists.\det(A)\neq 0 \iff \operatorname{rank}(A)=n \iff \operatorname{null}(A)=\{0\} \iff A^{-1}\text{ exists}.

So the determinant is not one test among many. It is one gateway into the entire equivalence class of invertibility statements.

13.2 Cramer's Rule and Explicit Formulas

Cramer's rule gives explicit formulas for the coordinates of the solution to

Ax=b.Ax=b.

That makes determinant theory historically inseparable from linear systems. Before modern numerical linear algebra, determinants were studied partly because they gave exact symbolic solution formulas.

Today the computational message is different:

  • Cramer's rule explains
  • LU solves

The determinant remains conceptually central even when it is not the fastest numerical tool.

13.3 Determinant Conditions for Solution Uniqueness

If AA is square:

  • det(A)0\det(A)\neq 0 -> unique solution for every bb
  • det(A)=0\det(A)=0 -> not uniquely solvable for every bb

But note the subtlety:

det(A)=0\det(A)=0

does not by itself tell you whether a particular system has:

  • no solution
  • infinitely many solutions

For that, one must compare the rank of AA and the augmented matrix.

So determinants are decisive for invertibility, but not sufficient by themselves to classify every singular system.

13.4 Characteristic Polynomial and Eigenvalue Systems

The eigenvalue equation

det(AλI)=0\det(A-\lambda I)=0

is itself a determinant-based system condition.

That is the bridge to the next chapter:

  • determinants tell you when a shifted matrix becomes singular
  • those singular shifts are exactly the eigenvalues
  • decomposition theory begins there

14. Common Mistakes

| Mistake | Why it is wrong | Fix | | --------------------------------------------------------------------- | -------------------------------------------------------------------- | ------------------------------------------------- | ---------------------- | ------------------------------------- | | det(A + B) = det(A) + det(B) | Determinant is not linear in the whole matrix | Use multilinearity one row/column at a time only | | det(AB) = det(A) + det(B) | Determinant is multiplicative, not additive | Remember det(AB) = det(A)det(B) | | det(2A) = 2 det(A) | Scaling every row by 2 scales determinant by 2n2^n | Use det(alpha A) = alpha^n det(A) | | det(A) approximately 0 means numerically singular | Determinant magnitude depends on scale and dimension | Use condition number to diagnose near-singularity | | Sarrus' rule works for 4x4 | It only works for 3x3 matrices | Use cofactor expansion or LU beyond 3x3 | | det(A) > 0 means all eigenvalues are positive | Only the product is positive; negative eigenvalues can come in pairs | Check all eigenvalues or SPD criteria | | Adding a multiple of one row changes determinant | Row replacement leaves determinant unchanged | Track only swaps and scalings | | det(A) = 0 tells me whether Ax=b has no solution or infinitely many | It only tells you the matrix is singular | Use the ranks of AA and [Ab][A \mid b] for classification | | log(det(A)) is always real | Not if determinant is negative or matrix is not SPD | Use logdet(A)\log |\det(A)| or slogdet unless SPD is guaranteed | | A large determinant always means good conditioning | Conditioning depends on singular value ratio, not product alone | Use singular values or condition number |

15. Exercises

  1. Computing determinants Compute the determinant of each matrix using the most efficient method and justify the method choice:

    • A=(5324)A=\begin{pmatrix}5&3\\2&4\end{pmatrix}
    • B=(210031004)B=\begin{pmatrix}2&1&0\\0&3&1\\0&0&4\end{pmatrix}
    • C=(123456789)C=\begin{pmatrix}1&2&3\\4&5&6\\7&8&9\end{pmatrix}
    • D=(10002300456078910)D=\begin{pmatrix}1&0&0&0\\2&3&0&0\\4&5&6&0\\7&8&9&10\end{pmatrix}
    • E=diag(2,3,1,4,1)E=\operatorname{diag}(2,-3,1,4,-1)
  2. Property verification Let

A=(2113),B=(1120). A=\begin{pmatrix}2&1\\1&3\end{pmatrix}, \qquad B=\begin{pmatrix}1&-1\\2&0\end{pmatrix}.

Verify:

  • det(AB)=det(A)det(B)=det(BA)\det(AB)=\det(A)\det(B)=\det(BA)
  • det(A+B)det(A)+det(B)\det(A+B)\neq \det(A)+\det(B)
  • det(3A)=32det(A)\det(3A)=3^2\det(A)
  • det(AT)=det(A)\det(A^T)=\det(A)
  1. Characteristic polynomial For
A=(4213), A=\begin{pmatrix}4&2\\1&3\end{pmatrix},

compute:

  • the characteristic polynomial
  • the eigenvalues
  • eigenvectors
  • a direct verification of Cayley-Hamilton
  1. Geometric interpretation In R2\mathbb{R}^2, let
u=(31),v=(12). u=\begin{pmatrix}3\\1\end{pmatrix},\qquad v=\begin{pmatrix}1\\2\end{pmatrix}.

Find the area of the spanned parallelogram, then apply:

  • a rotation
  • a reflection
  • a scaling by factor 3 and track what happens to the determinant and orientation in each case.
  1. Cofactors and adjugate For
A=(120311021), A=\begin{pmatrix}1&2&0\\3&1&1\\0&2&1\end{pmatrix},

compute:

  • det(A)\det(A) by two different cofactor expansions
  • the full cofactor matrix
  • adj(A)\operatorname{adj}(A)
  • A1A^{-1} from the adjugate identity
  1. Determinant identities

    • Use the matrix determinant lemma on a diagonal matrix plus a rank-1 update
    • Verify Sylvester's theorem on a small rectangular example
    • Verify the block determinant formula on a block triangular matrix
    • Compute a Schur complement determinant directly and by formula
  2. Log-det for flows Consider a 3D coupling transformation with triangular Jacobian. Write its Jacobian explicitly, identify the diagonal entries, and derive the formula for logdetJ\log|\det J|.

  3. SPD and Gaussian computation For a symmetric positive definite covariance matrix:

    • compute a Cholesky factor
    • derive logdet(Σ)\log\det(\Sigma) from the diagonal of the factor
    • evaluate the Gaussian log-likelihood for a sample vector
  4. Numerical instability Compare:

    • direct determinant computation of 0.1I500.1 I_{50}
    • stable sign/log-determinant computation Explain why the raw determinant is a poor numerical representation.

16. Why This Matters for AI (2026 Edition)

AspectImpact
Normalising flowsThe log-determinant of the Jacobian is the core term in the likelihood
Multivariate GaussiansCovariance normalization and entropy both depend on log-det
Gaussian processesMarginal likelihood combines linear solves and log-determinants
Eigenvalue theoryThe characteristic equation is a determinant equation
Invertible networksLocal and global invertibility are determinant statements
Information geometryFisher-metric volume terms involve determinants and log-determinants
DPP-based diversityDeterminants score subset diversity via spanned volume
Stability analysisDeterminants summarize total volume expansion or contraction of dynamics
Low-rank updatesDeterminant lemmas turn expensive recomputation into small auxiliary problems
Numerical ML systemsStable slogdet, Cholesky log-det, and stochastic estimators are production tools

The bigger picture is that determinants are one of the places where algebra and probability truly lock together. In deep learning you often care less about a matrix entry-by-entry and more about what the matrix does globally: does it preserve information, collapse dimensions, distort probability mass, or amplify uncertainty? Determinants answer precisely those questions.

17. Conceptual Bridge

The determinant is where matrix theory stops being just arithmetic and becomes geometry.

  • In linear algebra, it detects invertibility and dependence.
  • In geometry, it measures volume and orientation.
  • In calculus, it becomes the Jacobian factor in change of variables.
  • In probability, it becomes the normalization term for Gaussian families and invertible generative models.

That is why this chapter sits exactly between matrix operations and spectral theory. Determinants summarize a matrix globally, but they also open the door to a finer structural story:

det(λIA)=0.\det(\lambda I-A)=0.

That single equation launches the study of eigenvalues, eigenspaces, diagonalisation, spectral decompositions, PCA, SVD, and much of modern representation analysis in ML.

Matrix entries
    ->
determinant
    ->
invertibility / volume / orientation
    ->
det(lambda I - A)
    ->
eigenvalues and decomposition theory

Next: Eigenvalues, Eigenvectors, and Matrix Decompositions.

References