Part 1

30 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Matrix Decompositions: Part 1: Intuition to 5. Cholesky Factorization Computational Recap

1. Intuition

1.1 The Factorization Paradigm

Every problem in computational linear algebra is, at its heart, a search for structure. When you factor a matrix into simpler pieces - triangular, orthogonal, diagonal - you transform a hard problem into a sequence of easy ones. The word "easy" is precise: triangular systems can be solved in $O(n^2)$ operations by substitution; orthogonal transformations are numerically perfect (they preserve norms exactly); diagonal systems are trivially solvable. A decomposition is a way of peeling back complexity to reveal the structural skeleton underneath.

This perspective explains why factorizations are not merely computational conveniences. They are mathematical revelations. LU factorization reveals which rows are "heavy" (large pivots dominate) and which are "light" (small pivots signal near-singularity). QR factorization reveals the orthogonal complement of the column space via the zero rows of $R$ . Cholesky factorization reveals the "matrix square root" $L$ that encodes the geometry of a positive definite quadratic form. Each decomposition is a lens tuned to a different structural feature.

The central insight: nearly every matrix algorithm of practical importance runs in $O(n^3)$ time, dominated by the factorization phase. After factorization, the actual problem-solving (applying $L^{-1}$ , $U^{-1}$ , $Q^\top$ ) costs only $O(n^2)$ . This is the fundamental return on investment: spend $O(n^3)$ once to factor, then spend $O(n^2)$ as many times as needed for different right-hand sides.

For AI: this principle appears in natural gradient methods (factor the Fisher information matrix once per outer loop, solve multiple gradient updates cheaply), in Gaussian process regression (factor the covariance matrix once, evaluate posterior at many test points cheaply), and in iterative solvers (factor a preconditioner once, apply it repeatedly to accelerate convergence).

1.2 Triangular Systems: The Foundation

Before understanding factorizations, you must understand why triangular systems are considered "solved." A lower triangular system $L\mathbf{x} = \mathbf{b}$ with $L_{ii} \neq 0$ can be solved row by row from top to bottom:

x_1 = b_1 / L_{11}

x_2 = (b_2 - L_{21}x_1) / L_{22}

x_i = \left(b_i - \sum_{j=1}^{i-1} L_{ij} x_j\right) / L_{ii}

This is forward substitution: each unknown is determined uniquely by those already computed. The computation requires exactly $n^2/2$ multiplications and $n^2/2$ additions - $O(n^2)$ total. Similarly, an upper triangular system $U\mathbf{x} = \mathbf{b}$ is solved by backward substitution from bottom to top.

The key properties of triangular systems:

Existence and uniqueness: a triangular system has a unique solution if and only if all diagonal entries are nonzero.
Numerical stability: forward/backward substitution are inherently stable when pivots are large; small pivots amplify errors.
Parallelism: the sequential dependency (each unknown requires all previous) limits parallelism, but blocked implementations can exploit level-2 BLAS parallelism within panels.
Structural exploitation: sparse triangular systems (band, arrowhead, bidiagonal) can be solved in $O(nk)$ for bandwidth $k$ .

Non-example: a general dense $n \times n$ system $A\mathbf{x} = \mathbf{b}$ cannot be solved by substitution directly - it requires $O(n^3)$ work. The entire purpose of LU, QR, and Cholesky is to reduce the general case to triangular cases.

1.3 Three Canonical Decompositions

The three decompositions in this section partition the space of structured matrix problems:

Decomposition	Matrix class	Factored form	Primary use
LU	General square, non-singular	$PA = LU$	Solving $A\mathbf{x} = \mathbf{b}$ , computing $\det(A)$
QR	Any rectangular $m \times n$ , $m \geq n$	$A = QR$	Least squares, rank determination, eigenvalue algorithms
Cholesky	Symmetric positive definite	$A = LL^\top$	SPD systems, Gaussian sampling, log-det computation

Why three? Each exploits a different structural property. LU uses no special structure beyond invertibility; it is the most general and therefore requires more care (pivoting for stability). QR uses orthogonality, which is the most numerically pristine structure - $Q$ has condition number 1 by definition. Cholesky uses both symmetry and positive definiteness, which allows working with only half the matrix and guarantees the factorization exists without pivoting.

The hierarchy of stability: Cholesky $>$ QR $>$ LU (partially pivoted) $>$ LU (no pivoting). As you sacrifice structure, you need more algorithmic sophistication to maintain numerical reliability.

1.4 Why This Matters for AI

Matrix factorizations are not background infrastructure - they are in the critical path of the most computationally expensive operations in modern AI.

Gaussian process regression (used in Bayesian optimization, uncertainty quantification, and hyperparameter tuning) requires computing $K^{-1}\mathbf{y}$ and $\log\det K$ for an $n \times n$ kernel matrix $K$ . Both require Cholesky factorization. A single GP training step on $n = 10{,}000$ points costs $O(n^3) = 10^{12}$ operations - factorization is the bottleneck.

Second-order optimization (Newton's method, natural gradient, K-FAC) requires solving Hessian or Fisher information systems $H\boldsymbol{\delta} = \mathbf{g}$ . For K-FAC (Kronecker-Factored Approximate Curvature, Martens & Grosse 2015), the Fisher approximation is block-diagonal with Kronecker structure, and each block is inverted via Cholesky - thousands of small Cholesky factorizations per training step.

Linear probing and fine-tuning - fitting a linear model on top of frozen features - is a least-squares problem. Numerically reliable implementations use QR (e.g., numpy.linalg.lstsq calls LAPACK's dgelsd which uses divide-and-conquer SVD, or dgelsy which uses column-pivoted QR).

Differentiating through factorizations is required whenever a factorization appears inside a differentiable computational graph. PyTorch's torch.linalg.cholesky_solve and JAX's jax.lax.linalg.cholesky support automatic differentiation through the factorization via implicit function theorem.

Randomized methods (LoRA, sketching, randomized preconditioning) use approximate LU and QR factorizations on sketched matrices. Understanding the exact factorizations is prerequisite to understanding their randomized variants.

1.5 Historical Timeline

Year	Contribution	Author(s)
1809	Gaussian elimination for geodesy (Gauss not published until 1810)	Carl Friedrich Gauss
1938	LU factorization formalized as matrix decomposition	Tadeusz Banachiewicz
1945	Systematic analysis of elimination and error growth	Alan Turing
1954	Givens rotations for QR	Wallace Givens
1958	Householder reflections; modern QR algorithm	Alston Householder
1961	QR algorithm for eigenvalues (combined QR iteration)	Francis; Kublanovskaya
1961	LAPACK precursor: ALGOL programs for matrix problems	Wilkinson, Reinsch
1979	LINPACK (first standardized library)	Dongarra et al.
1987	LAPACK + BLAS-3 blocked algorithms	Anderson, Bai, Dongarra et al.
1995	Randomised SVD (sketch-then-factor)	Frieze, Kannan, Vempala
2011	Halko-Martinsson-Tropp randomized low-rank framework	Halko, Martinsson, Tropp
2015	K-FAC: Cholesky for Fisher information blocks	Martens & Grosse
2022	LoRA: low-rank LU-inspired weight decomposition	Hu et al.
2024	FlashAttention-3: blocked QR-like tiling for attention	Shah et al.

2. Background: Triangular Systems

2.1 Forward Substitution

Definition. Given a lower triangular matrix $L \in \mathbb{R}^{n \times n}$ with $L_{ii} \neq 0$ for all $i$ , and $\mathbf{b} \in \mathbb{R}^n$ , forward substitution computes $\mathbf{x}$ satisfying $L\mathbf{x} = \mathbf{b}$ via:

x_i = \frac{1}{L_{ii}}\left(b_i - \sum_{j=1}^{i-1} L_{ij} x_j\right), \quad i = 1, 2, \ldots, n

Algorithm (row-oriented):

for i = 1 to n:
    x[i] = b[i]
    for j = 1 to i-1:
        x[i] -= L[i,j] * x[j]
    x[i] /= L[i,i]

Complexity: The inner loop runs $0, 1, \ldots, n-1$ times, giving $\sum_{i=1}^{n}(i-1) = n(n-1)/2$ multiplications and the same number of additions. Total: $n^2$ floating-point operations, or $O(n^2)$ .

Numerical properties: Forward substitution is unconditionally stable in exact arithmetic. In floating-point arithmetic, errors accumulate proportionally to $\kappa(L)$ (the condition number of $L$ ), not to the intermediate values of $x_j$ . The key source of numerical difficulty is a small diagonal entry $L_{ii}$ : dividing by a near-zero diagonal amplifies errors quadratically.

Unit lower triangular: When $L_{ii} = 1$ for all $i$ (as in the LU factorization where $L$ is unit lower triangular by convention), the divisions are trivial and the algorithm reduces to:

for i = 1 to n:
    x[i] = b[i] - sum(L[i,1:i-1] * x[1:i-1])

This eliminates $n$ divisions, slightly improving performance and stability.

For AI: Forward substitution appears in the final step of any matrix factorization-based solver. In neural network training with second-order methods (Newton, K-FAC), solving $L\mathbf{x} = \mathbf{g}$ extracts the natural gradient direction $\mathbf{g} = F^{-1}\nabla\mathcal{L}$ given the Cholesky factor $L$ of the Fisher matrix $F$ .

2.2 Backward Substitution

Definition. Given an upper triangular $U \in \mathbb{R}^{n \times n}$ with $U_{ii} \neq 0$ , backward substitution computes $\mathbf{x}$ satisfying $U\mathbf{x} = \mathbf{b}$ via:

x_i = \frac{1}{U_{ii}}\left(b_i - \sum_{j=i+1}^{n} U_{ij} x_j\right), \quad i = n, n-1, \ldots, 1

The algorithm proceeds from the last equation (which has a single unknown $x_n = b_n / U_{nn}$ ) backward to the first. Complexity and numerical properties mirror forward substitution.

Column-oriented backward substitution: An equivalent formulation updates the remaining components of $\mathbf{b}$ as each $x_i$ is resolved:

for i = n downto 1:
    x[i] = b[i] / U[i,i]
    b[1:i-1] -= x[i] * U[1:i-1, i]

This accesses $U$ by columns rather than rows, which can improve cache efficiency on column-major storage (Fortran, LAPACK convention).

Connection to matrix inverse: The columns of $U^{-1}$ can be computed by solving $n$ triangular systems $U \mathbf{x}^{(k)} = \mathbf{e}_k$ , one per standard basis vector. This is how scipy.linalg.solve_triangular computes $U^{-1}$ when called with a matrix right-hand side.

2.3 Gaussian Elimination Revisited

Gaussian elimination is the process of reducing a matrix $A$ to upper triangular form by applying elementary row operations (EROs). Each ERO of type "add $c$ times row $j$ to row $i$ (with $i > j$ )" corresponds to left-multiplication by an elementary lower triangular matrix (an elementary elimination matrix or Frobenius matrix):

E_{ij} = I - m_{ij}\mathbf{e}_i\mathbf{e}_j^\top, \quad m_{ij} = A_{ij}/A_{jj}

The multiplier $m_{ij}$ is the ratio that zeros out entry $(i,j)$ . Applying all elimination matrices to reduce $A$ to $U$ gives:

E_{n,n-1} \cdots E_{31} E_{21} A = U

Crucially, the product of elementary lower triangular matrices is lower triangular, and their inverses are trivially:

E_{ij}^{-1} = I + m_{ij}\mathbf{e}_i\mathbf{e}_j^\top

So the product of inverses, $L = E_{21}^{-1} E_{31}^{-1} \cdots E_{n,n-1}^{-1}$ , is unit lower triangular with the multipliers $m_{ij}$ in their natural positions. This gives:

A = LU

The multipliers go directly into $L$ : one of the most elegant facts in numerical linear algebra is that the multipliers $m_{ij}$ used during elimination appear, without any further computation, as the subdiagonal entries $L_{ij}$ of the factor $L$ . No separate computation of $L$ is required - it is built up in-place during elimination.

3. LU Factorization

3.1 LU Decomposition Theorem

Theorem (LU factorization). Let $A \in \mathbb{R}^{n \times n}$ . Then $A$ has an LU factorization $A = LU$ (with $L$ unit lower triangular and $U$ upper triangular) if and only if all leading principal submatrices $A_k = A[1:k, 1:k]$ for $k = 1, \ldots, n-1$ are non-singular.

Proof sketch. The condition $A_k$ non-singular for all $k$ ensures that no zero pivot is encountered during Gaussian elimination. At step $k$ , the pivot is $U_{kk} = \det(A_k)/\det(A_{k-1})$ , so all pivots are nonzero iff all leading minors are nonzero. Conversely, if Gaussian elimination completes without zero pivots, the multipliers are finite and $L, U$ are well-defined. $\square$

Non-examples:

$A = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}$ fails: $A_{11} = 0$ (a zero pivot at step 1).
$A = \begin{pmatrix} 1 & 1 \\ 1 & 1 \end{pmatrix}$ : $A_{11} = 1 \neq 0$ but $\det(A) = 0$ , so the factorization runs but $U_{22} = 0$ (singular).
Any singular matrix: $A = LU \Rightarrow \det(A) = \det(L)\det(U) = \prod_i U_{ii}$ , so $\det(A) = 0$ iff some $U_{ii} = 0$ .

Uniqueness: The factorization with $L_{ii} = 1$ (unit lower triangular) is unique when it exists. Without the normalization, one could scale $L$ and $U$ by any diagonal matrix and its inverse.

The factorization in-place: LU computation overwrites $A$ : the upper triangle stores $U$ , the strict lower triangle stores the multipliers (i.e., the strict lower triangle of $L$ , since $L_{ii} = 1$ need not be stored). This halves the memory requirement.

3.2 The LU Algorithm: Outer Product Form

The standard view of LU elimination is column-by-column. But the outer product form (also called rank-1 update form) exposes the structure more clearly and maps naturally to BLAS-3 blocked implementations.

At step $k$ of the factorization, we have already reduced columns $1, \ldots, k-1$ . The remaining $(n-k+1) \times (n-k+1)$ submatrix $A^{(k)}$ (the Schur complement after $k-1$ steps) has the form:

A^{(k+1)} = A^{(k)}[k+1:n, k+1:n] - \mathbf{l}^{(k)} \mathbf{u}^{(k)\top}

where $\mathbf{l}^{(k)} = A^{(k)}[k+1:n, k] / A^{(k)}_{kk}$ (the multipliers column) and $\mathbf{u}^{(k)\top} = A^{(k)}[k, k+1:n]$ (the pivot row). This is a rank-1 update of the trailing submatrix.

Algorithm:

for k = 1 to n-1:
    # Compute multipliers
    L[k+1:n, k] = A[k+1:n, k] / A[k, k]    # size (n-k)
    # Rank-1 update of trailing submatrix
    A[k+1:n, k+1:n] -= L[k+1:n, k] * A[k, k+1:n]   # BLAS-2: ger
    U[k, k:n] = A[k, k:n]

Complexity: Step $k$ costs $(n-k)$ divisions and $(n-k)^2$ multiply-adds. Total:

\sum_{k=1}^{n-1} [(n-k) + (n-k)^2] = \frac{n(n-1)}{2} + \frac{(n-1)n(2n-1)}{6} \approx \frac{2n^3}{3}

So LU factorization costs $\frac{2n^3}{3}$ floating-point operations (flops). For comparison: solving $A\mathbf{x} = \mathbf{b}$ after factorization costs $2n^2$ (two triangular solves). The ratio $n/3$ means factorization dominates for large $n$ .

For AI: the rank-1 update structure maps directly to GPU tensor cores. Blocked implementations reformulate the trailing submatrix update as a matrix-matrix multiply (BLAS-3: DGEMM), which achieves near-peak GPU throughput.

3.3 Failure Without Pivoting

Naive LU (no pivoting) fails catastrophically when:

Zero pivots: If $A_{kk}^{(k)} = 0$ during step $k$ , the division by the pivot is undefined. This happens even for non-singular matrices that happen to have zero leading principal minors.
Small pivots - numerical catastrophe: Consider:

A = \begin{pmatrix} \varepsilon & 1 \\ 1 & 1 \end{pmatrix}, \quad \varepsilon = 10^{-16}

Naive LU gives multiplier $m_{21} = 1/\varepsilon = 10^{16}$ , and $U_{22} = 1 - 10^{16} \cdot 1 \approx -10^{16}$ (in floating-point, this overwhelms the exact value $1 - 1/\varepsilon = ({\varepsilon - 1})/\varepsilon$ ). The computed solution $\hat{\mathbf{x}}$ can have relative error of order 1 even when the exact solution is well-conditioned.

Growth factor explosion: The growth factor $\rho_n$ is defined as:

\rho_n = \frac{\max_{i,j,k} |A_{ij}^{(k)}|}{\max_{i,j} |A_{ij}|}

Without pivoting, $\rho_n$ can grow as $2^{n-1}$ in the worst case (Wilkinson's matrix), making the computed $U$ useless.

Example of catastrophic failure:

A = \begin{pmatrix} 10^{-20} & 1 \\ 1 & 2 \end{pmatrix}, \quad \mathbf{b} = \begin{pmatrix} 1 \\ 3 \end{pmatrix}

Exact solution: $\mathbf{x} = (1, 1)^\top$ approximately. Naive LU in IEEE double precision gives multiplier $m_{21} = 10^{20}$ , yielding $U_{22} = 2 - 10^{20} \approx -10^{20}$ - a catastrophic cancellation.

For AI: PyTorch's torch.linalg.lu uses partial pivoting by default. numpy.linalg.solve calls LAPACK dgesv which uses partial pivoting. Never use unpivoted LU for numerical computation.

3.4 Partial Pivoting: PA = LU

Idea: Before processing column $k$ , swap row $k$ with the row (among $k, k+1, \ldots, n$ ) having the largest absolute value in column $k$ . This ensures $|m_{ij}| \leq 1$ for all multipliers.

Algorithm with partial pivoting:

for k = 1 to n-1:
    # Find pivot row
    p = argmax_{i >= k} |A[i, k]|
    swap rows k and p in A (and record in permutation P)
    # Now A[k,k] is the largest in column k below
    L[k+1:n, k] = A[k+1:n, k] / A[k, k]       # |L[i,k]| <= 1
    A[k+1:n, k+1:n] -= L[k+1:n, k] * A[k, k+1:n]

Result: $PA = LU$ where $P$ is a permutation matrix encoding all row swaps. The factor $L$ satisfies $|L_{ij}| \leq 1$ for all $i > j$ .

Growth factor with partial pivoting: The theoretical bound is $\rho_n \leq 2^{n-1}$ (same as no pivoting!), but in practice the growth factor is almost always small ( $\rho_n \approx n^{2/3}$ for random matrices). Foster (1997) proved that adversarial inputs achieving $\rho_n = 2^{n-1}$ are exponentially rare.

Stability theorem (Wilkinson 1961): Partial pivoting produces a computed factorization $\hat{L}\hat{U}$ such that:

PA = \hat{L}\hat{U} + E, \quad \|E\|_\infty \leq 8n^3 u \cdot \rho_n \cdot \|A\|_\infty

where $u$ is the unit roundoff (e.g., $u = 2^{-53}$ for IEEE double). For typical $\rho_n$ , this bound is well within practical tolerance.

Storage of the permutation: $P$ is stored as a pivot vector $\mathbf{p} \in \mathbb{Z}^{n-1}$ where $p_k$ is the row index selected at step $k$ . Applying $P$ to $\mathbf{b}$ before the triangular solve costs $O(n)$ .

Non-example: Partial pivoting does NOT guarantee $\rho_n$ is small. The $2 \times 2$ example:

A = \begin{pmatrix} 1 & -1 \\ 1 & 1 \end{pmatrix}

has pivot column $\begin{pmatrix} 1 \\ 1 \end{pmatrix}$ , so either row could be chosen. After step 1, $U_{22} = 2$ , $\rho_2 = 1$ . But for $A = \begin{pmatrix} 1 & 1 \\ 1 & 1+\varepsilon \end{pmatrix}$ , $\rho_2 = (2+\varepsilon)/1 \approx 2$ .

3.5 Complete Pivoting: PAQ = LU

Complete pivoting selects both the row and column with the globally largest absolute entry at each step:

At step $k$ : find $(p, q) = \arg\max_{i \geq k, j \geq k} |A_{ij}|$ , then swap row $k \leftrightarrow p$ and column $k \leftrightarrow q$ .

Result: $PAQ = LU$ where $P$ and $Q$ are permutation matrices (row and column respectively).

Growth factor bound: Complete pivoting satisfies the tighter bound:

\rho_n \leq (n \cdot 2^1 \cdot 3^{1/2} \cdot 4^{1/3} \cdots n^{1/(n-1)})^{1/2}

which grows much more slowly than $2^{n-1}$ but is still superlinear. Numerically, complete pivoting is extremely stable - no practical example of large growth is known.

Cost: Finding the global maximum at each step requires examining the full trailing submatrix, adding $O(n^2)$ comparisons per step and $O(n^3)$ total - which triples the leading constant relative to partial pivoting. This is the reason complete pivoting is rarely used in practice.

When to use complete pivoting:

When stability is paramount (e.g., certified computation, interval arithmetic)
When the matrix is suspected to be rank-deficient (complete pivoting exposes rank)
When solving very ill-conditioned systems where partial pivoting is known to fail

For AI: Complete pivoting is rarely used in mainstream ML workflows. Partial pivoting suffices for almost all practical problems. The theoretical importance of complete pivoting is in establishing lower bounds and in the analysis of rank-revealing factorizations.

3.6 Rook Pivoting

Rook pivoting is a middle ground between partial and complete pivoting. At step $k$ :

Find the largest element in column $k$ (as in partial pivoting) - say row $p$ .
Check if $|A_{pk}|$ is also the largest in row $p$ . If yes, use $(p, k)$ as pivot.
If not, find the largest in row $p$ , swap columns, repeat until convergence.

The name comes from the chess rook: the pivot search alternates between column and row moves until it finds a position that is simultaneously the largest in its row and column - a "rook position."

Properties:

Growth factor: $\rho_n \leq 2^{n-1}$ but in practice much smaller than partial pivoting.
Cost: $O(n^2)$ additional work per step in the worst case but typically $O(n)$ in practice (convergence in 1-2 alternations).
Rank-revealing: Rook pivoting reveals rank approximately as well as complete pivoting.

Theorem (Foster 1997): For a matrix of rank $r$ , rook pivoting ensures $|R_{11}| / |R_{kk}| \leq \sqrt{n}$ for $k > r$ , making the diagonal jump at position $r+1$ clearly visible.

3.7 LU Stability Analysis and Backward Error

The definitive framework for understanding LU stability is backward error analysis, developed by Wilkinson in the 1960s.

Definition (backward error). The backward error of a computed solution $\hat{\mathbf{x}}$ to $A\mathbf{x} = \mathbf{b}$ is the smallest perturbation $\delta A, \delta\mathbf{b}$ such that:

(A + \delta A)\hat{\mathbf{x}} = \mathbf{b} + \delta\mathbf{b}

A method is backward stable if $\|\delta A\| / \|A\|$ and $\|\delta\mathbf{b}\| / \|\mathbf{b}\|$ are of order $u$ (machine epsilon) regardless of the input.

Backward error theorem for LU with partial pivoting:

Theorem (Wilkinson 1961). Let $\hat{\mathbf{x}}$ be the solution computed by LU with partial pivoting in IEEE double precision. Then:

(A + \delta A)\hat{\mathbf{x}} = \mathbf{b}, \quad \|\delta A\|_\infty \leq c_n u \cdot \rho_n \cdot \|A\|_\infty

where $c_n = O(n^2)$ is a modest polynomial constant and $u = 2^{-53} \approx 1.1 \times 10^{-16}$ .

Forward error bound: The forward error satisfies:

\frac{\|\mathbf{x} - \hat{\mathbf{x}}\|_\infty}{\|\mathbf{x}\|_\infty} \leq c_n u \cdot \rho_n \cdot \kappa_\infty(A) + O(u^2)

where $\kappa_\infty(A) = \|A\|_\infty \|A^{-1}\|_\infty$ is the condition number. This is the key identity: numerical error = roundoff $\times$ growth factor $\times$ condition number.

Implication for practice: If $\kappa(A) \approx 10^k$ in double precision, you lose $k$ digits of accuracy. With $u \approx 10^{-16}$ , you have $16 - k$ correct digits in $\hat{\mathbf{x}}$ .

Backward stability of QR: QR (Householder) is backward stable with growth factor $\rho_n = 1$ - Householder reflectors are orthogonal and preserve norms exactly. This is why QR is preferred for ill-conditioned systems.

3.8 Blocked LU for Cache Efficiency

Modern hardware has a deep memory hierarchy (L1: 32KB, L2: 256KB, L3: 8MB, DRAM: \infty). The key metric for algorithmic performance is arithmetic intensity: floating-point operations per byte of data moved. BLAS operations have different intensities:

Operation	BLAS level	Flops	Data	Intensity
$\mathbf{y} \leftarrow A\mathbf{x}$ (DGEMV)	2	$2n^2$	$n^2$	$O(1)$
$C \leftarrow AB + C$ (DGEMM)	3	$2n^3$	$n^2$	$O(n)$

Blocked LU reformulates the algorithm so that the dominant computation is DGEMM (matrix-matrix multiply), achieving $O(n)$ arithmetic intensity and near-peak GPU/CPU throughput.

Block algorithm (block size $b$ ):

for k = 0 to n/b - 1:
    A_kk = A[kb:(k+1)b, kb:(k+1)b]
    # Factor the diagonal panel (unblocked, uses BLAS-2)
    P_k, L_kk, U_kk = LU_factor(A_kk)    # O(b^3)
    # Apply permutation to remaining columns
    A[kb:(k+1)b, (k+1)b:n] = P_k @ A[kb:(k+1)b, (k+1)b:n]
    # Solve for U panel (triangular solve, BLAS-3: DTRSM)
    U_k_right = solve_lower_triangular(L_kk, A[kb:(k+1)b, (k+1)b:n])
    # Update trailing submatrix (BLAS-3: DGEMM)
    A[(k+1)b:n, (k+1)b:n] -= L_right @ U_k_right     # This is the bottleneck

The trailing submatrix update (A -= L_right @ U_k_right) is a pure DGEMM and dominates the total computation. On a modern CPU/GPU with peak DGEMM performance, this block formulation achieves 80-95% of theoretical peak.

LAPACK routine: DGETRF implements blocked LU with partial pivoting. The block size $b$ is tuned automatically per platform via the ilaenv oracle.

For AI: Blocked LU is the algorithm behind PyTorch's torch.linalg.lu_factor, jax.scipy.linalg.lu, and scipy's lu_factor. For GPU, cuSOLVER implements batched blocked LU for thousands of small systems simultaneously - critical for K-FAC training.

3.9 Solving Ax = b via LU

Given the LU factorization $PA = LU$ , solving $A\mathbf{x} = \mathbf{b}$ proceeds in four steps:

Apply permutation: $\mathbf{c} = P\mathbf{b}$ (O(n), just index lookup)
Forward solve: solve $L\mathbf{y} = \mathbf{c}$ (O(n^2), forward substitution)
Backward solve: solve $U\mathbf{x} = \mathbf{y}$ (O(n^2), backward substitution)
(Optional) Iterative refinement: compute $\mathbf{r} = \mathbf{b} - A\hat{\mathbf{x}}$ , solve for correction $\delta\mathbf{x}$ , update

The total cost after factorization is $2n^2$ flops - linear in $n^2$ , negligible compared to the $\frac{2n^3}{3}$ factorization cost for large $n$ .

Multiple right-hand sides: For $k$ different vectors $\mathbf{b}_1, \ldots, \mathbf{b}_k$ , one factorization costing $\frac{2n^3}{3}$ is followed by $k$ triangular solves each costing $2n^2$ . This is the key economic argument for LU: amortize the factorization cost over many solves.

Computing the determinant: $\det(A) = \det(P^{-1})\det(L)\det(U) = (-1)^s \prod_{i=1}^n U_{ii}$ , where $s$ is the number of row swaps. Since $L$ is unit lower triangular, $\det(L) = 1$ .

Computing $A^{-1}$ : Solve $A\mathbf{x}^{(k)} = \mathbf{e}_k$ for $k = 1, \ldots, n$ (one factorization, $n$ triangular solves). But explicitly forming $A^{-1}$ is almost always unnecessary and should be avoided - solve the system directly instead.

3.10 Rank-Deficient LU

When $A$ is singular or numerically rank-deficient, the LU factorization (even with pivoting) produces a $U$ with some diagonal entries near zero. The factorization still "works" mechanically but produces $U_{kk} \approx 0$ at position $k$ corresponding to a dependent column.

Signature of rank deficiency:

\operatorname{rank}(A) = r \iff U_{11}, \ldots, U_{rr} \text{ are "large" and } U_{r+1,r+1}, \ldots, U_{nn} \approx 0

Problem: With partial pivoting, the threshold for "approximately zero" is ambiguous. If $|U_{kk}| / |U_{11}| < \varepsilon_{\text{tol}}$ , we declare the matrix has numerical rank $< k$ . But the choice of $\varepsilon_{\text{tol}}$ is problem-dependent.

Rank-revealing LU: Column-pivoted LU (complete or rook pivoting) provides better rank-revelation: the diagonal entries of $U$ decay in a manner that reflects the true rank structure.

Better alternative for rank-deficient problems: Use rank-revealing QR (RRQR, 4.6) or SVD. The SVD provides the definitive rank determination via singular value thresholding - but costs $O(n^3)$ just as LU does. RRQR provides a cheaper, nearly-as-reliable alternative.

For AI: Neural network weight matrices often have approximate low rank (Hu et al., 2022 show that fine-tuning adapters are intrinsically low-dimensional). Rank-deficient LU can be used to detect this structure, but RRQR or randomized SVD are preferred in practice.

4. QR Factorization: Computational Algorithms

4.1 From Gram-Schmidt to Algorithms

Recall from 05: The QR factorization $A = QR$ of an $m \times n$ matrix ( $m \geq n$ ) decomposes $A$ into an orthonormal factor $Q \in \mathbb{R}^{m \times n}$ (or $m \times m$ for the full QR) and an upper triangular factor $R \in \mathbb{R}^{n \times n}$ . This was constructed via Gram-Schmidt orthogonalization in 05.

The problem with classical Gram-Schmidt: Classical Gram-Schmidt (CGS) is mathematically correct but numerically unstable for ill-conditioned matrices. The computed $Q$ can lose orthogonality catastrophically: for a matrix with condition number $10^8$ , the computed $Q$ from CGS may have $\|Q^\top Q - I\|_F \approx 10^8 \cdot u \approx 10^{-8}$ , causing significant errors in subsequent computations.

Modified Gram-Schmidt (MGS): Reorders the orthogonalization to reduce error accumulation. MGS is numerically equivalent to classical QR of a slightly perturbed matrix, giving $\|Q^\top Q - I\|_F = O(\kappa(A) \cdot u)$ . Still insufficient for highly ill-conditioned problems.

Householder and Givens: Both achieve $\|Q^\top Q - I\|_F = O(u)$ regardless of $\kappa(A)$ , making them the preferred algorithms for production use. The key is that they build $Q$ as a product of exactly orthogonal elementary transformations (reflectors or rotations), never losing orthogonality during construction.

For AI: numpy.linalg.qr and scipy.linalg.qr use LAPACK DGEQRF (Householder QR). torch.linalg.qr similarly calls cuSOLVER which uses Householder on GPU. Understanding the algorithmic basis helps you interpret condition number warnings and numerical precision behavior.

4.2 Householder Reflections

Definition. A Householder reflector (or Householder transformation) is a matrix of the form:

H = I - 2\frac{\mathbf{v}\mathbf{v}^\top}{\mathbf{v}^\top\mathbf{v}} = I - \frac{2}{\|\mathbf{v}\|_2^2}\mathbf{v}\mathbf{v}^\top

where $\mathbf{v} \in \mathbb{R}^m$ is a nonzero Householder vector. The matrix $H$ is symmetric ( $H = H^\top$ ) and orthogonal ( $H^\top H = H^2 = I$ ), so $H^{-1} = H$ - it is its own inverse.

Geometric interpretation: $H$ reflects vectors across the hyperplane perpendicular to $\mathbf{v}$ . Every point $\mathbf{x}$ in the hyperplane satisfies $H\mathbf{x} = \mathbf{x}$ (invariant). Every point $c\mathbf{v}$ along the $\mathbf{v}$ direction satisfies $Hc\mathbf{v} = -c\mathbf{v}$ (negated).

Key property: Given any vector $\mathbf{a} \in \mathbb{R}^m$ and any unit vector $\hat{\mathbf{e}} \in \mathbb{R}^m$ , there exists a Householder reflector $H$ such that $H\mathbf{a} = -\operatorname{sign}(a_1)\|\mathbf{a}\|_2 \hat{\mathbf{e}}_1$ . That is, $H$ maps $\mathbf{a}$ to a multiple of the first standard basis vector.

Construction (numerical form): Given $\mathbf{a} \in \mathbb{R}^m$ to be zeroed below its first entry:

\alpha = -\operatorname{sign}(a_1)\|\mathbf{a}\|_2

\mathbf{v} = \mathbf{a} - \alpha\mathbf{e}_1 = \begin{pmatrix} a_1 - \alpha \\ a_2 \\ \vdots \\ a_m \end{pmatrix}

H = I - \frac{2}{\|\mathbf{v}\|_2^2}\mathbf{v}\mathbf{v}^\top, \quad H\mathbf{a} = \alpha\mathbf{e}_1

The sign convention (critical for stability): We choose $\alpha = -\operatorname{sign}(a_1)\|\mathbf{a}\|_2$ to avoid cancellation in computing $v_1 = a_1 - \alpha$ . If $a_1 > 0$ , choosing $\alpha = -\|\mathbf{a}\|_2$ gives $v_1 = a_1 + \|\mathbf{a}\|_2 > a_1$ , avoiding the catastrophic subtraction that would occur with $\alpha = +\|\mathbf{a}\|_2$ .

Cost of applying $H$ : Computing $H\mathbf{x}$ directly costs $O(m^2)$ (matrix-vector product), but using the formula $H\mathbf{x} = \mathbf{x} - 2\mathbf{v}(\mathbf{v}^\top\mathbf{x})/(\mathbf{v}^\top\mathbf{v})$ costs only $O(m)$ - first compute the scalar $s = \mathbf{v}^\top\mathbf{x}$ , then update $\mathbf{x} \leftarrow \mathbf{x} - (2s/\|\mathbf{v}\|^2)\mathbf{v}$ . This is the implicit representation of $H$ .

4.3 Householder QR Algorithm

Algorithm: Apply Householder reflectors successively to zero out subdiagonal entries of each column.

for k = 1 to n:                                # n columns
    # Construct H_k to zero out A[k+1:m, k]
    a = A[k:m, k]                              # current column
    alpha = -sign(a[0]) * norm(a)
    v = a.copy(); v[0] -= alpha
    v /= norm(v)                               # normalized Householder vector
    # Apply H_k implicitly to A[k:m, k:n]
    A[k:m, k:n] -= 2 * v * (v.T @ A[k:m, k:n])
    # Store v in lower triangle of A (LAPACK convention)
    A[k+1:m, k] = v[1:]

After $n$ steps, $A$ has been overwritten with $R$ in the upper triangle and Householder vectors in the lower triangle (the implicit storage of $Q$ ).

Implicit Q representation: The full $Q$ matrix is $Q = H_1 H_2 \cdots H_n$ and has size $m \times m$ . Forming $Q$ explicitly costs $O(mn^2)$ additional work and $O(mn)$ storage. LAPACK's DORGQR computes the explicit $Q$ when needed (e.g., for the thin QR, only the first $n$ columns of $Q$ are needed).

Complexity:

Applying $H_k$ to trailing submatrix: $O((m-k)(n-k))$ per step
Total: $\sum_{k=1}^{n} O((m-k)(n-k)) = O(mn^2 - n^3/3) \approx 2mn^2 - 2n^3/3$
For square $m = n$ : $4n^3/3$ flops (vs $2n^3/3$ for LU)

Numerical stability: Householder QR is backward stable. The computed $\hat{Q}, \hat{R}$ satisfy:

A + E = \hat{Q}\hat{R}, \quad \|E\|_F \leq c_{mn} u \|A\|_F

with $c_{mn}$ a modest polynomial constant and $u$ machine epsilon. The orthogonality error satisfies $\|\hat{Q}^\top\hat{Q} - I\|_F = O(u)$ , independent of $\kappa(A)$ .

LAPACK: DGEQRF computes the Householder QR in blocked form (block Householder, using WY representation). DORMQR applies $Q$ or $Q^\top$ to a matrix without forming $Q$ explicitly.

For AI: The Householder algorithm is used in PyTorch's torch.linalg.qr, in scipy's linalg.qr, and in LAPACK DGEQRF. It is the foundation for numerically reliable QR-based least-squares solvers used in linear probing, weight matrix analysis, and second-order methods.

4.4 Givens Rotations

Definition. A Givens rotation $G(i,j,\theta) \in \mathbb{R}^{n \times n}$ is the identity matrix with a $2 \times 2$ rotation embedded in the $(i,j)$ rows and columns:

G(i,j,\theta) = \begin{pmatrix} 1 & \cdots & 0 & \cdots & 0 & \cdots & 0 \\ \vdots & \ddots & \vdots & & \vdots & & \vdots \\ 0 & \cdots & c & \cdots & s & \cdots & 0 \\ \vdots & & \vdots & \ddots & \vdots & & \vdots \\ 0 & \cdots & -s & \cdots & c & \cdots & 0 \\ \vdots & & \vdots & & \vdots & \ddots & \vdots \\ 0 & \cdots & 0 & \cdots & 0 & \cdots & 1 \end{pmatrix}

with $c = \cos\theta$ and $s = \sin\theta$ in positions $(i,i)$ , $(i,j)$ , $(j,i)$ , $(j,j)$ .

Key property: $G(i,j,\theta)$ is orthogonal. Left-multiplying $\mathbf{x}$ by $G$ rotates the $(x_i, x_j)$ plane by angle $\theta$ .

Zeroing a specific entry: Given $\mathbf{x}$ with entries $x_i$ and $x_j$ , choose:

r = \sqrt{x_i^2 + x_j^2}, \quad c = x_i/r, \quad s = -x_j/r

Then $(G\mathbf{x})_i = r$ , $(G\mathbf{x})_j = 0$ - the rotation zeros out $x_j$ while preserving $x_i$ as $r$ .

Numerically stable computation (LAPACK DLARTG):

if x_j == 0:
    c = 1, s = 0, r = x_i
elif |x_j| > |x_i|:
    t = x_i / x_j; s = 1/sqrt(1+t^2); c = s*t; r = x_j/s
else:
    t = x_j / x_i; c = 1/sqrt(1+t^2); s = c*t; r = x_i/c

This avoids overflow and underflow in computing $\sqrt{x_i^2 + x_j^2}$ .

Cost: Applying $G(i,j,\theta)$ to a matrix costs $6m$ flops (touching only rows $i$ and $j$ ). Constructing $G$ costs $O(1)$ .

For AI: Givens rotations appear in updating QR factorizations when a new row is appended (streaming QR), which is used in online learning and in updating Cholesky factors after rank-1 additions (Cholesky rank-1 update, relevant to incremental GP regression).

4.5 Givens QR and Sparse Matrices

Algorithm: Apply Givens rotations sequentially to zero out subdiagonal entries of $A$ column by column. To zero $A_{ij}$ ( $i > j$ ), apply $G(j, i, \theta)$ on the left.

For a dense $m \times n$ matrix, the total number of Givens rotations needed is $mn - n(n+1)/2 \approx mn$ , each costing $O(m)$ flops - total $O(m^2 n)$ , which is worse than Householder QR for dense matrices.

Advantage for sparse matrices: Each Givens rotation touches exactly two rows and two columns. If $A$ has a known sparsity pattern, Givens rotations can be sequenced to minimize fill-in (new nonzeros created by the transformation). In contrast, a single Householder reflector $H_k$ is a rank-2 update that touches an entire panel, potentially creating dense fill.

Banded matrices: For a banded matrix with bandwidth $b$ , Givens QR costs $O(nb^2)$ instead of $O(nb^2 + n^2 b)$ for Householder, and the resulting $R$ remains banded. This is critical for signal processing and finite-element computations.

Online/streaming QR: When a new row $\mathbf{a}^\top$ is appended to $A$ , the existing QR factorization can be updated using a single sequence of $n$ Givens rotations, costing $O(n^2)$ rather than recomputing from scratch in $O(mn^2)$ .

For AI: Streaming QR via Givens is used in online learning settings where data arrives sequentially. It also underlies incremental SVD algorithms used for streaming PCA (e.g., in continual learning, where new tasks arrive without access to past data).

4.6 Column-Pivoted QR (RRQR)

Motivation: Standard QR doesn't reveal rank. The diagonal entries of $R$ decay, but not necessarily in a way that makes rank determination reliable. Column-pivoted QR (RRQR) reorders columns of $A$ to expose rank structure in $R$ .

Algorithm: At step $k$ , before computing the $k$ -th Householder reflector, swap column $k$ with the column $j \geq k$ having maximum 2-norm among remaining columns. This produces:

A P = QR

where $P$ is a permutation and $R$ has the property that $|R_{kk}| \geq |R_{k+1,k+1}| \geq \cdots \geq |R_{nn}|$ .

Rank estimation: If $A$ has numerical rank $r$ , then $|R_{rr}| / |R_{r+1,r+1}|$ is large (theoretically $\geq \sigma_r / \sigma_{r+1}$ , practically much larger). The threshold $\tau$ for declaring rank is:

r = \max\{k : |R_{kk}| > \tau \cdot |R_{11}|\}

RRQR theorem (Golub 1965, Chandrasekaran & Ipsen 1994): The strong RRQR algorithm guarantees:

\sigma_k(A)^2 \leq \sigma_k(R_{11})^2 \cdot (1 + f(k, n-k, r)), \quad f = O(n)

ensuring that $R_{11}$ (the leading $r \times r$ block) captures essentially all the energy of the matrix.

Connection to SVD: RRQR is a cheap ( $O(mn^2)$ ) approximation to the SVD. It provides the same rank information and a good basis for the column space, but singular values only approximately. For exact singular values, use SVD at $O(mn^2 + n^3)$ cost.

For AI:

LoRA (Hu et al., 2022): Low-Rank Adaptation uses rank- $r$ decompositions of weight updates $\Delta W = BA$ . Choosing $r$ requires rank estimation - RRQR is one approach.
DoRA (Liu et al., 2024): Decomposes weight matrices into magnitude and direction; RRQR is used to identify the principal components.
Intrinsic dimensionality: RRQR-based rank estimation reveals the intrinsic dimensionality of weight matrices, guiding compression decisions.

4.7 Tall-Skinny QR (TSQR)

Motivation: For matrices $A \in \mathbb{R}^{m \times n}$ with $m \gg n$ (tall and skinny - e.g., $m = 10^9$ , $n = 100$ ), standard Householder QR communicates $O(n^2)$ data between levels of the memory hierarchy at each of $n$ steps, totaling $O(n^3)$ words of communication. For distributed or GPU computation, this is the bottleneck.

TSQR algorithm (Demmel et al., 2008):

Local factorization: Partition $A$ into $P$ panels $A_1, \ldots, A_P$ (one per processor/GPU block).
Local QR: Factor each $A_i = Q_i R_i$ independently (no communication).
Reduction tree: Form $\begin{pmatrix} R_1 \\ R_2 \end{pmatrix}$ , factor its QR to get $R_{12}$ . Repeat up the tree.
Result: $R$ is the final upper triangular factor. $Q$ can be recovered by traversing the reduction tree backward.

Communication cost: TSQR communicates $O(n^2 \log P)$ words (vs $O(mn \cdot n/b)$ for blocked Householder), achieving communication optimality.

For AI:

Distributed training: TSQR is used in distributed least-squares and in computing the QR of gradient matrices across multiple GPUs.
Randomized methods: TSQR underlies the "sketch-then-QR" approach for randomized low-rank factorization.
Attention-efficient computation: The block-tile structure of FlashAttention-3 (Shah et al., 2024) is inspired by the communication-optimal blocking patterns of TSQR.

4.8 QR for Least Squares

Problem: Given $A \in \mathbb{R}^{m \times n}$ with $m > n$ (overdetermined) and $\mathbf{b} \in \mathbb{R}^m$ , find $\mathbf{x}^*$ minimizing $\|A\mathbf{x} - \mathbf{b}\|_2^2$ .

Method 1 (Normal equations): $\mathbf{x}^* = (A^\top A)^{-1} A^\top \mathbf{b}$ . Form $A^\top A$ , factor via Cholesky, solve. Cost: $O(mn^2 + n^3)$ . Problem: $\kappa(A^\top A) = \kappa(A)^2$ - squaring the condition number doubles the digits lost.

Method 2 (QR): Factor $A = QR$ (thin QR), then:

\|A\mathbf{x} - \mathbf{b}\|_2^2 = \|QR\mathbf{x} - \mathbf{b}\|_2^2 = \|R\mathbf{x} - Q^\top\mathbf{b}\|_2^2 + \|(I - QQ^\top)\mathbf{b}\|_2^2

Minimizing over $\mathbf{x}$ : solve $R\mathbf{x} = Q^\top\mathbf{b}$ (upper triangular, backward substitution). Cost: $O(mn^2)$ for QR, $O(mn)$ for $Q^\top\mathbf{b}$ , $O(n^2)$ for backsolve. The condition number of $R$ is $\kappa(A)$ , not $\kappa(A)^2$ .

Stability comparison:

Method	Condition amplification	Cost	When to use
Normal equations + Cholesky	$\kappa(A)^2$	$O(mn^2 + n^3)$	Well-conditioned $A$ , $m \gg n$
QR (Householder)	$\kappa(A)$	$O(mn^2)$	General purpose, numerically safe
RRQR	$\kappa(A)$ , reveals rank	$O(mn^2)$	Rank-deficient or near-singular $A$
SVD (truncated)	None (sets small SVs to zero)	$O(mn^2 + n^3)$	Ill-conditioned, rank-deficient

For AI:

Linear probing: Fitting a linear classifier on top of frozen features is an overdetermined least-squares problem. sklearn.linear_model.LinearRegression uses LAPACK DGELSD (divide-and-conquer SVD) or DGELSY (column-pivoted QR).
Ridge regression: $\min_{\mathbf{x}} \|A\mathbf{x} - \mathbf{b}\|_2^2 + \lambda\|\mathbf{x}\|_2^2$ is equivalent to the augmented system $\begin{pmatrix} A \\ \sqrt{\lambda}I \end{pmatrix} \mathbf{x} = \begin{pmatrix} \mathbf{b} \\ \mathbf{0} \end{pmatrix}$ , which QR solves stably.
Attention weight regression: Computing attention pattern regression (e.g., for mechanistic interpretability) uses QR-based least squares.

5. Cholesky Factorization (Computational Recap)

5.1 Brief Overview

Full treatment: The complete theory of Cholesky factorization - existence proofs, LDL^T, modified Cholesky, log-determinant, connection to PSD cone - is in 07: Positive Definite Matrices. This section covers only the computational aspects not covered in 07: the relationship to LU, blocked algorithms, and LAPACK routines.

For a symmetric positive definite matrix $A \succ 0$ , the Cholesky factorization is:

A = LL^\top

where $L$ is unit lower triangular with positive diagonal entries $L_{ii} > 0$ . The factorization exists and is unique for every $A \succ 0$ (Theorem from 07).

Why Cholesky for SPD systems:

Efficiency: Cholesky costs $\frac{n^3}{3}$ flops - exactly half of LU's $\frac{2n^3}{3}$ - because symmetry halves the work.
Storage: Only the lower triangle needs to be stored (n(n+1)/2 entries vs n^2 for LU).
No pivoting needed: Positive definiteness guarantees $L_{ii} > 0$ at every step without any row interchanges.
Stability: Cholesky is backward stable for SPD matrices without any pivoting.

5.2 Cholesky as Specialized LU

The connection between Cholesky and LU: for a symmetric positive definite $A$ , the LU factorization (without pivoting, which exists since all leading principal minors of $A$ are positive) gives $A = LU$ . But $A = A^\top$ implies $LU = U^\top L^\top$ , so $U = DL^\top$ where $D = \operatorname{diag}(U_{11}, \ldots, U_{nn})$ . Thus $A = L D L^\top$ (the LDL^T form). Since $A \succ 0$ , all diagonal entries $D_{ii} > 0$ , and we can define $\tilde{L} = L D^{1/2}$ to get $A = \tilde{L}\tilde{L}^\top$ - the Cholesky factorization.

Cholesky algorithm (jth column):

L_{jj} = \sqrt{A_{jj} - \sum_{k=1}^{j-1} L_{jk}^2}

L_{ij} = \frac{1}{L_{jj}}\left(A_{ij} - \sum_{k=1}^{j-1} L_{ik}L_{jk}\right), \quad i = j+1, \ldots, n

The argument of the square root must be positive - this is guaranteed by positive definiteness.

Cost analysis:

Column $j$ costs: 1 square root + $O(j)$ flops for $L_{jj}$ , plus $(n-j) \cdot O(j)$ flops for $L_{ij}$
Total: $\sum_{j=1}^{n} O(nj) = O(n^3/3)$ , exactly half of LU

5.3 LDL^T for Indefinite Systems

For symmetric indefinite matrices (not necessarily PD), Cholesky cannot be applied directly (negative square roots). The LDL^T factorization computes:

P A P^\top = L D L^\top

where $L$ is unit lower triangular, $D$ is block diagonal with $1 \times 1$ and $2 \times 2$ blocks (to handle negative eigenvalues), and $P$ is a permutation.

Bunch-Kaufman pivoting: The $2 \times 2$ blocks in $D$ capture pairs of eigenvalues of opposite sign, avoiding the square root altogether. The pivoting strategy (Bunch & Kaufman 1977) ensures $|L_{ij}| \leq (1 + \sqrt{17})/8 \approx 0.64$ - a stability bound analogous to $|L_{ij}| \leq 1$ for partial pivoting in LU.

Applications in ML:

Indefinite Hessians: At saddle points (common early in neural network training), the Hessian is indefinite. LDL^T factors it without the positive-definiteness requirement, enabling second-order descent directions even at saddle points.
Modified Cholesky: Adding a diagonal shift $\delta I$ to make $A + \delta I \succ 0$ before Cholesky factorization is the standard approach in L-BFGS-B and quasi-Newton methods. See 07 for the modified Cholesky algorithm.

LAPACK routine: DSYTRF implements LDL^T with Bunch-Kaufman pivoting.

5.4 Blocked Cholesky and LAPACK dpotrf

Blocked Cholesky follows the same blocked structure as blocked LU. For block size $b$ :

for k = 0 to n/b - 1:
    # Factor diagonal block
    L_kk = cholesky(A[kb:(k+1)b, kb:(k+1)b])   # O(b^3)
    # Solve for L panel (DTRSM)
    L_k_below = solve_lower_triangular(L_kk.T, A[(k+1)b:n, kb:(k+1)b].T).T
    # Update trailing submatrix (DSYRK + DGEMM)
    A[(k+1)b:n, (k+1)b:n] -= L_k_below @ L_k_below.T

The trailing submatrix update (A -= L @ L^T) is a symmetric rank- $b$ update, computed by LAPACK's DSYRK (symmetric rank-k update) which exploits symmetry to halve the work relative to DGEMM.

LAPACK routine DPOTRF: The production Cholesky routine. On entry: the upper or lower triangle of $A$ . On exit: $L$ (or $U$ for upper Cholesky). Returns an error flag INFO ( $= 0$ for success; $= k > 0$ if the $k$ -th leading minor is not positive definite).

For AI: DPOTRF is called by:

scipy.linalg.cholesky (wraps LAPACK)
numpy.linalg.cholesky (wraps LAPACK)
torch.linalg.cholesky (calls cuSOLVER cusolverDnDpotrf on GPU)
Every GP regression library (GPyTorch, GPflow, etc.) for computing $K^{-1}\mathbf{y}$ and $\log\det K$

Matrix Decompositions: Part 1 - Intuition To 5 Cholesky Factorization Computational Recap

Matrix Decompositions: Part 1: Intuition to 5. Cholesky Factorization Computational Recap

1. Intuition

1.1 The Factorization Paradigm

1.2 Triangular Systems: The Foundation

1.3 Three Canonical Decompositions

1.4 Why This Matters for AI

1.5 Historical Timeline

2. Background: Triangular Systems

2.1 Forward Substitution

2.2 Backward Substitution

2.3 Gaussian Elimination Revisited

3. LU Factorization

3.1 LU Decomposition Theorem

3.2 The LU Algorithm: Outer Product Form

3.3 Failure Without Pivoting

3.4 Partial Pivoting: PA = LU

3.5 Complete Pivoting: PAQ = LU

3.6 Rook Pivoting

3.7 LU Stability Analysis and Backward Error

3.8 Blocked LU for Cache Efficiency

3.9 Solving Ax = b via LU

3.10 Rank-Deficient LU

4. QR Factorization: Computational Algorithms

4.1 From Gram-Schmidt to Algorithms

4.2 Householder Reflections

4.3 Householder QR Algorithm

4.4 Givens Rotations

4.5 Givens QR and Sparse Matrices

4.6 Column-Pivoted QR (RRQR)

4.7 Tall-Skinny QR (TSQR)

4.8 QR for Least Squares

5. Cholesky Factorization (Computational Recap)

5.1 Brief Overview

5.2 Cholesky as Specialized LU

5.3 LDL^T for Indefinite Systems

5.4 Blocked Cholesky and LAPACK dpotrf

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?