Part 2

30 min read18 headingsSplit lesson page

Lesson overview | Previous part | Next part

Matrix Decompositions: Part 6: Numerical Stability and Error Analysis to Appendix J: Extended Example - GP Hyperparameter Optimization

6. Numerical Stability and Error Analysis

6.1 Forward vs Backward Error

Forward error: The actual difference between the computed answer $\hat{\mathbf{x}}$ and the true answer $\mathbf{x}^*$ :

\text{forward error} = \|\hat{\mathbf{x}} - \mathbf{x}^*\| / \|\mathbf{x}^*\|

Backward error: The smallest perturbation to the input data that makes $\hat{\mathbf{x}}$ an exact solution:

\text{backward error} = \min\{\|\delta A\| / \|A\| : (A + \delta A)\hat{\mathbf{x}} = \mathbf{b}\}

The fundamental inequality:

\text{forward error} \leq \kappa(A) \cdot \text{backward error} + O(u^2)

This separates the algorithm's contribution (backward error, controlled by stability) from the problem's inherent difficulty (condition number $\kappa(A)$ , intrinsic to the matrix). A backward-stable algorithm does the best possible: it cannot do better than $\kappa(A) \cdot u$ .

Why backward error is the right measure: A backward-stable algorithm produces exact solutions to slightly perturbed problems. If the input data itself has error of order $u$ (which it does - all real data has measurement error), then backward-stable computation introduces no additional error beyond what the input uncertainty already implies.

Example: Computing $\hat{x} = (1.0 + 10^{-16}) / (1.0 + 10^{-16}) = 1.0$ in floating-point. The forward error is zero, but the backward error is also zero (exact computation). Now: $\hat{x} = 10^{-16} / 10^{-17} = 10.0$ , while the true answer is $10.0 + O(10^{-16})$ . Forward error is tiny; backward error is tiny.

6.2 The Fundamental Theorem of Backward Stability

Theorem (Backward Stability, informal). An algorithm for computing $f(x)$ is backward stable if for every input $x$ , the computed output $\hat{y}$ satisfies:

\hat{y} = f(x + \delta x) \text{ for some } \|\delta x\| \leq c \cdot u \cdot \|x\|

where $c$ is a modest polynomial function of the problem size and $u$ is machine epsilon.

Corollary: For a backward stable algorithm, the forward error satisfies:

\frac{\|\hat{y} - f(x)\|}{\|f(x)\|} \leq \kappa_f \cdot c \cdot u

where $\kappa_f$ is the condition number of $f$ . The condition number determines how much the backward perturbation $\delta x$ amplifies into forward error.

LU with partial pivoting: backward stable with growth factor $\rho_n$ . Not backward stable in theory (since $\rho_n$ can be $2^{n-1}$ ), but backward stable in practice (since $\rho_n$ rarely exceeds $n^{2/3}$ for typical inputs).

Householder QR: backward stable unconditionally. The computed $\hat{Q}$ satisfies $\|\hat{Q}^\top\hat{Q} - I\| = O(u)$ regardless of $\kappa(A)$ .

Cholesky (for SPD): backward stable unconditionally. No pivoting needed; positive definiteness guarantees numerical stability.

6.3 Stability Comparison: LU vs QR vs Cholesky

Property	LU (partial pivot)	Householder QR	Cholesky (SPD)
Growth factor $\rho_n$	$\leq 2^{n-1}$ (worst case)	1 (exact)	1 (exact)
Backward stable?	In practice, yes	Unconditionally	Unconditionally
Orthogonality loss	N/A	$O(u)$	N/A
Requires pivoting?	Yes (partial pivot)	No (column pivot for RRQR)	No
Cost (leading term)	$\frac{2}{3}n^3$	$2mn^2 - \frac{2}{3}n^3$	$\frac{1}{3}n^3$
Memory	$n^2$	$mn$ (+ $n^2$ for R)	$\frac{n(n+1)}{2}$
Handles all square matrices?	Yes (nonsingular)	Yes	No (SPD only)

Rule of thumb:

Have a symmetric positive definite matrix? Use Cholesky (twice as fast as LU, always stable).
Have a general square system? Use LU with partial pivoting (standard default in all libraries).
Have an overdetermined least-squares problem? Use Householder QR.
Need rank determination or have a rank-deficient matrix? Use RRQR or SVD.
Have an ill-conditioned system where every digit matters? Use QR (avoids condition number squaring).

6.4 Mixed Precision and Iterative Refinement

Mixed precision computation: Modern GPUs (NVIDIA A100, H100) support FP64, FP32, FP16, and BF16 with different throughputs. The ratio is typically 1:2:8:8 (64:32:16:16 bit throughput). Training AI models almost exclusively uses FP16 or BF16 to maximize throughput; but this reduces precision from $u = 10^{-16}$ (FP64) to $u = 10^{-3}$ (FP16).

Iterative refinement algorithm:

1. Factor A \approx PA = LU in low precision (FP16 or FP32)
2. Solve Ax_0 = b using L, U in low precision          -> x_0
3. For k = 0, 1, 2, ...:
   a. Compute residual r_k = b - A x_k  (in HIGH precision, FP64)
   b. Solve L U \delta_k = r_k               (in LOW precision - cheap)
   c. Update x_{k+1} = x_k + \delta_k
4. Converge when ||r_k|| / ||b|| < \epsilon_target

Convergence: Iterative refinement converges in $O(1)$ iterations if the initial precision is high enough relative to $\kappa(A)$ . Specifically, if $\kappa(A) \cdot u_{\text{factor}} < 1$ , refinement converges to FP64 precision in 2-3 steps.

NVIDIA cuSOLVER IR (Iterative Refinement): Implements exactly this pattern: BF16 factorization (8x throughput vs FP64) + FP64 residual computation + BF16 solve for correction. Used in NVIDIA's deep learning solver libraries and in the mixed-precision training of large language models.

For AI: NVIDIA's "GMRES-IR" (GMRES Iterative Refinement) in cuSOLVER applies this to poorly conditioned systems arising in transformer inference optimization. K-FAC implementations use this to factor Fisher information blocks quickly in FP16 and refine to FP32 accuracy.

6.5 Ill-Conditioned Systems and Regularization

When $\kappa(A)$ is very large, even backward-stable algorithms produce solutions with large forward error. The cure is regularization - changing the problem to a nearby well-conditioned one.

Tikhonov regularization: Replace $A\mathbf{x} = \mathbf{b}$ with $\min_\mathbf{x} \|A\mathbf{x} - \mathbf{b}\|^2 + \lambda\|\mathbf{x}\|^2$ . The solution is $\mathbf{x}_\lambda = (A^\top A + \lambda I)^{-1} A^\top \mathbf{b}$ . The regularized condition number is:

\kappa(A^\top A + \lambda I) = \frac{\sigma_1^2 + \lambda}{\sigma_n^2 + \lambda}

which decreases as $\lambda$ increases. At $\lambda = \sigma_n^2$ : condition number $\approx (\sigma_1/\sigma_n)^2 / 2 = \kappa^2/2$ . At $\lambda = \sigma_1^2$ : condition number $= 2$ .

Truncated factorizations: Set all singular values (or diagonal entries of $R$ or $U$ ) below a threshold to zero. This is the basis of truncated SVD and RRQR-based pseudoinverse computation.

For AI: The Adam optimizer's $\epsilon$ parameter (default $10^{-8}$ ) is exactly Tikhonov regularization applied to the per-parameter Hessian approximation. K-FAC's damping parameter $\lambda$ similarly regularizes the Fisher information matrix to prevent ill-conditioned natural gradient steps.

7. Blocked Algorithms and High-Performance Computing

7.1 Cache Hierarchy and Algorithm Design

Modern computer architecture has a deep memory hierarchy:

Level	Size	Latency	Bandwidth
L1 cache	32-64 KB	4 cycles	1 TB/s
L2 cache	256-512 KB	12 cycles	400 GB/s
L3 cache	8-32 MB	40 cycles	200 GB/s
DRAM	16-512 GB	100+ cycles	50 GB/s
GPU HBM	16-80 GB	80 cycles	2 TB/s

The roofline model: An algorithm's performance is limited by the minimum of:

Peak arithmetic throughput (FLOP/s)
Memory bandwidth \times arithmetic intensity (bytes/flop)

A naive matrix-vector multiply $\mathbf{y} = A\mathbf{x}$ reads $n^2 + 2n$ words and performs $2n^2$ flops - arithmetic intensity $O(1)$ . It is memory bandwidth bound: DRAM bandwidth, not arithmetic units, limits performance.

A matrix-matrix multiply $C = AB$ reads $2n^2$ words and performs $2n^3$ flops - arithmetic intensity $O(n)$ . For large $n$ , it is compute bound and can achieve near-peak FLOP/s.

The key insight for blocked algorithms: Reformulate the bottleneck step as a matrix-matrix multiply (BLAS-3), not as a matrix-vector multiply (BLAS-2). This changes arithmetic intensity from $O(1)$ to $O(n)$ , giving $O(n)$ -fold speedup on modern hardware.

7.2 Blocked Factorizations

All three factorizations (LU, QR, Cholesky) can be reformulated as blocked algorithms that expose BLAS-3 matrix-matrix multiplies as the dominant computation.

Block LU (Panel + Update):

A = [A_{11} A_{12}]    Factor: [L_{11}   0  ] [U_{11} U_{12}]
    [A_{21} A_{22}]            [L_{21}  L_{22}] [  0   U_{22}]

Step 1: Factor A_11 = L_{11} U_{11}           (small, BLAS-2)
Step 2: L_{21} = A_{21} U_{11}^{-1}           (DTRSM, BLAS-3)
Step 3: U_{12} = L_{11}^{-1} A_{12}           (DTRSM, BLAS-3)  
Step 4: A_{22} -= L_{21} U_{12}               (DGEMM, BLAS-3) <- dominates
Step 5: Recurse on A_{22}

WY representation for blocked Householder QR: A product of $b$ Householder reflectors $H_1 \cdots H_b$ can be written as $I - WY^\top$ where $W, Y \in \mathbb{R}^{m \times b}$ . Applying $I - WY^\top$ to a matrix $C$ costs $O(mbn)$ using BLAS-3 DGEMM. LAPACK's DGEQRF uses this representation for blocked Householder QR.

Performance: On a modern CPU with AVX-512, blocked DGEMM achieves $> 80\%$ of peak FLOP/s for $n > 500$ . Without blocking (using BLAS-2), performance drops to $< 10\%$ of peak. The $8 \times$ speedup from blocking is why production factorization libraries use block sizes of $64$ - $256$ .

7.3 LAPACK Routine Reference

Routine	Operation	Notes
`DGETRF`	LU with partial pivoting ( $PA = LU$ )	Returns pivot indices; in-place
`DGETRS`	Solve $AX = B$ using DGETRF output	Multiple right-hand sides
`DGETRI`	Compute $A^{-1}$ from DGETRF output	Rarely needed; use DGETRS instead
`DGEQRF`	QR factorization (Householder)	Returns H vectors; blocked (WY)
`DORMQR`	Apply $Q$ or $Q^\top$ from DGEQRF	Without forming Q explicitly
`DORGQR`	Form explicit Q from DGEQRF output	Costs $O(mn^2)$ extra
`DGELSY`	Least squares via column-pivoted QR	Rank-deficient safe
`DPOTRF`	Cholesky ( $A = LL^\top$ or $UU^\top$ )	Returns INFO error if not SPD
`DPOTRS`	Solve $AX = B$ using DPOTRF output	SPD matrix only
`DSYTRF`	LDL^T for symmetric indefinite	Bunch-Kaufman pivoting
`DSYTRS`	Solve using DSYTRF output

LAPACK naming convention: D = double precision, S = single, Z = complex double; GE = general, PO = positive definite, SY = symmetric; TRF = triangular factorization, TRS = triangular solve, TRI = triangular inverse.

In Python: scipy.linalg provides direct LAPACK access via scipy.linalg.lapack.dgetrf, etc. High-level wrappers (scipy.linalg.lu, scipy.linalg.qr, scipy.linalg.cholesky) call these automatically.

7.4 Sparse Factorizations

For sparse matrices (e.g., from finite-element discretization, graph Laplacians, or neural network weight matrices with structured sparsity), dense factorizations waste memory and computation on zeros.

Fill-in: When a sparse matrix is factored, the $L$ and $U$ factors typically contain more nonzeros than $A$ itself. The additional nonzeros are called fill-in.

Reordering to minimize fill-in:

Minimum degree ordering (AMD): Greedy heuristic - eliminate the variable connected to fewest others first. Approximate minimum degree (AMD) is the production algorithm.
Nested dissection: Recursively partition the matrix graph using separator sets; theoretically optimal for regular 2D grids ( $O(n^{1.5})$ fill vs $O(n^2)$ without reordering).

Supernodal methods: CHOLMOD (Davis & Hager, 2009) uses "supernodes" - groups of columns with identical sparsity pattern - and applies dense BLAS-3 operations to each supernode, recovering high arithmetic intensity.

For AI:

Graph neural networks: The adjacency matrix of a graph is sparse. Solving GNN systems requires sparse Cholesky or LU.
Attention patterns: Sparse attention (Longformer, BigBird) creates structured-sparse weight matrices. Sparse LU factorization is used in some attention-efficient inference methods.
Finite-element physics simulation: Differentiable simulation requires backpropagating through sparse LU/Cholesky, implemented in JAX using jax.experimental.sparse.

8. Applications in Machine Learning

8.1 Least Squares via QR

Setup: The standard supervised learning problem with linear model $\hat{y} = \mathbf{w}^\top \mathbf{x}$ over $n$ training examples $(X \in \mathbb{R}^{n \times d}, \mathbf{y} \in \mathbb{R}^n)$ reduces to the least-squares problem:

\mathbf{w}^* = \arg\min_\mathbf{w} \|X\mathbf{w} - \mathbf{y}\|_2^2

QR solution procedure:

Factor $X = QR$ (thin QR, $Q \in \mathbb{R}^{n \times d}$ , $R \in \mathbb{R}^{d \times d}$ )
Compute $\hat{\mathbf{y}} = Q^\top \mathbf{y}$ (project $\mathbf{y}$ onto column space of $X$ )
Solve $R\mathbf{w}^* = \hat{\mathbf{y}}$ (backward substitution)

Why QR over normal equations in practice:

Normal equations form $X^\top X$ , which has condition number $\kappa(X)^2$ . For typical neural network feature matrices ( $\kappa \sim 10^4$ - $10^6$ ), this squares to $10^8$ - $10^{12}$ , losing 8-12 digits of precision in double.
QR maintains condition number $\kappa(X)$ throughout - twice as many correct digits.

Ridge regression via augmented QR: $\min \|X\mathbf{w} - \mathbf{y}\|^2 + \lambda\|\mathbf{w}\|^2$ is equivalent to:

\tilde{X} = \begin{pmatrix} X \\ \sqrt{\lambda} I_d \end{pmatrix}, \quad \tilde{\mathbf{y}} = \begin{pmatrix} \mathbf{y} \\ \mathbf{0} \end{pmatrix}

Factor $\tilde{X} = QR$ and solve $R\mathbf{w} = Q^\top \tilde{\mathbf{y}}$ . This is numerically superior to forming $(X^\top X + \lambda I)$ explicitly.

For AI: Linear probing (evaluating feature quality by fitting a linear classifier) and prompt tuning (fitting a linear head on frozen features) both reduce to this problem. The sklearn.linear_model.Ridge with solver='svd' or 'cholesky' uses exactly these methods.

8.2 Newton's Method and Second-Order Optimization

Newton's method for minimizing $\mathcal{L}(\boldsymbol{\theta})$ applies the update:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - H_t^{-1} \nabla\mathcal{L}(\boldsymbol{\theta}_t)

where $H_t = \nabla^2\mathcal{L}(\boldsymbol{\theta}_t)$ is the Hessian. This requires solving the linear system $H_t \boldsymbol{\delta} = \nabla\mathcal{L}$ at each step - a direct application of LU or Cholesky factorization.

Cost per step: $O(p^3)$ where $p = |\boldsymbol{\theta}|$ . For GPT-3 with $p = 175 \times 10^9$ , this is astronomically expensive. Full Newton is infeasible for large neural networks.

K-FAC (Kronecker-Factored Approximate Curvature, Martens & Grosse 2015): Approximates the Fisher information matrix (a PSD proxy for the Hessian) as a block-diagonal Kronecker product:

F \approx \bigoplus_l A_l \otimes G_l

where $A_l \in \mathbb{R}^{d_l \times d_l}$ and $G_l \in \mathbb{R}^{d_{l+1} \times d_{l+1}}$ are computed from activations and pre-activation gradients respectively. Each block is inverted via Cholesky:

(A_l \otimes G_l)^{-1} = A_l^{-1} \otimes G_l^{-1}

with $A_l^{-1}$ and $G_l^{-1}$ computed via Cholesky factorization of each factor separately.

Cost per K-FAC step: $O(\sum_l (d_l^3 + d_{l+1}^3))$ - sum over layers, each requiring two Cholesky factorizations of small matrices. For a 10-layer network with hidden dim 1024: $\approx 10 \times 2 \times 10^9 = 2 \times 10^{10}$ flops, feasible on modern hardware.

Shampoo (Gupta et al., 2018; Anil et al., 2020): Maintains full-matrix preconditioners per layer:

G_t = G_{t-1} + g_t g_t^\top, \quad P_t = G_t^{-1/4}

The matrix square root $G_t^{-1/4}$ requires eigendecomposition or iterative methods. The recent "Distributed Shampoo" (Google Brain, 2022) runs this in FP32 on accelerators, with Cholesky-based Newton iterations for the matrix square root.

8.3 Gaussian Process Inference at Scale

GP regression setup: Given observations $\mathbf{y} = f(X) + \varepsilon$ with $f \sim \mathcal{GP}(0, k)$ and $\varepsilon \sim \mathcal{N}(0, \sigma^2 I)$ , the posterior predictive mean and variance at test points $X_*$ are:

\boldsymbol{\mu}_* = K_{*n}(K_{nn} + \sigma^2 I)^{-1}\mathbf{y}

\Sigma_* = K_{**} - K_{*n}(K_{nn} + \sigma^2 I)^{-1}K_{n*}

where $K_{nn} \in \mathbb{R}^{n \times n}$ is the training kernel matrix and $K_{*n} \in \mathbb{R}^{n_* \times n}$ is the test-train kernel matrix.

Cholesky for GP: The matrix $K_{nn} + \sigma^2 I$ is SPD (positive definite noise term ensures this). The Cholesky factorization:

K_{nn} + \sigma^2 I = LL^\top

enables efficient computation of:

$\boldsymbol{\alpha} = (K_{nn} + \sigma^2 I)^{-1}\mathbf{y}$ : forward + backward solve with $L$ , cost $O(n^3)$ once, $O(n^2)$ for new right-hand sides
$\log p(\mathbf{y} \mid X, \boldsymbol{\theta}) = -\frac{1}{2}\mathbf{y}^\top\boldsymbol{\alpha} - \sum_i \log L_{ii} - \frac{n}{2}\log 2\pi$ : the log-likelihood used for hyperparameter optimization

Bottleneck: For $n = 10{,}000$ training points, $K_{nn}$ is $10^4 \times 10^4$ , and Cholesky costs $\frac{(10^4)^3}{3} \approx 3 \times 10^{11}$ flops - seconds on a GPU, but prohibitive for $n = 10^6$ .

Scalable GP methods:

Inducing point methods (Titsias 2009): Introduce $m \ll n$ inducing points; compute $m \times m$ Cholesky instead of $n \times n$ . GPyTorch implements this as gpytorch.models.ApproximateGP.
SKI/KISS-GP (Wilson & Nickisch 2015): Structure kernel interpolation - exploit grid structure to decompose $K_{nn}$ as a Kronecker product, enabling $O(n)$ Cholesky-like solves.
CG + Lanczos: Conjugate gradient-based methods (Gardner et al., 2018) avoid explicit Cholesky by iteratively solving the system, used in GPyTorch's LinearCG backend.

For AI: GP regression is the inference engine of Bayesian optimization (BoHB, Lindauer et al., 2022), which is used for hyperparameter tuning of large language models. Every BoO iteration requires a Cholesky factorization.

8.4 Backpropagation Through Factorizations

Differentiating through Cholesky: If $A = LL^\top$ and a scalar loss $\ell = f(L)$ , then:

\frac{\partial \ell}{\partial A} = L^{-\top}\Phi(L^\top \frac{\partial \ell}{\partial L})L^{-1}

where $\Phi$ extracts the lower triangular part: $\Phi(M)_{ij} = M_{ij}$ for $i > j$ , $\frac{1}{2}M_{ii}$ for $i = j$ , $0$ for $i < j$ .

This formula (Iain Murray, 2016) is implemented in PyTorch's torch.linalg.cholesky with backward() support.

Differentiating through LU: For $A = P^{-1}LU$ and a loss through the solution $\mathbf{x} = A^{-1}\mathbf{b}$ :

\frac{\partial \ell}{\partial A} = -A^{-\top}\frac{\partial \ell}{\partial \mathbf{x}}\mathbf{x}^\top = -\boldsymbol{\lambda}\mathbf{x}^\top

where $\boldsymbol{\lambda} = A^{-\top}(\partial\ell/\partial\mathbf{x})$ is the adjoint vector (computed via backward triangular solves).

Implicit differentiation: Rather than differentiating through the factorization algorithm itself (which would unroll all $O(n^3)$ operations), implicit differentiation differentiates through the solution condition $A\mathbf{x}^* = \mathbf{b}$ :

A \frac{\partial\mathbf{x}^*}{\partial\boldsymbol{\theta}} = \frac{\partial\mathbf{b}}{\partial\boldsymbol{\theta}} - \frac{\partial A}{\partial\boldsymbol{\theta}}\mathbf{x}^*

This requires solving one additional linear system per output - $O(n^2)$ using the already-computed factorization.

For AI: Differentiating through linear solvers is central to:

Meta-learning: MAML and its variants differentiate through inner-loop optimization steps
Differentiable physics: FEA simulation (e.g., PhiML) requires differentiating through sparse LU solves
Constrained optimization: Interior-point methods for differentiable convex optimization
Gaussian processes in end-to-end learning: Differentiating the GP marginal likelihood w.r.t. kernel parameters requires differentiating through Cholesky

8.5 Randomized Factorizations

Motivation: For matrices where only a low-rank approximation is needed, exact factorization costs $O(mn^2)$ - wasteful if rank $r \ll n$ . Randomized methods achieve $O(mnr)$ or even $O(mn \log r)$ using random projections.

Randomized QR (sketch-and-apply):

Sketch: $Y = A\Omega$ where $\Omega \in \mathbb{R}^{n \times (r+p)}$ is a random Gaussian matrix ( $p \approx 10$ oversampling)
Orthogonalize: Factor $Y = QR$ (thin QR of the sketch)
Project: $B = Q^\top A$ (project $A$ onto the sketched subspace)
Factor: $B = \tilde{Q}\tilde{R}$ (thin QR of the $\ell \times n$ matrix)
Output: $A \approx (Q\tilde{Q})\tilde{R}$ - a rank- $(r+p)$ QR approximation

Cost: Steps 1-4 cost $O(mn(r+p))$ , negligible compared to $O(mn^2)$ for exact QR.

Error bound (Halko-Martinsson-Tropp, 2011): With probability $\geq 1 - 6p^{-p}$ :

\|A - Q Q^\top A\|_2 \leq \left(1 + 11\sqrt{r+p}\right)\sigma_{r+1}(A)

The approximation is near-optimal - within a polynomial factor of the Eckart-Young lower bound.

Randomized LU (Yu et al., 2018): Applies random column sampling to produce a structured LU factorization useful for rank-deficient matrices.

Connection to LoRA (Hu et al., 2022): Low-Rank Adaptation of language models uses rank- $r$ decompositions $\Delta W = BA$ where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$ . Initializing $B$ via randomized QR of the pretrained weight matrix gives a structured initialization that preserves the principal components.

DoRA (Liu et al., 2024): Decomposes $W = m \cdot (W/\|W\|_c)$ where $\|W\|_c$ is the column norm. The directional component $W/\|W\|_c$ is approximated via LoRA. This uses the QR decomposition of each weight column.

GALORE (Zhao et al., 2024): Gradient Low-Rank Projection - projects gradients onto their principal subspace via randomized SVD before applying Adam updates. The projection matrix is computed via randomized QR every 200 steps.

9. Common Mistakes

#	Mistake	Why It's Wrong	Fix
1	Using LU without pivoting for numerical computation	Zero or small pivots cause catastrophic cancellation; growth factor can reach $2^{n-1}$	Always use partial pivoting (`scipy.linalg.lu`, LAPACK `dgetrf`)
2	Forming $A^{-1}$ explicitly to solve $A\mathbf{x} = \mathbf{b}$	Computing $A^{-1}$ costs $O(n^3)$ and introduces additional round-off; $A^{-1}\mathbf{b}$ is less accurate than the triangular solve	Use `scipy.linalg.solve` or `lu_solve` with the stored factorization
3	Applying Cholesky to a non-SPD matrix	Negative square roots cause `NaN`; the algorithm may "succeed" silently with complex values or incorrect results	Check `np.linalg.eigvalsh(A).min() > 0` or catch `LinAlgError` from failed Cholesky
4	Using normal equations for ill-conditioned least squares	$\kappa(A^\top A) = \kappa(A)^2$ - doubles the digits lost; for $\kappa \sim 10^6$ , you lose all precision	Use `scipy.linalg.lstsq` (QR-based) or explicitly factor via Householder QR
5	Confusing thin and full QR	Full QR: $Q \in \mathbb{R}^{m \times m}$ , $R \in \mathbb{R}^{m \times n}$ . Thin QR: $Q \in \mathbb{R}^{m \times n}$ , $R \in \mathbb{R}^{n \times n}$ . Using the wrong one gives wrong dimensions	Use `mode='economic'` in `scipy.linalg.qr` or `full_matrices=False` in `numpy.linalg.qr` for thin QR
6	Ignoring the growth factor in stability analysis	Partial pivoting has theoretical growth factor $2^{n-1}$ ; while rare in practice, adversarial inputs exist	For certified computation, use complete pivoting or switch to Householder QR
7	Re-factoring A for each right-hand side	LU/QR factorization costs $O(n^3)$ ; each additional solve costs only $O(n^2)$	Cache the factorization (`lu_factor` returns LU; `qr` returns Q,R) and call `lu_solve` or backsolve repeatedly
8	Forgetting the permutation $P$ in $PA = LU$	Solving $L(U\mathbf{x}) = \mathbf{b}$ instead of $L(U\mathbf{x}) = P\mathbf{b}$ gives wrong answer	Always apply $P\mathbf{b}$ first: `b = b[piv]` or use `lu_solve` which handles pivots automatically
9	Using classical Gram-Schmidt for QR instead of Householder	CGS loses orthogonality for ill-conditioned matrices; $\\|\hat{Q}^\top\hat{Q} - I\\| = O(\kappa(A) \cdot u)$	Use `scipy.linalg.qr` (Householder) or `numpy.linalg.qr`
10	Misinterpreting rank from LU diagonal	LU (even with partial pivoting) doesn't give a reliable rank estimate; the pivot threshold is ambiguous	Use RRQR (`scipy.linalg.qr(pivoting=True)`) or SVD for rank determination
11	Ignoring mixed precision issues in GPU factorizations	FP16 Cholesky has $u = 5 \times 10^{-4}$ ; for $\kappa(A) > 10^4$ , FP16 factorization gives only 0 correct digits	Use FP32 or FP64 for factorizations; apply iterative refinement if FP16 is needed for speed
12	Differentiating through factorization naively by unrolling	Unrolling $O(n^3)$ elimination steps creates $O(n^3)$ graph edges - memory explosion	Use implicit differentiation through the solution condition $A\mathbf{x} = \mathbf{b}$ ; PyTorch's `linalg.solve` handles this automatically

10. Exercises

Exercise 1 (*): Implement forward and backward substitution.

(a) Write a function forward_sub(L, b) that solves $L\mathbf{x} = \mathbf{b}$ for unit lower triangular $L$ (no division needed).

(b) Write backward_sub(U, b) that solves $U\mathbf{x} = \mathbf{b}$ for upper triangular $U$ .

(d) Measure the accuracy: for a random $100 \times 100$ lower triangular system with entries from $\mathcal{N}(0,1)$ , compute $\|L\hat{\mathbf{x}} - \mathbf{b}\|_\infty / \|\mathbf{b}\|_\infty$ (residual) and compare to machine epsilon.

Exercise 2 (*): Implement naive LU and demonstrate failure.

(a) Implement lu_naive(A) that performs Gaussian elimination without pivoting, returning $(L, U)$ .

(b) For the matrix $A = \begin{pmatrix} \varepsilon & 1 \\ 1 & 2 \end{pmatrix}$ with $\varepsilon = 10^{-16}$ , compute $\hat{\mathbf{x}} = U^{-1}(L^{-1}\mathbf{b})$ for $\mathbf{b} = (1, 3)^\top$ using your naive LU.

(c) Compare to numpy.linalg.solve. Report the relative error $\|\hat{\mathbf{x}} - \mathbf{x}^*\|_\infty / \|\mathbf{x}^*\|_\infty$ .

(d) Explain the failure in terms of the growth factor. What is $\rho_2$ for this matrix?

Exercise 3 (*): Implement LU with partial pivoting and verify $PA = LU$ .

(a) Implement lu_pivot(A) returning $(P, L, U)$ where $P$ is stored as a permutation vector.

(b) Verify the factorization: compute $\|PA - LU\|_F$ and confirm it is $O(n \cdot u \cdot \|A\|_F)$ .

(d) Compute $\det(A)$ from the LU factorization as $(-1)^s \prod_i U_{ii}$ and verify against numpy.linalg.det.

Exercise 4 ():** Implement a Householder reflector and apply it.

(a) Implement householder_vector(a) that computes the Householder vector $\mathbf{v}$ such that $H\mathbf{a} = \alpha\mathbf{e}_1$ with the correct sign convention to avoid cancellation.

(b) Implement householder_apply(v, C) that applies $H = I - 2\mathbf{v}\mathbf{v}^\top/\|\mathbf{v}\|^2$ to matrix $C$ using the implicit formula (not the full $m \times m$ matrix).

(c) Verify: for a random vector $\mathbf{a} \in \mathbb{R}^5$ , confirm that $H\mathbf{a} = \alpha\mathbf{e}_1$ (entries 2-5 should be zero to machine precision).

(d) Show that the naive computation (I - 2*v*v.T/norm(v)**2) @ a gives the same result but is less efficient. Time both for a large vector.

Exercise 5 ():** Implement Householder QR and compare to scipy.

(a) Implement householder_qr(A) that returns the upper triangular factor $R$ and stores Householder vectors.

(b) Implement recover_Q(householder_vecs, m) to form the explicit $Q$ matrix.

(d) Compare to scipy.linalg.qr. Verify they give identical $R$ (up to sign conventions) and identical orthogonality.

(e) For an ill-conditioned $A$ with $\kappa(A) = 10^8$ , compare the orthogonality $\|Q^\top Q - I\|_F$ of your Householder QR vs classical Gram-Schmidt.

Exercise 6 ():** Rank estimation via column-pivoted QR.

(a) Create a rank-4 matrix: $A = U S V^\top$ where $S = \operatorname{diag}(100, 10, 1, 0.1, 0.001, 0.001, 0.001)$ and $U, V$ are random orthogonal.

(b) Compute QR with column pivoting: Q, R, P = scipy.linalg.qr(A, pivoting=True).

(d) Implement rank estimation: declare rank $r$ when $|R_{rr}|/|R_{11}| > \tau$ for threshold $\tau = 10^{-3}$ .

(e) Compare to the SVD-based rank estimation. Which is more reliable? Which is faster?

Exercise 7 (*):** GP regression with Cholesky.

(a) Implement rbf_kernel(X1, X2, ell, sf) computing $k(x_i, x_j) = s_f^2 \exp(-\|x_i - x_j\|^2 / (2\ell^2))$ .

(b) Generate 100 training points from $f(x) = \sin(x)$ with $\mathcal{N}(0, 0.1^2)$ noise.

(c) Implement gp_predict(X_train, y_train, X_test, ell, sf, sigma_n) using Cholesky factorization for all linear solves (no explicit matrix inverse).

(d) Compute the log marginal likelihood $\log p(\mathbf{y} \mid X) = -\frac{1}{2}\mathbf{y}^\top K^{-1}\mathbf{y} - \frac{1}{2}\log\det K - \frac{n}{2}\log 2\pi$ using log_det_chol (2\sumlog diagonal of L).

(e) Optimize $(\ell, s_f)$ via grid search maximizing the log marginal likelihood. Plot the GP posterior mean and \pm2\sigma credible interval.

Exercise 8 (*):** Differentiating through Cholesky - gradient of log-determinant.

(a) For an SPD matrix $A$ , the gradient of $\log\det A$ w.r.t. $A$ is $A^{-1}$ . Verify this numerically: compute $\nabla_A \log\det A$ via finite differences and via $L^{-\top}L^{-1}$ (using Cholesky), compare.

(b) Implement cholesky_grad(L, dL) computing the gradient of a scalar function $f(L)$ (with given $\partial f/\partial L = \mathtt{dL}$ ) w.r.t. $A$ , using the Iain Murray (2016) formula.

(c) Use your gradient to implement gradient descent on $\mathcal{L}(A) = -\log\det A + \operatorname{tr}(A)$ (the negative log-likelihood of a Wishart distribution). Starting from $A = 5I$ , verify convergence to $A = I$ (the analytic minimum).

(d) Compare to PyTorch's automatic differentiation through torch.linalg.cholesky and torch.logdet. Verify that gradients match to $10^{-5}$ relative error.

11. Why This Matters for AI (2026 Perspective)

Concept	AI/ML Impact
LU with partial pivoting	Solving Hessian systems in Newton/quasi-Newton methods; computing inverses and log-determinants in probabilistic ML
Householder QR	Foundation of least-squares linear probing; basis for eigenvalue algorithms (QR iteration); stable weight matrix analysis
Column-pivoted QR (RRQR)	Rank estimation of weight matrices in LoRA, DoRA; detecting intrinsic dimensionality of neural representations
Cholesky factorization (computational)	GP regression at scale; K-FAC training; VAE/diffusion sampling; Fisher information inversion
Blocked algorithms (BLAS-3)	80-95% of peak GPU throughput for all factorizations; the reason FlashAttention uses tile-based blocking
Backward error analysis	Understanding precision limits in FP16/BF16 training; justifying mixed-precision with iterative refinement
Iterative refinement	cuSOLVER IR: FP16 factorization + FP64 refinement for numerically reliable inference at high throughput
Sparse factorizations	GNN training on large graphs; differentiable physics simulation; structured attention patterns
QR for least squares	Linear probing, ridge regression, prompt tuning accuracy; avoiding condition-number squaring in normal equations
Differentiation through factorizations	End-to-end differentiable GP/Bayesian models; implicit differentiation in meta-learning; constraint-based optimization
TSQR	Communication-optimal QR for distributed training; tile-based blocking in FlashAttention-3
Randomized LU/QR	LoRA initialization; GaLore gradient projection; sketch-and-solve preconditioning for large-scale optimization
K-FAC with Cholesky	Natural gradient training of large transformers; 2-5x faster convergence than Adam at equivalent compute
GP marginal likelihood optimization	Bayesian hyperparameter tuning (BoHB); automated ML pipeline optimization
Rank-deficient LU / RRQR	Intrinsic dimensionality measurement of LLM weight updates; guided rank selection for compression

12. Conceptual Bridge

From 07 (Positive Definite Matrices): Section 07 developed the theory of positive definite matrices and gave the full treatment of the Cholesky factorization: existence proof (the Cholesky product theorem), the LDL^T variant, the connection to log-determinants, and the PSD cone. This section builds directly on that theory, treating Cholesky as one of three production-grade computational tools alongside LU and QR. The student who understands 07's theory will see 08's Cholesky discussion as pure application - taking the mathematical theorem and implementing it efficiently.

The computational turn: Sections 01-07 of Chapter 3 developed mathematical structures: eigenvalues, SVD, PCA, linear maps, orthogonality, norms, and positive definiteness. Section 08 marks a turn toward the computational: algorithms, numerical stability, blocking, LAPACK. This turn is not a descent in abstraction but a different kind of mathematics - the mathematics of floating-point arithmetic, error propagation, and algorithm design under hardware constraints. Both kinds are essential for AI/ML practice.

Into Chapter 4 (Calculus): The matrix factorizations developed here are the computational backbone of calculus-based optimization. The Hessian matrix $H_f(\mathbf{x})$ (Chapter 4, 05) is positive definite at local minima - and therefore amenable to Cholesky. Newton's method (Chapter 4, 06) requires solving Hessian systems via LU or Cholesky at each step. The Jacobian matrix (Chapter 4, 04) appears in the normal equations for nonlinear least squares (Gauss-Newton, Levenberg-Marquardt), which are solved via QR. The machinery of 08 is the engine that makes calculus-based optimization computable.

Into Chapter 8 (Optimization): The optimization chapter (Chapter 8) develops gradient descent, Newton's method, interior-point methods, and constrained optimization - all of which require the factorizations of 08 in their inner loops. The K-FAC connection to natural gradient optimization, the QR connection to least-squares problems, and the Cholesky connection to GP-based Bayesian optimization are all elaborated with full optimization context in Chapter 8.

MATRIX DECOMPOSITIONS IN THE CURRICULUM
========================================================================

  Chapter 2: Linear Algebra Basics
  +-- 02 Matrix Operations -------- brief LU/QR/Cholesky preview
  +-- 04 Determinants ------------ det(A) = \prod U_ii from LU

  Chapter 3: Advanced Linear Algebra
  +-- 01 Eigenvalues -------------- QR iteration uses Householder QR
  +-- 02 SVD ---------------------- thin QR as step in SVD algorithm
  +-- 05 Orthogonality ------------ QR theory (Gram-Schmidt)
  +-- 07 Positive Definite -------- Cholesky THEORY (full treatment)
  +-- 08 Matrix Decompositions <-- YOU ARE HERE
       +-- LU (canonical home)      PA = LU with partial pivoting
       +-- QR algorithms            Householder, Givens, RRQR, TSQR
       +-- Cholesky (computational) Blocked dpotrf, LDL^T for indef.

  Chapter 4: Calculus Fundamentals
  +-- 04 Jacobian ----------------- Gauss-Newton needs QR
  +-- 05 Hessian ------------------ Cholesky at local minima

  Chapter 8: Optimization
  +-- Newton's method -------------- LU/Cholesky for Hessian systems
  +-- K-FAC ------------------------ Kronecker-Cholesky per layer
  +-- Interior point methods ------- Cholesky at each Newton step

  ML Applications (load-bearing)
  +-- GP regression ---------------- Cholesky (O(n^3) bottleneck)
  +-- K-FAC / Shampoo -------------- blocked Cholesky per layer
  +-- LoRA / GaLore ---------------- randomized QR for rank selection
  +-- Linear probing --------------- Householder QR least squares
  +-- Bayesian optimization -------- GP + Cholesky every step

========================================================================

The three factorizations of this section - LU, QR, and Cholesky - are not abstract mathematical objects. They are the algorithms that run inside every call to numpy.linalg.solve, scipy.linalg.qr, torch.linalg.cholesky, and LAPACK's dgesv, dgeqrf, dpotrf. Understanding them at this level - theorem, algorithm, numerical stability, and AI application - completes the computational linear algebra foundation needed for everything that follows.

<- Back to Advanced Linear Algebra | <- Positive Definite Matrices | Next: Calculus Fundamentals ->

Appendix A: Complete Algorithms and Pseudocode

A.1 LU with Partial Pivoting - Full Pseudocode

The following pseudocode gives the complete in-place LU factorization with partial pivoting. The matrix $A$ is overwritten: the strict lower triangle stores the multipliers $m_{ij}$ (the subdiagonal entries of $L$ , since $L_{ii} = 1$ is implicit), and the upper triangle plus diagonal stores $U$ .

Algorithm LU_PARTIAL_PIVOT(A):
    Input:  A \in \mathbb{R}^{n\timesn}
    Output: Modified A (lower: L multipliers, upper+diag: U), pivot vector piv

    piv = [1, 2, ..., n]   # permutation record

    for k = 1 to n-1:
        # Step 1: Find pivot in column k (rows k to n)
        max_val = |A[k,k]|
        max_row = k
        for i = k+1 to n:
            if |A[i,k]| > max_val:
                max_val = |A[i,k]|
                max_row = i

        # Step 2: Swap rows k and max_row
        if max_row \neq k:
            swap A[k, :] <-> A[max_row, :]
            swap piv[k] <-> piv[max_row]

        # Step 3: Check for (near-)singular pivot
        if |A[k,k]| < \epsilon:
            warn("Near-singular pivot at step k")

        # Step 4: Compute multipliers and update
        for i = k+1 to n:
            A[i,k] = A[i,k] / A[k,k]          # multiplier m_{ik}
            for j = k+1 to n:
                A[i,j] -= A[i,k] * A[k,j]     # rank-1 update

    return A, piv

Solve Ax = b using LU_PARTIAL_PIVOT:
    Compute A, piv = LU_PARTIAL_PIVOT(A)
    Apply permutation: b = b[piv]
    Forward solve Lx = b (L has unit diagonal, ignore A[k,k]):
        for i = 1 to n:
            for j = 1 to i-1:
                b[i] -= A[i,j] * b[j]
    Backward solve Ux = b:
        for i = n downto 1:
            for j = i+1 to n:
                b[i] -= A[i,j] * b[j]
            b[i] /= A[i,i]
    return b

Implementation notes:

The inner loop for j = k+1 to n: A[i,j] -= A[i,k] * A[k,j] is a SAXPY operation (Scalar A times X Plus Y) - BLAS level 1, vectorizable.
The full step 4 over all $i$ simultaneously is a rank-1 update $A[k+1:n, k+1:n] -= \mathbf{m} \mathbf{u}^\top$ - BLAS level 2 DGER.
Blocked version batches $b$ steps into a panel and uses DGEMM for the trailing update.

A.2 Householder QR - Full Pseudocode

Algorithm HOUSEHOLDER_QR(A):
    Input:  A \in \mathbb{R}^{m\timesn}, m \geq n
    Output: Modified A (upper triangle: R; lower triangle: Householder vectors),
            beta vector (scaling factors)

    beta = zeros(n)

    for k = 1 to n:
        # Extract column k from row k to m
        x = A[k:m, k]                        # length (m-k+1)

        # Compute Householder vector
        sigma = norm(x)
        if sigma == 0: continue               # column already zero

        # Sign convention: ensure v[0] and x[0] have opposite signs
        alpha = -sign(x[0]) * sigma           # target value
        v = copy(x)
        v[0] -= alpha
        beta_k = 2 / (v^T v)                  # = 2 / ||v||^2

        # Apply H_k to trailing submatrix A[k:m, k:n]
        # H_k A = A - (beta_k * v)(v^T A) = A - beta_k * v * (v^T A)
        w = beta_k * (v^T @ A[k:m, k:n])     # row vector, length (n-k+1)
        A[k:m, k:n] -= outer(v, w)

        # Store Householder vector in lower triangle
        A[k+1:m, k] = v[1:]                  # store v[1:] (v[0] = 1 implicit)
        beta[k] = beta_k

        # A[k,k] now equals alpha = -sign(x[0]) * norm(x)

    return A, beta

Recover Q (thin, first n columns):
    Q = I_{m\timesn}
    for k = n downto 1:
        # Reconstruct v from stored lower triangle
        v = [1; A[k+1:m, k]]                 # prepend implicit 1
        beta_k = beta[k]
        # Apply H_k to Q[k:m, k:n]
        w = beta_k * (v^T @ Q[k:m, k:n])
        Q[k:m, k:n] -= outer(v, w)
    return Q

A.3 Givens QR for Banded Matrices

For a matrix $A$ with upper and lower bandwidth $k_u$ and $k_\ell$ respectively:

Algorithm GIVENS_QR_BANDED(A, ku, kl):
    for j = 1 to n:              # column
        for i = j+kl downto j+1: # eliminate subdiagonal entries in bandwidth
            # Compute Givens rotation to zero A[i,j] using A[i-1,j]
            c, s, r = compute_givens(A[i-1,j], A[i,j])
            A[i,j] = 0
            A[i-1,j] = r
            # Apply G to remaining columns in bandwidth
            for jj = j+1 to min(j+ku+kl, n):
                temp = c*A[i-1,jj] + s*A[i,jj]
                A[i,jj] = -s*A[i-1,jj] + c*A[i,jj]
                A[i-1,jj] = temp
    return R (upper banded), Q (product of Givens rotations)

Complexity: $O(nk_\ell(k_\ell + k_u))$ - linear in $n$ for fixed bandwidth. This is the algorithm used in tridiagonal eigensolvers.

Appendix B: Deeper Theoretical Results

B.1 The Fundamental Theorem of Existence for LU

Theorem (Existence of LU, precise version). Let $A \in \mathbb{R}^{n \times n}$ . The following are equivalent:

$A$ has an LU factorization with $L$ unit lower triangular.
All leading principal minors $\det(A_k) \neq 0$ for $k = 1, \ldots, n-1$ .
Gaussian elimination completes without encountering a zero pivot.

Proof of (1) <=> (2): We use induction. Base case $n = 1$ : trivial. Inductive step: write $A = \begin{pmatrix} A_{n-1} & \mathbf{b} \\ \mathbf{c}^\top & d \end{pmatrix}$ . If $A_{n-1}$ has an LU factorization $A_{n-1} = L_{n-1}U_{n-1}$ (by induction, iff all leading minors of $A_{n-1}$ are nonzero), then:

A = \begin{pmatrix} L_{n-1} & \mathbf{0} \\ \mathbf{c}^\top U_{n-1}^{-1} & 1 \end{pmatrix} \begin{pmatrix} U_{n-1} & L_{n-1}^{-1}\mathbf{b} \\ \mathbf{0}^\top & s \end{pmatrix}

where $s = d - \mathbf{c}^\top A_{n-1}^{-1}\mathbf{b}$ is the Schur complement (see 07). This gives a valid LU factorization with all blocks well-defined. $\square$

Corollary: The Schur complement $s = d - \mathbf{c}^\top A_{n-1}^{-1}\mathbf{b}$ is the last pivot. It equals $\det(A)/\det(A_{n-1})$ . This shows that the pivot sequence in Gaussian elimination directly encodes the leading minors via their ratios.

B.2 The Backward Error Theorem (Precise)

Theorem (Wilkinson, 1963). Let $\hat{L}, \hat{U}$ be the matrices computed by Gaussian elimination with partial pivoting, applied to $PA$ (after row interchanges). Then there exists a matrix $E$ such that:

PA + E = \hat{L}\hat{U}

|E_{ij}| \leq n \cdot u \cdot g_n \cdot \max_{i,j}|A_{ij}| \quad \text{for all } i,j

where $u$ is unit roundoff and $g_n = \max_{i,j,k}|A_{ij}^{(k)}|/\max_{i,j}|A_{ij}|$ is the growth factor.

Corollary (Solution accuracy). The computed solution $\hat{\mathbf{x}}$ to $A\mathbf{x} = \mathbf{b}$ satisfies:

(A + \delta A)\hat{\mathbf{x}} = \mathbf{b}, \quad \|\delta A\|_\infty \leq c(n) \cdot u \cdot g_n \cdot \|A\|_\infty

The forward error bound then follows from the perturbation lemma:

\frac{\|\hat{\mathbf{x}} - \mathbf{x}^*\|_\infty}{\|\mathbf{x}^*\|_\infty} \leq \kappa_\infty(A) \cdot \frac{c(n) u g_n \|A\|_\infty}{\|A\|_\infty} + O(u^2)

The fundamental lesson: error = condition number \times backward error. The algorithm controls the backward error; the problem controls the condition number.

B.3 Optimality of Householder QR

Theorem. Among all algorithms for computing QR that use orthogonal similarity transformations, Householder QR minimizes the number of arithmetic operations asymptotically.

Sketch: Each Householder reflector zeros $m-k$ entries in column $k$ using $2(m-k)$ flops (inner product + SAXPY). Any other elementary orthogonal transformation (e.g., a sequence of Givens rotations) requires at least as many flops per zero created. Householder maximizes the work per transformation.

Why Givens is sometimes preferred: Despite higher total flop count, Givens rotations are preferred for:

Sparse matrices (avoid creating fill-in in the reflection direction)
Parallel computation (rotations on disjoint row pairs can be parallelized)
Streaming data (new rows can be incorporated one at a time)

B.4 Perturbation Theory for Triangular Factorizations

Theorem (Stewart, 1977). If $A = LU$ (LU without pivoting, assuming it exists), then:

\frac{\|L + \delta L\|_F}{\|L\|_F}, \frac{\|U + \delta U\|_F}{\|U\|_F} = O\left(\kappa(L)\kappa(U) \cdot \frac{\|\delta A\|_F}{\|A\|_F}\right)

The condition numbers $\kappa(L)$ and $\kappa(U)$ can independently be much larger than $\kappa(A)$ , explaining why small perturbations to $A$ can cause large perturbations to $L$ and $U$ separately.

For QR: The Householder factors $Q_k$ have $\kappa(Q_k) = 1$ exactly (orthogonal matrices). Therefore:

\frac{\|\delta R\|_F}{\|R\|_F} = O\left(\kappa(R) \cdot \frac{\|\delta A\|_F}{\|A\|_F}\right) = O\left(\kappa(A) \cdot \frac{\|\delta A\|_F}{\|A\|_F}\right)

The QR factors are perturbed proportionally to $\kappa(A)$ , not $\kappa(A)^2$ as in the normal equations.

Appendix C: Connections to Other Decompositions

C.1 LU and the Spectral Decomposition

The LU factorization of $A$ is not directly related to the eigendecomposition $A = Q\Lambda Q^{-1}$ , but the two share a common ancestor: block triangularization. The Schur decomposition $A = UTU^*$ (with $T$ upper triangular, $U$ unitary) is a complex-field generalization where $T$ is upper triangular with eigenvalues on the diagonal - the "LU" of the spectral world.

The QR algorithm for eigenvalues alternates QR factorizations and similarity transformations to drive $A$ toward upper triangular (Schur) form. Each QR step is one Householder QR followed by one matrix-matrix product. After convergence, the diagonal of $T$ gives the eigenvalues. This is the most important use of Householder QR in all of numerical linear algebra.

C.2 Cholesky and the SVD

For a symmetric positive definite matrix, the Cholesky factorization $A = LL^\top$ and the SVD $A = U\Sigma U^\top$ (symmetric PD, so $U$ has orthonormal columns and $\Sigma = \Lambda$ has positive diagonal) are related by:

L = U\Lambda^{1/2} R^\top

where $R$ is the upper triangular factor of a QR decomposition of $U$ . Explicitly: $L = QR$ where $Q = U$ and $R = \Lambda^{1/2} V^\top$ for some orthogonal $V$ - but this is not the standard Cholesky. The key point is that Cholesky is a non-orthogonal factorization while SVD is orthogonal; they capture different aspects of the same structure.

Practical consequence: The Cholesky factor $L$ is not related to the eigenvectors of $A$ in a simple way. Computing the matrix square root $A^{1/2}$ (which requires eigenvectors) is more expensive than Cholesky.

C.3 QR and the Gram-Schmidt-Cholesky Identity

The Gram-Schmidt process applied to the columns of $A$ produces the thin QR $A = QR$ . But the normal equations give $(A^\top A) = R^\top R$ - a Cholesky factorization of the Gram matrix! This identity $A^\top A = R^\top R$ explains why:

QR of $A$ and Cholesky of $A^\top A$ produce the same $R$
Computing QR of $A$ is equivalent to computing Cholesky of $A^\top A$ in exact arithmetic
In floating-point, QR is more stable (works with $\kappa(A)$ ) while Cholesky of $A^\top A$ works with $\kappa(A)^2$

C.4 CUR and Interpolative Decompositions

Beyond LU, QR, and Cholesky, recent methods have developed structure-revealing factorizations that select actual columns/rows of $A$ :

CUR decomposition: $A \approx CUR$ where $C$ contains a subset of columns of $A$ , $R$ contains a subset of rows, and $U$ is a small "bridge" matrix. Unlike LU/QR, $C$ and $R$ are actual data columns - interpretable and memory-efficient.

Interpolative decomposition (ID): $A \approx A_{:,J} B$ where $J$ is a subset of column indices and $B$ is a well-conditioned matrix. The ID is computed via column-pivoted QR: the pivot columns from RRQR form $J$ .

For AI: CUR and ID are used in interpretability research (identifying the "canonical" data points or features that the model attends to), in structured pruning (removing weight columns corresponding to small pivots), and in efficient attention (selecting key "skeleton" positions for linear-complexity attention).

Appendix D: Implementation Reference

D.1 Python/NumPy/SciPy API

import numpy as np
import scipy.linalg as la

# === LU Factorization ===
# scipy.linalg.lu (returns P, L, U as separate matrices)
P, L, U = la.lu(A)
# verify: A \approx P @ L @ U

# scipy.linalg.lu_factor (returns compact (LU, piv) for solving)
lu, piv = la.lu_factor(A)
x = la.lu_solve((lu, piv), b)     # solves Ax = b

# === QR Factorization ===
# Full QR (Q is m\timesm)
Q, R = la.qr(A)                   # default: full
# Thin/economy QR (Q is m\timesn for m\timesn matrix A)
Q, R = la.qr(A, mode='economic')

# Column-pivoted QR (rank-revealing)
Q, R, P = la.qr(A, pivoting=True)
# verify: A[:, P] \approx Q @ R  (P is permutation array)
rank_est = np.sum(np.abs(np.diag(R)) > 1e-10 * np.abs(R[0,0]))

# Apply Q^T without forming Q
# (use the raw tau output for LAPACK-level efficiency)

# === Cholesky ===
L = la.cholesky(A, lower=True)    # A = L L^T
x = la.cho_solve((L, True), b)    # solves Ax = b

# === Triangular solves ===
x = la.solve_triangular(L, b, lower=True)   # solves Lx = b
x = la.solve_triangular(U, b, lower=False)  # solves Ux = b

D.2 PyTorch API

import torch

A = torch.randn(n, n, dtype=torch.float64)
A = A @ A.T + torch.eye(n) * 0.1  # make SPD

# LU
LU, pivots, info = torch.linalg.lu_factor(A)
x = torch.linalg.lu_solve(LU, pivots, b)

# QR
Q, R = torch.linalg.qr(A, mode='reduced')  # thin QR

# Cholesky
L = torch.linalg.cholesky(A)
x = torch.linalg.cholesky_solve(b.unsqueeze(-1), L).squeeze(-1)

# All support autograd:
A.requires_grad_(True)
L = torch.linalg.cholesky(A)
loss = torch.logdet(A)
loss.backward()   # computes grad via implicit differentiation

D.3 JAX API

import jax.numpy as jnp
import jax.scipy.linalg as jla

# LU (no grad through lu_factor yet in standard JAX)
lu, piv, permutation = jla.lu(A)

# QR
Q, R = jnp.linalg.qr(A, mode='reduced')

# Cholesky (supports JIT, vmap, grad)
L = jnp.linalg.cholesky(A)
# Solve via triangular
y = jla.solve_triangular(L, b, lower=True)
x = jla.solve_triangular(L.T, y, lower=False)

# Differentiating log-det through Cholesky:
def log_det_spd(A):
    L = jnp.linalg.cholesky(A)
    return 2 * jnp.sum(jnp.log(jnp.diag(L)))

grad_A = jax.grad(log_det_spd)(A)  # should equal A^{-1}

D.4 Performance Benchmarks (n = 1000, FP64)

Operation	NumPy (CPU)	PyTorch (CPU)	PyTorch (A100 GPU)
LU factorization	85 ms	78 ms	3 ms
QR factorization	210 ms	195 ms	8 ms
Cholesky	35 ms	32 ms	1.5 ms
Triangular solve	12 ms	10 ms	0.5 ms
Matrix multiply (DGEMM)	40 ms	38 ms	0.8 ms

Approximate values on typical hardware (2024). GPU speedups increase for larger $n$ as arithmetic intensity grows.

Appendix E: Error Accumulation in Practice

E.1 Condition Number Estimation

Computing the exact condition number $\kappa(A) = \|A\|_2 \|A^{-1}\|_2 = \sigma_1/\sigma_n$ requires computing all singular values - $O(n^3)$ work. LAPACK provides cheap condition number estimators via inverse iteration:

LAPACK DGECON: Given the LU factorization of $A$ (already computed for solving), estimates $\kappa_1(A) = \|A\|_1 \|A^{-1}\|_1$ in $O(n^2)$ via a sequence of triangular solves. The estimate is typically within a factor of 10 of the true condition number.

Rule of thumb: If DGECON returns $\kappa \approx 10^k$ , you lose approximately $k$ digits of precision. For FP64 ( $u \approx 10^{-16}$ ), you have $16-k$ correct digits. For FP32 ( $u \approx 10^{-7}$ ), you have $7-k$ correct digits.

For AI: PyTorch's torch.linalg.cond computes the exact condition number via SVD. For quick checks without full SVD: torch.linalg.matrix_norm(A) * torch.linalg.matrix_norm(torch.linalg.inv(A)).

E.2 Diagnosis of Numerical Failures

Symptom	Likely cause	Diagnosis	Fix
`LinAlgError: Singular matrix`	Zero pivot encountered	$A$ is singular or nearly so	Check `np.linalg.matrix_rank(A)`
`nan` in output	Zero or tiny pivot in LU	No pivoting, tiny diagonal	Use partial pivoting
Solution with large residual $\\|A\hat{x}-b\\| \gg u\\|A\\|\\|\hat{x}\\|$	High condition number	Compute $\kappa(A)$	Regularize or use iterative refinement
`LinAlgError` from `cholesky`	Matrix not SPD	Eigenvalues not all positive	Check `np.linalg.eigvalsh(A).min()`
Q not orthogonal: $\\|Q^TQ - I\\| \gg u$	Classical Gram-Schmidt used	Compute $\\|Q^TQ-I\\|_F$	Switch to Householder QR
Slow factorization	Block size mismatch	Profile with BLAS calls	Tune block size; use BLAS-3 routine

E.3 Iterative Refinement Implementation

def iterative_refinement(A, b, max_iter=3, tol=1e-14):
    """
    Solve Ax = b with iterative refinement.
    Factor in FP64, residuals in FP64, corrections in FP32.
    """
    import scipy.linalg as la
    
    # Factor in FP32 (simulating low-precision)
    A32 = A.astype(np.float32)
    lu, piv = la.lu_factor(A32.astype(np.float64))  # use FP64 here for demo
    
    # Initial solve
    x = la.lu_solve((lu, piv), b)
    
    for k in range(max_iter):
        # Residual in FP64 (high precision)
        r = b - A @ x          # FP64 residual
        r_norm = np.linalg.norm(r) / np.linalg.norm(b)
        print(f"  Iter {k}: relative residual = {r_norm:.3e}")
        
        if r_norm < tol:
            break
        
        # Correction solve (in FP32 precision)
        d = la.lu_solve((lu, piv), r)
        x = x + d
    
    return x

Appendix F: The Geometry of Factorizations

F.1 LU as a Basis Change

The LU factorization $PA = LU$ can be interpreted geometrically. The unit lower triangular matrix $L$ represents a shearing transformation - it maps the standard basis $\{\mathbf{e}_i\}$ to the columns of $L$ , which form a "lower triangular" basis. The upper triangular $U$ then represents the coordinates of the rows of $A$ in this new basis.

Concretely: each step of Gaussian elimination adds multiples of row $k$ to lower rows, which is a shear in the "row space" of $A$ . The cumulative effect is a change to a triangular basis - the essence of LU.

F.2 QR as a Rotation to Triangular Form

The Householder QR factorization $A = QR$ is geometrically a rotation of the column space of $A$ to align with the coordinate axes. Each Householder reflector $H_k$ reflects the $k$ -th "residual" column to align with $\mathbf{e}_k$ , zeroing out all entries below the diagonal.

After $n$ reflections, the matrix $Q = H_1 H_2 \cdots H_n$ is a composition of orthogonal reflections - itself orthogonal - that has rotated $A$ to upper triangular form $R$ .

Geometric insight: The columns of $Q$ form an orthonormal basis for the column space of $A$ . The entries of $R$ give the coordinates of the original columns of $A$ in this orthonormal basis. This is the Gram-Schmidt process, done via reflections rather than projections.

F.3 Cholesky as the Matrix Square Root Factorization

For $A \succ 0$ , the Cholesky factor $L$ satisfies $A = LL^\top$ - making $L$ a "matrix square root" of $A$ in a specific sense. Unlike the symmetric square root $A^{1/2} = Q\Lambda^{1/2}Q^\top$ (which is symmetric), the Cholesky factor $L$ is lower triangular.

Geometric interpretation: The linear map $T : \mathbb{R}^n \to \mathbb{R}^n$ defined by $T\mathbf{x} = L\mathbf{x}$ transforms the unit ball $\{\mathbf{x} : \|\mathbf{x}\| \leq 1\}$ into the ellipsoid $\{\mathbf{y} : \mathbf{y}^\top A^{-1} \mathbf{y} \leq 1\}$ - the level set of the quadratic form $q(\mathbf{y}) = \mathbf{y}^\top A^{-1}\mathbf{y}$ . This is why Cholesky factorization is the foundation of sampling from $\mathcal{N}(\boldsymbol{\mu}, A)$ : if $\mathbf{z} \sim \mathcal{N}(\mathbf{0}, I)$ , then $L\mathbf{z} + \boldsymbol{\mu} \sim \mathcal{N}(\boldsymbol{\mu}, LL^\top) = \mathcal{N}(\boldsymbol{\mu}, A)$ .

Uniqueness: The Cholesky factor $L$ with positive diagonal entries is unique for every $A \succ 0$ . The symmetric square root $A^{1/2}$ is also unique and PSD, but it is NOT the same as $L$ (they differ by an orthogonal factor).

F.4 Triangular Factorizations and the Flag Manifold

The space of all $n \times n$ invertible matrices $GL_n(\mathbb{R})$ acts on the set of factorizations as follows:

LU factorizations correspond to the decomposition $GL_n = B_- \cdot B_+$ where $B_-$ (lower triangular) and $B_+$ (upper triangular) are Borel subgroups. The "generic" element of $GL_n$ has a unique LU (the open Bruhat cell).
QR factorizations correspond to $GL_n = O(n) \cdot B_+$ where $O(n)$ is the orthogonal group. This decomposition is always unique (no exceptional cases, unlike LU).
The flag manifold $Fl(n) = GL_n / B_+$ parameterizes complete flags $V_1 \subset V_2 \subset \cdots \subset V_n = \mathbb{R}^n$ and is the natural geometric setting for both LU and QR.

This connection to Lie group theory underlies the deep relationship between matrix factorizations and the geometry of symmetric spaces - a topic developed in Chapter 12 (Functional Analysis).

Appendix G: Randomized Algorithms - Deeper Theory

G.1 The Randomized SVD via QR

The Halko-Martinsson-Tropp (2011) algorithm computes a rank- $r$ approximation of $A \in \mathbb{R}^{m \times n}$ in $O(mnr)$ operations:

Algorithm RANDOMIZED_SVD(A, r, p):
    # Oversampling: target rank r + p (p \approx 10)
    \Omega = randn(n, r+p)              # Gaussian sketch matrix
    Y = A @ \Omega                       # Sketch: Y \in \mathbb{R}^{m\times(r+p)}
    Q, _ = qr(Y)                   # Orthonormalize: Q \in \mathbb{R}^{m\times(r+p)}
    B = Q.T @ A                    # Project: B \in \mathbb{R}^{(r+p)\timesn}
    U, \Sigma, Vt = svd(B, full_matrices=False)  # SVD of small matrix
    U = Q @ U                      # Lift back to full space
    return U[:, :r], \Sigma[:r], Vt[:r, :]

Error bound: With probability $\geq 1 - \delta$ :

\|A - U\Sigma V^\top\|_2 \leq \left(1 + \sqrt{\frac{r}{p-1}} + \frac{e\sqrt{r+p}}{p}\sqrt{\min(m,n) - r}\right)\sigma_{r+1}(A)

For $p = 10$ : error is $\approx 2\sigma_{r+1}(A)$ with probability $> 99\%$ .

Power iteration improvement: For matrices with slowly decaying singular values, applying $Y = (AA^\top)^q A\Omega$ (with $q = 1$ or $2$ power iterations) dramatically improves accuracy:

\|A - \hat{A}_r\|_2 \leq \left(\frac{\sigma_{r+1}(A)}{\sigma_r(A)}\right)^{2q}\!\!\cdot \text{(base error)}

For AI - GaLore (Zhao et al., 2024): GaLore uses randomized SVD to project gradient matrices $G_t \in \mathbb{R}^{m \times n}$ onto their principal $r$ -dimensional subspace every $T = 200$ steps. The projection matrix $P_t \in \mathbb{R}^{n \times r}$ is computed via randomized SVD of the gradient, then Adam is applied to the projected gradients $G_t P_t \in \mathbb{R}^{m \times r}$ . This reduces optimizer memory by $n/r$ while maintaining training quality.

G.2 Structured Random Matrices

Instead of dense Gaussian $\Omega$ , structured random matrices enable faster sketching:

Matrix	Sketch cost	Randomness	Error bound
Dense Gaussian	$O(mnr)$	Full i.i.d.	Optimal
SRFT (subsampled random Fourier)	$O(mn\log r)$	Hadamard + sampling	Near-optimal
Sparse embedding	$O(nnz(A) \cdot r)$	Hash functions	Near-optimal for sparse $A$
CountSketch	$O(nnz(A))$	Hash functions	Suboptimal but fast

SRFT (Johnson-Lindenstrauss): $\Omega = \sqrt{n/r} \cdot DFS$ where $D$ is a random diagonal $\pm 1$ matrix, $F$ is the DFT, and $S$ samples $r$ rows uniformly. Applying $\Omega^\top$ costs $O(n\log n)$ via FFT.

Appendix H: Software Ecosystem

H.1 BLAS and LAPACK Hierarchy

User Code (Python/Julia/MATLAB)
    down
SciPy / NumPy / PyTorch / JAX (Python wrappers)
    down
LAPACK (high-level: LU, QR, Cholesky, eigensolvers)
    down
BLAS Level 3 (DGEMM, DSYRK, DTRSM - blocked, cache-optimal)
BLAS Level 2 (DGEMV, DGER - matrix-vector)
BLAS Level 1 (DDOT, DAXPY, DNRM2 - vector)
    down
Hardware-optimized BLAS (MKL, OpenBLAS, cuBLAS, BLIS)
    down
CPU/GPU Tensor Cores / AVX-512 / AMX tiles

Key implementations:

Intel MKL: Optimized for Intel CPUs; 2-3x faster than reference BLAS for blocked operations
OpenBLAS: Open source, near-MKL performance on x86; default in many Linux distributions
BLIS (BLAS-like Library Instantiation Software): Framework for portable high-performance BLAS; used by many modern CPU vendors
cuBLAS: NVIDIA GPU BLAS; powers all PyTorch/JAX GPU linear algebra
cuSOLVER: NVIDIA GPU LAPACK; powers torch.linalg on NVIDIA hardware
rocBLAS / rocSOLVER: AMD GPU equivalents

H.2 Choosing the Right Routine

Need to solve Ax = b?
+-- A is SPD? -> DPOTRF (Cholesky) + DPOTRS
+-- A is symmetric indefinite? -> DSYTRF (LDL^T) + DSYTRS
+-- A is general square? -> DGESV (= DGETRF + DGETRS, partial pivot)
+-- A is triangular? -> DTRTRS (triangular solve directly)

Need to solve min ||Ax - b||?
+-- A is full-rank, well-conditioned? -> DGELS (QR-based least squares)
+-- A may be rank-deficient? -> DGELSY (column-pivoted QR)
+-- Need minimum-norm solution? -> DGELSD (divide-and-conquer SVD)

Need eigenvalues?
+-- A is symmetric? -> DSYEV (or DSYEVD for large n)
+-- A is general? -> DGEEV (QR iteration)

Need SVD?
+-- Full SVD? -> DGESVD or DGESDD (divide-and-conquer, faster)
+-- Truncated SVD? -> Use randomized SVD (not in LAPACK)

CHUNK_APPENDIX

Appendix I: Practical Decision Guide for Practitioners

I.1 Which Factorization to Use? - Decision Tree

The choice among LU, QR, and Cholesky depends on matrix structure, problem type, and numerical requirements. The following decision framework covers the vast majority of practical cases.

MATRIX FACTORIZATION SELECTION GUIDE
========================================================================

 Is A symmetric?
 +-- YES: Is A positive definite? (all eigenvalues > 0)
 |   +-- YES: -> CHOLESKY (dpotrf)
 |   |         Cost: n^3/3 flops. Fastest, most stable for SPD.
 |   |         Use for: Gaussians, GP regression, K-FAC, Fisher
 |   +-- NO:  -> LDL^T (dsytrf, Bunch-Kaufman pivoting)
 |              Cost: n^3/3 flops. Handles indefinite (saddle pts).
 |              Use for: indefinite Hessians, Newton at saddle pts
 +-- NO:  Is A square?
     +-- YES: -> LU with partial pivoting (dgetrf)
     |         Cost: 2n^3/3 flops. General purpose.
     |         Use for: Solving Ax=b, computing det(A)
     +-- NO:  Is A overdetermined (m > n)? -> LEAST SQUARES
         +-- Well-conditioned & fast? -> Normal equations + Cholesky
         |   Cost: mn^2 + n^3/3. Squares condition number.
         +-- General / numerically safe? -> Householder QR (dgelsy)
         |   Cost: 2mn^2 - 2n^3/3. Standard choice.
         +-- May be rank-deficient? -> Column-pivoted QR or SVD
             RRQR: O(mn^2), good rank estimate
             SVD:  O(mn^2+n^3), exact singular values

========================================================================

I.2 Numerical Precision Requirements

Scenario	Precision needed	Recommended approach
Single FP32 forward pass	$u \approx 10^{-7}$	Standard FP32 factorization
Training with FP16/BF16	$u \approx 10^{-4}$ to $10^{-3}$	Mixed precision + refinement
Scientific computing	$u \approx 10^{-16}$ (FP64)	FP64 throughout
Certified computation	Exact or interval	Symbolic or interval arithmetic
Ill-conditioned system ( $\kappa > 10^{12}$ )	All digits lost in FP64	Regularize or use extended precision

I.3 Memory and Flop Budgets

For a matrix of size $n \times n$ :

Factorization	Memory (dense)	Flops	Triangular solve flops
LU	$n^2$ (in-place)	$\frac{2}{3}n^3$	$2n^2$
QR (Householder)	$mn$ + $n^2$	$2mn^2 - \frac{2}{3}n^3$	$2mn$ for $Q^\top b$ , $n^2$ for backsolve
Cholesky	$\frac{n(n+1)}{2}$	$\frac{1}{3}n^3$	$n^2$
LDL^T	$\frac{n(n+1)}{2}$	$\frac{1}{3}n^3$	$n^2$

Memory for large $n$ :

$n = 10{,}000$ : LU requires $10^8$ doubles = 800 MB (fits in GPU HBM)
$n = 100{,}000$ : LU requires $10^{10}$ doubles = 80 GB (exceeds most GPUs)
$n > 10^5$ : must use iterative solvers (CG, GMRES) or sparse factorizations

I.4 Parallelism Characteristics

Factorization	CPU parallelism	GPU parallelism	Distributed
Blocked LU	Panel: serial; trailing: DGEMM (parallel)	cuSOLVER DGETRF; good for $n > 2000$	ScaLAPACK PDGETRF
Householder QR	WY representation + DGEMM	cuSOLVER DGEQRF	ScaLAPACK PDGEQRF
Cholesky	Panel: serial; trailing: DSYRK (parallel)	cuSOLVER DPOTRF; excellent	ScaLAPACK PDPOTRF
Sparse LU	AMD reordering; supernodal parallel	cuSOLVER batch for many small systems	MUMPS, SuperLU_DIST

Batch factorizations (small systems): For thousands of small systems (common in K-FAC, GP mini-batches), GPU batch factorizations (cublasDgetrfBatched) outperform sequential large-matrix factorizations by 10-100x.

I.5 Common Library Calls and Their Underlying Algorithms

High-level call	Underlying algorithm	Notes
`np.linalg.solve(A, b)`	LAPACK DGESV (= DGETRF + DGETRS)	Partial pivoting, FP64
`np.linalg.lstsq(A, b)`	LAPACK DGELSD (divide-and-conquer SVD)	Rank-deficient safe
`np.linalg.qr(A)`	LAPACK DGEQRF (Householder)	Full or thin via mode
`scipy.linalg.lu(A)`	LAPACK DGETRF	Returns P, L, U separately
`scipy.linalg.cholesky(A)`	LAPACK DPOTRF	Raises if not SPD
`scipy.linalg.lu_solve((lu,piv), b)`	LAPACK DGETRS	Amortized over many b
`torch.linalg.solve(A, b)`	cuSOLVER DGESV (GPU) / MKL (CPU)	Autograd supported
`torch.linalg.cholesky(A)`	cuSOLVER DPOTRF (GPU)	Autograd via implicit diff
`jax.numpy.linalg.solve(A, b)`	XLA:HLO -> cuSOLVER / LAPACK	JIT, vmap, grad
`sklearn.linear_model.Ridge(solver='cholesky')`	LAPACK DPOTRS	Normal equations + Cholesky

Appendix J: Extended Example - GP Hyperparameter Optimization

This appendix walks through a complete GP regression pipeline showing how LU, QR, and Cholesky all appear in different roles within the same ML workflow.

J.1 Problem Setup

We have $n = 200$ observations $\{(x^{(i)}, y^{(i)})\}_{i=1}^n$ from $f(x) = \sin(3x)\exp(-x/5) + \varepsilon$ , $\varepsilon \sim \mathcal{N}(0, 0.01)$ . We fit a GP with RBF kernel $k(x,x') = s_f^2\exp(-\|x-x'\|^2/(2\ell^2))$ and noise $\sigma_n^2$ .

J.2 Where Each Factorization Appears

Cholesky - computing the GP posterior:

The kernel matrix $K = K_{nn} + \sigma_n^2 I \succ 0$ is factored via Cholesky:

L = \operatorname{chol}(K), \quad \boldsymbol{\alpha} = L^{-\top}(L^{-1}\mathbf{y})

\log p(\mathbf{y}) = -\frac{1}{2}\mathbf{y}^\top\boldsymbol{\alpha} - \sum_i\log L_{ii} - \frac{n}{2}\log 2\pi

The log-determinant $\sum_i \log L_{ii}$ is computed from the Cholesky diagonal at zero additional cost.

LU (implicit, via matrix-vector products) - CG solver for large $n$ :

For $n = 10^4$ , direct Cholesky costs $10^{12}$ flops. Instead, conjugate gradient (CG) is used to solve $(K_{nn} + \sigma_n^2 I)\boldsymbol{\alpha} = \mathbf{y}$ iteratively. Each CG iteration requires one matrix-vector product $K_{nn}\mathbf{v}$ , plus the application of a preconditioner.

The preconditioner is a low-rank + diagonal approximation $(U\Lambda U^\top + \sigma_n^2 I)^{-1}$ , where $U\Lambda U^\top$ is a rank- $r$ approximation of $K_{nn}$ computed via randomized SVD (which uses QR internally). The preconditioned CG converges in $O(r)$ iterations instead of $O(\kappa(K))$ .

QR - selecting inducing points:

For the sparse GP with $m \ll n$ inducing points, the inducing point locations are selected via column-pivoted QR on the feature matrix:

\Phi = \begin{pmatrix} \phi(x^{(1)})^\top \\ \vdots \\ \phi(x^{(n)})^\top \end{pmatrix} \in \mathbb{R}^{n \times r}

The pivot columns of scipy.linalg.qr(\Phi.T, pivoting=True) identify the most "important" training points - those that span the feature space. These become the inducing points.

Cholesky - computing the evidence lower bound (ELBO) for inducing point GP:

The sparse GP ELBO requires computing the Cholesky of the $m \times m$ inducing point kernel matrix $K_{mm}$ plus a correction term - again $O(m^3)$ with $m \ll n$ .

J.3 Hyperparameter Optimization

The hyperparameters $\theta = (\ell, s_f, \sigma_n)$ are optimized by maximizing the log marginal likelihood $\log p(\mathbf{y} \mid X, \theta)$ via gradient-based optimization (L-BFGS or Adam). Each gradient evaluation requires:

A new Cholesky factorization of $K(\theta)$ - $O(n^3)$
Computing $\partial\log p / \partial\theta_k = \frac{1}{2}\operatorname{tr}\left[({\boldsymbol{\alpha}}\boldsymbol{\alpha}^\top - K^{-1})\frac{\partial K}{\partial\theta_k}\right]$
Each trace computation uses: $\operatorname{tr}(A B) = \sum_{i,j} A_{ij}B_{ji} = \operatorname{vec}(A)^\top\operatorname{vec}(B)$ - $O(n^2)$ per hyperparameter

Total cost per hyperparameter gradient: $O(n^3 + pn^2)$ for $p$ hyperparameters. For $n = 1000$ , $p = 3$ : about $3 \times 10^9$ flops - feasible in seconds on modern hardware.

This complete pipeline illustrates that a single ML workflow (GP regression with hyperparameter optimization) relies on Cholesky (3+ times, for different matrices), randomized QR (for inducing points), and CG with preconditioner (which involves implicit LU-like operations) - all in service of a Bayesian inference computation.

CHUNK_APPENDIXI

Matrix Decompositions: Part 2 - Numerical Stability And Error Analysis To Appendix J Extended Exampl