Part 3Math for LLMs

Matrix Decompositions: Part 3 - Appendix K Matrix Decompositions In The 2026 Ai Landscape To Appendix

Advanced Linear Algebra / Matrix Decompositions

Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 3
10 min read13 headingsSplit lesson page

Lesson overview | Previous part | Lesson overview

Matrix Decompositions: Appendix K: Matrix Decompositions in the 2026 AI Landscape to Appendix M: Exercises - Further Problems

Appendix K: Matrix Decompositions in the 2026 AI Landscape

K.1 Inference-Time Decompositions

As language models move toward longer context (1M+ tokens), the attention computation's quadratic cost O(n2d)O(n^2 d) has become a primary bottleneck. Several recent methods use matrix decompositions at inference time:

PagedAttention (Kwon et al., 2023): Manages KV cache memory using block-based paging inspired by virtual memory. The KV cache is a large matrix K,VRn×dK, V \in \mathbb{R}^{n \times d}; PagedAttention tiles this into fixed-size "pages" and maintains an LU-like decomposition of the free-list.

MLA (Multi-head Latent Attention, DeepSeek, 2024): Compresses the KV cache via low-rank projection: K=KcUKK = K_c U_{K}^\top where KcK_c is a compressed latent and UKU_K is a fixed projection matrix. The decomposition K=QRK = QR (QR of the projection matrix) is used to initialize and orthogonalize the projection basis.

Speculative decoding with draft models: Uses LU-based factored attention matrices to maintain a draft distribution for speculative sampling.

K.2 Training-Time Decompositions

GaLore (Zhao et al., 2024): Gradient Low-Rank Projection. For a weight matrix WRm×nW \in \mathbb{R}^{m \times n}, the gradient GtG_t is projected onto a rank-rr subspace via:

G~t=GtPt,PtRn×r\tilde{G}_t = G_t P_t, \quad P_t \in \mathbb{R}^{n \times r}

where PtP_t comes from the right singular vectors of GtG_t (randomized SVD). Adam is applied to G~t\tilde{G}_t (memory: mrmr instead of mnmn), then the update is lifted back: ΔW=ΔW~tPt\Delta W = \tilde{\Delta W}_t P_t^\top. The SVD is recomputed every T=200T = 200 steps.

Fira (Chen et al., 2024): Extends GaLore with a closed-form gradient approximation that avoids the SVD step entirely, using a QR-based approximation to the projection.

SOAP (Vyas et al., 2024): Combines Shampoo's Kronecker-structured preconditioner with Adam's update rule. Each layer's preconditioner is maintained as GlRdl×dlG_l \in \mathbb{R}^{d_l \times d_l} with iterative Cholesky-based Newton steps for the matrix square root.

K.3 Model Compression via Decompositions

SliceGPT (Ashkboos et al., 2024): Compresses transformer weights by applying QR-based slice operations: a structured pruning where the weight matrix is decomposed as W=QW~W = Q\tilde{W} and the orthogonal factor QQ is absorbed into the layer normalization.

ASVD (Yuan et al., 2023): Activation-aware SVD for LLM compression. Instead of SVD of WW directly, computes SVD of WXWX (weight matrix multiplied by typical activation matrix), finding a better rank-rr approximation for the actual inference distribution. Uses randomized QR for efficiency.

QuIP (Chee et al., 2023) and QuIP# (Tseng et al., 2024): Incoherence-based quantization. A random orthogonal matrix QQ (Hadamard-based) is applied to WW before quantization: W^=Qround(QW/s)\hat{W} = Q \cdot \text{round}(Q^\top W / s). The orthogonal preprocessing ensures that no single weight has disproportionate magnitude, enabling aggressive quantization. The QR connection: QQ is chosen as a randomized Hadamard transform, equivalent to the sketch matrix in randomized QR.

K.4 Numerical Linear Algebra for AI Accelerators

Tensor Cores and Mixed Precision: NVIDIA A100/H100 Tensor Cores perform D=AB+CD = AB + C where A,BA, B are FP16/BF16 and C,DC, D are FP32, achieving 312312 TFLOPS (FP16) vs 19.519.5 TFLOPS (FP64). All modern factorization libraries exploit this:

  • cuSOLVER RF (Iterative Refinement): FP16 factor + FP32 residual + FP32 correction - full FP32 accuracy at near-FP16 throughput
  • cuBLAS Tensor Core GEMM: All blocked factorizations (LU, QR, Cholesky trailing updates) use TC-GEMM for the dominant computation

Intel AMX (Advanced Matrix Extensions): New Intel architecture extensions (Sapphire Rapids, 2023) add hardware matrix tile registers and tile-based GEMM instructions. BLIS and MKL exploit AMX for 2×2\times speedup over AVX-512 on blocked factorizations.

AMD MI300X: 192 GB HBM3 + 5.3 TB/s bandwidth enables n=65,000n = 65{,}000 Cholesky factorization in GPU memory - sufficient for large GP regression without blocking across devices.


Appendix L: Summary of Key Results

For quick reference, here are the key theorems and facts from this section:

LU Factorization:

  • Exists iff all leading principal minors det(Ak)0\det(A_k) \neq 0 for k=1,,n1k = 1, \ldots, n-1.
  • With partial pivoting: PA=LUPA = LU; Lij1|L_{ij}| \leq 1; cost 23n3\frac{2}{3}n^3 flops.
  • Backward error: EcnuρnA\|E\|_\infty \leq c_n u \rho_n \|A\|_\infty where ρn\rho_n is the growth factor.
  • Growth factor: ρn2n1\rho_n \leq 2^{n-1} (theoretical), n2/3\approx n^{2/3} (typical).
  • Determinant: det(A)=(1)siUii\det(A) = (-1)^s \prod_i U_{ii} where ss = number of row swaps.

QR Factorization:

  • Always exists for any ARm×nA \in \mathbb{R}^{m \times n} with mnm \geq n.
  • Householder QR: backward stable with Q^Q^IF=O(u)\|\hat{Q}^\top\hat{Q} - I\|_F = O(u), independent of κ(A)\kappa(A).
  • Cost: 2mn223n32mn^2 - \frac{2}{3}n^3 flops (vs 23n3\frac{2}{3}n^3 for LU when m=nm = n).
  • QR for least squares: κ\kappa loss =κ(A)= \kappa(A) (vs κ(A)2\kappa(A)^2 for normal equations).
  • RRQR: diagonal of RR estimates singular values; Rkk|R_{kk}| reveals rank.

Cholesky Factorization:

  • Exists iff A0A \succ 0 (equivalently: all eigenvalues positive).
  • Cost: 13n3\frac{1}{3}n^3 flops (half of LU); unconditionally backward stable without pivoting.
  • Log-determinant: logdetA=2ilogLii\log\det A = 2\sum_i\log L_{ii} - exact, O(n)O(n) after Cholesky.
  • Gradient: AlogdetA=A1\nabla_A\log\det A = A^{-1} - computed in O(n2)O(n^2) via Cholesky solve.

Backward Stability:

  • Forward error κ(A)×\leq \kappa(A) \times backward error (fundamental inequality).
  • Backward stability hierarchy: Cholesky = Householder QR \succ LU with partial pivot \succ LU without pivot.
  • Iterative refinement: recovers FP64 accuracy from FP16 factorization in O(1)O(1) iterations if κu16<1\kappa \cdot u_{16} < 1.

CHUNK_FINAL


Appendix M: Exercises - Further Problems

M.1 Theoretical Extensions

Problem M.1. Show that for any invertible ARn×nA \in \mathbb{R}^{n \times n}, there exists a permutation matrix PP such that PAPA has an LU factorization. (Hint: show that there always exists a row ordering such that all leading principal minors of PAPA are nonzero.)

Problem M.2. Prove the identity det(A)=det(L)det(U)=i=1nUii\det(A) = \det(L)\det(U) = \prod_{i=1}^n U_{ii} using the LU factorization. Extend to show that det(PA)=(1)sdet(A)\det(PA) = (-1)^s \det(A) where ss is the number of transpositions in PP.

Problem M.3. For ARn×nA \in \mathbb{R}^{n \times n} with Cholesky A=LLA = LL^\top, prove that the Schur complement of A11A_{11} in AA equals (L22L22)(L_{22}L_{22}^\top) where L22L_{22} is the lower-right (n1)×(n1)(n-1) \times (n-1) block of LL. Conclude that the Schur complement of any SPD matrix is SPD.

Problem M.4. Show that the Householder reflector H=I2vv/v2H = I - 2\mathbf{v}\mathbf{v}^\top/\|\mathbf{v}\|^2 satisfies: (a) H=HH = H^\top (symmetric) (b) HH=IH^\top H = I (orthogonal) (c) det(H)=1\det(H) = -1 (orientation-reversing) (d) H2=IH^2 = I (involutory - its own inverse)

Problem M.5. Prove that QR with column pivoting finds the best rank-rr approximation in the following sense: if rank(A)=r\operatorname{rank}(A) = r and PP is chosen so that R11R22|R_{11}| \geq |R_{22}| \geq \cdots, then the leading r×rr \times r block R11R_{11} of RR satisfies σr(R11)σr(A)/1+r(nr)\sigma_r(R_{11}) \geq \sigma_r(A) / \sqrt{1 + r(n-r)}.

M.2 Computational Challenges

Challenge M.6 (Batched Cholesky). Implement batched Cholesky factorization for k=1000k = 1000 matrices of size n=50×50n = 50 \times 50. Compare three implementations:

  • Python loop over scipy.linalg.cholesky
  • NumPy vectorized via np.vectorize
  • torch.linalg.cholesky on CPU with batch dimension

Measure time and throughput (matrices/second). Extrapolate to the K-FAC setting where k=10,000k = 10{,}000 and n=128n = 128.

Challenge M.7 (Iterative Refinement). Construct a matrix AA with condition number κ1012\kappa \approx 10^{12} (near the FP64 limit). Solve Ax=bA\mathbf{x} = \mathbf{b} using: (a) Direct LU (FP64): measure relative error (b) Simulated FP32 LU + FP64 residual refinement (3 steps): measure relative error Compare and explain the improvement using backward error theory.

Challenge M.8 (Rank-revealing QR in practice). Download a real dataset (e.g., the UCI Breast Cancer dataset, n=569n = 569, d=30d = 30 features). Apply RRQR to the feature matrix XX and identify the effective rank (threshold: Rkk/R11<103|R_{kk}|/|R_{11}| < 10^{-3}). Verify by comparing to the number of significant singular values of XX.


References

  1. Golub, G. H. & Van Loan, C. F. (2013). Matrix Computations (4th ed.). Johns Hopkins University Press. - The definitive reference for all material in this section.

  2. Trefethen, L. N. & Bau, D. (1997). Numerical Linear Algebra. SIAM. - Exceptionally clear treatment of Householder QR and backward error analysis.

  3. Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms (2nd ed.). SIAM. - Comprehensive analysis of numerical stability; source for all error bounds cited here.

  4. Demmel, J. W. (1997). Applied Numerical Linear Algebra. SIAM. - Excellent treatment of LU pivoting strategies and condition estimation.

  5. Householder, A. S. (1958). Unitary triangularization of a nonsymmetric matrix. Journal of the ACM, 5(4), 339-342. - Original Householder reflector paper.

  6. Wilkinson, J. H. (1961). Error analysis of direct methods of matrix inversion. Journal of the ACM, 8(3), 281-330. - Foundational backward error analysis for LU.

  7. Halko, N., Martinsson, P.-G., & Tropp, J. A. (2011). Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2), 217-288. - Randomized QR/SVD framework.

  8. Martens, J. & Grosse, R. (2015). Optimizing neural networks with Kronecker-factored approximate curvature. ICML 2015. - K-FAC: Cholesky for Fisher information blocks.

  9. Hu, E. J. et al. (2022). LoRA: Low-rank adaptation of large language models. ICLR 2022. - Low-rank decomposition for fine-tuning.

  10. Zhao, J. et al. (2024). GaLore: Memory-efficient LLM training by gradient low-rank projection. ICML 2024. - Randomized SVD/QR for gradient compression.

  11. Bunch, J. R. & Kaufman, L. (1977). Some stable methods for calculating inertia and solving symmetric linear systems. Mathematics of Computation, 31(137), 163-179. - LDL^T with Bunch-Kaufman pivoting.

  12. Anderson, E. et al. (1999). LAPACK Users' Guide (3rd ed.). SIAM. - LAPACK routine reference and algorithmic details.

  13. Gardner, J. R. et al. (2018). GPyTorch: Blackbox matrix-matrix Gaussian process inference with GPU acceleration. NeurIPS 2018. - CG-based GP inference for large-scale regression.

  14. Demmel, J., Grigori, L., Hoemmen, M., & Langou, J. (2008). Communication-optimal parallel and sequential QR and LU factorizations. SIAM Journal on Scientific Computing, 34(1), A206-A239. - TSQR and communication-avoiding algorithms.

  15. Shah, J. et al. (2024). FlashAttention-3: Fast and accurate attention with asynchrony and low-precision. arXiv:2407.08608. - Block-tiled attention exploiting BLAS-3 structure.


Appendix M: Exercises - Further Problems

M.1 Theoretical Extensions

Problem M.1. Show that for any invertible A there exists a permutation P such that PA has an LU factorization. (Hint: show that there always exists a row ordering such that all leading principal minors of PA are nonzero.)

Problem M.2. Prove the identity det(A) = prod_i U_ii using the LU factorization. Extend to show that det(PA) = (-1)^s det(A) where s is the number of transpositions in P.

Problem M.3. For A with Cholesky A = LL^T, prove that the Schur complement of A_11 in A equals L_22 L_22^T. Conclude that the Schur complement of any SPD matrix is SPD.

Problem M.4. Show that the Householder reflector H = I - 2vv^T/||v||^2 is symmetric, orthogonal, has det(H) = -1, and H^2 = I.

M.2 References

  1. Golub, G. H. and Van Loan, C. F. (2013). Matrix Computations (4th ed.). Johns Hopkins University Press.
  2. Trefethen, L. N. and Bau, D. (1997). Numerical Linear Algebra. SIAM.
  3. Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms (2nd ed.). SIAM.
  4. Halko, N., Martinsson, P.-G., and Tropp, J. A. (2011). Finding structure with randomness. SIAM Review, 53(2), 217-288.
  5. Martens, J. and Grosse, R. (2015). Optimizing neural networks with K-FAC. ICML 2015.
  6. Hu, E. J. et al. (2022). LoRA: Low-rank adaptation of large language models. ICLR 2022.
  7. Zhao, J. et al. (2024). GaLore: Memory-efficient LLM training by gradient low-rank projection. ICML 2024.
  8. Demmel, J. et al. (2008). Communication-optimal parallel and sequential QR and LU factorizations. SIAM JSC, 34(1).

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue