"If vectors are the language of representation, matrices are the language of computation. A modern neural network is what happens when you apply a very large number of matrix operations, very quickly, to very large collections of vectors."
Overview
The previous chapter introduced vectors, subspaces, bases, norms, projections, and the geometry of linear algebra. This chapter turns that geometry into computation. A matrix is not only a rectangular array of numbers. It is the concrete representation of a linear map between finite-dimensional vector spaces, and matrix operations are the computational procedures that make linear algebra operational.
This matters immediately in machine learning. A dense layer is a matrix-vector multiply plus bias. A batch of activations passing through a layer is a matrix-matrix multiply. Self-attention is built from matrix multiplies, elementwise operations, transposes, and softmax. Backpropagation is dominated by transposed matrix multiplies and outer products. Numerical stability, hardware utilization, low-rank adaptation, quantization, and efficient inference are all matrix-operation questions.
This chapter therefore moves in three directions at once:
- the algebra of matrices as mathematical objects
- the numerical algorithms used to compute with them
- the AI interpretation of those operations in modern models
The goal is not only to know the rules of matrix arithmetic, but to understand why those rules explain how transformers run on GPUs, why certain decompositions are preferred in practice, and why low-rank structure appears so often in large-scale deep learning.
Prerequisites
- Basic algebra
- Familiarity with vectors in
R^n - Basic knowledge of linear combinations, basis, rank, and subspaces
- Comfort with summation notation
Learning Objectives
After completing this chapter, you should be able to:
- Read and manipulate matrix notation fluently
- Distinguish elementwise, algebraic, and decomposition-based matrix operations
- Carry out matrix addition, scalar multiplication, transpose, trace, and matrix multiply correctly
- Reason about shape compatibility and debug dimension errors quickly
- Explain matrix multiplication from row, column, outer-product, and block perspectives
- Understand when inverses exist and when pseudo-inverses are required
- Know when an inverse exists and when to use a pseudo-inverse instead
- Connect matrix operations directly to forward passes, backpropagation, attention, LoRA, quantization, and hardware efficiency
- Interpret matrix conditioning and numerical stability in AI systems
- Recognise which matrix decomposition is appropriate for a given problem and know where to find its full treatment
Table of Contents
- Matrix Operations
- Overview
- Prerequisites
- Learning Objectives
- Table of Contents
- 1. Intuition
- 2. Matrix Basics and Notation
- 3. Elementwise and Scalar Operations
- 4. Matrix Multiplication
- 5. Special Matrix Products
- 6. Matrix Inverse
- 7. Moore-Penrose Pseudo-Inverse
- 8. Matrix Decompositions - Preview
- 9. Common Mistakes
- 10. Exercises
- 11. Why This Matters for AI
- 12. Conceptual Bridge
- References
1. Intuition
1.1 What Are Matrix Operations?
A matrix is a rectangular array of numbers arranged in rows and columns:
But that description is only the surface. The deeper point is that every linear map
is represented by a unique matrix such that
So matrix operations are the arithmetic of linear maps:
- matrix addition adds linear transformations
- matrix multiplication composes linear transformations
- identity represents "do nothing"
- inverse represents "undo the transformation"
- transpose reorganizes rows and columns and, in the real inner-product case, corresponds to the adjoint
- decompositions expose hidden structure inside the transformation
If vectors are the states of a linear system, matrices are the machines that act on those states.
Vectors are states.
Matrices are transformations.
x in R^n --A--> y in R^m
Matrix operations tell us how transformations
combine, invert, factor, and approximate each other.
1.2 Why Matrix Operations Are Central to AI
In deep learning, almost every major computation is a matrix operation.
For a dense layer with weights , bias , and input :
For a batch of inputs :
That is matrix-matrix multiply plus broadcasting.
In scaled dot-product attention (Vaswani et al., 2017):
where are matrices. Two matrix multiplies dominate the work: and .
Backpropagation is also matrix arithmetic. If , then:
Weight updates are elementwise matrix operations:
So the forward pass, backward pass, and optimization step are all matrix-operation stories. Once models become large enough, performance becomes a question of how well those operations map to GPU kernels, memory hierarchies, and accelerator hardware.
1.3 The Computational Centrality of Matrix Multiply
The single most important primitive in deep learning systems is GEMM: general matrix-matrix multiplication. Vendor libraries such as cuBLAS on NVIDIA GPUs and oneDNN on CPUs exist largely to make GEMM fast and hardware-efficient. Dense neural workloads spend much of their time doing repeated multiply-accumulate operations over large arrays, so hardware is built around this pattern:
- NVIDIA Tensor Cores specialize in tiled matrix multiply
- TPUs organize compute as dense systolic-array linear algebra
- many convolutions are lowered to GEMM-style operations
- attention implementations are optimized around matmul and memory reuse
FlashAttention makes this especially clear. The original FlashAttention paper (Dao et al., 2022) did not change the mathematical definition of attention; it reorganized the computation to reduce IO between high-bandwidth memory and on-chip SRAM by tiling the operation better. FlashAttention-3 (Dao et al., 2024) pushed this further on H100 hardware, reporting FP16 throughput up to about 740 TFLOPs/s, roughly 75% utilization of peak tensor-core performance on that kernel.
This is one of the main systems lessons of modern AI:
- if your model is slow, the issue is often not "too much math"
- it is often "the matrix math is scheduled poorly"
- bad tiling, awkward shapes, poor layout, or excessive memory traffic can waste expensive hardware
Why matrix multiply dominates
model layer
-> matrix multiply
-> nonlinearity
-> matrix multiply
-> normalization
-> matrix multiply
Most of the FLOPs live in the multiply stages.
That is why hardware and libraries are optimized around GEMM.
1.4 The Map from Linear Algebra to Computation
It helps to translate abstract language directly into computational language.
| Abstract concept | Computational object |
|---|---|
| Linear map | Matrix |
| Composition | Matrix product |
| Identity map | Identity matrix |
| Inverse map | Matrix inverse |
| Adjoint in the real case | Transpose |
| Projection onto a subspace | Projection matrix |
| Change of basis | Similarity transform |
| Orthonormal basis | Orthogonal matrix with |
| Spectral decomposition | Eigendecomposition |
| Best rank- approximation | Truncated SVD |
This dictionary is the bridge between theory and implementation. It explains why so many later algorithms are specialized ways of multiplying, factoring, or solving with matrices.
1.5 Taxonomy of Matrix Operations
The subject is easier to navigate if matrix operations are grouped by what they preserve or compute.
MATRIX OPERATIONS
|
|-- Elementwise operations
| |-- addition / subtraction
| |-- scalar multiplication
| `-- Hadamard product
|
|-- Structure-preserving operations
| |-- transpose
| |-- conjugate transpose
| `-- trace
|
|-- Algebraic operations
| |-- matrix multiplication
| |-- matrix-vector multiply
| `-- outer product
|
|-- Solving and inverting
| |-- inverse
| |-- pseudo-inverse
| `-- linear system solvers
|
|-- Decompositions
| |-- LU
| |-- QR
| |-- eigendecomposition
| |-- SVD
| `-- Cholesky
|
`-- Derived quantities
|-- determinant
|-- rank
|-- norms
`-- condition number
These categories overlap, but the split is useful. It separates simple entrywise arithmetic from the deeper operations that reveal structure or solve numerical problems.
1.6 Historical Timeline
Matrix operations did not appear all at once. The modern subject is the result of several centuries of algebra, geometry, and numerical computation.
| Period | Development | Why it matters |
|---|---|---|
| Ancient China | Elimination methods in Nine Chapters on the Mathematical Art | Proto-Gaussian elimination |
| 17th-18th centuries | Determinants, Cramer's rule, linear systems | Early algebraic treatment of systems |
| 1809 | Gauss formalizes least squares and elimination | Linear systems and approximation become computational sciences |
| 1850s | Cayley and Sylvester develop matrix algebra | Matrices become algebraic objects |
| Late 19th century | Frobenius and spectral ideas | Eigenvalues and structure become central |
| Early 20th century | Gram-Schmidt and orthogonalization | Stable basis construction emerges |
| Mid 20th century | Householder, Golub, Reinsch, Wilkinson | Modern numerical linear algebra is born |
| 1990s | LAPACK | Reference dense linear algebra software stack |
| 2000s onward | cuBLAS and GPU linear algebra | Matrix operations move to accelerators |
| 2017 onward | Transformers and attention | GEMM becomes the dominant workload in AI |
| 2020s | FlashAttention, Tensor Cores, FP8 | Matrix arithmetic is co-designed with hardware |
The historical arc is instructive. Matrix operations began as abstract mathematics, matured into numerical algorithms, and then became the central workload of modern AI hardware.
2. Matrix Basics and Notation
2.1 Definition and Notation
A matrix has rows and columns:
Standard notation:
- or is the entry in row , column
- the -th row is written
- the -th column is written
- each row is an element of
- each column is an element of
That distinction matters. Matrix multiplication can be understood as rows meeting columns through dot products.
Shape and indexing
columns
j=1 j=2 j=n
+--------------------+
i=1 | a11 a12 ... a1n |
i=2 | a21 a22 ... a2n |
... | ... ... ... ... |
i=m | am1 am2 ... amn |
+--------------------+
rows
Block notation is also important. A matrix can be partitioned into submatrices:
This is not just notation. Block structure is how large matrix computations are actually implemented on modern hardware.
2.2 Special Matrices
Many algorithms simplify once a matrix has special structure.
Square matrix
A matrix with the same number of rows and columns.
Zero matrix
The additive identity; every entry is zero.
Identity matrix
It satisfies when shapes are compatible.
Diagonal matrix
Multiplying by a diagonal matrix is coordinatewise scaling.
Upper and lower triangular matrices
Upper triangular matrices have zeros below the diagonal; lower triangular matrices have zeros above it.
Symmetric matrix
Symmetric matrices have real eigenvalues and orthogonal eigenspaces for distinct eigenvalues.
Skew-symmetric matrix
Its diagonal entries must be zero.
Orthogonal matrix
Its columns form an orthonormal basis, and .
Positive definite matrix
All eigenvalues are positive.
Positive semidefinite matrix
All eigenvalues are nonnegative.
These classes are not merely labels. They determine which algorithms are valid:
- SPD matrices admit Cholesky decomposition
- orthogonal matrices make inversion cheap
- triangular matrices make solving systems cheap
- symmetric matrices make spectral analysis easier
Some important matrix shapes
Identity Diagonal Upper triangular
[1 0 0] [d1 0 0 ] [a b c]
[0 1 0] [0 d2 0 ] [0 d e]
[0 0 1] [0 0 d3] [0 0 f]
Symmetric Lower triangular
[a b c] [a 0 0]
[b d e] [b c 0]
[c e f] [d e f]
2.3 Matrix Shape and Compatibility
Shape checking is the first line of defense in matrix reasoning.
Addition requires identical shape:
Multiplication requires matching inner dimensions:
The inner dimension disappears because it is the contracted summation index.
Compatibility rule for multiplication
(m x p) times (p x n) -> (m x n)
A B AB
inner dimensions must match
outer dimensions survive
This sounds elementary, but in practice it is one of the most important debugging tools in machine learning. If a tensor program fails, a shape mismatch is one of the first things to check.
Broadcasting is different from matrix multiplication. In libraries such as NumPy and PyTorch, dimensions of size 1 can be expanded logically to match other shapes. That is a convenience rule for elementwise operations, not a new form of matrix algebra.
2.4 Memory Layout
A matrix is conceptually two-dimensional, but in memory it is stored as a linear sequence of numbers. The layout matters for performance.
In row-major order (common in C-style conventions), entries of the same row are stored contiguously. For a matrix with shape , the offset of entry is approximately
In column-major order (common in Fortran and classical BLAS/LAPACK conventions), entries of the same column are stored contiguously:
2 x 3 matrix
[a b c]
[d e f]
row-major memory: a b c d e f
column-major memory: a d b e c f
This matters because contiguous memory access is cache-friendly. A mathematically identical algorithm can run much faster or much slower depending on layout and traversal order.
The transpose illustrates the difference between mathematics and storage. Mathematically,
But in high-performance libraries, a transpose often does not require moving data immediately. It can be represented by changing shape and stride metadata, so the same underlying memory is interpreted with swapped indexing. In PyTorch, a 2D .T operation typically returns a view with transposed strides rather than a full data copy.
This matters in AI because operations such as in attention rely heavily on fast transposed access patterns. Optimized systems try to get the mathematical effect of transpose without paying the full physical cost of reshuffling memory.
3. Elementwise and Scalar Operations
3.1 Matrix Addition and Subtraction
If , then matrix addition and subtraction are defined entrywise:
The rules mirror vector addition:
- commutativity:
- associativity:
- additive identity:
- additive inverse:
The operation cost is because every entry must be touched once. On modern GPUs, matrix addition is usually memory-bandwidth bound rather than compute-bound. There is very little arithmetic per byte moved.
In deep learning, matrix addition appears constantly:
- adding biases
- adding residual connections
- updating weights
- combining branches in computation graphs
3.2 Scalar Multiplication
If and , then
Every entry is scaled by the same scalar.
The familiar linearity properties hold:
Together with addition, scalar multiplication defines affine updates such as
In numerical linear algebra, the operation
is so common that BLAS gives it a dedicated name: AXPY. Gradient descent is exactly this pattern:
where is a scalar learning rate and is a gradient matrix or tensor.
3.3 Hadamard Elementwise Product
The Hadamard product of two same-shaped matrices is
This is elementwise multiplication. It is not matrix multiplication.
Hadamard product vs matrix product
same shapes required inner dimensions required
A (m x n) A (m x p)
B (m x n) B (p x n)
A odot B -> (m x n) AB -> (m x n)
no summation sum over shared index
entry by entry row-column dot products
Properties:
- commutative:
- associative:
- distributive over addition:
But it does not obey the same algebra as matrix multiplication.
The Hadamard product is central in AI because modern networks use gates all over the place:
- LSTM and GRU gates
- dropout masks
- attention masks and padding masks
- SwiGLU and related gated feed-forward activations
- elementwise modulation in conditional models
3.4 Trace
For a square matrix , the trace is
Key properties:
Linearity
Transpose invariance
Cyclic permutation
For compatible matrices:
This does not mean arbitrary reordering is allowed.
Frobenius inner product
Eigenvalue sum
If is diagonalizable, then
In AI, trace appears in Frobenius norms, covariance identities, kernel formulas, and matrix-calculus derivations.
3.5 Transpose
The transpose of is the matrix defined by
Rows become columns and columns become rows.
Basic properties:
The reverse-order rule is fundamental:
So
This is the algebraic reason backpropagation has the structure it does. Reverse-mode differentiation repeatedly moves backward through compositions, and transposes inherit the same reverse-order behavior.
Transpose flips orientation
A: m rows, n cols
A^T: n rows, m cols
row of A -> column of A^T
column of A -> row of A^T
4. Matrix Multiplication
4.1 Definition
If
then the matrix product is defined by
Equivalently, the entry is the dot product of row of with column of :
How one entry is formed
row i of A: [ a_i1 a_i2 ... a_ip ]
column j of B: [ b_1j
b_2j
...
b_pj ]
dot product -> c_ij
This definition is exactly what composition of linear maps requires. If is first transformed by and then by , the combined transformation is .
4.2 Four Interpretations of Matrix Multiply
One of the best ways to understand matrix multiply is to see four different but equivalent interpretations.
Row perspective
Each row of is the corresponding row of acting on .
Column perspective
Each column of is applied to the corresponding column of .
Outer-product perspective
Let be the -th column of and let be the -th row of . Then
This is one of the most important views for low-rank structure and gradients.
Block perspective
If compatible blocks are used, block multiplication works exactly like scalar algebra with block products in place of scalar products.
One multiply, four views
rows meet columns
columns get transformed
outer products get accumulated
blocks combine hierarchically
4.3 Properties of Matrix Multiplication
Matrix multiplication obeys several essential rules.
Associativity
Distributivity
Identity
Zero behavior
But
Transpose reversal
Rank inequality
The rank cannot exceed the bottleneck rank of the factors.
4.4 Non-Commutativity and Consequences
In general,
Sometimes one product is defined and the other is not. Even when both are defined and square, they are usually different.
A standard counterexample is
Then
This has many consequences:
- the order of layer operations matters
- transpose and inverse formulas reverse order
- gradient derivations must preserve order carefully
- symmetric times symmetric is not automatically symmetric
- changing parenthesization changes computational cost even if the answer is unchanged
For example, in transformer notation:
and that identity only works because transpose reverses the order inside the second factor.
4.5 Computational Complexity
For
the naive algorithm performs about
floating-point operations if multiply-adds are counted separately.
For square matrices, that becomes .
Strassen's algorithm reduces the exponent to about
and later algorithms reduce it further in theory. But these methods bring larger constants, more complicated memory behavior, and weaker numerical properties. They are mathematically important, but they are not the default engine of deep learning workloads.
Another practical optimization problem is matrix-chain order. Because multiplication is associative, the result does not change when the parenthesization changes, but the cost can change drastically.
4.6 GEMM on GPU: Hardware Perspective
On accelerators, matrix multiply is implemented by tiling the problem into fragments that fit into fast on-chip memory and specialized multiply-accumulate units.
At a high level:
- load a tile of and a tile of from HBM
- stage them into shared memory or registers
- run many fused multiply-add operations on the tile
- accumulate partial sums
- write the output tile back
Tiled GEMM
HBM -> shared memory -> registers -> multiply-accumulate -> write back
Do more arithmetic per byte loaded.
That is the whole performance game.
NVIDIA's H100 product specifications list up to 989 TFLOPS tensor-core TF32 performance, 1,979 TFLOPS tensor-core BF16/FP16 performance, 3,958 TFLOPS tensor-core FP8 performance, and 3.35 TB/s HBM bandwidth for the SXM variant. That gap between compute throughput and memory bandwidth is why naive memory movement can destroy performance.
Large dense GEMMs usually become compute-bound if tiled well. Small or awkwardly shaped matrix multiplies often do not.
FlashAttention is a direct application of this principle. The algorithm is faster not because it changes the mathematics of attention, but because it reorganizes the matrix operations and reductions to reduce expensive memory traffic between HBM and SRAM while keeping arithmetic dense and local (Dao et al., 2022; Dao et al., 2024).
4.7 Matrix-Vector Multiply GEMV
Matrix-vector multiply
is a special case of matrix multiplication where one dimension is 1.
It costs about floating-point operations, but unlike large GEMM it has poor arithmetic intensity. You read the matrix, read the vector, and produce one output vector. There is much less reuse.
This is why GEMV is usually memory-bandwidth bound.
The AI consequence is important:
- batched training prefers GEMM because batching increases reuse
- autoregressive decoding often degenerates into many GEMV-like operations because the batch or active sequence dimension is small
- single-sequence decoding therefore achieves far lower hardware utilization than large-batch training
Batching converts many GEMV workloads back toward GEMM, which is why serving systems aggressively batch requests whenever latency constraints allow it.
5. Special Matrix Products
5.1 Outer Product
If and , then the outer product is
Every column of is a scalar multiple of , and every row is a scalar multiple of . Therefore, if neither vector is zero, the matrix has rank 1.
Outer product
u = [u1] v^T = [v1 v2 v3]
[u2]
[u3]
uv^T =
[u1v1 u1v2 u1v3]
[u2v1 u2v2 u2v3]
[u3v1 u3v2 u3v3]
Outer products matter because they are the atoms of low-rank matrix structure. Any rank- matrix can be written as a sum of outer products.
They also appear directly in AI:
- dense-layer gradient for one sample:
- LoRA updates as sums of rank-1 terms
- covariance and second-moment estimates as sums of outer products
5.2 Kronecker Product
If
then the Kronecker product is the block matrix
defined by
Important properties:
The Kronecker product appears whenever a large linear structure decomposes across independent dimensions. In AI it is especially important in second-order optimization approximations such as K-FAC.
5.3 Khatri-Rao Product
The Khatri-Rao product is the columnwise Kronecker product. If
with the same number of columns, then
It is more specialized than the Kronecker product, but it shows up in multilinear algebra, tensor factorization, and certain bilinear model derivations.
5.4 Block Matrix Multiplication
Block matrices let us reason about large matrices hierarchically. If compatible blocks are used, multiplication works exactly like scalar algebra except that scalar multiplication is replaced by matrix multiplication.
For example:
This is the correct abstraction for tiled GPU GEMM, distributed matrix multiply, tensor parallelism, and block-structured model design.
6. Matrix Inverse
6.1 Definition and Existence
A square matrix is invertible if there exists a matrix such that
This means undoes the transformation performed by .
x --A--> y
y --A^{-1}--> x
Applying A and then A^{-1} returns the original vector.
For a square matrix , the following are equivalent:
- is invertible
- the columns of are linearly independent
- the rows of are linearly independent
- has only the trivial solution
- 0 is not an eigenvalue of
6.2 Properties of Inverse
If is invertible, then the inverse is unique. Other core properties are:
Again, inverse reverses order just like transpose.
6.3 Explicit Formula for 2x2 Inverse
For
the inverse exists if and only if
When it exists:
This is worth memorizing for intuition, but it is not how serious matrix inversion is done for large matrices.
6.4 Adjugate Formula General Case
For a general square matrix, one classical formula is
where is the adjugate matrix obtained by transposing the cofactor matrix.
This formula is theoretically elegant but computationally poor. It explains structure; it is not the preferred practical algorithm.
6.5 Computing Inverses in Practice
There are several standard ways to compute with inverses or inverse-like effects.
Gaussian elimination
Augment the matrix with the identity:
and row-reduce until it becomes
This is conceptually direct and leads to cost.
LU-based methods
Factor
or, with pivoting,
Then solve the triangular systems needed to reconstruct the inverse column by column.
But in practice, the most important rule is:
To solve , do not compute explicitly unless you have a very unusual reason.
Instead, solve the linear system directly using LU, QR, or Cholesky.
Why?
- fewer operations
- better numerical stability
- better library support
- avoids forming an object you usually do not need
6.6 Numerical Stability and Condition Number
Some matrices are easy to invert numerically. Others are technically invertible but catastrophically unstable in floating-point arithmetic.
The condition number measures this sensitivity:
In the 2-norm,
Interpretation:
- : well-conditioned
- : ill-conditioned
If is ill-conditioned, small perturbations in data or rounding error can produce large perturbations in the solution.
This matters directly in AI:
- poorly conditioned Hessians slow optimization
- badly scaled feature matrices make regression unstable
- covariance matrices can become numerically troublesome
- second-order methods require careful conditioning
7. Moore-Penrose Pseudo-Inverse
7.1 Motivation
Ordinary inverse is only defined for square, full-rank matrices. But many practical matrices are rectangular or rank-deficient. So we need a generalized inverse.
The Moore-Penrose pseudo-inverse, written , extends inverse-like behavior to every matrix .
It solves two kinds of problems especially well:
- least-squares solutions when has no exact solution
- minimum-norm solutions when there are infinitely many exact solutions
7.2 Definition via SVD
Let
be the SVD of , where the nonzero singular values are . Then
where
If is square and invertible, then .
7.3 Moore-Penrose Conditions
The pseudo-inverse is characterized uniquely by the four Penrose conditions:
Two especially important consequences are:
- is the orthogonal projector onto
- is the orthogonal projector onto
7.4 Least Squares Solution
Suppose and .
If , the system is overdetermined and there is usually no exact solution. The pseudo-inverse returns the least-squares solution:
If has full column rank, this becomes
This solves
Geometrically, is the projection of onto the column space of .
If , there are typically infinitely many exact solutions. Then returns the minimum-norm solution.
Least squares picture
b
/|
/ |
/ | residual
/ |
-----*----+----------> col(A)
A x_star
A x_star is the closest reachable vector to b.
7.5 Pseudo-Inverse Properties
Important identities:
In general,
The pseudo-inverse behaves like inverse in some ways, but not all.
8. Matrix Decompositions - Preview
Matrix decompositions factor into structured products that expose different kinds of mathematical structure. Each decomposition in this section has a dedicated home elsewhere in the curriculum where it is treated in full - with proofs, algorithms, and worked examples. What follows is a navigational preview: enough context to understand which decomposition applies and why, plus explicit forward links to the full treatment.
Decomposition decision table
| Goal | Use | | ------------------------------------------------------ | ------------------------- | --- | --- | | Solve for general square | LU (Gaussian elimination) | | Solve for symmetric positive definite | Cholesky | | Least squares: | QR | | Eigenvalues and eigenvectors of square | Eigendecomposition | | Best rank- approximation, pseudo-inverse, all cases | SVD |
8.1 Determinant
The determinant of a square matrix is a scalar measuring the signed volume-scaling factor of the linear map. When the map expands volume; when it compresses; when the map is singular - it collapses the full space onto a lower-dimensional subspace, meaning has no unique solution.
The determinant is also the product of all eigenvalues and the reciprocal of the product of the eigenvalues of . In practice, explicit determinant computation is rarely done by direct expansion (which costs ) - instead it is read off from an LU or Cholesky factorization as a product of diagonal entries.
For AI: Log-determinants appear in Gaussian log-likelihoods , in the ELBO of variational inference, and in normalizing flow log-likelihood as the log absolute Jacobian determinant.
-> Full treatment: Determinants
8.2 LU Decomposition
LU factorization writes , where is a permutation (row-reordering) matrix, is lower triangular with ones on the diagonal, and is upper triangular. It is the computational form of Gaussian elimination. Once computed in , each new right-hand side can be solved in via forward substitution on then back substitution on .
For AI: LU appears in second-order optimizers (K-FAC, Shampoo), Kalman filter updates, and continuous normalizing flows that require solving stiff ODEs. It is also how numpy.linalg.solve and scipy.linalg.solve are implemented internally.
-> Full treatment: Matrix Decompositions
8.3 QR Decomposition
QR factorization writes where is orthogonal () and is upper triangular. It is numerically more stable than LU for least-squares problems because preserves norms. The main algorithms are Gram-Schmidt orthogonalization (conceptually clear), Householder reflectors (numerically stable, standard in practice), and Givens rotations (good for sparse matrices).
For AI: Orthogonal weight initialization (Saxe et al. 2013) uses QR. The QR algorithm is the standard method for computing all eigenvalues of a dense matrix. Weight orthogonalization regularizers (Brock et al.) apply QR repeatedly during training.
-> Full treatment: Matrix Decompositions
8.4 Eigendecomposition
For a diagonalizable matrix , the eigendecomposition is , where columns of are eigenvectors and holds the eigenvalues. It reveals the invariant directions of the linear map and makes matrix powers trivial: .
Not every square matrix is diagonalizable. Defective matrices (with repeated eigenvalues and insufficient eigenvectors) require the Jordan normal form. For symmetric real matrices, the spectral theorem guarantees real eigenvalues and an orthonormal eigenbasis, making and giving .
For AI: Eigenspectrum analysis of the Hessian explains sharpness and flat minima. Adam's diagonal preconditioning approximates the Fisher information matrix eigenspectrum. Vanishing and exploding gradients in RNNs are a direct consequence of the dominant eigenvalue of the recurrence matrix exceeding or falling below 1.
-> Full treatment: Eigenvalues and Eigenvectors
8.5 Singular Value Decomposition
The SVD is the universal matrix factorization: for any ,
where and are orthogonal, and is diagonal with non-negative singular values . Unlike eigendecomposition, SVD exists for every matrix - rectangular, rank-deficient, or otherwise.
Key consequences: the best rank- approximation is (Eckart-Young theorem); the pseudo-inverse is ; the condition number is ; and the four fundamental subspaces (column space, null space, row space, left null space) are read directly from and .
For AI: LoRA decomposes parameter updates as with , , - a direct consequence of the SVD low-rank viewpoint. Mechanistic interpretability uses SVD to decompose attention OV circuits into rank-1 components. Randomised SVD (Halko et al. 2011) makes low-rank approximation tractable for billion-parameter weight matrices.
-> Full treatment: Singular Value Decomposition
8.6 Cholesky Decomposition
For a symmetric positive definite (SPD) matrix , Cholesky gives where is lower triangular with positive diagonal. It is approximately twice as fast as LU for SPD systems (exploiting symmetry), requires no pivoting, and serves as a numerical test of positive definiteness: Cholesky fails if and only if is not SPD.
For AI: Gaussian process inference, Bayesian neural network covariance updates, Kalman filter covariance propagation, and second-order optimizer preconditioning (natural gradient methods) all require efficient SPD linear solves - Cholesky is the standard tool. It also appears in sampling from multivariate Gaussians: to draw , compute then return where .
-> Full treatment: Matrix Decompositions
9. Common Mistakes
| Mistake | Why it is wrong | Better rule |
|---|---|---|
| "Matrix multiplication is commutative." | In general , and sometimes only one product is even defined. | Track order carefully. |
| " means one factor must be zero." | Matrices have zero divisors. | Zero product does not imply zero factor. |
| "." | Inverse reverses order. | . |
| "If , then ." | Nilpotent matrices exist. | Matrix powers do not behave like scalar powers. |
| "Product of symmetric matrices is symmetric." | , which equals only if and commute. | is symmetric iff when both are symmetric. |
| "Every square matrix has an inverse." | A square matrix can still be singular. | Check rank, determinant, or null space first. |
| "To solve , compute ." | Explicit inversion is usually slower and less stable than solving directly. | Use a linear solver or factorization. |
| "Row rank and column rank may differ." | They are always equal. | Rank is one number, not two competing ones. |
| "Every square matrix has an eigendecomposition." | Some matrices are defective and not diagonalizable. | SVD always exists; eigendecomposition does not. |
| "Hadamard and matrix product are basically the same." | One is entrywise, one contracts over an index. | Treat them as different operations. |
10. Exercises
These exercises are designed to force both algebraic fluency and computational interpretation.
- Matrix arithmetic and properties
Let
Compute:
- and , and verify they differ
- and
- using the formula
- , , and
- and
-
Shape compatibility and broadcasting Let , , , , .
- which of , , , , , are defined? State each resulting shape.
- compute the shape of and verify it equals
- is defined? What about ?
- compute - what kind of object is the result?
- write as a sum of rank-1 outer products (column-of- times row-of-)
-
Trace and inner-product identities Let .
- prove from the definition
- show that (the Frobenius inner product)
- use to write
- for , , verify numerically
- is in general? Prove or give a counterexample.
-
Block matrix multiply and the outer-product form Partition and as blocks of size .
- write the block matrix product formula
- for with , , , and , compute by block multiply and verify it equals
- express a rank-2 matrix as a matrix product and state the shapes
- explain why every rank- matrix can be written as a product of an matrix and an matrix
- for a LoRA update with and , compute the parameter count versus full
-
Pseudo-inverse and least squares
Let
Solve:
- show why the system is overdetermined
- compute
- compute the residual norm
- verify the pseudo-inverse formula using SVD
- explain why is a projection matrix
-
Attention as matrix operations
Let for a single attention head.- write down the shapes in
- explain why is
- apply row-wise softmax conceptually to obtain attention weights
- compute the output shape of
- explain why each row of the attention matrix sums to 1
- if , determine whether is symmetric and whether the softmaxed matrix must still be symmetric
-
Conditioning and numerical stability Let for small .
- compute symbolically as a function of
- compute explicitly using the inverse formula
- compute the condition number using the spectral norm (largest singular value) - describe what happens as
- explain why a large condition number means that a small perturbation in causes a large perturbation in the solution of
- for , estimate the number of digits of precision lost in the solution relative to 64-bit floating-point (which has significant digits)
-
LoRA and low-rank structure
Let and let a LoRA update use rank .- compute the number of parameters in full
- compute the number of parameters in a rank-8 factorization
- express the update as a sum of outer products
- interpret a singular-value spectrum such as
- explain how truncated SVD could compress the learned update further
11. Why This Matters for AI
| Aspect | Why matrix operations matter |
|---|---|
| Every forward pass | Transformer layers are dominated by matrix multiplies: projections, attention scores, attention outputs, and feed-forward blocks. |
| Hardware efficiency | GPU and accelerator utilization is largely a GEMM scheduling problem. Batching, tiling, and layout determine whether tensor units are actually used well. |
| Backpropagation | Gradient formulas are matrix operations with transposes, outer products, and reductions. Reverse-order transpose rules are built directly into reverse-mode differentiation. |
| Low-rank adaptation | LoRA is matrix factorization applied to parameter updates. Its practicality is a direct consequence of low-rank structure and the SVD viewpoint. |
| MLA-style attention | DeepSeek-V2 reports that Multi-head Latent Attention compresses KV cache heavily and increases generation throughput by caching a low-rank latent representation instead of full key-value states. |
| Numerical stability | Condition number, decomposition choice, and matrix norms affect whether optimization remains stable or blows up. |
| Quantization | Post-training quantization, compensation, and calibration all rely on matrix norms, least squares, and low-rank corrections. |
| Optimizers | Methods such as K-FAC and Shampoo are explicitly matrix-based, using structured curvature approximations, matrix roots, or Kronecker factors. |
| Interpretability | SVD and eigenspectral analysis of weight matrices, OV circuits, and activations expose dominant directions and effective rank. |
| Inference latency | Training is GEMM-heavy and compute-dense, while single-sequence decoding often behaves more like bandwidth-limited GEMV. Understanding the matrix regime explains the performance difference. |
The short version is that matrix operations are not a chapter you pass through on the way to AI. They are the operational core of AI itself.
12. Conceptual Bridge
Vectors and spaces gave us the geometry of linear algebra. Matrix operations gave us the computational rules for acting on that geometry. Every abstract idea from the previous chapter now has an executable form:
- subspaces become column spaces, row spaces, null spaces, and projections
- orthogonality becomes QR and Cholesky
- low-dimensional approximation becomes SVD truncation
- invertibility becomes solving linear systems
- composition becomes matrix multiplication
This is the bridge to the next topics:
vectors and spaces
-> matrix operations
-> spectral structure and decompositions
-> probability and statistics
-> optimization and dynamics
-> deep learning systems
If this chapter leaves you with one durable mental model, let it be this:
neural networks are not mysterious collections of tensor APIs; they are structured compositions of matrix operations, and the quality, speed, and stability of those operations determine what the model can learn and how efficiently it can run.
References
Core Linear Algebra and Numerical Linear Algebra
- Sheldon Axler, Linear Algebra Done Right: https://linear.axler.net/
- MIT 18.06 Linear Algebra materials: https://web.mit.edu/18.06/www/
- LAPACK official site: https://www.netlib.org/lapack/
- NVIDIA cuBLAS documentation: https://docs.nvidia.com/cuda/cublas/
- Halko, Martinsson, and Tropp (2011), Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions: https://authors.library.caltech.edu/27399/
Transformers, Attention, and Low-Rank Adaptation
- Ashish Vaswani et al. (2017), Attention Is All You Need: https://arxiv.org/abs/1706.03762
- Tri Dao et al. (2022), FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness: https://arxiv.org/abs/2205.14135
- Tri Dao et al. (2024), FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision: https://arxiv.org/abs/2407.08608
- Edward J. Hu et al. (2021/2022), LoRA: Low-Rank Adaptation of Large Language Models: https://arxiv.org/abs/2106.09685
Large-Model Systems and Hardware
- NVIDIA H100 product specifications: https://www.nvidia.com/en-eu/data-center/h100/
- NVIDIA Hopper architecture overview: https://www.nvidia.com/en-us/data-center/technologies/hopper-architecture/
- DeepSeek-AI (2024), DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model: https://arxiv.org/abs/2405.04434
- DeepSeek-AI (2024/2025), DeepSeek-V3 Technical Report: https://arxiv.org/abs/2412.19437