NotesMath for LLMs

Matrix Operations

Linear Algebra Basics / Matrix Operations

Notes

"If vectors are the language of representation, matrices are the language of computation. A modern neural network is what happens when you apply a very large number of matrix operations, very quickly, to very large collections of vectors."

Overview

The previous chapter introduced vectors, subspaces, bases, norms, projections, and the geometry of linear algebra. This chapter turns that geometry into computation. A matrix is not only a rectangular array of numbers. It is the concrete representation of a linear map between finite-dimensional vector spaces, and matrix operations are the computational procedures that make linear algebra operational.

This matters immediately in machine learning. A dense layer is a matrix-vector multiply plus bias. A batch of activations passing through a layer is a matrix-matrix multiply. Self-attention is built from matrix multiplies, elementwise operations, transposes, and softmax. Backpropagation is dominated by transposed matrix multiplies and outer products. Numerical stability, hardware utilization, low-rank adaptation, quantization, and efficient inference are all matrix-operation questions.

This chapter therefore moves in three directions at once:

  • the algebra of matrices as mathematical objects
  • the numerical algorithms used to compute with them
  • the AI interpretation of those operations in modern models

The goal is not only to know the rules of matrix arithmetic, but to understand why those rules explain how transformers run on GPUs, why certain decompositions are preferred in practice, and why low-rank structure appears so often in large-scale deep learning.

Prerequisites

  • Basic algebra
  • Familiarity with vectors in R^n
  • Basic knowledge of linear combinations, basis, rank, and subspaces
  • Comfort with summation notation

Learning Objectives

After completing this chapter, you should be able to:

  • Read and manipulate matrix notation fluently
  • Distinguish elementwise, algebraic, and decomposition-based matrix operations
  • Carry out matrix addition, scalar multiplication, transpose, trace, and matrix multiply correctly
  • Reason about shape compatibility and debug dimension errors quickly
  • Explain matrix multiplication from row, column, outer-product, and block perspectives
  • Understand when inverses exist and when pseudo-inverses are required
  • Know when an inverse exists and when to use a pseudo-inverse instead
  • Connect matrix operations directly to forward passes, backpropagation, attention, LoRA, quantization, and hardware efficiency
  • Interpret matrix conditioning and numerical stability in AI systems
  • Recognise which matrix decomposition is appropriate for a given problem and know where to find its full treatment

Table of Contents


1. Intuition

1.1 What Are Matrix Operations?

A matrix is a rectangular array of numbers arranged in rows and columns:

A=(a11a12a1na21a22a2nam1am2amn)Rm×n.A = \begin{pmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{pmatrix} \in \mathbb{R}^{m \times n}.

But that description is only the surface. The deeper point is that every linear map

T:RnRmT : \mathbb{R}^n \to \mathbb{R}^m

is represented by a unique matrix ARm×nA \in \mathbb{R}^{m \times n} such that

T(x)=Ax.T(x) = Ax.

So matrix operations are the arithmetic of linear maps:

  • matrix addition adds linear transformations
  • matrix multiplication composes linear transformations
  • identity represents "do nothing"
  • inverse represents "undo the transformation"
  • transpose reorganizes rows and columns and, in the real inner-product case, corresponds to the adjoint
  • decompositions expose hidden structure inside the transformation

If vectors are the states of a linear system, matrices are the machines that act on those states.

Vectors are states.
Matrices are transformations.

x in R^n  --A-->  y in R^m

Matrix operations tell us how transformations
combine, invert, factor, and approximate each other.

1.2 Why Matrix Operations Are Central to AI

In deep learning, almost every major computation is a matrix operation.

For a dense layer with weights WRm×nW \in \mathbb{R}^{m \times n}, bias bRmb \in \mathbb{R}^m, and input xRnx \in \mathbb{R}^n:

y=Wx+b.y = Wx + b.

For a batch of inputs XRB×nX \in \mathbb{R}^{B \times n}:

Y=XW+b.Y = XW^\top + b.

That is matrix-matrix multiply plus broadcasting.

In scaled dot-product attention (Vaswani et al., 2017):

S=QKdk,A=softmax(S),O=AV,S = \frac{QK^\top}{\sqrt{d_k}}, \qquad A = \mathrm{softmax}(S), \qquad O = AV,

where Q,K,VQ, K, V are matrices. Two matrix multiplies dominate the work: QKQK^\top and AVAV.

Backpropagation is also matrix arithmetic. If Y=XWY = XW^\top, then:

LW=(LY)X,LX=LYW.\frac{\partial L}{\partial W} = \left(\frac{\partial L}{\partial Y}\right)^\top X, \qquad \frac{\partial L}{\partial X} = \frac{\partial L}{\partial Y}W.

Weight updates are elementwise matrix operations:

WWηW.W \leftarrow W - \eta \nabla_W.

So the forward pass, backward pass, and optimization step are all matrix-operation stories. Once models become large enough, performance becomes a question of how well those operations map to GPU kernels, memory hierarchies, and accelerator hardware.

1.3 The Computational Centrality of Matrix Multiply

The single most important primitive in deep learning systems is GEMM: general matrix-matrix multiplication. Vendor libraries such as cuBLAS on NVIDIA GPUs and oneDNN on CPUs exist largely to make GEMM fast and hardware-efficient. Dense neural workloads spend much of their time doing repeated multiply-accumulate operations over large arrays, so hardware is built around this pattern:

  • NVIDIA Tensor Cores specialize in tiled matrix multiply
  • TPUs organize compute as dense systolic-array linear algebra
  • many convolutions are lowered to GEMM-style operations
  • attention implementations are optimized around matmul and memory reuse

FlashAttention makes this especially clear. The original FlashAttention paper (Dao et al., 2022) did not change the mathematical definition of attention; it reorganized the computation to reduce IO between high-bandwidth memory and on-chip SRAM by tiling the operation better. FlashAttention-3 (Dao et al., 2024) pushed this further on H100 hardware, reporting FP16 throughput up to about 740 TFLOPs/s, roughly 75% utilization of peak tensor-core performance on that kernel.

This is one of the main systems lessons of modern AI:

  • if your model is slow, the issue is often not "too much math"
  • it is often "the matrix math is scheduled poorly"
  • bad tiling, awkward shapes, poor layout, or excessive memory traffic can waste expensive hardware
Why matrix multiply dominates

model layer
   -> matrix multiply
   -> nonlinearity
   -> matrix multiply
   -> normalization
   -> matrix multiply

Most of the FLOPs live in the multiply stages.
That is why hardware and libraries are optimized around GEMM.

1.4 The Map from Linear Algebra to Computation

It helps to translate abstract language directly into computational language.

Abstract conceptComputational object
Linear map T:RnRmT : \mathbb{R}^n \to \mathbb{R}^mMatrix ARm×nA \in \mathbb{R}^{m \times n}
Composition T2T1T_2 \circ T_1Matrix product A2A1A_2 A_1
Identity mapIdentity matrix II
Inverse mapMatrix inverse A1A^{-1}
Adjoint in the real caseTranspose AA^\top
Projection onto a subspaceProjection matrix PP
Change of basisSimilarity transform B1ABB^{-1}AB
Orthonormal basisOrthogonal matrix QQ with QQ=IQ^\top Q = I
Spectral decompositionEigendecomposition A=VΛV1A = V \Lambda V^{-1}
Best rank-rr approximationTruncated SVD

This dictionary is the bridge between theory and implementation. It explains why so many later algorithms are specialized ways of multiplying, factoring, or solving with matrices.

1.5 Taxonomy of Matrix Operations

The subject is easier to navigate if matrix operations are grouped by what they preserve or compute.

MATRIX OPERATIONS
|
|-- Elementwise operations
|   |-- addition / subtraction
|   |-- scalar multiplication
|   `-- Hadamard product
|
|-- Structure-preserving operations
|   |-- transpose
|   |-- conjugate transpose
|   `-- trace
|
|-- Algebraic operations
|   |-- matrix multiplication
|   |-- matrix-vector multiply
|   `-- outer product
|
|-- Solving and inverting
|   |-- inverse
|   |-- pseudo-inverse
|   `-- linear system solvers
|
|-- Decompositions
|   |-- LU
|   |-- QR
|   |-- eigendecomposition
|   |-- SVD
|   `-- Cholesky
|
`-- Derived quantities
    |-- determinant
    |-- rank
    |-- norms
    `-- condition number

These categories overlap, but the split is useful. It separates simple entrywise arithmetic from the deeper operations that reveal structure or solve numerical problems.

1.6 Historical Timeline

Matrix operations did not appear all at once. The modern subject is the result of several centuries of algebra, geometry, and numerical computation.

PeriodDevelopmentWhy it matters
Ancient ChinaElimination methods in Nine Chapters on the Mathematical ArtProto-Gaussian elimination
17th-18th centuriesDeterminants, Cramer's rule, linear systemsEarly algebraic treatment of systems
1809Gauss formalizes least squares and eliminationLinear systems and approximation become computational sciences
1850sCayley and Sylvester develop matrix algebraMatrices become algebraic objects
Late 19th centuryFrobenius and spectral ideasEigenvalues and structure become central
Early 20th centuryGram-Schmidt and orthogonalizationStable basis construction emerges
Mid 20th centuryHouseholder, Golub, Reinsch, WilkinsonModern numerical linear algebra is born
1990sLAPACKReference dense linear algebra software stack
2000s onwardcuBLAS and GPU linear algebraMatrix operations move to accelerators
2017 onwardTransformers and attentionGEMM becomes the dominant workload in AI
2020sFlashAttention, Tensor Cores, FP8Matrix arithmetic is co-designed with hardware

The historical arc is instructive. Matrix operations began as abstract mathematics, matured into numerical algorithms, and then became the central workload of modern AI hardware.


2. Matrix Basics and Notation

2.1 Definition and Notation

A matrix ARm×nA \in \mathbb{R}^{m \times n} has mm rows and nn columns:

A=(a11a12a1na21a22a2nam1am2amn).A = \begin{pmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{pmatrix}.

Standard notation:

  • AijA_{ij} or (A)ij(A)_{ij} is the entry in row ii, column jj
  • the ii-th row is written AiA_{i \cdot}
  • the jj-th column is written AjA_{\cdot j}
  • each row is an element of Rn\mathbb{R}^n
  • each column is an element of Rm\mathbb{R}^m

That distinction matters. Matrix multiplication can be understood as rows meeting columns through dot products.

Shape and indexing

          columns
        j=1  j=2      j=n
      +--------------------+
i=1   | a11  a12  ... a1n |
i=2   | a21  a22  ... a2n |
 ...  | ...  ...  ... ... |
i=m   | am1  am2  ... amn |
      +--------------------+
           rows

Block notation is also important. A matrix can be partitioned into submatrices:

A=(A11A12A21A22).A = \begin{pmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{pmatrix}.

This is not just notation. Block structure is how large matrix computations are actually implemented on modern hardware.

2.2 Special Matrices

Many algorithms simplify once a matrix has special structure.

Square matrix
A matrix with the same number of rows and columns.

Zero matrix
The additive identity; every entry is zero.

Identity matrix

In=(100010001).I_n = \begin{pmatrix} 1 & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 \end{pmatrix}.

It satisfies AI=IA=AAI = IA = A when shapes are compatible.

Diagonal matrix

D=diag(d1,,dn).D = \mathrm{diag}(d_1, \ldots, d_n).

Multiplying by a diagonal matrix is coordinatewise scaling.

Upper and lower triangular matrices
Upper triangular matrices have zeros below the diagonal; lower triangular matrices have zeros above it.

Symmetric matrix

A=A.A^\top = A.

Symmetric matrices have real eigenvalues and orthogonal eigenspaces for distinct eigenvalues.

Skew-symmetric matrix

A=A.A^\top = -A.

Its diagonal entries must be zero.

Orthogonal matrix

QQ=QQ=I.Q^\top Q = QQ^\top = I.

Its columns form an orthonormal basis, and Q1=QQ^{-1} = Q^\top.

Positive definite matrix

xAx>0for all x0.x^\top A x > 0 \quad \text{for all } x \neq 0.

All eigenvalues are positive.

Positive semidefinite matrix

xAx0for all x.x^\top A x \geq 0 \quad \text{for all } x.

All eigenvalues are nonnegative.

These classes are not merely labels. They determine which algorithms are valid:

  • SPD matrices admit Cholesky decomposition
  • orthogonal matrices make inversion cheap
  • triangular matrices make solving systems cheap
  • symmetric matrices make spectral analysis easier
Some important matrix shapes

Identity           Diagonal           Upper triangular

[1 0 0]            [d1 0  0 ]         [a b c]
[0 1 0]            [0  d2 0 ]         [0 d e]
[0 0 1]            [0  0  d3]         [0 0 f]

Symmetric          Lower triangular

[a b c]            [a 0 0]
[b d e]            [b c 0]
[c e f]            [d e f]

2.3 Matrix Shape and Compatibility

Shape checking is the first line of defense in matrix reasoning.

Addition requires identical shape:

A,BRm×nA+BRm×n.A, B \in \mathbb{R}^{m \times n} \quad \Longrightarrow \quad A + B \in \mathbb{R}^{m \times n}.

Multiplication requires matching inner dimensions:

ARm×p,BRp×nABRm×n.A \in \mathbb{R}^{m \times p}, \qquad B \in \mathbb{R}^{p \times n} \quad \Longrightarrow \quad AB \in \mathbb{R}^{m \times n}.

The inner dimension pp disappears because it is the contracted summation index.

Compatibility rule for multiplication

(m x p)  times  (p x n)  ->  (m x n)

   A               B              AB

inner dimensions must match
outer dimensions survive

This sounds elementary, but in practice it is one of the most important debugging tools in machine learning. If a tensor program fails, a shape mismatch is one of the first things to check.

Broadcasting is different from matrix multiplication. In libraries such as NumPy and PyTorch, dimensions of size 1 can be expanded logically to match other shapes. That is a convenience rule for elementwise operations, not a new form of matrix algebra.

2.4 Memory Layout

A matrix is conceptually two-dimensional, but in memory it is stored as a linear sequence of numbers. The layout matters for performance.

In row-major order (common in C-style conventions), entries of the same row are stored contiguously. For a matrix with shape (m,n)(m,n), the offset of entry (i,j)(i,j) is approximately

offset=in+j.\text{offset} = i \cdot n + j.

In column-major order (common in Fortran and classical BLAS/LAPACK conventions), entries of the same column are stored contiguously:

offset=i+jm.\text{offset} = i + j \cdot m.
2 x 3 matrix

[a b c]
[d e f]

row-major memory:    a b c d e f
column-major memory: a d b e c f

This matters because contiguous memory access is cache-friendly. A mathematically identical algorithm can run much faster or much slower depending on layout and traversal order.

The transpose illustrates the difference between mathematics and storage. Mathematically,

(A)ij=Aji.(A^\top)_{ij} = A_{ji}.

But in high-performance libraries, a transpose often does not require moving data immediately. It can be represented by changing shape and stride metadata, so the same underlying memory is interpreted with swapped indexing. In PyTorch, a 2D .T operation typically returns a view with transposed strides rather than a full data copy.

This matters in AI because operations such as QKQK^\top in attention rely heavily on fast transposed access patterns. Optimized systems try to get the mathematical effect of transpose without paying the full physical cost of reshuffling memory.


3. Elementwise and Scalar Operations

3.1 Matrix Addition and Subtraction

If A,BRm×nA, B \in \mathbb{R}^{m \times n}, then matrix addition and subtraction are defined entrywise:

(A±B)ij=Aij±Bij.(A \pm B)_{ij} = A_{ij} \pm B_{ij}.

The rules mirror vector addition:

  • commutativity: A+B=B+AA + B = B + A
  • associativity: (A+B)+C=A+(B+C)(A + B) + C = A + (B + C)
  • additive identity: A+0=AA + 0 = A
  • additive inverse: A+(A)=0A + (-A) = 0

The operation cost is O(mn)O(mn) because every entry must be touched once. On modern GPUs, matrix addition is usually memory-bandwidth bound rather than compute-bound. There is very little arithmetic per byte moved.

In deep learning, matrix addition appears constantly:

  • adding biases
  • adding residual connections
  • updating weights
  • combining branches in computation graphs

3.2 Scalar Multiplication

If αR\alpha \in \mathbb{R} and ARm×nA \in \mathbb{R}^{m \times n}, then

(αA)ij=αAij.(\alpha A)_{ij} = \alpha A_{ij}.

Every entry is scaled by the same scalar.

The familiar linearity properties hold:

  • α(A+B)=αA+αB\alpha(A + B) = \alpha A + \alpha B
  • (α+β)A=αA+βA(\alpha + \beta)A = \alpha A + \beta A
  • α(βA)=(αβ)A\alpha(\beta A) = (\alpha \beta)A
  • 1A=A1 \cdot A = A

Together with addition, scalar multiplication defines affine updates such as

αA+βB.\alpha A + \beta B.

In numerical linear algebra, the operation

YαX+YY \leftarrow \alpha X + Y

is so common that BLAS gives it a dedicated name: AXPY. Gradient descent is exactly this pattern:

θθηg,\theta \leftarrow \theta - \eta g,

where η\eta is a scalar learning rate and gg is a gradient matrix or tensor.

3.3 Hadamard Elementwise Product

The Hadamard product of two same-shaped matrices A,BRm×nA, B \in \mathbb{R}^{m \times n} is

(AB)ij=AijBij.(A \odot B)_{ij} = A_{ij} B_{ij}.

This is elementwise multiplication. It is not matrix multiplication.

Hadamard product vs matrix product

same shapes required              inner dimensions required

A (m x n)                         A (m x p)
B (m x n)                         B (p x n)

A odot B  ->  (m x n)             AB      ->  (m x n)

no summation                      sum over shared index
entry by entry                    row-column dot products

Properties:

  • commutative: AB=BAA \odot B = B \odot A
  • associative: (AB)C=A(BC)(A \odot B) \odot C = A \odot (B \odot C)
  • distributive over addition: A(B+C)=AB+ACA \odot (B + C) = A \odot B + A \odot C

But it does not obey the same algebra as matrix multiplication.

The Hadamard product is central in AI because modern networks use gates all over the place:

  • LSTM and GRU gates
  • dropout masks
  • attention masks and padding masks
  • SwiGLU and related gated feed-forward activations
  • elementwise modulation in conditional models

3.4 Trace

For a square matrix ARn×nA \in \mathbb{R}^{n \times n}, the trace is

tr(A)=i=1nAii.\mathrm{tr}(A) = \sum_{i=1}^n A_{ii}.

Key properties:

Linearity

tr(αA+βB)=αtr(A)+βtr(B).\mathrm{tr}(\alpha A + \beta B) = \alpha \mathrm{tr}(A) + \beta \mathrm{tr}(B).

Transpose invariance

tr(A)=tr(A).\mathrm{tr}(A^\top) = \mathrm{tr}(A).

Cyclic permutation

For compatible matrices:

tr(ABC)=tr(BCA)=tr(CAB).\mathrm{tr}(ABC) = \mathrm{tr}(BCA) = \mathrm{tr}(CAB).

This does not mean arbitrary reordering is allowed.

Frobenius inner product

tr(AB)=i,jAijBij=A,BF.\mathrm{tr}(A^\top B) = \sum_{i,j} A_{ij} B_{ij} = \langle A, B \rangle_F.

Eigenvalue sum

If AA is diagonalizable, then

tr(A)=iλi.\mathrm{tr}(A) = \sum_i \lambda_i.

In AI, trace appears in Frobenius norms, covariance identities, kernel formulas, and matrix-calculus derivations.

3.5 Transpose

The transpose of ARm×nA \in \mathbb{R}^{m \times n} is the matrix ARn×mA^\top \in \mathbb{R}^{n \times m} defined by

(A)ij=Aji.(A^\top)_{ij} = A_{ji}.

Rows become columns and columns become rows.

Basic properties:

  • (A)=A(A^\top)^\top = A
  • (A+B)=A+B(A + B)^\top = A^\top + B^\top
  • (αA)=αA(\alpha A)^\top = \alpha A^\top
  • (AB)=BA(AB)^\top = B^\top A^\top

The reverse-order rule is fundamental:

((AB))ij=(AB)ji=kAjkBki=k(B)ik(A)kj=(BA)ij.((AB)^\top)_{ij} = (AB)_{ji} = \sum_k A_{jk} B_{ki} = \sum_k (B^\top)_{ik}(A^\top)_{kj} = (B^\top A^\top)_{ij}.

So

(AB)=BA.(AB)^\top = B^\top A^\top.

This is the algebraic reason backpropagation has the structure it does. Reverse-mode differentiation repeatedly moves backward through compositions, and transposes inherit the same reverse-order behavior.

Transpose flips orientation

A:      m rows, n cols
A^T:    n rows, m cols

row of A      -> column of A^T
column of A   -> row of A^T

4. Matrix Multiplication

4.1 Definition

If

ARm×p,BRp×n,A \in \mathbb{R}^{m \times p}, \qquad B \in \mathbb{R}^{p \times n},

then the matrix product C=ABRm×nC = AB \in \mathbb{R}^{m \times n} is defined by

Cij=k=1pAikBkj.C_{ij} = \sum_{k=1}^p A_{ik} B_{kj}.

Equivalently, the (i,j)(i,j) entry is the dot product of row ii of AA with column jj of BB:

Cij=AiBj.C_{ij} = A_{i \cdot} \cdot B_{\cdot j}.
How one entry is formed

row i of A:      [ a_i1  a_i2  ...  a_ip ]
column j of B:   [ b_1j
                   b_2j
                   ...
                   b_pj ]

dot product -> c_ij

This definition is exactly what composition of linear maps requires. If xx is first transformed by BB and then by AA, the combined transformation is (AB)x(AB)x.

4.2 Four Interpretations of Matrix Multiply

One of the best ways to understand matrix multiply is to see four different but equivalent interpretations.

Row perspective

Ci=AiB.C_{i \cdot} = A_{i \cdot} B.

Each row of CC is the corresponding row of AA acting on BB.

Column perspective

Cj=ABj.C_{\cdot j} = A B_{\cdot j}.

Each column of CC is AA applied to the corresponding column of BB.

Outer-product perspective

Let ak=Aka_k = A_{\cdot k} be the kk-th column of AA and let bk=Bkb_k^\top = B_{k \cdot} be the kk-th row of BB. Then

AB=k=1pakbk.AB = \sum_{k=1}^p a_k b_k^\top.

This is one of the most important views for low-rank structure and gradients.

Block perspective

If compatible blocks are used, block multiplication works exactly like scalar algebra with block products in place of scalar products.

One multiply, four views

rows meet columns
columns get transformed
outer products get accumulated
blocks combine hierarchically

4.3 Properties of Matrix Multiplication

Matrix multiplication obeys several essential rules.

Associativity

(AB)C=A(BC).(AB)C = A(BC).

Distributivity

A(B+C)=AB+AC,(A+B)C=AC+BC.A(B + C) = AB + AC, \qquad (A + B)C = AC + BC.

Identity

AI=IA=A.AI = IA = A.

Zero behavior

A0=0,0A=0.A0 = 0, \qquad 0A = 0.

But

AB=0\centernot    A=0 or B=0.AB = 0 \centernot\implies A = 0 \text{ or } B = 0.

Transpose reversal

(AB)=BA.(AB)^\top = B^\top A^\top.

Rank inequality

rank(AB)min(rank(A),rank(B)).\mathrm{rank}(AB) \leq \min(\mathrm{rank}(A), \mathrm{rank}(B)).

The rank cannot exceed the bottleneck rank of the factors.

4.4 Non-Commutativity and Consequences

In general,

ABBA.AB \neq BA.

Sometimes one product is defined and the other is not. Even when both are defined and square, they are usually different.

A standard counterexample is

A=(0100),B=(0010).A = \begin{pmatrix} 0 & 1 \\ 0 & 0 \end{pmatrix}, \qquad B = \begin{pmatrix} 0 & 0 \\ 1 & 0 \end{pmatrix}.

Then

AB=(1000),BA=(0001).AB = \begin{pmatrix} 1 & 0 \\ 0 & 0 \end{pmatrix}, \qquad BA = \begin{pmatrix} 0 & 0 \\ 0 & 1 \end{pmatrix}.

This has many consequences:

  • the order of layer operations matters
  • transpose and inverse formulas reverse order
  • gradient derivations must preserve order carefully
  • symmetric times symmetric is not automatically symmetric
  • changing parenthesization changes computational cost even if the answer is unchanged

For example, in transformer notation:

(XWQ)(XWK)=XWQWKX,(XW_Q)(XW_K)^\top = XW_QW_K^\top X^\top,

and that identity only works because transpose reverses the order inside the second factor.

4.5 Computational Complexity

For

C=AB,ARm×p,BRp×n,C = AB, \qquad A \in \mathbb{R}^{m \times p}, \qquad B \in \mathbb{R}^{p \times n},

the naive algorithm performs about

2mnpmn2mnp - mn

floating-point operations if multiply-adds are counted separately.

For square matrices, that becomes O(n3)O(n^3).

Strassen's algorithm reduces the exponent to about

O(n2.807),O(n^{2.807}),

and later algorithms reduce it further in theory. But these methods bring larger constants, more complicated memory behavior, and weaker numerical properties. They are mathematically important, but they are not the default engine of deep learning workloads.

Another practical optimization problem is matrix-chain order. Because multiplication is associative, the result does not change when the parenthesization changes, but the cost can change drastically.

4.6 GEMM on GPU: Hardware Perspective

On accelerators, matrix multiply is implemented by tiling the problem into fragments that fit into fast on-chip memory and specialized multiply-accumulate units.

At a high level:

  1. load a tile of AA and a tile of BB from HBM
  2. stage them into shared memory or registers
  3. run many fused multiply-add operations on the tile
  4. accumulate partial sums
  5. write the output tile back
Tiled GEMM

HBM -> shared memory -> registers -> multiply-accumulate -> write back

Do more arithmetic per byte loaded.
That is the whole performance game.

NVIDIA's H100 product specifications list up to 989 TFLOPS tensor-core TF32 performance, 1,979 TFLOPS tensor-core BF16/FP16 performance, 3,958 TFLOPS tensor-core FP8 performance, and 3.35 TB/s HBM bandwidth for the SXM variant. That gap between compute throughput and memory bandwidth is why naive memory movement can destroy performance.

Large dense GEMMs usually become compute-bound if tiled well. Small or awkwardly shaped matrix multiplies often do not.

FlashAttention is a direct application of this principle. The algorithm is faster not because it changes the mathematics of attention, but because it reorganizes the matrix operations and reductions to reduce expensive memory traffic between HBM and SRAM while keeping arithmetic dense and local (Dao et al., 2022; Dao et al., 2024).

4.7 Matrix-Vector Multiply GEMV

Matrix-vector multiply

y=Av,ARm×n,vRn,y = Av, \qquad A \in \mathbb{R}^{m \times n}, \qquad v \in \mathbb{R}^n,

is a special case of matrix multiplication where one dimension is 1.

It costs about 2mn2mn floating-point operations, but unlike large GEMM it has poor arithmetic intensity. You read the matrix, read the vector, and produce one output vector. There is much less reuse.

This is why GEMV is usually memory-bandwidth bound.

The AI consequence is important:

  • batched training prefers GEMM because batching increases reuse
  • autoregressive decoding often degenerates into many GEMV-like operations because the batch or active sequence dimension is small
  • single-sequence decoding therefore achieves far lower hardware utilization than large-batch training

Batching converts many GEMV workloads back toward GEMM, which is why serving systems aggressively batch requests whenever latency constraints allow it.


5. Special Matrix Products

5.1 Outer Product

If uRmu \in \mathbb{R}^m and vRnv \in \mathbb{R}^n, then the outer product is

uvRm×n,(uv)ij=uivj.uv^\top \in \mathbb{R}^{m \times n}, \qquad (uv^\top)_{ij} = u_i v_j.

Every column of uvuv^\top is a scalar multiple of uu, and every row is a scalar multiple of vv^\top. Therefore, if neither vector is zero, the matrix has rank 1.

Outer product

u = [u1]          v^T = [v1 v2 v3]
    [u2]
    [u3]

uv^T =

[u1v1 u1v2 u1v3]
[u2v1 u2v2 u2v3]
[u3v1 u3v2 u3v3]

Outer products matter because they are the atoms of low-rank matrix structure. Any rank-rr matrix can be written as a sum of rr outer products.

They also appear directly in AI:

  • dense-layer gradient for one sample: W=δx\nabla_W = \delta x^\top
  • LoRA updates as sums of rank-1 terms
  • covariance and second-moment estimates as sums of outer products

5.2 Kronecker Product

If

ARm×n,BRp×q,A \in \mathbb{R}^{m \times n}, \qquad B \in \mathbb{R}^{p \times q},

then the Kronecker product is the block matrix

ABRmp×nqA \otimes B \in \mathbb{R}^{mp \times nq}

defined by

AB=(a11Ba12Ba1nBa21Ba22Ba2nBam1Bam2BamnB).A \otimes B = \begin{pmatrix} a_{11} B & a_{12} B & \cdots & a_{1n} B \\ a_{21} B & a_{22} B & \cdots & a_{2n} B \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} B & a_{m2} B & \cdots & a_{mn} B \end{pmatrix}.

Important properties:

(AB)(CD)=(AC)(BD),(A \otimes B)(C \otimes D) = (AC) \otimes (BD), (AB)=AB,(A \otimes B)^\top = A^\top \otimes B^\top, rank(AB)=rank(A)rank(B),\mathrm{rank}(A \otimes B) = \mathrm{rank}(A)\mathrm{rank}(B), tr(AB)=tr(A)tr(B).\mathrm{tr}(A \otimes B) = \mathrm{tr}(A)\mathrm{tr}(B).

The Kronecker product appears whenever a large linear structure decomposes across independent dimensions. In AI it is especially important in second-order optimization approximations such as K-FAC.

5.3 Khatri-Rao Product

The Khatri-Rao product is the columnwise Kronecker product. If

A=[a1a2an],B=[b1b2bn],A = [a_1 \,|\, a_2 \,|\, \cdots \,|\, a_n], \qquad B = [b_1 \,|\, b_2 \,|\, \cdots \,|\, b_n],

with the same number of columns, then

AKRB=[a1b1a2b2anbn].A \odot_{KR} B = [a_1 \otimes b_1 \,|\, a_2 \otimes b_2 \,|\, \cdots \,|\, a_n \otimes b_n].

It is more specialized than the Kronecker product, but it shows up in multilinear algebra, tensor factorization, and certain bilinear model derivations.

5.4 Block Matrix Multiplication

Block matrices let us reason about large matrices hierarchically. If compatible blocks are used, multiplication works exactly like scalar algebra except that scalar multiplication is replaced by matrix multiplication.

For example:

(A11A12A21A22)(B11B12B21B22)=(A11B11+A12B21A11B12+A12B22A21B11+A22B21A21B12+A22B22).\begin{pmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{pmatrix} \begin{pmatrix} B_{11} & B_{12} \\ B_{21} & B_{22} \end{pmatrix} = \begin{pmatrix} A_{11}B_{11} + A_{12}B_{21} & A_{11}B_{12} + A_{12}B_{22} \\ A_{21}B_{11} + A_{22}B_{21} & A_{21}B_{12} + A_{22}B_{22} \end{pmatrix}.

This is the correct abstraction for tiled GPU GEMM, distributed matrix multiply, tensor parallelism, and block-structured model design.


6. Matrix Inverse

6.1 Definition and Existence

A square matrix ARn×nA \in \mathbb{R}^{n \times n} is invertible if there exists a matrix A1A^{-1} such that

AA1=A1A=I.AA^{-1} = A^{-1}A = I.

This means A1A^{-1} undoes the transformation performed by AA.

x --A--> y
y --A^{-1}--> x

Applying A and then A^{-1} returns the original vector.

For a square matrix AA, the following are equivalent:

  • AA is invertible
  • det(A)0\det(A) \neq 0
  • rank(A)=n\mathrm{rank}(A) = n
  • the columns of AA are linearly independent
  • the rows of AA are linearly independent
  • Ax=0Ax = 0 has only the trivial solution
  • 0 is not an eigenvalue of AA

6.2 Properties of Inverse

If AA is invertible, then the inverse is unique. Other core properties are:

(A1)1=A,(A^{-1})^{-1} = A, (AB)1=B1A1,(AB)^{-1} = B^{-1} A^{-1}, (A)1=(A1),(A^\top)^{-1} = (A^{-1})^\top, (αA)1=1αA1for α0,(\alpha A)^{-1} = \frac{1}{\alpha} A^{-1} \quad \text{for } \alpha \neq 0, det(A1)=1det(A).\det(A^{-1}) = \frac{1}{\det(A)}.

Again, inverse reverses order just like transpose.

6.3 Explicit Formula for 2x2 Inverse

For

A=(abcd),A = \begin{pmatrix} a & b \\ c & d \end{pmatrix},

the inverse exists if and only if

adbc0.ad - bc \neq 0.

When it exists:

A1=1adbc(dbca).A^{-1} = \frac{1}{ad - bc} \begin{pmatrix} d & -b \\ -c & a \end{pmatrix}.

This is worth memorizing for intuition, but it is not how serious matrix inversion is done for large matrices.

6.4 Adjugate Formula General Case

For a general square matrix, one classical formula is

A1=adj(A)det(A),A^{-1} = \frac{\mathrm{adj}(A)}{\det(A)},

where adj(A)\mathrm{adj}(A) is the adjugate matrix obtained by transposing the cofactor matrix.

This formula is theoretically elegant but computationally poor. It explains structure; it is not the preferred practical algorithm.

6.5 Computing Inverses in Practice

There are several standard ways to compute with inverses or inverse-like effects.

Gaussian elimination

Augment the matrix with the identity:

[AI][A \,|\, I]

and row-reduce until it becomes

[IA1].[I \,|\, A^{-1}].

This is conceptually direct and leads to O(n3)O(n^3) cost.

LU-based methods

Factor

A=LUA = LU

or, with pivoting,

PA=LU.PA = LU.

Then solve the triangular systems needed to reconstruct the inverse column by column.

But in practice, the most important rule is:

To solve Ax=bAx = b, do not compute x=A1bx = A^{-1}b explicitly unless you have a very unusual reason.

Instead, solve the linear system directly using LU, QR, or Cholesky.

Why?

  • fewer operations
  • better numerical stability
  • better library support
  • avoids forming an object you usually do not need

6.6 Numerical Stability and Condition Number

Some matrices are easy to invert numerically. Others are technically invertible but catastrophically unstable in floating-point arithmetic.

The condition number measures this sensitivity:

κ(A)=AA1.\kappa(A) = \|A\| \, \|A^{-1}\|.

In the 2-norm,

κ2(A)=σmax(A)σmin(A).\kappa_2(A) = \frac{\sigma_{\max}(A)}{\sigma_{\min}(A)}.

Interpretation:

  • κ(A)1\kappa(A) \approx 1: well-conditioned
  • κ(A)1\kappa(A) \gg 1: ill-conditioned

If AA is ill-conditioned, small perturbations in data or rounding error can produce large perturbations in the solution.

This matters directly in AI:

  • poorly conditioned Hessians slow optimization
  • badly scaled feature matrices make regression unstable
  • covariance matrices can become numerically troublesome
  • second-order methods require careful conditioning

7. Moore-Penrose Pseudo-Inverse

7.1 Motivation

Ordinary inverse is only defined for square, full-rank matrices. But many practical matrices are rectangular or rank-deficient. So we need a generalized inverse.

The Moore-Penrose pseudo-inverse, written A+A^+, extends inverse-like behavior to every matrix ARm×nA \in \mathbb{R}^{m \times n}.

It solves two kinds of problems especially well:

  • least-squares solutions when Ax=bAx = b has no exact solution
  • minimum-norm solutions when there are infinitely many exact solutions

7.2 Definition via SVD

Let

A=UΣVA = U \Sigma V^\top

be the SVD of AA, where the nonzero singular values are σ1,,σr\sigma_1, \ldots, \sigma_r. Then

A+=VΣ+U,A^+ = V \Sigma^+ U^\top,

where

Σii+={1/σiif σi00if σi=0.\Sigma^+_{ii} = \begin{cases} 1/\sigma_i & \text{if } \sigma_i \neq 0 \\ 0 & \text{if } \sigma_i = 0. \end{cases}

If AA is square and invertible, then A+=A1A^+ = A^{-1}.

7.3 Moore-Penrose Conditions

The pseudo-inverse is characterized uniquely by the four Penrose conditions:

  1. AA+A=AAA^+A = A
  2. A+AA+=A+A^+AA^+ = A^+
  3. (AA+)=AA+(AA^+)^\top = AA^+
  4. (A+A)=A+A(A^+A)^\top = A^+A

Two especially important consequences are:

  • AA+AA^+ is the orthogonal projector onto col(A)\mathrm{col}(A)
  • A+AA^+A is the orthogonal projector onto row(A)\mathrm{row}(A)

7.4 Least Squares Solution

Suppose ARm×nA \in \mathbb{R}^{m \times n} and bRmb \in \mathbb{R}^m.

If m>nm > n, the system is overdetermined and there is usually no exact solution. The pseudo-inverse returns the least-squares solution:

x=A+b.x^\star = A^+ b.

If AA has full column rank, this becomes

x=(AA)1Ab.x^\star = (A^\top A)^{-1} A^\top b.

This solves

minxAxb22.\min_x \|Ax - b\|_2^2.

Geometrically, AxAx^\star is the projection of bb onto the column space of AA.

If m<nm < n, there are typically infinitely many exact solutions. Then A+bA^+b returns the minimum-norm solution.

Least squares picture

          b
         /|
        / |
       /  | residual
      /   |
-----*----+----------> col(A)
   A x_star

A x_star is the closest reachable vector to b.

7.5 Pseudo-Inverse Properties

Important identities:

(A+)+=A,(A^+)^+ = A, (A)+=(A+),(A^\top)^+ = (A^+)^\top, (αA)+=1αA+for α0,(\alpha A)^+ = \frac{1}{\alpha} A^+ \quad \text{for } \alpha \neq 0, rank(A+)=rank(A).\mathrm{rank}(A^+) = \mathrm{rank}(A).

In general,

(AB)+B+A+.(AB)^+ \neq B^+A^+.

The pseudo-inverse behaves like inverse in some ways, but not all.


8. Matrix Decompositions - Preview

Matrix decompositions factor AA into structured products that expose different kinds of mathematical structure. Each decomposition in this section has a dedicated home elsewhere in the curriculum where it is treated in full - with proofs, algorithms, and worked examples. What follows is a navigational preview: enough context to understand which decomposition applies and why, plus explicit forward links to the full treatment.

Decomposition decision table

| Goal | Use | | ------------------------------------------------------ | ------------------------- | --- | --- | | Solve Ax=bAx = b for general square AA | LU (Gaussian elimination) | | Solve Ax=bAx = b for symmetric positive definite AA | Cholesky | | Least squares: minxAxb\min_x \\ | Ax - b\\ | | QR | | Eigenvalues and eigenvectors of square AA | Eigendecomposition | | Best rank-kk approximation, pseudo-inverse, all cases | SVD |


8.1 Determinant

The determinant det(A)\det(A) of a square matrix ARn×nA \in \mathbb{R}^{n \times n} is a scalar measuring the signed volume-scaling factor of the linear map. When det(A)>1|\det(A)| > 1 the map expands volume; when det(A)<1|\det(A)| < 1 it compresses; when det(A)=0\det(A) = 0 the map is singular - it collapses the full space onto a lower-dimensional subspace, meaning Ax=bAx = b has no unique solution.

The determinant is also the product of all eigenvalues and the reciprocal of the product of the eigenvalues of A1A^{-1}. In practice, explicit determinant computation is rarely done by direct expansion (which costs O(n!)O(n!)) - instead it is read off from an LU or Cholesky factorization as a product of diagonal entries.

For AI: Log-determinants appear in Gaussian log-likelihoods logp(x)12logdetΣ12(xμ)Σ1(xμ)\log p(x) \propto -\tfrac{1}{2}\log\det\Sigma - \tfrac{1}{2}(x-\mu)^\top\Sigma^{-1}(x-\mu), in the ELBO of variational inference, and in normalizing flow log-likelihood as the log absolute Jacobian determinant.

-> Full treatment: Determinants


8.2 LU Decomposition

LU factorization writes PA=LUPA = LU, where PP is a permutation (row-reordering) matrix, LL is lower triangular with ones on the diagonal, and UU is upper triangular. It is the computational form of Gaussian elimination. Once computed in O(n3)O(n^3), each new right-hand side bb can be solved in O(n2)O(n^2) via forward substitution on Ly=PbLy = Pb then back substitution on Ux=yUx = y.

For AI: LU appears in second-order optimizers (K-FAC, Shampoo), Kalman filter updates, and continuous normalizing flows that require solving stiff ODEs. It is also how numpy.linalg.solve and scipy.linalg.solve are implemented internally.

-> Full treatment: Matrix Decompositions


8.3 QR Decomposition

QR factorization writes A=QRA = QR where QRm×mQ \in \mathbb{R}^{m \times m} is orthogonal (QQ=IQ^\top Q = I) and RRm×nR \in \mathbb{R}^{m \times n} is upper triangular. It is numerically more stable than LU for least-squares problems because QQ preserves norms. The main algorithms are Gram-Schmidt orthogonalization (conceptually clear), Householder reflectors (numerically stable, standard in practice), and Givens rotations (good for sparse matrices).

For AI: Orthogonal weight initialization (Saxe et al. 2013) uses QR. The QR algorithm is the standard method for computing all eigenvalues of a dense matrix. Weight orthogonalization regularizers (Brock et al.) apply QR repeatedly during training.

-> Full treatment: Matrix Decompositions


8.4 Eigendecomposition

For a diagonalizable matrix ARn×nA \in \mathbb{R}^{n \times n}, the eigendecomposition is A=VΛV1A = V\Lambda V^{-1}, where columns of VV are eigenvectors and Λ=diag(λ1,,λn)\Lambda = \mathrm{diag}(\lambda_1, \ldots, \lambda_n) holds the eigenvalues. It reveals the invariant directions of the linear map and makes matrix powers trivial: Ak=VΛkV1A^k = V\Lambda^k V^{-1}.

Not every square matrix is diagonalizable. Defective matrices (with repeated eigenvalues and insufficient eigenvectors) require the Jordan normal form. For symmetric real matrices, the spectral theorem guarantees real eigenvalues and an orthonormal eigenbasis, making V1=VV^{-1} = V^\top and giving A=VΛVA = V\Lambda V^\top.

For AI: Eigenspectrum analysis of the Hessian explains sharpness and flat minima. Adam's diagonal preconditioning approximates the Fisher information matrix eigenspectrum. Vanishing and exploding gradients in RNNs are a direct consequence of the dominant eigenvalue of the recurrence matrix exceeding or falling below 1.

-> Full treatment: Eigenvalues and Eigenvectors


8.5 Singular Value Decomposition

The SVD is the universal matrix factorization: for any ARm×nA \in \mathbb{R}^{m \times n},

A=UΣVA = U \Sigma V^\top

where URm×mU \in \mathbb{R}^{m \times m} and VRn×nV \in \mathbb{R}^{n \times n} are orthogonal, and ΣRm×n\Sigma \in \mathbb{R}^{m \times n} is diagonal with non-negative singular values σ1σ2σmin(m,n)0\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_{\min(m,n)} \geq 0. Unlike eigendecomposition, SVD exists for every matrix - rectangular, rank-deficient, or otherwise.

Key consequences: the best rank-kk approximation is Ak=UkΣkVkA_k = U_k \Sigma_k V_k^\top (Eckart-Young theorem); the pseudo-inverse is A+=VΣ+UA^+ = V \Sigma^+ U^\top; the condition number is σ1/σmin\sigma_1 / \sigma_{\min}; and the four fundamental subspaces (column space, null space, row space, left null space) are read directly from UU and VV.

For AI: LoRA decomposes parameter updates as ΔW=BA\Delta W = BA with BRm×rB \in \mathbb{R}^{m \times r}, ARr×nA \in \mathbb{R}^{r \times n}, rmin(m,n)r \ll \min(m,n) - a direct consequence of the SVD low-rank viewpoint. Mechanistic interpretability uses SVD to decompose attention OV circuits into rank-1 components. Randomised SVD (Halko et al. 2011) makes low-rank approximation tractable for billion-parameter weight matrices.

-> Full treatment: Singular Value Decomposition


8.6 Cholesky Decomposition

For a symmetric positive definite (SPD) matrix AA, Cholesky gives A=LLA = LL^\top where LL is lower triangular with positive diagonal. It is approximately twice as fast as LU for SPD systems (exploiting symmetry), requires no pivoting, and serves as a numerical test of positive definiteness: Cholesky fails if and only if AA is not SPD.

For AI: Gaussian process inference, Bayesian neural network covariance updates, Kalman filter covariance propagation, and second-order optimizer preconditioning (natural gradient methods) all require efficient SPD linear solves - Cholesky is the standard tool. It also appears in sampling from multivariate Gaussians: to draw xN(μ,Σ)x \sim \mathcal{N}(\mu, \Sigma), compute L=chol(Σ)L = \mathrm{chol}(\Sigma) then return μ+Lz\mu + Lz where zN(0,I)z \sim \mathcal{N}(0, I).

-> Full treatment: Matrix Decompositions

9. Common Mistakes

MistakeWhy it is wrongBetter rule
"Matrix multiplication is commutative."In general ABBAAB \neq BA, and sometimes only one product is even defined.Track order carefully.
"AB=0AB = 0 means one factor must be zero."Matrices have zero divisors.Zero product does not imply zero factor.
"(AB)1=A1B1(AB)^{-1} = A^{-1}B^{-1}."Inverse reverses order.(AB)1=B1A1(AB)^{-1} = B^{-1}A^{-1}.
"If A2=0A^2 = 0, then A=0A=0."Nilpotent matrices exist.Matrix powers do not behave like scalar powers.
"Product of symmetric matrices is symmetric."(AB)=BA(AB)^\top = BA, which equals ABAB only if AA and BB commute.ABAB is symmetric iff AB=BAAB=BA when both are symmetric.
"Every square matrix has an inverse."A square matrix can still be singular.Check rank, determinant, or null space first.
"To solve Ax=bAx=b, compute A1bA^{-1}b."Explicit inversion is usually slower and less stable than solving directly.Use a linear solver or factorization.
"Row rank and column rank may differ."They are always equal.Rank is one number, not two competing ones.
"Every square matrix has an eigendecomposition."Some matrices are defective and not diagonalizable.SVD always exists; eigendecomposition does not.
"Hadamard and matrix product are basically the same."One is entrywise, one contracts over an index.Treat them as different operations.

10. Exercises

These exercises are designed to force both algebraic fluency and computational interpretation.

  1. Matrix arithmetic and properties
    Let
A=(1234),B=(0110). A = \begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix}, \qquad B = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}.

Compute:

  • (a)(a) ABAB and BABA, and verify they differ
  • (b)(b) (AB)(AB)^\top and BAB^\top A^\top
  • (c)(c) A1A^{-1} using the 2×22 \times 2 formula
  • (d)(d) det(A)\det(A), det(B)\det(B), and det(AB)\det(AB)
  • (e)(e) tr(AB)\mathrm{tr}(AB) and tr(BA)\mathrm{tr}(BA)
  1. Shape compatibility and broadcasting Let AR3×4A \in \mathbb{R}^{3 \times 4}, BR4×2B \in \mathbb{R}^{4 \times 2}, CR3×2C \in \mathbb{R}^{3 \times 2}, uR4u \in \mathbb{R}^4, vR3v \in \mathbb{R}^3.

    • (a)(a) which of ABAB, BABA, ACAC, CACA, BCBC, CBCB are defined? State each resulting shape.
    • (b)(b) compute the shape of (AB)(AB)^\top and verify it equals BAB^\top A^\top
    • (c)(c) is AvA^\top v defined? What about AuAu?
    • (d)(d) compute AuAu - what kind of object is the result?
    • (e)(e) write ABAB as a sum of 44 rank-1 outer products (column-of-AA times row-of-BB)
  2. Trace and inner-product identities Let A,BRn×nA, B \in \mathbb{R}^{n \times n}.

    • (a)(a) prove tr(AB)=tr(BA)\mathrm{tr}(AB) = \mathrm{tr}(BA) from the definition tr(M)=iMii\mathrm{tr}(M) = \sum_i M_{ii}
    • (b)(b) show that tr(AB)=i,jAijBij\mathrm{tr}(A^\top B) = \sum_{i,j} A_{ij} B_{ij} (the Frobenius inner product)
    • (c)(c) use (b)(b) to write AF2=tr(AA)\|A\|_F^2 = \mathrm{tr}(A^\top A)
    • (d)(d) for A=(1234)A = \begin{pmatrix}1&2\\3&4\end{pmatrix}, B=(0110)B = \begin{pmatrix}0&1\\1&0\end{pmatrix}, verify tr(AB)=tr(BA)\mathrm{tr}(AB) = \mathrm{tr}(BA) numerically
    • (e)(e) is tr(ABC)=tr(BAC)\mathrm{tr}(ABC) = \mathrm{tr}(BAC) in general? Prove or give a counterexample.
  3. Block matrix multiply and the outer-product form Partition AR4×4A \in \mathbb{R}^{4 \times 4} and BR4×4B \in \mathbb{R}^{4 \times 4} as 2×22 \times 2 blocks of size 2×22 \times 2.

    • (a)(a) write the block matrix product formula (AB)IJ=KAIKBKJ(AB)_{IJ} = \sum_K A_{IK} B_{KJ}
    • (b)(b) for A=(A11A12A21A22)A = \begin{pmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{pmatrix} with A11=I2A_{11} = I_2, A12=0A_{12} = 0, A21=0A_{21} = 0, A22=2I2A_{22} = 2I_2 and B=I4B = I_4, compute ABAB by block multiply and verify it equals AA
    • (c)(c) express a rank-2 matrix M=u1v1+u2v2M = u_1 v_1^\top + u_2 v_2^\top as a matrix product [u1 u2][v1 v2][u_1\ u_2][v_1\ v_2]^\top and state the shapes
    • (d)(d) explain why every rank-rr matrix can be written as a product of an m×rm \times r matrix and an r×nr \times n matrix
    • (e)(e) for a LoRA update ΔW=BA\Delta W = BA with BR512×8B \in \mathbb{R}^{512 \times 8} and AR8×512A \in \mathbb{R}^{8 \times 512}, compute the parameter count versus full WR512×512W \in \mathbb{R}^{512 \times 512}
  4. Pseudo-inverse and least squares
    Let

A=(123456),b=(123). A = \begin{pmatrix} 1 & 2 \\ 3 & 4 \\ 5 & 6 \end{pmatrix}, \qquad b = \begin{pmatrix} 1 \\ 2 \\ 3 \end{pmatrix}.

Solve:

  • (a)(a) show why the system is overdetermined
  • (b)(b) compute (AA)1Ab(A^\top A)^{-1}A^\top b
  • (c)(c) compute the residual norm
  • (d)(d) verify the pseudo-inverse formula using SVD
  • (e)(e) explain why AA+AA^+ is a projection matrix
  1. Attention as matrix operations
    Let Q,K,VR4×2Q, K, V \in \mathbb{R}^{4 \times 2} for a single attention head.

    • (a)(a) write down the shapes in S=QK/2S = QK^\top / \sqrt{2}
    • (b)(b) explain why SS is 4×44 \times 4
    • (c)(c) apply row-wise softmax conceptually to obtain attention weights
    • (d)(d) compute the output shape of AVAV
    • (e)(e) explain why each row of the attention matrix sums to 1
    • (f)(f) if Q=KQ=K, determine whether SS is symmetric and whether the softmaxed matrix must still be symmetric
  2. Conditioning and numerical stability Let A=(1111+ϵ)A = \begin{pmatrix}1 & 1 \\ 1 & 1+\epsilon\end{pmatrix} for small ϵ>0\epsilon > 0.

    • (a)(a) compute det(A)\det(A) symbolically as a function of ϵ\epsilon
    • (b)(b) compute A1A^{-1} explicitly using the 2×22 \times 2 inverse formula
    • (c)(c) compute the condition number κ(A)=AA1\kappa(A) = \|A\|\|A^{-1}\| using the spectral norm (largest singular value) - describe what happens as ϵ0\epsilon \to 0
    • (d)(d) explain why a large condition number means that a small perturbation δb\delta b in bb causes a large perturbation in the solution xx of Ax=bAx = b
    • (e)(e) for ϵ=108\epsilon = 10^{-8}, estimate the number of digits of precision lost in the solution relative to 64-bit floating-point (which has 16\approx 16 significant digits)
  3. LoRA and low-rank structure
    Let WR512×512W \in \mathbb{R}^{512 \times 512} and let a LoRA update use rank r=8r=8.

    • (a)(a) compute the number of parameters in full WW
    • (b)(b) compute the number of parameters in a rank-8 factorization
    • (c)(c) express the update as a sum of outer products
    • (d)(d) interpret a singular-value spectrum such as [5.2,4.8,0.1,0.05,0.02,0.01,0.003,0.001][5.2, 4.8, 0.1, 0.05, 0.02, 0.01, 0.003, 0.001]
    • (e)(e) explain how truncated SVD could compress the learned update further

11. Why This Matters for AI

AspectWhy matrix operations matter
Every forward passTransformer layers are dominated by matrix multiplies: projections, attention scores, attention outputs, and feed-forward blocks.
Hardware efficiencyGPU and accelerator utilization is largely a GEMM scheduling problem. Batching, tiling, and layout determine whether tensor units are actually used well.
BackpropagationGradient formulas are matrix operations with transposes, outer products, and reductions. Reverse-order transpose rules are built directly into reverse-mode differentiation.
Low-rank adaptationLoRA is matrix factorization applied to parameter updates. Its practicality is a direct consequence of low-rank structure and the SVD viewpoint.
MLA-style attentionDeepSeek-V2 reports that Multi-head Latent Attention compresses KV cache heavily and increases generation throughput by caching a low-rank latent representation instead of full key-value states.
Numerical stabilityCondition number, decomposition choice, and matrix norms affect whether optimization remains stable or blows up.
QuantizationPost-training quantization, compensation, and calibration all rely on matrix norms, least squares, and low-rank corrections.
OptimizersMethods such as K-FAC and Shampoo are explicitly matrix-based, using structured curvature approximations, matrix roots, or Kronecker factors.
InterpretabilitySVD and eigenspectral analysis of weight matrices, OV circuits, and activations expose dominant directions and effective rank.
Inference latencyTraining is GEMM-heavy and compute-dense, while single-sequence decoding often behaves more like bandwidth-limited GEMV. Understanding the matrix regime explains the performance difference.

The short version is that matrix operations are not a chapter you pass through on the way to AI. They are the operational core of AI itself.


12. Conceptual Bridge

Vectors and spaces gave us the geometry of linear algebra. Matrix operations gave us the computational rules for acting on that geometry. Every abstract idea from the previous chapter now has an executable form:

  • subspaces become column spaces, row spaces, null spaces, and projections
  • orthogonality becomes QR and Cholesky
  • low-dimensional approximation becomes SVD truncation
  • invertibility becomes solving linear systems
  • composition becomes matrix multiplication

This is the bridge to the next topics:

vectors and spaces
    -> matrix operations
    -> spectral structure and decompositions
    -> probability and statistics
    -> optimization and dynamics
    -> deep learning systems

If this chapter leaves you with one durable mental model, let it be this:

neural networks are not mysterious collections of tensor APIs; they are structured compositions of matrix operations, and the quality, speed, and stability of those operations determine what the model can learn and how efficiently it can run.


References

Core Linear Algebra and Numerical Linear Algebra

Transformers, Attention, and Low-Rank Adaptation

Large-Model Systems and Hardware


Back to Home | Next: Systems of Equations