"A model is useful only when its constraints can be satisfied. Systems of equations are the language in which those constraints become explicit, and solving them is the point where mathematical structure turns into actual computation."
Overview
A single equation imposes a single condition. A system of equations imposes many conditions at once. That shift from "one condition" to "all of these conditions simultaneously" is the core step from elementary algebra to modern computational mathematics. It is also the step that makes linear algebra operational in AI.
In geometric language, each equation cuts down the set of allowable points. In algebraic language, each equation adds a constraint. In computational language, the entire problem becomes: find the variable assignment that satisfies the full constraint set, or determine that no such assignment exists. Everything that follows in numerical linear algebra, optimization, probabilistic inference, and deep learning depends on this viewpoint.
For machine learning, systems of equations appear everywhere:
- linear regression solves the normal equations
- second-order optimization solves Hessian systems
- Gaussian processes solve large symmetric positive definite systems
- constrained optimization solves KKT systems
- deep equilibrium models solve nonlinear fixed-point systems
- training itself searches for solutions to the gradient equations
This chapter therefore treats systems of equations in three linked ways:
- as geometric intersections of constraints
- as algebraic objects described by rank, null spaces, and consistency
- as computational problems requiring direct or iterative solvers
The linear case is the foundation because it is both completely understood and universally useful. The nonlinear case matters because every realistic optimization problem eventually becomes nonlinear. Together they explain why solving equations is central to both classical scientific computing and modern AI.
Prerequisites
- Basic algebra and multivariable notation
- Familiarity with vectors, matrices, rank, null spaces, and subspaces
- Comfort with matrix multiplication and transpose
- Basic understanding of norms and least squares
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | Interactive examples for row reduction, least squares, conditioning, and iterative solvers |
| exercises.ipynb | Guided practice on consistency, Gaussian elimination, least squares, Newton updates, and AI-flavored systems |
Learning Objectives
After completing this chapter, you should be able to:
- Translate a system of equations into matrix form
- Diagnose whether a system has zero, one, or infinitely many solutions
- Use rank conditions to characterize consistency and uniqueness
- Perform Gaussian elimination and interpret REF / RREF correctly
- Separate homogeneous and inhomogeneous solution structure
- Understand least squares as projection onto a column space
- Distinguish direct solvers from iterative solvers and know when each is appropriate
- Explain Newton's method as repeated solution of linearized systems
- Interpret constrained optimization through KKT systems and saddle-point structure
- Connect all of the above directly to linear regression, attention, Gaussian processes, DEQs, and second-order optimization
Table of Contents
- Systems of Equations
- Overview
- Prerequisites
- Companion Notebooks
- Learning Objectives
- Table of Contents
- 1. Intuition
- 2. Linear Systems - Formulation
- 3. Gaussian Elimination and Row Reduction
- 4. Existence, Uniqueness, and the Rank Conditions
- 5. Special Linear Systems
- 6. Least Squares Problems
- 7. Iterative Methods for Linear Systems
- 8. Nonlinear Systems of Equations
- 9. Constrained Systems and Lagrangians
- 10. Systems Arising Directly in AI
- 11. Overdetermined and Underdetermined Systems in Practice
- 12. Numerical Methods and Stability
- 13. Systems of Equations in Optimisation
- 14. Common Mistakes
- 15. Exercises
- 16. Why This Matters for AI (2026 Perspective)
- 17. Conceptual Bridge
- References
1. Intuition
1.1 What Are Systems of Equations?
A system of equations is a collection of two or more equations that must all be satisfied at the same time. The word "simultaneously" matters more than anything else. One equation restricts the variables somewhat. Many equations restrict them together, and the solution is whatever survives all of those restrictions at once.
For example, in two variables:
the first equation alone allows infinitely many points, namely every point on a line. The second equation also allows infinitely many points, again a line. Solving the system means finding the point or points that lie on both lines at the same time.
This generalises immediately:
- one equation in two unknowns gives a curve or a line in
- one equation in three unknowns gives a surface or a plane in
- one linear equation in unknowns gives a hyperplane in
- a full system gives the intersection of all those constraint sets
Linear systems are special because each equation has degree . Their geometry is flat, their algebra is complete, and their computational methods are highly developed. Nonlinear systems are harder because the constraints can curve, fold, branch, or fail to intersect in complicated ways.
In AI, both matter:
- linear systems appear directly in least squares, Gaussian processes, Kalman filters, and second-order methods
- nonlinear systems appear in optimization, fixed-point models, equilibrium conditions, KKT systems, and PDE-constrained learning
1.2 Why Systems of Equations Are Central to AI
Many AI computations that seem unrelated are really system-solving problems in disguise.
Least squares and linear regression
If we want to minimize
then the minimizer satisfies the normal equations
So even the most basic linear regression problem ends as a linear system solve.
Optimization
At a stationary point of a loss ,
This is a system of equations in unknowns when . Newton's method solves this nonlinear system by repeatedly solving linear systems involving the Hessian.
Probabilistic inference
Gaussian processes, Kalman filters, and many Bayesian update formulas require solving systems such as
where is a covariance or kernel matrix. The main computational bottleneck is often not "doing Bayesian inference" in the abstract, but solving this linear system accurately and efficiently.
Attention and normalization
Softmax enforces constraints: non-negativity and sum-to-one. That is not a linear system, but it is still a simultaneous constraint-satisfaction problem at every query position.
Constrained generation and RLHF
Grammar-constrained decoding, schema following, and KL-constrained policy updates are all system-like problems. They involve satisfying many conditions at once, exactly the mentality of equation systems and constrained optimization.
1.3 The Geometric Picture
The geometry is the fastest way to build intuition.
One linear equation in two variables,
defines a line in .
One linear equation in three variables,
defines a plane in .
In , one equation
defines a hyperplane, which is an affine subspace of codimension .
One equation cuts space once.
R^2: line
R^3: plane
R^n: hyperplane
Many equations = many cuts
Solution = common intersection
The three basic outcomes are:
TWO-LINE PICTURE IN R^2
1. Unique solution
\ /
\/
/\
/ \
2. No solution
-----
-----
3. Infinitely many solutions
=====
=====
The same logic scales to hyperplanes in higher dimension.
This same geometry survives in abstract form in higher dimensions:
- unique solution means the constraints cut down to a single point
- infinitely many solutions means the constraints leave a whole direction or family of directions unconstrained
- no solution means the constraints are incompatible
1.4 Linear vs Nonlinear Systems
The distinction between linear and nonlinear systems is structural, not cosmetic.
LINEAR SYSTEMS NONLINEAR SYSTEMS
-------------------------------------------------------------
Constraints: hyperplanes Constraints: curves/surfaces/manifolds
Superposition: yes Superposition: no
Theory: complete Theory: problem-dependent
Algorithms: elimination, QR, Algorithms: Newton, fixed-point,
Cholesky, CG homotopy, continuation
Solution set: affine subspace Solution set: isolated roots,
or empty branches, manifolds, empty
Linearity means two things happen:
- the geometry is flat
- the algebra closes under addition and scalar multiplication in a controlled way
That is why the linear case admits the clean rank-based theory:
- consistency is determined by column space membership
- uniqueness is determined by null-space triviality
- general solutions are particular solution plus null-space directions
None of this survives in the same complete form for nonlinear systems. But nonlinear methods almost always depend on linearization, which is why linear systems remain the local language of almost all nonlinear computation.
1.5 The Three Cases and Their Geometry
For a linear system , there are only three possible outcomes.
Case 1: unique solution
- the system is consistent
- the constraints are independent enough to determine every variable
- algebraically, the null space is trivial:
- when is square, this means
Case 2: infinitely many solutions
- the system is consistent
- but there are free directions left
- algebraically, is nontrivial
- the full solution set is
so it is an affine subspace
Case 3: no solution
- the right-hand side is incompatible with the column space of
- equivalently,
- geometrically, the hyperplanes do not all meet
SOLUTION SET SHAPE
unique solution infinite solutions no solution
point affine family empty set
o o----o----o (none)
| | |
line / plane / etc.
The entire chapter is really about characterising these three cases precisely, then solving them efficiently.
1.6 Historical Timeline
- Babylonian mathematics already solved small systems by elimination-like reasoning in commercial and geometric contexts.
- The Chinese Nine Chapters on the Mathematical Art described the fangcheng method, a tabular elimination scheme that is recognisably proto-Gaussian elimination.
- Cramer introduced determinant-based formulas for square systems in the 18th century.
- Gauss systematised elimination and least squares in astronomical computation.
- Jordan refined the full reduction process now called Gauss-Jordan elimination.
- The 19th century matrix viewpoint, due to Cayley, Sylvester, Frobenius, and others, turned system solving into matrix algebra.
- The 20th century added numerical stability, pivoting, factorisation methods, and large-scale iterative solvers.
- Modern AI inherits all of this: dense solvers for medium-scale exact problems, sparse / iterative solvers for large systems, and structured approximations for trillion-parameter optimization.
The historical pattern is important: the subject developed first from geometry and astronomy, then from algebra, then from numerical analysis, and now from large-scale machine computation. AI sits in the fourth stage, but it still depends on the first three.
2. Linear Systems - Formulation
2.1 Standard Form
A system of linear equations in unknowns has the form
The compact matrix form is
where
- is the coefficient matrix
- is the unknown vector
- is the right-hand side
Ax = b
m x n n x 1 m x 1
[coeffs] [unknowns] = [targets]
This notation compresses the entire system into one object. It also reveals the key question immediately:
If yes, the system is consistent. If no, it is not.
2.2 Augmented Matrix
The augmented matrix packages the coefficients and the right-hand side together:
Why do this? Because row operations affect the equations and the right-hand side at the same time. Working with keeps the bookkeeping honest.
[A | b]
coefficient columns | right-hand side
-------------------------------------
a11 a12 ... a1n | b1
a21 a22 ... a2n | b2
... ... ... ... | ...
am1 am2 ... amn | bm
In practice:
- row-reduce
- inspect the pivots and any contradiction rows
- read off the solution type directly
2.3 Existence and Uniqueness - The Fundamental Criterion
The basic theorem is the Rouch-Capelli criterion:
- the system is consistent iff
- the system has a unique solution iff
- the system has infinitely many solutions iff
- the system is inconsistent iff
This is the complete classification.
Why it works:
- counts how many independent coefficient constraints there are
- counts how many independent constraints remain after including the targets
- if the augmented rank is larger, then introduced an extra independent demand that the columns of cannot satisfy
2.4 Homogeneous vs Inhomogeneous
A system is homogeneous if :
It is inhomogeneous if :
The homogeneous case is special because it is always consistent: always works.
Its solution set is exactly the null space:
This is a subspace, not merely an affine set. Its dimension is
by rank-nullity.
The inhomogeneous case is built from the homogeneous one. If is any particular solution to , then every solution has the form
This is one of the most important formulas in the chapter. It says:
general solution
= one concrete solution
+ every homogeneous direction that changes nothing
So the homogeneous system describes the freedom left over after one particular solution is found.
2.5 Types by Shape of A
The shape of already tells you a lot.
Square system:
- same number of equations as unknowns
- generically unique when full rank
- singular if rank drops
Overdetermined system:
- more equations than unknowns
- generically inconsistent
- solved approximately by least squares
Underdetermined system:
- fewer equations than unknowns
- generically infinitely many solutions if consistent
- often choose the minimum-norm solution
Exactly determined full-rank square system
- the classical "nice" case
- unique exact solution
- direct methods like LU are natural
In ML, overdetermined and underdetermined cases are at least as important as square ones:
- regression and curve fitting are overdetermined
- overparameterized learning is underdetermined
- inverse problems are often ill-posed even when square
3. Gaussian Elimination and Row Reduction
3.1 Elementary Row Operations
There are three row operations that preserve the solution set:
- swap two rows:
- scale a row by a non-zero scalar:
- replace a row with itself plus a multiple of another row:
These are not arbitrary tricks. Each corresponds to a legitimate equation manipulation:
- swapping changes equation order
- scaling keeps the same solution set because dividing back is possible
- adding one equation to another creates an equivalent equation
Each elementary row operation can also be represented as left multiplication by an elementary matrix. That matters later because elimination is really matrix multiplication in disguise.
3.2 Row Echelon Form REF
A matrix is in row echelon form if:
- all zero rows are at the bottom
- each pivot is to the right of the pivot above it
- everything below each pivot is zero
Example:
is in REF.
The pivots tell you which variables are constrained directly. The non-pivot columns are free-variable columns.
Pivot columns -> determined variables
Free columns -> free variables / parameters
Forward elimination always produces an REF. It is not unique, but the pivot structure encodes the rank and the free-variable count.
3.3 Reduced Row Echelon Form RREF
RREF imposes two extra conditions:
- each pivot equals
- each pivot is the only non-zero entry in its column
Example:
This form is stronger than REF and, crucially, unique for a given matrix.
That uniqueness makes RREF a conceptual gold standard: once you have it, the solution structure can be read off immediately. In practice, though, solving via REF plus back substitution is usually cheaper than carrying the full matrix all the way to RREF.
3.4 Gaussian Elimination Algorithm
Gaussian elimination has two conceptual stages.
Forward elimination
- choose a pivot column
- select a pivot row
- eliminate entries below the pivot
- move to the next submatrix
Back substitution
- solve the last pivot variable first
- substitute upward one row at a time
FORWARD ELIMINATION
* * * *
* * * * -> * * * *
* * * * 0 * * *
* * * * 0 0 * *
0 0 0 *
BACK SUBSTITUTION
solve bottom row first, then move upward
For an dense system:
- forward elimination costs on the order of arithmetic units
- back substitution costs on the order of
- total cost is therefore
This cubic cost is why direct dense methods become expensive at scale, and why iterative methods matter for huge sparse systems.
3.5 Partial and Complete Pivoting
In exact arithmetic, any non-zero pivot is fine. In floating-point arithmetic, that is false. Choosing a tiny pivot can amplify rounding error badly.
That is why practical elimination uses pivoting.
No pivoting
- simplest conceptually
- dangerous numerically
- may divide by a very small number and create huge multipliers
Partial pivoting
- in the current column, swap in the row with the largest absolute pivot candidate
- this is the default in most dense direct solvers
- it gives strong practical stability at low extra cost
Complete pivoting
- search over the remaining submatrix, not just the current column
- swap both rows and columns
- slightly more stable in theory, but more expensive and less common in practice
PIVOTING IDEA
Bad pivot:
0.000001 * * divide by tiny number -> big multipliers -> error growth
5 * *
7 * *
After partial pivoting:
7 * *
5 * *
0.000001 * *
For AI, pivoting matters whenever system solves occur inside a larger algorithm:
- Gaussian process regression
- second-order optimization
- implicit differentiation
- structured probabilistic inference
It is easy to think "the math says invertible, so we are safe." Numerical analysis says otherwise. Invertible is not the same as well-conditioned.
3.6 Reading Solutions from RREF
Once is in RREF, the solution set can be read directly.
Inconsistency
If a row becomes
then the system says , which is impossible. No solution exists.
Pivot variables
Variables corresponding to pivot columns are determined by the equations.
Free variables
Variables corresponding to non-pivot columns can be chosen freely. They become parameters.
The general solution is then
where
- is one particular solution
- form a basis for
This is not just a computational trick. It is the structural theorem of linear systems expressed in coordinates.
3.7 Worked Example - Underdetermined System
Consider
Its augmented matrix is
Apply
to get
Then
gives
So the pivot columns are and , while columns and are free.
Let
Then row gives
and row gives
Therefore
This displays the exact structure:
- one particular solution
- plus a -dimensional null-space family
That is the canonical shape of a consistent underdetermined linear system.
4. Existence, Uniqueness, and the Rank Conditions
4.1 Rank and Solution Structure - Complete Characterisation
For with and rank , the possibilities can be summarised as follows.
| rank(A) | rank([A|b]) | relation to n | solution set |
|---|---|---|---|
| full column rank | unique solution | ||
| consistent but rank-deficient | infinitely many solutions | ||
| augmented rank larger | no solution |
More explicitly:
- if , then there are no free variables
- if , then there are free variables
- if , then the system is inconsistent
This is complete. There are no other cases.
4.2 Geometric Interpretation of Each Case
It is useful to combine algebra and geometry explicitly.
Square full-rank system
- , rank
- hyperplanes meet in a single point
- algebraically: no null directions remain
Consistent overdetermined system
- possible only if extra equations are redundant or implied
- geometrically: many hyperplanes still share the same point
Generic overdetermined system
- more constraints than degrees of freedom
- usually inconsistent
- motivates least squares rather than exact solving
Underdetermined system
- too few independent constraints to determine every variable
- solutions form an affine family
- dimension of the family is
DIMENSION COUNT
n unknown directions
- r independent constraints
--------------------------
n-r remaining free directions
This count is one of the simplest and most useful geometric heuristics in all of linear algebra.
4.3 The Fundamental Theorem of Linear Algebra Applied to Systems
The four fundamental subspaces of give the cleanest picture of when is solvable:
- - the set of achievable right-hand sides
- - the set of solutions to
- - the input directions that are not annihilated
- - the left null space; directions cannot lie in for solvability
Reading solvability from these subspaces:
- is consistent
- if is any particular solution, every solution is for some
- has a component in the system is inconsistent
The direct-sum decomposition is what makes the least-squares projection of 6 well-defined: we project onto and discard the component.
Scope note: This section uses the four fundamental subspaces as a practical tool for understanding . Their rigorous definitions, orthogonality relationships, and dimensional identities are the canonical subject of a later section.
-> Full treatment: Vector Spaces and Subspaces 7
4.4 The Fredholm Alternative
The Fredholm alternative packages the previous paragraph into a clean either-or statement:
For a given and , exactly one of the following is true:
- has a solution, or
- there exists such that
Interpretation:
- if , then lies in the left null space
- if , then has a component orthogonal to the column space
- therefore cannot lie entirely in
This theorem is conceptually deeper than it first appears. It says inconsistency is not mysterious. It is witnessed by an explicit certificate that proves the right-hand side points outside the achievable subspace.
4.5 Sensitivity and Condition Number
Not all solvable systems are equally safe to solve numerically.
Suppose and the right-hand side is perturbed to . Then the solution changes to satisfying
Subtracting gives
so
when is invertible.
This leads to the standard error bound
where
is the condition number in the chosen norm.
Meaning:
- means the system is well-conditioned
- large means the system strongly amplifies perturbations
small data error --times--> condition number --gives--> solution error
In ML this matters constantly:
- Hessian systems can be very ill-conditioned
- kernel matrices can be nearly singular
- normal equations square the condition number
- regularization often helps more by improving conditioning than by changing the exact minimizer
5. Special Linear Systems
5.1 Triangular Systems
Triangular systems are the cheap end of linear solving.
For a lower triangular system :
This is forward substitution.
For an upper triangular system :
This is back substitution.
Each costs , much cheaper than factorisation. That is why methods such as LU and Cholesky are so useful: they reduce the hard problem to triangular solves.
5.2 Symmetric Positive Definite Systems
If is symmetric positive definite (SPD), then
This is an especially pleasant class:
- eigenvalues are positive
- the system has a unique solution
- Cholesky factorisation applies:
with lower triangular
Cholesky is faster and more stable than generic LU in this setting, and it avoids pivoting.
SPD systems appear everywhere in ML:
- Gram matrices
- kernel matrices (with regularization)
- covariance matrices
- normal equations when has full column rank
- local quadratic models near well-behaved minima
5.3 Diagonal and Block Diagonal Systems
If is diagonal, then solving
is trivial:
This is .
If the matrix is block diagonal, then each block can be solved independently.
[B1 0 0 ] [x1] [b1]
[0 B2 0 ] [x2] = [b2]
[0 0 B3] [x3] [b3]
solve B1 x1 = b1, B2 x2 = b2, B3 x3 = b3 separately
This separability is a major computational win. Many preconditioners and structured models try to approximate a difficult system by something diagonal or block diagonal precisely because those systems are cheap to solve.
5.4 Sparse Linear Systems
A sparse matrix has relatively few non-zero entries. For large systems, sparsity changes everything.
Dense viewpoint:
- storage:
- factorisation:
Sparse viewpoint:
- storage:
- matrix-vector multiply: often
- iterative methods become viable
The main subtlety is fill-in: elimination may create new non-zero entries where zeros previously existed. Sparse direct solvers therefore spend significant effort on reordering the matrix to reduce fill-in before factorisation.
In AI, sparsity appears in:
- graph neural networks
- sparse attention patterns
- recommender systems
- PDE discretisations in scientific ML
- structured factor graphs and graphical models
5.5 Toeplitz and Structured Systems
Some matrices are not sparse but are still highly structured.
Toeplitz matrix
- constant along diagonals
- depends only on
Circulant matrix
- special Toeplitz form
- diagonalised by the discrete Fourier transform
- solvable via FFT in near-linear time
Vandermonde matrix
- arises in polynomial interpolation
- numerically delicate but structurally special
Convolution is a good AI example. A 1D convolution can be written as multiplication by a Toeplitz matrix. In practice we do not materialise that matrix, but the structured-system viewpoint is still correct and useful.
6. Least Squares Problems
6.1 Motivation and Formulation
When is overdetermined, exact consistency is unlikely. The next best question is:
This gives the least-squares problem:
Geometrically:
- ranges over the column space of
- we want the point in closest to
- therefore is the orthogonal projection of onto
The residual
is orthogonal to the column space at the optimum.
b
|\
| \
| \ residual r*
| \
| o = projection of b onto col(A)
|
+--------------------------> col(A)
Least squares is therefore not merely "approximate solving." It is orthogonal projection in disguise.
6.2 Normal Equations
Differentiate the least-squares objective:
Then
At the minimizer,
which gives the normal equations
If has full column rank, then is SPD and
This formula is exact and important, but also numerically dangerous when is ill-conditioned because
That squared condition number is the main reason serious numerical work often avoids solving least squares through normal equations directly.
6.3 QR-Based Least Squares
If is the thin QR factorisation, with
- having orthonormal columns
- upper triangular
then
Because preserves lengths on its column space, the least-squares problem reduces to
This is an upper triangular solve.
Advantages over normal equations:
- avoids forming
- avoids squaring the condition number
- numerically more stable
For practical least squares, QR is usually the safer default.
6.4 Regularised Least Squares
When is nearly singular, rank-deficient, or noisy, add an penalty:
This is Tikhonov regularisation or ridge regression.
The optimality condition becomes
This does two things at once:
- shrinks the solution toward the origin
- improves conditioning by shifting small eigenvalues upward
In ML, this is not a side topic. Weight decay is exactly this geometry applied to parameter fitting.
6.5 Weighted Least Squares
If some observations are more reliable than others, use a positive diagonal weight matrix :
This produces the weighted normal equations
If , then each residual is penalised according to its importance.
This is natural when:
- measurement errors have unequal variances
- some samples are trusted more than others
- robust procedures iteratively reweight points
6.6 Total Least Squares
Ordinary least squares assumes is exact and only is noisy. That is often unrealistic.
Total least squares allows perturbations to both:
The goal is to minimise the size of the joint perturbation
The classical solution uses the SVD of the augmented data matrix . The smallest singular direction identifies the smallest correction that makes the system exactly consistent.
This matters when the design matrix itself was measured or estimated with error, which is common in scientific inference and noisy representation learning.
7. Iterative Methods for Linear Systems
7.1 Why Iterative Methods?
Direct solvers are excellent when matrices are moderate in size and dense. But they do not scale indefinitely.
If , then:
- storing a dense matrix would require on the order of entries
- a dense direct factorisation would require on the order of arithmetic operations
That is not realistic.
Iterative methods change the question. Instead of computing the exact answer in finitely many elimination steps, they produce a sequence
that ideally converges to the true solution.
The stopping rule is usually based on the residual:
When is small enough, stop.
The key computational benefit is that many iterative methods only need:
- matrix-vector products
- vector inner products
- inexpensive preconditioner applications
That makes them ideal for large sparse systems where factorising would be too expensive.
7.2 Stationary Iterative Methods
A broad stationary-method template begins by splitting
where is chosen so that solving is easy.
Then iterate:
Convergence depends on the spectral radius:
Jacobi
Take , the diagonal of .
All components use only old values, so Jacobi is naturally parallel.
Gauss-Seidel
Use the lower triangular part as the easy system. Updated values are used immediately:
This often converges faster than Jacobi but is less parallel-friendly.
SOR
Successive over-relaxation adds a parameter :
When is well chosen, convergence can improve dramatically.
Stationary methods are conceptually important, but for large modern workloads Krylov methods usually dominate in practice.
7.3 Conjugate Gradient Method
For SPD systems, conjugate gradient (CG) is the central iterative method.
It solves
by minimising the quadratic
over increasingly rich Krylov subspaces
The algorithm is:
and then for each step:
Important facts:
- in exact arithmetic, CG terminates in at most steps
- in practice it is useful because it often converges much sooner
- convergence improves when eigenvalues are clustered
The classical bound is
where .
So conditioning directly controls iteration count.
7.4 Preconditioning
If conditioning controls convergence, the natural question is:
That is the purpose of preconditioning.
Choose such that:
- applying is cheap
- is better conditioned than
Then solve the transformed system instead.
Common examples:
- Jacobi preconditioner:
- incomplete LU (ILU)
- incomplete Cholesky for SPD systems
- multigrid preconditioners for PDE-like problems
Preconditioning is often the difference between an iterative solver that is unusably slow and one that converges in a practical number of steps.
In AI terms, normalization layers and adaptive optimizers play a closely related geometric role: they reshape the effective conditioning of the optimization problem.
7.5 GMRES for Non-Symmetric Systems
Conjugate gradient requires SPD structure. For general non-symmetric systems, a standard choice is GMRES: Generalised Minimal Residual.
GMRES constructs an orthonormal basis for the Krylov space using the Arnoldi process and chooses the iterate whose residual norm is minimal over that subspace.
Conceptually:
- build a basis for
- solve a small least-squares problem each iteration
- obtain the best residual achievable in that subspace
Benefits:
- applies to very general systems
- robust in many settings
Costs:
- memory grows with iteration count because the Krylov basis grows
- restarted GMRES() is often used to limit storage
GMRES is a good example of a recurring pattern in numerical linear algebra: turn a hard large system into a sequence of small projection problems.
7.6 Krylov Subspace Methods Summary
| Method | Matrix type | Cost per iteration | Memory | Typical use |
|---|---|---|---|---|
| Jacobi | broad but fragile | sparse matvec + vector ops | low | simple baseline, parallel smoothing |
| Gauss-Seidel | broad but sequential | sparse matvec-like updates | low | small systems, relaxation |
| CG | SPD | matvec + a few dot products | low | kernel systems, normal equations, SPD problems |
| MINRES | symmetric indefinite | matvec + vector ops | low | saddle-like symmetric systems |
| GMRES | general non-singular | matvec + orthogonalisation | growing | non-symmetric large systems |
| LSQR | least squares / rectangular | bidiagonalisation-based | low | large sparse least squares |
The main lesson is not to memorise solver names. It is to match system structure to algorithm structure.
8. Nonlinear Systems of Equations
8.1 Formulation
A nonlinear system has the form
or in components
Unlike the linear case, the solution set may be:
- empty
- a finite set of isolated points
- a curve or surface of solutions
- multiple disconnected branches
There is no complete analogue of the rank classification above. Structure still matters, but now local differential information becomes central.
8.2 Newton's Method for Nonlinear Systems
Newton's method linearises the system around the current iterate :
where is the Jacobian.
Set the linear approximation to zero:
So the Newton step solves the linear system
and updates
This is one of the great unifying ideas of applied mathematics:
nonlinear problem
-> local linearisation
-> solve linear system
-> update
-> repeat
Near a simple root, Newton convergence is quadratic, which is extremely fast. But it can fail badly if:
- the initial guess is poor
- the Jacobian is singular or nearly singular
- the function is not smooth enough
8.3 Quasi-Newton Methods
Newton's method is powerful but expensive because each step requires the full Jacobian and a system solve.
Quasi-Newton methods approximate the Jacobian (or the Hessian in optimization form) using update formulas instead of recomputing it exactly.
For root-finding, Broyden's method updates an approximate Jacobian with a rank-one correction chosen to satisfy the latest secant relation.
In optimization, the best-known analogues are BFGS and L-BFGS. They trade exact second-order information for lower cost while preserving much of Newton's geometric advantage.
The recurring theme is the same as elsewhere in AI systems engineering:
- exact structure is too expensive
- structured approximation is often the practical win
8.4 Fixed-Point Iteration
Sometimes we rewrite the system
as
and iterate
If is a contraction, Banach's fixed-point theorem guarantees convergence to a unique fixed point.
This sounds abstract, but the idea is everywhere:
- value iteration in reinforcement learning
- expectation-maximization style self-consistency updates
- implicit layers and deep equilibrium models
- iterative normalisation or message-passing updates
Fixed-point iteration is usually slower than Newton near a solution, but it may be simpler, cheaper, or more robust globally.
8.5 Continuation and Homotopy Methods
What if the target nonlinear system is too hard to solve directly?
Continuation methods embed it in a family:
such that:
- is easy
- is the true problem
Then track the solution path as moves from to .
This is especially powerful for:
- polynomial systems with many roots
- bifurcation analysis
- problems where direct Newton iteration fails from naive initialisation
The conceptual point is valuable even if one never implements homotopy directly: difficult problems can often be solved by deforming from an easier one.
8.6 The Gradient Equations as a Nonlinear System
Training a model means minimising . The stationary condition is
That is a nonlinear system in the parameters.
Gradient descent is therefore not separate from system solving. It is one particular iterative scheme for approximately solving the gradient equations.
Newton's method for optimization goes one step further:
where is the Hessian.
So each Newton step is a linear-system solve inside a nonlinear-system solve. This nesting is fundamental to modern optimization theory.
9. Constrained Systems and Lagrangians
9.1 Equality-Constrained Systems
Suppose we want to minimise subject to constraints
Introduce Lagrange multipliers and define
The first-order conditions are
This is itself a larger system of equations in the primal variables and the multipliers.
This is the central message:
optimization with constraints
->
system of equations in (x, lambda)
9.2 Saddle-Point Systems
Linearised KKT systems often take the block form
These are saddle-point systems:
- not positive definite
- often indefinite
- structured rather than arbitrary
They require specialised linear algebra.
A common strategy is block elimination using the Schur complement. If is invertible, eliminate first and solve a reduced system for .
From
we get
Substituting into
gives
so
The matrix
is the Schur complement. Once is known, recover from the first block equation.
This block viewpoint is essential in large constrained problems because it turns one large structure into smaller pieces that can be interpreted and solved more intelligently.
9.3 KKT Systems in AI
Constraint systems appear in many AI settings:
- support vector machines through quadratic programming
- KL-constrained policy optimization and RLHF-style objectives
- simplex-constrained probability vectors
- structured prediction with equality or inequality constraints
Even when an ML paper does not say "system of equations," KKT conditions are often doing the hidden work.
For example, the softmax distribution can be derived from an entropy-regularised constrained optimization problem with the simplex constraint
The multiplier associated with this normalisation is exactly the kind of object introduced by the Lagrangian formalism.
9.4 Inequality Constraints and Linear Programming
If the constraints are inequalities,
then feasibility itself becomes a constraint system.
Linear programming studies
The geometry changes from affine subspaces to convex polytopes, but the system perspective remains:
- equations describe faces
- inequalities carve out feasible regions
- KKT conditions add complementary slackness
This viewpoint matters in AI for decoding constraints, sparse optimization, and relaxations of combinatorial prediction problems.
10. Systems Arising Directly in AI
10.1 Normal Equations in Linear Regression
Given data matrix and target vector , least squares solves
The normal equations are
This is the canonical ML example of a system emerging from an optimization problem. Ridge regression modifies it to
which is both statistically and numerically better behaved when the feature matrix is close to rank deficient.
Linear probes on frozen embeddings are exactly this calculation.
10.2 Attention Normalisation as Constraint System
For one attention row, raw scores are transformed into weights
The resulting vector satisfies:
So every attention row is a constrained weighting problem over the value vectors. Softmax is the smooth mechanism that enforces those constraints.
The final output is
which is a convex combination. System language clarifies what attention is doing: it solves for admissible weights under normalization constraints, then forms a structured linear combination.
10.3 Deep Equilibrium Models DEQ
A deep equilibrium model defines its hidden state implicitly through
This is a nonlinear fixed-point system.
Forward pass:
- solve for iteratively
Backward pass:
- use implicit differentiation, which again requires solving a linear system involving or a related operator
More concretely, if
then differentiating implicitly gives
So the backward pass avoids storing an explicit deep stack of activations, but only because it replaces that memory cost with a linear solve in the implicit Jacobian operator.
DEQs are a clean example of how system solving is not only an analysis tool but the actual definition of the model computation.
10.4 Gaussian Processes and Linear System Solves
In Gaussian process regression, the posterior mean involves
where is the kernel matrix.
Rather than explicitly invert the matrix, one solves
Because the matrix is SPD, Cholesky is the natural direct method. For larger problems, iterative methods such as CG with kernel-structure-aware approximations become important.
In practical Bayesian optimization and uncertainty-aware modelling, the system solve is often the dominant cost.
10.5 Newton's Method for Transformer Training
A full Newton step for a large model would solve
with
This is not feasible at LLM scale because:
- is far too large to store
- even matrix-vector access can be expensive
Modern second-order or quasi-second-order optimizers therefore approximate the system structure:
- diagonal approximations such as Adam-like methods
- Kronecker-factored approximations such as K-FAC
- matrix preconditioners such as Shampoo and related variants
The common idea is always the same: do not ignore the linear system, approximate it intelligently.
For example:
- K-FAC approximates layerwise curvature by Kronecker factors, turning one huge inverse into smaller matrix inverses or Cholesky solves
- Shampoo maintains matrix preconditioners for row and column structure, effectively solving a block-shaped system in a better metric
- recent methods such as SOAP stabilise these preconditioners further while keeping their cost tractable at scale
This is why second-order methods should be understood as linear-system design problems rather than as isolated optimizer tricks.
10.6 Physics-Informed Neural Networks PINNs
Physics-informed neural networks train a network so that it satisfies a differential equation system approximately.
Typical ingredients:
- PDE residual equations
- boundary conditions
- initial conditions
The loss is built from these residuals, so training attempts to solve the underlying equation system through function approximation.
PINNs matter here because they make the analogy literal: neural networks are being used as approximate solvers for systems of equations coming from physics.
11. Overdetermined and Underdetermined Systems in Practice
11.1 The Underdetermined Case and Implicit Bias
In modern machine learning, underdetermined systems are not pathological. They are normal.
If a model has many more parameters than effectively independent constraints, then the stationary equations admit many solutions. The question is no longer:
It becomes:
For linear models trained from small or zero initialisation, gradient descent often converges to the minimum-norm interpolating solution. That is an example of implicit bias: the algorithm selects one point from a large affine family without that preference being written explicitly in the objective.
This matters for generalisation. In overparameterized learning, the geometry of the solution set is not enough; the algorithm's path through that set matters too.
11.2 Minimum Norm Solutions and Regularisation
For an underdetermined but consistent system with full row rank,
the minimum Euclidean norm solution is
This is the Moore-Penrose pseudo-inverse solution.
It is the special member of the solution family
that is orthogonal to the null space.
Other norms lead to other preferences:
- minimisation promotes sparsity
- nuclear norm minimisation promotes low rank
- explicit penalties reshape which member of a feasible family is preferred
The practical lesson is simple: once exact uniqueness is gone, the norm or regulariser you choose becomes part of the definition of the problem.
11.3 Structured Underdetermined Systems in Language Models
Language-model generation contains many structured underdetermined choices.
At each decoding step:
- many token distributions satisfy the simplex constraint
- many sequences satisfy a grammar or schema
- many continuation paths have similar local likelihood
The system is not linear, but the principle is similar: there are many admissible solutions, and the algorithm plus constraints determine which one is selected.
Examples:
- beam search chooses among multiple high-scoring structured candidates
- constrained decoding solves a combinatorial feasibility problem over token sequences
- watermarking schemes add extra statistical constraints to the output distribution
11.4 The System Perspective on Self-Attention
For one query position , attention computes
with
So the output lies in the convex hull of the value vectors. This is a constrained coefficient-selection problem:
- unconstrained linear combination would allow arbitrary signed weights
- attention allows only simplex-constrained weights
That restriction is part of why attention is stable and interpretable as "allocation of focus" rather than arbitrary recombination.
12. Numerical Methods and Stability
12.1 Floating-Point Arithmetic and Rounding Errors
Every real computation on a machine is finite-precision computation. Solving systems of equations therefore means solving them with rounding error.
Each arithmetic operation introduces a relative perturbation on the order of machine epsilon:
- FP64: about
- FP32: about
- BF16: about in relative mantissa precision
The actual solution error depends on both:
- the stability of the algorithm
- the conditioning of the problem
This distinction is foundational:
- a stable algorithm can still produce a poor answer on an ill-conditioned problem
- an unstable algorithm can fail even on a well-conditioned problem
12.2 Iterative Refinement
Suppose we computed an approximate solution . We can form the residual
and solve a correction system
Then update
This is iterative refinement.
Why it works:
- the factorisation of can be reused
- the residual often reveals the remaining error accurately
- low-precision factorisation plus higher-precision residual computation can recover surprisingly high accuracy
In mixed precision form, a typical pattern is:
- factorise or approximately solve in low precision
- compute the residual in higher precision
- solve the correction equation using the stored factorisation
- update the solution and repeat
This is mathematically classical and computationally modern at the same time.
Modern mixed-precision solvers rely heavily on this principle. The same mentality appears in mixed-precision training: cheap low-precision main computation, with strategically preserved higher-precision corrective structure.
12.3 Pivoting Strategies and Numerical Stability
Pivoting is not merely a convenience; it is a stability mechanism.
If a matrix is diagonally dominant or structurally well behaved, elimination may be stable even without pivoting. But in general, partial pivoting is the practical baseline because it controls growth in the elimination process.
The high-level rule is:
algebraic solvability
!=
numerical reliability
This distinction explains why numerically aware implementations almost never use the classroom version of Gaussian elimination verbatim.
12.4 Preconditioning for Ill-Conditioned Systems
Ill-conditioning hurts twice:
- it amplifies perturbations in direct solving
- it slows iterative convergence
Preconditioning attacks the geometry. Instead of accepting the original coordinate system of the problem, it rescales or reshapes the system into one that behaves better numerically.
Left preconditioning:
Right preconditioning:
Split preconditioning applies transforms on both sides.
The point is never to find a perfect inverse. It is to find a cheap approximation that dramatically improves the effective spectrum.
12.5 Condition Number and Practical Impact
Condition number translates geometry into operational risk.
If is large, then:
- small perturbations in data can produce large perturbations in the solution
- low-precision arithmetic becomes dangerous
- iterative methods need more steps
- regularization becomes more attractive
In AI this explains many engineering interventions:
- damping Hessian systems in second-order methods
- adding ridge terms to kernel matrices
- scaling attention scores by
- normalizing activations and gradients
All of these can be interpreted as attempts to avoid badly conditioned computational subproblems.
13. Systems of Equations in Optimisation
13.1 Gradient Descent as Approximate System Solving
Optimization can be read as equation solving:
Gradient descent iterates
A fixed point of this update satisfies
So gradient descent is one way of seeking a root of the gradient field.
For quadratic losses, this connection becomes explicit: gradient descent is an iterative solver for the normal equations.
13.2 Newton's Method and the Hessian System
Newton's method in optimization solves
and updates
Near a well-behaved optimum, this gives quadratic convergence. But for large neural networks it is limited by the same obstacle as every other second-order method: the Hessian system is enormous.
That is why exact Newton is rare at scale, while approximate curvature methods are common.
13.3 The Role of Linear System Solves in Second-Order Methods
Second-order methods differ mainly in how they approximate or factor the curvature system.
K-FAC
- approximates curvature blockwise with Kronecker structure
- converts one huge system into many smaller structured solves
- for a layer gradient , the preconditioned update has the schematic form
where each side is estimated from activations or gradient covariances
Shampoo
- keeps matrix-valued preconditioners per parameter block
- uses matrix roots and inverse roots to precondition updates
- replaces scalar learning-rate rescaling with geometry-aware matrix rescaling
SOAP and related recent preconditioners
- push matrix preconditioning further while retaining practical scalability
- aim to capture more curvature structure than diagonal methods without paying full Hessian cost
These methods are not mysterious if you view them through the system lens. Each is a different compromise between:
- fidelity to the true Hessian system
- memory cost
- solve cost
- numerical stability
13.4 Convergence Analysis via Linear Systems
For the quadratic loss
we have
so the stationary condition is
Thus the local convergence of first-order methods is governed by the spectrum of .
This is why eigenvalues, singular values, and condition numbers are not separate subjects from optimization. They govern the difficulty of the linear systems sitting underneath it.
14. Common Mistakes
| Mistake | Why it is wrong | Fix |
|---|---|---|
| "If , the system must have a unique solution." | Square only means the matrix is square. It does not imply full rank. Singular square systems may have no solution or infinitely many. | Check rank or determinant; uniqueness requires full rank. |
| "Gaussian elimination is only for square systems." | Row reduction works for any matrix and is one of the best tools for understanding rectangular systems. | Use elimination on any augmented matrix to reveal rank and solution structure. |
| "Homogeneous systems might have no solution." | always solves . | The real question is whether non-zero solutions exist. |
| "Least squares solves the original system exactly." | Least squares solves the closest projection problem, not the exact system unless . | Always inspect the residual. |
| "Normal equations are always fine because the formula is simple." | Forming squares the condition number and can destroy accuracy. | Prefer QR or SVD for harder least-squares problems. |
| "A small residual means a highly accurate solution." | For ill-conditioned systems, the residual can be small while the actual solution error is large. | Consider conditioning as well as residual size. |
| "Free variables can be ignored after finding one solution." | Free variables encode the entire family of exact solutions. | Write the full affine solution: particular plus null-space basis. |
| "Iterative methods always converge if you run them long enough." | Convergence depends on matrix structure and the chosen method. Some iterations diverge. | Check assumptions such as SPD, diagonal dominance, or spectral-radius conditions. |
| "The inverse is the natural way to solve every linear system." | Explicit inversion is usually slower and less stable than factorisation-based solving. | Solve directly using LU, QR, or Cholesky as appropriate. |
| "Optimization is separate from systems of equations." | Stationarity, KKT conditions, Newton steps, and fixed points are all equation systems. | Reframe optimization algorithms through their associated systems. |
15. Exercises
-
Row reduction and solution types: for each augmented matrix, reduce to RREF, classify the system, and write the full solution set: (a) (b) (c) (d)
-
Particular solution and null space: for
(a) find the RREF, (b) identify pivot and free variables, (c) find a particular solution to , (d) find a basis for , and (e) verify rank-nullity.
- Least squares: with
derive the normal equations, solve for , compute the residual, verify orthogonality to the column space, and then compare with ridge regularisation using .
- Conditioning: for
compute the determinant, solve , perturb slightly, and quantify how much the solution changes. Interpret the result using the condition number.
- Newton's method for a nonlinear system: solve
by deriving the Jacobian and performing two Newton steps from .
- Iterative methods: for
solve using Cholesky, then compare Jacobi, Gauss-Seidel, and two steps of CG.
-
Constrained system: minimise subject to . Write the Lagrangian, derive the KKT system, solve it, and then repeat with the extra constraint .
-
AI application - attention as a constrained coefficient system: with
compute the score matrix, the softmax weights, and the output matrix. Interpret each attention row as a simplex-constrained weighting problem.
16. Why This Matters for AI (2026 Perspective)
| Aspect | Impact |
|---|---|
| Linear regression and probing | Fitting linear probes, calibration heads, and readout layers reduces to exact or regularised linear systems. |
| Newton and second-order optimisation | K-FAC, Shampoo, SOAP, and related optimizers are structured approximations to very large curvature-system solves. |
| Overparameterisation | Modern deep networks often operate in underdetermined regimes, so the geometry of the solution set and the implicit bias of the solver matter. |
| Deep equilibrium models | Forward pass solves a nonlinear fixed-point system; backward pass solves an implicit linear system. |
| Gaussian processes and Bayesian inference | Kernel methods are fundamentally system-solving pipelines, usually with SPD matrices. |
| Constrained decoding and RLHF | Normalization, KL constraints, and structured generation can all be expressed as coupled constraint systems. |
| Numerical stability | Condition numbers explain why damping, normalization, scaling, and mixed precision are operational necessities rather than cosmetic choices. |
| PINNs and scientific AI | Solving PDE systems with neural surrogates makes equation-solving central to modern AI-for-science workflows. |
| Sparse and efficient attention | Approximate attention mechanisms preserve or relax the original attention system structure to gain scale. |
| Interpretability | Many attribution and decomposition techniques boil down to solving linear systems for component contributions. |
17. Conceptual Bridge
Systems of equations are where abstract algebra, geometry, and computation become inseparable.
- vector spaces told us what the objects are
- matrix operations told us how those objects are transformed
- systems of equations tell us how to recover unknowns from constraints
The linear case gives the complete reference model:
- rank tells us how many independent constraints exist
- null spaces tell us what freedom remains
- elimination and factorisation tell us how to solve
- least squares tells us what to do when exact compatibility fails
The nonlinear case adds the genuinely hard phenomena:
- multiple solutions
- bifurcations
- local linearisation
- fixed points
- constrained stationary conditions
This is exactly the bridge needed for the next major topic. Once a system has been written as , the next question is often:
That is the spectral question, and it leads naturally to eigenvalues, eigenvectors, and decompositions.
Vectors and Spaces -> geometry
Matrix Operations -> computation
Systems of Equations -> solving
Eigenvalues / SVD -> structure
Probability -> uncertainty
Optimisation -> dynamics
LLM Mathematics -> all of them together
References
- Gilbert Strang, MIT 18.06 Linear Algebra course materials and notes: https://web.mit.edu/18.06/www/
- Richard Barrett et al., Templates for the Solution of Linear Systems: https://www.netlib.org/templates/templates.html
- Lloyd N. Trefethen and David Bau III, Numerical Linear Algebra, SIAM, 1997.
- Gene H. Golub and Charles F. Van Loan, Matrix Computations, 4th ed., Johns Hopkins University Press, 2013.
- Jonathan Richard Shewchuk, An Introduction to the Conjugate Gradient Method Without the Agonizing Pain: https://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf
- Y. Saad and M. H. Schultz, "GMRES: A Generalized Minimal Residual Algorithm for Solving Nonsymmetric Linear Systems": https://epubs.siam.org/doi/10.1137/0907058
- Ashish Vaswani et al., "Attention Is All You Need" (2017): https://arxiv.org/abs/1706.03762
- Shaojie Bai, J. Zico Kolter, and Vladlen Koltun, "Deep Equilibrium Models" (2019): https://arxiv.org/abs/1909.01377
- James Martens and Roger Grosse, "Optimizing Neural Networks with Kronecker-factored Approximate Curvature" (ICML 2015): https://proceedings.mlr.press/v37/martens15.html
- Vineet Gupta et al., "Shampoo: Preconditioned Stochastic Tensor Optimization" (ICML 2018): https://proceedings.mlr.press/v80/gupta18a.html
- Nikhil Vyas et al., "SOAP: Improving and Stabilizing Shampoo using Adam" (2024): https://arxiv.org/abs/2409.11321
- A. S. Rafailov et al., "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" (2023): https://arxiv.org/abs/2305.18290
- Maziar Raissi, Paris Perdikaris, and George E. Karniadakis, "Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations" (2019): https://www.sciencedirect.com/science/article/pii/S0021999118307125
- NVIDIA cuSOLVER documentation: https://docs.nvidia.com/cuda/cusolver/
- NVIDIA cuSPARSE documentation: https://docs.nvidia.com/cuda/cusparse/