NotesMath for LLMs

Systems of Equations

Linear Algebra Basics / Systems of Equations

Notes

"A model is useful only when its constraints can be satisfied. Systems of equations are the language in which those constraints become explicit, and solving them is the point where mathematical structure turns into actual computation."

Overview

A single equation imposes a single condition. A system of equations imposes many conditions at once. That shift from "one condition" to "all of these conditions simultaneously" is the core step from elementary algebra to modern computational mathematics. It is also the step that makes linear algebra operational in AI.

In geometric language, each equation cuts down the set of allowable points. In algebraic language, each equation adds a constraint. In computational language, the entire problem becomes: find the variable assignment that satisfies the full constraint set, or determine that no such assignment exists. Everything that follows in numerical linear algebra, optimization, probabilistic inference, and deep learning depends on this viewpoint.

For machine learning, systems of equations appear everywhere:

  • linear regression solves the normal equations
  • second-order optimization solves Hessian systems
  • Gaussian processes solve large symmetric positive definite systems
  • constrained optimization solves KKT systems
  • deep equilibrium models solve nonlinear fixed-point systems
  • training itself searches for solutions to the gradient equations L(θ)=0\nabla L(\theta) = 0

This chapter therefore treats systems of equations in three linked ways:

  • as geometric intersections of constraints
  • as algebraic objects described by rank, null spaces, and consistency
  • as computational problems requiring direct or iterative solvers

The linear case is the foundation because it is both completely understood and universally useful. The nonlinear case matters because every realistic optimization problem eventually becomes nonlinear. Together they explain why solving equations is central to both classical scientific computing and modern AI.

Prerequisites

  • Basic algebra and multivariable notation
  • Familiarity with vectors, matrices, rank, null spaces, and subspaces
  • Comfort with matrix multiplication and transpose
  • Basic understanding of norms and least squares

Companion Notebooks

NotebookDescription
theory.ipynbInteractive examples for row reduction, least squares, conditioning, and iterative solvers
exercises.ipynbGuided practice on consistency, Gaussian elimination, least squares, Newton updates, and AI-flavored systems

Learning Objectives

After completing this chapter, you should be able to:

  • Translate a system of equations into matrix form Ax=bAx = b
  • Diagnose whether a system has zero, one, or infinitely many solutions
  • Use rank conditions to characterize consistency and uniqueness
  • Perform Gaussian elimination and interpret REF / RREF correctly
  • Separate homogeneous and inhomogeneous solution structure
  • Understand least squares as projection onto a column space
  • Distinguish direct solvers from iterative solvers and know when each is appropriate
  • Explain Newton's method as repeated solution of linearized systems
  • Interpret constrained optimization through KKT systems and saddle-point structure
  • Connect all of the above directly to linear regression, attention, Gaussian processes, DEQs, and second-order optimization

Table of Contents


1. Intuition

1.1 What Are Systems of Equations?

A system of equations is a collection of two or more equations that must all be satisfied at the same time. The word "simultaneously" matters more than anything else. One equation restricts the variables somewhat. Many equations restrict them together, and the solution is whatever survives all of those restrictions at once.

For example, in two variables:

x+y=32xy=0\begin{aligned} x + y &= 3 \\ 2x - y &= 0 \end{aligned}

the first equation alone allows infinitely many points, namely every point on a line. The second equation also allows infinitely many points, again a line. Solving the system means finding the point or points that lie on both lines at the same time.

This generalises immediately:

  • one equation in two unknowns gives a curve or a line in R2\mathbb{R}^2
  • one equation in three unknowns gives a surface or a plane in R3\mathbb{R}^3
  • one linear equation in nn unknowns gives a hyperplane in Rn\mathbb{R}^n
  • a full system gives the intersection of all those constraint sets

Linear systems are special because each equation has degree 11. Their geometry is flat, their algebra is complete, and their computational methods are highly developed. Nonlinear systems are harder because the constraints can curve, fold, branch, or fail to intersect in complicated ways.

In AI, both matter:

  • linear systems appear directly in least squares, Gaussian processes, Kalman filters, and second-order methods
  • nonlinear systems appear in optimization, fixed-point models, equilibrium conditions, KKT systems, and PDE-constrained learning

1.2 Why Systems of Equations Are Central to AI

Many AI computations that seem unrelated are really system-solving problems in disguise.

Least squares and linear regression

If we want to minimize

Axb22,\|Ax - b\|_2^2,

then the minimizer satisfies the normal equations

AAx=Ab.A^\top A x = A^\top b.

So even the most basic linear regression problem ends as a linear system solve.

Optimization

At a stationary point of a loss L(θ)L(\theta),

L(θ)=0.\nabla L(\theta) = 0.

This is a system of pp equations in pp unknowns when θRp\theta \in \mathbb{R}^p. Newton's method solves this nonlinear system by repeatedly solving linear systems involving the Hessian.

Probabilistic inference

Gaussian processes, Kalman filters, and many Bayesian update formulas require solving systems such as

Kα=y,K \alpha = y,

where KK is a covariance or kernel matrix. The main computational bottleneck is often not "doing Bayesian inference" in the abstract, but solving this linear system accurately and efficiently.

Attention and normalization

Softmax enforces constraints: non-negativity and sum-to-one. That is not a linear system, but it is still a simultaneous constraint-satisfaction problem at every query position.

Constrained generation and RLHF

Grammar-constrained decoding, schema following, and KL-constrained policy updates are all system-like problems. They involve satisfying many conditions at once, exactly the mentality of equation systems and constrained optimization.

1.3 The Geometric Picture

The geometry is the fastest way to build intuition.

One linear equation in two variables,

a1x1+a2x2=b,a_1 x_1 + a_2 x_2 = b,

defines a line in R2\mathbb{R}^2.

One linear equation in three variables,

a1x1+a2x2+a3x3=b,a_1 x_1 + a_2 x_2 + a_3 x_3 = b,

defines a plane in R3\mathbb{R}^3.

In Rn\mathbb{R}^n, one equation

ax=ba^\top x = b

defines a hyperplane, which is an affine subspace of codimension 11.

One equation cuts space once.

R^2:  line
R^3:  plane
R^n:  hyperplane

Many equations = many cuts
Solution = common intersection

The three basic outcomes are:

TWO-LINE PICTURE IN R^2

1. Unique solution

   \  /
    \/
    /\
   /  \

2. No solution

   -----
   -----

3. Infinitely many solutions

   =====
   =====

The same logic scales to hyperplanes in higher dimension.

This same geometry survives in abstract form in higher dimensions:

  • unique solution means the constraints cut down to a single point
  • infinitely many solutions means the constraints leave a whole direction or family of directions unconstrained
  • no solution means the constraints are incompatible

1.4 Linear vs Nonlinear Systems

The distinction between linear and nonlinear systems is structural, not cosmetic.

LINEAR SYSTEMS                    NONLINEAR SYSTEMS
-------------------------------------------------------------
Constraints: hyperplanes          Constraints: curves/surfaces/manifolds
Superposition: yes                Superposition: no
Theory: complete                  Theory: problem-dependent
Algorithms: elimination, QR,      Algorithms: Newton, fixed-point,
            Cholesky, CG                     homotopy, continuation
Solution set: affine subspace     Solution set: isolated roots,
            or empty                          branches, manifolds, empty

Linearity means two things happen:

  1. the geometry is flat
  2. the algebra closes under addition and scalar multiplication in a controlled way

That is why the linear case admits the clean rank-based theory:

  • consistency is determined by column space membership
  • uniqueness is determined by null-space triviality
  • general solutions are particular solution plus null-space directions

None of this survives in the same complete form for nonlinear systems. But nonlinear methods almost always depend on linearization, which is why linear systems remain the local language of almost all nonlinear computation.

1.5 The Three Cases and Their Geometry

For a linear system Ax=bAx = b, there are only three possible outcomes.

Case 1: unique solution

  • the system is consistent
  • the constraints are independent enough to determine every variable
  • algebraically, the null space is trivial: null(A)={0}\mathrm{null}(A) = \{0\}
  • when AA is square, this means rank(A)=n\mathrm{rank}(A) = n

Case 2: infinitely many solutions

  • the system is consistent
  • but there are free directions left
  • algebraically, null(A)\mathrm{null}(A) is nontrivial
  • the full solution set is
x=xp+xh,xhnull(A),x = x_p + x_h, \qquad x_h \in \mathrm{null}(A),

so it is an affine subspace

Case 3: no solution

  • the right-hand side bb is incompatible with the column space of AA
  • equivalently, bcol(A)b \notin \mathrm{col}(A)
  • geometrically, the hyperplanes do not all meet
SOLUTION SET SHAPE

unique solution         infinite solutions        no solution
      point               affine family             empty set

      o                    o----o----o                (none)
                           |    |    |
                           line / plane / etc.

The entire chapter is really about characterising these three cases precisely, then solving them efficiently.

1.6 Historical Timeline

  • Babylonian mathematics already solved small systems by elimination-like reasoning in commercial and geometric contexts.
  • The Chinese Nine Chapters on the Mathematical Art described the fangcheng method, a tabular elimination scheme that is recognisably proto-Gaussian elimination.
  • Cramer introduced determinant-based formulas for square systems in the 18th century.
  • Gauss systematised elimination and least squares in astronomical computation.
  • Jordan refined the full reduction process now called Gauss-Jordan elimination.
  • The 19th century matrix viewpoint, due to Cayley, Sylvester, Frobenius, and others, turned system solving into matrix algebra.
  • The 20th century added numerical stability, pivoting, factorisation methods, and large-scale iterative solvers.
  • Modern AI inherits all of this: dense solvers for medium-scale exact problems, sparse / iterative solvers for large systems, and structured approximations for trillion-parameter optimization.

The historical pattern is important: the subject developed first from geometry and astronomy, then from algebra, then from numerical analysis, and now from large-scale machine computation. AI sits in the fourth stage, but it still depends on the first three.


2. Linear Systems - Formulation

2.1 Standard Form

A system of mm linear equations in nn unknowns has the form

a11x1+a12x2++a1nxn=b1a_{11}x_1 + a_{12}x_2 + \cdots + a_{1n}x_n = b_1 a21x1+a22x2++a2nxn=b2a_{21}x_1 + a_{22}x_2 + \cdots + a_{2n}x_n = b_2 \vdots am1x1+am2x2++amnxn=bm.a_{m1}x_1 + a_{m2}x_2 + \cdots + a_{mn}x_n = b_m.

The compact matrix form is

Ax=b,Ax = b,

where

  • ARm×nA \in \mathbb{R}^{m \times n} is the coefficient matrix
  • xRnx \in \mathbb{R}^n is the unknown vector
  • bRmb \in \mathbb{R}^m is the right-hand side
Ax = b

  m x n      n x 1    m x 1

[coeffs]  [unknowns] = [targets]

This notation compresses the entire system into one object. It also reveals the key question immediately:

Is b in the column space of A?\text{Is } b \text{ in the column space of } A?

If yes, the system is consistent. If no, it is not.

2.2 Augmented Matrix

The augmented matrix packages the coefficients and the right-hand side together:

[Ab]=(a11a1nb1am1amnbm).[A \mid b] = \begin{pmatrix} a_{11} & \cdots & a_{1n} & b_1 \\ \vdots & \ddots & \vdots & \vdots \\ a_{m1} & \cdots & a_{mn} & b_m \end{pmatrix}.

Why do this? Because row operations affect the equations and the right-hand side at the same time. Working with [Ab][A \mid b] keeps the bookkeeping honest.

[A | b]

coefficient columns | right-hand side
-------------------------------------
a11 a12 ... a1n    | b1
a21 a22 ... a2n    | b2
... ... ... ...    | ...
am1 am2 ... amn    | bm

In practice:

  • row-reduce [Ab][A \mid b]
  • inspect the pivots and any contradiction rows
  • read off the solution type directly

2.3 Existence and Uniqueness - The Fundamental Criterion

The basic theorem is the Rouch-Capelli criterion:

  • the system is consistent iff
rank(A)=rank([Ab])\mathrm{rank}(A) = \mathrm{rank}([A \mid b])
  • the system has a unique solution iff
rank(A)=rank([Ab])=n\mathrm{rank}(A) = \mathrm{rank}([A \mid b]) = n
  • the system has infinitely many solutions iff
rank(A)=rank([Ab])<n\mathrm{rank}(A) = \mathrm{rank}([A \mid b]) < n
  • the system is inconsistent iff
rank(A)<rank([Ab]).\mathrm{rank}(A) < \mathrm{rank}([A \mid b]).

This is the complete classification.

Why it works:

  • rank(A)\mathrm{rank}(A) counts how many independent coefficient constraints there are
  • rank([Ab])\mathrm{rank}([A \mid b]) counts how many independent constraints remain after including the targets
  • if the augmented rank is larger, then bb introduced an extra independent demand that the columns of AA cannot satisfy

2.4 Homogeneous vs Inhomogeneous

A system is homogeneous if b=0b = 0:

Ax=0.Ax = 0.

It is inhomogeneous if b0b \neq 0:

Ax=b.Ax = b.

The homogeneous case is special because it is always consistent: x=0x = 0 always works.

Its solution set is exactly the null space:

{x:Ax=0}=null(A).\{x : Ax = 0\} = \mathrm{null}(A).

This is a subspace, not merely an affine set. Its dimension is

dim(null(A))=nrank(A),\dim(\mathrm{null}(A)) = n - \mathrm{rank}(A),

by rank-nullity.

The inhomogeneous case is built from the homogeneous one. If xpx_p is any particular solution to Ax=bAx = b, then every solution has the form

x=xp+xh,xhnull(A).x = x_p + x_h, \qquad x_h \in \mathrm{null}(A).

This is one of the most important formulas in the chapter. It says:

general solution
= one concrete solution
+ every homogeneous direction that changes nothing

So the homogeneous system describes the freedom left over after one particular solution is found.

2.5 Types by Shape of A

The shape of AA already tells you a lot.

Square system: m=nm = n

  • same number of equations as unknowns
  • generically unique when full rank
  • singular if rank drops

Overdetermined system: m>nm > n

  • more equations than unknowns
  • generically inconsistent
  • solved approximately by least squares

Underdetermined system: m<nm < n

  • fewer equations than unknowns
  • generically infinitely many solutions if consistent
  • often choose the minimum-norm solution

Exactly determined full-rank square system

  • the classical "nice" case
  • unique exact solution
  • direct methods like LU are natural

In ML, overdetermined and underdetermined cases are at least as important as square ones:

  • regression and curve fitting are overdetermined
  • overparameterized learning is underdetermined
  • inverse problems are often ill-posed even when square

3. Gaussian Elimination and Row Reduction

3.1 Elementary Row Operations

There are three row operations that preserve the solution set:

  1. swap two rows: RiRjR_i \leftrightarrow R_j
  2. scale a row by a non-zero scalar: RiαRiR_i \leftarrow \alpha R_i
  3. replace a row with itself plus a multiple of another row: RiRi+αRjR_i \leftarrow R_i + \alpha R_j

These are not arbitrary tricks. Each corresponds to a legitimate equation manipulation:

  • swapping changes equation order
  • scaling keeps the same solution set because dividing back is possible
  • adding one equation to another creates an equivalent equation

Each elementary row operation can also be represented as left multiplication by an elementary matrix. That matters later because elimination is really matrix multiplication in disguise.

3.2 Row Echelon Form REF

A matrix is in row echelon form if:

  1. all zero rows are at the bottom
  2. each pivot is to the right of the pivot above it
  3. everything below each pivot is zero

Example:

(1203014100250000)\begin{pmatrix} 1 & 2 & 0 & 3 \\ 0 & 1 & 4 & -1 \\ 0 & 0 & 2 & 5 \\ 0 & 0 & 0 & 0 \end{pmatrix}

is in REF.

The pivots tell you which variables are constrained directly. The non-pivot columns are free-variable columns.

Pivot columns  -> determined variables
Free columns   -> free variables / parameters

Forward elimination always produces an REF. It is not unique, but the pivot structure encodes the rank and the free-variable count.

3.3 Reduced Row Echelon Form RREF

RREF imposes two extra conditions:

  1. each pivot equals 11
  2. each pivot is the only non-zero entry in its column

Example:

(1030012000010000)\begin{pmatrix} 1 & 0 & 3 & 0 \\ 0 & 1 & -2 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 \end{pmatrix}

This form is stronger than REF and, crucially, unique for a given matrix.

That uniqueness makes RREF a conceptual gold standard: once you have it, the solution structure can be read off immediately. In practice, though, solving via REF plus back substitution is usually cheaper than carrying the full matrix all the way to RREF.

3.4 Gaussian Elimination Algorithm

Gaussian elimination has two conceptual stages.

Forward elimination

  • choose a pivot column
  • select a pivot row
  • eliminate entries below the pivot
  • move to the next submatrix

Back substitution

  • solve the last pivot variable first
  • substitute upward one row at a time
FORWARD ELIMINATION

* * * *
* * * *   ->   * * * *
* * * *        0 * * *
* * * *        0 0 * *
                0 0 0 *

BACK SUBSTITUTION

solve bottom row first, then move upward

For an n×nn \times n dense system:

  • forward elimination costs on the order of (1/3)n3(1/3)n^3 arithmetic units
  • back substitution costs on the order of n2n^2
  • total cost is therefore O(n3)O(n^3)

This cubic cost is why direct dense methods become expensive at scale, and why iterative methods matter for huge sparse systems.

3.5 Partial and Complete Pivoting

In exact arithmetic, any non-zero pivot is fine. In floating-point arithmetic, that is false. Choosing a tiny pivot can amplify rounding error badly.

That is why practical elimination uses pivoting.

No pivoting

  • simplest conceptually
  • dangerous numerically
  • may divide by a very small number and create huge multipliers

Partial pivoting

  • in the current column, swap in the row with the largest absolute pivot candidate
  • this is the default in most dense direct solvers
  • it gives strong practical stability at low extra cost

Complete pivoting

  • search over the remaining submatrix, not just the current column
  • swap both rows and columns
  • slightly more stable in theory, but more expensive and less common in practice
PIVOTING IDEA

Bad pivot:

0.000001   *   *        divide by tiny number -> big multipliers -> error growth
  5        *   *
  7        *   *

After partial pivoting:

  7        *   *
  5        *   *
0.000001   *   *

For AI, pivoting matters whenever system solves occur inside a larger algorithm:

  • Gaussian process regression
  • second-order optimization
  • implicit differentiation
  • structured probabilistic inference

It is easy to think "the math says invertible, so we are safe." Numerical analysis says otherwise. Invertible is not the same as well-conditioned.

3.6 Reading Solutions from RREF

Once [Ab][A \mid b] is in RREF, the solution set can be read directly.

Inconsistency

If a row becomes

[0  0    0c],c0,[0 \ \ 0 \ \ \cdots \ \ 0 \mid c], \qquad c \neq 0,

then the system says 0=c0 = c, which is impossible. No solution exists.

Pivot variables

Variables corresponding to pivot columns are determined by the equations.

Free variables

Variables corresponding to non-pivot columns can be chosen freely. They become parameters.

The general solution is then

x=xp+t1v1++tkvk,x = x_p + t_1 v_1 + \cdots + t_k v_k,

where

  • xpx_p is one particular solution
  • v1,,vkv_1, \ldots, v_k form a basis for null(A)\mathrm{null}(A)
  • k=nrank(A)k = n - \mathrm{rank}(A)

This is not just a computational trick. It is the structural theorem of linear systems expressed in coordinates.

3.7 Worked Example - Underdetermined System

Consider

(120124130011)x=(131).\begin{pmatrix} 1 & 2 & 0 & 1 \\ 2 & 4 & 1 & 3 \\ 0 & 0 & 1 & 1 \end{pmatrix} x = \begin{pmatrix} 1 \\ 3 \\ 1 \end{pmatrix}.

Its augmented matrix is

[120112413300111].\left[ \begin{array}{cccc|c} 1 & 2 & 0 & 1 & 1 \\ 2 & 4 & 1 & 3 & 3 \\ 0 & 0 & 1 & 1 & 1 \end{array} \right].

Apply

R2R22R1R_2 \leftarrow R_2 - 2R_1

to get

[120110011100111].\left[ \begin{array}{cccc|c} 1 & 2 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 1 \\ 0 & 0 & 1 & 1 & 1 \end{array} \right].

Then

R3R3R2R_3 \leftarrow R_3 - R_2

gives

[120110011100000].\left[ \begin{array}{cccc|c} 1 & 2 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 1 \\ 0 & 0 & 0 & 0 & 0 \end{array} \right].

So the pivot columns are 11 and 33, while columns 22 and 44 are free.

Let

x2=t,x4=s.x_2 = t, \qquad x_4 = s.

Then row 22 gives

x3+x4=1x3=1s,x_3 + x_4 = 1 \Rightarrow x_3 = 1 - s,

and row 11 gives

x1+2x2+x4=1x1=12ts.x_1 + 2x_2 + x_4 = 1 \Rightarrow x_1 = 1 - 2t - s.

Therefore

x=(12tst1ss)=(1010)+t(2100)+s(1011).x = \begin{pmatrix} 1 - 2t - s \\ t \\ 1 - s \\ s \end{pmatrix} = \begin{pmatrix} 1 \\ 0 \\ 1 \\ 0 \end{pmatrix} + t \begin{pmatrix} -2 \\ 1 \\ 0 \\ 0 \end{pmatrix} + s \begin{pmatrix} -1 \\ 0 \\ -1 \\ 1 \end{pmatrix}.

This displays the exact structure:

  • one particular solution
  • plus a 22-dimensional null-space family

That is the canonical shape of a consistent underdetermined linear system.


4. Existence, Uniqueness, and the Rank Conditions

4.1 Rank and Solution Structure - Complete Characterisation

For Ax=bAx = b with ARm×nA \in \mathbb{R}^{m \times n} and rank rr, the possibilities can be summarised as follows.

rank(A)rank([A|b])relation to nsolution set
r=nr = nr=nr = nfull column rankunique solution
r<nr < nr<nr < nconsistent but rank-deficientinfinitely many solutions
rrr+1r+1augmented rank largerno solution

More explicitly:

  • if rank(A)=rank([Ab])=n\mathrm{rank}(A) = \mathrm{rank}([A \mid b]) = n, then there are no free variables
  • if rank(A)=rank([Ab])<n\mathrm{rank}(A) = \mathrm{rank}([A \mid b]) < n, then there are nrn-r free variables
  • if rank(A)<rank([Ab])\mathrm{rank}(A) < \mathrm{rank}([A \mid b]), then the system is inconsistent

This is complete. There are no other cases.

4.2 Geometric Interpretation of Each Case

It is useful to combine algebra and geometry explicitly.

Square full-rank system

  • m=nm = n, rank =n= n
  • hyperplanes meet in a single point
  • algebraically: no null directions remain

Consistent overdetermined system

  • m>nm > n
  • possible only if extra equations are redundant or implied
  • geometrically: many hyperplanes still share the same point

Generic overdetermined system

  • more constraints than degrees of freedom
  • usually inconsistent
  • motivates least squares rather than exact solving

Underdetermined system

  • too few independent constraints to determine every variable
  • solutions form an affine family
  • dimension of the family is nrn-r
DIMENSION COUNT

n unknown directions
- r independent constraints
--------------------------
n-r remaining free directions

This count is one of the simplest and most useful geometric heuristics in all of linear algebra.

4.3 The Fundamental Theorem of Linear Algebra Applied to Systems

The four fundamental subspaces of ARm×nA \in \mathbb{R}^{m \times n} give the cleanest picture of when Ax=bAx = b is solvable:

  • col(A)Rm\mathrm{col}(A) \subseteq \mathbb{R}^m - the set of achievable right-hand sides
  • null(A)Rn\mathrm{null}(A) \subseteq \mathbb{R}^n - the set of solutions to Ax=0Ax = 0
  • row(A)Rn\mathrm{row}(A) \subseteq \mathbb{R}^n - the input directions that are not annihilated
  • null(A)Rm\mathrm{null}(A^\top) \subseteq \mathbb{R}^m - the left null space; directions bb cannot lie in for solvability

Reading solvability from these subspaces:

  • Ax=bAx = b is consistent     \iff bcol(A)b \in \mathrm{col}(A)
  • if xpx_p is any particular solution, every solution is xp+vx_p + v for some vnull(A)v \in \mathrm{null}(A)
  • bb has a component in null(A)\mathrm{null}(A^\top)     \iff the system is inconsistent

The direct-sum decomposition Rm=col(A)null(A)\mathbb{R}^m = \mathrm{col}(A) \oplus \mathrm{null}(A^\top) is what makes the least-squares projection of 6 well-defined: we project bb onto col(A)\mathrm{col}(A) and discard the null(A)\mathrm{null}(A^\top) component.

Scope note: This section uses the four fundamental subspaces as a practical tool for understanding Ax=bAx = b. Their rigorous definitions, orthogonality relationships, and dimensional identities are the canonical subject of a later section.

-> Full treatment: Vector Spaces and Subspaces 7

4.4 The Fredholm Alternative

The Fredholm alternative packages the previous paragraph into a clean either-or statement:

For a given AA and bb, exactly one of the following is true:

  1. Ax=bAx = b has a solution, or
  2. there exists yy such that
Ay=0andyb0.A^\top y = 0 \qquad \text{and} \qquad y^\top b \neq 0.

Interpretation:

  • if Ay=0A^\top y = 0, then yy lies in the left null space
  • if yb0y^\top b \neq 0, then bb has a component orthogonal to the column space
  • therefore bb cannot lie entirely in col(A)\mathrm{col}(A)

This theorem is conceptually deeper than it first appears. It says inconsistency is not mysterious. It is witnessed by an explicit certificate yy that proves the right-hand side points outside the achievable subspace.

4.5 Sensitivity and Condition Number

Not all solvable systems are equally safe to solve numerically.

Suppose Ax=bAx = b and the right-hand side is perturbed to b+δbb + \delta b. Then the solution changes to x+δxx + \delta x satisfying

A(x+δx)=b+δb.A(x + \delta x) = b + \delta b.

Subtracting Ax=bAx = b gives

Aδx=δb,A \delta x = \delta b,

so

δx=A1δb\delta x = A^{-1} \delta b

when AA is invertible.

This leads to the standard error bound

δxxκ(A)δbb,\frac{\|\delta x\|}{\|x\|} \leq \kappa(A) \frac{\|\delta b\|}{\|b\|},

where

κ(A)=AA1\kappa(A) = \|A\| \, \|A^{-1}\|

is the condition number in the chosen norm.

Meaning:

  • κ(A)1\kappa(A) \approx 1 means the system is well-conditioned
  • large κ(A)\kappa(A) means the system strongly amplifies perturbations
small data error  --times-->  condition number  --gives-->  solution error

In ML this matters constantly:

  • Hessian systems can be very ill-conditioned
  • kernel matrices can be nearly singular
  • normal equations square the condition number
  • regularization often helps more by improving conditioning than by changing the exact minimizer

5. Special Linear Systems

5.1 Triangular Systems

Triangular systems are the cheap end of linear solving.

For a lower triangular system Lx=bLx = b:

xi=1ii(bij=1i1ijxj),i=1,2,,n.x_i = \frac{1}{\ell_{ii}} \left( b_i - \sum_{j=1}^{i-1} \ell_{ij} x_j \right), \qquad i = 1,2,\ldots,n.

This is forward substitution.

For an upper triangular system Ux=bUx = b:

xi=1uii(bij=i+1nuijxj),i=n,n1,,1.x_i = \frac{1}{u_{ii}} \left( b_i - \sum_{j=i+1}^{n} u_{ij} x_j \right), \qquad i = n,n-1,\ldots,1.

This is back substitution.

Each costs O(n2)O(n^2), much cheaper than factorisation. That is why methods such as LU and Cholesky are so useful: they reduce the hard problem to triangular solves.

5.2 Symmetric Positive Definite Systems

If AA is symmetric positive definite (SPD), then

A=AandxAx>0 for all x0.A^\top = A \qquad \text{and} \qquad x^\top A x > 0 \text{ for all } x \neq 0.

This is an especially pleasant class:

  • eigenvalues are positive
  • the system has a unique solution
  • Cholesky factorisation applies:
A=LLA = LL^\top

with LL lower triangular

Cholesky is faster and more stable than generic LU in this setting, and it avoids pivoting.

SPD systems appear everywhere in ML:

  • Gram matrices
  • kernel matrices (with regularization)
  • covariance matrices
  • normal equations when AA has full column rank
  • local quadratic models near well-behaved minima

5.3 Diagonal and Block Diagonal Systems

If DD is diagonal, then solving

Dx=bDx = b

is trivial:

xi=bidi.x_i = \frac{b_i}{d_i}.

This is O(n)O(n).

If the matrix is block diagonal, then each block can be solved independently.

[B1  0   0 ] [x1]   [b1]
[0   B2  0 ] [x2] = [b2]
[0   0   B3] [x3]   [b3]

solve B1 x1 = b1, B2 x2 = b2, B3 x3 = b3 separately

This separability is a major computational win. Many preconditioners and structured models try to approximate a difficult system by something diagonal or block diagonal precisely because those systems are cheap to solve.

5.4 Sparse Linear Systems

A sparse matrix has relatively few non-zero entries. For large systems, sparsity changes everything.

Dense viewpoint:

  • storage: O(n2)O(n^2)
  • factorisation: O(n3)O(n^3)

Sparse viewpoint:

  • storage: O(nnz)O(\mathrm{nnz})
  • matrix-vector multiply: often O(nnz)O(\mathrm{nnz})
  • iterative methods become viable

The main subtlety is fill-in: elimination may create new non-zero entries where zeros previously existed. Sparse direct solvers therefore spend significant effort on reordering the matrix to reduce fill-in before factorisation.

In AI, sparsity appears in:

  • graph neural networks
  • sparse attention patterns
  • recommender systems
  • PDE discretisations in scientific ML
  • structured factor graphs and graphical models

5.5 Toeplitz and Structured Systems

Some matrices are not sparse but are still highly structured.

Toeplitz matrix

  • constant along diagonals
  • AijA_{ij} depends only on iji-j

Circulant matrix

  • special Toeplitz form
  • diagonalised by the discrete Fourier transform
  • solvable via FFT in near-linear time

Vandermonde matrix

  • arises in polynomial interpolation
  • numerically delicate but structurally special

Convolution is a good AI example. A 1D convolution can be written as multiplication by a Toeplitz matrix. In practice we do not materialise that matrix, but the structured-system viewpoint is still correct and useful.


6. Least Squares Problems

6.1 Motivation and Formulation

When Ax=bAx = b is overdetermined, exact consistency is unlikely. The next best question is:

Which x makes Ax as close as possible to b?\text{Which } x \text{ makes } Ax \text{ as close as possible to } b?

This gives the least-squares problem:

x=argminxRnAxb22.x^\star = \arg\min_{x \in \mathbb{R}^n} \|Ax - b\|_2^2.

Geometrically:

  • AxAx ranges over the column space of AA
  • we want the point in col(A)\mathrm{col}(A) closest to bb
  • therefore AxAx^\star is the orthogonal projection of bb onto col(A)\mathrm{col}(A)

The residual

r=bAxr^\star = b - Ax^\star

is orthogonal to the column space at the optimum.

b
|\
| \
|  \   residual r*
|   \
|    o  = projection of b onto col(A)
|
+--------------------------> col(A)

Least squares is therefore not merely "approximate solving." It is orthogonal projection in disguise.

6.2 Normal Equations

Differentiate the least-squares objective:

f(x)=Axb22=(Axb)(Axb).f(x) = \|Ax - b\|_2^2 = (Ax-b)^\top (Ax-b).

Then

f(x)=2A(Axb).\nabla f(x) = 2A^\top (Ax-b).

At the minimizer,

A(Axb)=0,A^\top(Ax-b) = 0,

which gives the normal equations

AAx=Ab.A^\top A x = A^\top b.

If AA has full column rank, then AAA^\top A is SPD and

x=(AA)1Ab.x^\star = (A^\top A)^{-1} A^\top b.

This formula is exact and important, but also numerically dangerous when AA is ill-conditioned because

κ(AA)=κ(A)2.\kappa(A^\top A) = \kappa(A)^2.

That squared condition number is the main reason serious numerical work often avoids solving least squares through normal equations directly.

6.3 QR-Based Least Squares

If A=QRA = QR is the thin QR factorisation, with

  • QRm×nQ \in \mathbb{R}^{m \times n} having orthonormal columns
  • RRn×nR \in \mathbb{R}^{n \times n} upper triangular

then

Axb2=QRxb2.\|Ax - b\|_2 = \|QRx - b\|_2.

Because QQ preserves lengths on its column space, the least-squares problem reduces to

Rx=Qb.Rx = Q^\top b.

This is an upper triangular solve.

Advantages over normal equations:

  • avoids forming AAA^\top A
  • avoids squaring the condition number
  • numerically more stable

For practical least squares, QR is usually the safer default.

6.4 Regularised Least Squares

When AA is nearly singular, rank-deficient, or noisy, add an L2L^2 penalty:

xλ=argminxAxb22+λx22.x_\lambda^\star = \arg\min_x \|Ax-b\|_2^2 + \lambda \|x\|_2^2.

This is Tikhonov regularisation or ridge regression.

The optimality condition becomes

(AA+λI)x=Ab.(A^\top A + \lambda I)x = A^\top b.

This does two things at once:

  • shrinks the solution toward the origin
  • improves conditioning by shifting small eigenvalues upward

In ML, this is not a side topic. Weight decay is exactly this geometry applied to parameter fitting.

6.5 Weighted Least Squares

If some observations are more reliable than others, use a positive diagonal weight matrix WW:

x=argminx(Axb)W(Axb).x^\star = \arg\min_x (Ax-b)^\top W (Ax-b).

This produces the weighted normal equations

AWAx=AWb.A^\top W A x = A^\top W b.

If W=diag(w1,,wm)W = \mathrm{diag}(w_1,\ldots,w_m), then each residual is penalised according to its importance.

This is natural when:

  • measurement errors have unequal variances
  • some samples are trusted more than others
  • robust procedures iteratively reweight points

6.6 Total Least Squares

Ordinary least squares assumes AA is exact and only bb is noisy. That is often unrealistic.

Total least squares allows perturbations to both:

(A+δA)x=b+δb.(A + \delta A)x = b + \delta b.

The goal is to minimise the size of the joint perturbation

[δAδb]F.\|[\delta A \mid \delta b]\|_F.

The classical solution uses the SVD of the augmented data matrix [Ab][A \mid b]. The smallest singular direction identifies the smallest correction that makes the system exactly consistent.

This matters when the design matrix itself was measured or estimated with error, which is common in scientific inference and noisy representation learning.


7. Iterative Methods for Linear Systems

7.1 Why Iterative Methods?

Direct solvers are excellent when matrices are moderate in size and dense. But they do not scale indefinitely.

If n=106n = 10^6, then:

  • storing a dense matrix would require on the order of 101210^{12} entries
  • a dense direct factorisation would require on the order of 101810^{18} arithmetic operations

That is not realistic.

Iterative methods change the question. Instead of computing the exact answer in finitely many elimination steps, they produce a sequence

x0,x1,x2,x_0, x_1, x_2, \ldots

that ideally converges to the true solution.

The stopping rule is usually based on the residual:

rk=bAxk.r_k = b - A x_k.

When rk\|r_k\| is small enough, stop.

The key computational benefit is that many iterative methods only need:

  • matrix-vector products
  • vector inner products
  • inexpensive preconditioner applications

That makes them ideal for large sparse systems where factorising AA would be too expensive.

7.2 Stationary Iterative Methods

A broad stationary-method template begins by splitting

A=MN,A = M - N,

where MM is chosen so that solving My=zMy = z is easy.

Then iterate:

Mxk+1=Nxk+b,xk+1=M1Nxk+M1b.M x_{k+1} = N x_k + b, \qquad x_{k+1} = M^{-1} N x_k + M^{-1} b.

Convergence depends on the spectral radius:

ρ(M1N)<1.\rho(M^{-1}N) < 1.

Jacobi

Take M=DM = D, the diagonal of AA.

xi(k+1)=1aii(bijiaijxj(k)).x_i^{(k+1)} = \frac{1}{a_{ii}} \left( b_i - \sum_{j \neq i} a_{ij} x_j^{(k)} \right).

All components use only old values, so Jacobi is naturally parallel.

Gauss-Seidel

Use the lower triangular part as the easy system. Updated values are used immediately:

xi(k+1)=1aii(bij<iaijxj(k+1)j>iaijxj(k)).x_i^{(k+1)} = \frac{1}{a_{ii}} \left( b_i - \sum_{j < i} a_{ij} x_j^{(k+1)} - \sum_{j > i} a_{ij} x_j^{(k)} \right).

This often converges faster than Jacobi but is less parallel-friendly.

SOR

Successive over-relaxation adds a parameter ω\omega:

x(k+1)=(1ω)x(k)+ωxGS(k+1).x^{(k+1)} = (1-\omega)x^{(k)} + \omega x_{\text{GS}}^{(k+1)}.

When ω\omega is well chosen, convergence can improve dramatically.

Stationary methods are conceptually important, but for large modern workloads Krylov methods usually dominate in practice.

7.3 Conjugate Gradient Method

For SPD systems, conjugate gradient (CG) is the central iterative method.

It solves

Ax=bAx = b

by minimising the quadratic

f(x)=12xAxbxf(x) = \frac{1}{2}x^\top A x - b^\top x

over increasingly rich Krylov subspaces

Kk(A,r0)=span{r0,Ar0,A2r0,,Ak1r0}.\mathcal{K}_k(A, r_0) = \mathrm{span}\{r_0, Ar_0, A^2 r_0, \ldots, A^{k-1}r_0\}.

The algorithm is:

r0=bAx0,p0=r0r_0 = b - A x_0, \qquad p_0 = r_0

and then for each step:

αk=rkrkpkApk,\alpha_k = \frac{r_k^\top r_k}{p_k^\top A p_k}, xk+1=xk+αkpk,x_{k+1} = x_k + \alpha_k p_k, rk+1=rkαkApk,r_{k+1} = r_k - \alpha_k A p_k, βk=rk+1rk+1rkrk,\beta_k = \frac{r_{k+1}^\top r_{k+1}}{r_k^\top r_k}, pk+1=rk+1+βkpk.p_{k+1} = r_{k+1} + \beta_k p_k.

Important facts:

  • in exact arithmetic, CG terminates in at most nn steps
  • in practice it is useful because it often converges much sooner
  • convergence improves when eigenvalues are clustered

The classical bound is

xkxA2(κ1κ+1)kx0xA,\|x_k - x^\star\|_A \le 2 \left( \frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1} \right)^k \|x_0 - x^\star\|_A,

where κ=κ(A)\kappa = \kappa(A).

So conditioning directly controls iteration count.

7.4 Preconditioning

If conditioning controls convergence, the natural question is:

Can we transform the system into one with smaller condition number?\text{Can we transform the system into one with smaller condition number?}

That is the purpose of preconditioning.

Choose MAM \approx A such that:

  • applying M1M^{-1} is cheap
  • M1AM^{-1}A is better conditioned than AA

Then solve the transformed system instead.

Common examples:

  • Jacobi preconditioner: M=diag(A)M = \mathrm{diag}(A)
  • incomplete LU (ILU)
  • incomplete Cholesky for SPD systems
  • multigrid preconditioners for PDE-like problems

Preconditioning is often the difference between an iterative solver that is unusably slow and one that converges in a practical number of steps.

In AI terms, normalization layers and adaptive optimizers play a closely related geometric role: they reshape the effective conditioning of the optimization problem.

7.5 GMRES for Non-Symmetric Systems

Conjugate gradient requires SPD structure. For general non-symmetric systems, a standard choice is GMRES: Generalised Minimal Residual.

GMRES constructs an orthonormal basis for the Krylov space using the Arnoldi process and chooses the iterate whose residual norm is minimal over that subspace.

Conceptually:

  • build a basis for Kk(A,r0)\mathcal{K}_k(A, r_0)
  • solve a small least-squares problem each iteration
  • obtain the best residual achievable in that subspace

Benefits:

  • applies to very general systems
  • robust in many settings

Costs:

  • memory grows with iteration count because the Krylov basis grows
  • restarted GMRES(mm) is often used to limit storage

GMRES is a good example of a recurring pattern in numerical linear algebra: turn a hard large system into a sequence of small projection problems.

7.6 Krylov Subspace Methods Summary

MethodMatrix typeCost per iterationMemoryTypical use
Jacobibroad but fragilesparse matvec + vector opslowsimple baseline, parallel smoothing
Gauss-Seidelbroad but sequentialsparse matvec-like updateslowsmall systems, relaxation
CGSPDmatvec + a few dot productslowkernel systems, normal equations, SPD problems
MINRESsymmetric indefinitematvec + vector opslowsaddle-like symmetric systems
GMRESgeneral non-singularmatvec + orthogonalisationgrowingnon-symmetric large systems
LSQRleast squares / rectangularbidiagonalisation-basedlowlarge sparse least squares

The main lesson is not to memorise solver names. It is to match system structure to algorithm structure.


8. Nonlinear Systems of Equations

8.1 Formulation

A nonlinear system has the form

F(x)=0,F:RnRn,F(x) = 0, \qquad F : \mathbb{R}^n \to \mathbb{R}^n,

or in components

F1(x1,,xn)=0,F_1(x_1,\ldots,x_n) = 0, F2(x1,,xn)=0,F_2(x_1,\ldots,x_n) = 0, \vdots Fn(x1,,xn)=0.F_n(x_1,\ldots,x_n) = 0.

Unlike the linear case, the solution set may be:

  • empty
  • a finite set of isolated points
  • a curve or surface of solutions
  • multiple disconnected branches

There is no complete analogue of the rank classification above. Structure still matters, but now local differential information becomes central.

8.2 Newton's Method for Nonlinear Systems

Newton's method linearises the system around the current iterate xkx_k:

F(x)F(xk)+JF(xk)(xxk),F(x) \approx F(x_k) + J_F(x_k)(x-x_k),

where JF(xk)J_F(x_k) is the Jacobian.

Set the linear approximation to zero:

F(xk)+JF(xk)Δxk0.F(x_k) + J_F(x_k)\Delta x_k \approx 0.

So the Newton step solves the linear system

JF(xk)Δxk=F(xk),J_F(x_k)\Delta x_k = -F(x_k),

and updates

xk+1=xk+Δxk.x_{k+1} = x_k + \Delta x_k.

This is one of the great unifying ideas of applied mathematics:

nonlinear problem
   -> local linearisation
   -> solve linear system
   -> update
   -> repeat

Near a simple root, Newton convergence is quadratic, which is extremely fast. But it can fail badly if:

  • the initial guess is poor
  • the Jacobian is singular or nearly singular
  • the function is not smooth enough

8.3 Quasi-Newton Methods

Newton's method is powerful but expensive because each step requires the full Jacobian and a system solve.

Quasi-Newton methods approximate the Jacobian (or the Hessian in optimization form) using update formulas instead of recomputing it exactly.

For root-finding, Broyden's method updates an approximate Jacobian with a rank-one correction chosen to satisfy the latest secant relation.

In optimization, the best-known analogues are BFGS and L-BFGS. They trade exact second-order information for lower cost while preserving much of Newton's geometric advantage.

The recurring theme is the same as elsewhere in AI systems engineering:

  • exact structure is too expensive
  • structured approximation is often the practical win

8.4 Fixed-Point Iteration

Sometimes we rewrite the system

F(x)=0F(x) = 0

as

x=G(x)x = G(x)

and iterate

xk+1=G(xk).x_{k+1} = G(x_k).

If GG is a contraction, Banach's fixed-point theorem guarantees convergence to a unique fixed point.

This sounds abstract, but the idea is everywhere:

  • value iteration in reinforcement learning
  • expectation-maximization style self-consistency updates
  • implicit layers and deep equilibrium models
  • iterative normalisation or message-passing updates

Fixed-point iteration is usually slower than Newton near a solution, but it may be simpler, cheaper, or more robust globally.

8.5 Continuation and Homotopy Methods

What if the target nonlinear system is too hard to solve directly?

Continuation methods embed it in a family:

H(x,t)=0,t[0,1],H(x,t) = 0, \qquad t \in [0,1],

such that:

  • H(x,0)=0H(x,0)=0 is easy
  • H(x,1)=F(x)=0H(x,1)=F(x)=0 is the true problem

Then track the solution path as tt moves from 00 to 11.

This is especially powerful for:

  • polynomial systems with many roots
  • bifurcation analysis
  • problems where direct Newton iteration fails from naive initialisation

The conceptual point is valuable even if one never implements homotopy directly: difficult problems can often be solved by deforming from an easier one.

8.6 The Gradient Equations as a Nonlinear System

Training a model means minimising L(θ)L(\theta). The stationary condition is

θL(θ)=0.\nabla_\theta L(\theta^\star) = 0.

That is a nonlinear system in the parameters.

Gradient descent is therefore not separate from system solving. It is one particular iterative scheme for approximately solving the gradient equations.

Newton's method for optimization goes one step further:

H(θk)Δθk=L(θk),H(\theta_k)\Delta \theta_k = -\nabla L(\theta_k),

where HH is the Hessian.

So each Newton step is a linear-system solve inside a nonlinear-system solve. This nesting is fundamental to modern optimization theory.


9. Constrained Systems and Lagrangians

9.1 Equality-Constrained Systems

Suppose we want to minimise f(x)f(x) subject to constraints

g(x)=0,g:RnRm.g(x) = 0, \qquad g : \mathbb{R}^n \to \mathbb{R}^m.

Introduce Lagrange multipliers λRm\lambda \in \mathbb{R}^m and define

L(x,λ)=f(x)+λg(x).\mathcal{L}(x,\lambda) = f(x) + \lambda^\top g(x).

The first-order conditions are

xL(x,λ)=0,g(x)=0.\nabla_x \mathcal{L}(x,\lambda) = 0, \qquad g(x) = 0.

This is itself a larger system of equations in the primal variables and the multipliers.

This is the central message:

optimization with constraints
   ->
system of equations in (x, lambda)

9.2 Saddle-Point Systems

Linearised KKT systems often take the block form

(ABB0)(xλ)=(fg).\begin{pmatrix} A & B^\top \\ B & 0 \end{pmatrix} \begin{pmatrix} x \\ \lambda \end{pmatrix} = \begin{pmatrix} f \\ g \end{pmatrix}.

These are saddle-point systems:

  • not positive definite
  • often indefinite
  • structured rather than arbitrary

They require specialised linear algebra.

A common strategy is block elimination using the Schur complement. If AA is invertible, eliminate xx first and solve a reduced system for λ\lambda.

From

Ax+Bλ=f,Ax + B^\top \lambda = f,

we get

x=A1(fBλ).x = A^{-1}(f - B^\top \lambda).

Substituting into

Bx=gBx = g

gives

BA1(fBλ)=g,BA^{-1}(f - B^\top \lambda) = g,

so

(BA1B)λ=BA1fg.(BA^{-1}B^\top)\lambda = BA^{-1}f - g.

The matrix

S=BA1BS = BA^{-1}B^\top

is the Schur complement. Once λ\lambda is known, recover xx from the first block equation.

This block viewpoint is essential in large constrained problems because it turns one large structure into smaller pieces that can be interpreted and solved more intelligently.

9.3 KKT Systems in AI

Constraint systems appear in many AI settings:

  • support vector machines through quadratic programming
  • KL-constrained policy optimization and RLHF-style objectives
  • simplex-constrained probability vectors
  • structured prediction with equality or inequality constraints

Even when an ML paper does not say "system of equations," KKT conditions are often doing the hidden work.

For example, the softmax distribution can be derived from an entropy-regularised constrained optimization problem with the simplex constraint

ipi=1.\sum_i p_i = 1.

The multiplier associated with this normalisation is exactly the kind of object introduced by the Lagrangian formalism.

9.4 Inequality Constraints and Linear Programming

If the constraints are inequalities,

Axb,Cx=d,Ax \le b, \qquad Cx = d,

then feasibility itself becomes a constraint system.

Linear programming studies

minxcxsubject to Axb, Cx=d.\min_x c^\top x \quad \text{subject to } Ax \le b, \ Cx=d.

The geometry changes from affine subspaces to convex polytopes, but the system perspective remains:

  • equations describe faces
  • inequalities carve out feasible regions
  • KKT conditions add complementary slackness

This viewpoint matters in AI for decoding constraints, sparse optimization, and relaxations of combinatorial prediction problems.


10. Systems Arising Directly in AI

10.1 Normal Equations in Linear Regression

Given data matrix XRn×dX \in \mathbb{R}^{n \times d} and target vector yRny \in \mathbb{R}^n, least squares solves

minwXwy22.\min_w \|Xw-y\|_2^2.

The normal equations are

XXw=Xy.X^\top X w = X^\top y.

This is the canonical ML example of a system emerging from an optimization problem. Ridge regression modifies it to

(XX+λI)w=Xy,(X^\top X + \lambda I)w = X^\top y,

which is both statistically and numerically better behaved when the feature matrix is close to rank deficient.

Linear probes on frozen embeddings are exactly this calculation.

10.2 Attention Normalisation as Constraint System

For one attention row, raw scores sjs_j are transformed into weights

αj=esjkesk.\alpha_j = \frac{e^{s_j}}{\sum_k e^{s_k}}.

The resulting vector α\alpha satisfies:

  • αj0\alpha_j \ge 0
  • jαj=1\sum_j \alpha_j = 1

So every attention row is a constrained weighting problem over the value vectors. Softmax is the smooth mechanism that enforces those constraints.

The final output is

o=jαjvj,o = \sum_j \alpha_j v_j,

which is a convex combination. System language clarifies what attention is doing: it solves for admissible weights under normalization constraints, then forms a structured linear combination.

10.3 Deep Equilibrium Models DEQ

A deep equilibrium model defines its hidden state implicitly through

z=fθ(z,x).z^\star = f_\theta(z^\star, x).

This is a nonlinear fixed-point system.

Forward pass:

  • solve for zz^\star iteratively

Backward pass:

  • use implicit differentiation, which again requires solving a linear system involving (IJf)(I - J_f^\top) or a related operator

More concretely, if

z=fθ(z,x),z^\star = f_\theta(z^\star, x),

then differentiating implicitly gives

(Ifθz(z,x))zθ=fθθ(z,x).\left(I - \frac{\partial f_\theta}{\partial z}(z^\star, x)\right)\frac{\partial z^\star}{\partial \theta} = \frac{\partial f_\theta}{\partial \theta}(z^\star, x).

So the backward pass avoids storing an explicit deep stack of activations, but only because it replaces that memory cost with a linear solve in the implicit Jacobian operator.

DEQs are a clean example of how system solving is not only an analysis tool but the actual definition of the model computation.

10.4 Gaussian Processes and Linear System Solves

In Gaussian process regression, the posterior mean involves

(K+σ2I)1y,(K + \sigma^2 I)^{-1} y,

where KK is the kernel matrix.

Rather than explicitly invert the matrix, one solves

(K+σ2I)α=y.(K + \sigma^2 I)\alpha = y.

Because the matrix is SPD, Cholesky is the natural direct method. For larger problems, iterative methods such as CG with kernel-structure-aware approximations become important.

In practical Bayesian optimization and uncertainty-aware modelling, the system solve is often the dominant cost.

10.5 Newton's Method for Transformer Training

A full Newton step for a large model would solve

HΔθ=g,H \Delta \theta = -g,

with

  • g=L(θ)g = \nabla L(\theta)
  • H=2L(θ)H = \nabla^2 L(\theta)

This is not feasible at LLM scale because:

  • HH is far too large to store
  • even matrix-vector access can be expensive

Modern second-order or quasi-second-order optimizers therefore approximate the system structure:

  • diagonal approximations such as Adam-like methods
  • Kronecker-factored approximations such as K-FAC
  • matrix preconditioners such as Shampoo and related variants

The common idea is always the same: do not ignore the linear system, approximate it intelligently.

For example:

  • K-FAC approximates layerwise curvature by Kronecker factors, turning one huge inverse into smaller matrix inverses or Cholesky solves
  • Shampoo maintains matrix preconditioners for row and column structure, effectively solving a block-shaped system in a better metric
  • recent methods such as SOAP stabilise these preconditioners further while keeping their cost tractable at scale

This is why second-order methods should be understood as linear-system design problems rather than as isolated optimizer tricks.

10.6 Physics-Informed Neural Networks PINNs

Physics-informed neural networks train a network uθu_\theta so that it satisfies a differential equation system approximately.

Typical ingredients:

  • PDE residual equations
  • boundary conditions
  • initial conditions

The loss is built from these residuals, so training attempts to solve the underlying equation system through function approximation.

PINNs matter here because they make the analogy literal: neural networks are being used as approximate solvers for systems of equations coming from physics.


11. Overdetermined and Underdetermined Systems in Practice

11.1 The Underdetermined Case and Implicit Bias

In modern machine learning, underdetermined systems are not pathological. They are normal.

If a model has many more parameters than effectively independent constraints, then the stationary equations admit many solutions. The question is no longer:

Is there a solution?\text{Is there a solution?}

It becomes:

Which solution does the algorithm choose?\text{Which solution does the algorithm choose?}

For linear models trained from small or zero initialisation, gradient descent often converges to the minimum-norm interpolating solution. That is an example of implicit bias: the algorithm selects one point from a large affine family without that preference being written explicitly in the objective.

This matters for generalisation. In overparameterized learning, the geometry of the solution set is not enough; the algorithm's path through that set matters too.

11.2 Minimum Norm Solutions and Regularisation

For an underdetermined but consistent system with full row rank,

Ax=b,ARm×n,m<n,Ax = b, \qquad A \in \mathbb{R}^{m \times n}, \qquad m < n,

the minimum Euclidean norm solution is

x=A(AA)1b=A+b.x^\star = A^\top (AA^\top)^{-1} b = A^+ b.

This is the Moore-Penrose pseudo-inverse solution.

It is the special member of the solution family

x=xp+xhx = x_p + x_h

that is orthogonal to the null space.

Other norms lead to other preferences:

  • L1L^1 minimisation promotes sparsity
  • nuclear norm minimisation promotes low rank
  • explicit penalties reshape which member of a feasible family is preferred

The practical lesson is simple: once exact uniqueness is gone, the norm or regulariser you choose becomes part of the definition of the problem.

11.3 Structured Underdetermined Systems in Language Models

Language-model generation contains many structured underdetermined choices.

At each decoding step:

  • many token distributions satisfy the simplex constraint
  • many sequences satisfy a grammar or schema
  • many continuation paths have similar local likelihood

The system is not linear, but the principle is similar: there are many admissible solutions, and the algorithm plus constraints determine which one is selected.

Examples:

  • beam search chooses among multiple high-scoring structured candidates
  • constrained decoding solves a combinatorial feasibility problem over token sequences
  • watermarking schemes add extra statistical constraints to the output distribution

11.4 The System Perspective on Self-Attention

For one query position ii, attention computes

Oi=jαijVj,O_i = \sum_j \alpha_{ij} V_j,

with

αij0,jαij=1.\alpha_{ij} \ge 0, \qquad \sum_j \alpha_{ij} = 1.

So the output lies in the convex hull of the value vectors. This is a constrained coefficient-selection problem:

  • unconstrained linear combination would allow arbitrary signed weights
  • attention allows only simplex-constrained weights

That restriction is part of why attention is stable and interpretable as "allocation of focus" rather than arbitrary recombination.


12. Numerical Methods and Stability

12.1 Floating-Point Arithmetic and Rounding Errors

Every real computation on a machine is finite-precision computation. Solving systems of equations therefore means solving them with rounding error.

Each arithmetic operation introduces a relative perturbation on the order of machine epsilon:

  • FP64: about 101610^{-16}
  • FP32: about 10710^{-7}
  • BF16: about 10310^{-3} in relative mantissa precision

The actual solution error depends on both:

  • the stability of the algorithm
  • the conditioning of the problem

This distinction is foundational:

  • a stable algorithm can still produce a poor answer on an ill-conditioned problem
  • an unstable algorithm can fail even on a well-conditioned problem

12.2 Iterative Refinement

Suppose we computed an approximate solution x~\tilde{x}. We can form the residual

r=bAx~r = b - A \tilde{x}

and solve a correction system

Ad=r.Ad = r.

Then update

x~x~+d.\tilde{x} \leftarrow \tilde{x} + d.

This is iterative refinement.

Why it works:

  • the factorisation of AA can be reused
  • the residual often reveals the remaining error accurately
  • low-precision factorisation plus higher-precision residual computation can recover surprisingly high accuracy

In mixed precision form, a typical pattern is:

  1. factorise or approximately solve in low precision
  2. compute the residual in higher precision
  3. solve the correction equation using the stored factorisation
  4. update the solution and repeat

This is mathematically classical and computationally modern at the same time.

Modern mixed-precision solvers rely heavily on this principle. The same mentality appears in mixed-precision training: cheap low-precision main computation, with strategically preserved higher-precision corrective structure.

12.3 Pivoting Strategies and Numerical Stability

Pivoting is not merely a convenience; it is a stability mechanism.

If a matrix is diagonally dominant or structurally well behaved, elimination may be stable even without pivoting. But in general, partial pivoting is the practical baseline because it controls growth in the elimination process.

The high-level rule is:

algebraic solvability
!=
numerical reliability

This distinction explains why numerically aware implementations almost never use the classroom version of Gaussian elimination verbatim.

12.4 Preconditioning for Ill-Conditioned Systems

Ill-conditioning hurts twice:

  • it amplifies perturbations in direct solving
  • it slows iterative convergence

Preconditioning attacks the geometry. Instead of accepting the original coordinate system of the problem, it rescales or reshapes the system into one that behaves better numerically.

Left preconditioning:

M1Ax=M1b.M^{-1} A x = M^{-1} b.

Right preconditioning:

AMy=b,x=My.A M y = b, \qquad x = M y.

Split preconditioning applies transforms on both sides.

The point is never to find a perfect inverse. It is to find a cheap approximation that dramatically improves the effective spectrum.

12.5 Condition Number and Practical Impact

Condition number translates geometry into operational risk.

If κ(A)\kappa(A) is large, then:

  • small perturbations in data can produce large perturbations in the solution
  • low-precision arithmetic becomes dangerous
  • iterative methods need more steps
  • regularization becomes more attractive

In AI this explains many engineering interventions:

  • damping Hessian systems in second-order methods
  • adding ridge terms to kernel matrices
  • scaling attention scores by dk\sqrt{d_k}
  • normalizing activations and gradients

All of these can be interpreted as attempts to avoid badly conditioned computational subproblems.


13. Systems of Equations in Optimisation

13.1 Gradient Descent as Approximate System Solving

Optimization can be read as equation solving:

L(θ)=0.\nabla L(\theta) = 0.

Gradient descent iterates

θt+1=θtηL(θt).\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t).

A fixed point of this update satisfies

θt+1=θtL(θt)=0.\theta_{t+1} = \theta_t \quad \Longleftrightarrow \quad \nabla L(\theta_t) = 0.

So gradient descent is one way of seeking a root of the gradient field.

For quadratic losses, this connection becomes explicit: gradient descent is an iterative solver for the normal equations.

13.2 Newton's Method and the Hessian System

Newton's method in optimization solves

H(θ)Δθ=L(θ)H(\theta)\Delta \theta = -\nabla L(\theta)

and updates

θθ+Δθ.\theta \leftarrow \theta + \Delta \theta.

Near a well-behaved optimum, this gives quadratic convergence. But for large neural networks it is limited by the same obstacle as every other second-order method: the Hessian system is enormous.

That is why exact Newton is rare at scale, while approximate curvature methods are common.

13.3 The Role of Linear System Solves in Second-Order Methods

Second-order methods differ mainly in how they approximate or factor the curvature system.

K-FAC

  • approximates curvature blockwise with Kronecker structure
  • converts one huge system into many smaller structured solves
  • for a layer gradient GG, the preconditioned update has the schematic form
ΔWGleft1GGright1 \Delta W \approx G_{\text{left}}^{-1} \, G \, G_{\text{right}}^{-1}

where each side is estimated from activations or gradient covariances

Shampoo

  • keeps matrix-valued preconditioners per parameter block
  • uses matrix roots and inverse roots to precondition updates
  • replaces scalar learning-rate rescaling with geometry-aware matrix rescaling

SOAP and related recent preconditioners

  • push matrix preconditioning further while retaining practical scalability
  • aim to capture more curvature structure than diagonal methods without paying full Hessian cost

These methods are not mysterious if you view them through the system lens. Each is a different compromise between:

  • fidelity to the true Hessian system
  • memory cost
  • solve cost
  • numerical stability

13.4 Convergence Analysis via Linear Systems

For the quadratic loss

L(θ)=12Aθb22,L(\theta) = \frac{1}{2}\|A\theta - b\|_2^2,

we have

L(θ)=A(Aθb),\nabla L(\theta) = A^\top(A\theta - b),

so the stationary condition is

AAθ=Ab.A^\top A \theta = A^\top b.

Thus the local convergence of first-order methods is governed by the spectrum of AAA^\top A.

This is why eigenvalues, singular values, and condition numbers are not separate subjects from optimization. They govern the difficulty of the linear systems sitting underneath it.


14. Common Mistakes

MistakeWhy it is wrongFix
"If m=nm=n, the system must have a unique solution."Square only means the matrix is square. It does not imply full rank. Singular square systems may have no solution or infinitely many.Check rank or determinant; uniqueness requires full rank.
"Gaussian elimination is only for square systems."Row reduction works for any m×nm \times n matrix and is one of the best tools for understanding rectangular systems.Use elimination on any augmented matrix to reveal rank and solution structure.
"Homogeneous systems might have no solution."x=0x=0 always solves Ax=0Ax=0.The real question is whether non-zero solutions exist.
"Least squares solves the original system exactly."Least squares solves the closest projection problem, not the exact system unless bcol(A)b \in \mathrm{col}(A).Always inspect the residual.
"Normal equations are always fine because the formula is simple."Forming AAA^\top A squares the condition number and can destroy accuracy.Prefer QR or SVD for harder least-squares problems.
"A small residual means a highly accurate solution."For ill-conditioned systems, the residual can be small while the actual solution error is large.Consider conditioning as well as residual size.
"Free variables can be ignored after finding one solution."Free variables encode the entire family of exact solutions.Write the full affine solution: particular plus null-space basis.
"Iterative methods always converge if you run them long enough."Convergence depends on matrix structure and the chosen method. Some iterations diverge.Check assumptions such as SPD, diagonal dominance, or spectral-radius conditions.
"The inverse is the natural way to solve every linear system."Explicit inversion is usually slower and less stable than factorisation-based solving.Solve Ax=bAx=b directly using LU, QR, or Cholesky as appropriate.
"Optimization is separate from systems of equations."Stationarity, KKT conditions, Newton steps, and fixed points are all equation systems.Reframe optimization algorithms through their associated systems.

15. Exercises

  1. Row reduction and solution types: for each augmented matrix, reduce to RREF, classify the system, and write the full solution set: (a) [121\vline4242\vline8363\vline12]\begin{bmatrix} 1 & 2 & -1 & \vline & 4 \\ 2 & 4 & -2 & \vline & 8 \\ 3 & 6 & -3 & \vline & 12 \end{bmatrix} (b) [102\vline3011\vline2111\vline6]\begin{bmatrix} 1 & 0 & 2 & \vline & 3 \\ 0 & 1 & -1 & \vline & 2 \\ 1 & 1 & 1 & \vline & 6 \end{bmatrix} (c) [12\vline324\vline701\vline1]\begin{bmatrix} 1 & 2 & \vline & 3 \\ 2 & 4 & \vline & 7 \\ 0 & 1 & \vline & 1 \end{bmatrix} (d) [111\vline6012\vline7101\vline1]\begin{bmatrix} 1 & 1 & 1 & \vline & 6 \\ 0 & 1 & 2 & \vline & 7 \\ 1 & 0 & -1 & \vline & 1 \end{bmatrix}

  2. Particular solution and null space: for

A=(120100111212), A = \begin{pmatrix} 1 & 2 & 0 & 1 \\ 0 & 0 & 1 & 1 \\ 1 & 2 & 1 & 2 \end{pmatrix},

(a) find the RREF, (b) identify pivot and free variables, (c) find a particular solution to Ax=(1,1,2)Ax = (1,1,2)^\top, (d) find a basis for null(A)\mathrm{null}(A), and (e) verify rank-nullity.

  1. Least squares: with
X=(11121314),y=(233.55), X = \begin{pmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \end{pmatrix}, \qquad y = \begin{pmatrix} 2 \\ 3 \\ 3.5 \\ 5 \end{pmatrix},

derive the normal equations, solve for ww^\star, compute the residual, verify orthogonality to the column space, and then compare with ridge regularisation using λ=0.5\lambda = 0.5.

  1. Conditioning: for
A=(1000999999998), A = \begin{pmatrix} 1000 & 999 \\ 999 & 998 \end{pmatrix},

compute the determinant, solve Ax=bAx=b, perturb bb slightly, and quantify how much the solution changes. Interpret the result using the condition number.

  1. Newton's method for a nonlinear system: solve
F(x,y)=(x2+y21, xy2)=0 F(x,y) = (x^2 + y^2 - 1,\ x - y^2) = 0

by deriving the Jacobian and performing two Newton steps from (1,1)(1,1).

  1. Iterative methods: for
A=(4113),b=(12), A = \begin{pmatrix} 4 & 1 \\ 1 & 3 \end{pmatrix}, \qquad b = \begin{pmatrix} 1 \\ 2 \end{pmatrix},

solve using Cholesky, then compare Jacobi, Gauss-Seidel, and two steps of CG.

  1. Constrained system: minimise f(x,y)=x2+2y2f(x,y)=x^2 + 2y^2 subject to x+y=3x+y=3. Write the Lagrangian, derive the KKT system, solve it, and then repeat with the extra constraint xy=1x-y=1.

  2. AI application - attention as a constrained coefficient system: with

Q=(100111),K=(111001),V=(100111), Q = \begin{pmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{pmatrix}, \quad K = \begin{pmatrix} 1 & 1 \\ 1 & 0 \\ 0 & 1 \end{pmatrix}, \quad V = \begin{pmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{pmatrix},

compute the score matrix, the softmax weights, and the output matrix. Interpret each attention row as a simplex-constrained weighting problem.


16. Why This Matters for AI (2026 Perspective)

AspectImpact
Linear regression and probingFitting linear probes, calibration heads, and readout layers reduces to exact or regularised linear systems.
Newton and second-order optimisationK-FAC, Shampoo, SOAP, and related optimizers are structured approximations to very large curvature-system solves.
OverparameterisationModern deep networks often operate in underdetermined regimes, so the geometry of the solution set and the implicit bias of the solver matter.
Deep equilibrium modelsForward pass solves a nonlinear fixed-point system; backward pass solves an implicit linear system.
Gaussian processes and Bayesian inferenceKernel methods are fundamentally system-solving pipelines, usually with SPD matrices.
Constrained decoding and RLHFNormalization, KL constraints, and structured generation can all be expressed as coupled constraint systems.
Numerical stabilityCondition numbers explain why damping, normalization, scaling, and mixed precision are operational necessities rather than cosmetic choices.
PINNs and scientific AISolving PDE systems with neural surrogates makes equation-solving central to modern AI-for-science workflows.
Sparse and efficient attentionApproximate attention mechanisms preserve or relax the original attention system structure to gain scale.
InterpretabilityMany attribution and decomposition techniques boil down to solving linear systems for component contributions.

17. Conceptual Bridge

Systems of equations are where abstract algebra, geometry, and computation become inseparable.

  • vector spaces told us what the objects are
  • matrix operations told us how those objects are transformed
  • systems of equations tell us how to recover unknowns from constraints

The linear case gives the complete reference model:

  • rank tells us how many independent constraints exist
  • null spaces tell us what freedom remains
  • elimination and factorisation tell us how to solve
  • least squares tells us what to do when exact compatibility fails

The nonlinear case adds the genuinely hard phenomena:

  • multiple solutions
  • bifurcations
  • local linearisation
  • fixed points
  • constrained stationary conditions

This is exactly the bridge needed for the next major topic. Once a system has been written as Ax=bAx=b, the next question is often:

What does the structure of A tell us about the solve?\text{What does the structure of } A \text{ tell us about the solve?}

That is the spectral question, and it leads naturally to eigenvalues, eigenvectors, and decompositions.

Vectors and Spaces      -> geometry
Matrix Operations       -> computation
Systems of Equations    -> solving
Eigenvalues / SVD       -> structure
Probability             -> uncertainty
Optimisation            -> dynamics
LLM Mathematics         -> all of them together

References