"A proof is a finite sequence of logical steps that establishes a truth beyond any possible doubt. Unlike experiments, which can always be overturned by a new observation, a mathematical proof is eternal - valid for all cases, for all time."
Overview
A proof technique is a general strategy for establishing that a mathematical statement is true - not probably true, not true in every case we checked, but necessarily true as a consequence of axioms, definitions, and previously established results. Proof techniques are the algorithms of mathematical reasoning: given a goal, which technique do you apply? Given a structure, which argument do you use?
For AI practitioners, proof techniques are not optional academic exercises. Every convergence theorem for gradient descent, every generalisation bound in PAC learning, every correctness argument for attention mechanisms, every hardness result in complexity theory - all rest on specific proof strategies. Without fluency in these strategies, theoretical machine learning papers are inaccessible, and the foundations of why models work (or fail) remain opaque.
This chapter covers the complete landscape of proof techniques, from elementary direct proofs through probabilistic existence arguments and analytic convergence proofs, with emphasis on the patterns that recur throughout ML theory.
Prerequisites
- Propositional and predicate logic (from Sets and Logic module)
- Basic set operations (union, intersection, complement)
- Familiarity with functions, injectivity, surjectivity
- Elementary number theory concepts (even, odd, prime, divisibility)
- Basic probability (for Sections 10, 13)
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | Interactive demonstrations: proof verification, induction chains, probabilistic arguments |
| exercises.ipynb | Practice problems: direct proofs, contradiction, induction, counting, ML proof patterns |
Learning Objectives
After completing this section, you will:
- Identify and apply the correct proof technique for any mathematical statement
- Write rigorous direct proofs, proofs by contradiction, and proofs by contrapositive
- Master mathematical induction (ordinary, strong, and structural)
- Understand and apply the probabilistic method for existence proofs
- Use counting arguments, pigeonhole principle, and double counting
- Write \epsilon-\delta proofs for limits, continuity, and convergence
- Recognise and apply standard ML proof patterns: union bound, concentration inequalities, PAC learning, reduction proofs
- Identify common proof errors and evaluate the validity of AI-generated proofs
- Read and understand proofs in theoretical ML papers
Table of Contents
- Proof Techniques
- Overview
- Prerequisites
- Companion Notebooks
- Learning Objectives
- Table of Contents
- 1. Intuition
- 2. Direct Proof
- 3. Proof by Construction
- 4. Proof by Contrapositive
- 5. Proof by Contradiction
- 6. Proof by Cases
- 7. Mathematical Induction
- 8. Strong Induction
- 9. Structural Induction
- 10. The Probabilistic Method
- 11. Counting Arguments
- 12. Epsilon-Delta and Analytic Arguments
- 13. Proof Patterns in ML Theory
- 14. Common Mistakes in Proofs
- 15. Exercises
- 16. Why This Matters for AI
- 17. Conceptual Bridge
1. Intuition
1.1 What Are Proof Techniques?
A proof technique is a general strategy for establishing that a mathematical statement is true beyond any possible doubt. Unlike empirical evidence (which can always be overturned by a new counterexample) or intuition (which is frequently wrong), a mathematical proof is an absolute guarantee - valid for all cases, for all time.
Proof techniques are the algorithms of mathematical reasoning: given a goal, which technique do you apply? Given a structure, which argument do you use? Mastering proof techniques means mastering the ability to move from "I believe this is true" to "I can prove this is true and explain exactly why."
For AI: proofs establish correctness of algorithms, validity of bounds, convergence of optimisers, and soundness of theoretical guarantees. Without proof techniques, theoretical ML is inaccessible.
1.2 Why Proof Techniques Matter for AI
| AI Domain | Proof Technique Required | Example |
|---|---|---|
| Convergence proofs | Telescoping, induction, \epsilon-\delta | Proving gradient descent converges to a stationary point |
| Generalisation bounds | Probabilistic method, union bound | PAC learning, VC dimension, Rademacher complexity |
| Algorithm correctness | Direct proof, induction | Proving BPE terminates; proving softmax is well-defined |
| Complexity theory | Reduction proofs, contradiction | Showing neural network verification is NP-hard |
| LLM reasoning evaluation | All techniques | Evaluating whether an LLM's "proof" is logically valid |
| Formal verification | All techniques | Lean, Coq, Isabelle verify AI system properties |
1.3 The Landscape of Proof Techniques
+==============================================================+
| PROOF TECHNIQUES |
+==============================================================+
| |
| Direct Methods |
| +-- Direct proof (assume P; derive Q) |
| +-- Proof by construction (exhibit the object) |
| +-- Proof by exhaustion (check all cases) |
| |
| Indirect Methods |
| +-- Contrapositive (prove \negQ -> \negP instead of P -> Q) |
| +-- Contradiction (assume \negP; derive \perp) |
| +-- Proof by cases (partition; prove each) |
| |
| Inductive Methods |
| +-- Mathematical induction (base + step) |
| +-- Strong induction (use all previous cases) |
| +-- Structural induction (induct on recursive structure) |
| |
| Probabilistic Methods |
| +-- Probabilistic method (show Pr[property] > 0) |
| +-- Expectation argument (E[X] implies existence) |
| +-- Lovasz Local Lemma (avoid rare bad events) |
| |
| Algebraic Methods |
| +-- Counting arguments (double counting; bijection) |
| +-- Pigeonhole principle (n+1 items in n bins) |
| +-- Generating functions (encode combinatorics) |
| |
| Analytic Methods |
| +-- \epsilon-\delta arguments (limits; continuity; convergence) |
| +-- Compactness arguments (Heine-Borel; sequential) |
| +-- Fixed point arguments (Banach; Brouwer; Kakutani) |
| |
+==============================================================+
1.4 The Structure of a Mathematical Statement
Every theorem has the form: "If [hypotheses], then [conclusion]."
Formally:
- Hypotheses: conditions assumed to hold; the "given" part
- Conclusion: what must be shown to follow; the "prove" part
A proof is a finite sequence of logically valid steps from hypotheses to conclusion. Each step follows from axioms, definitions, previously proved results, or hypotheses.
Key skill: identifying what you are allowed to assume and what you must derive.
+===================================================+
| ANATOMY OF A THEOREM |
+===================================================+
| |
| "If n is even, then n^2 is even" |
| --------- ---------- |
| hypothesis conclusion |
| (assume this) (derive this) |
| |
| Proof = chain of valid steps: |
| |
| Assume n is even |
| -> n = 2k for some integer k (definition) |
| -> n^2 = (2k)^2 = 4k^2 (algebra) |
| -> n^2 = 2(2k^2) (factor) |
| -> n^2 is even (definition) |
| |
| Each arrow: justified by a named rule. |
| |
+===================================================+
1.5 Reading and Writing Proofs
Reading a proof: identify hypotheses -> identify conclusion -> trace each step -> check each logical inference -> verify completeness.
Writing a proof: state what you are proving -> state your strategy -> execute strategy -> end with explicit conclusion.
Common proof structure markers:
| Marker | Purpose |
|---|---|
| "Assume..." / "Suppose..." | Introduce hypothesis or assumption |
| "Let..." | Introduce a variable or object |
| "Since..." / "Because..." / "By..." | Justify a step |
| "Therefore..." / "Thus..." / "Hence..." | Conclude a step |
| "We have shown..." / "This completes the proof" | Explicit closure |
| "QED" / "[]" / "QED" | End of proof |
| "Case 1: ... Case 2: ..." | Proof by cases |
| "Base case: ... Inductive step: ..." | Induction |
1.6 Historical Timeline
| Date | Milestone | Significance |
|---|---|---|
| ~300 BCE | Euclid's Elements | First systematic proofs; 467 propositions from 5 axioms |
| ~250 BCE | Archimedes | Method of exhaustion; proto-integration |
| 9th-13th c. | Arab mathematicians | Algebraic proofs; al-Khwarizmi |
| 1821 | Cauchy | Rigorous \epsilon-\delta definitions of limits |
| 1860s | Weierstrass | Formal epsilon-delta proofs |
| 1874 | Cantor | Diagonal argument; \mathbb{R} is uncountable |
| 1900 | Hilbert | Programme to formalise all mathematics |
| 1901 | Russell | Paradox via self-reference |
| 1931 | Godel | Incompleteness; limits of formal systems |
| 1976 | Appel & Haken | Four-colour theorem; computer-assisted proof |
| 1995 | Wiles | Fermat's Last Theorem; 129-page proof |
| 2000s- | Coq, Lean, Isabelle | Machine-verified proofs |
| 2024 | AlphaProof (DeepMind) | LLM + Lean; silver medal at IMO 2024 |
2. Direct Proof
2.1 The Strategy
To prove :
- Assume is true
- Through a sequence of logical steps - each justified by definitions, axioms, or previously proved results - derive
- Conclude follows from
This is the most natural proof technique. Try it first for any implication. It works best when the hypothesis gives you something concrete to work with and the conclusion has a clear definition to aim for.
2.2 Template
+==============================================+
| DIRECT PROOF TEMPLATE |
+==============================================+
| |
| Proof: |
| Assume [P]. |
| [Step 1: apply definition/theorem] |
| [Step 2: algebraic or logical step] |
| ... |
| [Final step: arrive at Q] |
| Therefore [Q]. [] |
| |
+==============================================+
2.3 Worked Example - Even Times Even Is Even
Theorem. If and are both even integers, then is even.
Proof.
Assume and are both even.
By definition of even: and .
Then:
Since , is of the form .
Therefore is even.
Dissection: Every step names its justification - definition of even, integer closure under multiplication, definition of even again. No hand-waving.
2.4 Worked Example - If n^2 Is Even Then n Is Even (Direct Proof Fails)
Theorem. If is even, then is even.
Attempted direct proof: Assume is even. Then for some . So ... stuck. The square root of an even number is not obviously an integer, let alone obviously even.
Lesson: When direct proof stalls, switch technique. This theorem is proved cleanly by contrapositive (Section 4).
2.5 Worked Example - Sum of Continuous Functions Is Continuous
Theorem. If and are continuous at , then is continuous at .
Proof.
Assume and are continuous at . Let .
We need to find such that implies .
By continuity of : : .
By continuity of : : .
Let .
Then implies:
Therefore is continuous at .
Structure: Uses both hypotheses; constructs from and ; the trick ensures the sum stays below after applying the triangle inequality.
2.6 Direct Proof in AI Contexts
Theorem. The output of a self-attention head lies in the convex hull of the value vectors.
Proof.
Let be the attention weights and the value vectors.
- for all (softmax outputs are non-negative)
- (softmax outputs sum to 1)
- (output is a weighted sum)
By definition of convex hull: a point is in if and only if it is a convex combination (non-negative weights summing to 1) of the .
Therefore .
Theorem. Cross-entropy loss is convex in logits.
Proof.
- is linear in , hence convex
- is the log-sum-exp function, which is convex (standard result: composition of affine functions with )
- Sum of convex functions is convex
Therefore is convex.
3. Proof by Construction
3.1 The Strategy
To prove : exhibit a specific and verify that holds.
A constructive proof provides an algorithm - the proof itself tells you how to find the object. This stands in contrast to non-constructive existence proofs (Section 5), which prove something exists without showing you how to find it.
Constructive proofs are strictly more informative: they tell you not just that something exists, but what it is.
3.2 Template
+==============================================+
| CONSTRUCTIVE PROOF TEMPLATE |
+==============================================+
| |
| Proof: |
| [Define or describe object x explicitly] |
| We claim x satisfies [P]. |
| [Verify each required property of x] |
| Therefore \existsx satisfying P. [] |
| |
+==============================================+
3.3 Worked Example - Prime Between n and 2n
Theorem (Bertrand's Postulate). For every , there exists a prime with .
This is proved by careful counting (Chebyshev's and later Erdos's proof): one bounds the central binomial coefficient from above and below, showing that a prime in must exist. The key is that the proof is constructive in principle - it establishes existence by showing the alternative (no prime in the range) leads to a bound violation.
3.4 Worked Example - Constructing a Bijection
Theorem. (the integers and natural numbers have the same cardinality).
Proof (by construction). Exhibit an explicit bijection :
This maps:
Verify injective: Different natural numbers map to different integers (each integer appears exactly once in the sequence above).
Verify surjective: Every integer is hit:
- : OK
- : OK
- : OK
Therefore is a bijection; .
3.5 Worked Example - Neural Network Approximation
Theorem. For the target function , there exists a 1-hidden-layer ReLU network approximating to within on .
Proof (by construction).
Define the network:
for a parameter . This is a single-neuron ReLU network with:
- , , ,
Behaviour:
- For : , so , so OK
- For :
- At : ; ; error (transition zone)
- At : ; for :
For and : . We need , so this simple construction works on but needs adjustment for the full interval. A better construction uses two ReLU units to create a step:
This clips to and approximates the step function with transition width . For , the approximation error is outside .
3.6 Constructive vs Non-Constructive - A Key Distinction
| Constructive | Non-Constructive | |
|---|---|---|
| What it gives | The object itself | Proof that it exists |
| Algorithm? | Yes - the proof is the algorithm | No - existence without method |
| Informativeness | Very high | Moderate |
| Typical technique | Direct construction | Contradiction, probabilistic method |
| AI relevance | Algorithms, architectures | Performance bounds, impossibility results |
Example - Normal Numbers:
- Constructive: "Let " (Champernowne's constant). This is provably normal in base 10 by construction.
- Non-constructive: "A normal number must exist because the set of non-normal numbers has Lebesgue measure zero." Proves existence without exhibiting the number.
AI relevance: Constructive proofs correspond to implementable algorithms. Non-constructive proofs establish performance bounds without telling you how to achieve them.
4. Proof by Contrapositive
4.1 The Strategy
Logical equivalence:
To prove , you may equivalently prove . This is the contrapositive.
When to use: when the negation of gives more useful information than itself. Often, hands you a concrete object to work with, while is too abstract.
Critical distinction from contradiction: Contrapositive proves a specific alternate implication . Contradiction assumes and derives any false statement . They are different techniques (see Section 5.7).
4.2 Template
+======================================================+
| CONTRAPOSITIVE PROOF TEMPLATE |
+======================================================+
| |
| Proof (by contrapositive): |
| We prove the contrapositive: assume \negQ. |
| [Derive \negP through logical steps] |
| Therefore \negP. |
| By contrapositive equivalence, P -> Q. [] |
| |
+======================================================+
4.3 Worked Example - If n^2 Is Even Then n Is Even
Theorem. For , if is even then is even.
Contrapositive: If is odd, then is odd.
Proof (by contrapositive).
Assume is odd. Then .
Since , is of the form ; therefore is odd.
By contrapositive equivalence: if is even, then is even.
Compare with Section 2.4: The direct proof got stuck at . The contrapositive transforms "even -> even" (hard direction) into "odd -> odd" (easy direction with concrete algebraic manipulation).
4.4 Worked Example - Irrational Product
Theorem. If is rational and is irrational, then .
Contrapositive: If and is irrational, then is irrational.
Proof (by contrapositive).
Assume and is irrational. Suppose for contradiction that is rational: with . Then . If is rational, , then , which is rational - contradicting irrational. So either is irrational (and we are in the case where could still be irrational) or we get the contradiction.
Actually, this theorem is more naturally proved by contradiction - showing that technique selection requires judgment. The contrapositive reformulation doesn't simplify the argument. Not every theorem is best proved by contrapositive.
4.5 Contrapositive in AI Contexts
Theorem. If a neural network is not Lipschitz continuous, then it is not robust to adversarial examples.
Contrapositive: If is robust to adversarial examples (all perturbations change output by less than ), then is Lipschitz with constant .
This direction is often easier to prove: robustness directly gives the Lipschitz bound.
Theorem. If , then is not a local minimum.
Contrapositive: If is a local minimum, then .
This is the standard necessary condition for optimality, and the contrapositive direction is the one typically proved: at a local minimum, any directional derivative must be non-negative, which forces .
4.6 Recognising When to Use Contrapositive
| Sign | Explanation |
|---|---|
| Conclusion involves a negation | "x is not...", "cannot be...", "there is no..." |
| Negating gives concrete structure | hands you an object to work with |
| Direct proof gets stuck | Reformulating as opens new paths |
| Hypothesis has existential content | may be simpler than |
Test: Write out and . If seems more natural to argue, use contrapositive.
5. Proof by Contradiction
5.1 The Strategy
To prove : assume ; derive a contradiction - a statement known to be false, or both a statement and its negation simultaneously.
Valid by the law of excluded middle: . If leads to a contradiction , then must hold.
When to use: when has the form "X does not exist" or "X is impossible"; when no positive construction is available; when the assumption has powerful consequences to exploit.
5.2 Template
+======================================================+
| CONTRADICTION PROOF TEMPLATE |
+======================================================+
| |
| Proof (by contradiction): |
| Assume for contradiction that \negP. |
| [Derive consequences of \negP] |
| [Arrive at statement C and \negC simultaneously] |
| This is a contradiction. |
| Therefore P must hold. [] |
| |
+======================================================+
5.3 Worked Example - \sqrt2 Is Irrational
Theorem. .
Proof (by contradiction).
Assume for contradiction that .
Then for some , , with (fraction in lowest terms).
Squaring: , so .
is even. By the result of Section 4.3: is even. So for some .
Substituting: .
is even. By the same result: is even.
But both and even contradicts .
Contradiction. Therefore .
Note how the proof builds on Section 4.3: The irrationality of uses the lemma " even even" as a sub-result. This is typical - complex proofs assemble simpler proven facts.
5.4 Worked Example - Infinitely Many Primes (Euclid)
Theorem. There are infinitely many primes.
Proof (by contradiction).
Assume for contradiction there are only finitely many primes: .
Construct .
. Every integer greater than 1 has a prime factor. Let be a prime factor of .
must be one of (by assumption, these are all the primes).
But and , so .
No prime divides 1. Contradiction.
Therefore there are infinitely many primes.
5.5 Worked Example - No Largest Real Number
Theorem. There is no largest real number.
Proof (by contradiction).
Assume for contradiction : for all .
Consider .
, contradicting being the largest.
Contradiction. Therefore no largest real number exists.
5.6 Non-Constructive Existence via Contradiction
One can prove "" by contradiction: assume ; derive a contradiction. This proves existence without constructing the object.
Example: The Intermediate Value Theorem proves without constructing .
AI relevance: Existence of optimal parameters minimising loss can be proved via compactness (if loss is continuous on a compact set) - but the proof gives no algorithm for finding . This is why optimisation theory is a separate discipline from existence theory.
5.7 Contradiction vs Contrapositive - Clarifying the Difference
These two techniques are frequently confused. They are not the same:
| Contrapositive | Contradiction | |
|---|---|---|
| Goal | Prove | Prove (or ) |
| Assume | (or ) | |
| Derive | (specifically) | Any contradiction |
| Conclusion | (by logical equivalence) | (since led to ) |
| When best | is natural | has powerful consequences |
Contrapositive is more structured (you know exactly what to derive: ). Contradiction is more flexible (any absurdity suffices), but can make proofs harder to follow.
5.8 Contradiction in AI Contexts
Theorem. Cross-entropy loss has no finite lower bound.
Proof (by contradiction).
Assume : for all .
Set and all other . Then:
For large : .
So from above? Let's be more careful. Set all for and vary :
As : (approaches but never reaches 0).
So the infimum is 0, but it is not achieved. For any , choosing large enough makes . The infimum 0 is not a minimum.
More precisely: The loss approaches 0 but never equals it. There is no achieving . The infimum is not attained.
Theorem. A single ReLU hidden unit cannot represent the XOR function.
Proof (by contradiction).
Assume a network with one ReLU hidden unit computes XOR. The hidden unit computes , which is a piecewise linear function with at most 2 linear pieces (separated by the hyperplane ). The output is , also piecewise linear with at most 2 pieces.
XOR on requires: , , , . But two linear pieces cannot separate these four points correctly - any half-plane places at least one positive and one negative example on the same side. Contradiction.
6. Proof by Cases (Exhaustion)
6.1 The Strategy
To prove , partition the universe into exhaustive, mutually exclusive cases such that covers all possibilities. Prove holds in each case separately.
Valid because: together with yield .
When to use: when no single argument covers all cases, but the problem has a natural partition (parity, sign, relative order, etc.).
6.2 Template
+======================================================+
| PROOF BY CASES TEMPLATE |
+======================================================+
| |
| Proof (by cases): |
| Let x be arbitrary. We consider all cases. |
| |
| Case 1: [C_1 holds] |
| [Prove P under assumption C_1] |
| |
| Case 2: [C_2 holds] |
| [Prove P under assumption C_2] |
| |
| ... |
| Since C_1 \vee C_2 \vee ... \vee C_n is exhaustive, P. [] |
| |
+======================================================+
6.3 Worked Example - Triangle Inequality for Absolute Value
Theorem. For all : .
Proof (by cases).
Case 1: .
since and .
Case 2: .
since and .
Both cases yield .
6.4 Worked Example - n^2 mod 4
Theorem. For all , .
Proof (by cases on ).
Case 1: . Then , .
Case 2: . Then , .
Case 3: . Then , .
Case 4: . Then , .
All cases give .
6.5 Cases in AI - Activation Function Analysis
Theorem. The ReLU function is convex.
Proof (by cases).
For convexity: for .
Let .
Case 1: .
since RHS is .
Case 2: .
(using and ).
Both cases confirm convexity.
6.6 Without Loss of Generality (WLOG)
When cases are symmetric, you can reduce the number by stating "without loss of generality" (WLOG).
Example: Proving a property about : WLOG assume ; then . The case is symmetric with roles swapped.
Caution: WLOG is only valid when the cases are truly symmetric under relabelling. A common mistake is claiming WLOG when the problem is not symmetric.
7. Mathematical Induction
7.1 The Strategy
To prove holds for all (typically or ):
- Base case: Prove .
- Inductive step: Prove .
Then holds for all by the well-ordering principle (or equivalently, the axiom of induction for ).
Induction is the foundational proof technique for discrete structures and is heavily used in analysing algorithms, data structures, and recursive models.
7.2 Template
+======================================================+
| INDUCTION PROOF TEMPLATE |
+======================================================+
| |
| Proof (by induction on n): |
| |
| Base case (n = n_0): |
| [Verify P(n_0) directly] |
| |
| Inductive step: |
| Assume P(k) for some k \geq n_0. (IH) |
| [Prove P(k+1) using the inductive |
| hypothesis P(k)] |
| |
| By induction, P(n) for all n \geq n_0. [] |
| |
+======================================================+
Important: In the inductive step, you assume (the "inductive hypothesis," IH) and prove . This is not circular - you are proving a conditional statement.
7.3 Worked Example - Sum Formula
Theorem. for all .
Proof (by induction on ).
Base case ():
LHS: . RHS: . OK
Inductive step:
Assume for some . (IH)
Then:
This is , which is . OK
By induction, holds for all .
7.4 Worked Example - Power Set Cardinality
Theorem. A set with elements has subsets.
Proof (by induction on ).
Base case ():
has one subset: itself. . OK
Inductive step:
Assume a set with elements has subsets. (IH)
Let be a set with elements. Pick and let , so .
Every subset of either contains or does not:
- Subsets not containing : exactly the subsets of . There are of these (by IH).
- Subsets containing : formed by taking a subset of and adding . There are of these.
Total: . OK
By induction, for all .
7.5 Induction for Neural Network Depth
Theorem. A ReLU network with hidden layers can represent a function with at most linear regions (given one hidden unit per layer).
Proof (by induction on ).
Base case ():
One ReLU hidden unit: . This produces at most 2 linear pieces (one where input and one where ). . OK
Inductive step:
Assume a network with layers (one unit each) has linear regions. (IH)
Adding layer with one ReLU: the new ReLU unit can at most double the number of regions by "folding" each existing piece. The fold happens at the ReLU's breakpoint, splitting at most all existing regions in two.
Maximum new regions: . OK
By induction, the bound is for layers.
Remark: With units per layer, the bound becomes , explaining the exponential expressiveness of depth - a key insight behind deep learning.
7.6 Common Induction Mistakes
| Mistake | Example | Why It Fails |
|---|---|---|
| Forgetting the base case | "Assume , prove " - but never check | The chain has no starting anchor |
| Wrong base case | Prove but claim result for | May miss that needs separate attention |
| Assuming what you prove | "Assume " instead of proving it | Circular reasoning |
| Not using the IH | Proving from scratch, never invoking | Valid but not induction (just direct proof) |
| Wrong inductive "direction" | Assume , prove | Proves for all , not |
8. Strong Induction
8.1 The Strategy
Strong induction (also called complete induction or course-of-values induction) strengthens the inductive hypothesis: instead of assuming only , you assume - that is, for all - and then prove .
Logical equivalence: Strong induction is equivalent to ordinary induction on the statement "for all , ." It is not a stronger proof system, just a more convenient packaging.
When to use: when depends on , , or some earlier case - not just the immediately preceding one.
8.2 Template
+======================================================+
| STRONG INDUCTION TEMPLATE |
+======================================================+
| |
| Proof (by strong induction on n): |
| |
| Base case(s) (n = n_0, ..., n_0+r): |
| [Verify P(n_0), ..., P(n_0+r) directly] |
| |
| Inductive step: |
| Assume P(j) for all n_0 \leq j \leq k. (SIH) |
| [Prove P(k+1) using any/all of |
| P(n_0), P(n_0+1), ..., P(k)] |
| |
| By strong induction, P(n) for all n \geq n_0. [] |
| |
+======================================================+
Note: Strong induction often requires multiple base cases because the inductive step may reach back several positions.
8.3 Worked Example - Fundamental Theorem of Arithmetic
Theorem. Every integer can be written as a product of primes.
Proof (by strong induction on ).
Base case ():
is prime, so it is (trivially) a product of primes. OK
Inductive step:
Assume every integer with can be written as a product of primes. (SIH)
Consider . Two cases:
Case 1: is prime. Then it is a product of one prime. OK
Case 2: is composite. Then for some .
By the SIH, and are products of primes.
Therefore , a product of primes. OK
By strong induction, every is a product of primes.
Why standard induction fails: for composite requires and where , but not necessarily specifically.
8.4 Worked Example - Binary Representation
Theorem. Every positive integer has a binary representation: with .
Proof (by strong induction on ).
Base case (): , so . OK
Inductive step:
Assume all have binary representations. (SIH)
Consider .
Case 1: is even. Then where . By SIH, . So , a valid binary representation (shift all digits left one position). OK
Case 2: is odd. Then where (for ). By SIH, . So , setting . OK
By strong induction, every positive integer has a binary representation.
AI relevance: Binary representations are fundamental to digital computation. This proof legitimises the binary encoding that underlies all neural network implementations.
8.5 Strong Induction in AI - Recursive Network Correctness
Theorem. A tree-structured recursive neural network correctly computes the representation of any tree of depth if (a) it correctly handles leaves and (b) whenever it correctly handles subtrees, its composition function correctly handles their parent.
Proof (by strong induction on ).
Base case (): Leaf nodes. By assumption (a), the network correctly computes their representations. OK
Inductive step: Assume the network correctly represents all trees of depth . (SIH)
A tree of depth has a root whose children are roots of subtrees of depth . By SIH, these subtrees are correctly represented. By assumption (b), the composition function correctly computes the parent's representation from its children's representations. OK
By strong induction, the network correctly handles all trees.
9. Structural Induction
9.1 The Strategy
Structural induction generalises mathematical induction from natural numbers to inductively defined structures: lists, trees, formulas, programs, and other recursive data types.
The idea: an inductively defined set has base constructors (creating atomic elements) and recursive constructors (building larger structures from smaller ones). Structural induction mirrors this:
- Base case: Prove for each base constructor.
- Inductive step: For each recursive constructor, assume for all sub-structures (structural inductive hypothesis, SIH); prove for the constructed structure.
9.2 Template
+======================================================+
| STRUCTURAL INDUCTION TEMPLATE |
+======================================================+
| |
| Proof (by structural induction): |
| |
| Base case: [For each base constructor c_0] |
| [Verify P(c_0)] |
| |
| Inductive step: [For each recursive |
| constructor C(x_1, ..., x_m)] |
| Assume P(x_1), ..., P(x_m). (SIH) |
| [Prove P(C(x_1, ..., x_m))] |
| |
| By structural induction, P holds for all |
| elements of the inductively defined set. [] |
| |
+======================================================+
9.3 Induction on Lists
Definition (Lists). The set of lists over a type is inductively defined:
- Base: (the empty list) is a list.
- Recursive: If and is a list, then (cons) is a list.
Theorem. .
where and .
Proof (by structural induction on ).
Base case ():
. OK
Inductive step ():
Assume . (SIH)
OK By structural induction, the result holds for all lists .
9.4 Induction on Binary Trees
Definition (Binary trees).
- Base: is a binary tree.
- Recursive: If and are binary trees, then is a binary tree.
Theorem. A binary tree with internal nodes has leaves.
Proof (by structural induction).
Base case ():
0 internal nodes, 1 leaf. . OK
Inductive step ():
Assume has internal nodes and leaves (SIH). Similarly has internal nodes and leaves (SIH).
has internal nodes (the is the new root).
Leaves: . OK
By structural induction, a tree with internal nodes has leaves.
AI relevance: Decision trees, parse trees in NLP, and hierarchical attention structures are all binary/n-ary trees. Properties proved by structural induction apply directly.
9.5 Induction on Computation Graphs
Definition (Computation graph for automatic differentiation).
- Base: An input variable is a computation node.
- Recursive: If are computation nodes and is a differentiable function, then is a computation node.
Theorem (Correctness of backpropagation). For any output node and any input , the backpropagation algorithm correctly computes .
Proof sketch (by structural induction on computation nodes).
Base case: . Then and for . Backpropagation initialises these correctly. OK
Inductive step: .
Assume backprop correctly computes for all and all . (SIH)
By the chain rule:
Backpropagation computes this sum: the local gradients are computed by the node, and are correctly computed by SIH. Their product and sum are correctly accumulated. OK
By structural induction, backpropagation is correct for all computation graphs.
9.6 When to Use Which Induction Variant
| Variant | Domain | Inductive Hypothesis | Use When |
|---|---|---|---|
| Standard | depends only on | ||
| Strong | depends on earlier cases | ||
| Structural | Inductive types | for all sub-structures | Domain is recursively defined |
| Transfinite | Ordinals | for all | Infinite/uncountable well-ordered sets |
10. The Probabilistic Method
10.1 The Strategy
To prove that an object with property exists, define a probability distribution over a space of candidate objects. Show that the probability of is strictly positive: . Then an object with property must exist (otherwise the probability would be 0).
This technique, pioneered by Paul Erdos, is profoundly non-constructive: it proves existence without exhibiting a specific example. Yet it is one of the most powerful tools in combinatorics, computer science, and - increasingly - machine learning theory.
10.2 Template
+======================================================+
| PROBABILISTIC METHOD TEMPLATE |
+======================================================+
| |
| Proof (by the probabilistic method): |
| |
| Define a probability space \Omega over candidates. |
| Let X be a random candidate drawn from \Omega. |
| Compute E[cost(X)] or Pr[P(X)]. |
| |
| Show that E[cost(X)] < threshold |
| (or Pr[P(X)] > 0). |
| |
| Therefore \exists x \in \Omega with cost(x) < threshold |
| (or satisfying P). [] |
| |
+======================================================+
Key principle: If a random variable has expected value , then an outcome (and an outcome ). This is the first moment method.
10.3 Worked Example - Ramsey Lower Bound
Theorem (Erdos, 1947). (the diagonal Ramsey number exceeds ).
Proof (by the probabilistic method).
Consider the complete graph on vertices. Colour each edge red or blue independently with probability .
For any set of vertices: .
By union bound over all choices of :
(factor 2 for red or blue).
Using :
For : and . So:
Since probability , there exists a 2-colouring with no monochromatic . Therefore .
10.4 Random Hyperplane Rounding (AI Application)
Theorem (Goemans-Williamson, 1995). The random hyperplane rounding algorithm achieves a 0.878-approximation for MAX-CUT.
Proof sketch (probabilistic method).
Solve the SDP relaxation: assign unit vectors to vertices. For each edge , the relaxation value is .
Rounding: Choose a random hyperplane through the origin (uniformly random normal ). Assign vertex to side if , else side .
Edge is cut iff .
The key inequality:
Therefore: .
By linearity of expectation, there exists a hyperplane achieving .
AI connection: Approximation algorithms via random rounding are conceptually related to random projection methods (Johnson-Lindenstrauss lemma, locality-sensitive hashing) used in approximate nearest neighbor search for embedding retrieval.
10.5 The Probabilistic Method in Deep Learning Theory
Theorem (Random initialisation has non-zero gradients - informal). For a randomly initialised deep network with ReLU activations, with probability 1 the gradient is non-zero at initialisation.
Proof sketch.
The set of parameter values where exactly is a measure-zero subset of (it is the zero set of a non-trivial analytic function of the parameters, assuming the loss is analytic). A continuous distribution over parameters assigns probability zero to any measure-zero set. Therefore, with probability 1, the initialisation lies outside this set.
This is why random initialisation "works": it almost surely places us at a point where gradient descent can make progress.
11. Counting and Combinatorial Arguments
11.1 The Strategy
Prove identities or existence results by counting the same set in two different ways (double counting), or by establishing bijections between sets, or by applying combinatorial inequalities (pigeonhole principle, inclusion-exclusion).
11.2 Double Counting
Principle: If you count the elements of a set in two different ways and get expressions and , then .
Theorem (Handshaking Lemma). In any graph : .
Proof (by double counting).
Count the set .
Method 1: For each vertex , it contributes pairs. Total: .
Method 2: For each edge , it contributes exactly 2 pairs: and . Total: .
Same set, two counts: .
AI relevance: Graph neural networks (GNNs) aggregate messages along edges. The handshaking lemma ensures that the total number of message-passing operations equals , which determines the computational cost of one GNN layer.
11.3 The Pigeonhole Principle
Principle: If items are placed into containers and , then some container holds more than one item.
Generalised: If items are placed into containers, some container holds items.
Theorem. In any set of 13 real numbers, at least two have a difference whose fractional part .
Proof (by pigeonhole).
Partition into 12 intervals: .
Given 13 numbers, consider their fractional parts. By pigeonhole (13 numbers, 12 intervals), at least two fractional parts lie in the same interval. Their difference has fractional part .
11.4 Pigeonhole in AI - Collision Arguments
Theorem. Any hash function with has collisions.
Proof (by pigeonhole).
. By pigeonhole, .
Corollary for embeddings: An embedding where for any finite cannot be injective if restricted to a finite-precision representation with bits per dimension.
Theorem. Any fixed-length tokeniser mapping variable-length strings to fixed-size vocabularies must have collisions (multiple strings mapping to the same token sequence of bounded length).
11.5 Inclusion-Exclusion
Principle.
AI application - Estimating vocabulary coverage:
Given documents with vocabularies :
In practice, the higher-order terms are estimated by sampling. This is used in corpus analysis for NLP.
11.6 Bijective Proofs
Principle: To prove , exhibit an explicit bijection .
Theorem. The number of subsets of of even size equals the number of odd size (for ).
Proof (by bijection).
Fix element 1. Define (symmetric difference: add 1 if absent, remove if present).
maps even-sized subsets to odd-sized subsets and vice versa. is its own inverse (), so it is a bijection.
Therefore: .
12. Epsilon-Delta and Analytic Proofs
12.1 The \epsilon-\delta Framework
In analysis, many proofs require showing that a quantity can be made arbitrarily small. The standard framework:
The key challenge is constructing as a function of (and sometimes other parameters).
12.2 Convergence of Gradient Descent (Fixed Step Size)
Theorem. Let be convex and -smooth ( for all ). Gradient descent with step size satisfies:
Proof.
By -smoothness, for any :
Set and :
With : .
By convexity: , so:
Summing the per-step descent and using (convexity):
This is an convergence rate.
Reading this proof: Note the structure - establish a per-step improvement bound, then telescope the sum. This pattern recurs throughout optimisation theory.
12.3 PAC-Learning Bound (Finite Hypothesis Classes)
Theorem. Let be a finite hypothesis class. For any distribution , any target , and any : if , then with probability , any consistent with i.i.d. samples has error .
Proof.
Fix a "bad" hypothesis with . The probability it is consistent with one sample:
Over i.i.d. samples:
Union bound over all bad hypotheses ( of them):
Set this : .
Significance: This is the foundational bound in computational learning theory. It connects sample complexity to hypothesis class size logarithmically - explaining why large model families need more data.
12.4 Uniform Convergence via \epsilon-Nets
Theorem (\epsilon-net argument). For a hypothesis class with VC-dimension , samples suffice for uniform convergence of empirical risk to true risk within .
Proof idea.
The \epsilon-net approach discretises the infinite hypothesis class: find a finite \epsilon-net (of size determined by the VC dimension) such that every is close to some . Apply the finite class bound to .
The key technical tool is the symmetrisation argument (Vapnik-Chervonenkis): replace the true risk with an empirical risk on a "ghost sample," then bound the probability of large deviations using Rademacher complexity or growth function bounds.
This is the theoretical foundation of generalisation bounds in machine learning.
13. Proof Patterns in Machine Learning
13.1 Overview
ML proofs recurrently use a small number of structural patterns. Recognising these patterns helps you read and construct proofs more efficiently.
+======================================================+
| ML PROOF PATTERN LANDSCAPE |
+======================================================+
| |
| Bound-Based ---- Union bound + per-element |
| probability control |
| |
| Descent-Based ---- Per-step improvement |
| + telescoping sum |
| |
| Convexity ---- Jensen's inequality / |
| definition verification |
| |
| Symmetry ---- Rademacher / permutation |
| arguments |
| |
| Information ---- Data processing inequality / |
| mutual information bounds |
| |
+======================================================+
13.2 The Union Bound Pattern
Structure: To bound , decompose into sub-events and apply:
Used in: PAC-learning, uniform convergence, multi-task bounds, reward model error analysis.
Pro tip: If the union bound is too loose (sum exceeds 1), try:
- Chernoff/Hoeffding bounds for tighter individual terms
- \epsilon-net arguments to reduce the number of terms
- Symmetrisation to avoid direct union bounding
13.3 The Descent Lemma Pattern
Structure:
- Show for some positive quantity .
- Telescope: .
- Conclude: .
Used in: Gradient descent convergence, SGD analysis, Adam convergence, policy gradient convergence.
13.4 The Convexity Verification Pattern
Structure: To prove is convex, verify one of:
- Definition:
- First-order: for all
- Second-order: for all (Hessian is PSD)
Example - Cross-entropy is convex in logits:
This is , which is the covariance matrix of a categorical distribution - always PSD.
13.5 The Symmetrisation Argument
Structure: To bound (empirical vs true risk):
- Introduce a "ghost" sample of equal size.
- Replace with (close with high probability by concentration).
- is now a function of two finite samples.
- Introduce Rademacher variables : swapping and entries is equivalent.
- Bound the Rademacher complexity .
This converts an infinite supremum into a computable quantity and is the basis of Rademacher complexity generalisation bounds.
13.6 Proof Strategy Selection Guide
| I want to show... | Technique | Key tool |
|---|---|---|
| Algorithm converges | Descent lemma | Per-step bound + telescope |
| Model generalises | Union bound / Rademacher | PAC / VC / Rademacher |
| Loss is well-behaved | Convexity verification | Hessian PSD check |
| Architecture has capacity | Structural induction | Induction on depth/width |
| Existence of good model | Probabilistic method | Random construction |
| Lower bound on complexity | Contradiction / counting | Information-theoretic |
| Two quantities are equal | Bijection / double counting | Combinatorial identity |
| Gradient computation is correct | Structural induction | Induction on computation graph |
14. Common Mistakes and Pitfalls
14.1 Affirming the Consequent
Fallacy: From and , conclude .
Example: "If a model overfits, training loss is low. Training loss is low. Therefore the model overfits." - Invalid. Low training loss could be from a well-generalising model.
Correct reasoning: and tells you nothing about . Only and lets you conclude (modus tollens).
14.2 Denying the Antecedent
Fallacy: From and , conclude .
Example: "If learning rate is too high, training diverges. Learning rate is not too high. Therefore training does not diverge." - Invalid. Training can diverge for other reasons (exploding gradients, numerical issues).
14.3 Confusing Necessary and Sufficient Conditions
| Necessary | Sufficient | |
|---|---|---|
| Form | ||
| Meaning | is required for | guarantees |
| Negation |
Example: "Gradient = 0 is necessary for a minimum, but not sufficient." (Saddle points also have zero gradient.)
14.4 Proof by Example (Invalid)
Verifying for specific values of does not prove .
Famous counterexample: is prime for - but is not prime.
In ML: "The model works on the test set" does not prove "the model works on all inputs." This is precisely why generalisation theory exists.
14.5 Circular Reasoning
Using the conclusion as a premise (possibly through a chain of implications that loops back).
Example: "Neural networks are universal approximators because they can approximate any function because they are universal approximators."
Detection: Trace the chain of reasoning. If any step assumes (even implicitly) what you are trying to prove, the argument is circular.
14.6 Misusing Induction
"All horses are the same colour" fallacy:
"Claim: Any group of horses are the same colour.
Base: . One horse has one colour. OK
Inductive step: Assume any horses are the same colour. Given horses :
- : same colour by IH.
- : same colour by IH.
- These overlap in , so all are the same colour."
The bug: When , the overlap - there is no overlap! The inductive step fails at .
Lesson: Always check that the inductive step works at the boundary ().
14.7 Quantifier Errors
| Written | Intended | Problem |
|---|---|---|
| ... | For each , find | Correct (convergence) |
| ... | One works for all | Wrong (much stronger) |
In ML context:
- ": with samples, error " - correct PAC statement.
- ": with samples, error " - would mean finite samples give zero error, which is false.
The order of quantifiers matters. Swapping and changes the meaning entirely.
15. Practice Exercises
15.1 Direct Proof Exercises
Exercise 1. Prove: the sum of two odd integers is even.
Exercise 2. Prove: if and , then (divisibility is transitive).
Exercise 3. Prove: .
15.2 Contrapositive Exercises
Exercise 4. Prove by contrapositive: if is odd, then is odd.
Exercise 5. Prove by contrapositive: if is not Lipschitz, then is unbounded (for differentiable on ).
15.3 Contradiction Exercises
Exercise 6. Prove by contradiction: is irrational.
Exercise 7. Prove by contradiction: there is no smallest positive rational number.
Exercise 8. Prove: is irrational.
15.4 Induction Exercises
Exercise 9. Prove by induction: .
Exercise 10. Prove by induction: a binary tree of height has at most nodes.
Exercise 11. Prove by strong induction: every integer can be expressed as a sum of distinct powers of 2 (binary representation with no repeats).
15.5 Cases and Construction Exercises
Exercise 12. Prove by cases: .
Exercise 13. Construct a bijection between (binary strings of length ) and (subsets of ).
15.6 Counting and Probabilistic Exercises
Exercise 14. Use the pigeonhole principle to show: in any group of 5 integers, some pair has the same remainder mod 4.
Exercise 15. Use the handshaking lemma to prove: in any graph, the number of vertices with odd degree is even.
15.7 ML-Specific Exercises
Exercise 16. Prove that softmax output lies in the probability simplex: and for all .
Exercise 17. Prove by induction on network depth: the composition of Lipschitz functions with constants is Lipschitz with constant .
Exercise 18. Prove by contradiction: a continuous function on a closed bounded interval attains its maximum (use sequences and the Bolzano-Weierstrass theorem).
15.8 Solution Sketches
Exercise 1 solution sketch:
Let , . Then . Since , the sum is even.
Exercise 6 solution sketch:
Assume with . Then , so , hence (since 3 is prime). Write : , so . But and contradicts .
Exercise 9 solution sketch:
Base (): . OK
IH: Assume .
. OK
Exercise 16 solution sketch:
Positivity: for all , and , so .
Sum to 1: .
16. Why It Matters for AI
16.1 Proofs as Guarantees
Machine learning operates in a landscape of empirical results. Proofs provide the theoretical guardrails:
| What proofs guarantee | Example |
|---|---|
| Correctness | Backpropagation correctly computes gradients (structural induction, 9.5) |
| Convergence | Gradient descent reaches optimum at rate (analytic proof, 12.2) |
| Generalisation | PAC bounds on sample complexity (union bound + counting, 12.3) |
| Expressiveness | Universal approximation - networks can represent any continuous function |
| Impossibility | No Free Lunch - no single algorithm dominates on all problems |
| Robustness | Lipschitz bounds <-> adversarial robustness (contrapositive, 4.5) |
Without proofs, we are running experiments in the dark. With proofs, we know why things work (or fail) and when guarantees hold.
16.2 Proofs and Algorithms
There is a deep connection between proofs and algorithms (the Curry-Howard correspondence in its broadest sense):
| Proof concept | Algorithm concept |
|---|---|
| Constructive existence proof | Algorithm that finds the object |
| Inductive proof | Recursive algorithm |
| Case analysis | Branching / conditional logic |
| Contradiction | Exception handling / invariant violation |
| Bound derivation | Complexity analysis |
When you prove a convergence rate of , you are simultaneously describing the algorithm (gradient descent) and its guarantee. When you prove existence by construction, you are writing a program.
16.3 Proof Literacy for ML Practitioners
You don't need to prove new theorems to benefit from proof literacy. Knowing how to read proofs lets you:
-
Understand assumptions: Every theorem has hypotheses. Knowing them tells you when the guarantee applies to your model and when it doesn't.
-
Spot gaps in arguments: "This architecture is better because it has more parameters" - is that a proof? What's the implicit theorem? What are the hidden assumptions?
-
Debug theoretical claims: Papers sometimes contain proof errors. Proof literacy lets you verify claims before building on them.
-
Transfer results: If you understand why a theorem holds (not just that it holds), you can adapt it to new settings.
16.4 Modern Frontiers - AI for Proofs
The field is coming full circle: AI systems are now being used to discover and verify proofs:
| System | Achievement |
|---|---|
| AlphaProof (DeepMind, 2024) | Solved IMO-level problems via reinforcement learning + formal verification |
| Lean / Mathlib | Formal proof library with growing ML-generated contributions |
| LLEMMA (2023) | Language model fine-tuned for mathematical reasoning |
| Autoformalization | Translating informal proofs to formal (verifiable) proofs via LLMs |
| ATP systems | Automated theorem provers (Vampire, E, Z3) integrated with ML heuristics |
The ability to formalise and verify proofs is becoming a key AI capability, and understanding proof techniques is essential for building and evaluating these systems.
17. Conceptual Bridge
17.1 From Proof Strategy to ML Reasoning
This chapter sits underneath the rest of the curriculum in a quiet but important way. Later topics will look like they are about vectors, derivatives, optimization, probability, or model architecture, but the moment a theorem appears, the same proof patterns return:
- direct proof for algebraic identities and gradient derivations
- contradiction for impossibility results and lower bounds
- induction for sequence length, layer depth, and recursive structure
- construction for algorithms, witnesses, and counterexamples
Proof techniques are therefore not separate from applied ML mathematics. They are the reasoning templates that make later chapters readable.
17.2 What Comes Next
As you move into linear algebra, calculus, probability, and optimization, try to read every result with three questions in mind: what is assumed, what is being concluded, and why this proof strategy fits the shape of the claim. That habit turns proof literacy into practical engineering judgment, because it helps you distinguish empirical patterns from actual guarantees.
References
- Velleman, D.J. How to Prove It: A Structured Approach. Cambridge University Press, 3rd ed., 2019.
- Hammack, R. Book of Proof. Virginia Commonwealth University, 3rd ed., 2018. (Free online)
- Alon, N. & Spencer, J.H. The Probabilistic Method. Wiley, 4th ed., 2016.
- Shalev-Shwartz, S. & Ben-David, S. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014.
- Boyd, S. & Vandenberghe, L. Convex Optimization. Cambridge University Press, 2004.
- Goemans, M.X. & Williamson, D.P. "Improved approximation algorithms for maximum cut..." JACM 42(6), 1995.
- Erdos, P. "Some remarks on the theory of graphs." Bulletin of the AMS 53, 1947.
- Nesterov, Y. Introductory Lectures on Convex Optimization. Springer, 2004.
- Trinh, T. et al. "Solving olympiad geometry without human demonstrations." Nature 625, 2024.
<- Previous: Einstein Summation and Index Notation | Home | Next: Vectors and Spaces ->