Part 2

28 min read18 headingsSplit lesson page

Lesson overview | Previous part | Next part

Sets and Logic: Part 5: Propositional Logic to 12. Advanced Topics

5. Propositional Logic

Propositional logic is the simplest complete system of logical reasoning. It studies how the truth of compound statements depends on the truth of their components, without looking inside those components. It is the foundation for Boolean circuits, attention mask logic, SAT solvers, and the systematic reasoning that predicate logic (6) and proof techniques (7) build upon.

5.1 Propositions

A proposition is a declarative statement that is either true ( $T$ , 1) or false ( $F$ , 0). Never both, never neither.

Examples of propositions:

"The sky is blue" - true proposition
" $2 + 2 = 5$ " - false proposition
"LLaMA-3 has 8B parameters" - true proposition
"Every even integer greater than 2 is the sum of two primes" - proposition (truth value unknown; Goldbach's conjecture)

Not propositions:

"What is your name?" - question, no truth value
"Close the door" - command, no truth value
" $x > 3$ " - depends on $x$ ; becomes a proposition only when $x$ is specified (this is a predicate, see 6)

Propositional variables: Letters $p, q, r, s, \ldots$ stand for unspecified propositions. Each has a truth value - either $T$ or $F$ .

5.2 Logical Connectives

Connectives combine propositions into compound statements. There are six standard connectives:

Negation (NOT): $\neg p$

"Not $p$ ." Flips the truth value.

$p$	$\neg p$
$T$	$F$
$F$	$T$

AI: Bitwise NOT in attention masks; complement of a token set.

Conjunction (AND): $p \wedge q$

" $p$ and $q$ ." True only when both are true.

$p$	$q$	$p \wedge q$
$T$	$T$	$T$
$T$	$F$	$F$
$F$	$T$	$F$
$F$	$F$	$F$

AI: Combining attention masks (token must be unmasked AND within window); intersecting constraints in structured generation.

Disjunction (OR): $p \vee q$

" $p$ or $q$ ." True when at least one is true (inclusive or).

$p$	$q$	$p \vee q$
$T$	$T$	$T$
$T$	$F$	$T$
$F$	$T$	$T$
$F$	$F$	$F$

AI: Union of token sets in structured generation; "match either pattern" in retrieval.

Exclusive Or (XOR): $p \oplus q$

" $p$ xor $q$ ." True when exactly one is true.

$p$	$q$	$p \oplus q$
$T$	$T$	$F$
$T$	$F$	$T$
$F$	$T$	$T$
$F$	$F$	$F$

AI: Symmetric difference of sets; error detection (XOR of prediction and ground truth bits); XOR is the simplest function not linearly separable - a single perceptron cannot learn it (Minsky & Papert 1969).

Implication (IF...THEN): $p \to q$

"If $p$ then $q$ ." Also read " $p$ implies $q$ ." False only when $p$ is true and $q$ is false - a true hypothesis cannot lead to a false conclusion.

$p$	$q$	$p \to q$
$T$	$T$	$T$
$T$	$F$	$F$
$F$	$T$	$T$
$F$	$F$	$T$

Vacuous truth: When $p$ is false, $p \to q$ is always true regardless of $q$ . This is often counterintuitive but essential:

"If pigs fly, then the moon is green" is true (vacuously)
The reason: $p \to q$ promises "whenever $p$ holds, $q$ holds." If $p$ never holds, the promise is never violated, so it's true by default.

Vacuous truth in mathematics and AI:

" $\forall x \in \emptyset$ , property $P(x)$ holds" - true, because there's nothing to check
" $\emptyset \subseteq A$ " - true, because every element of $\emptyset$ (there are none) is in $A$
"If the model reaches 100% accuracy, then it has generalised" - vacuously true if the model never reaches 100%

Biconditional (IF AND ONLY IF): $p \leftrightarrow q$

" $p$ if and only if $q$ " (abbreviated "iff"). True when $p$ and $q$ have the same truth value.

$p$	$q$	$p \leftrightarrow q$
$T$	$T$	$T$
$T$	$F$	$F$
$F$	$T$	$F$
$F$	$F$	$T$

Equivalently: $p \leftrightarrow q \equiv (p \to q) \wedge (q \to p)$ .

AI: Set equality ( $A = B \iff A \subseteq B \wedge B \subseteq A$ ); defining exactly when a condition holds.

5.3 Operator Precedence

When multiple connectives appear without parentheses, the standard precedence (highest first) is:

\neg \quad > \quad \wedge \quad > \quad \vee \quad > \quad \to \quad > \quad \leftrightarrow

Examples:

$\neg p \wedge q$ means $(\neg p) \wedge q$ , not $\neg(p \wedge q)$
$p \vee q \to r$ means $(p \vee q) \to r$ , not $p \vee (q \to r)$
$p \wedge q \vee r$ means $(p \wedge q) \vee r$ , not $p \wedge (q \vee r)$

Best practice: Always use parentheses when there's any ambiguity. Clarity is more important than brevity.

5.4 Tautologies, Contradictions, and Contingencies

Tautology: A formula that is true for all possible truth value assignments to its variables. Logically necessary - cannot possibly be false.

Tautology	Name
$p \vee \neg p$	Law of excluded middle
$p \to p$	Self-implication
$(p \wedge q) \to p$	Simplification
$(p \to q) \leftrightarrow (\neg p \vee q)$	Implication equivalence

Contradiction (Unsatisfiable): A formula that is false for all truth value assignments.

Contradiction	Why
$p \wedge \neg p$	Cannot be both true and false
$(p \to q) \wedge p \wedge \neg q$	Modus ponens violated

Contingency: A formula that is true for some assignments and false for others.

$p \wedge q$ - true when both true, false otherwise
$p \to q$ - false only when $T \to F$

Satisfiability: A formula is satisfiable if at least one truth assignment makes it true. Tautologies and contingencies are satisfiable; contradictions are not.

The SAT problem: Given a propositional formula $\phi$ , is $\phi$ satisfiable? This is the canonical NP-complete problem (Cook-Levin theorem, 1971). Despite worst-case intractability, modern SAT solvers handle millions of variables - they are critical for AI constraint solving, formal verification, and automated planning.

5.5 Logical Equivalence

Two formulas $\phi$ and $\psi$ are logically equivalent (written $\phi \equiv \psi$ ) if they have the same truth value for every possible truth assignment. Equivalently: $\phi \leftrightarrow \psi$ is a tautology.

Here are the fundamental equivalences - the "algebra rules" of logic:

Double Negation:

\neg\neg p \equiv p

De Morgan's Laws:

\neg(p \wedge q) \equiv \neg p \vee \neg q

\neg(p \vee q) \equiv \neg p \wedge \neg q

These are the most used laws in practice. Notice they mirror the set-theoretic De Morgan's Laws (3.6) exactly - this is the Boolean algebra correspondence (5.7).

Implication Equivalences:

p \to q \equiv \neg p \vee q

p \to q \equiv \neg q \to \neg p \quad \text{(contrapositive)}

\neg(p \to q) \equiv p \wedge \neg q

The contrapositive equivalence is used constantly in proofs (see 7.3). It says " $p$ implies $q$ " is the same as "not $q$ implies not $p$ ." Distinguish from the converse ( $q \to p$ ), which is NOT equivalent to $p \to q$ .

Distributive Laws:

p \wedge (q \vee r) \equiv (p \wedge q) \vee (p \wedge r)

p \vee (q \wedge r) \equiv (p \vee q) \wedge (p \vee r)

Absorption Laws:

p \wedge (p \vee q) \equiv p

p \vee (p \wedge q) \equiv p

Idempotent Laws:

p \wedge p \equiv p

p \vee p \equiv p

Identity Laws:

p \wedge T \equiv p

p \vee F \equiv p

Domination Laws:

p \wedge F \equiv F

p \vee T \equiv T

Complement Laws:

p \wedge \neg p \equiv F

p \vee \neg p \equiv T

5.6 Normal Forms

Literal: A variable or its negation ( $p$ or $\neg p$ ).

Clause: A disjunction of literals ( $p \vee \neg q \vee r$ ).

Term: A conjunction of literals ( $p \wedge \neg q \wedge r$ ).

Conjunctive Normal Form (CNF)

A formula in CNF is a conjunction of clauses - an AND of ORs:

(p \vee q \vee \neg r) \wedge (\neg p \vee s) \wedge (r \vee \neg s)

Every propositional formula can be converted to an equivalent CNF. SAT solvers require CNF input - the DPLL and CDCL algorithms operate exclusively on CNF.

Disjunctive Normal Form (DNF)

A formula in DNF is a disjunction of terms - an OR of ANDs:

(p \wedge q \wedge \neg r) \vee (\neg p \wedge s) \vee (r \wedge \neg s)

Every formula can be converted to DNF. A DNF is satisfiable iff any single term is satisfiable (no complementary literals in a term), which is easy to check - but the conversion to DNF may be exponentially large.

Conversion to CNF (Algorithm)

Given an arbitrary formula $\phi$ :

Eliminate biconditionals: $p \leftrightarrow q \longrightarrow (p \to q) \wedge (q \to p)$
Eliminate implications: $p \to q \longrightarrow \neg p \vee q$
Push negations inward (De Morgan's + double negation):
- $\neg(p \wedge q) \longrightarrow \neg p \vee \neg q$
- $\neg(p \vee q) \longrightarrow \neg p \wedge \neg q$
- $\neg\neg p \longrightarrow p$
Distribute $\vee$ over $\wedge$ : $p \vee (q \wedge r) \longrightarrow (p \vee q) \wedge (p \vee r)$

Example: Convert $p \to (q \wedge r)$ to CNF:

Step 2: $\neg p \vee (q \wedge r)$
Step 4: $(\neg p \vee q) \wedge (\neg p \vee r)$ - this is CNF OK

5.7 Boolean Algebra

The laws of propositional logic form an algebraic structure called a Boolean algebra: $(B, \wedge, \vee, \neg, 0, 1)$ satisfying all the equivalences listed in 5.5.

The fundamental connection - Stone Representation Theorem: Every Boolean algebra is isomorphic to an algebra of sets:

Logic	Sets	Python
$\wedge$ (AND)	$\cap$ (intersection)	`&`
$\vee$ (OR)	$\cup$ (union)	`\|`
$\neg$ (NOT)	complement	`~`
$F$ (false)	$\emptyset$	`set()`
$T$ (true)	$U$	universal set
$\to$ (implies)	$\subseteq$	`<=`
$\leftrightarrow$ (iff)	$=$	`==`

This is why De Morgan's laws look the same for sets and for logic - they are the same, in the precise algebraic sense. Boolean algebra unifies propositional logic, set operations, and circuit design under a single framework.

AI connections:

Attention masks as Boolean matrices: An attention mask is a Boolean matrix $M \in \{0, 1\}^{n \times n}$ where $M_{ij} = 1$ iff query $i$ attends to key $j$ . Combining masks (causal AND sliding window) is Boolean AND.
Structured generation FSMs: The transition function of a finite state machine operates on Boolean states. Valid token sets at each state are computed via Boolean operations.
Circuit complexity of neural networks: A ReLU network computes a piecewise linear function; each linear region corresponds to a Boolean pattern of active/inactive neurons - the network's behaviour is described by a Boolean function over activation patterns.

6. Predicate Logic (First-Order Logic)

Propositional logic treats statements as atomic units - it cannot look inside them. Predicate logic (FOL) opens them up, exposing the objects, properties, and quantifiers that make mathematical reasoning powerful. Every formal specification in AI - from type systems to knowledge bases to verification conditions - ultimately rests on predicate logic.

6.1 Predicates and Quantifiers - Motivation

The statement "every prime greater than 2 is odd" cannot be expressed in propositional logic. We need:

A way to talk about individual objects (numbers, tokens, vectors)
A way to express properties of objects ("is prime," "is odd")
A way to say "for all" or "there exists"

Predicate logic provides all three.

6.2 Syntax of FOL

Terms: The objects we talk about.

Constants: Specific objects ( $0$ , $1$ , $\pi$ , [PAD], [CLS])
Variables: Placeholders for objects ( $x$ , $y$ , $z$ )
Function symbols: Operations that produce objects ( $f(x)$ , $x + y$ , $\text{embed}(t)$ )

Predicates (relation symbols): Properties of or relationships between objects that evaluate to true or false.

Unary: $\text{Prime}(x)$ , $\text{IsToken}(t)$ , $\text{Positive}(x)$
Binary: $x < y$ , $\text{SameCluster}(a, b)$ , $\text{Attends}(i, j)$
$n$ -ary: $\text{Between}(x, y, z)$ meaning $y$ is between $x$ and $z$

Quantifiers: The two pillars of FOL.

Universal Quantifier: $\forall$ ("for all")

\forall x\, P(x)

"For every $x$ , $P(x)$ holds." True when $P(a)$ is true for every object $a$ in the domain.

Examples:

$\forall x \in \mathbb{R},\, x^2 \geq 0$ - "every real number squared is non-negative"
$\forall t \in V,\, \text{embed}(t) \in \mathbb{R}^d$ - "every token maps to a $d$ -dimensional vector"
$\forall i, j \in [n],\, i > j \implies M_{ij} = 0$ - causal mask definition

Existential Quantifier: $\exists$ ("there exists")

\exists x\, P(x)

"There exists an $x$ such that $P(x)$ holds." True when $P(a)$ is true for at least one object $a$ in the domain.

Examples:

$\exists x \in \mathbb{R},\, x^2 = 2$ - " $\sqrt{2}$ exists (in $\mathbb{R}$ )"
$\exists t \in V,\, P(\text{next} = t \mid \text{context}) > 0.5$ - "there's a token with probability above 0.5"
$\exists \theta,\, \mathcal{L}(\theta) = 0$ - "a global minimum exists" (not always true!)

Unique Existential: $\exists!$ ("there exists exactly one")

\exists! x\, P(x) \equiv \exists x\, [P(x) \wedge \forall y\, (P(y) \to y = x)]

"There is exactly one $x$ satisfying $P(x)$ ." This is a defined symbol - it's syntactic sugar for the expression on the right.

6.3 Free and Bound Variables

A variable $x$ is bound in a formula if it appears within the scope of a quantifier $\forall x$ or $\exists x$ . Otherwise, it is free.

In $\forall x\, (x > y)$ : $x$ is bound, $y$ is free
In $\exists x\, P(x, y) \wedge Q(z)$ : $x$ is bound, $y$ and $z$ are free
In $\forall x\, \exists y\, (x + y = 0)$ : both $x$ and $y$ are bound

A sentence (closed formula) has no free variables - it has a definite truth value. A formula with free variables is a predicate - it becomes a sentence only when specific values are substituted for the free variables or when they are quantified.

AI connection: In a loss function $\mathcal{L}(\theta) = \frac{1}{N} \sum_{i=1}^N \ell(f_\theta(x_i), y_i)$ , the index $i$ is bound (by the sum, a disguised $\forall$ ), while $\theta$ is free - this is why we optimise over $\theta$ .

6.4 Negating Quantifiers

Negation interacts with quantifiers via two fundamental rules:

\neg \forall x\, P(x) \equiv \exists x\, \neg P(x)

"Not everything satisfies $P$ " $\equiv$ "Something fails to satisfy $P$ ."

\neg \exists x\, P(x) \equiv \forall x\, \neg P(x)

"Nothing satisfies $P$ " $\equiv$ "Everything fails to satisfy $P$ ."

These are the quantifier analogues of De Morgan's laws. The pattern generalises:

Negation swaps $\forall$ and $\exists$
Negation swaps $\wedge$ and $\vee$
Negation swaps $\cap$ and $\cup$

AI example: "Not all tokens are attended to" $\equiv$ "There exists a token that is not attended to":

\neg \forall j\, \text{Attends}(i, j) \equiv \exists j\, \neg\text{Attends}(i, j)

6.5 Nested Quantifiers

Quantifiers can be nested. The order matters for mixed quantifiers:

\forall x\, \exists y\, P(x, y) \quad \neq \quad \exists y\, \forall x\, P(x, y)

The first says: "for every $x$ , there is (possibly different) $y$ such that $P(x, y)$ ." The second says: "there is a single $y$ that works for all $x$ ."

Comparison with examples:

Statement	Formal	True?
"For every real, there's a larger one"	$\forall x \in \mathbb{R}\, \exists y \in \mathbb{R}\, (y > x)$	True
"There's a real larger than all reals"	$\exists y \in \mathbb{R}\, \forall x \in \mathbb{R}\, (y > x)$	False
"Every continuous function on $[a,b]$ attains its max"	$\forall f \in C[a,b]\, \exists x^* \in [a,b]\, \forall x \in [a,b]\, f(x^*) \geq f(x)$	True (EVT)
"For every $\varepsilon > 0$ , there exists $\delta > 0$ ..."	$\forall \varepsilon\, \exists \delta\, \forall x\, \ldots$	Continuity def

Order rule: Swapping $\forall$ and $\exists$ is valid only if one direction of implication holds:

\exists y\, \forall x\, P(x, y) \implies \forall x\, \exists y\, P(x, y)

The converse is NOT generally true. A single universal $y$ is a strictly stronger claim than an individual $y(x)$ for each $x$ .

AI example - convergence of SGD:

$\forall \varepsilon > 0\, \exists T\, \forall t > T\, |\mathcal{L}(\theta_t) - \mathcal{L}^*| < \varepsilon$ means "the loss converges to $\mathcal{L}^*$ "

Here $T$ depends on $\varepsilon$ - this is the $\forall\exists$ pattern

6.6 Translating Between English and FOL

Translating natural language to FOL is a critical skill for formal verification, knowledge representation, and understanding what mathematical theorems actually say.

Systematic approach:

Identify the domain (what are the objects?)
Identify the predicates (what properties/relations?)
Identify the quantifier structure ( $\forall$ , $\exists$ , their order)
Write the formal expression
Check by reading it back in English

Practice translations:

English	FOL
"All vectors in $S$ have unit norm"	$\forall v \in S,\, \\|v\\| = 1$
"Some weight is negative"	$\exists w \in W,\, w < 0$
"No gradient is zero"	$\forall g \in G,\, g \neq 0$ or $\neg\exists g \in G,\, g = 0$
"If all gradients vanish, we're at a critical point"	$(\forall i,\, \nabla_i f = 0) \to \text{CriticalPoint}(\theta)$
"For every layer, there exists a neuron that fires"	$\forall \ell\, \exists n\, \text{Fires}(\ell, n)$
"There's a learning rate that works for all tasks"	$\exists \eta\, \forall \text{task}\, \text{Converges}(\eta, \text{task})$ (probably false!)

6.7 FOL in AI Contexts

Knowledge graphs store facts as FOL ground atoms: $\text{LocatedIn}(\text{Paris}, \text{France})$ , $\text{InstanceOf}(\text{GPT-4}, \text{LLM})$ . Rules are FOL implications: $\forall x, y, z\, [\text{LocatedIn}(x, y) \wedge \text{LocatedIn}(y, z) \to \text{LocatedIn}(x, z)]$ .

Description logics (OWL, used in ontologies) are decidable fragments of FOL. They balance expressiveness with computational tractability - reasoning in full FOL is undecidable.

Program specification: Hoare logic uses FOL to specify pre/post-conditions: $\{P\}\, \text{program}\, \{Q\}$ means "if $P$ holds before, $Q$ holds after."

LLM reasoning: When an LLM performs chain-of-thought reasoning, it is (approximately) performing FOL inference - applying modus ponens, instantiating universals, existential witnesses. The fidelity of this approximation is an active research question.

7. Proof Techniques

This chapter only needs a working preview of proof methods so later sections can state theorems precisely and you can follow the logic of an argument without being surprised by the structure. The full treatment lives in Proof Techniques, where each method is developed in much more depth.

For now, the right goal is recognition:

What kind of claim is being proved?
What assumptions are being used?
Why is a specific proof strategy a natural fit?

7.1 What is a Proof?

A proof is a finite chain of justified steps from assumptions, definitions, axioms, and previously established results to a conclusion. In this section, sets and logic give us the vocabulary; proofs tell us how that vocabulary is used to certify statements rather than merely suggest them.

That distinction matters in AI. A numerical experiment may suggest that an optimizer converges or that a masking rule preserves causality, but a proof tells you exactly under which assumptions the conclusion must hold.

7.2 Direct Proof

In a direct proof, you start from the hypothesis and push forward to the conclusion. This is the default style for statements of the form " $P \to Q$ " when the path from premises to result is transparent.

Example: to show that the sum of two even integers is even, write them as $2a$ and $2b$ , add them, and observe the result is again divisible by 2. This style mirrors many derivations in ML, where you begin with model assumptions and algebraically derive a bound or identity.

7.3 Proof by Contrapositive

Sometimes the forward direction is awkward, but the logically equivalent statement $\neg Q \to \neg P$ is easier. That is proof by contrapositive, and it is why the implication equivalence in 5.5 matters operationally, not just symbolically.

Classic example: to prove "if $n^2$ is even, then $n$ is even," it is simpler to show "if $n$ is odd, then $n^2$ is odd." In theory-heavy AI writing, contraposition shows up in impossibility and lower-bound arguments where failure of the conclusion exposes useful structure in the assumptions.

7.4 Proof by Contradiction

In proof by contradiction, you assume the statement is false and derive an impossibility. This is especially common when proving that some object cannot exist, that a property must hold universally, or that a certain finite description is impossible.

Famous examples include proving that $\sqrt{2}$ is irrational and that there are infinitely many primes. In AI-flavored mathematics, contradiction often appears when ruling out pathological counterexamples or showing that a supposed optimum or invariant would violate earlier assumptions.

7.5 Proof by Cases

When a domain naturally splits into a few exhaustive possibilities, you can prove a statement separately in each branch. This is proof by cases.

For example, the claim $|x| \geq 0$ is easiest by splitting into $x \geq 0$ and $x < 0$ . The same structure appears in algorithms and model analysis whenever behavior differs across regimes such as active vs inactive ReLU units, interior vs boundary points, or short-context vs long-context cases.

7.6 Mathematical Induction

Induction proves statements indexed by the natural numbers. You establish a base case, then show that if the claim holds at step $k$ , it also holds at step $k+1$ . Once both parts are in place, the result propagates to all later indices.

This matters for AI because many objects are recursive or sequential: token prefixes, layers in a stack, rollout horizons, and sequence length arguments are all natural induction targets. We will use the full machinery in the dedicated proof section when sequence-model reasoning becomes more central.

7.7 Existence and Uniqueness Proofs

An existence proof shows that at least one object with the desired property exists. A uniqueness proof shows there can be at most one. Together, they establish that there is exactly one such object.

This distinction is everywhere in optimization and model theory. "A minimizer exists" and "the minimizer is unique" are very different statements, with different consequences for learning dynamics and interpretation.

7.8 Proof by Construction

A constructive proof does more than assert existence: it explicitly builds the object. This is especially valuable in computer science and AI because a constructive argument often doubles as an algorithm.

For instance, if a theorem says there exists a network, partition, mask, or encoding with certain properties, a constructive proof tells you how to produce one. That is why constructive arguments are often more actionable than purely existential ones.

7.9 Why This Is Only a Preview

At this stage, proof techniques should feel like a toolbox you can recognize, not yet one you have fully mastered. The important bridge is:

logic gives the form of valid inference
sets provide the objects and predicates we talk about
proofs combine both into reliable mathematical arguments

When you are ready to go deeper, continue with Proof Techniques, where these methods are expanded with full templates, worked examples, and more explicit AI applications.

8. Axiomatic Set Theory (ZFC)

Naive set theory (2) is intuitive but broken - Russell's Paradox (2.5) shows that unrestricted set formation leads to contradictions. ZFC (Zermelo-Fraenkel axioms with the Axiom of Choice) is the standard rigorous foundation for all of modern mathematics. Every mathematical object - numbers, functions, spaces, probability measures, neural networks - can be built from ZFC sets.

8.1 Why Axioms?

Axioms serve two purposes:

Prevent paradoxes: By restricting which sets exist, ZFC avoids Russell's Paradox and similar contradictions.
Ensure consistency: All mathematical reasoning ultimately reduces to statements provable from these axioms.

The axioms tell you exactly which sets you are allowed to construct and what operations you may perform on them. Nothing more, nothing less.

8.2 The ZFC Axioms

There are 9 axioms (some formulations give 8 or 10; the exact count depends on grouping). Here they are, with intuition and notation.

Axiom 1: Extensionality

\forall A\, \forall B\, [\forall x\, (x \in A \leftrightarrow x \in B) \to A = B]

Two sets are equal if and only if they have the same elements. A set is determined entirely by its membership - there is no "hidden structure."

Consequence: $\{1, 2, 3\} = \{3, 1, 2\} = \{1, 1, 2, 3\}$ . Order and repetition are irrelevant.

Axiom 2: Empty Set

\exists \emptyset\, \forall x\, (x \notin \emptyset)

There exists a set with no elements. By Extensionality, this set is unique - there is only one empty set.

Axiom 3: Pairing

\forall a\, \forall b\, \exists C\, \forall x\, (x \in C \leftrightarrow x = a \vee x = b)

For any two objects $a$ and $b$ , the set $\{a, b\}$ exists. This lets us form pairs, which underpins ordered pairs and Cartesian products.

Axiom 4: Union

\forall \mathcal{F}\, \exists U\, \forall x\, (x \in U \leftrightarrow \exists A \in \mathcal{F},\, x \in A)

For any collection of sets $\mathcal{F}$ , there exists a set $U = \bigcup \mathcal{F}$ containing exactly the elements that belong to at least one set in $\mathcal{F}$ .

Axiom 5: Power Set

\forall A\, \exists P\, \forall B\, (B \in P \leftrightarrow B \subseteq A)

For any set $A$ , the power set $\mathcal{P}(A)$ exists. This is the set of all subsets of $A$ , and $|\mathcal{P}(A)| = 2^{|A|}$ for finite $A$ .

Axiom 6: Separation (Specification / Comprehension)

\forall A\, \exists B\, \forall x\, (x \in B \leftrightarrow x \in A \wedge \phi(x))

For any set $A$ and any property $\phi$ , the set $\{x \in A \mid \phi(x)\}$ exists. Crucially, you can only separate out elements from an existing set - you cannot form $\{x \mid \phi(x)\}$ without a bounding set $A$ .

This is what blocks Russell's Paradox. The "set of all sets that don't contain themselves" would require $\{x \mid x \notin x\}$ - but without a bounding set $A$ , Separation doesn't let you form it.

Axiom 7: Infinity

\exists I\, [\emptyset \in I \wedge \forall x\, (x \in I \to x \cup \{x\} \in I)]

There exists an infinite set - specifically, a set containing $\emptyset$ , $\{\emptyset\}$ , $\{\emptyset, \{\emptyset\}\}$ , and so on forever. This axiom guarantees the existence of the natural numbers (and prevents mathematics from being stuck in a finite world).

Axiom 8: Replacement

\forall A\, \forall F\, [\text{($F$ is a function)} \to \exists B\, B = \{F(x) \mid x \in A\}]

If $F$ is a definable function and $A$ is a set, then the image $\{F(x) \mid x \in A\}$ is a set. This is like Separation but more powerful - it lets you apply a function to every element and collect the results.

AI analogy: Replacement is map. Given a set (dataset) and a function (model), the outputs form a set (predictions).

Axiom 9: Foundation (Regularity)

\forall A\, [A \neq \emptyset \to \exists x \in A\, (x \cap A = \emptyset)]

Every non-empty set $A$ has an element disjoint from $A$ . This prevents circular membership chains ( $a \in b \in a$ ) and self-containing sets ( $a \in a$ ). Sets are "well-founded" - they are built from the bottom up (from $\emptyset$ ).

8.3 The Axiom of Choice (AC)

\forall \mathcal{F}\, [\emptyset \notin \mathcal{F} \to \exists f: \mathcal{F} \to \bigcup \mathcal{F}\, \forall A \in \mathcal{F},\, f(A) \in A]

For any collection of non-empty sets, there exists a choice function $f$ that selects exactly one element from each set. This seems obvious but has deep consequences.

Why it's controversial:

It's non-constructive: it asserts the existence of a choice function without building one
It implies paradoxical results: Banach-Tarski paradox (decompose a sphere into 5 pieces and reassemble into two spheres of the same size)
It implies well-ordering theorem: every set can be well-ordered (including $\mathbb{R}$ , despite having no known explicit well-ordering)

Why it's accepted:

Without AC, many theorems fail (every vector space has a basis; Tychonoff's theorem; Hahn-Banach theorem)
ZFC (with AC) has not been shown inconsistent
AC is independent of ZF: both ZF+AC and ZF+\negAC are consistent (if ZF is consistent), by Godel (1938) and Cohen (1963)

AI applications of AC: Most machine learning theory implicitly uses AC:

"There exists a minimiser of the loss function" - may require AC for non-compact spaces
"Every vector space has a basis" - needs AC for infinite-dimensional spaces (function spaces in kernel methods)
In practice, everything is finite-dimensional, so AC is not needed for actual computations

8.4 Constructing Numbers from Sets

One of the most remarkable achievements of axiomatic set theory is building all number systems from pure sets:

Natural numbers (von Neumann construction):

0 = \emptyset, \quad 1 = \{\emptyset\}, \quad 2 = \{\emptyset, \{\emptyset\}\}, \quad 3 = \{\emptyset, \{\emptyset\}, \{\emptyset, \{\emptyset\}\}\}, \quad \ldots

The rule: $n + 1 = n \cup \{n\}$ . Each natural number is the set of all smaller natural numbers: $n = \{0, 1, \ldots, n-1\}$ , so $|n| = n$ .

Integers: $\mathbb{Z}$ is constructed as equivalence classes of pairs of natural numbers: $[(a, b)]$ represents $a - b$ .

Rationals: $\mathbb{Q}$ is constructed as equivalence classes of pairs of integers: $[(p, q)]$ represents $p/q$ where $q \neq 0$ .

Reals: $\mathbb{R}$ is constructed via Dedekind cuts or Cauchy sequences of rationals.

The takeaway: Every mathematical object - including the real numbers that neural network weights live in - is built from the empty set using ZFC axioms. This is why set theory is called the "foundation of mathematics."

8.5 Ordinals and Cardinals (Brief)

Ordinal numbers extend the natural numbers to describe the "length" of well-orderings:

0, 1, 2, \ldots, \omega, \omega + 1, \omega + 2, \ldots, \omega \cdot 2, \ldots, \omega^2, \ldots, \omega^\omega, \ldots, \varepsilon_0, \ldots

Here $\omega$ is the first infinite ordinal (the "ordinal type" of $\mathbb{N}$ ). Ordinals are used to measure the "height" of sets in the cumulative hierarchy.

Cardinal numbers measure the "size" of sets. For finite sets, the cardinal and ordinal notions coincide ( $|S| = n$ means $S$ has $n$ elements). For infinite sets, cardinality is defined via bijections - two sets have the same cardinality iff a bijection exists between them. More on this in 10.

9. Logic in Computation and AI

This section bridges the pure theory of 5-8 to the concrete computational world. The key idea: logic is not just a tool for proofs - it is the foundation of computation itself.

9.1 Boolean Circuits

Every digital computer is built from logic gates - physical implementations of the logical connectives:

Gate	Logic	Symbol	Output
AND	$p \wedge q$	`&`	1 iff both inputs are 1
OR	$p \vee q$	`\|`	1 iff at least one input is 1
NOT	$\neg p$	`~`	Flips the bit
NAND	$\neg(p \wedge q)$	-	0 iff both inputs are 1
XOR	$p \oplus q$	`^`	1 iff inputs differ

Functional completeness: $\{AND, OR, NOT\}$ can compute any Boolean function. Even more remarkably, $\{NAND\}$ alone is functionally complete - every Boolean function can be built from NAND gates.

From gates to GPUs: GPU cores perform massively parallel Boolean and arithmetic operations. The floating-point units, comparison circuits, and memory addressing logic in a GPU are all Boolean circuits. When you run torch.matmul(A, B), underneath it is billions of transistors implementing Boolean functions.

9.2 SAT Solving

The Boolean satisfiability problem (SAT): given a propositional formula in CNF, is there a truth assignment that makes it true?

DPLL Algorithm (Davis-Putnam-Logemann-Loveland, 1962):

Unit propagation: If a clause has one unset literal, it must be true
Pure literal elimination: If a variable appears with only one polarity, set it to satisfy those clauses
Branching: Pick an unset variable, try both truth values, recurse
Backtracking: If a contradiction is found, undo the last choice

CDCL (Conflict-Driven Clause Learning): Modern SAT solvers extend DPLL with learned clauses - when a conflict occurs, they analyse why and add a clause preventing the same conflict pattern. This is a form of learning from mistakes.

SAT in AI:

Constraint satisfaction: Many combinatorial AI problems reduce to SAT
Planning: Classical AI planning (SATPLAN) encodes action sequences as SAT instances
Formal verification: Proving neural network properties (robustness, fairness) reduces to SAT/SMT
Hardware design: Chip verification uses SAT solvers extensively
Structured generation: Ensuring LLM output satisfies a grammar can be framed as constraint satisfaction

9.3 Resolution and Automated Reasoning

Resolution is a single inference rule that is complete for refuting CNF formulas:

\frac{(A \vee p) \quad (\neg p \vee B)}{A \vee B}

If one clause contains $p$ and another contains $\neg p$ , their resolvent combines the remaining literals.

Completeness (Robinson, 1965): A set of clauses is unsatisfiable if and only if the empty clause ( $\bot$ ) can be derived by repeated resolution.

This is the backbone of Prolog and logic programming - computation as proof search.

9.4 Type Theory and Programming

Type systems are deductive systems closely related to logic. The Curry-Howard correspondence establishes a deep isomorphism:

Logic	Type Theory	Programming
Proposition	Type	Specification
Proof	Term (program)	Implementation
$P \to Q$ (implication)	Function type $P \to Q$	Function from $P$ to $Q$
$P \wedge Q$ (conjunction)	Product type $P \times Q$	Pair/tuple
$P \vee Q$ (disjunction)	Sum type $P + Q$	Tagged union / either
$\forall x\, P(x)$	Dependent function $\Pi_{x:A} P(x)$	Generic/polymorphic function
$\exists x\, P(x)$	Dependent pair $\Sigma_{x:A} P(x)$	Record with witness
$\bot$ (false)	Empty type	Unreachable code
Proof of $P$	Value of type $P$	Running program

Consequence: A valid type-checked program is a proof of its type signature. Type errors are logical contradictions.

AI connection - type-safe tensors: Projects like jaxtyping and beartype enforce tensor shape constraints at the type level: Float[Array, "batch seq d_model"]. A shape mismatch is caught as a type error - i.e., a logical contradiction between what the function promises and what it receives.

Modal logic extends propositional logic with necessity ( $\square$ ) and possibility ( $\diamond$ ):

$\square P$ : " $P$ is necessarily true" (true in all possible worlds)
$\diamond P$ : " $P$ is possibly true" (true in some possible world)

In AI:

Epistemic logic: $K_a P$ means "agent $a$ knows $P$ ." Used in multi-agent systems.
Temporal logic: $\square P$ means " $P$ always holds" (safety); $\diamond P$ means " $P$ eventually holds" (liveness). Used in model checking and formal verification.
Deontic logic: What an AI system "should" do - the logic of obligations and permissions, relevant to AI alignment.

9.6 Description Logics and Knowledge Representation

Description logics (DLs) are decidable fragments of FOL designed for knowledge representation:

Concepts (unary predicates): Person, Model, LargeLanguageModel
Roles (binary predicates): trainedBy, hasParameter, outputs
Constructors: $C \sqcap D$ (intersection), $C \sqcup D$ (union), $\neg C$ (complement), $\forall R.C$ (all $R$ -successors are in $C$ ), $\exists R.C$ (some $R$ -successor is in $C$ )

DLs underpin the Web Ontology Language (OWL), used in knowledge graphs, biomedical ontologies, and semantic web. The reasoning complexity depends on which constructors are included - a careful balance between expressiveness and decidability.

9.7 LLM Reasoning as Approximate Logic

When an LLM performs chain-of-thought (CoT) reasoning, it is (approximately) performing logical inference:

Logical step	LLM behaviour	Fidelity
Modus ponens ( $P, P \to Q \vdash Q$ )	"Since $P$ and $P$ implies $Q$ , therefore $Q$ "	High for simple cases
Universal instantiation ( $\forall x P(x) \vdash P(a)$ )	"Since all X are Y, this X is Y"	Moderate
Contradiction detection	"But this contradicts..."	Fragile
Multi-step chains	Chaining 5+ inference steps	Degrades rapidly

Open research questions:

Can LLMs perform reliable deductive reasoning, or only pattern-matched approximation?
Can we formally verify LLM reasoning chains against logical rules?
Can symbolic logic modules be integrated with neural networks (neuro-symbolic AI)?
How does the "chain-of-thought" prompt format affect logical correctness?

Current evidence suggests LLMs are surprisingly good at local logical steps but struggle with long chains and quantifier reasoning (especially nested $\forall\exists$ patterns). This is precisely the gap that formal methods aim to fill.

10. Cardinality and Infinite Sets

Cardinality answers the question: how big is a set? For finite sets the answer is obvious - just count. For infinite sets, the answer is one of the most profound discoveries in mathematics: not all infinities are the same size.

10.1 Finite Cardinality

If $A$ is finite, $|A| = n$ means there exists a bijection $f: A \to \{1, 2, \ldots, n\}$ . Counting is bijection.

Key counting facts for AI:

Formula	Name	AI Example
$\vert A \cup B \vert = \vert A \vert + \vert B \vert - \vert A \cap B \vert$	Inclusion-exclusion	Deduplicating token sets
$\vert A \times B \vert = \vert A \vert \cdot \vert B \vert$	Product rule	Size of joint state space
$\vert \mathcal{P}(A) \vert = 2^{\vert A \vert}$	Power set	Number of feature subsets
$\vert A^B \vert = \vert A \vert^{\vert B \vert}$	Exponential	Number of functions from $B$ to $A$
$n! = n \cdot (n-1) \cdots 1$	Factorial	Permutation count
$\binom{n}{k} = \frac{n!}{k!(n-k)!}$	Binomial	Ways to choose $k$ items from $n$

10.2 Comparing Infinite Sets - Bijections

Definition: Two sets $A$ and $B$ have the same cardinality ( $|A| = |B|$ ) if and only if there exists a bijection $f: A \to B$ .

This is the only sensible definition for infinite sets - we cannot "count" them, but we can pair their elements one-to-one.

Example - $|\mathbb{N}| = |\mathbb{Z}|$ :

This seems wrong - $\mathbb{Z}$ "has twice as many elements." But pair them:

0 \mapsto 0, \quad 1 \mapsto 1, \quad 2 \mapsto -1, \quad 3 \mapsto 2, \quad 4 \mapsto -2, \quad 5 \mapsto 3, \quad \ldots

Formally: $f(n) = \begin{cases} n/2 & \text{if$ n $is even} \\ -(n+1)/2 & \text{if$ n $is odd} \end{cases}$

This is a bijection, so $|\mathbb{N}| = |\mathbb{Z}|$ . Infinity doesn't work like finite numbers.

10.3 Countable Sets

A set $A$ is countably infinite if $|A| = |\mathbb{N}|$ - i.e., there exists a bijection $A \to \mathbb{N}$ . A set is countable if it is finite or countably infinite.

Equivalently, $A$ is countable iff its elements can be listed as a sequence $a_1, a_2, a_3, \ldots$ (possibly terminating).

Countable sets include:

Set	Why countable
$\mathbb{N}$	By definition
$\mathbb{Z}$	Bijection shown above
$\mathbb{Q}$	Cantor's zig-zag (diagonalisation of $\frac{p}{q}$ table)
$\mathbb{N} \times \mathbb{N}$	Cantor pairing function: $(m, n) \mapsto \frac{(m+n)(m+n+1)}{2} + m$
All finite strings over finite alphabet $\Sigma$	$\Sigma^* = \bigcup_{n=0}^\infty \Sigma^n$ , countable union of finite sets
All programs (source code)	Programs are finite strings
All rational polynomials	Countable coefficients, finite degree
Algebraic numbers	Roots of polynomials with integer coefficients

Key theorem: A countable union of countable sets is countable:

A_1, A_2, A_3, \ldots \text{ countable} \implies \bigcup_{i=1}^\infty A_i \text{ countable}

AI consequence: The set of all possible tokenised prompts (finite strings over a finite vocabulary $V$ ) is countable. There are $\aleph_0$ possible inputs to an LLM.

10.4 Uncountable Sets - Cantor's Diagonal Argument

Theorem (Cantor, 1891). $\mathbb{R}$ is uncountable - there is no bijection $\mathbb{N} \to \mathbb{R}$ .

Proof (diagonal argument). We show that even the interval $(0, 1)$ is uncountable. Assume, for contradiction, that $(0, 1)$ is countable. Then we can list all real numbers in $(0, 1)$ :

r_1 = 0.d_{11}d_{12}d_{13}d_{14}\ldots

r_2 = 0.d_{21}d_{22}d_{23}d_{24}\ldots

r_3 = 0.d_{31}d_{32}d_{33}d_{34}\ldots

\vdots

Construct $r^* = 0.d_1^* d_2^* d_3^* \ldots$ where $d_n^*$ is any digit $\neq d_{nn}$ (and $\neq 0, 9$ to avoid alternative decimal representations).

Then $r^* \in (0, 1)$ but $r^* \neq r_n$ for every $n$ (they differ at the $n$ -th digit). This contradicts the assumption that all reals in $(0, 1)$ were listed. $\square$

The symbol $|\mathbb{N}| = \aleph_0$ ("aleph-null") denotes the first infinite cardinal. The set $\mathbb{R}$ has cardinality $|\mathbb{R}| = 2^{\aleph_0} = \mathfrak{c}$ (the cardinality of the continuum), which is strictly larger.

10.5 Cantor's Theorem (Generalised)

Theorem. For any set $A$ : $|A| < |\mathcal{P}(A)|$ .

No set is as large as its power set. This holds for every set, finite or infinite.

Proof sketch. The map $a \mapsto \{a\}$ injects $A$ into $\mathcal{P}(A)$ , so $|A| \leq |\mathcal{P}(A)|$ . To show strict inequality, suppose $f: A \to \mathcal{P}(A)$ is a bijection. Let $D = \{a \in A \mid a \notin f(a)\}$ . Then $D \in \mathcal{P}(A)$ , so $D = f(d)$ for some $d$ (by surjectivity). Is $d \in D$ ?

If $d \in D$ , then $d \notin f(d) = D$ - contradiction.
If $d \notin D$ , then $d \in f(d) = D$ - contradiction.

Either way, contradiction. So no such bijection exists. $\square$

Consequence - the infinite tower:

|\mathbb{N}| < |\mathcal{P}(\mathbb{N})| < |\mathcal{P}(\mathcal{P}(\mathbb{N}))| < \cdots

There is no largest infinity. The hierarchy of infinities is itself infinite (and larger than any infinity in the hierarchy).

10.6 The Continuum Hypothesis

Claim (Cantor, 1878): There is no set $S$ with $|\mathbb{N}| < |S| < |\mathbb{R}|$ .

In other words, there is no "intermediate" infinity between the countable and the continuum.

Resolution (Godel 1940, Cohen 1963): The Continuum Hypothesis is independent of ZFC - it can neither be proved nor disproved from the standard axioms. Both ZFC + CH and ZFC + \negCH are consistent (if ZFC is).

This is one of the most surprising results in mathematics: there are meaningful mathematical questions that our axioms simply cannot answer.

10.7 Cardinality and AI

Object	Cardinality	Consequence
Finite vocabulary $V$	$\vert V \vert$ (finite, e.g., 32,000)	Discrete, tractable
All prompts $V^*$	$\aleph_0$ (countable)	Enumerable in principle
All real-valued weight vectors $\mathbb{R}^d$	$\mathfrak{c}$	Uncountable - can't enumerate models
All functions $\mathbb{R}^d \to \mathbb{R}$	$\mathfrak{c}^{\mathfrak{c}} > \mathfrak{c}$	Vastly larger than the set of computable functions
All computable functions	$\aleph_0$	Programs are countable strings

The gap: There are uncountably many functions but only countably many programs (and neural network architectures, at any fixed precision). Most functions are not computable - and most mathematical objects are not representable by any model. This is a fundamental limitation.

11. Logic and Set Theory in Machine Learning

This section shows how the abstract tools of 2-10 appear concretely in modern ML.

11.1 Formal Languages and Automata

Formal language theory is set theory applied to strings:

An alphabet $\Sigma$ is a finite set of symbols
A string over $\Sigma$ is a finite sequence of symbols from $\Sigma$
A language $L$ is a set of strings: $L \subseteq \Sigma^*$

Connection to AI:

Concept	ML Instantiation
Alphabet $\Sigma$	Vocabulary $V$ (set of tokens)
String	Token sequence (prompt or completion)
Language $L$	Set of valid outputs
Grammar	Rules defining valid outputs (JSON schema, Python syntax)
Automaton (DFA/NFA)	Constrained decoding state machine
Regular language	Languages recognisable by finite automata
Context-free language	Languages parseable by pushdown automata

Structured generation constrains an LLM to produce only strings from a target language $L$ . At each decoding step, the valid next tokens form a set $V_{\text{valid}} \subseteq V$ , computed by the automaton/parser. Tokens outside $V_{\text{valid}}$ are masked (logits set to $-\infty$ ). This is Boolean logic applied to decoding.

11.2 Sets in Probability

The entire framework of probability theory is built on set theory (Kolmogorov axioms):

Probability concept	Set-theoretic object
Sample space $\Omega$	Set of all possible outcomes
Event $A$	Subset of $\Omega$ ( $A \subseteq \Omega$ )
$P(A \cup B)$	Probability of $A$ or $B$
$P(A \cap B)$	Probability of $A$ and $B$
$P(A^c)$	Probability of not $A$ ; equals $1 - P(A)$
$\sigma$ -algebra $\mathcal{F}$	Collection of measurable subsets of $\Omega$
Independence: $P(A \cap B) = P(A)P(B)$	Multiplicativity under intersection

Measure theory (the rigorous foundation of probability) relies heavily on ZFC: $\sigma$ -algebras are sets of sets, measures are set functions, and the Axiom of Choice is needed for results like the Vitali set (a non-measurable set).

11.3 Sets in Linear Algebra

Linear algebra concept	Set structure
Vector space $V$	Set with addition and scalar multiplication
Subspace $W \leq V$	Subset that is itself a vector space
Span	$\text{span}(S) = \{$ all linear combinations of $S\}$ - a set
Basis	Linearly independent spanning set
Eigenspace	$E_\lambda = \{v \mid Av = \lambda v\}$ - defined by set-builder notation
Column space	$\text{Col}(A) = \{Ax \mid x \in \mathbb{R}^n\}$ - image of $A$ (a set)
Null space	$\text{Null}(A) = \{x \mid Ax = 0\}$ - kernel of $A$ (a set)

11.4 Logic in Training

Loss functions encode logical conditions:

Condition	Logic	Loss
Output $=$ target	Equality	MSE: $\\|f_\theta(x) - y\\|^2$
Output matches distribution	$P_\theta \approx P_{\text{data}}$	KL divergence, cross-entropy
Margin condition	$y_i (\mathbf{w} \cdot \mathbf{x}_i) \geq 1$	Hinge loss: $\max(0, 1 - y_i (\mathbf{w} \cdot \mathbf{x}_i))$
Constraint satisfaction	$g(\theta) \leq 0$	Lagrangian: $\mathcal{L} + \lambda g(\theta)$

Training as logic: gradient descent searches for $\theta$ satisfying $\nabla \mathcal{L}(\theta) = 0$ - this is an existential claim: $\exists \theta^*\, \nabla \mathcal{L}(\theta^*) = 0$ .

Early stopping as logic: "Stop when the validation loss has not decreased for $k$ epochs" is a quantified condition: $\forall t \in [T, T+k],\, \mathcal{L}_{\text{val}}^{(t)} \geq \mathcal{L}_{\text{val}}^{(T)}$ .

11.5 Sets in Model Architecture

Transformer attention as set operations:

The key insight of attention is that it operates on a set of key-value pairs (modulo positional encoding). Self-attention is permutation-equivariant: it treats the input as a set, not a sequence.

Query: "What am I looking for?"
Key set: $K = \{k_1, k_2, \ldots, k_n\}$ - the set of keys to match against
Value set: $V = \{v_1, v_2, \ldots, v_n\}$ - the set of values to aggregate
Attention mask: $M \subseteq [n] \times [n]$ - the set of allowed (query, key) pairs

Combining masks:

Causal AND local: $M_{\text{causal}} \cap M_{\text{local}}$ - intersection (Boolean AND)
Multi-head union: $\bigcup_h M_h$ - each head sees a different subset of positions
Padding mask: $M_{\text{pad}} = \{(i, j) \mid j \text{ is not a padding token}\}$

11.6 Formal Verification of ML Systems

Logic provides tools to prove properties of ML systems rather than merely test them:

Property	Formal statement	Method
Robustness	$\forall x',\, \\|x' - x\\| < \varepsilon \to \text{class}(x') = \text{class}(x)$	SMT solving, MILP
Fairness	$P(\hat{y} \mid A=0) = P(\hat{y} \mid A=1)$	Statistical testing + formal constraints
Safety	$\forall \text{input},\, \text{output} \in \text{SafeSet}$	Reachability analysis
Monotonicity	$x_i \leq x_i' \to f(x) \leq f(x')$	Lattice-based verification

Certified defences use logical proofs (often via abstract interpretation or interval arithmetic) to guarantee that no adversarial perturbation within an $\varepsilon$ -ball can change the classification. This is the gold standard for ML safety - moving from empirical robustness to provable robustness.

12. Advanced Topics

12.1 Godel's Incompleteness Theorems

The most profound results in mathematical logic, with deep implications for AI.

First Incompleteness Theorem (Godel, 1931): Any consistent formal system $F$ capable of expressing basic arithmetic contains a statement $G$ such that:

$G$ is true (in the standard model of arithmetic)
$G$ is not provable in $F$
$\neg G$ is not provable in $F$

In other words, there are true mathematical statements that cannot be proved. No axiom system - ZFC, or any extension - can be both consistent and complete for arithmetic.

Godel's self-referential trick: Godel constructs $G$ to say, essentially, "I am not provable in system $F$ ." If $G$ were provable, it would be false (since it says it's unprovable), contradicting consistency. If $\neg G$ were provable, $G$ would be provable (since $G$ says it's not), contradiction. So neither $G$ nor $\neg G$ is provable.

Second Incompleteness Theorem: No consistent system $F$ capable of expressing arithmetic can prove its own consistency.

F \text{ consistent} \implies F \not\vdash \text{Con}(F)

AI implications:

No AI system (which is a formal system) can verify all true mathematical statements - there will always be truths beyond its reach
The claim "this AI is guaranteed safe" cannot be proved within the AI's own framework (a version of the second theorem)
Godel's theorems do NOT mean mathematics is broken - they mean it is inexhaustible

12.2 The Halting Problem

Theorem (Turing, 1936). There is no algorithm that can determine, for every program $P$ and input $I$ , whether $P$ halts on $I$ .

Proof (by contradiction). Suppose $H(P, I)$ is a program that returns "halts" or "loops" for any $P, I$ . Define:

D(P) = \begin{cases} \text{loop forever} & \text{if } H(P, P) = \text{"halts"} \\ \text{halt} & \text{if } H(P, P) = \text{"loops"} \end{cases}

What does $D(D)$ do?

If $H(D, D) =$ "halts" -> $D(D)$ loops forever. But $H$ said it halts. Contradiction.
If $H(D, D) =$ "loops" -> $D(D)$ halts. But $H$ said it loops. Contradiction.

So $H$ cannot exist. $\square$

Connections to AI:

This is the computation analogue of Russell's Paradox - self-reference plus negation creates paradox
You cannot write a "universal bug detector" that catches all infinite loops
Theorem provers and code verifiers must be incomplete (they may time out or return "unknown")
LLM code generation cannot guarantee termination of generated code

12.3 Zorn's Lemma

Zorn's Lemma (equivalent to the Axiom of Choice and the Well-Ordering Theorem):

If every chain in a partially ordered set $(P, \leq)$ has an upper bound in $P$ , then $P$ has a maximal element.

A chain is a totally ordered subset; a maximal element $m$ is one with no element strictly above it ( $\nexists x \in P,\, x > m$ ).

Applications in mathematics:

Every vector space has a basis (even infinite-dimensional ones)
Every ideal in a ring is contained in a maximal ideal
Every filter can be extended to an ultrafilter

AI relevance: Zorn's Lemma is used in the theoretical foundations of kernel methods (Hilbert space bases) and functional analysis (Hahn-Banach theorem), which underlies support vector machines and reproducing kernel Hilbert spaces.

12.4 Fuzzy Logic

Classical logic is crisp: $T$ or $F$ . Fuzzy logic (Zadeh, 1965) allows truth values in $[0, 1]$ :

\mu_A(x) \in [0, 1]

where $\mu_A(x)$ is the degree of membership of $x$ in fuzzy set $A$ .

Fuzzy operations:

Classical	Fuzzy
$A \cap B$	$\mu_{A \cap B}(x) = \min(\mu_A(x), \mu_B(x))$
$A \cup B$	$\mu_{A \cup B}(x) = \max(\mu_A(x), \mu_B(x))$
$\neg A$	$\mu_{\neg A}(x) = 1 - \mu_A(x)$

AI connections:

Softmax outputs are like fuzzy membership values - a token has degree-of-membership in each class
Attention weights $\alpha_{ij} \in [0, 1]$ are fuzzy: not binary "attend or not" but "how much to attend"
Fuzzy control systems (early AI) used fuzzy rules: "IF temperature IS high THEN fan IS fast"
Modern neural networks have largely replaced fuzzy systems, but the conceptual framework persists in the soft/differentiable operations that make gradient descent possible

12.5 Second-Order Logic (Brief)

First-order logic quantifies over objects: $\forall x, \exists y, \ldots$

Second-order logic quantifies over properties (sets, predicates, functions): $\forall P, \exists f, \ldots$

Example - Completeness of $\mathbb{R}$ :

First-order: CANNOT express "every bounded set has a least upper bound" (needs quantification over sets)
Second-order: $\forall S \subseteq \mathbb{R}\, [S \neq \emptyset \wedge \exists M\, \forall x \in S\, (x \leq M) \to \exists \sup S]$

Trade-off:

First-order logic: semi-decidable (complete proof systems exist), but less expressive
Second-order logic: more expressive, but no complete proof system exists (incompleteness is "worse")

In AI, first-order logic is preferred because it has better computational properties. Description logics, Datalog, and Prolog all work within FOL or its fragments.

Sets and Logic: Part 2 - Propositional Logic To 12 Advanced Topics

Sets and Logic: Part 5: Propositional Logic to 12. Advanced Topics

5. Propositional Logic

5.1 Propositions

5.2 Logical Connectives

Negation (NOT): ¬p\neg p¬p

Conjunction (AND): p∧qp \wedge qp∧q

Disjunction (OR): p∨qp \vee qp∨q

Exclusive Or (XOR): p⊕qp \oplus qp⊕q

Implication (IF...THEN): p→qp \to qp→q

Biconditional (IF AND ONLY IF): p↔qp \leftrightarrow qp↔q

5.3 Operator Precedence

5.4 Tautologies, Contradictions, and Contingencies

5.5 Logical Equivalence

5.6 Normal Forms

Conjunctive Normal Form (CNF)

Disjunctive Normal Form (DNF)

Conversion to CNF (Algorithm)

5.7 Boolean Algebra

6. Predicate Logic (First-Order Logic)

6.1 Predicates and Quantifiers - Motivation

6.2 Syntax of FOL

Universal Quantifier: ∀\forall∀ ("for all")

Existential Quantifier: ∃\exists∃ ("there exists")

Unique Existential: ∃!\exists!∃! ("there exists exactly one")

6.3 Free and Bound Variables

6.4 Negating Quantifiers

6.5 Nested Quantifiers

6.6 Translating Between English and FOL

6.7 FOL in AI Contexts

7. Proof Techniques

7.1 What is a Proof?

7.2 Direct Proof

7.3 Proof by Contrapositive

7.4 Proof by Contradiction

7.5 Proof by Cases

7.6 Mathematical Induction

7.7 Existence and Uniqueness Proofs

7.8 Proof by Construction

7.9 Why This Is Only a Preview

8. Axiomatic Set Theory (ZFC)

8.1 Why Axioms?

8.2 The ZFC Axioms

Axiom 1: Extensionality

Axiom 2: Empty Set

Axiom 3: Pairing

Axiom 4: Union

Axiom 5: Power Set

Axiom 6: Separation (Specification / Comprehension)

Axiom 7: Infinity

Axiom 8: Replacement

Axiom 9: Foundation (Regularity)

8.3 The Axiom of Choice (AC)

8.4 Constructing Numbers from Sets

8.5 Ordinals and Cardinals (Brief)

9. Logic in Computation and AI

9.1 Boolean Circuits

9.2 SAT Solving

9.3 Resolution and Automated Reasoning

9.4 Type Theory and Programming

9.5 Modal Logic (Brief)

9.6 Description Logics and Knowledge Representation

9.7 LLM Reasoning as Approximate Logic

10. Cardinality and Infinite Sets

10.1 Finite Cardinality

10.2 Comparing Infinite Sets - Bijections

10.3 Countable Sets

10.4 Uncountable Sets - Cantor's Diagonal Argument

10.5 Cantor's Theorem (Generalised)

10.6 The Continuum Hypothesis

10.7 Cardinality and AI

11. Logic and Set Theory in Machine Learning

11.1 Formal Languages and Automata

11.2 Sets in Probability

11.3 Sets in Linear Algebra

11.4 Logic in Training

11.5 Sets in Model Architecture

11.6 Formal Verification of ML Systems

12. Advanced Topics

12.1 Godel's Incompleteness Theorems

Negation (NOT): $\neg p$

Conjunction (AND): $p \wedge q$

Disjunction (OR): $p \vee q$

Exclusive Or (XOR): $p \oplus q$

Implication (IF...THEN): $p \to q$

Biconditional (IF AND ONLY IF): $p \leftrightarrow q$

Universal Quantifier: $\forall$ ("for all")

Existential Quantifier: $\exists$ ("there exists")

Unique Existential: $\exists!$ ("there exists exactly one")