NotesMath for LLMs

Sets and Logic

Mathematical Foundations / Sets and Logic

Notes

"Sets tell us what objects exist in a mathematical world; logic tells us what must be true about them."

Overview

Sets and logic are the foundation beneath every other mathematical object in this repository. Before we can speak precisely about vectors, functions, probabilities, losses, or neural network layers, we need a language for collections, membership, relations, predicates, and valid inference.

In AI, these ideas show up constantly. A vocabulary is a set, an attention mask is a relation on token positions, a training rule is a logical implication, and a theorem about optimization or generalization is only as strong as the quantifiers and assumptions that define it. This section develops that language so later chapters can use it rigorously rather than implicitly.

Prerequisites

  • Basic arithmetic and symbolic notation from Number Systems
  • Comfort reading simple mathematical statements and equations
  • Familiarity with elementary programming concepts such as conditions and Boolean values

Companion Notebooks

NotebookDescription
theory.ipynbInteractive set operations, truth tables, logical normal forms, and AI-flavored demonstrations
exercises.ipynbGuided practice on sets, relations, quantifiers, predicates, and proof structure

Learning Objectives

After completing this section, you will:

  • Define sets, subsets, relations, and Cartesian products precisely
  • Distinguish naive set intuition from axiomatic safeguards against paradox
  • Translate English mathematical statements into propositional and predicate logic
  • Work fluently with quantifiers, implication, equivalence, and normal forms
  • Recognize how set and logical structure appears in ML pipelines and model design
  • Explain cardinality, countability, and why they matter for computation and learning theory
  • Read formal statements in later chapters without losing track of hypotheses and conclusions
  • Use sets and logic as the conceptual bridge to functions, proof techniques, probability, and beyond

Table of Contents

  1. Intuition
  2. Naive Set Theory
  3. Set Operations
  4. Relations
  5. Propositional Logic
  6. Predicate Logic (First-Order Logic)
  7. Proof Techniques
  8. Axiomatic Set Theory (ZFC)
  9. Logic in Computation and AI
  10. Cardinality and Infinite Sets
  11. Logic and Set Theory in Machine Learning
  12. Advanced Topics
  13. Common Mistakes
  14. Exercises
  15. Why This Matters for AI
  16. Conceptual Bridge

1. Intuition

1.1 What Are Sets and Logic?

Set theory is the mathematical language for talking about collections of objects; logic is the mathematical language for talking about truth and valid reasoning. Together they form the foundation of all mathematics: every other mathematical structure - numbers, functions, probability, linear algebra - is built on sets and governed by logic.

A set answers the question: what objects belong to this category?

Logic answers the question: given what we know, what can we validly conclude?

These are not abstract curiosities. When you tokenise a sentence, the vocabulary VV is a set. When you compute attention, the mask selects a subset of positions. When you train a model, the loss function maps from a set of parameters to real numbers. When you evaluate whether a model's output is "correct", you're applying logical predicates. Sets and logic are not prerequisites you study once and forget - they are the operational language of every equation in this curriculum.

For AI and LLMs specifically:

  • Sets formalise vocabularies, token sequences, and model outputs
  • Logic underpins reasoning, constraint satisfaction, knowledge representation, and formal verification of AI systems
  • Every probability distribution is defined over a set with a logical structure (sigma-algebra)
  • Neural network verification expresses safety properties as logical formulas

1.2 Why Sets and Logic Matter for AI

AI ConceptSet/Logic Foundation
TokenisationVocabulary VV is a set; the set of all token sequences VV^* is a formal language
ProbabilityA probability space is built on sets: events are subsets of sample space Ω\Omega
Neural network architectureEach layer is a function between sets (domain Rdin\mathbb{R}^{d_{in}} and codomain Rdout\mathbb{R}^{d_{out}})
Attention masksA mask selects a subset of positions; causal mask is a specific set relation {(i,j)ij}\{(i,j) \mid i \geq j\}
Structured generationConstrained decoding enforces logical constraints on output tokens at each step
Knowledge graphsEntities are sets; relations are functions between sets; reasoning is predicate logic
Formal verificationProving properties of AI systems requires predicate logic and SAT/SMT solvers
Training dataDeduplication is a set operation; a dataset is a multiset of examples
Retrieval (RAG)A query retrieves a subset of a corpus satisfying a relevance predicate
Type checkingTensor shape inference is type theory; preventing dimension mismatches

1.3 The Hierarchy of Mathematical Foundations

Every mathematical structure used in AI rests on a hierarchy, and sets and logic form the very bottom:

Layer 7: Machine Learning     (SGD, backprop, transformers)
Layer 6: Probability          (distributions, Bayes, entropy)
Layer 5: Analysis             (continuity, limits, derivatives)
Layer 4: Linear Algebra       (vectors, matrices, eigenvalues)
Layer 3: Algebra              (groups, rings, fields)
Layer 2: Arithmetic           (natural numbers, integers, reals)
Layer 1: Set Theory (ZFC)     (what objects exist? what collections?)
Layer 0: Logic                (what is valid reasoning? what follows from what?)

Each layer is defined in terms of the layers below. Natural numbers are constructed from sets (von Neumann ordinals). Real numbers are equivalence classes of Cauchy sequences of rationals. Vector spaces are sets with operations satisfying axioms stated in predicate logic. Probability measures are functions from sets to [0,1][0,1]. The entire edifice stands on sets and logic.

This is not historical accident - it is the result of 150 years of foundational work from Cantor, Frege, Russell, Hilbert, Godel, and Turing to make mathematics rigorous and unambiguous.

1.4 Formal vs Informal Mathematics

Informal mathematics is what most research papers use: proofs in natural language, intuitions supported by diagrams, appeals to "clearly" and "obviously". This is adequate for communication between mathematicians who share common training and conventions.

Formal mathematics is where every step is justified by an explicit rule, and proofs are machine-checkable. Nothing is taken for granted:

AspectInformalFormal
LanguageNatural language + notationFormal logic (FOL or type theory)
Proof verificationHuman reviewerMachine (proof assistant)
AmbiguityToleratedForbidden
ToolsPaper, LaTeXLean 4, Coq, Isabelle, Agda
ScaleResearch papersVerified libraries (Mathlib ~150K theorems)

Proof assistants encode logic and set theory precisely and verify proofs automatically. They are no longer academic curiosities:

  • Lean 4 (Microsoft Research): Mathlib library; used by AlphaProof (Google DeepMind, 2024) to win IMO gold medals
  • Coq (INRIA): CompCert verified C compiler; Four Colour Theorem formalised
  • Isabelle (Cambridge/Munich): seL4 verified operating system kernel
  • AI and formal mathematics: LLMs increasingly used to assist with formal proofs - Lean Copilot, Hypertree Proof Search, AlphaProof. Understanding sets and logic is prerequisite for understanding what LLMs are attempting when doing mathematical reasoning

1.5 Historical Timeline

The development of sets and logic spans 2,400 years. Here are the critical milestones:

YearPersonContribution
~350 BCEAristotleSyllogistic logic - first formal system of deductive reasoning
1679LeibnizCalculus ratiocinator - dream of universal logical calculus
1847BooleThe Mathematical Analysis of Logic - Boolean algebra; logic as algebra
1874CantorSet theory - infinite sets, cardinality, continuum hypothesis
1879FregeBegriffsschrift - first complete formal logic system; predicate calculus
1901RussellRussell's Paradox - naive set theory is inconsistent
1908ZermeloAxiomatic set theory - foundation for modern mathematics
1910-13Russell & WhiteheadPrincipia Mathematica - attempt to reduce all mathematics to logic
1922-30sFraenkel, SkolemZFC axioms completed - standard foundation of modern mathematics
1929GodelCompleteness theorem - every valid FOL formula has a proof
1931GodelIncompleteness theorems - no consistent system can prove all true arithmetic
1936Church & TuringLambda calculus and Turing machines - computability theory
1936TarskiFormal semantics - truth defined precisely for formal languages
1965ZadehFuzzy logic - graded truth values in [0,1][0,1]
1972Martin-LofDependent type theory - constructive foundations connecting logic to programming
2005-10-SAT solvers industrialised - automated reasoning becomes practical
2024Google DeepMindAlphaProof - neural theorem prover achieves IMO gold medal level

1.6 The Two Pillars

Mathematics rests on two pillars, and understanding their relationship is essential:

                        MATHEMATICS
                            |
        +-------------------+-------------------+
        |                                       |
   LOGIC                                  SET THEORY
   (What is valid reasoning?)             (What objects exist?)
        |                                       |
   Propositional Logic                    Naive Set Theory
        down                                       down
   Predicate Logic (FOL)                  Axiomatic (ZFC)
        down                                       down
   Modal / Temporal Logic                 Type Theory
        |                                       |
        +-------------------+-------------------+
                            |
              Both together: the language in which
              all of AI's mathematical foundations
                        are written

Logic provides the rules of valid inference - what follows from what. It tells you that if "all transformers have attention" and "GPT is a transformer", then "GPT has attention" - and it tells you exactly why this inference is valid (universal instantiation plus modus ponens).

Set theory provides the objects - the collections, the spaces, the structures. It tells you what Rd\mathbb{R}^d is (a Cartesian product of dd copies of R\mathbb{R}), what a function is (a special kind of relation, which is a subset of a Cartesian product), and what a probability space is (a triple of sets and a measure).

Neither is sufficient alone. Logic without sets has nothing to reason about. Sets without logic have no way to draw conclusions. Together they form the complete foundation.


2. Naive Set Theory

2.1 What Is a Set?

A set is an unordered collection of distinct objects called elements (or members). This is the simplest and most fundamental concept in all of mathematics.

Membership notation:

  • aAa \in A means "aa is an element of AA" (read: "aa is in AA")
  • bAb \notin A means "bb is not an element of AA" (read: "bb is not in AA")

Example:

A={1,2,3}A = \{1, 2, 3\}

Then 1A1 \in A, 2A2 \in A, 3A3 \in A, but 4A4 \notin A, 0A0 \notin A.

The fundamental principle - Axiom of Extensionality: A set is entirely determined by its members. Two sets are equal if and only if they have exactly the same elements:

A=B    x(xA    xB)A = B \iff \forall x\,(x \in A \iff x \in B)

This means:

  • Order does not matter: {1,2,3}={3,1,2}={2,3,1}\{1, 2, 3\} = \{3, 1, 2\} = \{2, 3, 1\}
  • Repetition does not matter: {1,1,2}={1,2}\{1, 1, 2\} = \{1, 2\} (sets have distinct elements)

Important distinction: Sets have no notion of order or multiplicity. If you need order, use a sequence (tuple). If you need multiplicity, use a multiset (see 2.6). In ML, training batches are typically ordered sequences, not sets - but the mathematical specification of "training data" is a multiset (same example can appear multiple times).

2.2 Describing Sets

There are three standard ways to describe which elements belong to a set:

Roster notation (listing): Enumerate elements explicitly inside braces.

ExampleDescription
{1,2,3,4,5}\{1, 2, 3, 4, 5\}Finite set with 5 elements
{a,b,c}\{a, b, c\}Set of three letters
{2,4,6,8,}\{2, 4, 6, 8, \ldots\}Infinite set with clear pattern (use carefully)

Set-builder notation (predicate): Describe the property that members satisfy.

A={xP(x)}orA={x:P(x)}A = \{x \mid P(x)\} \quad \text{or} \quad A = \{x : P(x)\}

Read: "the set of all xx such that P(x)P(x) holds."

Set-builderMeaning
{xZxmod2=0}\{x \in \mathbb{Z} \mid x \bmod 2 = 0\}Even integers
{xRx>0}\{x \in \mathbb{R} \mid x > 0\}Positive real numbers
{tΣt is a valid token}\{t \in \Sigma^* \mid t \text{ is a valid token}\}Vocabulary as a set
{(i,j)[n]2ij}\{(i,j) \in [n]^2 \mid i \geq j\}Causal attention mask positions

Parametric description: Define elements by explicit construction.

ParametricMeaning
{2kkZ}\{2k \mid k \in \mathbb{Z}\}Even integers
{1/nnN,n>0}\{1/n \mid n \in \mathbb{N},\, n > 0\}Reciprocals of positive integers
{WqxxRd}\{W_q x \mid x \in \mathbb{R}^d\}Range of a linear map WqW_q

2.3 Special Sets

The following sets are used so frequently that they have special symbols:

SymbolNameElements
={}\emptyset = \{\}Empty setNo elements
N\mathbb{N}Natural numbers{0,1,2,3,}\{0, 1, 2, 3, \ldots\} (sometimes starting from 1)
Z\mathbb{Z}Integers{,2,1,0,1,2,}\{\ldots, -2, -1, 0, 1, 2, \ldots\}
Q\mathbb{Q}Rationals{p/qp,qZ,q0}\{p/q \mid p, q \in \mathbb{Z},\, q \neq 0\}
R\mathbb{R}Real numbersAll points on the number line
C\mathbb{C}Complex numbers{a+bia,bR,i2=1}\{a + bi \mid a, b \in \mathbb{R},\, i^2 = -1\}
UUUniversal setAll objects under consideration (context-dependent)

The containment chain:

NZQRC\mathbb{N} \subset \mathbb{Z} \subset \mathbb{Q} \subset \mathbb{R} \subset \mathbb{C}

Each step adds new elements: Z\mathbb{Z} adds negatives, Q\mathbb{Q} adds fractions, R\mathbb{R} adds irrationals (2\sqrt{2}, π\pi, ee), C\mathbb{C} adds imaginary numbers.

Critical distinctions with the empty set:

ExpressionElementsCardinality
={}\emptyset = \{\}None0
{}={{}}\{\emptyset\} = \{\{\}\}One element: \emptyset1
{,{}}\{\emptyset, \{\emptyset\}\}Two elements: \emptyset and {}\{\emptyset\}2

Warning: {}\emptyset \neq \{\emptyset\}. The empty set has no elements. The set containing the empty set has one element (which happens to be the empty set). This distinction is not pedantic - it is exactly how natural numbers are constructed from sets in ZFC (see 8.3).

2.4 Subset and Superset

Subset: AA is a subset of BB (written ABA \subseteq B) if every element of AA is also an element of BB:

AB    x(xA    xB)A \subseteq B \iff \forall x\,(x \in A \implies x \in B)

Proper subset: AA is a proper subset of BB (written ABA \subsetneq B or ABA \subset B) if ABA \subseteq B and ABA \neq B - that is, BB has at least one element not in AA.

Superset: ABA \supseteq B means BAB \subseteq A (read: "AA is a superset of BB").

Properties of the subset relation:

PropertyStatementMeaning
ReflexivityAAA \subseteq AEvery set is a subset of itself
AntisymmetryABA \subseteq B and BA    A=BB \subseteq A \implies A = BStandard proof technique for set equality
TransitivityABA \subseteq B and BC    ACB \subseteq C \implies A \subseteq CSubset chains compose
Empty setA\emptyset \subseteq A for every set AAVacuously true

The antisymmetry property gives us the standard method for proving two sets are equal: show ABA \subseteq B (every element of AA is in BB) and BAB \subseteq A (every element of BB is in AA). This "double containment" proof is used constantly in mathematics.

Vacuous truth explanation: Why is A\emptyset \subseteq A true? The definition says: for every xx, if xx \in \emptyset then xAx \in A. Since nothing is in \emptyset, the "if" clause is never satisfied, so the implication is vacuously true. There is no counterexample - we would need an xx that is in \emptyset but not in AA, and no such xx exists. See 5.2 for more on vacuous truth and implication.

2.5 Russell's Paradox

Naive comprehension (the principle that doomed early set theory): For any property PP, there exists a set {xP(x)}\{x \mid P(x)\}. This seems natural - "the set of all things with property PP" - but it leads to disaster.

Russell's construction (1901): Define

R={xxx}R = \{x \mid x \notin x\}

The set of all sets that don't contain themselves. Now ask: is RRR \in R?

Case 1: Assume RRR \in R. Then RR satisfies its own membership condition, which says RRR \notin R. Contradiction.

Case 2: Assume RRR \notin R. Then RR satisfies the condition xxx \notin x, so by the definition of RR, we have RRR \in R. Contradiction.

Either way we get a contradiction. Naive set theory is inconsistent.

The resolution: We cannot form "the set of all sets" or unrestricted collections. The axiom of Separation (in ZFC) restricts comprehension: you can only form {xAP(x)}\{x \in A \mid P(x)\} for an already-existing set AA. You cannot form a set "from scratch" using just a property - you must separate from an existing set. This blocks Russell's paradox because there is no universal set of all sets from which to separate.

This is not merely a historical curiosity. Russell's paradox is the reason mathematics has axioms for set theory at all, and understanding it is prerequisite for understanding ZFC (8), type theory (9.4), and the limits of formal systems (12.1).

2.6 Multisets (Bags)

Standard sets require distinct elements: {1,1,2}={1,2}\{1, 1, 2\} = \{1, 2\}. But in practice, we often need to track multiplicity - how many times each element appears.

A multiset (or bag) generalises sets by assigning each element a multiplicity (count). Formally, a multiset over a base set AA is a function m:AN0m: A \to \mathbb{N}_0 where m(a)m(a) gives the number of times aa appears.

Notation: {1,1,2,3,3,3}\{1, 1, 2, 3, 3, 3\} as a multiset has m(1)=2m(1) = 2, m(2)=1m(2) = 1, m(3)=3m(3) = 3.

Multiset operations:

OperationDefinitionExample
Union (max)(AB)(x)=max(mA(x),mB(x))(A \uplus B)(x) = \max(m_A(x), m_B(x)){1,1,2}{1,2,2}={1,1,2,2}\{1,1,2\} \uplus \{1,2,2\} = \{1,1,2,2\}
Sum (add)(A+B)(x)=mA(x)+mB(x)(A + B)(x) = m_A(x) + m_B(x){1,1,2}+{1,2,2}={1,1,1,2,2,2}\{1,1,2\} + \{1,2,2\} = \{1,1,1,2,2,2\}
Intersection (min)(AB)(x)=min(mA(x),mB(x))(A \cap B)(x) = \min(m_A(x), m_B(x)){1,1,2}{1,2,2}={1,2}\{1,1,2\} \cap \{1,2,2\} = \{1,2\}

AI applications of multisets:

  • Training data: A dataset is a multiset of examples - the same document may appear multiple times (data repetition, which affects training dynamics)
  • Token counting: Term frequency in TF-IDF and BM25 is a multiset operation - how many times each word appears matters
  • Gradient accumulation: Sum of gradient multiset over a batch: each micro-batch contributes a gradient, and we sum (multiset addition)
  • Batch statistics: BatchNorm computes mean and variance of a multiset of activations per channel

3. Set Operations

Set operations are the algebraic toolkit for combining, comparing, and transforming sets. Every one of these operations appears directly in modern AI systems: attention masks use complement and intersection, retrieval uses union and difference, and probability theory uses all of them.

3.1 Union

AB={xxA or xB}A \cup B = \{x \mid x \in A \text{ or } x \in B\}

The union contains every element that belongs to at least one of the two sets. The "or" is inclusive - elements in both sets are included (once, since sets don't repeat).

Visual: Think of taking both sets and pouring them together. Any element from either set goes in.

Properties of union:

PropertyStatementWhy
CommutativityAB=BAA \cup B = B \cup A"or" is commutative
Associativity(AB)C=A(BC)(A \cup B) \cup C = A \cup (B \cup C)Can chain without parentheses
IdentityA=AA \cup \emptyset = AAdding nothing changes nothing
IdempotenceAA=AA \cup A = ADuplicates collapse in sets
DominationAU=UA \cup U = UEverything plus anything is everything

Generalised union: For a collection of sets {Ai}iI\{A_i\}_{i \in I}:

iIAi={xxAi for some iI}\bigcup_{i \in I} A_i = \{x \mid x \in A_i \text{ for some } i \in I\}

AI examples:

  • Vocabulary merging: When combining tokenisers from different models, Vmerged=V1V2V_{\text{merged}} = V_1 \cup V_2
  • Search results: "Documents matching query A OR query B" = Results(A)Results(B)\text{Results}(A) \cup \text{Results}(B)
  • Multi-dataset training: Training set = D1D2DkD_1 \cup D_2 \cup \ldots \cup D_k

3.2 Intersection

AB={xxA and xB}A \cap B = \{x \mid x \in A \text{ and } x \in B\}

The intersection contains only elements that belong to both sets simultaneously.

Properties of intersection:

PropertyStatementWhy
CommutativityAB=BAA \cap B = B \cap A"and" is commutative
Associativity(AB)C=A(BC)(A \cap B) \cap C = A \cap (B \cap C)Can chain without parentheses
IdentityAU=AA \cap U = AIntersecting with everything keeps all
IdempotenceAA=AA \cap A = ASet intersected with itself is itself
AnnihilationA=A \cap \emptyset = \emptysetIntersecting with nothing gives nothing

Disjoint sets: Two sets AA and BB are disjoint if AB=A \cap B = \emptyset - they share no elements.

Generalised intersection: For a collection {Ai}iI\{A_i\}_{i \in I}:

iIAi={xxAi for all iI}\bigcap_{i \in I} A_i = \{x \mid x \in A_i \text{ for all } i \in I\}

AI examples:

  • Common tokens: Tokens shared by two vocabularies: V1V2V_1 \cap V_2
  • Data deduplication: Documents appearing in both train and test: DtrainDtestD_{\text{train}} \cap D_{\text{test}} (should be \emptyset for valid evaluation!)
  • Attention intersection: Tokens attended by both head A and head B

Distributive laws (union and intersection interact):

A(BC)=(AB)(AC)A \cap (B \cup C) = (A \cap B) \cup (A \cap C) A(BC)=(AB)(AC)A \cup (B \cap C) = (A \cup B) \cap (A \cup C)

These mirror the distributive laws of multiplication over addition (a(b+c)=ab+aca \cdot (b + c) = ab + ac), which is no coincidence - Boolean algebra (5.7) makes these analogies precise.

3.3 Set Difference

AB=AB={xxA and xB}A \setminus B = A - B = \{x \mid x \in A \text{ and } x \notin B\}

Elements of AA that are not in BB. Think of "removing" BB's elements from AA.

Properties of set difference:

PropertyStatement
Not commutativeABBAA \setminus B \neq B \setminus A in general
A=AA \setminus \emptyset = ARemoving nothing changes nothing
AA=A \setminus A = \emptysetRemoving everything leaves nothing
ABAA \setminus B \subseteq ACan only reduce the original set
(AB)B=(A \setminus B) \cap B = \emptysetWhat remains shares nothing with what was removed

Relationship to other operations:

AB=ABˉA \setminus B = A \cap \bar{B}

Difference is intersection with complement (this is often the most useful way to think about it).

AI examples:

  • Vocabulary difference: Tokens in GPT-4's vocabulary but not LLaMA's: VGPT4VLLaMAV_{\text{GPT4}} \setminus V_{\text{LLaMA}}
  • Data filtering: Training data after removing test examples: DtrainDtestD_{\text{train}} \setminus D_{\text{test}}
  • Feature selection: Selected features = all features minus dropped features

3.4 Symmetric Difference

AB=(AB)(BA)=(AB)(AB)A \triangle B = (A \setminus B) \cup (B \setminus A) = (A \cup B) \setminus (A \cap B)

Elements in exactly one of the two sets - not in both.

Properties of symmetric difference:

PropertyStatementWhy
CommutativityAB=BAA \triangle B = B \triangle ADefinition is symmetric
Associativity(AB)C=A(BC)(A \triangle B) \triangle C = A \triangle (B \triangle C)Can chain
IdentityA=AA \triangle \emptyset = AXOR with nothing = original
Self-inverseAA=A \triangle A = \emptysetXOR with self cancels
CriterionAB=    A=BA \triangle B = \emptyset \iff A = BZero difference means equal

Algebraic structure: The symmetric difference operation makes the power set P(U)\mathcal{P}(U) into an abelian group with \emptyset as identity. Combined with intersection as multiplication, (P(U),,)(\mathcal{P}(U), \triangle, \cap) forms a Boolean ring - connecting set theory to abstract algebra.

AI examples:

  • Dataset drift detection: Tokens that changed between vocabulary v1 and v2: V1V2V_1 \triangle V_2
  • Model comparison: Features used by model A XOR model B
  • Edit distance: Symmetric difference of character sets is a crude string similarity measure

3.5 Complement

Aˉ=A=UA={xUxA}\bar{A} = A' = U \setminus A = \{x \in U \mid x \notin A\}

Everything in the universal set UU that is not in AA. The complement depends on what UU is (must specify context).

Properties of complement:

PropertyStatement
Double complementAˉ=A\overline{\bar{A}} = A
Complement of \emptysetˉ=U\bar{\emptyset} = U
Complement of UUUˉ=\bar{U} = \emptyset
Union with complementAAˉ=UA \cup \bar{A} = U
Intersection with complementAAˉ=A \cap \bar{A} = \emptyset

AI example: An attention mask MM selects positions to attend to. The complement Mˉ\bar{M} gives the positions that are masked out (set to -\infty before softmax). In causal masking, the complement is the "future" positions that must not be attended to.

3.6 De Morgan's Laws for Sets

These are the most important algebraic identities for sets. They connect union and intersection through complementation:

First law - complement of union = intersection of complements:

AB=AˉBˉ\overline{A \cup B} = \bar{A} \cap \bar{B}

"Not in A-or-B" is the same as "not in A and not in B."

Second law - complement of intersection = union of complements:

AB=AˉBˉ\overline{A \cap B} = \bar{A} \cup \bar{B}

"Not in A-and-B" is the same as "not in A or not in B."

Proof of the first law (by double containment):

(\subseteq): Let xABx \in \overline{A \cup B}. Then xABx \notin A \cup B, which means xAx \notin A and xBx \notin B (if xx were in either, it would be in the union). Hence xAˉx \in \bar{A} and xBˉx \in \bar{B}, so xAˉBˉx \in \bar{A} \cap \bar{B}.

(\supseteq): Let xAˉBˉx \in \bar{A} \cap \bar{B}. Then xAx \notin A and xBx \notin B. Hence xx is not in AA or BB, so xABx \notin A \cup B, which means xABx \in \overline{A \cup B}.

Both containments hold, so AB=AˉBˉ\overline{A \cup B} = \bar{A} \cap \bar{B}. \square

Generalised De Morgan's Laws:

iAi=iAˉiiAi=iAˉi\overline{\bigcup_i A_i} = \bigcap_i \bar{A}_i \qquad \qquad \overline{\bigcap_i A_i} = \bigcup_i \bar{A}_i

AI examples:

  • Attention mask logic: "Not attending to position A or position B" = "not attending to A and not attending to B"
  • Retrieval negation: "Documents not matching (query1 OR query2)" = "documents not matching query1 AND not matching query2"
  • Filter composition: Negating a disjunctive filter becomes conjunctive

3.7 Cartesian Product

A×B={(a,b)aA,bB}A \times B = \{(a, b) \mid a \in A,\, b \in B\}

The set of all ordered pairs where the first element comes from AA and the second from BB. Unlike sets, order matters in pairs: (a,b)(b,a)(a, b) \neq (b, a) unless a=ba = b.

Cardinality: A×B=A×B|A \times B| = |A| \times |B|

Not commutative: A×BB×AA \times B \neq B \times A (different ordered pairs, unless A=BA = B).

Higher products:

  • A×B×C={(a,b,c)aA,bB,cC}A \times B \times C = \{(a, b, c) \mid a \in A,\, b \in B,\, c \in C\} - ordered triples
  • An=A×A××An timesA^n = \underbrace{A \times A \times \cdots \times A}_{n \text{ times}} - nn-tuples of elements from AA

AI examples - Cartesian products are everywhere:

ExpressionMeaning
Rd=R××Rd\mathbb{R}^d = \underbrace{\mathbb{R} \times \cdots \times \mathbb{R}}_{d}dd-dimensional embedding space
VnV^nAll possible token sequences of length nn
Rn×dk×Rn×dk\mathbb{R}^{n \times d_k} \times \mathbb{R}^{n \times d_k}Domain of attention: (queries, keys)
[n]×[n][n] \times [n]All position pairs for attention matrix
Θ=Rp\Theta = \mathbb{R}^pParameter space of model with pp parameters

Formal definition of ordered pair (Kuratowski): (a,b)={{a},{a,b}}(a, b) = \{\{a\}, \{a, b\}\}. This reduces ordered pairs to pure sets - the pair (a,b)(a, b) is a set that encodes the order by treating aa as the "singleton" element.

3.8 Power Set

P(A)={BBA}\mathcal{P}(A) = \{B \mid B \subseteq A\}

The set of all subsets of AA, including \emptyset and AA itself.

Cardinality: P(A)=2A|\mathcal{P}(A)| = 2^{|A|} for finite AA.

The reason: for each element of AA, you independently choose "include" or "exclude" - this gives 22 choices for each of the A|A| elements, hence 2A2^{|A|} subsets total.

Examples:

Set AAP(A)\mathcal{P}(A)P(A)\|\mathcal{P}(A)\|
\emptyset{}\{\emptyset\}20=12^0 = 1
{1}\{1\}{,{1}}\{\emptyset, \{1\}\}21=22^1 = 2
{1,2}\{1, 2\}{,{1},{2},{1,2}}\{\emptyset, \{1\}, \{2\}, \{1,2\}\}22=42^2 = 4
{1,2,3}\{1, 2, 3\}8 subsets (including \emptyset and {1,2,3}\{1,2,3\})23=82^3 = 8

Cantor's theorem (critical): The power set is always strictly larger than the original set, even for infinite sets:

A<P(A)for all sets A|A| < |\mathcal{P}(A)| \quad \text{for all sets } A

This means there are always more subsets than elements - a fact that generates the entire hierarchy of infinite cardinalities (see 10.4).

AI implications:

  • If vocabulary V=32,000|V| = 32{,}000, then P(V)=232,000|\mathcal{P}(V)| = 2^{32{,}000} - the number of possible token subsets is astronomically larger than the vocabulary itself
  • Structured generation constrains output to a subset of VV at each step - it selects one element of P(V)\mathcal{P}(V)
  • The exponential blowup of power sets is why exhaustive search over subsets is intractable, driving the need for heuristic and approximate algorithms throughout AI

3.9 Operations Summary Table

For quick reference, here is the complete table of set operations with their logical counterparts:

Set OperationNotationLogic EquivalentPython
UnionABA \cup B\vee (OR)A | B or A.union(B)
IntersectionABA \cap B\wedge (AND)A & B or A.intersection(B)
DifferenceABA \setminus B¬\wedge \neg (AND NOT)A - B or A.difference(B)
Symmetric differenceABA \triangle B\oplus (XOR)A ^ B or A.symmetric_difference(B)
ComplementAˉ\bar{A}¬\neg (NOT)U - A
Cartesian productA×BA \times B-itertools.product(A, B)
Power setP(A)\mathcal{P}(A)-itertools.combinations for each size
Subset testABA \subseteq B    \impliesA <= B or A.issubset(B)

4. Relations

Relations generalise the concept of "connection" between objects. Functions, orderings, equivalences, and even neural network computations are all special kinds of relations. Understanding relations precisely is what separates informal intuition from rigorous mathematics.

4.1 Binary Relations

A binary relation RR from set AA to set BB is a subset of the Cartesian product:

RA×BR \subseteq A \times B

If (a,b)R(a, b) \in R, we write aRbaRb or R(a,b)R(a, b) and read "aa is related to bb by RR."

A relation on AA is a relation from AA to itself: RA×AR \subseteq A \times A.

Examples:

RelationSetsDefinition
\leq on R\mathbb{R}R×R\mathbb{R} \times \mathbb{R}(x,y)(x, y) \in \leq iff xx is less than or equal to yy
Parent-ofPerson ×\times Person(p,c)(p, c) iff pp is parent of cc
Token adjacencyV×VV \times V(t1,t2)R(t_1, t_2) \in R iff t1t_1 can precede t2t_2 in valid text
Attention mask[n]×[n][n] \times [n](i,j)mask(i, j) \in \text{mask} iff query at position ii attends to key at position jj
SimilarityV×VV \times V(w1,w2)R(w_1, w_2) \in R iff cos(ew1,ew2)>τ\cos(e_{w_1}, e_{w_2}) > \tau

A relation can be represented as:

  • A set of pairs (the mathematical definition)
  • A matrix (Boolean matrix MM where Mij=1M_{ij} = 1 iff iRjiRj; this is the attention mask representation)
  • A directed graph (nodes = elements; edge from aa to bb iff aRbaRb)

4.2 Properties of Relations

Let RR be a relation on set AA (i.e., RA×AR \subseteq A \times A). There are six fundamental properties a relation may or may not have:

Reflexivity: Every element is related to itself.

aA:(a,a)R\forall a \in A:\, (a, a) \in R
  • OK \leq is reflexive (xxx \leq x for all xx)
  • NO << is not reflexive (xxx \not< x)
  • AI: identity attention (token attends to itself) is a reflexive relation

Irreflexivity: No element is related to itself.

aA:(a,a)R\forall a \in A:\, (a, a) \notin R
  • OK << is irreflexive
  • OK "parent-of" is irreflexive (no one is their own parent)
  • A relation can be neither reflexive nor irreflexive (some self-loops, not all)

Symmetry: If aa is related to bb, then bb is related to aa.

a,bA:(a,b)R    (b,a)R\forall a, b \in A:\, (a, b) \in R \implies (b, a) \in R
  • OK "is sibling of" is symmetric
  • OK "has same length as" is symmetric
  • NO \leq is not symmetric (353 \leq 5 but 5≰35 \not\leq 3)
  • AI: cosine similarity is symmetric; attention is NOT symmetric (query->key direction)

Antisymmetry: If aa is related to bb AND bb is related to aa, then a=ba = b.

a,bA:(a,b)R(b,a)R    a=b\forall a, b \in A:\, (a, b) \in R \wedge (b, a) \in R \implies a = b
  • OK \leq is antisymmetric (xyx \leq y and yxy \leq x implies x=yx = y)
  • OK \subseteq on sets is antisymmetric
  • Note: antisymmetry \neq "not symmetric". A relation can be both symmetric and antisymmetric (e.g., equality)

Asymmetry: If aa is related to bb, then bb is NOT related to aa.

a,bA:(a,b)R    (b,a)R\forall a, b \in A:\, (a, b) \in R \implies (b, a) \notin R
  • OK << is asymmetric
  • OK "parent-of" is asymmetric
  • Asymmetry implies irreflexivity (if RR were reflexive, (a,a)R(a, a) \in R would require (a,a)R(a, a) \notin R)

Transitivity: If aa is related to bb and bb is related to cc, then aa is related to cc.

a,b,cA:(a,b)R(b,c)R    (a,c)R\forall a, b, c \in A:\, (a, b) \in R \wedge (b, c) \in R \implies (a, c) \in R
  • OK \leq is transitive
  • OK "ancestor-of" is transitive
  • NO "friend-of" is not transitive (my friend's friend need not be my friend)
  • AI: causal mask transitivity - if position ii attends to jj and jj attends to kk, should ii attend to kk? In a causal mask, yes (by transitivity of \geq)

4.3 Equivalence Relations

A relation RR on AA is an equivalence relation if it is:

  1. Reflexive: aRaaRa for all aa
  2. Symmetric: aRb    bRaaRb \implies bRa
  3. Transitive: aRbbRc    aRcaRb \wedge bRc \implies aRc

Equivalence relations capture the idea of "sameness" or "interchangeability" - elements related by an equivalence relation are considered equivalent for the purpose at hand.

Examples of equivalence relations:

RelationDomainEquivalence classes
Equality ==Any setEach class is a singleton {a}\{a\}
Same remainder mod nnZ\mathbb{Z}nn classes: [0],[1],,[n1][0], [1], \ldots, [n-1]
Same lengthStringsStrings of equal length grouped together
Same BPE encodingCharacter sequencesSequences mapping to same token(s)
Same predictionInputsInputs producing identical model output

Equivalence class of aa:

[a]={bAaRb}[a] = \{b \in A \mid aRb\}

All elements equivalent to aa form the equivalence class [a][a].

Partition theorem: An equivalence relation on AA partitions AA into disjoint equivalence classes:

  • Every element belongs to exactly one equivalence class
  • The union of all classes equals AA: a[a]=A\bigcup_a [a] = A
  • Distinct classes are disjoint: [a][b]    [a][b]=[a] \neq [b] \implies [a] \cap [b] = \emptyset

Quotient set: A/R={[a]aA}A / R = \{[a] \mid a \in A\} - the set of all equivalence classes.

AI example - tokenisation as equivalence: BPE tokenisation defines an equivalence relation on character sequences. Two character sequences are "equivalent" if they merge to the same token. The equivalence classes are precisely the tokens. The vocabulary VV is the quotient set Σ/BPE\Sigma^* / \sim_{\text{BPE}} (restricted to the merge rules).

4.4 Partial Orders

A relation RR on AA is a partial order if it is:

  1. Reflexive: aRaaRa for all aa
  2. Antisymmetric: aRbbRa    a=baRb \wedge bRa \implies a = b
  3. Transitive: aRbbRc    aRcaRb \wedge bRc \implies aRc

A partial order is written \leq or \preceq conventionally. The pair (A,)(A, \leq) is called a partially ordered set (poset).

"Partial" means: Not all pairs need be comparable. Some elements may be incomparable - neither aba \leq b nor bab \leq a holds.

Total order (linear order): A partial order where additionally

a,bA:aRb or bRa\forall a, b \in A:\, aRb \text{ or } bRa

All pairs are comparable. Total orders arrange elements in a single line.

Examples:

RelationDomainTypeIncomparable?
\leq on R\mathbb{R}Real numbersTotal orderNo
Lexicographic on stringsΣ\Sigma^*Total orderNo
\subseteq on P(A)\mathcal{P}(A)Power setPartial orderYes: {1}\{1\} and {2}\{2\}
Divisibility on N\mathbb{N}aba \mid b iff aa divides bbPartial orderYes: 22 and 33
Prefix relationsts \leq t iff ss is prefix of ttPartial orderYes: "ab" and "ba"

Hasse diagram: A graphical representation of a finite poset. Draw aa above bb with an edge if a>ba > b and there is no element between them (no element cc with a>c>ba > c > b). This gives a compact picture of the partial order structure.

AI connection - dependency parsing: In a syntactic parse tree, token ii dominates token jj if jj is a descendant of ii. This dominance relation is a partial order on sentence tokens. Similarly, in DAG-structured computation graphs, the execution order is a partial order on operations.

4.5 Functions as Relations

A function f:ABf: A \to B is a special case of a relation - a relation fA×Bf \subseteq A \times B satisfying the condition that every element of AA appears in exactly one pair:

aA:!bB:(a,b)f\forall a \in A:\, \exists!\, b \in B:\, (a, b) \in f

We write f(a)=bf(a) = b to mean (a,b)f(a, b) \in f.

Terminology:

TermSymbolMeaning
Domaindom(f)=A\text{dom}(f) = ASet of inputs
Codomaincod(f)=B\text{cod}(f) = BSet of allowed outputs
Range (image)ran(f)={f(a)aA}\text{ran}(f) = \{f(a) \mid a \in A\}Actual outputs (B\subseteq B)

Types of functions:

TypeDefinitionMeaning
Injective (one-to-one)f(a1)=f(a2)    a1=a2f(a_1) = f(a_2) \implies a_1 = a_2Distinct inputs -> distinct outputs
Surjective (onto)ran(f)=B\text{ran}(f) = BEvery output is hit
BijectiveInjective and surjectivePerfect pairing between AA and BB

AI examples - functions are the core abstraction:

FunctionType SignatureProperties
TokeniserΣV\Sigma^* \to V^*Not injective (multiple strings -> same tokens)
EmbeddingVRdV \to \mathbb{R}^dInjective (each token gets unique vector)
LM headRdRV\mathbb{R}^d \to \mathbb{R}^{\vert V \vert}Linear map (matrix multiplication)
AttentionRn×d×Rn×d×Rn×dRn×d\mathbb{R}^{n \times d} \times \mathbb{R}^{n \times d} \times \mathbb{R}^{n \times d} \to \mathbb{R}^{n \times d}Highly non-linear
SoftmaxRkΔk1\mathbb{R}^k \to \Delta^{k-1}Surjective onto probability simplex
argmaxRk[k]\mathbb{R}^k \to [k]Not injective (ties); not differentiable

4.6 Composition and Inverse

Composition: Given f:ABf: A \to B and g:BCg: B \to C, the composition gf:ACg \circ f: A \to C is defined by:

(gf)(a)=g(f(a))(g \circ f)(a) = g(f(a))

Read right to left: "apply ff first, then gg."

Properties of composition:

  • Associativity: h(gf)=(hg)fh \circ (g \circ f) = (h \circ g) \circ f
  • Identity: fidA=ff \circ \text{id}_A = f and idBf=f\text{id}_B \circ f = f, where idA(a)=a\text{id}_A(a) = a
  • Not commutative: gffgg \circ f \neq f \circ g in general

Inverse: If f:ABf: A \to B is bijective, there exists a unique f1:BAf^{-1}: B \to A satisfying:

f1(f(a))=aandf(f1(b))=bf^{-1}(f(a)) = a \quad \text{and} \quad f(f^{-1}(b)) = b

Only bijections have inverses. Injective-but-not-surjective functions have left inverses; surjective-but-not-injective functions have right inverses.

AI example - neural networks as function composition: A deep neural network is fundamentally a composition of functions:

y^=fLfL1f2f1(x)\hat{y} = f_L \circ f_{L-1} \circ \cdots \circ f_2 \circ f_1(x)

Each fif_i is one layer (linear map + activation). Deep learning is deeply compositional - the power comes from composing many simple functions. The chain rule of calculus (backpropagation) exploits this compositional structure: x(gf)=gffx\frac{\partial}{\partial x}(g \circ f) = \frac{\partial g}{\partial f} \cdot \frac{\partial f}{\partial x}.


5. Propositional Logic

Propositional logic is the simplest complete system of logical reasoning. It studies how the truth of compound statements depends on the truth of their components, without looking inside those components. It is the foundation for Boolean circuits, attention mask logic, SAT solvers, and the systematic reasoning that predicate logic (6) and proof techniques (7) build upon.

5.1 Propositions

A proposition is a declarative statement that is either true (TT, 1) or false (FF, 0). Never both, never neither.

Examples of propositions:

  • "The sky is blue" - true proposition
  • "2+2=52 + 2 = 5" - false proposition
  • "LLaMA-3 has 8B parameters" - true proposition
  • "Every even integer greater than 2 is the sum of two primes" - proposition (truth value unknown; Goldbach's conjecture)

Not propositions:

  • "What is your name?" - question, no truth value
  • "Close the door" - command, no truth value
  • "x>3x > 3" - depends on xx; becomes a proposition only when xx is specified (this is a predicate, see 6)

Propositional variables: Letters p,q,r,s,p, q, r, s, \ldots stand for unspecified propositions. Each has a truth value - either TT or FF.

5.2 Logical Connectives

Connectives combine propositions into compound statements. There are six standard connectives:

Negation (NOT): ¬p\neg p

"Not pp." Flips the truth value.

pp¬p\neg p
TTFF
FFTT

AI: Bitwise NOT in attention masks; complement of a token set.

Conjunction (AND): pqp \wedge q

"pp and qq." True only when both are true.

ppqqpqp \wedge q
TTTTTT
TTFFFF
FFTTFF
FFFFFF

AI: Combining attention masks (token must be unmasked AND within window); intersecting constraints in structured generation.

Disjunction (OR): pqp \vee q

"pp or qq." True when at least one is true (inclusive or).

ppqqpqp \vee q
TTTTTT
TTFFTT
FFTTTT
FFFFFF

AI: Union of token sets in structured generation; "match either pattern" in retrieval.

Exclusive Or (XOR): pqp \oplus q

"pp xor qq." True when exactly one is true.

ppqqpqp \oplus q
TTTTFF
TTFFTT
FFTTTT
FFFFFF

AI: Symmetric difference of sets; error detection (XOR of prediction and ground truth bits); XOR is the simplest function not linearly separable - a single perceptron cannot learn it (Minsky & Papert 1969).

Implication (IF...THEN): pqp \to q

"If pp then qq." Also read "pp implies qq." False only when pp is true and qq is false - a true hypothesis cannot lead to a false conclusion.

ppqqpqp \to q
TTTTTT
TTFFFF
FFTTTT
FFFFTT

Vacuous truth: When pp is false, pqp \to q is always true regardless of qq. This is often counterintuitive but essential:

  • "If pigs fly, then the moon is green" is true (vacuously)
  • The reason: pqp \to q promises "whenever pp holds, qq holds." If pp never holds, the promise is never violated, so it's true by default.

Vacuous truth in mathematics and AI:

  • "x\forall x \in \emptyset, property P(x)P(x) holds" - true, because there's nothing to check
  • "A\emptyset \subseteq A" - true, because every element of \emptyset (there are none) is in AA
  • "If the model reaches 100% accuracy, then it has generalised" - vacuously true if the model never reaches 100%

Biconditional (IF AND ONLY IF): pqp \leftrightarrow q

"pp if and only if qq" (abbreviated "iff"). True when pp and qq have the same truth value.

ppqqpqp \leftrightarrow q
TTTTTT
TTFFFF
FFTTFF
FFFFTT

Equivalently: pq(pq)(qp)p \leftrightarrow q \equiv (p \to q) \wedge (q \to p).

AI: Set equality (A=B    ABBAA = B \iff A \subseteq B \wedge B \subseteq A); defining exactly when a condition holds.

5.3 Operator Precedence

When multiple connectives appear without parentheses, the standard precedence (highest first) is:

¬>>>>\neg \quad > \quad \wedge \quad > \quad \vee \quad > \quad \to \quad > \quad \leftrightarrow

Examples:

  • ¬pq\neg p \wedge q means (¬p)q(\neg p) \wedge q, not ¬(pq)\neg(p \wedge q)
  • pqrp \vee q \to r means (pq)r(p \vee q) \to r, not p(qr)p \vee (q \to r)
  • pqrp \wedge q \vee r means (pq)r(p \wedge q) \vee r, not p(qr)p \wedge (q \vee r)

Best practice: Always use parentheses when there's any ambiguity. Clarity is more important than brevity.

5.4 Tautologies, Contradictions, and Contingencies

Tautology: A formula that is true for all possible truth value assignments to its variables. Logically necessary - cannot possibly be false.

TautologyName
p¬pp \vee \neg pLaw of excluded middle
ppp \to pSelf-implication
(pq)p(p \wedge q) \to pSimplification
(pq)(¬pq)(p \to q) \leftrightarrow (\neg p \vee q)Implication equivalence

Contradiction (Unsatisfiable): A formula that is false for all truth value assignments.

ContradictionWhy
p¬pp \wedge \neg pCannot be both true and false
(pq)p¬q(p \to q) \wedge p \wedge \neg qModus ponens violated

Contingency: A formula that is true for some assignments and false for others.

  • pqp \wedge q - true when both true, false otherwise
  • pqp \to q - false only when TFT \to F

Satisfiability: A formula is satisfiable if at least one truth assignment makes it true. Tautologies and contingencies are satisfiable; contradictions are not.

The SAT problem: Given a propositional formula ϕ\phi, is ϕ\phi satisfiable? This is the canonical NP-complete problem (Cook-Levin theorem, 1971). Despite worst-case intractability, modern SAT solvers handle millions of variables - they are critical for AI constraint solving, formal verification, and automated planning.

5.5 Logical Equivalence

Two formulas ϕ\phi and ψ\psi are logically equivalent (written ϕψ\phi \equiv \psi) if they have the same truth value for every possible truth assignment. Equivalently: ϕψ\phi \leftrightarrow \psi is a tautology.

Here are the fundamental equivalences - the "algebra rules" of logic:

Double Negation:

¬¬pp\neg\neg p \equiv p

De Morgan's Laws:

¬(pq)¬p¬q\neg(p \wedge q) \equiv \neg p \vee \neg q ¬(pq)¬p¬q\neg(p \vee q) \equiv \neg p \wedge \neg q

These are the most used laws in practice. Notice they mirror the set-theoretic De Morgan's Laws (3.6) exactly - this is the Boolean algebra correspondence (5.7).

Implication Equivalences:

pq¬pqp \to q \equiv \neg p \vee q pq¬q¬p(contrapositive)p \to q \equiv \neg q \to \neg p \quad \text{(contrapositive)} ¬(pq)p¬q\neg(p \to q) \equiv p \wedge \neg q

The contrapositive equivalence is used constantly in proofs (see 7.3). It says "pp implies qq" is the same as "not qq implies not pp." Distinguish from the converse (qpq \to p), which is NOT equivalent to pqp \to q.

Distributive Laws:

p(qr)(pq)(pr)p \wedge (q \vee r) \equiv (p \wedge q) \vee (p \wedge r) p(qr)(pq)(pr)p \vee (q \wedge r) \equiv (p \vee q) \wedge (p \vee r)

Absorption Laws:

p(pq)pp \wedge (p \vee q) \equiv p p(pq)pp \vee (p \wedge q) \equiv p

Idempotent Laws:

pppp \wedge p \equiv p pppp \vee p \equiv p

Identity Laws:

pTpp \wedge T \equiv p pFpp \vee F \equiv p

Domination Laws:

pFFp \wedge F \equiv F pTTp \vee T \equiv T

Complement Laws:

p¬pFp \wedge \neg p \equiv F p¬pTp \vee \neg p \equiv T

5.6 Normal Forms

Literal: A variable or its negation (pp or ¬p\neg p).

Clause: A disjunction of literals (p¬qrp \vee \neg q \vee r).

Term: A conjunction of literals (p¬qrp \wedge \neg q \wedge r).

Conjunctive Normal Form (CNF)

A formula in CNF is a conjunction of clauses - an AND of ORs:

(pq¬r)(¬ps)(r¬s)(p \vee q \vee \neg r) \wedge (\neg p \vee s) \wedge (r \vee \neg s)

Every propositional formula can be converted to an equivalent CNF. SAT solvers require CNF input - the DPLL and CDCL algorithms operate exclusively on CNF.

Disjunctive Normal Form (DNF)

A formula in DNF is a disjunction of terms - an OR of ANDs:

(pq¬r)(¬ps)(r¬s)(p \wedge q \wedge \neg r) \vee (\neg p \wedge s) \vee (r \wedge \neg s)

Every formula can be converted to DNF. A DNF is satisfiable iff any single term is satisfiable (no complementary literals in a term), which is easy to check - but the conversion to DNF may be exponentially large.

Conversion to CNF (Algorithm)

Given an arbitrary formula ϕ\phi:

  1. Eliminate biconditionals: pq(pq)(qp)p \leftrightarrow q \longrightarrow (p \to q) \wedge (q \to p)
  2. Eliminate implications: pq¬pqp \to q \longrightarrow \neg p \vee q
  3. Push negations inward (De Morgan's + double negation):
    • ¬(pq)¬p¬q\neg(p \wedge q) \longrightarrow \neg p \vee \neg q
    • ¬(pq)¬p¬q\neg(p \vee q) \longrightarrow \neg p \wedge \neg q
    • ¬¬pp\neg\neg p \longrightarrow p
  4. Distribute \vee over \wedge: p(qr)(pq)(pr)p \vee (q \wedge r) \longrightarrow (p \vee q) \wedge (p \vee r)

Example: Convert p(qr)p \to (q \wedge r) to CNF:

  • Step 2: ¬p(qr)\neg p \vee (q \wedge r)
  • Step 4: (¬pq)(¬pr)(\neg p \vee q) \wedge (\neg p \vee r) - this is CNF OK

5.7 Boolean Algebra

The laws of propositional logic form an algebraic structure called a Boolean algebra: (B,,,¬,0,1)(B, \wedge, \vee, \neg, 0, 1) satisfying all the equivalences listed in 5.5.

The fundamental connection - Stone Representation Theorem: Every Boolean algebra is isomorphic to an algebra of sets:

LogicSetsPython
\wedge (AND)\cap (intersection)&
\vee (OR)\cup (union)|
¬\neg (NOT)complement~
FF (false)\emptysetset()
TT (true)UUuniversal set
\to (implies)\subseteq<=
\leftrightarrow (iff)====

This is why De Morgan's laws look the same for sets and for logic - they are the same, in the precise algebraic sense. Boolean algebra unifies propositional logic, set operations, and circuit design under a single framework.

AI connections:

  • Attention masks as Boolean matrices: An attention mask is a Boolean matrix M{0,1}n×nM \in \{0, 1\}^{n \times n} where Mij=1M_{ij} = 1 iff query ii attends to key jj. Combining masks (causal AND sliding window) is Boolean AND.
  • Structured generation FSMs: The transition function of a finite state machine operates on Boolean states. Valid token sets at each state are computed via Boolean operations.
  • Circuit complexity of neural networks: A ReLU network computes a piecewise linear function; each linear region corresponds to a Boolean pattern of active/inactive neurons - the network's behaviour is described by a Boolean function over activation patterns.

6. Predicate Logic (First-Order Logic)

Propositional logic treats statements as atomic units - it cannot look inside them. Predicate logic (FOL) opens them up, exposing the objects, properties, and quantifiers that make mathematical reasoning powerful. Every formal specification in AI - from type systems to knowledge bases to verification conditions - ultimately rests on predicate logic.

6.1 Predicates and Quantifiers - Motivation

The statement "every prime greater than 2 is odd" cannot be expressed in propositional logic. We need:

  1. A way to talk about individual objects (numbers, tokens, vectors)
  2. A way to express properties of objects ("is prime," "is odd")
  3. A way to say "for all" or "there exists"

Predicate logic provides all three.

6.2 Syntax of FOL

Terms: The objects we talk about.

  • Constants: Specific objects (00, 11, π\pi, [PAD], [CLS])
  • Variables: Placeholders for objects (xx, yy, zz)
  • Function symbols: Operations that produce objects (f(x)f(x), x+yx + y, embed(t)\text{embed}(t))

Predicates (relation symbols): Properties of or relationships between objects that evaluate to true or false.

  • Unary: Prime(x)\text{Prime}(x), IsToken(t)\text{IsToken}(t), Positive(x)\text{Positive}(x)
  • Binary: x<yx < y, SameCluster(a,b)\text{SameCluster}(a, b), Attends(i,j)\text{Attends}(i, j)
  • nn-ary: Between(x,y,z)\text{Between}(x, y, z) meaning yy is between xx and zz

Quantifiers: The two pillars of FOL.

Universal Quantifier: \forall ("for all")

xP(x)\forall x\, P(x)

"For every xx, P(x)P(x) holds." True when P(a)P(a) is true for every object aa in the domain.

Examples:

  • xR,x20\forall x \in \mathbb{R},\, x^2 \geq 0 - "every real number squared is non-negative"
  • tV,embed(t)Rd\forall t \in V,\, \text{embed}(t) \in \mathbb{R}^d - "every token maps to a dd-dimensional vector"
  • i,j[n],i>j    Mij=0\forall i, j \in [n],\, i > j \implies M_{ij} = 0 - causal mask definition

Existential Quantifier: \exists ("there exists")

xP(x)\exists x\, P(x)

"There exists an xx such that P(x)P(x) holds." True when P(a)P(a) is true for at least one object aa in the domain.

Examples:

  • xR,x2=2\exists x \in \mathbb{R},\, x^2 = 2 - "2\sqrt{2} exists (in R\mathbb{R})"
  • tV,P(next=tcontext)>0.5\exists t \in V,\, P(\text{next} = t \mid \text{context}) > 0.5 - "there's a token with probability above 0.5"
  • θ,L(θ)=0\exists \theta,\, \mathcal{L}(\theta) = 0 - "a global minimum exists" (not always true!)

Unique Existential: !\exists! ("there exists exactly one")

!xP(x)x[P(x)y(P(y)y=x)]\exists! x\, P(x) \equiv \exists x\, [P(x) \wedge \forall y\, (P(y) \to y = x)]

"There is exactly one xx satisfying P(x)P(x)." This is a defined symbol - it's syntactic sugar for the expression on the right.

6.3 Free and Bound Variables

A variable xx is bound in a formula if it appears within the scope of a quantifier x\forall x or x\exists x. Otherwise, it is free.

  • In x(x>y)\forall x\, (x > y): xx is bound, yy is free
  • In xP(x,y)Q(z)\exists x\, P(x, y) \wedge Q(z): xx is bound, yy and zz are free
  • In xy(x+y=0)\forall x\, \exists y\, (x + y = 0): both xx and yy are bound

A sentence (closed formula) has no free variables - it has a definite truth value. A formula with free variables is a predicate - it becomes a sentence only when specific values are substituted for the free variables or when they are quantified.

AI connection: In a loss function L(θ)=1Ni=1N(fθ(xi),yi)\mathcal{L}(\theta) = \frac{1}{N} \sum_{i=1}^N \ell(f_\theta(x_i), y_i), the index ii is bound (by the sum, a disguised \forall), while θ\theta is free - this is why we optimise over θ\theta.

6.4 Negating Quantifiers

Negation interacts with quantifiers via two fundamental rules:

¬xP(x)x¬P(x)\neg \forall x\, P(x) \equiv \exists x\, \neg P(x)

"Not everything satisfies PP" \equiv "Something fails to satisfy PP."

¬xP(x)x¬P(x)\neg \exists x\, P(x) \equiv \forall x\, \neg P(x)

"Nothing satisfies PP" \equiv "Everything fails to satisfy PP."

These are the quantifier analogues of De Morgan's laws. The pattern generalises:

  • Negation swaps \forall and \exists
  • Negation swaps \wedge and \vee
  • Negation swaps \cap and \cup

AI example: "Not all tokens are attended to" \equiv "There exists a token that is not attended to":

¬jAttends(i,j)j¬Attends(i,j)\neg \forall j\, \text{Attends}(i, j) \equiv \exists j\, \neg\text{Attends}(i, j)

6.5 Nested Quantifiers

Quantifiers can be nested. The order matters for mixed quantifiers:

xyP(x,y)yxP(x,y)\forall x\, \exists y\, P(x, y) \quad \neq \quad \exists y\, \forall x\, P(x, y)

The first says: "for every xx, there is (possibly different) yy such that P(x,y)P(x, y)." The second says: "there is a single yy that works for all xx."

Comparison with examples:

StatementFormalTrue?
"For every real, there's a larger one"xRyR(y>x)\forall x \in \mathbb{R}\, \exists y \in \mathbb{R}\, (y > x)True
"There's a real larger than all reals"yRxR(y>x)\exists y \in \mathbb{R}\, \forall x \in \mathbb{R}\, (y > x)False
"Every continuous function on [a,b][a,b] attains its max"fC[a,b]x[a,b]x[a,b]f(x)f(x)\forall f \in C[a,b]\, \exists x^* \in [a,b]\, \forall x \in [a,b]\, f(x^*) \geq f(x)True (EVT)
"For every ε>0\varepsilon > 0, there exists δ>0\delta > 0 ..."εδx\forall \varepsilon\, \exists \delta\, \forall x\, \ldotsContinuity def

Order rule: Swapping \forall and \exists is valid only if one direction of implication holds:

yxP(x,y)    xyP(x,y)\exists y\, \forall x\, P(x, y) \implies \forall x\, \exists y\, P(x, y)

The converse is NOT generally true. A single universal yy is a strictly stronger claim than an individual y(x)y(x) for each xx.

AI example - convergence of SGD:

  • ε>0Tt>TL(θt)L<ε\forall \varepsilon > 0\, \exists T\, \forall t > T\, |\mathcal{L}(\theta_t) - \mathcal{L}^*| < \varepsilon means "the loss converges to L\mathcal{L}^*"
  • Here TT depends on ε\varepsilon - this is the \forall\exists pattern

6.6 Translating Between English and FOL

Translating natural language to FOL is a critical skill for formal verification, knowledge representation, and understanding what mathematical theorems actually say.

Systematic approach:

  1. Identify the domain (what are the objects?)
  2. Identify the predicates (what properties/relations?)
  3. Identify the quantifier structure (\forall, \exists, their order)
  4. Write the formal expression
  5. Check by reading it back in English

Practice translations:

EnglishFOL
"All vectors in SS have unit norm"vS,v=1\forall v \in S,\, \|v\| = 1
"Some weight is negative"wW,w<0\exists w \in W,\, w < 0
"No gradient is zero"gG,g0\forall g \in G,\, g \neq 0 or ¬gG,g=0\neg\exists g \in G,\, g = 0
"If all gradients vanish, we're at a critical point"(i,if=0)CriticalPoint(θ)(\forall i,\, \nabla_i f = 0) \to \text{CriticalPoint}(\theta)
"For every layer, there exists a neuron that fires"nFires(,n)\forall \ell\, \exists n\, \text{Fires}(\ell, n)
"There's a learning rate that works for all tasks"ηtaskConverges(η,task)\exists \eta\, \forall \text{task}\, \text{Converges}(\eta, \text{task}) (probably false!)

6.7 FOL in AI Contexts

Knowledge graphs store facts as FOL ground atoms: LocatedIn(Paris,France)\text{LocatedIn}(\text{Paris}, \text{France}), InstanceOf(GPT-4,LLM)\text{InstanceOf}(\text{GPT-4}, \text{LLM}). Rules are FOL implications: x,y,z[LocatedIn(x,y)LocatedIn(y,z)LocatedIn(x,z)]\forall x, y, z\, [\text{LocatedIn}(x, y) \wedge \text{LocatedIn}(y, z) \to \text{LocatedIn}(x, z)].

Description logics (OWL, used in ontologies) are decidable fragments of FOL. They balance expressiveness with computational tractability - reasoning in full FOL is undecidable.

Program specification: Hoare logic uses FOL to specify pre/post-conditions: {P}program{Q}\{P\}\, \text{program}\, \{Q\} means "if PP holds before, QQ holds after."

LLM reasoning: When an LLM performs chain-of-thought reasoning, it is (approximately) performing FOL inference - applying modus ponens, instantiating universals, existential witnesses. The fidelity of this approximation is an active research question.


7. Proof Techniques

This chapter only needs a working preview of proof methods so later sections can state theorems precisely and you can follow the logic of an argument without being surprised by the structure. The full treatment lives in Proof Techniques, where each method is developed in much more depth.

For now, the right goal is recognition:

  • What kind of claim is being proved?
  • What assumptions are being used?
  • Why is a specific proof strategy a natural fit?

7.1 What is a Proof?

A proof is a finite chain of justified steps from assumptions, definitions, axioms, and previously established results to a conclusion. In this section, sets and logic give us the vocabulary; proofs tell us how that vocabulary is used to certify statements rather than merely suggest them.

That distinction matters in AI. A numerical experiment may suggest that an optimizer converges or that a masking rule preserves causality, but a proof tells you exactly under which assumptions the conclusion must hold.

7.2 Direct Proof

In a direct proof, you start from the hypothesis and push forward to the conclusion. This is the default style for statements of the form "PQP \to Q" when the path from premises to result is transparent.

Example: to show that the sum of two even integers is even, write them as 2a2a and 2b2b, add them, and observe the result is again divisible by 2. This style mirrors many derivations in ML, where you begin with model assumptions and algebraically derive a bound or identity.

7.3 Proof by Contrapositive

Sometimes the forward direction is awkward, but the logically equivalent statement ¬Q¬P\neg Q \to \neg P is easier. That is proof by contrapositive, and it is why the implication equivalence in 5.5 matters operationally, not just symbolically.

Classic example: to prove "if n2n^2 is even, then nn is even," it is simpler to show "if nn is odd, then n2n^2 is odd." In theory-heavy AI writing, contraposition shows up in impossibility and lower-bound arguments where failure of the conclusion exposes useful structure in the assumptions.

7.4 Proof by Contradiction

In proof by contradiction, you assume the statement is false and derive an impossibility. This is especially common when proving that some object cannot exist, that a property must hold universally, or that a certain finite description is impossible.

Famous examples include proving that 2\sqrt{2} is irrational and that there are infinitely many primes. In AI-flavored mathematics, contradiction often appears when ruling out pathological counterexamples or showing that a supposed optimum or invariant would violate earlier assumptions.

7.5 Proof by Cases

When a domain naturally splits into a few exhaustive possibilities, you can prove a statement separately in each branch. This is proof by cases.

For example, the claim x0|x| \geq 0 is easiest by splitting into x0x \geq 0 and x<0x < 0. The same structure appears in algorithms and model analysis whenever behavior differs across regimes such as active vs inactive ReLU units, interior vs boundary points, or short-context vs long-context cases.

7.6 Mathematical Induction

Induction proves statements indexed by the natural numbers. You establish a base case, then show that if the claim holds at step kk, it also holds at step k+1k+1. Once both parts are in place, the result propagates to all later indices.

This matters for AI because many objects are recursive or sequential: token prefixes, layers in a stack, rollout horizons, and sequence length arguments are all natural induction targets. We will use the full machinery in the dedicated proof section when sequence-model reasoning becomes more central.

7.7 Existence and Uniqueness Proofs

An existence proof shows that at least one object with the desired property exists. A uniqueness proof shows there can be at most one. Together, they establish that there is exactly one such object.

This distinction is everywhere in optimization and model theory. "A minimizer exists" and "the minimizer is unique" are very different statements, with different consequences for learning dynamics and interpretation.

7.8 Proof by Construction

A constructive proof does more than assert existence: it explicitly builds the object. This is especially valuable in computer science and AI because a constructive argument often doubles as an algorithm.

For instance, if a theorem says there exists a network, partition, mask, or encoding with certain properties, a constructive proof tells you how to produce one. That is why constructive arguments are often more actionable than purely existential ones.

7.9 Why This Is Only a Preview

At this stage, proof techniques should feel like a toolbox you can recognize, not yet one you have fully mastered. The important bridge is:

  • logic gives the form of valid inference
  • sets provide the objects and predicates we talk about
  • proofs combine both into reliable mathematical arguments

When you are ready to go deeper, continue with Proof Techniques, where these methods are expanded with full templates, worked examples, and more explicit AI applications.


8. Axiomatic Set Theory (ZFC)

Naive set theory (2) is intuitive but broken - Russell's Paradox (2.5) shows that unrestricted set formation leads to contradictions. ZFC (Zermelo-Fraenkel axioms with the Axiom of Choice) is the standard rigorous foundation for all of modern mathematics. Every mathematical object - numbers, functions, spaces, probability measures, neural networks - can be built from ZFC sets.

8.1 Why Axioms?

Axioms serve two purposes:

  1. Prevent paradoxes: By restricting which sets exist, ZFC avoids Russell's Paradox and similar contradictions.
  2. Ensure consistency: All mathematical reasoning ultimately reduces to statements provable from these axioms.

The axioms tell you exactly which sets you are allowed to construct and what operations you may perform on them. Nothing more, nothing less.

8.2 The ZFC Axioms

There are 9 axioms (some formulations give 8 or 10; the exact count depends on grouping). Here they are, with intuition and notation.

Axiom 1: Extensionality

AB[x(xAxB)A=B]\forall A\, \forall B\, [\forall x\, (x \in A \leftrightarrow x \in B) \to A = B]

Two sets are equal if and only if they have the same elements. A set is determined entirely by its membership - there is no "hidden structure."

Consequence: {1,2,3}={3,1,2}={1,1,2,3}\{1, 2, 3\} = \{3, 1, 2\} = \{1, 1, 2, 3\}. Order and repetition are irrelevant.

Axiom 2: Empty Set

x(x)\exists \emptyset\, \forall x\, (x \notin \emptyset)

There exists a set with no elements. By Extensionality, this set is unique - there is only one empty set.

Axiom 3: Pairing

abCx(xCx=ax=b)\forall a\, \forall b\, \exists C\, \forall x\, (x \in C \leftrightarrow x = a \vee x = b)

For any two objects aa and bb, the set {a,b}\{a, b\} exists. This lets us form pairs, which underpins ordered pairs and Cartesian products.

Axiom 4: Union

FUx(xUAF,xA)\forall \mathcal{F}\, \exists U\, \forall x\, (x \in U \leftrightarrow \exists A \in \mathcal{F},\, x \in A)

For any collection of sets F\mathcal{F}, there exists a set U=FU = \bigcup \mathcal{F} containing exactly the elements that belong to at least one set in F\mathcal{F}.

Axiom 5: Power Set

APB(BPBA)\forall A\, \exists P\, \forall B\, (B \in P \leftrightarrow B \subseteq A)

For any set AA, the power set P(A)\mathcal{P}(A) exists. This is the set of all subsets of AA, and P(A)=2A|\mathcal{P}(A)| = 2^{|A|} for finite AA.

Axiom 6: Separation (Specification / Comprehension)

ABx(xBxAϕ(x))\forall A\, \exists B\, \forall x\, (x \in B \leftrightarrow x \in A \wedge \phi(x))

For any set AA and any property ϕ\phi, the set {xAϕ(x)}\{x \in A \mid \phi(x)\} exists. Crucially, you can only separate out elements from an existing set - you cannot form {xϕ(x)}\{x \mid \phi(x)\} without a bounding set AA.

This is what blocks Russell's Paradox. The "set of all sets that don't contain themselves" would require {xxx}\{x \mid x \notin x\} - but without a bounding set AA, Separation doesn't let you form it.

Axiom 7: Infinity

I[Ix(xIx{x}I)]\exists I\, [\emptyset \in I \wedge \forall x\, (x \in I \to x \cup \{x\} \in I)]

There exists an infinite set - specifically, a set containing \emptyset, {}\{\emptyset\}, {,{}}\{\emptyset, \{\emptyset\}\}, and so on forever. This axiom guarantees the existence of the natural numbers (and prevents mathematics from being stuck in a finite world).

Axiom 8: Replacement

AF[(F is a function)BB={F(x)xA}]\forall A\, \forall F\, [\text{($F$ is a function)} \to \exists B\, B = \{F(x) \mid x \in A\}]

If FF is a definable function and AA is a set, then the image {F(x)xA}\{F(x) \mid x \in A\} is a set. This is like Separation but more powerful - it lets you apply a function to every element and collect the results.

AI analogy: Replacement is map. Given a set (dataset) and a function (model), the outputs form a set (predictions).

Axiom 9: Foundation (Regularity)

A[AxA(xA=)]\forall A\, [A \neq \emptyset \to \exists x \in A\, (x \cap A = \emptyset)]

Every non-empty set AA has an element disjoint from AA. This prevents circular membership chains (abaa \in b \in a) and self-containing sets (aaa \in a). Sets are "well-founded" - they are built from the bottom up (from \emptyset).

8.3 The Axiom of Choice (AC)

F[Ff:FFAF,f(A)A]\forall \mathcal{F}\, [\emptyset \notin \mathcal{F} \to \exists f: \mathcal{F} \to \bigcup \mathcal{F}\, \forall A \in \mathcal{F},\, f(A) \in A]

For any collection of non-empty sets, there exists a choice function ff that selects exactly one element from each set. This seems obvious but has deep consequences.

Why it's controversial:

  • It's non-constructive: it asserts the existence of a choice function without building one
  • It implies paradoxical results: Banach-Tarski paradox (decompose a sphere into 5 pieces and reassemble into two spheres of the same size)
  • It implies well-ordering theorem: every set can be well-ordered (including R\mathbb{R}, despite having no known explicit well-ordering)

Why it's accepted:

  • Without AC, many theorems fail (every vector space has a basis; Tychonoff's theorem; Hahn-Banach theorem)
  • ZFC (with AC) has not been shown inconsistent
  • AC is independent of ZF: both ZF+AC and ZF+\negAC are consistent (if ZF is consistent), by Godel (1938) and Cohen (1963)

AI applications of AC: Most machine learning theory implicitly uses AC:

  • "There exists a minimiser of the loss function" - may require AC for non-compact spaces
  • "Every vector space has a basis" - needs AC for infinite-dimensional spaces (function spaces in kernel methods)
  • In practice, everything is finite-dimensional, so AC is not needed for actual computations

8.4 Constructing Numbers from Sets

One of the most remarkable achievements of axiomatic set theory is building all number systems from pure sets:

Natural numbers (von Neumann construction):

0=,1={},2={,{}},3={,{},{,{}}},0 = \emptyset, \quad 1 = \{\emptyset\}, \quad 2 = \{\emptyset, \{\emptyset\}\}, \quad 3 = \{\emptyset, \{\emptyset\}, \{\emptyset, \{\emptyset\}\}\}, \quad \ldots

The rule: n+1=n{n}n + 1 = n \cup \{n\}. Each natural number is the set of all smaller natural numbers: n={0,1,,n1}n = \{0, 1, \ldots, n-1\}, so n=n|n| = n.

Integers: Z\mathbb{Z} is constructed as equivalence classes of pairs of natural numbers: [(a,b)][(a, b)] represents aba - b.

Rationals: Q\mathbb{Q} is constructed as equivalence classes of pairs of integers: [(p,q)][(p, q)] represents p/qp/q where q0q \neq 0.

Reals: R\mathbb{R} is constructed via Dedekind cuts or Cauchy sequences of rationals.

The takeaway: Every mathematical object - including the real numbers that neural network weights live in - is built from the empty set using ZFC axioms. This is why set theory is called the "foundation of mathematics."

8.5 Ordinals and Cardinals (Brief)

Ordinal numbers extend the natural numbers to describe the "length" of well-orderings:

0,1,2,,ω,ω+1,ω+2,,ω2,,ω2,,ωω,,ε0,0, 1, 2, \ldots, \omega, \omega + 1, \omega + 2, \ldots, \omega \cdot 2, \ldots, \omega^2, \ldots, \omega^\omega, \ldots, \varepsilon_0, \ldots

Here ω\omega is the first infinite ordinal (the "ordinal type" of N\mathbb{N}). Ordinals are used to measure the "height" of sets in the cumulative hierarchy.

Cardinal numbers measure the "size" of sets. For finite sets, the cardinal and ordinal notions coincide (S=n|S| = n means SS has nn elements). For infinite sets, cardinality is defined via bijections - two sets have the same cardinality iff a bijection exists between them. More on this in 10.


9. Logic in Computation and AI

This section bridges the pure theory of 5-8 to the concrete computational world. The key idea: logic is not just a tool for proofs - it is the foundation of computation itself.

9.1 Boolean Circuits

Every digital computer is built from logic gates - physical implementations of the logical connectives:

GateLogicSymbolOutput
ANDpqp \wedge q&1 iff both inputs are 1
ORpqp \vee q|1 iff at least one input is 1
NOT¬p\neg p~Flips the bit
NAND¬(pq)\neg(p \wedge q)-0 iff both inputs are 1
XORpqp \oplus q^1 iff inputs differ

Functional completeness: {AND,OR,NOT}\{AND, OR, NOT\} can compute any Boolean function. Even more remarkably, {NAND}\{NAND\} alone is functionally complete - every Boolean function can be built from NAND gates.

From gates to GPUs: GPU cores perform massively parallel Boolean and arithmetic operations. The floating-point units, comparison circuits, and memory addressing logic in a GPU are all Boolean circuits. When you run torch.matmul(A, B), underneath it is billions of transistors implementing Boolean functions.

9.2 SAT Solving

The Boolean satisfiability problem (SAT): given a propositional formula in CNF, is there a truth assignment that makes it true?

DPLL Algorithm (Davis-Putnam-Logemann-Loveland, 1962):

  1. Unit propagation: If a clause has one unset literal, it must be true
  2. Pure literal elimination: If a variable appears with only one polarity, set it to satisfy those clauses
  3. Branching: Pick an unset variable, try both truth values, recurse
  4. Backtracking: If a contradiction is found, undo the last choice

CDCL (Conflict-Driven Clause Learning): Modern SAT solvers extend DPLL with learned clauses - when a conflict occurs, they analyse why and add a clause preventing the same conflict pattern. This is a form of learning from mistakes.

SAT in AI:

  • Constraint satisfaction: Many combinatorial AI problems reduce to SAT
  • Planning: Classical AI planning (SATPLAN) encodes action sequences as SAT instances
  • Formal verification: Proving neural network properties (robustness, fairness) reduces to SAT/SMT
  • Hardware design: Chip verification uses SAT solvers extensively
  • Structured generation: Ensuring LLM output satisfies a grammar can be framed as constraint satisfaction

9.3 Resolution and Automated Reasoning

Resolution is a single inference rule that is complete for refuting CNF formulas:

(Ap)(¬pB)AB\frac{(A \vee p) \quad (\neg p \vee B)}{A \vee B}

If one clause contains pp and another contains ¬p\neg p, their resolvent combines the remaining literals.

Completeness (Robinson, 1965): A set of clauses is unsatisfiable if and only if the empty clause (\bot) can be derived by repeated resolution.

This is the backbone of Prolog and logic programming - computation as proof search.

9.4 Type Theory and Programming

Type systems are deductive systems closely related to logic. The Curry-Howard correspondence establishes a deep isomorphism:

LogicType TheoryProgramming
PropositionTypeSpecification
ProofTerm (program)Implementation
PQP \to Q (implication)Function type PQP \to QFunction from PP to QQ
PQP \wedge Q (conjunction)Product type P×QP \times QPair/tuple
PQP \vee Q (disjunction)Sum type P+QP + QTagged union / either
xP(x)\forall x\, P(x)Dependent function Πx:AP(x)\Pi_{x:A} P(x)Generic/polymorphic function
xP(x)\exists x\, P(x)Dependent pair Σx:AP(x)\Sigma_{x:A} P(x)Record with witness
\bot (false)Empty typeUnreachable code
Proof of PPValue of type PPRunning program

Consequence: A valid type-checked program is a proof of its type signature. Type errors are logical contradictions.

AI connection - type-safe tensors: Projects like jaxtyping and beartype enforce tensor shape constraints at the type level: Float[Array, "batch seq d_model"]. A shape mismatch is caught as a type error - i.e., a logical contradiction between what the function promises and what it receives.

9.5 Modal Logic (Brief)

Modal logic extends propositional logic with necessity (\square) and possibility (\diamond):

  • P\square P: "PP is necessarily true" (true in all possible worlds)
  • P\diamond P: "PP is possibly true" (true in some possible world)

In AI:

  • Epistemic logic: KaPK_a P means "agent aa knows PP." Used in multi-agent systems.
  • Temporal logic: P\square P means "PP always holds" (safety); P\diamond P means "PP eventually holds" (liveness). Used in model checking and formal verification.
  • Deontic logic: What an AI system "should" do - the logic of obligations and permissions, relevant to AI alignment.

9.6 Description Logics and Knowledge Representation

Description logics (DLs) are decidable fragments of FOL designed for knowledge representation:

  • Concepts (unary predicates): Person, Model, LargeLanguageModel
  • Roles (binary predicates): trainedBy, hasParameter, outputs
  • Constructors: CDC \sqcap D (intersection), CDC \sqcup D (union), ¬C\neg C (complement), R.C\forall R.C (all RR-successors are in CC), R.C\exists R.C (some RR-successor is in CC)

DLs underpin the Web Ontology Language (OWL), used in knowledge graphs, biomedical ontologies, and semantic web. The reasoning complexity depends on which constructors are included - a careful balance between expressiveness and decidability.

9.7 LLM Reasoning as Approximate Logic

When an LLM performs chain-of-thought (CoT) reasoning, it is (approximately) performing logical inference:

Logical stepLLM behaviourFidelity
Modus ponens (P,PQQP, P \to Q \vdash Q)"Since PP and PP implies QQ, therefore QQ"High for simple cases
Universal instantiation (xP(x)P(a)\forall x P(x) \vdash P(a))"Since all X are Y, this X is Y"Moderate
Contradiction detection"But this contradicts..."Fragile
Multi-step chainsChaining 5+ inference stepsDegrades rapidly

Open research questions:

  • Can LLMs perform reliable deductive reasoning, or only pattern-matched approximation?
  • Can we formally verify LLM reasoning chains against logical rules?
  • Can symbolic logic modules be integrated with neural networks (neuro-symbolic AI)?
  • How does the "chain-of-thought" prompt format affect logical correctness?

Current evidence suggests LLMs are surprisingly good at local logical steps but struggle with long chains and quantifier reasoning (especially nested \forall\exists patterns). This is precisely the gap that formal methods aim to fill.


10. Cardinality and Infinite Sets

Cardinality answers the question: how big is a set? For finite sets the answer is obvious - just count. For infinite sets, the answer is one of the most profound discoveries in mathematics: not all infinities are the same size.

10.1 Finite Cardinality

If AA is finite, A=n|A| = n means there exists a bijection f:A{1,2,,n}f: A \to \{1, 2, \ldots, n\}. Counting is bijection.

Key counting facts for AI:

FormulaNameAI Example
AB=A+BAB\vert A \cup B \vert = \vert A \vert + \vert B \vert - \vert A \cap B \vertInclusion-exclusionDeduplicating token sets
A×B=AB\vert A \times B \vert = \vert A \vert \cdot \vert B \vertProduct ruleSize of joint state space
P(A)=2A\vert \mathcal{P}(A) \vert = 2^{\vert A \vert}Power setNumber of feature subsets
AB=AB\vert A^B \vert = \vert A \vert^{\vert B \vert}ExponentialNumber of functions from BB to AA
n!=n(n1)1n! = n \cdot (n-1) \cdots 1FactorialPermutation count
(nk)=n!k!(nk)!\binom{n}{k} = \frac{n!}{k!(n-k)!}BinomialWays to choose kk items from nn

10.2 Comparing Infinite Sets - Bijections

Definition: Two sets AA and BB have the same cardinality (A=B|A| = |B|) if and only if there exists a bijection f:ABf: A \to B.

This is the only sensible definition for infinite sets - we cannot "count" them, but we can pair their elements one-to-one.

Example - N=Z|\mathbb{N}| = |\mathbb{Z}|:

This seems wrong - Z\mathbb{Z} "has twice as many elements." But pair them:

00,11,21,32,42,53,0 \mapsto 0, \quad 1 \mapsto 1, \quad 2 \mapsto -1, \quad 3 \mapsto 2, \quad 4 \mapsto -2, \quad 5 \mapsto 3, \quad \ldots

Formally: f(n) = \begin{cases} n/2 & \text{if nis even} \\ -(n+1)/2 & \text{ifn is odd} \end{cases}

This is a bijection, so N=Z|\mathbb{N}| = |\mathbb{Z}|. Infinity doesn't work like finite numbers.

10.3 Countable Sets

A set AA is countably infinite if A=N|A| = |\mathbb{N}| - i.e., there exists a bijection ANA \to \mathbb{N}. A set is countable if it is finite or countably infinite.

Equivalently, AA is countable iff its elements can be listed as a sequence a1,a2,a3,a_1, a_2, a_3, \ldots (possibly terminating).

Countable sets include:

SetWhy countable
N\mathbb{N}By definition
Z\mathbb{Z}Bijection shown above
Q\mathbb{Q}Cantor's zig-zag (diagonalisation of pq\frac{p}{q} table)
N×N\mathbb{N} \times \mathbb{N}Cantor pairing function: (m,n)(m+n)(m+n+1)2+m(m, n) \mapsto \frac{(m+n)(m+n+1)}{2} + m
All finite strings over finite alphabet Σ\SigmaΣ=n=0Σn\Sigma^* = \bigcup_{n=0}^\infty \Sigma^n, countable union of finite sets
All programs (source code)Programs are finite strings
All rational polynomialsCountable coefficients, finite degree
Algebraic numbersRoots of polynomials with integer coefficients

Key theorem: A countable union of countable sets is countable:

A1,A2,A3, countable    i=1Ai countableA_1, A_2, A_3, \ldots \text{ countable} \implies \bigcup_{i=1}^\infty A_i \text{ countable}

AI consequence: The set of all possible tokenised prompts (finite strings over a finite vocabulary VV) is countable. There are 0\aleph_0 possible inputs to an LLM.

10.4 Uncountable Sets - Cantor's Diagonal Argument

Theorem (Cantor, 1891). R\mathbb{R} is uncountable - there is no bijection NR\mathbb{N} \to \mathbb{R}.

Proof (diagonal argument). We show that even the interval (0,1)(0, 1) is uncountable. Assume, for contradiction, that (0,1)(0, 1) is countable. Then we can list all real numbers in (0,1)(0, 1):

r1=0.d11d12d13d14r_1 = 0.d_{11}d_{12}d_{13}d_{14}\ldots r2=0.d21d22d23d24r_2 = 0.d_{21}d_{22}d_{23}d_{24}\ldots r3=0.d31d32d33d34r_3 = 0.d_{31}d_{32}d_{33}d_{34}\ldots \vdots

Construct r=0.d1d2d3r^* = 0.d_1^* d_2^* d_3^* \ldots where dnd_n^* is any digit dnn\neq d_{nn} (and 0,9\neq 0, 9 to avoid alternative decimal representations).

Then r(0,1)r^* \in (0, 1) but rrnr^* \neq r_n for every nn (they differ at the nn-th digit). This contradicts the assumption that all reals in (0,1)(0, 1) were listed. \square

The symbol N=0|\mathbb{N}| = \aleph_0 ("aleph-null") denotes the first infinite cardinal. The set R\mathbb{R} has cardinality R=20=c|\mathbb{R}| = 2^{\aleph_0} = \mathfrak{c} (the cardinality of the continuum), which is strictly larger.

10.5 Cantor's Theorem (Generalised)

Theorem. For any set AA: A<P(A)|A| < |\mathcal{P}(A)|.

No set is as large as its power set. This holds for every set, finite or infinite.

Proof sketch. The map a{a}a \mapsto \{a\} injects AA into P(A)\mathcal{P}(A), so AP(A)|A| \leq |\mathcal{P}(A)|. To show strict inequality, suppose f:AP(A)f: A \to \mathcal{P}(A) is a bijection. Let D={aAaf(a)}D = \{a \in A \mid a \notin f(a)\}. Then DP(A)D \in \mathcal{P}(A), so D=f(d)D = f(d) for some dd (by surjectivity). Is dDd \in D?

  • If dDd \in D, then df(d)=Dd \notin f(d) = D - contradiction.
  • If dDd \notin D, then df(d)=Dd \in f(d) = D - contradiction.

Either way, contradiction. So no such bijection exists. \square

Consequence - the infinite tower:

N<P(N)<P(P(N))<|\mathbb{N}| < |\mathcal{P}(\mathbb{N})| < |\mathcal{P}(\mathcal{P}(\mathbb{N}))| < \cdots

There is no largest infinity. The hierarchy of infinities is itself infinite (and larger than any infinity in the hierarchy).

10.6 The Continuum Hypothesis

Claim (Cantor, 1878): There is no set SS with N<S<R|\mathbb{N}| < |S| < |\mathbb{R}|.

In other words, there is no "intermediate" infinity between the countable and the continuum.

Resolution (Godel 1940, Cohen 1963): The Continuum Hypothesis is independent of ZFC - it can neither be proved nor disproved from the standard axioms. Both ZFC + CH and ZFC + \negCH are consistent (if ZFC is).

This is one of the most surprising results in mathematics: there are meaningful mathematical questions that our axioms simply cannot answer.

10.7 Cardinality and AI

ObjectCardinalityConsequence
Finite vocabulary VVV\vert V \vert (finite, e.g., 32,000)Discrete, tractable
All prompts VV^*0\aleph_0 (countable)Enumerable in principle
All real-valued weight vectors Rd\mathbb{R}^dc\mathfrak{c}Uncountable - can't enumerate models
All functions RdR\mathbb{R}^d \to \mathbb{R}cc>c\mathfrak{c}^{\mathfrak{c}} > \mathfrak{c}Vastly larger than the set of computable functions
All computable functions0\aleph_0Programs are countable strings

The gap: There are uncountably many functions but only countably many programs (and neural network architectures, at any fixed precision). Most functions are not computable - and most mathematical objects are not representable by any model. This is a fundamental limitation.


11. Logic and Set Theory in Machine Learning

This section shows how the abstract tools of 2-10 appear concretely in modern ML.

11.1 Formal Languages and Automata

Formal language theory is set theory applied to strings:

  • An alphabet Σ\Sigma is a finite set of symbols
  • A string over Σ\Sigma is a finite sequence of symbols from Σ\Sigma
  • A language LL is a set of strings: LΣL \subseteq \Sigma^*

Connection to AI:

ConceptML Instantiation
Alphabet Σ\SigmaVocabulary VV (set of tokens)
StringToken sequence (prompt or completion)
Language LLSet of valid outputs
GrammarRules defining valid outputs (JSON schema, Python syntax)
Automaton (DFA/NFA)Constrained decoding state machine
Regular languageLanguages recognisable by finite automata
Context-free languageLanguages parseable by pushdown automata

Structured generation constrains an LLM to produce only strings from a target language LL. At each decoding step, the valid next tokens form a set VvalidVV_{\text{valid}} \subseteq V, computed by the automaton/parser. Tokens outside VvalidV_{\text{valid}} are masked (logits set to -\infty). This is Boolean logic applied to decoding.

11.2 Sets in Probability

The entire framework of probability theory is built on set theory (Kolmogorov axioms):

Probability conceptSet-theoretic object
Sample space Ω\OmegaSet of all possible outcomes
Event AASubset of Ω\Omega (AΩA \subseteq \Omega)
P(AB)P(A \cup B)Probability of AA or BB
P(AB)P(A \cap B)Probability of AA and BB
P(Ac)P(A^c)Probability of not AA; equals 1P(A)1 - P(A)
σ\sigma-algebra F\mathcal{F}Collection of measurable subsets of Ω\Omega
Independence: P(AB)=P(A)P(B)P(A \cap B) = P(A)P(B)Multiplicativity under intersection

Measure theory (the rigorous foundation of probability) relies heavily on ZFC: σ\sigma-algebras are sets of sets, measures are set functions, and the Axiom of Choice is needed for results like the Vitali set (a non-measurable set).

11.3 Sets in Linear Algebra

Linear algebra conceptSet structure
Vector space VVSet with addition and scalar multiplication
Subspace WVW \leq VSubset that is itself a vector space
Spanspan(S)={\text{span}(S) = \{ all linear combinations of S}S\} - a set
BasisLinearly independent spanning set
EigenspaceEλ={vAv=λv}E_\lambda = \{v \mid Av = \lambda v\} - defined by set-builder notation
Column spaceCol(A)={AxxRn}\text{Col}(A) = \{Ax \mid x \in \mathbb{R}^n\} - image of AA (a set)
Null spaceNull(A)={xAx=0}\text{Null}(A) = \{x \mid Ax = 0\} - kernel of AA (a set)

11.4 Logic in Training

Loss functions encode logical conditions:

ConditionLogicLoss
Output == targetEqualityMSE: fθ(x)y2\|f_\theta(x) - y\|^2
Output matches distributionPθPdataP_\theta \approx P_{\text{data}}KL divergence, cross-entropy
Margin conditionyi(wxi)1y_i (\mathbf{w} \cdot \mathbf{x}_i) \geq 1Hinge loss: max(0,1yi(wxi))\max(0, 1 - y_i (\mathbf{w} \cdot \mathbf{x}_i))
Constraint satisfactiong(θ)0g(\theta) \leq 0Lagrangian: L+λg(θ)\mathcal{L} + \lambda g(\theta)

Training as logic: gradient descent searches for θ\theta satisfying L(θ)=0\nabla \mathcal{L}(\theta) = 0 - this is an existential claim: θL(θ)=0\exists \theta^*\, \nabla \mathcal{L}(\theta^*) = 0.

Early stopping as logic: "Stop when the validation loss has not decreased for kk epochs" is a quantified condition: t[T,T+k],Lval(t)Lval(T)\forall t \in [T, T+k],\, \mathcal{L}_{\text{val}}^{(t)} \geq \mathcal{L}_{\text{val}}^{(T)}.

11.5 Sets in Model Architecture

Transformer attention as set operations:

The key insight of attention is that it operates on a set of key-value pairs (modulo positional encoding). Self-attention is permutation-equivariant: it treats the input as a set, not a sequence.

  • Query: "What am I looking for?"
  • Key set: K={k1,k2,,kn}K = \{k_1, k_2, \ldots, k_n\} - the set of keys to match against
  • Value set: V={v1,v2,,vn}V = \{v_1, v_2, \ldots, v_n\} - the set of values to aggregate
  • Attention mask: M[n]×[n]M \subseteq [n] \times [n] - the set of allowed (query, key) pairs

Combining masks:

  • Causal AND local: McausalMlocalM_{\text{causal}} \cap M_{\text{local}} - intersection (Boolean AND)
  • Multi-head union: hMh\bigcup_h M_h - each head sees a different subset of positions
  • Padding mask: Mpad={(i,j)j is not a padding token}M_{\text{pad}} = \{(i, j) \mid j \text{ is not a padding token}\}

11.6 Formal Verification of ML Systems

Logic provides tools to prove properties of ML systems rather than merely test them:

PropertyFormal statementMethod
Robustnessx,xx<εclass(x)=class(x)\forall x',\, \|x' - x\| < \varepsilon \to \text{class}(x') = \text{class}(x)SMT solving, MILP
FairnessP(y^A=0)=P(y^A=1)P(\hat{y} \mid A=0) = P(\hat{y} \mid A=1)Statistical testing + formal constraints
Safetyinput,outputSafeSet\forall \text{input},\, \text{output} \in \text{SafeSet}Reachability analysis
Monotonicityxixif(x)f(x)x_i \leq x_i' \to f(x) \leq f(x')Lattice-based verification

Certified defences use logical proofs (often via abstract interpretation or interval arithmetic) to guarantee that no adversarial perturbation within an ε\varepsilon-ball can change the classification. This is the gold standard for ML safety - moving from empirical robustness to provable robustness.


12. Advanced Topics

12.1 Godel's Incompleteness Theorems

The most profound results in mathematical logic, with deep implications for AI.

First Incompleteness Theorem (Godel, 1931): Any consistent formal system FF capable of expressing basic arithmetic contains a statement GG such that:

  • GG is true (in the standard model of arithmetic)
  • GG is not provable in FF
  • ¬G\neg G is not provable in FF

In other words, there are true mathematical statements that cannot be proved. No axiom system - ZFC, or any extension - can be both consistent and complete for arithmetic.

Godel's self-referential trick: Godel constructs GG to say, essentially, "I am not provable in system FF." If GG were provable, it would be false (since it says it's unprovable), contradicting consistency. If ¬G\neg G were provable, GG would be provable (since GG says it's not), contradiction. So neither GG nor ¬G\neg G is provable.

Second Incompleteness Theorem: No consistent system FF capable of expressing arithmetic can prove its own consistency.

F consistent    F⊬Con(F)F \text{ consistent} \implies F \not\vdash \text{Con}(F)

AI implications:

  • No AI system (which is a formal system) can verify all true mathematical statements - there will always be truths beyond its reach
  • The claim "this AI is guaranteed safe" cannot be proved within the AI's own framework (a version of the second theorem)
  • Godel's theorems do NOT mean mathematics is broken - they mean it is inexhaustible

12.2 The Halting Problem

Theorem (Turing, 1936). There is no algorithm that can determine, for every program PP and input II, whether PP halts on II.

Proof (by contradiction). Suppose H(P,I)H(P, I) is a program that returns "halts" or "loops" for any P,IP, I. Define:

D(P)={loop foreverif H(P,P)="halts"haltif H(P,P)="loops"D(P) = \begin{cases} \text{loop forever} & \text{if } H(P, P) = \text{"halts"} \\ \text{halt} & \text{if } H(P, P) = \text{"loops"} \end{cases}

What does D(D)D(D) do?

  • If H(D,D)=H(D, D) = "halts" -> D(D)D(D) loops forever. But HH said it halts. Contradiction.
  • If H(D,D)=H(D, D) = "loops" -> D(D)D(D) halts. But HH said it loops. Contradiction.

So HH cannot exist. \square

Connections to AI:

  • This is the computation analogue of Russell's Paradox - self-reference plus negation creates paradox
  • You cannot write a "universal bug detector" that catches all infinite loops
  • Theorem provers and code verifiers must be incomplete (they may time out or return "unknown")
  • LLM code generation cannot guarantee termination of generated code

12.3 Zorn's Lemma

Zorn's Lemma (equivalent to the Axiom of Choice and the Well-Ordering Theorem):

If every chain in a partially ordered set (P,)(P, \leq) has an upper bound in PP, then PP has a maximal element.

A chain is a totally ordered subset; a maximal element mm is one with no element strictly above it (xP,x>m\nexists x \in P,\, x > m).

Applications in mathematics:

  • Every vector space has a basis (even infinite-dimensional ones)
  • Every ideal in a ring is contained in a maximal ideal
  • Every filter can be extended to an ultrafilter

AI relevance: Zorn's Lemma is used in the theoretical foundations of kernel methods (Hilbert space bases) and functional analysis (Hahn-Banach theorem), which underlies support vector machines and reproducing kernel Hilbert spaces.

12.4 Fuzzy Logic

Classical logic is crisp: TT or FF. Fuzzy logic (Zadeh, 1965) allows truth values in [0,1][0, 1]:

μA(x)[0,1]\mu_A(x) \in [0, 1]

where μA(x)\mu_A(x) is the degree of membership of xx in fuzzy set AA.

Fuzzy operations:

ClassicalFuzzy
ABA \cap BμAB(x)=min(μA(x),μB(x))\mu_{A \cap B}(x) = \min(\mu_A(x), \mu_B(x))
ABA \cup BμAB(x)=max(μA(x),μB(x))\mu_{A \cup B}(x) = \max(\mu_A(x), \mu_B(x))
¬A\neg Aμ¬A(x)=1μA(x)\mu_{\neg A}(x) = 1 - \mu_A(x)

AI connections:

  • Softmax outputs are like fuzzy membership values - a token has degree-of-membership in each class
  • Attention weights αij[0,1]\alpha_{ij} \in [0, 1] are fuzzy: not binary "attend or not" but "how much to attend"
  • Fuzzy control systems (early AI) used fuzzy rules: "IF temperature IS high THEN fan IS fast"
  • Modern neural networks have largely replaced fuzzy systems, but the conceptual framework persists in the soft/differentiable operations that make gradient descent possible

12.5 Second-Order Logic (Brief)

First-order logic quantifies over objects: x,y,\forall x, \exists y, \ldots

Second-order logic quantifies over properties (sets, predicates, functions): P,f,\forall P, \exists f, \ldots

Example - Completeness of R\mathbb{R}:

  • First-order: CANNOT express "every bounded set has a least upper bound" (needs quantification over sets)
  • Second-order: SR[SMxS(xM)supS]\forall S \subseteq \mathbb{R}\, [S \neq \emptyset \wedge \exists M\, \forall x \in S\, (x \leq M) \to \exists \sup S]

Trade-off:

  • First-order logic: semi-decidable (complete proof systems exist), but less expressive
  • Second-order logic: more expressive, but no complete proof system exists (incompleteness is "worse")

In AI, first-order logic is preferred because it has better computational properties. Description logics, Datalog, and Prolog all work within FOL or its fragments.


13. Common Mistakes and Misconceptions

#MistakeWhy it's wrongCorrect version
1Confusing \in and \subseteq"3{1,2,3}3 \in \{1,2,3\}" (membership) vs "{3}{1,2,3}\{3\} \subseteq \{1,2,3\}" (subset)aAa \in A = element belongs; BAB \subseteq A = every element of BB is in AA
2{1,2}{1,2,3}\{1,2\} \in \{1,2,3\}The set {1,2}\{1,2\} is not an element of {1,2,3}\{1,2,3\}; its elements are 11, 22, 33{1,2}{1,2,3}\{1,2\} \subseteq \{1,2,3\} OK but {1,2}{1,2,3}\{1,2\} \in \{1,2,3\} NO
3={}\emptyset = \{\emptyset\}\emptyset has 0 elements; {}\{\emptyset\} has 1 element (namely \emptyset){}\emptyset \neq \{\emptyset\}; =0\vert \emptyset \vert = 0, {}=1\vert \{\emptyset\} \vert = 1
4"Universal set exists"In ZFC, there is no set of all sets (leads to Russell's Paradox)Work relative to a fixed domain UU (a set large enough for your purpose)
5Confusing \to and \leftrightarrowpqp \to q does NOT mean qpq \to p (converse)Implication is one-directional; biconditional is both
6"Vacuous truth is a bug"FQF \to Q is always TT - this is a feature, not a bugVacuous truth prevents edge-case contradictions; it's why A\emptyset \subseteq A for all AA
7Converse = contrapositiveqpq \to p (converse) \neq ¬q¬p\neg q \to \neg p (contrapositive)Contrapositive \equiv original; converse is NOT equivalent
8Swapping \forall and \existsxyP(x,y)yxP(x,y)\forall x \exists y P(x,y) \neq \exists y \forall x P(x,y)Quantifier order matters; only yxxy\exists y \forall x \Rightarrow \forall x \exists y (not reverse)
9"Countable = small"Q\mathbb{Q} is countable but dense in R\mathbb{R}; countable sets can be structurally complexCountable means "no bigger than N\mathbb{N}" - still infinite
10Induction hypothesis forgottenClaiming P(k+1)P(k+1) without using P(k)P(k)The inductive step MUST use the hypothesis P(k)P(k) (or P(1),,P(k)P(1), \ldots, P(k) for strong induction)
11Using induction without base caseProving only the inductive stepWithout base case, the chain has no starting point; the "proof" proves nothing
12Confusing "not proven" with "false"Godel: some true statements are unprovableUnprovability \neq falsity; independence results are about axiom limitations
13AB=BAA - B = B - ASet difference is NOT symmetric (unlike symmetric difference){1,2,3}{2,3,4}={1}\{1,2,3\} - \{2,3,4\} = \{1\}; {2,3,4}{1,2,3}={4}\{2,3,4\} - \{1,2,3\} = \{4\}
14P(A)=A\vert \mathcal{P}(A) \vert = \vert A \vert for infinite AACantor's theorem: A<P(A)\vert A \vert < \vert \mathcal{P}(A) \vert always, even for infinite setsThere is no bijection between AA and P(A)\mathcal{P}(A) - ever
15Treating attention as symmetricAttends(i,j)\text{Attends}(i,j) does NOT imply Attends(j,i)\text{Attends}(j,i); attention is a directed relationAttention is asymmetric: query->key direction matters

14. Exercises

Exercise 1: Set Operations (Warm-up)

Given A={1,2,3,4,5}A = \{1, 2, 3, 4, 5\}, B={3,4,5,6,7}B = \{3, 4, 5, 6, 7\}, C={5,6,7,8,9}C = \{5, 6, 7, 8, 9\}, compute:

  1. ABA \cup B
  2. ABCA \cap B \cap C
  3. ABA - B
  4. ABA \triangle B (symmetric difference)
  5. (AB)C(A \cup B) \cap C
  6. A(BC)A \cap (B \cup C)
  7. Verify De Morgan's law: (AB)c=AcBc(A \cup B)^c = A^c \cap B^c (with U={1,,9}U = \{1,\ldots,9\})
  8. P({1,2,3})\mathcal{P}(\{1,2,3\}) - list all elements

Exercise 2: Relations

Let A={1,2,3,4}A = \{1, 2, 3, 4\} and R={(1,1),(1,2),(2,1),(2,2),(3,3),(4,4),(3,4),(4,3)}R = \{(1,1), (1,2), (2,1), (2,2), (3,3), (4,4), (3,4), (4,3)\}.

  1. Is RR reflexive? Symmetric? Transitive?
  2. Is RR an equivalence relation? If yes, find the equivalence classes.
  3. Draw the relation as a directed graph.
  4. Write RR as a Boolean matrix.

Exercise 3: Partial Orders

Consider the divisibility relation | on the set S={1,2,3,4,6,12}S = \{1, 2, 3, 4, 6, 12\} (aba \mid b iff aa divides bb).

  1. Verify that | is a partial order on SS (check reflexivity, antisymmetry, transitivity).
  2. Draw the Hasse diagram.
  3. Find all maximal elements, minimal elements, greatest element, least element.
  4. Is (S,)(S, |) a total order? Why or why not?
  5. Find all chains of maximum length.

Exercise 4: Propositional Logic

  1. Construct truth tables for:
    • (pq)(qr)(pr)(p \to q) \wedge (q \to r) \to (p \to r) - is this a tautology?
    • (pq)p(p \wedge q) \to p - simplification law
    • ¬(pq)(p¬q)\neg(p \to q) \leftrightarrow (p \wedge \neg q)
  2. Prove or disprove: (pq)(¬q¬p)(p \to q) \equiv (\neg q \to \neg p) (using truth tables).
  3. Convert to CNF: p(qr)p \to (q \vee r).
  4. Convert to DNF: ¬(pq)r\neg(p \wedge q) \vee r.

Exercise 5: Predicate Logic

Translate to FOL (define your predicates):

  1. "Every transformer has at least one attention head."
  2. "No model with zero parameters can learn."
  3. "There exists a learning rate that makes every batch converge."
  4. "For every ε>0\varepsilon > 0, there exists NN such that for all n>Nn > N, anL<ε|a_n - L| < \varepsilon."
  5. Negate statement (3) formally and interpret in English.

Exercise 6: Proof Practice

Prove each of the following using the specified technique:

  1. Direct proof: If nn is odd, then n2n^2 is odd.
  2. Contrapositive: If n2n^2 is divisible by 3, then nn is divisible by 3.
  3. Contradiction: log23\log_2 3 is irrational.
  4. Induction: For all n1n \geq 1: i=1ni2=n(n+1)(2n+1)6\sum_{i=1}^n i^2 = \frac{n(n+1)(2n+1)}{6}.
  5. Cases: For all integers nn: n3nn^3 - n is divisible by 6.

Exercise 7: Cardinality

  1. Prove that (0,1)=R|(0, 1)| = |\mathbb{R}| by constructing an explicit bijection.
  2. Prove that N×N=N|\mathbb{N} \times \mathbb{N}| = |\mathbb{N}| using the Cantor pairing function.
  3. Explain why the set of all Python programs is countable.
  4. Explain why the set of all functions N{0,1}\mathbb{N} \to \{0, 1\} is uncountable. (Hint: relate to P(N)\mathcal{P}(\mathbb{N}).)
  5. A vocabulary has V=32,000|V| = 32{,}000 tokens. How many distinct prompts of length exactly nn exist? Of length at most nn?

Exercise 8: AI Applications

  1. Attention mask logic: A sliding window attention mask allows query ii to attend to key jj iff ijw|i - j| \leq w. A causal mask allows ii to attend to jj iff jij \leq i. Write the combined mask as a set: M={(i,j)}M = \{(i, j) \mid \ldots\}. Is MM an equivalence relation? A partial order?

  2. Tokeniser as equivalence: Define an equivalence relation on {a,b,c}\{a, b, c\}^* where two strings are equivalent iff they map to the same token sequence under the BPE merges: abXab \to X. List the equivalence classes (up to length 3).

  3. Loss convergence as FOL: Write the formal statement "SGD converges to a critical point" using quantifiers. Then write its negation and interpret what it means for training to fail.

  4. Function composition: A model has: embedding E:VR768E: V \to \mathbb{R}^{768}, encoder T:Rn×768Rn×768T: \mathbb{R}^{n \times 768} \to \mathbb{R}^{n \times 768} (12 layers), and head H:R768RVH: \mathbb{R}^{768} \to \mathbb{R}^{|V|}. Write the full model as a composition. Is the composition associative? Does the order matter?


15. Why This Matters for AI

15.1 Summary of Connections

Topic in This ChapterWhere It Appears in AI/MLChapter Reference
Sets, subsets, operationsData partitioning (train/val/test), vocabularies, token sets2-3
RelationsAttention masks, knowledge graphs, dependency parsing4
Equivalence relationsTokenisation, clustering, quotient spaces4.3
Partial ordersComputation graphs, dependency ordering, beam search4.4
FunctionsNeural network layers, loss functions, activations4.5
CompositionDeep networks as function chains, backpropagation4.6
Propositional logicBoolean masks, circuit design, SAT solving5
Truth tables, CNFConstraint satisfaction, structured generation5.2, 5.6
Predicate logic (FOL)Formal specifications, knowledge representation6
QuantifiersConvergence definitions, generalization bounds6.2-6.5
Proof techniquesUnderstanding theoretical papers, convergence proofs7
InductionProofs about recursive algorithms and sequence models7.6
ZFC axiomsWhy mathematics is consistent; foundational security8
Axiom of ChoiceExistence of bases, minima in non-compact spaces8.3
CardinalityExpressiveness limits, countability of programs10
Diagonal argumentImpossibility results, undecidability10.4, 12.2
Godel's theoremsLimits of formal verification and AI reasoning12.1
Formal verificationProvable robustness, safety guarantees11.6

16. Conceptual Bridge

16.1 What to Study Next

Now that you have the foundations of sets and logic, the natural next steps are:

  1. Functions and Mappings (Chapter 03): Deepens the treatment of functions from 4.5 - injectivity, surjectivity, inverse functions, and their role in defining neural network layers precisely.

  2. Linear Algebra (Chapters 02-03): Sets and logic provide the language for defining vector spaces, subspaces, and linear maps rigorously. Every "for all" and "there exists" in a linear algebra textbook uses the predicate logic from 6.

  3. Calculus (Chapters 04-05): The ε\varepsilon-δ\delta definition of continuity (ε>0,δ>0,x,xa<δf(x)f(a)<ε\forall \varepsilon > 0, \exists \delta > 0, \forall x, |x - a| < \delta \to |f(x) - f(a)| < \varepsilon) is a nested quantifier statement from 6.5. Optimisation theory builds on this.

  4. Probability Theory (Chapters 06-07): Built entirely on measure theory, which is set theory on steroids. Events are sets, probability is a set function, and conditional probability uses set intersection.

  5. Information Theory (Chapter 09): Cross-entropy and KL divergence are defined using sets (probability distributions), logarithms (built from real numbers, which are built from sets), and summations (which are disguised quantifiers).

16.2 The Big Picture

     +-----------------------------------------------------+
     |                  THIS CHAPTER                        |
     |                                                      |
     |   Sets ---- Logic ---- Proofs ---- Axioms            |
     |    |          |          |            |               |
     |    v          v          v            v               |
     |  Data      Masks     Theory     Foundations          |
     | structures  & rules   papers     of math             |
     |    |          |          |            |               |
     |    +----------+----------+------------+               |
     |                    |                                  |
     |                    v                                  |
     |          +-----------------+                         |
     |          |    EVERYTHING    |                         |
     |          |   IN MATH/CS    |                         |
     |          |   BUILDS HERE   |                         |
     |          +-----------------+                         |
     +-----------------------------------------------------+

Sets and logic are the lingua franca of mathematics. Every mathematical definition, theorem, and proof uses the language developed in this chapter. When you encounter a statement like:

"For every ε>0\varepsilon > 0, there exists a neural network NN with at most n(ε)n(\varepsilon) neurons such that Nf<ε\|N - f\|_\infty < \varepsilon"

You now know:

  • ε>0\forall \varepsilon > 0 - universal quantifier (6.2)
  • N\exists N - existential quantifier (6.2)
  • Nf<ε\|N - f\|_\infty < \varepsilon - a predicate (6.1)
  • The whole statement is a nested \forall\exists claim (6.5)
  • It can be proved by construction (7.8) or contradiction (7.4)
  • The set of all such networks has a specific cardinality (10)

You can read mathematics now. The rest of this course builds on this foundation.