Lesson overview | Lesson overview | Next part
Sets and Logic: Part 1: Intuition to 4. Relations
1. Intuition
1.1 What Are Sets and Logic?
Set theory is the mathematical language for talking about collections of objects; logic is the mathematical language for talking about truth and valid reasoning. Together they form the foundation of all mathematics: every other mathematical structure - numbers, functions, probability, linear algebra - is built on sets and governed by logic.
A set answers the question: what objects belong to this category?
Logic answers the question: given what we know, what can we validly conclude?
These are not abstract curiosities. When you tokenise a sentence, the vocabulary is a set. When you compute attention, the mask selects a subset of positions. When you train a model, the loss function maps from a set of parameters to real numbers. When you evaluate whether a model's output is "correct", you're applying logical predicates. Sets and logic are not prerequisites you study once and forget - they are the operational language of every equation in this curriculum.
For AI and LLMs specifically:
- Sets formalise vocabularies, token sequences, and model outputs
- Logic underpins reasoning, constraint satisfaction, knowledge representation, and formal verification of AI systems
- Every probability distribution is defined over a set with a logical structure (sigma-algebra)
- Neural network verification expresses safety properties as logical formulas
1.2 Why Sets and Logic Matter for AI
| AI Concept | Set/Logic Foundation |
|---|---|
| Tokenisation | Vocabulary is a set; the set of all token sequences is a formal language |
| Probability | A probability space is built on sets: events are subsets of sample space |
| Neural network architecture | Each layer is a function between sets (domain and codomain ) |
| Attention masks | A mask selects a subset of positions; causal mask is a specific set relation |
| Structured generation | Constrained decoding enforces logical constraints on output tokens at each step |
| Knowledge graphs | Entities are sets; relations are functions between sets; reasoning is predicate logic |
| Formal verification | Proving properties of AI systems requires predicate logic and SAT/SMT solvers |
| Training data | Deduplication is a set operation; a dataset is a multiset of examples |
| Retrieval (RAG) | A query retrieves a subset of a corpus satisfying a relevance predicate |
| Type checking | Tensor shape inference is type theory; preventing dimension mismatches |
1.3 The Hierarchy of Mathematical Foundations
Every mathematical structure used in AI rests on a hierarchy, and sets and logic form the very bottom:
Layer 7: Machine Learning (SGD, backprop, transformers)
Layer 6: Probability (distributions, Bayes, entropy)
Layer 5: Analysis (continuity, limits, derivatives)
Layer 4: Linear Algebra (vectors, matrices, eigenvalues)
Layer 3: Algebra (groups, rings, fields)
Layer 2: Arithmetic (natural numbers, integers, reals)
Layer 1: Set Theory (ZFC) (what objects exist? what collections?)
Layer 0: Logic (what is valid reasoning? what follows from what?)
Each layer is defined in terms of the layers below. Natural numbers are constructed from sets (von Neumann ordinals). Real numbers are equivalence classes of Cauchy sequences of rationals. Vector spaces are sets with operations satisfying axioms stated in predicate logic. Probability measures are functions from sets to . The entire edifice stands on sets and logic.
This is not historical accident - it is the result of 150 years of foundational work from Cantor, Frege, Russell, Hilbert, Godel, and Turing to make mathematics rigorous and unambiguous.
1.4 Formal vs Informal Mathematics
Informal mathematics is what most research papers use: proofs in natural language, intuitions supported by diagrams, appeals to "clearly" and "obviously". This is adequate for communication between mathematicians who share common training and conventions.
Formal mathematics is where every step is justified by an explicit rule, and proofs are machine-checkable. Nothing is taken for granted:
| Aspect | Informal | Formal |
|---|---|---|
| Language | Natural language + notation | Formal logic (FOL or type theory) |
| Proof verification | Human reviewer | Machine (proof assistant) |
| Ambiguity | Tolerated | Forbidden |
| Tools | Paper, LaTeX | Lean 4, Coq, Isabelle, Agda |
| Scale | Research papers | Verified libraries (Mathlib ~150K theorems) |
Proof assistants encode logic and set theory precisely and verify proofs automatically. They are no longer academic curiosities:
- Lean 4 (Microsoft Research): Mathlib library; used by AlphaProof (Google DeepMind, 2024) to win IMO gold medals
- Coq (INRIA): CompCert verified C compiler; Four Colour Theorem formalised
- Isabelle (Cambridge/Munich): seL4 verified operating system kernel
- AI and formal mathematics: LLMs increasingly used to assist with formal proofs - Lean Copilot, Hypertree Proof Search, AlphaProof. Understanding sets and logic is prerequisite for understanding what LLMs are attempting when doing mathematical reasoning
1.5 Historical Timeline
The development of sets and logic spans 2,400 years. Here are the critical milestones:
| Year | Person | Contribution |
|---|---|---|
| ~350 BCE | Aristotle | Syllogistic logic - first formal system of deductive reasoning |
| 1679 | Leibniz | Calculus ratiocinator - dream of universal logical calculus |
| 1847 | Boole | The Mathematical Analysis of Logic - Boolean algebra; logic as algebra |
| 1874 | Cantor | Set theory - infinite sets, cardinality, continuum hypothesis |
| 1879 | Frege | Begriffsschrift - first complete formal logic system; predicate calculus |
| 1901 | Russell | Russell's Paradox - naive set theory is inconsistent |
| 1908 | Zermelo | Axiomatic set theory - foundation for modern mathematics |
| 1910-13 | Russell & Whitehead | Principia Mathematica - attempt to reduce all mathematics to logic |
| 1922-30s | Fraenkel, Skolem | ZFC axioms completed - standard foundation of modern mathematics |
| 1929 | Godel | Completeness theorem - every valid FOL formula has a proof |
| 1931 | Godel | Incompleteness theorems - no consistent system can prove all true arithmetic |
| 1936 | Church & Turing | Lambda calculus and Turing machines - computability theory |
| 1936 | Tarski | Formal semantics - truth defined precisely for formal languages |
| 1965 | Zadeh | Fuzzy logic - graded truth values in |
| 1972 | Martin-Lof | Dependent type theory - constructive foundations connecting logic to programming |
| 2005-10 | - | SAT solvers industrialised - automated reasoning becomes practical |
| 2024 | Google DeepMind | AlphaProof - neural theorem prover achieves IMO gold medal level |
1.6 The Two Pillars
Mathematics rests on two pillars, and understanding their relationship is essential:
MATHEMATICS
|
+-------------------+-------------------+
| |
LOGIC SET THEORY
(What is valid reasoning?) (What objects exist?)
| |
Propositional Logic Naive Set Theory
down down
Predicate Logic (FOL) Axiomatic (ZFC)
down down
Modal / Temporal Logic Type Theory
| |
+-------------------+-------------------+
|
Both together: the language in which
all of AI's mathematical foundations
are written
Logic provides the rules of valid inference - what follows from what. It tells you that if "all transformers have attention" and "GPT is a transformer", then "GPT has attention" - and it tells you exactly why this inference is valid (universal instantiation plus modus ponens).
Set theory provides the objects - the collections, the spaces, the structures. It tells you what is (a Cartesian product of copies of ), what a function is (a special kind of relation, which is a subset of a Cartesian product), and what a probability space is (a triple of sets and a measure).
Neither is sufficient alone. Logic without sets has nothing to reason about. Sets without logic have no way to draw conclusions. Together they form the complete foundation.
2. Naive Set Theory
2.1 What Is a Set?
A set is an unordered collection of distinct objects called elements (or members). This is the simplest and most fundamental concept in all of mathematics.
Membership notation:
- means " is an element of " (read: " is in ")
- means " is not an element of " (read: " is not in ")
Example:
Then , , , but , .
The fundamental principle - Axiom of Extensionality: A set is entirely determined by its members. Two sets are equal if and only if they have exactly the same elements:
This means:
- Order does not matter:
- Repetition does not matter: (sets have distinct elements)
Important distinction: Sets have no notion of order or multiplicity. If you need order, use a sequence (tuple). If you need multiplicity, use a multiset (see 2.6). In ML, training batches are typically ordered sequences, not sets - but the mathematical specification of "training data" is a multiset (same example can appear multiple times).
2.2 Describing Sets
There are three standard ways to describe which elements belong to a set:
Roster notation (listing): Enumerate elements explicitly inside braces.
| Example | Description |
|---|---|
| Finite set with 5 elements | |
| Set of three letters | |
| Infinite set with clear pattern (use carefully) |
Set-builder notation (predicate): Describe the property that members satisfy.
Read: "the set of all such that holds."
| Set-builder | Meaning |
|---|---|
| Even integers | |
| Positive real numbers | |
| Vocabulary as a set | |
| Causal attention mask positions |
Parametric description: Define elements by explicit construction.
| Parametric | Meaning |
|---|---|
| Even integers | |
| Reciprocals of positive integers | |
| Range of a linear map |
2.3 Special Sets
The following sets are used so frequently that they have special symbols:
| Symbol | Name | Elements |
|---|---|---|
| Empty set | No elements | |
| Natural numbers | (sometimes starting from 1) | |
| Integers | ||
| Rationals | ||
| Real numbers | All points on the number line | |
| Complex numbers | ||
| Universal set | All objects under consideration (context-dependent) |
The containment chain:
Each step adds new elements: adds negatives, adds fractions, adds irrationals (, , ), adds imaginary numbers.
Critical distinctions with the empty set:
| Expression | Elements | Cardinality |
|---|---|---|
| None | 0 | |
| One element: | 1 | |
| Two elements: and | 2 |
Warning: . The empty set has no elements. The set containing the empty set has one element (which happens to be the empty set). This distinction is not pedantic - it is exactly how natural numbers are constructed from sets in ZFC (see 8.3).
2.4 Subset and Superset
Subset: is a subset of (written ) if every element of is also an element of :
Proper subset: is a proper subset of (written or ) if and - that is, has at least one element not in .
Superset: means (read: " is a superset of ").
Properties of the subset relation:
| Property | Statement | Meaning |
|---|---|---|
| Reflexivity | Every set is a subset of itself | |
| Antisymmetry | and | Standard proof technique for set equality |
| Transitivity | and | Subset chains compose |
| Empty set | for every set | Vacuously true |
The antisymmetry property gives us the standard method for proving two sets are equal: show (every element of is in ) and (every element of is in ). This "double containment" proof is used constantly in mathematics.
Vacuous truth explanation: Why is true? The definition says: for every , if then . Since nothing is in , the "if" clause is never satisfied, so the implication is vacuously true. There is no counterexample - we would need an that is in but not in , and no such exists. See 5.2 for more on vacuous truth and implication.
2.5 Russell's Paradox
Naive comprehension (the principle that doomed early set theory): For any property , there exists a set . This seems natural - "the set of all things with property " - but it leads to disaster.
Russell's construction (1901): Define
The set of all sets that don't contain themselves. Now ask: is ?
Case 1: Assume . Then satisfies its own membership condition, which says . Contradiction.
Case 2: Assume . Then satisfies the condition , so by the definition of , we have . Contradiction.
Either way we get a contradiction. Naive set theory is inconsistent.
The resolution: We cannot form "the set of all sets" or unrestricted collections. The axiom of Separation (in ZFC) restricts comprehension: you can only form for an already-existing set . You cannot form a set "from scratch" using just a property - you must separate from an existing set. This blocks Russell's paradox because there is no universal set of all sets from which to separate.
This is not merely a historical curiosity. Russell's paradox is the reason mathematics has axioms for set theory at all, and understanding it is prerequisite for understanding ZFC (8), type theory (9.4), and the limits of formal systems (12.1).
2.6 Multisets (Bags)
Standard sets require distinct elements: . But in practice, we often need to track multiplicity - how many times each element appears.
A multiset (or bag) generalises sets by assigning each element a multiplicity (count). Formally, a multiset over a base set is a function where gives the number of times appears.
Notation: as a multiset has , , .
Multiset operations:
| Operation | Definition | Example |
|---|---|---|
| Union (max) | ||
| Sum (add) | ||
| Intersection (min) |
AI applications of multisets:
- Training data: A dataset is a multiset of examples - the same document may appear multiple times (data repetition, which affects training dynamics)
- Token counting: Term frequency in TF-IDF and BM25 is a multiset operation - how many times each word appears matters
- Gradient accumulation: Sum of gradient multiset over a batch: each micro-batch contributes a gradient, and we sum (multiset addition)
- Batch statistics: BatchNorm computes mean and variance of a multiset of activations per channel
3. Set Operations
Set operations are the algebraic toolkit for combining, comparing, and transforming sets. Every one of these operations appears directly in modern AI systems: attention masks use complement and intersection, retrieval uses union and difference, and probability theory uses all of them.
3.1 Union
The union contains every element that belongs to at least one of the two sets. The "or" is inclusive - elements in both sets are included (once, since sets don't repeat).
Visual: Think of taking both sets and pouring them together. Any element from either set goes in.
Properties of union:
| Property | Statement | Why |
|---|---|---|
| Commutativity | "or" is commutative | |
| Associativity | Can chain without parentheses | |
| Identity | Adding nothing changes nothing | |
| Idempotence | Duplicates collapse in sets | |
| Domination | Everything plus anything is everything |
Generalised union: For a collection of sets :
AI examples:
- Vocabulary merging: When combining tokenisers from different models,
- Search results: "Documents matching query A OR query B" =
- Multi-dataset training: Training set =
3.2 Intersection
The intersection contains only elements that belong to both sets simultaneously.
Properties of intersection:
| Property | Statement | Why |
|---|---|---|
| Commutativity | "and" is commutative | |
| Associativity | Can chain without parentheses | |
| Identity | Intersecting with everything keeps all | |
| Idempotence | Set intersected with itself is itself | |
| Annihilation | Intersecting with nothing gives nothing |
Disjoint sets: Two sets and are disjoint if - they share no elements.
Generalised intersection: For a collection :
AI examples:
- Common tokens: Tokens shared by two vocabularies:
- Data deduplication: Documents appearing in both train and test: (should be for valid evaluation!)
- Attention intersection: Tokens attended by both head A and head B
Distributive laws (union and intersection interact):
These mirror the distributive laws of multiplication over addition (), which is no coincidence - Boolean algebra (5.7) makes these analogies precise.
3.3 Set Difference
Elements of that are not in . Think of "removing" 's elements from .
Properties of set difference:
| Property | Statement |
|---|---|
| Not commutative | in general |
| Removing nothing changes nothing | |
| Removing everything leaves nothing | |
| Can only reduce the original set | |
| What remains shares nothing with what was removed |
Relationship to other operations:
Difference is intersection with complement (this is often the most useful way to think about it).
AI examples:
- Vocabulary difference: Tokens in GPT-4's vocabulary but not LLaMA's:
- Data filtering: Training data after removing test examples:
- Feature selection: Selected features = all features minus dropped features
3.4 Symmetric Difference
Elements in exactly one of the two sets - not in both.
Properties of symmetric difference:
| Property | Statement | Why |
|---|---|---|
| Commutativity | Definition is symmetric | |
| Associativity | Can chain | |
| Identity | XOR with nothing = original | |
| Self-inverse | XOR with self cancels | |
| Criterion | Zero difference means equal |
Algebraic structure: The symmetric difference operation makes the power set into an abelian group with as identity. Combined with intersection as multiplication, forms a Boolean ring - connecting set theory to abstract algebra.
AI examples:
- Dataset drift detection: Tokens that changed between vocabulary v1 and v2:
- Model comparison: Features used by model A XOR model B
- Edit distance: Symmetric difference of character sets is a crude string similarity measure
3.5 Complement
Everything in the universal set that is not in . The complement depends on what is (must specify context).
Properties of complement:
| Property | Statement |
|---|---|
| Double complement | |
| Complement of | |
| Complement of | |
| Union with complement | |
| Intersection with complement |
AI example: An attention mask selects positions to attend to. The complement gives the positions that are masked out (set to before softmax). In causal masking, the complement is the "future" positions that must not be attended to.
3.6 De Morgan's Laws for Sets
These are the most important algebraic identities for sets. They connect union and intersection through complementation:
First law - complement of union = intersection of complements:
"Not in A-or-B" is the same as "not in A and not in B."
Second law - complement of intersection = union of complements:
"Not in A-and-B" is the same as "not in A or not in B."
Proof of the first law (by double containment):
(): Let . Then , which means and (if were in either, it would be in the union). Hence and , so .
(): Let . Then and . Hence is not in or , so , which means .
Both containments hold, so .
Generalised De Morgan's Laws:
AI examples:
- Attention mask logic: "Not attending to position A or position B" = "not attending to A and not attending to B"
- Retrieval negation: "Documents not matching (query1 OR query2)" = "documents not matching query1 AND not matching query2"
- Filter composition: Negating a disjunctive filter becomes conjunctive
3.7 Cartesian Product
The set of all ordered pairs where the first element comes from and the second from . Unlike sets, order matters in pairs: unless .
Cardinality:
Not commutative: (different ordered pairs, unless ).
Higher products:
- - ordered triples
- - -tuples of elements from
AI examples - Cartesian products are everywhere:
| Expression | Meaning |
|---|---|
| -dimensional embedding space | |
| All possible token sequences of length | |
| Domain of attention: (queries, keys) | |
| All position pairs for attention matrix | |
| Parameter space of model with parameters |
Formal definition of ordered pair (Kuratowski): . This reduces ordered pairs to pure sets - the pair is a set that encodes the order by treating as the "singleton" element.
3.8 Power Set
The set of all subsets of , including and itself.
Cardinality: for finite .
The reason: for each element of , you independently choose "include" or "exclude" - this gives choices for each of the elements, hence subsets total.
Examples:
| Set | ||
|---|---|---|
| 8 subsets (including and ) |
Cantor's theorem (critical): The power set is always strictly larger than the original set, even for infinite sets:
This means there are always more subsets than elements - a fact that generates the entire hierarchy of infinite cardinalities (see 10.4).
AI implications:
- If vocabulary , then - the number of possible token subsets is astronomically larger than the vocabulary itself
- Structured generation constrains output to a subset of at each step - it selects one element of
- The exponential blowup of power sets is why exhaustive search over subsets is intractable, driving the need for heuristic and approximate algorithms throughout AI
3.9 Operations Summary Table
For quick reference, here is the complete table of set operations with their logical counterparts:
| Set Operation | Notation | Logic Equivalent | Python |
|---|---|---|---|
| Union | (OR) | A | B or A.union(B) | |
| Intersection | (AND) | A & B or A.intersection(B) | |
| Difference | (AND NOT) | A - B or A.difference(B) | |
| Symmetric difference | (XOR) | A ^ B or A.symmetric_difference(B) | |
| Complement | (NOT) | U - A | |
| Cartesian product | - | itertools.product(A, B) | |
| Power set | - | itertools.combinations for each size | |
| Subset test | A <= B or A.issubset(B) |
4. Relations
Relations generalise the concept of "connection" between objects. Functions, orderings, equivalences, and even neural network computations are all special kinds of relations. Understanding relations precisely is what separates informal intuition from rigorous mathematics.
4.1 Binary Relations
A binary relation from set to set is a subset of the Cartesian product:
If , we write or and read " is related to by ."
A relation on is a relation from to itself: .
Examples:
| Relation | Sets | Definition |
|---|---|---|
| on | iff is less than or equal to | |
| Parent-of | Person Person | iff is parent of |
| Token adjacency | iff can precede in valid text | |
| Attention mask | iff query at position attends to key at position | |
| Similarity | iff |
A relation can be represented as:
- A set of pairs (the mathematical definition)
- A matrix (Boolean matrix where iff ; this is the attention mask representation)
- A directed graph (nodes = elements; edge from to iff )
4.2 Properties of Relations
Let be a relation on set (i.e., ). There are six fundamental properties a relation may or may not have:
Reflexivity: Every element is related to itself.
- OK is reflexive ( for all )
- NO is not reflexive ()
- AI: identity attention (token attends to itself) is a reflexive relation
Irreflexivity: No element is related to itself.
- OK is irreflexive
- OK "parent-of" is irreflexive (no one is their own parent)
- A relation can be neither reflexive nor irreflexive (some self-loops, not all)
Symmetry: If is related to , then is related to .
- OK "is sibling of" is symmetric
- OK "has same length as" is symmetric
- NO is not symmetric ( but )
- AI: cosine similarity is symmetric; attention is NOT symmetric (query->key direction)
Antisymmetry: If is related to AND is related to , then .
- OK is antisymmetric ( and implies )
- OK on sets is antisymmetric
- Note: antisymmetry \neq "not symmetric". A relation can be both symmetric and antisymmetric (e.g., equality)
Asymmetry: If is related to , then is NOT related to .
- OK is asymmetric
- OK "parent-of" is asymmetric
- Asymmetry implies irreflexivity (if were reflexive, would require )
Transitivity: If is related to and is related to , then is related to .
- OK is transitive
- OK "ancestor-of" is transitive
- NO "friend-of" is not transitive (my friend's friend need not be my friend)
- AI: causal mask transitivity - if position attends to and attends to , should attend to ? In a causal mask, yes (by transitivity of )
4.3 Equivalence Relations
A relation on is an equivalence relation if it is:
- Reflexive: for all
- Symmetric:
- Transitive:
Equivalence relations capture the idea of "sameness" or "interchangeability" - elements related by an equivalence relation are considered equivalent for the purpose at hand.
Examples of equivalence relations:
| Relation | Domain | Equivalence classes |
|---|---|---|
| Equality | Any set | Each class is a singleton |
| Same remainder mod | classes: | |
| Same length | Strings | Strings of equal length grouped together |
| Same BPE encoding | Character sequences | Sequences mapping to same token(s) |
| Same prediction | Inputs | Inputs producing identical model output |
Equivalence class of :
All elements equivalent to form the equivalence class .
Partition theorem: An equivalence relation on partitions into disjoint equivalence classes:
- Every element belongs to exactly one equivalence class
- The union of all classes equals :
- Distinct classes are disjoint:
Quotient set: - the set of all equivalence classes.
AI example - tokenisation as equivalence: BPE tokenisation defines an equivalence relation on character sequences. Two character sequences are "equivalent" if they merge to the same token. The equivalence classes are precisely the tokens. The vocabulary is the quotient set (restricted to the merge rules).
4.4 Partial Orders
A relation on is a partial order if it is:
- Reflexive: for all
- Antisymmetric:
- Transitive:
A partial order is written or conventionally. The pair is called a partially ordered set (poset).
"Partial" means: Not all pairs need be comparable. Some elements may be incomparable - neither nor holds.
Total order (linear order): A partial order where additionally
All pairs are comparable. Total orders arrange elements in a single line.
Examples:
| Relation | Domain | Type | Incomparable? |
|---|---|---|---|
| on | Real numbers | Total order | No |
| Lexicographic on strings | Total order | No | |
| on | Power set | Partial order | Yes: and |
| Divisibility on | iff divides | Partial order | Yes: and |
| Prefix relation | iff is prefix of | Partial order | Yes: "ab" and "ba" |
Hasse diagram: A graphical representation of a finite poset. Draw above with an edge if and there is no element between them (no element with ). This gives a compact picture of the partial order structure.
AI connection - dependency parsing: In a syntactic parse tree, token dominates token if is a descendant of . This dominance relation is a partial order on sentence tokens. Similarly, in DAG-structured computation graphs, the execution order is a partial order on operations.
4.5 Functions as Relations
A function is a special case of a relation - a relation satisfying the condition that every element of appears in exactly one pair:
We write to mean .
Terminology:
| Term | Symbol | Meaning |
|---|---|---|
| Domain | Set of inputs | |
| Codomain | Set of allowed outputs | |
| Range (image) | Actual outputs () |
Types of functions:
| Type | Definition | Meaning |
|---|---|---|
| Injective (one-to-one) | Distinct inputs -> distinct outputs | |
| Surjective (onto) | Every output is hit | |
| Bijective | Injective and surjective | Perfect pairing between and |
AI examples - functions are the core abstraction:
| Function | Type Signature | Properties |
|---|---|---|
| Tokeniser | Not injective (multiple strings -> same tokens) | |
| Embedding | Injective (each token gets unique vector) | |
| LM head | Linear map (matrix multiplication) | |
| Attention | Highly non-linear | |
| Softmax | Surjective onto probability simplex | |
| argmax | Not injective (ties); not differentiable |
4.6 Composition and Inverse
Composition: Given and , the composition is defined by:
Read right to left: "apply first, then ."
Properties of composition:
- Associativity:
- Identity: and , where
- Not commutative: in general
Inverse: If is bijective, there exists a unique satisfying:
Only bijections have inverses. Injective-but-not-surjective functions have left inverses; surjective-but-not-injective functions have right inverses.
AI example - neural networks as function composition: A deep neural network is fundamentally a composition of functions:
Each is one layer (linear map + activation). Deep learning is deeply compositional - the power comes from composing many simple functions. The chain rule of calculus (backpropagation) exploits this compositional structure: .