Part 1

23 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Sets and Logic: Part 1: Intuition to 4. Relations

1. Intuition

1.1 What Are Sets and Logic?

Set theory is the mathematical language for talking about collections of objects; logic is the mathematical language for talking about truth and valid reasoning. Together they form the foundation of all mathematics: every other mathematical structure - numbers, functions, probability, linear algebra - is built on sets and governed by logic.

A set answers the question: what objects belong to this category?

Logic answers the question: given what we know, what can we validly conclude?

These are not abstract curiosities. When you tokenise a sentence, the vocabulary $V$ is a set. When you compute attention, the mask selects a subset of positions. When you train a model, the loss function maps from a set of parameters to real numbers. When you evaluate whether a model's output is "correct", you're applying logical predicates. Sets and logic are not prerequisites you study once and forget - they are the operational language of every equation in this curriculum.

For AI and LLMs specifically:

Sets formalise vocabularies, token sequences, and model outputs
Logic underpins reasoning, constraint satisfaction, knowledge representation, and formal verification of AI systems
Every probability distribution is defined over a set with a logical structure (sigma-algebra)
Neural network verification expresses safety properties as logical formulas

1.2 Why Sets and Logic Matter for AI

AI Concept	Set/Logic Foundation
Tokenisation	Vocabulary $V$ is a set; the set of all token sequences $V^*$ is a formal language
Probability	A probability space is built on sets: events are subsets of sample space $\Omega$
Neural network architecture	Each layer is a function between sets (domain $\mathbb{R}^{d_{in}}$ and codomain $\mathbb{R}^{d_{out}}$ )
Attention masks	A mask selects a subset of positions; causal mask is a specific set relation $\{(i,j) \mid i \geq j\}$
Structured generation	Constrained decoding enforces logical constraints on output tokens at each step
Knowledge graphs	Entities are sets; relations are functions between sets; reasoning is predicate logic
Formal verification	Proving properties of AI systems requires predicate logic and SAT/SMT solvers
Training data	Deduplication is a set operation; a dataset is a multiset of examples
Retrieval (RAG)	A query retrieves a subset of a corpus satisfying a relevance predicate
Type checking	Tensor shape inference is type theory; preventing dimension mismatches

1.3 The Hierarchy of Mathematical Foundations

Every mathematical structure used in AI rests on a hierarchy, and sets and logic form the very bottom:

Layer 7: Machine Learning     (SGD, backprop, transformers)
Layer 6: Probability          (distributions, Bayes, entropy)
Layer 5: Analysis             (continuity, limits, derivatives)
Layer 4: Linear Algebra       (vectors, matrices, eigenvalues)
Layer 3: Algebra              (groups, rings, fields)
Layer 2: Arithmetic           (natural numbers, integers, reals)
Layer 1: Set Theory (ZFC)     (what objects exist? what collections?)
Layer 0: Logic                (what is valid reasoning? what follows from what?)

Each layer is defined in terms of the layers below. Natural numbers are constructed from sets (von Neumann ordinals). Real numbers are equivalence classes of Cauchy sequences of rationals. Vector spaces are sets with operations satisfying axioms stated in predicate logic. Probability measures are functions from sets to $[0,1]$ . The entire edifice stands on sets and logic.

This is not historical accident - it is the result of 150 years of foundational work from Cantor, Frege, Russell, Hilbert, Godel, and Turing to make mathematics rigorous and unambiguous.

1.4 Formal vs Informal Mathematics

Informal mathematics is what most research papers use: proofs in natural language, intuitions supported by diagrams, appeals to "clearly" and "obviously". This is adequate for communication between mathematicians who share common training and conventions.

Formal mathematics is where every step is justified by an explicit rule, and proofs are machine-checkable. Nothing is taken for granted:

Aspect	Informal	Formal
Language	Natural language + notation	Formal logic (FOL or type theory)
Proof verification	Human reviewer	Machine (proof assistant)
Ambiguity	Tolerated	Forbidden
Tools	Paper, LaTeX	Lean 4, Coq, Isabelle, Agda
Scale	Research papers	Verified libraries (Mathlib ~150K theorems)

Proof assistants encode logic and set theory precisely and verify proofs automatically. They are no longer academic curiosities:

Lean 4 (Microsoft Research): Mathlib library; used by AlphaProof (Google DeepMind, 2024) to win IMO gold medals
Coq (INRIA): CompCert verified C compiler; Four Colour Theorem formalised
Isabelle (Cambridge/Munich): seL4 verified operating system kernel
AI and formal mathematics: LLMs increasingly used to assist with formal proofs - Lean Copilot, Hypertree Proof Search, AlphaProof. Understanding sets and logic is prerequisite for understanding what LLMs are attempting when doing mathematical reasoning

1.5 Historical Timeline

The development of sets and logic spans 2,400 years. Here are the critical milestones:

Year	Person	Contribution
~350 BCE	Aristotle	Syllogistic logic - first formal system of deductive reasoning
1679	Leibniz	Calculus ratiocinator - dream of universal logical calculus
1847	Boole	The Mathematical Analysis of Logic - Boolean algebra; logic as algebra
1874	Cantor	Set theory - infinite sets, cardinality, continuum hypothesis
1879	Frege	Begriffsschrift - first complete formal logic system; predicate calculus
1901	Russell	Russell's Paradox - naive set theory is inconsistent
1908	Zermelo	Axiomatic set theory - foundation for modern mathematics
1910-13	Russell & Whitehead	Principia Mathematica - attempt to reduce all mathematics to logic
1922-30s	Fraenkel, Skolem	ZFC axioms completed - standard foundation of modern mathematics
1929	Godel	Completeness theorem - every valid FOL formula has a proof
1931	Godel	Incompleteness theorems - no consistent system can prove all true arithmetic
1936	Church & Turing	Lambda calculus and Turing machines - computability theory
1936	Tarski	Formal semantics - truth defined precisely for formal languages
1965	Zadeh	Fuzzy logic - graded truth values in $[0,1]$
1972	Martin-Lof	Dependent type theory - constructive foundations connecting logic to programming
2005-10	-	SAT solvers industrialised - automated reasoning becomes practical
2024	Google DeepMind	AlphaProof - neural theorem prover achieves IMO gold medal level

1.6 The Two Pillars

Mathematics rests on two pillars, and understanding their relationship is essential:

                        MATHEMATICS
                            |
        +-------------------+-------------------+
        |                                       |
   LOGIC                                  SET THEORY
   (What is valid reasoning?)             (What objects exist?)
        |                                       |
   Propositional Logic                    Naive Set Theory
        down                                       down
   Predicate Logic (FOL)                  Axiomatic (ZFC)
        down                                       down
   Modal / Temporal Logic                 Type Theory
        |                                       |
        +-------------------+-------------------+
                            |
              Both together: the language in which
              all of AI's mathematical foundations
                        are written

Logic provides the rules of valid inference - what follows from what. It tells you that if "all transformers have attention" and "GPT is a transformer", then "GPT has attention" - and it tells you exactly why this inference is valid (universal instantiation plus modus ponens).

Set theory provides the objects - the collections, the spaces, the structures. It tells you what $\mathbb{R}^d$ is (a Cartesian product of $d$ copies of $\mathbb{R}$ ), what a function is (a special kind of relation, which is a subset of a Cartesian product), and what a probability space is (a triple of sets and a measure).

Neither is sufficient alone. Logic without sets has nothing to reason about. Sets without logic have no way to draw conclusions. Together they form the complete foundation.

2. Naive Set Theory

2.1 What Is a Set?

A set is an unordered collection of distinct objects called elements (or members). This is the simplest and most fundamental concept in all of mathematics.

Membership notation:

$a \in A$ means " $a$ is an element of $A$ " (read: " $a$ is in $A$ ")
$b \notin A$ means " $b$ is not an element of $A$ " (read: " $b$ is not in $A$ ")

Example:

A = \{1, 2, 3\}

Then $1 \in A$ , $2 \in A$ , $3 \in A$ , but $4 \notin A$ , $0 \notin A$ .

The fundamental principle - Axiom of Extensionality: A set is entirely determined by its members. Two sets are equal if and only if they have exactly the same elements:

A = B \iff \forall x\,(x \in A \iff x \in B)

This means:

Order does not matter: $\{1, 2, 3\} = \{3, 1, 2\} = \{2, 3, 1\}$
Repetition does not matter: $\{1, 1, 2\} = \{1, 2\}$ (sets have distinct elements)

Important distinction: Sets have no notion of order or multiplicity. If you need order, use a sequence (tuple). If you need multiplicity, use a multiset (see 2.6). In ML, training batches are typically ordered sequences, not sets - but the mathematical specification of "training data" is a multiset (same example can appear multiple times).

2.2 Describing Sets

There are three standard ways to describe which elements belong to a set:

Roster notation (listing): Enumerate elements explicitly inside braces.

Example	Description
$\{1, 2, 3, 4, 5\}$	Finite set with 5 elements
$\{a, b, c\}$	Set of three letters
$\{2, 4, 6, 8, \ldots\}$	Infinite set with clear pattern (use carefully)

Set-builder notation (predicate): Describe the property that members satisfy.

A = \{x \mid P(x)\} \quad \text{or} \quad A = \{x : P(x)\}

Read: "the set of all $x$ such that $P(x)$ holds."

Set-builder	Meaning
$\{x \in \mathbb{Z} \mid x \bmod 2 = 0\}$	Even integers
$\{x \in \mathbb{R} \mid x > 0\}$	Positive real numbers
$\{t \in \Sigma^* \mid t \text{ is a valid token}\}$	Vocabulary as a set
$\{(i,j) \in [n]^2 \mid i \geq j\}$	Causal attention mask positions

Parametric description: Define elements by explicit construction.

Parametric	Meaning
$\{2k \mid k \in \mathbb{Z}\}$	Even integers
$\{1/n \mid n \in \mathbb{N},\, n > 0\}$	Reciprocals of positive integers
$\{W_q x \mid x \in \mathbb{R}^d\}$	Range of a linear map $W_q$

2.3 Special Sets

The following sets are used so frequently that they have special symbols:

Symbol	Name	Elements
$\emptyset = \{\}$	Empty set	No elements
$\mathbb{N}$	Natural numbers	$\{0, 1, 2, 3, \ldots\}$ (sometimes starting from 1)
$\mathbb{Z}$	Integers	$\{\ldots, -2, -1, 0, 1, 2, \ldots\}$
$\mathbb{Q}$	Rationals	$\{p/q \mid p, q \in \mathbb{Z},\, q \neq 0\}$
$\mathbb{R}$	Real numbers	All points on the number line
$\mathbb{C}$	Complex numbers	$\{a + bi \mid a, b \in \mathbb{R},\, i^2 = -1\}$
$U$	Universal set	All objects under consideration (context-dependent)

The containment chain:

\mathbb{N} \subset \mathbb{Z} \subset \mathbb{Q} \subset \mathbb{R} \subset \mathbb{C}

Each step adds new elements: $\mathbb{Z}$ adds negatives, $\mathbb{Q}$ adds fractions, $\mathbb{R}$ adds irrationals ( $\sqrt{2}$ , $\pi$ , $e$ ), $\mathbb{C}$ adds imaginary numbers.

Critical distinctions with the empty set:

Expression	Elements	Cardinality
$\emptyset = \{\}$	None	0
$\{\emptyset\} = \{\{\}\}$	One element: $\emptyset$	1
$\{\emptyset, \{\emptyset\}\}$	Two elements: $\emptyset$ and $\{\emptyset\}$	2

Warning: $\emptyset \neq \{\emptyset\}$ . The empty set has no elements. The set containing the empty set has one element (which happens to be the empty set). This distinction is not pedantic - it is exactly how natural numbers are constructed from sets in ZFC (see 8.3).

2.4 Subset and Superset

Subset: $A$ is a subset of $B$ (written $A \subseteq B$ ) if every element of $A$ is also an element of $B$ :

A \subseteq B \iff \forall x\,(x \in A \implies x \in B)

Proper subset: $A$ is a proper subset of $B$ (written $A \subsetneq B$ or $A \subset B$ ) if $A \subseteq B$ and $A \neq B$ - that is, $B$ has at least one element not in $A$ .

Superset: $A \supseteq B$ means $B \subseteq A$ (read: " $A$ is a superset of $B$ ").

Properties of the subset relation:

Property	Statement	Meaning
Reflexivity	$A \subseteq A$	Every set is a subset of itself
Antisymmetry	$A \subseteq B$ and $B \subseteq A \implies A = B$	Standard proof technique for set equality
Transitivity	$A \subseteq B$ and $B \subseteq C \implies A \subseteq C$	Subset chains compose
Empty set	$\emptyset \subseteq A$ for every set $A$	Vacuously true

The antisymmetry property gives us the standard method for proving two sets are equal: show $A \subseteq B$ (every element of $A$ is in $B$ ) and $B \subseteq A$ (every element of $B$ is in $A$ ). This "double containment" proof is used constantly in mathematics.

Vacuous truth explanation: Why is $\emptyset \subseteq A$ true? The definition says: for every $x$ , if $x \in \emptyset$ then $x \in A$ . Since nothing is in $\emptyset$ , the "if" clause is never satisfied, so the implication is vacuously true. There is no counterexample - we would need an $x$ that is in $\emptyset$ but not in $A$ , and no such $x$ exists. See 5.2 for more on vacuous truth and implication.

2.5 Russell's Paradox

Naive comprehension (the principle that doomed early set theory): For any property $P$ , there exists a set $\{x \mid P(x)\}$ . This seems natural - "the set of all things with property $P$ " - but it leads to disaster.

Russell's construction (1901): Define

R = \{x \mid x \notin x\}

The set of all sets that don't contain themselves. Now ask: is $R \in R$ ?

Case 1: Assume $R \in R$ . Then $R$ satisfies its own membership condition, which says $R \notin R$ . Contradiction.

Case 2: Assume $R \notin R$ . Then $R$ satisfies the condition $x \notin x$ , so by the definition of $R$ , we have $R \in R$ . Contradiction.

Either way we get a contradiction. Naive set theory is inconsistent.

The resolution: We cannot form "the set of all sets" or unrestricted collections. The axiom of Separation (in ZFC) restricts comprehension: you can only form $\{x \in A \mid P(x)\}$ for an already-existing set $A$ . You cannot form a set "from scratch" using just a property - you must separate from an existing set. This blocks Russell's paradox because there is no universal set of all sets from which to separate.

This is not merely a historical curiosity. Russell's paradox is the reason mathematics has axioms for set theory at all, and understanding it is prerequisite for understanding ZFC (8), type theory (9.4), and the limits of formal systems (12.1).

2.6 Multisets (Bags)

Standard sets require distinct elements: $\{1, 1, 2\} = \{1, 2\}$ . But in practice, we often need to track multiplicity - how many times each element appears.

A multiset (or bag) generalises sets by assigning each element a multiplicity (count). Formally, a multiset over a base set $A$ is a function $m: A \to \mathbb{N}_0$ where $m(a)$ gives the number of times $a$ appears.

Notation: $\{1, 1, 2, 3, 3, 3\}$ as a multiset has $m(1) = 2$ , $m(2) = 1$ , $m(3) = 3$ .

Multiset operations:

Operation	Definition	Example
Union (max)	$(A \uplus B)(x) = \max(m_A(x), m_B(x))$	$\{1,1,2\} \uplus \{1,2,2\} = \{1,1,2,2\}$
Sum (add)	$(A + B)(x) = m_A(x) + m_B(x)$	$\{1,1,2\} + \{1,2,2\} = \{1,1,1,2,2,2\}$
Intersection (min)	$(A \cap B)(x) = \min(m_A(x), m_B(x))$	$\{1,1,2\} \cap \{1,2,2\} = \{1,2\}$

AI applications of multisets:

Training data: A dataset is a multiset of examples - the same document may appear multiple times (data repetition, which affects training dynamics)
Token counting: Term frequency in TF-IDF and BM25 is a multiset operation - how many times each word appears matters
Gradient accumulation: Sum of gradient multiset over a batch: each micro-batch contributes a gradient, and we sum (multiset addition)
Batch statistics: BatchNorm computes mean and variance of a multiset of activations per channel

3. Set Operations

Set operations are the algebraic toolkit for combining, comparing, and transforming sets. Every one of these operations appears directly in modern AI systems: attention masks use complement and intersection, retrieval uses union and difference, and probability theory uses all of them.

3.1 Union

A \cup B = \{x \mid x \in A \text{ or } x \in B\}

The union contains every element that belongs to at least one of the two sets. The "or" is inclusive - elements in both sets are included (once, since sets don't repeat).

Visual: Think of taking both sets and pouring them together. Any element from either set goes in.

Properties of union:

Property	Statement	Why
Commutativity	$A \cup B = B \cup A$	"or" is commutative
Associativity	$(A \cup B) \cup C = A \cup (B \cup C)$	Can chain without parentheses
Identity	$A \cup \emptyset = A$	Adding nothing changes nothing
Idempotence	$A \cup A = A$	Duplicates collapse in sets
Domination	$A \cup U = U$	Everything plus anything is everything

Generalised union: For a collection of sets $\{A_i\}_{i \in I}$ :

\bigcup_{i \in I} A_i = \{x \mid x \in A_i \text{ for some } i \in I\}

AI examples:

Vocabulary merging: When combining tokenisers from different models, $V_{\text{merged}} = V_1 \cup V_2$
Search results: "Documents matching query A OR query B" = $\text{Results}(A) \cup \text{Results}(B)$
Multi-dataset training: Training set = $D_1 \cup D_2 \cup \ldots \cup D_k$

3.2 Intersection

A \cap B = \{x \mid x \in A \text{ and } x \in B\}

The intersection contains only elements that belong to both sets simultaneously.

Properties of intersection:

Property	Statement	Why
Commutativity	$A \cap B = B \cap A$	"and" is commutative
Associativity	$(A \cap B) \cap C = A \cap (B \cap C)$	Can chain without parentheses
Identity	$A \cap U = A$	Intersecting with everything keeps all
Idempotence	$A \cap A = A$	Set intersected with itself is itself
Annihilation	$A \cap \emptyset = \emptyset$	Intersecting with nothing gives nothing

Disjoint sets: Two sets $A$ and $B$ are disjoint if $A \cap B = \emptyset$ - they share no elements.

Generalised intersection: For a collection $\{A_i\}_{i \in I}$ :

\bigcap_{i \in I} A_i = \{x \mid x \in A_i \text{ for all } i \in I\}

AI examples:

Common tokens: Tokens shared by two vocabularies: $V_1 \cap V_2$
Data deduplication: Documents appearing in both train and test: $D_{\text{train}} \cap D_{\text{test}}$ (should be $\emptyset$ for valid evaluation!)
Attention intersection: Tokens attended by both head A and head B

Distributive laws (union and intersection interact):

A \cap (B \cup C) = (A \cap B) \cup (A \cap C)

A \cup (B \cap C) = (A \cup B) \cap (A \cup C)

These mirror the distributive laws of multiplication over addition ( $a \cdot (b + c) = ab + ac$ ), which is no coincidence - Boolean algebra (5.7) makes these analogies precise.

3.3 Set Difference

A \setminus B = A - B = \{x \mid x \in A \text{ and } x \notin B\}

Elements of $A$ that are not in $B$ . Think of "removing" $B$ 's elements from $A$ .

Properties of set difference:

Property	Statement
Not commutative	$A \setminus B \neq B \setminus A$ in general
$A \setminus \emptyset = A$	Removing nothing changes nothing
$A \setminus A = \emptyset$	Removing everything leaves nothing
$A \setminus B \subseteq A$	Can only reduce the original set
$(A \setminus B) \cap B = \emptyset$	What remains shares nothing with what was removed

Relationship to other operations:

A \setminus B = A \cap \bar{B}

Difference is intersection with complement (this is often the most useful way to think about it).

AI examples:

Vocabulary difference: Tokens in GPT-4's vocabulary but not LLaMA's: $V_{\text{GPT4}} \setminus V_{\text{LLaMA}}$
Data filtering: Training data after removing test examples: $D_{\text{train}} \setminus D_{\text{test}}$
Feature selection: Selected features = all features minus dropped features

3.4 Symmetric Difference

A \triangle B = (A \setminus B) \cup (B \setminus A) = (A \cup B) \setminus (A \cap B)

Elements in exactly one of the two sets - not in both.

Properties of symmetric difference:

Property	Statement	Why
Commutativity	$A \triangle B = B \triangle A$	Definition is symmetric
Associativity	$(A \triangle B) \triangle C = A \triangle (B \triangle C)$	Can chain
Identity	$A \triangle \emptyset = A$	XOR with nothing = original
Self-inverse	$A \triangle A = \emptyset$	XOR with self cancels
Criterion	$A \triangle B = \emptyset \iff A = B$	Zero difference means equal

Algebraic structure: The symmetric difference operation makes the power set $\mathcal{P}(U)$ into an abelian group with $\emptyset$ as identity. Combined with intersection as multiplication, $(\mathcal{P}(U), \triangle, \cap)$ forms a Boolean ring - connecting set theory to abstract algebra.

AI examples:

Dataset drift detection: Tokens that changed between vocabulary v1 and v2: $V_1 \triangle V_2$
Model comparison: Features used by model A XOR model B
Edit distance: Symmetric difference of character sets is a crude string similarity measure

3.5 Complement

\bar{A} = A' = U \setminus A = \{x \in U \mid x \notin A\}

Everything in the universal set $U$ that is not in $A$ . The complement depends on what $U$ is (must specify context).

Properties of complement:

Property	Statement
Double complement	$\overline{\bar{A}} = A$
Complement of $\emptyset$	$\bar{\emptyset} = U$
Complement of $U$	$\bar{U} = \emptyset$
Union with complement	$A \cup \bar{A} = U$
Intersection with complement	$A \cap \bar{A} = \emptyset$

AI example: An attention mask $M$ selects positions to attend to. The complement $\bar{M}$ gives the positions that are masked out (set to $-\infty$ before softmax). In causal masking, the complement is the "future" positions that must not be attended to.

3.6 De Morgan's Laws for Sets

These are the most important algebraic identities for sets. They connect union and intersection through complementation:

First law - complement of union = intersection of complements:

\overline{A \cup B} = \bar{A} \cap \bar{B}

"Not in A-or-B" is the same as "not in A and not in B."

Second law - complement of intersection = union of complements:

\overline{A \cap B} = \bar{A} \cup \bar{B}

"Not in A-and-B" is the same as "not in A or not in B."

Proof of the first law (by double containment):

( $\subseteq$ ): Let $x \in \overline{A \cup B}$ . Then $x \notin A \cup B$ , which means $x \notin A$ and $x \notin B$ (if $x$ were in either, it would be in the union). Hence $x \in \bar{A}$ and $x \in \bar{B}$ , so $x \in \bar{A} \cap \bar{B}$ .

( $\supseteq$ ): Let $x \in \bar{A} \cap \bar{B}$ . Then $x \notin A$ and $x \notin B$ . Hence $x$ is not in $A$ or $B$ , so $x \notin A \cup B$ , which means $x \in \overline{A \cup B}$ .

Both containments hold, so $\overline{A \cup B} = \bar{A} \cap \bar{B}$ . $\square$

Generalised De Morgan's Laws:

\overline{\bigcup_i A_i} = \bigcap_i \bar{A}_i \qquad \qquad \overline{\bigcap_i A_i} = \bigcup_i \bar{A}_i

AI examples:

Attention mask logic: "Not attending to position A or position B" = "not attending to A and not attending to B"
Retrieval negation: "Documents not matching (query1 OR query2)" = "documents not matching query1 AND not matching query2"
Filter composition: Negating a disjunctive filter becomes conjunctive

3.7 Cartesian Product

A \times B = \{(a, b) \mid a \in A,\, b \in B\}

The set of all ordered pairs where the first element comes from $A$ and the second from $B$ . Unlike sets, order matters in pairs: $(a, b) \neq (b, a)$ unless $a = b$ .

Cardinality: $|A \times B| = |A| \times |B|$

Not commutative: $A \times B \neq B \times A$ (different ordered pairs, unless $A = B$ ).

Higher products:

$A \times B \times C = \{(a, b, c) \mid a \in A,\, b \in B,\, c \in C\}$ - ordered triples
$A^n = \underbrace{A \times A \times \cdots \times A}_{n \text{ times}}$ - $n$ -tuples of elements from $A$

AI examples - Cartesian products are everywhere:

Expression	Meaning
$\mathbb{R}^d = \underbrace{\mathbb{R} \times \cdots \times \mathbb{R}}_{d}$	$d$ -dimensional embedding space
$V^n$	All possible token sequences of length $n$
$\mathbb{R}^{n \times d_k} \times \mathbb{R}^{n \times d_k}$	Domain of attention: (queries, keys)
$[n] \times [n]$	All position pairs for attention matrix
$\Theta = \mathbb{R}^p$	Parameter space of model with $p$ parameters

Formal definition of ordered pair (Kuratowski): $(a, b) = \{\{a\}, \{a, b\}\}$ . This reduces ordered pairs to pure sets - the pair $(a, b)$ is a set that encodes the order by treating $a$ as the "singleton" element.

3.8 Power Set

\mathcal{P}(A) = \{B \mid B \subseteq A\}

The set of all subsets of $A$ , including $\emptyset$ and $A$ itself.

Cardinality: $|\mathcal{P}(A)| = 2^{|A|}$ for finite $A$ .

The reason: for each element of $A$ , you independently choose "include" or "exclude" - this gives $2$ choices for each of the $|A|$ elements, hence $2^{|A|}$ subsets total.

Examples:

Set $A$	$\mathcal{P}(A)$	$\\|\mathcal{P}(A)\\|$
$\emptyset$	$\{\emptyset\}$	$2^0 = 1$
$\{1\}$	$\{\emptyset, \{1\}\}$	$2^1 = 2$
$\{1, 2\}$	$\{\emptyset, \{1\}, \{2\}, \{1,2\}\}$	$2^2 = 4$
$\{1, 2, 3\}$	8 subsets (including $\emptyset$ and $\{1,2,3\}$ )	$2^3 = 8$

Cantor's theorem (critical): The power set is always strictly larger than the original set, even for infinite sets:

|A| < |\mathcal{P}(A)| \quad \text{for all sets } A

This means there are always more subsets than elements - a fact that generates the entire hierarchy of infinite cardinalities (see 10.4).

AI implications:

If vocabulary $|V| = 32{,}000$ , then $|\mathcal{P}(V)| = 2^{32{,}000}$ - the number of possible token subsets is astronomically larger than the vocabulary itself
Structured generation constrains output to a subset of $V$ at each step - it selects one element of $\mathcal{P}(V)$
The exponential blowup of power sets is why exhaustive search over subsets is intractable, driving the need for heuristic and approximate algorithms throughout AI

3.9 Operations Summary Table

For quick reference, here is the complete table of set operations with their logical counterparts:

Set Operation	Notation	Logic Equivalent	Python
Union	$A \cup B$	$\vee$ (OR)	`A \| B` or `A.union(B)`
Intersection	$A \cap B$	$\wedge$ (AND)	`A & B` or `A.intersection(B)`
Difference	$A \setminus B$	$\wedge \neg$ (AND NOT)	`A - B` or `A.difference(B)`
Symmetric difference	$A \triangle B$	$\oplus$ (XOR)	`A ^ B` or `A.symmetric_difference(B)`
Complement	$\bar{A}$	$\neg$ (NOT)	`U - A`
Cartesian product	$A \times B$	-	`itertools.product(A, B)`
Power set	$\mathcal{P}(A)$	-	`itertools.combinations` for each size
Subset test	$A \subseteq B$	$\implies$	`A <= B` or `A.issubset(B)`

4. Relations

Relations generalise the concept of "connection" between objects. Functions, orderings, equivalences, and even neural network computations are all special kinds of relations. Understanding relations precisely is what separates informal intuition from rigorous mathematics.

4.1 Binary Relations

A binary relation $R$ from set $A$ to set $B$ is a subset of the Cartesian product:

R \subseteq A \times B

If $(a, b) \in R$ , we write $aRb$ or $R(a, b)$ and read " $a$ is related to $b$ by $R$ ."

A relation on $A$ is a relation from $A$ to itself: $R \subseteq A \times A$ .

Examples:

Relation	Sets	Definition
$\leq$ on $\mathbb{R}$	$\mathbb{R} \times \mathbb{R}$	$(x, y) \in \leq$ iff $x$ is less than or equal to $y$
Parent-of	Person $\times$ Person	$(p, c)$ iff $p$ is parent of $c$
Token adjacency	$V \times V$	$(t_1, t_2) \in R$ iff $t_1$ can precede $t_2$ in valid text
Attention mask	$[n] \times [n]$	$(i, j) \in \text{mask}$ iff query at position $i$ attends to key at position $j$
Similarity	$V \times V$	$(w_1, w_2) \in R$ iff $\cos(e_{w_1}, e_{w_2}) > \tau$

A relation can be represented as:

A set of pairs (the mathematical definition)
A matrix (Boolean matrix $M$ where $M_{ij} = 1$ iff $iRj$ ; this is the attention mask representation)
A directed graph (nodes = elements; edge from $a$ to $b$ iff $aRb$ )

4.2 Properties of Relations

Let $R$ be a relation on set $A$ (i.e., $R \subseteq A \times A$ ). There are six fundamental properties a relation may or may not have:

Reflexivity: Every element is related to itself.

\forall a \in A:\, (a, a) \in R

OK $\leq$ is reflexive ( $x \leq x$ for all $x$ )
NO $<$ is not reflexive ( $x \not< x$ )
AI: identity attention (token attends to itself) is a reflexive relation

Irreflexivity: No element is related to itself.

\forall a \in A:\, (a, a) \notin R

OK $<$ is irreflexive
OK "parent-of" is irreflexive (no one is their own parent)
A relation can be neither reflexive nor irreflexive (some self-loops, not all)

Symmetry: If $a$ is related to $b$ , then $b$ is related to $a$ .

\forall a, b \in A:\, (a, b) \in R \implies (b, a) \in R

OK "is sibling of" is symmetric
OK "has same length as" is symmetric
NO $\leq$ is not symmetric ( $3 \leq 5$ but $5 \not\leq 3$ )
AI: cosine similarity is symmetric; attention is NOT symmetric (query->key direction)

Antisymmetry: If $a$ is related to $b$ AND $b$ is related to $a$ , then $a = b$ .

\forall a, b \in A:\, (a, b) \in R \wedge (b, a) \in R \implies a = b

OK $\leq$ is antisymmetric ( $x \leq y$ and $y \leq x$ implies $x = y$ )
OK $\subseteq$ on sets is antisymmetric
Note: antisymmetry \neq "not symmetric". A relation can be both symmetric and antisymmetric (e.g., equality)

Asymmetry: If $a$ is related to $b$ , then $b$ is NOT related to $a$ .

\forall a, b \in A:\, (a, b) \in R \implies (b, a) \notin R

OK $<$ is asymmetric
OK "parent-of" is asymmetric
Asymmetry implies irreflexivity (if $R$ were reflexive, $(a, a) \in R$ would require $(a, a) \notin R$ )

Transitivity: If $a$ is related to $b$ and $b$ is related to $c$ , then $a$ is related to $c$ .

\forall a, b, c \in A:\, (a, b) \in R \wedge (b, c) \in R \implies (a, c) \in R

OK $\leq$ is transitive
OK "ancestor-of" is transitive
NO "friend-of" is not transitive (my friend's friend need not be my friend)
AI: causal mask transitivity - if position $i$ attends to $j$ and $j$ attends to $k$ , should $i$ attend to $k$ ? In a causal mask, yes (by transitivity of $\geq$ )

4.3 Equivalence Relations

A relation $R$ on $A$ is an equivalence relation if it is:

Reflexive: $aRa$ for all $a$
Symmetric: $aRb \implies bRa$
Transitive: $aRb \wedge bRc \implies aRc$

Equivalence relations capture the idea of "sameness" or "interchangeability" - elements related by an equivalence relation are considered equivalent for the purpose at hand.

Examples of equivalence relations:

Relation	Domain	Equivalence classes
Equality $=$	Any set	Each class is a singleton $\{a\}$
Same remainder mod $n$	$\mathbb{Z}$	$n$ classes: $[0], [1], \ldots, [n-1]$
Same length	Strings	Strings of equal length grouped together
Same BPE encoding	Character sequences	Sequences mapping to same token(s)
Same prediction	Inputs	Inputs producing identical model output

Equivalence class of $a$ :

[a] = \{b \in A \mid aRb\}

All elements equivalent to $a$ form the equivalence class $[a]$ .

Partition theorem: An equivalence relation on $A$ partitions $A$ into disjoint equivalence classes:

Every element belongs to exactly one equivalence class
The union of all classes equals $A$ : $\bigcup_a [a] = A$
Distinct classes are disjoint: $[a] \neq [b] \implies [a] \cap [b] = \emptyset$

Quotient set: $A / R = \{[a] \mid a \in A\}$ - the set of all equivalence classes.

AI example - tokenisation as equivalence: BPE tokenisation defines an equivalence relation on character sequences. Two character sequences are "equivalent" if they merge to the same token. The equivalence classes are precisely the tokens. The vocabulary $V$ is the quotient set $\Sigma^* / \sim_{\text{BPE}}$ (restricted to the merge rules).

4.4 Partial Orders

A relation $R$ on $A$ is a partial order if it is:

Reflexive: $aRa$ for all $a$
Antisymmetric: $aRb \wedge bRa \implies a = b$
Transitive: $aRb \wedge bRc \implies aRc$

A partial order is written $\leq$ or $\preceq$ conventionally. The pair $(A, \leq)$ is called a partially ordered set (poset).

"Partial" means: Not all pairs need be comparable. Some elements may be incomparable - neither $a \leq b$ nor $b \leq a$ holds.

Total order (linear order): A partial order where additionally

\forall a, b \in A:\, aRb \text{ or } bRa

All pairs are comparable. Total orders arrange elements in a single line.

Examples:

Relation	Domain	Type	Incomparable?
$\leq$ on $\mathbb{R}$	Real numbers	Total order	No
Lexicographic on strings	$\Sigma^*$	Total order	No
$\subseteq$ on $\mathcal{P}(A)$	Power set	Partial order	Yes: $\{1\}$ and $\{2\}$
Divisibility on $\mathbb{N}$	$a \mid b$ iff $a$ divides $b$	Partial order	Yes: $2$ and $3$
Prefix relation	$s \leq t$ iff $s$ is prefix of $t$	Partial order	Yes: "ab" and "ba"

Hasse diagram: A graphical representation of a finite poset. Draw $a$ above $b$ with an edge if $a > b$ and there is no element between them (no element $c$ with $a > c > b$ ). This gives a compact picture of the partial order structure.

AI connection - dependency parsing: In a syntactic parse tree, token $i$ dominates token $j$ if $j$ is a descendant of $i$ . This dominance relation is a partial order on sentence tokens. Similarly, in DAG-structured computation graphs, the execution order is a partial order on operations.

4.5 Functions as Relations

A function $f: A \to B$ is a special case of a relation - a relation $f \subseteq A \times B$ satisfying the condition that every element of $A$ appears in exactly one pair:

\forall a \in A:\, \exists!\, b \in B:\, (a, b) \in f

We write $f(a) = b$ to mean $(a, b) \in f$ .

Terminology:

Term	Symbol	Meaning
Domain	$\text{dom}(f) = A$	Set of inputs
Codomain	$\text{cod}(f) = B$	Set of allowed outputs
Range (image)	$\text{ran}(f) = \{f(a) \mid a \in A\}$	Actual outputs ( $\subseteq B$ )

Types of functions:

Type	Definition	Meaning
Injective (one-to-one)	$f(a_1) = f(a_2) \implies a_1 = a_2$	Distinct inputs -> distinct outputs
Surjective (onto)	$\text{ran}(f) = B$	Every output is hit
Bijective	Injective and surjective	Perfect pairing between $A$ and $B$

AI examples - functions are the core abstraction:

Function	Type Signature	Properties
Tokeniser	$\Sigma^* \to V^*$	Not injective (multiple strings -> same tokens)
Embedding	$V \to \mathbb{R}^d$	Injective (each token gets unique vector)
LM head	$\mathbb{R}^d \to \mathbb{R}^{\vert V \vert}$	Linear map (matrix multiplication)
Attention	$\mathbb{R}^{n \times d} \times \mathbb{R}^{n \times d} \times \mathbb{R}^{n \times d} \to \mathbb{R}^{n \times d}$	Highly non-linear
Softmax	$\mathbb{R}^k \to \Delta^{k-1}$	Surjective onto probability simplex
argmax	$\mathbb{R}^k \to [k]$	Not injective (ties); not differentiable

4.6 Composition and Inverse

Composition: Given $f: A \to B$ and $g: B \to C$ , the composition $g \circ f: A \to C$ is defined by:

(g \circ f)(a) = g(f(a))

Read right to left: "apply $f$ first, then $g$ ."

Properties of composition:

Associativity: $h \circ (g \circ f) = (h \circ g) \circ f$
Identity: $f \circ \text{id}_A = f$ and $\text{id}_B \circ f = f$ , where $\text{id}_A(a) = a$
Not commutative: $g \circ f \neq f \circ g$ in general

Inverse: If $f: A \to B$ is bijective, there exists a unique $f^{-1}: B \to A$ satisfying:

f^{-1}(f(a)) = a \quad \text{and} \quad f(f^{-1}(b)) = b

Only bijections have inverses. Injective-but-not-surjective functions have left inverses; surjective-but-not-injective functions have right inverses.

AI example - neural networks as function composition: A deep neural network is fundamentally a composition of functions:

\hat{y} = f_L \circ f_{L-1} \circ \cdots \circ f_2 \circ f_1(x)

Each $f_i$ is one layer (linear map + activation). Deep learning is deeply compositional - the power comes from composing many simple functions. The chain rule of calculus (backpropagation) exploits this compositional structure: $\frac{\partial}{\partial x}(g \circ f) = \frac{\partial g}{\partial f} \cdot \frac{\partial f}{\partial x}$ .

Sets And Logic: Part 1 - Intuition To 4 Relations

Sets and Logic: Part 1: Intuition to 4. Relations

1. Intuition

1.1 What Are Sets and Logic?

1.2 Why Sets and Logic Matter for AI

1.3 The Hierarchy of Mathematical Foundations

1.4 Formal vs Informal Mathematics

1.5 Historical Timeline

1.6 The Two Pillars

2. Naive Set Theory

2.1 What Is a Set?

2.2 Describing Sets

2.3 Special Sets

2.4 Subset and Superset

2.5 Russell's Paradox

2.6 Multisets (Bags)

3. Set Operations

3.1 Union

3.2 Intersection

3.3 Set Difference

3.4 Symmetric Difference

3.5 Complement

3.6 De Morgan's Laws for Sets

3.7 Cartesian Product

3.8 Power Set

3.9 Operations Summary Table

4. Relations

4.1 Binary Relations

4.2 Properties of Relations

4.3 Equivalence Relations

4.4 Partial Orders

4.5 Functions as Relations

4.6 Composition and Inverse

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?