Part 1Math for LLMs

Sets And Logic: Part 1 - Intuition To 4 Relations

Mathematical Foundations / Sets And Logic

Concept Lesson
Beginner
26 min

Learning Objective

Understand 1 Intuition To 4 Relations well enough to explain it, recognize it in Math for LLMs, and apply it in a small task.

Why It Matters

1 Intuition To 4 Relations gives you the math vocabulary behind model behavior, optimization, and LLM reasoning.

IntuitionRelationsWhat Are Sets And LogicWhy Sets And Logic Matter For AIThe Hierarchy Of Mathematical Foundations
Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 1
23 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Sets and Logic: Part 1: Intuition to 4. Relations

1. Intuition

1.1 What Are Sets and Logic?

Set theory is the mathematical language for talking about collections of objects; logic is the mathematical language for talking about truth and valid reasoning. Together they form the foundation of all mathematics: every other mathematical structure - numbers, functions, probability, linear algebra - is built on sets and governed by logic.

A set answers the question: what objects belong to this category?

Logic answers the question: given what we know, what can we validly conclude?

These are not abstract curiosities. When you tokenise a sentence, the vocabulary VV is a set. When you compute attention, the mask selects a subset of positions. When you train a model, the loss function maps from a set of parameters to real numbers. When you evaluate whether a model's output is "correct", you're applying logical predicates. Sets and logic are not prerequisites you study once and forget - they are the operational language of every equation in this curriculum.

For AI and LLMs specifically:

  • Sets formalise vocabularies, token sequences, and model outputs
  • Logic underpins reasoning, constraint satisfaction, knowledge representation, and formal verification of AI systems
  • Every probability distribution is defined over a set with a logical structure (sigma-algebra)
  • Neural network verification expresses safety properties as logical formulas

1.2 Why Sets and Logic Matter for AI

AI ConceptSet/Logic Foundation
TokenisationVocabulary VV is a set; the set of all token sequences VV^* is a formal language
ProbabilityA probability space is built on sets: events are subsets of sample space Ω\Omega
Neural network architectureEach layer is a function between sets (domain Rdin\mathbb{R}^{d_{in}} and codomain Rdout\mathbb{R}^{d_{out}})
Attention masksA mask selects a subset of positions; causal mask is a specific set relation {(i,j)ij}\{(i,j) \mid i \geq j\}
Structured generationConstrained decoding enforces logical constraints on output tokens at each step
Knowledge graphsEntities are sets; relations are functions between sets; reasoning is predicate logic
Formal verificationProving properties of AI systems requires predicate logic and SAT/SMT solvers
Training dataDeduplication is a set operation; a dataset is a multiset of examples
Retrieval (RAG)A query retrieves a subset of a corpus satisfying a relevance predicate
Type checkingTensor shape inference is type theory; preventing dimension mismatches

1.3 The Hierarchy of Mathematical Foundations

Every mathematical structure used in AI rests on a hierarchy, and sets and logic form the very bottom:

Layer 7: Machine Learning     (SGD, backprop, transformers)
Layer 6: Probability          (distributions, Bayes, entropy)
Layer 5: Analysis             (continuity, limits, derivatives)
Layer 4: Linear Algebra       (vectors, matrices, eigenvalues)
Layer 3: Algebra              (groups, rings, fields)
Layer 2: Arithmetic           (natural numbers, integers, reals)
Layer 1: Set Theory (ZFC)     (what objects exist? what collections?)
Layer 0: Logic                (what is valid reasoning? what follows from what?)

Each layer is defined in terms of the layers below. Natural numbers are constructed from sets (von Neumann ordinals). Real numbers are equivalence classes of Cauchy sequences of rationals. Vector spaces are sets with operations satisfying axioms stated in predicate logic. Probability measures are functions from sets to [0,1][0,1]. The entire edifice stands on sets and logic.

This is not historical accident - it is the result of 150 years of foundational work from Cantor, Frege, Russell, Hilbert, Godel, and Turing to make mathematics rigorous and unambiguous.

1.4 Formal vs Informal Mathematics

Informal mathematics is what most research papers use: proofs in natural language, intuitions supported by diagrams, appeals to "clearly" and "obviously". This is adequate for communication between mathematicians who share common training and conventions.

Formal mathematics is where every step is justified by an explicit rule, and proofs are machine-checkable. Nothing is taken for granted:

AspectInformalFormal
LanguageNatural language + notationFormal logic (FOL or type theory)
Proof verificationHuman reviewerMachine (proof assistant)
AmbiguityToleratedForbidden
ToolsPaper, LaTeXLean 4, Coq, Isabelle, Agda
ScaleResearch papersVerified libraries (Mathlib ~150K theorems)

Proof assistants encode logic and set theory precisely and verify proofs automatically. They are no longer academic curiosities:

  • Lean 4 (Microsoft Research): Mathlib library; used by AlphaProof (Google DeepMind, 2024) to win IMO gold medals
  • Coq (INRIA): CompCert verified C compiler; Four Colour Theorem formalised
  • Isabelle (Cambridge/Munich): seL4 verified operating system kernel
  • AI and formal mathematics: LLMs increasingly used to assist with formal proofs - Lean Copilot, Hypertree Proof Search, AlphaProof. Understanding sets and logic is prerequisite for understanding what LLMs are attempting when doing mathematical reasoning

1.5 Historical Timeline

The development of sets and logic spans 2,400 years. Here are the critical milestones:

YearPersonContribution
~350 BCEAristotleSyllogistic logic - first formal system of deductive reasoning
1679LeibnizCalculus ratiocinator - dream of universal logical calculus
1847BooleThe Mathematical Analysis of Logic - Boolean algebra; logic as algebra
1874CantorSet theory - infinite sets, cardinality, continuum hypothesis
1879FregeBegriffsschrift - first complete formal logic system; predicate calculus
1901RussellRussell's Paradox - naive set theory is inconsistent
1908ZermeloAxiomatic set theory - foundation for modern mathematics
1910-13Russell & WhiteheadPrincipia Mathematica - attempt to reduce all mathematics to logic
1922-30sFraenkel, SkolemZFC axioms completed - standard foundation of modern mathematics
1929GodelCompleteness theorem - every valid FOL formula has a proof
1931GodelIncompleteness theorems - no consistent system can prove all true arithmetic
1936Church & TuringLambda calculus and Turing machines - computability theory
1936TarskiFormal semantics - truth defined precisely for formal languages
1965ZadehFuzzy logic - graded truth values in [0,1][0,1]
1972Martin-LofDependent type theory - constructive foundations connecting logic to programming
2005-10-SAT solvers industrialised - automated reasoning becomes practical
2024Google DeepMindAlphaProof - neural theorem prover achieves IMO gold medal level

1.6 The Two Pillars

Mathematics rests on two pillars, and understanding their relationship is essential:

                        MATHEMATICS
                            |
        +-------------------+-------------------+
        |                                       |
   LOGIC                                  SET THEORY
   (What is valid reasoning?)             (What objects exist?)
        |                                       |
   Propositional Logic                    Naive Set Theory
        down                                       down
   Predicate Logic (FOL)                  Axiomatic (ZFC)
        down                                       down
   Modal / Temporal Logic                 Type Theory
        |                                       |
        +-------------------+-------------------+
                            |
              Both together: the language in which
              all of AI's mathematical foundations
                        are written

Logic provides the rules of valid inference - what follows from what. It tells you that if "all transformers have attention" and "GPT is a transformer", then "GPT has attention" - and it tells you exactly why this inference is valid (universal instantiation plus modus ponens).

Set theory provides the objects - the collections, the spaces, the structures. It tells you what Rd\mathbb{R}^d is (a Cartesian product of dd copies of R\mathbb{R}), what a function is (a special kind of relation, which is a subset of a Cartesian product), and what a probability space is (a triple of sets and a measure).

Neither is sufficient alone. Logic without sets has nothing to reason about. Sets without logic have no way to draw conclusions. Together they form the complete foundation.


2. Naive Set Theory

2.1 What Is a Set?

A set is an unordered collection of distinct objects called elements (or members). This is the simplest and most fundamental concept in all of mathematics.

Membership notation:

  • aAa \in A means "aa is an element of AA" (read: "aa is in AA")
  • bAb \notin A means "bb is not an element of AA" (read: "bb is not in AA")

Example:

A={1,2,3}A = \{1, 2, 3\}

Then 1A1 \in A, 2A2 \in A, 3A3 \in A, but 4A4 \notin A, 0A0 \notin A.

The fundamental principle - Axiom of Extensionality: A set is entirely determined by its members. Two sets are equal if and only if they have exactly the same elements:

A=B    x(xA    xB)A = B \iff \forall x\,(x \in A \iff x \in B)

This means:

  • Order does not matter: {1,2,3}={3,1,2}={2,3,1}\{1, 2, 3\} = \{3, 1, 2\} = \{2, 3, 1\}
  • Repetition does not matter: {1,1,2}={1,2}\{1, 1, 2\} = \{1, 2\} (sets have distinct elements)

Important distinction: Sets have no notion of order or multiplicity. If you need order, use a sequence (tuple). If you need multiplicity, use a multiset (see 2.6). In ML, training batches are typically ordered sequences, not sets - but the mathematical specification of "training data" is a multiset (same example can appear multiple times).

2.2 Describing Sets

There are three standard ways to describe which elements belong to a set:

Roster notation (listing): Enumerate elements explicitly inside braces.

ExampleDescription
{1,2,3,4,5}\{1, 2, 3, 4, 5\}Finite set with 5 elements
{a,b,c}\{a, b, c\}Set of three letters
{2,4,6,8,}\{2, 4, 6, 8, \ldots\}Infinite set with clear pattern (use carefully)

Set-builder notation (predicate): Describe the property that members satisfy.

A={xP(x)}orA={x:P(x)}A = \{x \mid P(x)\} \quad \text{or} \quad A = \{x : P(x)\}

Read: "the set of all xx such that P(x)P(x) holds."

Set-builderMeaning
{xZxmod2=0}\{x \in \mathbb{Z} \mid x \bmod 2 = 0\}Even integers
{xRx>0}\{x \in \mathbb{R} \mid x > 0\}Positive real numbers
{tΣt is a valid token}\{t \in \Sigma^* \mid t \text{ is a valid token}\}Vocabulary as a set
{(i,j)[n]2ij}\{(i,j) \in [n]^2 \mid i \geq j\}Causal attention mask positions

Parametric description: Define elements by explicit construction.

ParametricMeaning
{2kkZ}\{2k \mid k \in \mathbb{Z}\}Even integers
{1/nnN,n>0}\{1/n \mid n \in \mathbb{N},\, n > 0\}Reciprocals of positive integers
{WqxxRd}\{W_q x \mid x \in \mathbb{R}^d\}Range of a linear map WqW_q

2.3 Special Sets

The following sets are used so frequently that they have special symbols:

SymbolNameElements
={}\emptyset = \{\}Empty setNo elements
N\mathbb{N}Natural numbers{0,1,2,3,}\{0, 1, 2, 3, \ldots\} (sometimes starting from 1)
Z\mathbb{Z}Integers{,2,1,0,1,2,}\{\ldots, -2, -1, 0, 1, 2, \ldots\}
Q\mathbb{Q}Rationals{p/qp,qZ,q0}\{p/q \mid p, q \in \mathbb{Z},\, q \neq 0\}
R\mathbb{R}Real numbersAll points on the number line
C\mathbb{C}Complex numbers{a+bia,bR,i2=1}\{a + bi \mid a, b \in \mathbb{R},\, i^2 = -1\}
UUUniversal setAll objects under consideration (context-dependent)

The containment chain:

NZQRC\mathbb{N} \subset \mathbb{Z} \subset \mathbb{Q} \subset \mathbb{R} \subset \mathbb{C}

Each step adds new elements: Z\mathbb{Z} adds negatives, Q\mathbb{Q} adds fractions, R\mathbb{R} adds irrationals (2\sqrt{2}, π\pi, ee), C\mathbb{C} adds imaginary numbers.

Critical distinctions with the empty set:

ExpressionElementsCardinality
={}\emptyset = \{\}None0
{}={{}}\{\emptyset\} = \{\{\}\}One element: \emptyset1
{,{}}\{\emptyset, \{\emptyset\}\}Two elements: \emptyset and {}\{\emptyset\}2

Warning: {}\emptyset \neq \{\emptyset\}. The empty set has no elements. The set containing the empty set has one element (which happens to be the empty set). This distinction is not pedantic - it is exactly how natural numbers are constructed from sets in ZFC (see 8.3).

2.4 Subset and Superset

Subset: AA is a subset of BB (written ABA \subseteq B) if every element of AA is also an element of BB:

AB    x(xA    xB)A \subseteq B \iff \forall x\,(x \in A \implies x \in B)

Proper subset: AA is a proper subset of BB (written ABA \subsetneq B or ABA \subset B) if ABA \subseteq B and ABA \neq B - that is, BB has at least one element not in AA.

Superset: ABA \supseteq B means BAB \subseteq A (read: "AA is a superset of BB").

Properties of the subset relation:

PropertyStatementMeaning
ReflexivityAAA \subseteq AEvery set is a subset of itself
AntisymmetryABA \subseteq B and BA    A=BB \subseteq A \implies A = BStandard proof technique for set equality
TransitivityABA \subseteq B and BC    ACB \subseteq C \implies A \subseteq CSubset chains compose
Empty setA\emptyset \subseteq A for every set AAVacuously true

The antisymmetry property gives us the standard method for proving two sets are equal: show ABA \subseteq B (every element of AA is in BB) and BAB \subseteq A (every element of BB is in AA). This "double containment" proof is used constantly in mathematics.

Vacuous truth explanation: Why is A\emptyset \subseteq A true? The definition says: for every xx, if xx \in \emptyset then xAx \in A. Since nothing is in \emptyset, the "if" clause is never satisfied, so the implication is vacuously true. There is no counterexample - we would need an xx that is in \emptyset but not in AA, and no such xx exists. See 5.2 for more on vacuous truth and implication.

2.5 Russell's Paradox

Naive comprehension (the principle that doomed early set theory): For any property PP, there exists a set {xP(x)}\{x \mid P(x)\}. This seems natural - "the set of all things with property PP" - but it leads to disaster.

Russell's construction (1901): Define

R={xxx}R = \{x \mid x \notin x\}

The set of all sets that don't contain themselves. Now ask: is RRR \in R?

Case 1: Assume RRR \in R. Then RR satisfies its own membership condition, which says RRR \notin R. Contradiction.

Case 2: Assume RRR \notin R. Then RR satisfies the condition xxx \notin x, so by the definition of RR, we have RRR \in R. Contradiction.

Either way we get a contradiction. Naive set theory is inconsistent.

The resolution: We cannot form "the set of all sets" or unrestricted collections. The axiom of Separation (in ZFC) restricts comprehension: you can only form {xAP(x)}\{x \in A \mid P(x)\} for an already-existing set AA. You cannot form a set "from scratch" using just a property - you must separate from an existing set. This blocks Russell's paradox because there is no universal set of all sets from which to separate.

This is not merely a historical curiosity. Russell's paradox is the reason mathematics has axioms for set theory at all, and understanding it is prerequisite for understanding ZFC (8), type theory (9.4), and the limits of formal systems (12.1).

2.6 Multisets (Bags)

Standard sets require distinct elements: {1,1,2}={1,2}\{1, 1, 2\} = \{1, 2\}. But in practice, we often need to track multiplicity - how many times each element appears.

A multiset (or bag) generalises sets by assigning each element a multiplicity (count). Formally, a multiset over a base set AA is a function m:AN0m: A \to \mathbb{N}_0 where m(a)m(a) gives the number of times aa appears.

Notation: {1,1,2,3,3,3}\{1, 1, 2, 3, 3, 3\} as a multiset has m(1)=2m(1) = 2, m(2)=1m(2) = 1, m(3)=3m(3) = 3.

Multiset operations:

OperationDefinitionExample
Union (max)(AB)(x)=max(mA(x),mB(x))(A \uplus B)(x) = \max(m_A(x), m_B(x)){1,1,2}{1,2,2}={1,1,2,2}\{1,1,2\} \uplus \{1,2,2\} = \{1,1,2,2\}
Sum (add)(A+B)(x)=mA(x)+mB(x)(A + B)(x) = m_A(x) + m_B(x){1,1,2}+{1,2,2}={1,1,1,2,2,2}\{1,1,2\} + \{1,2,2\} = \{1,1,1,2,2,2\}
Intersection (min)(AB)(x)=min(mA(x),mB(x))(A \cap B)(x) = \min(m_A(x), m_B(x)){1,1,2}{1,2,2}={1,2}\{1,1,2\} \cap \{1,2,2\} = \{1,2\}

AI applications of multisets:

  • Training data: A dataset is a multiset of examples - the same document may appear multiple times (data repetition, which affects training dynamics)
  • Token counting: Term frequency in TF-IDF and BM25 is a multiset operation - how many times each word appears matters
  • Gradient accumulation: Sum of gradient multiset over a batch: each micro-batch contributes a gradient, and we sum (multiset addition)
  • Batch statistics: BatchNorm computes mean and variance of a multiset of activations per channel

3. Set Operations

Set operations are the algebraic toolkit for combining, comparing, and transforming sets. Every one of these operations appears directly in modern AI systems: attention masks use complement and intersection, retrieval uses union and difference, and probability theory uses all of them.

3.1 Union

AB={xxA or xB}A \cup B = \{x \mid x \in A \text{ or } x \in B\}

The union contains every element that belongs to at least one of the two sets. The "or" is inclusive - elements in both sets are included (once, since sets don't repeat).

Visual: Think of taking both sets and pouring them together. Any element from either set goes in.

Properties of union:

PropertyStatementWhy
CommutativityAB=BAA \cup B = B \cup A"or" is commutative
Associativity(AB)C=A(BC)(A \cup B) \cup C = A \cup (B \cup C)Can chain without parentheses
IdentityA=AA \cup \emptyset = AAdding nothing changes nothing
IdempotenceAA=AA \cup A = ADuplicates collapse in sets
DominationAU=UA \cup U = UEverything plus anything is everything

Generalised union: For a collection of sets {Ai}iI\{A_i\}_{i \in I}:

iIAi={xxAi for some iI}\bigcup_{i \in I} A_i = \{x \mid x \in A_i \text{ for some } i \in I\}

AI examples:

  • Vocabulary merging: When combining tokenisers from different models, Vmerged=V1V2V_{\text{merged}} = V_1 \cup V_2
  • Search results: "Documents matching query A OR query B" = Results(A)Results(B)\text{Results}(A) \cup \text{Results}(B)
  • Multi-dataset training: Training set = D1D2DkD_1 \cup D_2 \cup \ldots \cup D_k

3.2 Intersection

AB={xxA and xB}A \cap B = \{x \mid x \in A \text{ and } x \in B\}

The intersection contains only elements that belong to both sets simultaneously.

Properties of intersection:

PropertyStatementWhy
CommutativityAB=BAA \cap B = B \cap A"and" is commutative
Associativity(AB)C=A(BC)(A \cap B) \cap C = A \cap (B \cap C)Can chain without parentheses
IdentityAU=AA \cap U = AIntersecting with everything keeps all
IdempotenceAA=AA \cap A = ASet intersected with itself is itself
AnnihilationA=A \cap \emptyset = \emptysetIntersecting with nothing gives nothing

Disjoint sets: Two sets AA and BB are disjoint if AB=A \cap B = \emptyset - they share no elements.

Generalised intersection: For a collection {Ai}iI\{A_i\}_{i \in I}:

iIAi={xxAi for all iI}\bigcap_{i \in I} A_i = \{x \mid x \in A_i \text{ for all } i \in I\}

AI examples:

  • Common tokens: Tokens shared by two vocabularies: V1V2V_1 \cap V_2
  • Data deduplication: Documents appearing in both train and test: DtrainDtestD_{\text{train}} \cap D_{\text{test}} (should be \emptyset for valid evaluation!)
  • Attention intersection: Tokens attended by both head A and head B

Distributive laws (union and intersection interact):

A(BC)=(AB)(AC)A \cap (B \cup C) = (A \cap B) \cup (A \cap C) A(BC)=(AB)(AC)A \cup (B \cap C) = (A \cup B) \cap (A \cup C)

These mirror the distributive laws of multiplication over addition (a(b+c)=ab+aca \cdot (b + c) = ab + ac), which is no coincidence - Boolean algebra (5.7) makes these analogies precise.

3.3 Set Difference

AB=AB={xxA and xB}A \setminus B = A - B = \{x \mid x \in A \text{ and } x \notin B\}

Elements of AA that are not in BB. Think of "removing" BB's elements from AA.

Properties of set difference:

PropertyStatement
Not commutativeABBAA \setminus B \neq B \setminus A in general
A=AA \setminus \emptyset = ARemoving nothing changes nothing
AA=A \setminus A = \emptysetRemoving everything leaves nothing
ABAA \setminus B \subseteq ACan only reduce the original set
(AB)B=(A \setminus B) \cap B = \emptysetWhat remains shares nothing with what was removed

Relationship to other operations:

AB=ABˉA \setminus B = A \cap \bar{B}

Difference is intersection with complement (this is often the most useful way to think about it).

AI examples:

  • Vocabulary difference: Tokens in GPT-4's vocabulary but not LLaMA's: VGPT4VLLaMAV_{\text{GPT4}} \setminus V_{\text{LLaMA}}
  • Data filtering: Training data after removing test examples: DtrainDtestD_{\text{train}} \setminus D_{\text{test}}
  • Feature selection: Selected features = all features minus dropped features

3.4 Symmetric Difference

AB=(AB)(BA)=(AB)(AB)A \triangle B = (A \setminus B) \cup (B \setminus A) = (A \cup B) \setminus (A \cap B)

Elements in exactly one of the two sets - not in both.

Properties of symmetric difference:

PropertyStatementWhy
CommutativityAB=BAA \triangle B = B \triangle ADefinition is symmetric
Associativity(AB)C=A(BC)(A \triangle B) \triangle C = A \triangle (B \triangle C)Can chain
IdentityA=AA \triangle \emptyset = AXOR with nothing = original
Self-inverseAA=A \triangle A = \emptysetXOR with self cancels
CriterionAB=    A=BA \triangle B = \emptyset \iff A = BZero difference means equal

Algebraic structure: The symmetric difference operation makes the power set P(U)\mathcal{P}(U) into an abelian group with \emptyset as identity. Combined with intersection as multiplication, (P(U),,)(\mathcal{P}(U), \triangle, \cap) forms a Boolean ring - connecting set theory to abstract algebra.

AI examples:

  • Dataset drift detection: Tokens that changed between vocabulary v1 and v2: V1V2V_1 \triangle V_2
  • Model comparison: Features used by model A XOR model B
  • Edit distance: Symmetric difference of character sets is a crude string similarity measure

3.5 Complement

Aˉ=A=UA={xUxA}\bar{A} = A' = U \setminus A = \{x \in U \mid x \notin A\}

Everything in the universal set UU that is not in AA. The complement depends on what UU is (must specify context).

Properties of complement:

PropertyStatement
Double complementAˉ=A\overline{\bar{A}} = A
Complement of \emptysetˉ=U\bar{\emptyset} = U
Complement of UUUˉ=\bar{U} = \emptyset
Union with complementAAˉ=UA \cup \bar{A} = U
Intersection with complementAAˉ=A \cap \bar{A} = \emptyset

AI example: An attention mask MM selects positions to attend to. The complement Mˉ\bar{M} gives the positions that are masked out (set to -\infty before softmax). In causal masking, the complement is the "future" positions that must not be attended to.

3.6 De Morgan's Laws for Sets

These are the most important algebraic identities for sets. They connect union and intersection through complementation:

First law - complement of union = intersection of complements:

AB=AˉBˉ\overline{A \cup B} = \bar{A} \cap \bar{B}

"Not in A-or-B" is the same as "not in A and not in B."

Second law - complement of intersection = union of complements:

AB=AˉBˉ\overline{A \cap B} = \bar{A} \cup \bar{B}

"Not in A-and-B" is the same as "not in A or not in B."

Proof of the first law (by double containment):

(\subseteq): Let xABx \in \overline{A \cup B}. Then xABx \notin A \cup B, which means xAx \notin A and xBx \notin B (if xx were in either, it would be in the union). Hence xAˉx \in \bar{A} and xBˉx \in \bar{B}, so xAˉBˉx \in \bar{A} \cap \bar{B}.

(\supseteq): Let xAˉBˉx \in \bar{A} \cap \bar{B}. Then xAx \notin A and xBx \notin B. Hence xx is not in AA or BB, so xABx \notin A \cup B, which means xABx \in \overline{A \cup B}.

Both containments hold, so AB=AˉBˉ\overline{A \cup B} = \bar{A} \cap \bar{B}. \square

Generalised De Morgan's Laws:

iAi=iAˉiiAi=iAˉi\overline{\bigcup_i A_i} = \bigcap_i \bar{A}_i \qquad \qquad \overline{\bigcap_i A_i} = \bigcup_i \bar{A}_i

AI examples:

  • Attention mask logic: "Not attending to position A or position B" = "not attending to A and not attending to B"
  • Retrieval negation: "Documents not matching (query1 OR query2)" = "documents not matching query1 AND not matching query2"
  • Filter composition: Negating a disjunctive filter becomes conjunctive

3.7 Cartesian Product

A×B={(a,b)aA,bB}A \times B = \{(a, b) \mid a \in A,\, b \in B\}

The set of all ordered pairs where the first element comes from AA and the second from BB. Unlike sets, order matters in pairs: (a,b)(b,a)(a, b) \neq (b, a) unless a=ba = b.

Cardinality: A×B=A×B|A \times B| = |A| \times |B|

Not commutative: A×BB×AA \times B \neq B \times A (different ordered pairs, unless A=BA = B).

Higher products:

  • A×B×C={(a,b,c)aA,bB,cC}A \times B \times C = \{(a, b, c) \mid a \in A,\, b \in B,\, c \in C\} - ordered triples
  • An=A×A××An timesA^n = \underbrace{A \times A \times \cdots \times A}_{n \text{ times}} - nn-tuples of elements from AA

AI examples - Cartesian products are everywhere:

ExpressionMeaning
Rd=R××Rd\mathbb{R}^d = \underbrace{\mathbb{R} \times \cdots \times \mathbb{R}}_{d}dd-dimensional embedding space
VnV^nAll possible token sequences of length nn
Rn×dk×Rn×dk\mathbb{R}^{n \times d_k} \times \mathbb{R}^{n \times d_k}Domain of attention: (queries, keys)
[n]×[n][n] \times [n]All position pairs for attention matrix
Θ=Rp\Theta = \mathbb{R}^pParameter space of model with pp parameters

Formal definition of ordered pair (Kuratowski): (a,b)={{a},{a,b}}(a, b) = \{\{a\}, \{a, b\}\}. This reduces ordered pairs to pure sets - the pair (a,b)(a, b) is a set that encodes the order by treating aa as the "singleton" element.

3.8 Power Set

P(A)={BBA}\mathcal{P}(A) = \{B \mid B \subseteq A\}

The set of all subsets of AA, including \emptyset and AA itself.

Cardinality: P(A)=2A|\mathcal{P}(A)| = 2^{|A|} for finite AA.

The reason: for each element of AA, you independently choose "include" or "exclude" - this gives 22 choices for each of the A|A| elements, hence 2A2^{|A|} subsets total.

Examples:

Set AAP(A)\mathcal{P}(A)P(A)\|\mathcal{P}(A)\|
\emptyset{}\{\emptyset\}20=12^0 = 1
{1}\{1\}{,{1}}\{\emptyset, \{1\}\}21=22^1 = 2
{1,2}\{1, 2\}{,{1},{2},{1,2}}\{\emptyset, \{1\}, \{2\}, \{1,2\}\}22=42^2 = 4
{1,2,3}\{1, 2, 3\}8 subsets (including \emptyset and {1,2,3}\{1,2,3\})23=82^3 = 8

Cantor's theorem (critical): The power set is always strictly larger than the original set, even for infinite sets:

A<P(A)for all sets A|A| < |\mathcal{P}(A)| \quad \text{for all sets } A

This means there are always more subsets than elements - a fact that generates the entire hierarchy of infinite cardinalities (see 10.4).

AI implications:

  • If vocabulary V=32,000|V| = 32{,}000, then P(V)=232,000|\mathcal{P}(V)| = 2^{32{,}000} - the number of possible token subsets is astronomically larger than the vocabulary itself
  • Structured generation constrains output to a subset of VV at each step - it selects one element of P(V)\mathcal{P}(V)
  • The exponential blowup of power sets is why exhaustive search over subsets is intractable, driving the need for heuristic and approximate algorithms throughout AI

3.9 Operations Summary Table

For quick reference, here is the complete table of set operations with their logical counterparts:

Set OperationNotationLogic EquivalentPython
UnionABA \cup B\vee (OR)A | B or A.union(B)
IntersectionABA \cap B\wedge (AND)A & B or A.intersection(B)
DifferenceABA \setminus B¬\wedge \neg (AND NOT)A - B or A.difference(B)
Symmetric differenceABA \triangle B\oplus (XOR)A ^ B or A.symmetric_difference(B)
ComplementAˉ\bar{A}¬\neg (NOT)U - A
Cartesian productA×BA \times B-itertools.product(A, B)
Power setP(A)\mathcal{P}(A)-itertools.combinations for each size
Subset testABA \subseteq B    \impliesA <= B or A.issubset(B)

4. Relations

Relations generalise the concept of "connection" between objects. Functions, orderings, equivalences, and even neural network computations are all special kinds of relations. Understanding relations precisely is what separates informal intuition from rigorous mathematics.

4.1 Binary Relations

A binary relation RR from set AA to set BB is a subset of the Cartesian product:

RA×BR \subseteq A \times B

If (a,b)R(a, b) \in R, we write aRbaRb or R(a,b)R(a, b) and read "aa is related to bb by RR."

A relation on AA is a relation from AA to itself: RA×AR \subseteq A \times A.

Examples:

RelationSetsDefinition
\leq on R\mathbb{R}R×R\mathbb{R} \times \mathbb{R}(x,y)(x, y) \in \leq iff xx is less than or equal to yy
Parent-ofPerson ×\times Person(p,c)(p, c) iff pp is parent of cc
Token adjacencyV×VV \times V(t1,t2)R(t_1, t_2) \in R iff t1t_1 can precede t2t_2 in valid text
Attention mask[n]×[n][n] \times [n](i,j)mask(i, j) \in \text{mask} iff query at position ii attends to key at position jj
SimilarityV×VV \times V(w1,w2)R(w_1, w_2) \in R iff cos(ew1,ew2)>τ\cos(e_{w_1}, e_{w_2}) > \tau

A relation can be represented as:

  • A set of pairs (the mathematical definition)
  • A matrix (Boolean matrix MM where Mij=1M_{ij} = 1 iff iRjiRj; this is the attention mask representation)
  • A directed graph (nodes = elements; edge from aa to bb iff aRbaRb)

4.2 Properties of Relations

Let RR be a relation on set AA (i.e., RA×AR \subseteq A \times A). There are six fundamental properties a relation may or may not have:

Reflexivity: Every element is related to itself.

aA:(a,a)R\forall a \in A:\, (a, a) \in R
  • OK \leq is reflexive (xxx \leq x for all xx)
  • NO << is not reflexive (xxx \not< x)
  • AI: identity attention (token attends to itself) is a reflexive relation

Irreflexivity: No element is related to itself.

aA:(a,a)R\forall a \in A:\, (a, a) \notin R
  • OK << is irreflexive
  • OK "parent-of" is irreflexive (no one is their own parent)
  • A relation can be neither reflexive nor irreflexive (some self-loops, not all)

Symmetry: If aa is related to bb, then bb is related to aa.

a,bA:(a,b)R    (b,a)R\forall a, b \in A:\, (a, b) \in R \implies (b, a) \in R
  • OK "is sibling of" is symmetric
  • OK "has same length as" is symmetric
  • NO \leq is not symmetric (353 \leq 5 but 5≰35 \not\leq 3)
  • AI: cosine similarity is symmetric; attention is NOT symmetric (query->key direction)

Antisymmetry: If aa is related to bb AND bb is related to aa, then a=ba = b.

a,bA:(a,b)R(b,a)R    a=b\forall a, b \in A:\, (a, b) \in R \wedge (b, a) \in R \implies a = b
  • OK \leq is antisymmetric (xyx \leq y and yxy \leq x implies x=yx = y)
  • OK \subseteq on sets is antisymmetric
  • Note: antisymmetry \neq "not symmetric". A relation can be both symmetric and antisymmetric (e.g., equality)

Asymmetry: If aa is related to bb, then bb is NOT related to aa.

a,bA:(a,b)R    (b,a)R\forall a, b \in A:\, (a, b) \in R \implies (b, a) \notin R
  • OK << is asymmetric
  • OK "parent-of" is asymmetric
  • Asymmetry implies irreflexivity (if RR were reflexive, (a,a)R(a, a) \in R would require (a,a)R(a, a) \notin R)

Transitivity: If aa is related to bb and bb is related to cc, then aa is related to cc.

a,b,cA:(a,b)R(b,c)R    (a,c)R\forall a, b, c \in A:\, (a, b) \in R \wedge (b, c) \in R \implies (a, c) \in R
  • OK \leq is transitive
  • OK "ancestor-of" is transitive
  • NO "friend-of" is not transitive (my friend's friend need not be my friend)
  • AI: causal mask transitivity - if position ii attends to jj and jj attends to kk, should ii attend to kk? In a causal mask, yes (by transitivity of \geq)

4.3 Equivalence Relations

A relation RR on AA is an equivalence relation if it is:

  1. Reflexive: aRaaRa for all aa
  2. Symmetric: aRb    bRaaRb \implies bRa
  3. Transitive: aRbbRc    aRcaRb \wedge bRc \implies aRc

Equivalence relations capture the idea of "sameness" or "interchangeability" - elements related by an equivalence relation are considered equivalent for the purpose at hand.

Examples of equivalence relations:

RelationDomainEquivalence classes
Equality ==Any setEach class is a singleton {a}\{a\}
Same remainder mod nnZ\mathbb{Z}nn classes: [0],[1],,[n1][0], [1], \ldots, [n-1]
Same lengthStringsStrings of equal length grouped together
Same BPE encodingCharacter sequencesSequences mapping to same token(s)
Same predictionInputsInputs producing identical model output

Equivalence class of aa:

[a]={bAaRb}[a] = \{b \in A \mid aRb\}

All elements equivalent to aa form the equivalence class [a][a].

Partition theorem: An equivalence relation on AA partitions AA into disjoint equivalence classes:

  • Every element belongs to exactly one equivalence class
  • The union of all classes equals AA: a[a]=A\bigcup_a [a] = A
  • Distinct classes are disjoint: [a][b]    [a][b]=[a] \neq [b] \implies [a] \cap [b] = \emptyset

Quotient set: A/R={[a]aA}A / R = \{[a] \mid a \in A\} - the set of all equivalence classes.

AI example - tokenisation as equivalence: BPE tokenisation defines an equivalence relation on character sequences. Two character sequences are "equivalent" if they merge to the same token. The equivalence classes are precisely the tokens. The vocabulary VV is the quotient set Σ/BPE\Sigma^* / \sim_{\text{BPE}} (restricted to the merge rules).

4.4 Partial Orders

A relation RR on AA is a partial order if it is:

  1. Reflexive: aRaaRa for all aa
  2. Antisymmetric: aRbbRa    a=baRb \wedge bRa \implies a = b
  3. Transitive: aRbbRc    aRcaRb \wedge bRc \implies aRc

A partial order is written \leq or \preceq conventionally. The pair (A,)(A, \leq) is called a partially ordered set (poset).

"Partial" means: Not all pairs need be comparable. Some elements may be incomparable - neither aba \leq b nor bab \leq a holds.

Total order (linear order): A partial order where additionally

a,bA:aRb or bRa\forall a, b \in A:\, aRb \text{ or } bRa

All pairs are comparable. Total orders arrange elements in a single line.

Examples:

RelationDomainTypeIncomparable?
\leq on R\mathbb{R}Real numbersTotal orderNo
Lexicographic on stringsΣ\Sigma^*Total orderNo
\subseteq on P(A)\mathcal{P}(A)Power setPartial orderYes: {1}\{1\} and {2}\{2\}
Divisibility on N\mathbb{N}aba \mid b iff aa divides bbPartial orderYes: 22 and 33
Prefix relationsts \leq t iff ss is prefix of ttPartial orderYes: "ab" and "ba"

Hasse diagram: A graphical representation of a finite poset. Draw aa above bb with an edge if a>ba > b and there is no element between them (no element cc with a>c>ba > c > b). This gives a compact picture of the partial order structure.

AI connection - dependency parsing: In a syntactic parse tree, token ii dominates token jj if jj is a descendant of ii. This dominance relation is a partial order on sentence tokens. Similarly, in DAG-structured computation graphs, the execution order is a partial order on operations.

4.5 Functions as Relations

A function f:ABf: A \to B is a special case of a relation - a relation fA×Bf \subseteq A \times B satisfying the condition that every element of AA appears in exactly one pair:

aA:!bB:(a,b)f\forall a \in A:\, \exists!\, b \in B:\, (a, b) \in f

We write f(a)=bf(a) = b to mean (a,b)f(a, b) \in f.

Terminology:

TermSymbolMeaning
Domaindom(f)=A\text{dom}(f) = ASet of inputs
Codomaincod(f)=B\text{cod}(f) = BSet of allowed outputs
Range (image)ran(f)={f(a)aA}\text{ran}(f) = \{f(a) \mid a \in A\}Actual outputs (B\subseteq B)

Types of functions:

TypeDefinitionMeaning
Injective (one-to-one)f(a1)=f(a2)    a1=a2f(a_1) = f(a_2) \implies a_1 = a_2Distinct inputs -> distinct outputs
Surjective (onto)ran(f)=B\text{ran}(f) = BEvery output is hit
BijectiveInjective and surjectivePerfect pairing between AA and BB

AI examples - functions are the core abstraction:

FunctionType SignatureProperties
TokeniserΣV\Sigma^* \to V^*Not injective (multiple strings -> same tokens)
EmbeddingVRdV \to \mathbb{R}^dInjective (each token gets unique vector)
LM headRdRV\mathbb{R}^d \to \mathbb{R}^{\vert V \vert}Linear map (matrix multiplication)
AttentionRn×d×Rn×d×Rn×dRn×d\mathbb{R}^{n \times d} \times \mathbb{R}^{n \times d} \times \mathbb{R}^{n \times d} \to \mathbb{R}^{n \times d}Highly non-linear
SoftmaxRkΔk1\mathbb{R}^k \to \Delta^{k-1}Surjective onto probability simplex
argmaxRk[k]\mathbb{R}^k \to [k]Not injective (ties); not differentiable

4.6 Composition and Inverse

Composition: Given f:ABf: A \to B and g:BCg: B \to C, the composition gf:ACg \circ f: A \to C is defined by:

(gf)(a)=g(f(a))(g \circ f)(a) = g(f(a))

Read right to left: "apply ff first, then gg."

Properties of composition:

  • Associativity: h(gf)=(hg)fh \circ (g \circ f) = (h \circ g) \circ f
  • Identity: fidA=ff \circ \text{id}_A = f and idBf=f\text{id}_B \circ f = f, where idA(a)=a\text{id}_A(a) = a
  • Not commutative: gffgg \circ f \neq f \circ g in general

Inverse: If f:ABf: A \to B is bijective, there exists a unique f1:BAf^{-1}: B \to A satisfying:

f1(f(a))=aandf(f1(b))=bf^{-1}(f(a)) = a \quad \text{and} \quad f(f^{-1}(b)) = b

Only bijections have inverses. Injective-but-not-surjective functions have left inverses; surjective-but-not-injective functions have right inverses.

AI example - neural networks as function composition: A deep neural network is fundamentally a composition of functions:

y^=fLfL1f2f1(x)\hat{y} = f_L \circ f_{L-1} \circ \cdots \circ f_2 \circ f_1(x)

Each fif_i is one layer (linear map + activation). Deep learning is deeply compositional - the power comes from composing many simple functions. The chain rule of calculus (backpropagation) exploits this compositional structure: x(gf)=gffx\frac{\partial}{\partial x}(g \circ f) = \frac{\partial g}{\partial f} \cdot \frac{\partial f}{\partial x}.


Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue