Part 1

30 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Random Graphs: Part 1: Intuition to 8. Graphons: The Infinite-Size Limit

1. Intuition

1.1 What Is a Random Graph?

A random graph is not a single graph but a probability distribution over graphs. When we write $G \sim G(n, p)$ , we mean: take $n$ labeled vertices; independently include each of the $\binom{n}{2}$ possible edges with probability $p$ . The specific graph $G$ is a sample from this distribution.

This is a powerful abstraction. Real-world networks - social graphs, citation networks, protein-protein interaction maps, the web - cannot be written down in closed form. They arise from complex processes involving millions of actors. Random graph models capture the statistical regularity of these networks without requiring knowledge of every individual edge.

Key insight: Many properties of real-world networks can be characterized by just a few parameters (average degree, clustering coefficient, community structure), and random graph models are the mathematical objects that interpolate between these summary statistics and the full combinatorial structure of the graph.

For AI and ML, random graphs matter in at least three ways:

Benchmark generation: We generate synthetic graphs from random models to test GNN algorithms across controlled conditions (the GraphWorld framework uses SBM).
Graph generation models: GRAN, GDSS, and DiGress learn to sample from distributions over graphs - essentially learning a graphon.
Network analysis: Training data graphs (social networks, citation networks, knowledge graphs) have structure that random graph models help characterize and exploit.

1.2 Why Random Graphs Matter for AI

The connection between random graphs and modern AI is deeper than benchmark generation:

Transformer attention is a random graph. In a language model with $n$ tokens and sparse attention, the attention pattern defines a random-looking bipartite graph. The connectivity of this graph determines information flow - whether distant tokens can influence each other and how quickly information propagates. Results from random graph theory (connectivity thresholds, small-world phenomena) directly predict when transformers can or cannot capture long-range dependencies.

GNN expressiveness depends on graph structure. The WL hierarchy and GIN's expressiveness guarantees depend on what graphs the model will see. If training graphs are sampled from an SBM, the GNN faces a specific community detection problem that has known information-theoretic limits - limits that bound what no GNN can achieve regardless of architecture.

Overparameterized networks are sparse graphs. The lottery ticket hypothesis says that dense networks contain sparse subnetworks (tickets) that train just as well. These sparse subnetworks have random-graph-like structure: they appear near the percolation threshold of the dense network.

Graphons = graph neural network limits. The mathematical theory of graphons (limit objects of dense graph sequences) is the same theory that gives GNNs their universality results. A GNN that is continuous in the cut metric topology of graphons is provably expressive on all graphs in the same graphon equivalence class.

1.3 The Four Canonical Models

THE FOUR CANONICAL RANDOM GRAPH MODELS
========================================================================

  Model           Parameters          Key property
  ---------------------------------------------------------------------
  Erdos-Renyi     n, p (or n, m)      Phase transition at p = 1/n
  G(n,p)          Independent edges   Poisson degree distribution
                                      Thresholds for all monotone props

  Watts-Strogatz  n, k, \beta             High clustering + short paths
  Small-World     Ring lattice +      Models social/neural networks
                  random rewiring     Navigability (Kleinberg)

  Barabasi-Albert n, m_0, m            Power-law degree distribution
  Scale-Free      Preferential        Hubs, robustness to random failure
                  attachment          Most real-world networks

  Stochastic      k, n_1,...,n_k, B     Community structure
  Block Model     Block membership    Kesten-Stigum threshold
  (SBM)           + probability matrix Benchmark for GNN node classif.

========================================================================

Each model captures one dominant feature of real-world networks:

ER captures mathematical tractability and threshold phenomena
WS captures the small-world property (short paths, high clustering)
BA captures heterogeneity (a few highly-connected hubs, many low-degree nodes)
SBM captures community structure (dense intra-community, sparse inter-community)

Real networks typically exhibit several of these properties simultaneously. Hybrid models (e.g., SBM with degree correction, or preferential attachment with community structure) better match empirical data.

1.4 Historical Timeline: 1959-2026

RANDOM GRAPH THEORY: 1959-2026
========================================================================

  1959  Erdos & Renyi introduce G(n,m); first random graph paper
  1960  Erdos & Renyi prove giant component phase transition
  1961  Erdos & Renyi prove connectivity threshold p = log(n)/n
  1984  Bollobas writes first comprehensive textbook on random graphs
  1995  Chung & Lu: random graphs with given expected degrees
  1998  Watts & Strogatz: small-world networks (Nature, ~36,000 cites)
  1999  Barabasi & Albert: scale-free networks (Science, ~50,000 cites)
  2001  Lovasz & Szegedy begin graphon theory (published 2006)
  2001  Girvan & Newman: modularity and community detection
  2007  Lovasz & Szegedy: limits of dense graph sequences (JCTB)
  2011  Mossel, Neeman, Sly: Kesten-Stigum threshold (stochastic proof)
  2014  Abbe & Sandon: exact recovery threshold for SBM
  2015  You, Ying, Leskovec: GraphRNN graph generation
  2018  Keriven & Peyre: graphon neural networks
  2020  GraphWorld: benchmark generation via SBM
  2022  Rusch, Bronstein, Mishra: gradient over-squashing theory
  2023  DiGress: discrete diffusion for graph generation
  2024  Graphon attention transformers (sparse attention limits)
  2026  Graphon-based GNN universality: completeness results

========================================================================

1.5 Phase Transitions: The Central Metaphor

The most striking feature of random graphs is the phase transition: a sudden, dramatic change in global structure as a parameter crosses a critical threshold. This is not merely a quantitative change but a qualitative one - the graph transitions from one phase (many small components) to another (one giant component plus small components).

Physical analogy: In ferromagnetism, a material transitions from disordered (many small magnetic domains) to ordered (one aligned domain) as temperature drops below the Curie point. The mathematics is identical: both are percolation phase transitions.

The critical exponent phenomenon: Near the critical point $p_c = 1/n$ , the giant component has size $\Theta(n^{2/3})$ - a polynomial intermediate between the subcritical $O(\log n)$ and supercritical $\Theta(n)$ regimes. This $n^{2/3}$ scaling is a universal critical exponent that appears in many other combinatorial and physical phase transitions.

For AI: Phase transitions in random graphs correspond to phase transitions in learning. A GNN trained to detect community structure in SBMs undergoes a computational phase transition at the Kesten-Stigum threshold - above it, polynomial-time algorithms succeed; below it, no efficient algorithm can (conditional on computational hardness conjectures). This is the rigorous mathematical statement of why some graph learning problems are fundamentally hard.

2. Probability on Graphs: Formal Setup

2.1 Graph Probability Spaces

Let $\mathcal{G}_n$ denote the set of all labeled graphs on vertex set $[n] = \{1, 2, \ldots, n\}$ . A random graph model is a probability measure $\mu$ on $\mathcal{G}_n$ .

Definition (Random Graph): A random graph $G$ on $n$ vertices is a random variable taking values in $\mathcal{G}_n$ , defined on some probability space $(\Omega, \mathcal{F}, \mathbb{P})$ .

The key examples:

$G(n,p)$ (Erdos-Renyi): Each edge $\{u,v\} \in \binom{[n]}{2}$ is included independently with probability $p$ . The probability of a specific graph $g$ with $m$ edges is:

\mathbb{P}[G = g] = p^m (1-p)^{\binom{n}{2} - m}

$G(n,m)$ (fixed-edge): Uniform over all graphs with exactly $m$ edges:

\mathbb{P}[G = g] = \binom{\binom{n}{2}}{m}^{-1} \cdot \mathbf{1}[|E(g)| = m]

Equivalence: $G(n,p)$ and $G(n,m)$ with $m = p\binom{n}{2}$ have the same asymptotic behavior for most properties. More precisely, $G(n,p)$ conditioned on having exactly $m$ edges is $G(n,m)$ .

Graph properties as events: A graph property $\mathcal{P}$ is a set of graphs closed under isomorphism (it depends only on structure, not labeling). We study the event $\{G \in \mathcal{P}\}$ and its probability as $n \to \infty$ .

2.2 With High Probability (w.h.p.)

Definition (w.h.p.): We say $G(n,p)$ satisfies property $\mathcal{P}$ with high probability (w.h.p.) if:

\lim_{n \to \infty} \mathbb{P}[G(n,p) \in \mathcal{P}] = 1

This is distinct from "always" - there will always be rare samples that violate the property. But in the limit, the measure of the failure set goes to zero.

Notation conventions:

$f(n) = o(g(n))$ : $f/g \to 0$
$f(n) = \omega(g(n))$ : $f/g \to \infty$
$f(n) = \Theta(g(n))$ : $c_1 g \le f \le c_2 g$ for constants $c_1, c_2 > 0$
$f(n) \sim g(n)$ : $f/g \to 1$

Example (Isolated vertices): In $G(n,p)$ with $p = c/n$ , the probability that vertex $v$ is isolated is $(1-p)^{n-1} \sim e^{-c}$ . The expected number of isolated vertices is $n \cdot e^{-c}$ . For $c = \ln n$ , this expected number is $1$ , and the actual number of isolated vertices is 0 w.h.p.

Markov's inequality (First Moment Method): If $X \ge 0$ and $\mathbb{E}[X] \to 0$ , then $X = 0$ w.h.p. (since $\mathbb{P}[X \ge 1] \le \mathbb{E}[X] \to 0$ ). This proves upper bounds on the probability of properties.

Second moment method: If $\mathbb{E}[X^2] / \mathbb{E}[X]^2 \to 1$ , then $X > 0$ w.h.p. (Paley-Zygmund inequality: $\mathbb{P}[X > 0] \ge \mathbb{E}[X]^2 / \mathbb{E}[X^2]$ ). This proves lower bounds.

2.3 Monotone Properties and Threshold Functions

Definition (Monotone property): A graph property $\mathcal{P}$ is monotone increasing if whenever $G \in \mathcal{P}$ and $G' \supset G$ (additional edges), then $G' \in \mathcal{P}$ . Examples: connectivity, containing a triangle, having a giant component.

Theorem (Bollobas-Thomason, 1987): Every non-trivial monotone graph property has a threshold function $p^*(n)$ such that:

\lim_{n \to \infty} \mathbb{P}[G(n,p) \in \mathcal{P}] = \begin{cases} 0 & \text{if } p/p^* \to 0 \\ 1 & \text{if } p/p^* \to \infty \end{cases}

The proof uses the noise sensitivity / sharp threshold theory: monotone properties exhibit sharp transitions (the window where $\mathbb{P}[\mathcal{P}]$ goes from near-0 to near-1 is $o(p^*)$ wide) due to the influence of edges being approximately equal.

For AI: Sharp thresholds are the theoretical explanation for why GNN performance can change dramatically as graph density or community strength crosses a critical value. A model that performs at 90% accuracy might drop to random guessing when a graph parameter decreases by a factor of 2.

2.4 First and Second Moment Methods

These are the two workhorses for proving w.h.p. statements.

First Moment (Upper Bound):

To show $\mathcal{P}$ holds w.h.p., find a "bad event" $B$ and show $\mathbb{E}[\mathbf{1}_B] \to 0$ :

\mathbb{P}[B] = \mathbb{E}[\mathbf{1}_B] \to 0

Second Moment (Lower Bound):

To show $X > 0$ w.h.p. (where $X = \sum_i X_i$ counts copies of some structure):

Paley-Zygmund inequality: $\mathbb{P}[X > 0] \ge \frac{\mathbb{E}[X]^2}{\mathbb{E}[X^2]}$

Expanding: $\mathbb{E}[X^2] = \sum_{i,j} \mathbb{E}[X_i X_j]$ , so we need to bound correlations between pairs of copies.

Example (Triangles): Let $X$ = number of triangles in $G(n,p)$ .

$\mathbb{E}[X] = \binom{n}{3} p^3 \sim \frac{n^3 p^3}{6}$
For $p = c/n$ : $\mathbb{E}[X] \sim c^3/6$

The triangle threshold is $p^* = n^{-1}$ : for $p \ll 1/n$ , no triangles w.h.p.; for $p \gg 1/n$ , triangles exist w.h.p.

3. Erdos-Renyi Model

3.1 Definitions: G(n,p) and G(n,m)

The Erdos-Renyi model is the simplest and most mathematically tractable random graph model. Its beauty lies in the complete independence of edges, which makes exact calculations feasible.

Definition $G(n,p)$ : The random graph on $[n]$ where each edge is independently present with probability $p \in [0,1]$ .

Statistics:

$\mathbb{E}[|E|] = \binom{n}{2} p \approx n^2 p / 2$
$\text{Var}[|E|] = \binom{n}{2} p(1-p)$
$\mathbb{E}[\deg(v)] = (n-1)p \approx np$ for each vertex $v$

Definition $G(n,m)$ : The random graph on $[n]$ drawn uniformly from all graphs with exactly $m$ edges.

Regime classification (by average degree $c = (n-1)p \approx np$ ):

Regime	$p$ range	$c$ range	Largest component
Subcritical	$p < 1/n$	$c < 1$	$O(\log n)$
Critical	$p = 1/n$	$c = 1$	$\Theta(n^{2/3})$
Supercritical	$p > 1/n$	$c > 1$	$\Theta(n)$
Connected	$p \ge \ln(n)/n$	$c \ge \ln n$	$n$ (whole graph)

For AI: When analyzing message passing in sparse GNNs on ER graphs, the regime determines whether information can flow globally. Below criticality ( $c < 1$ ), most nodes are in isolated small components - the GNN can only see local structure. Above criticality, there's a giant component enabling global information flow.

3.2 Degree Distribution: Poisson Limit

Theorem (Poisson Degree Distribution): In $G(n,p)$ with $p = c/n$ , the degree of a fixed vertex $v$ satisfies:

\deg(v) \sim \text{Binomial}(n-1, c/n) \xrightarrow{d} \text{Poisson}(c) \text{ as } n \to \infty

Proof sketch: $\deg(v) = \sum_{u \neq v} X_{uv}$ where $X_{uv} \sim \text{Bernoulli}(c/n)$ are i.i.d. The sum of $n-1$ independent Bernoullis with success probability $c/n$ converges to Poisson( $c$ ) by the law of small numbers (Poisson limit theorem).

Generating function approach: The probability generating function of $\text{Binomial}(n-1, c/n)$ is:

G_{B}(z) = \left(1 - \frac{c}{n} + \frac{cz}{n}\right)^{n-1} \to e^{c(z-1)} = G_{\text{Poi}(c)}(z)

Consequences of Poisson degree distribution:

Exponential tail: $\mathbb{P}[\deg(v) = k] = e^{-c} c^k / k!$ - degrees are concentrated near $c$
Max degree: $\max_v \deg(v) \sim \frac{\log n}{\log \log n}$ (sublinear in $n$ )
No hubs: Unlike scale-free networks, ER graphs have no nodes with $\Omega(\sqrt{n})$ degree

Contrast with scale-free: Power-law degree distributions $\mathbb{P}[\deg = k] \propto k^{-\gamma}$ have polynomial tails and allow hubs with degree $\Theta(n^{1/(\gamma-1)})$ . This is why real networks (preferential attachment) look so different from ER.

3.3 Giant Component Phase Transition

This is the central theorem of random graph theory - one of the most beautiful results in all of combinatorics.

Theorem (Erdos-Renyi, 1960): Let $c > 0$ be a constant and $p = c/n$ . Let $L_1(G)$ denote the size of the largest connected component of $G \sim G(n,p)$ .

Subcritical ( $c < 1$ ): $L_1 / \ln n \to 1/I(c)$ w.h.p., where $I(c) = c - 1 - \ln c > 0$ . In particular, $L_1 = O(\log n)$ .
Critical ( $c = 1$ ): $L_1 / n^{2/3} \to \Theta(1)$ in distribution (scaling limit is Brownian motion related).
Supercritical ( $c > 1$ ): $L_1 / n \to \beta(c)$ w.h.p., where $\beta(c)$ is the unique solution in $(0,1)$ to:

\beta = 1 - e^{-c\beta}

The second-largest component has size $O(\log n)$ .

The survival probability $\beta(c)$ : The equation $\beta = 1 - e^{-c\beta}$ has a probabilistic interpretation. Think of each vertex independently joining the giant component with probability $\beta$ . A vertex joins if it has at least one neighbor that also joins. If neighbors join with probability $\beta$ , and there are $\text{Poisson}(c)$ neighbors, the probability of having at least one joining neighbor is $1 - e^{-c\beta}$ - hence the fixed-point equation.

Derivation via branching processes: A key technique is to compare component exploration with a branching process. Starting from vertex $v$ , reveal its neighbors one by one. Each neighbor independently has $\text{Poisson}(c)$ additional neighbors (in the limit). This is a Galton-Watson branching process with offspring distribution $\text{Poisson}(c)$ .

A Galton-Watson process with mean offspring $c$ survives (generates an infinite tree) with probability $\beta$ satisfying $\beta = 1 - e^{-c\beta}$ . Below $c = 1$ (mean offspring $< 1$ ), the process dies out almost surely - hence subcritical. Above $c = 1$ , there's a positive probability of survival - hence the giant component.

Size of the giant component:

$c$	$\beta(c)$ (fraction in giant)
0.5	0 (subcritical)
1.0	0 (critical, but $n^{2/3}$ scaling)
1.5	0.583
2.0	0.797
3.0	0.940
5.0	0.993

For AI: GNN expressiveness on random graphs undergoes a similar phase transition. In the subcritical regime, the WL algorithm (and GINs) see disconnected local neighborhoods - they cannot distinguish non-isomorphic components. In the supercritical regime, the global component provides rich structural information.

3.4 Connectivity Threshold

Theorem (Erdos-Renyi, 1961): Let $p = (\ln n + c) / n$ for a constant $c \in \mathbb{R}$ . Then:

\lim_{n \to \infty} \mathbb{P}[G(n,p) \text{ is connected}] = e^{-e^{-c}}

In particular:

If $p \ll \ln(n)/n$ : $G$ is disconnected w.h.p.
If $p = \ln(n)/n$ : $G$ is connected with probability $e^{-1} \approx 0.368$
If $p \gg \ln(n)/n$ : $G$ is connected w.h.p.

Proof sketch (upper bound via first moment):

Let $X_k$ = number of components of size $k \le n/2$ . The bottleneck is $k=1$ (isolated vertices). A vertex $v$ is isolated iff none of its $n-1$ potential edges are present:

\mathbb{P}[v \text{ isolated}] = (1-p)^{n-1} = \left(1 - \frac{\ln n + c}{n}\right)^{n-1} \to e^{-(\ln n + c)} = \frac{e^{-c}}{n}

Expected isolated vertices: $\mathbb{E}[X_1] = n \cdot \frac{e^{-c}}{n} = e^{-c}$ .

By a Poisson approximation argument, $X_1 \to \text{Poisson}(e^{-c})$ in distribution, so:

\mathbb{P}[X_1 = 0] \to e^{-e^{-c}}

Connectivity fails iff $X_1 > 0$ or there's a larger isolated component. One shows larger components disappear before isolated vertices, so connectivity threshold $=$ isolated vertex disappearance threshold.

Sharp threshold: The window is $p = (\ln n + \omega(1))/n$ - any diverging $\omega$ suffices. This is an unusually sharp threshold (polynomial-width thresholds are more common).

3.5 Subgraph Counts and Triangle Thresholds

Definition: A subgraph count for a fixed graph $H$ is $X_H =$ number of labeled copies of $H$ in $G(n,p)$ .

Expectation:

\mathbb{E}[X_H] = \frac{n!}{(n - |V(H)|)!} \cdot \frac{1}{|\text{Aut}(H)|} \cdot p^{|E(H)|}

For dense subgraphs, this simplifies to $\Theta(n^{|V(H)|} p^{|E(H)|})$ .

Threshold for $H$ : By the first and second moment methods, $X_H > 0$ w.h.p. iff $p \gg n^{-1/m(H)}$ , where:

m(H) = \max_{H' \subseteq H, |V(H')| \ge 1} \frac{|E(H')|}{|V(H')|}

is the maximum 2-core density of $H$ .

Triangle threshold: For $H = K_3$ (triangle), $m(H) = 3/3 = 1$ , so threshold is $p^* = n^{-1}$ .

| Subgraph $H$ | $|V|$ | $|E|$ | $m(H)$ | Threshold $p^*$ | |-------------|-------|-------|---------|----------------| | Edge | 2 | 1 | 1/2 | $n^{-2}$ | | Path $P_3$ | 3 | 2 | 2/3 | $n^{-3/2}$ | | Triangle $K_3$ | 3 | 3 | 1 | $n^{-1}$ | | 4-clique $K_4$ | 4 | 6 | 3/2 | $n^{-2/3}$ | | 5-clique $K_5$ | 5 | 10 | 2 | $n^{-1/2}$ |

Janson's inequality: Provides exponential concentration for subgraph counts when $\mathbb{E}[X_H]$ is large:

\mathbb{P}[X_H = 0] \le \exp\left(-\frac{\mathbb{E}[X_H]^2}{2\Delta}\right)

where $\Delta = \sum_{H' \sim H''} \mathbb{P}[H' \cup H'' \subseteq G]$ sums over pairs sharing at least one edge.

3.6 Diameter and Distances

Theorem: In $G(n,p)$ with $c = np > 1$ (supercritical), the diameter of the giant component is:

\text{diam} \sim \frac{\log n}{\log(c)}

w.h.p. More precisely, $\text{diam} = (1 + o(1)) \log n / \log c$ .

Why? The giant component behaves like a random tree with branching factor $c$ . Starting from any node, the neighborhood of radius $r$ has size $\sim c^r$ . It reaches $n$ when $c^r \approx n$ , i.e., $r \approx \log n / \log c$ .

Characteristic path length: This $O(\log n)$ diameter is the mathematical explanation for the six degrees of separation phenomenon. With $n = 10^9$ (Facebook) and $c \approx 100$ (100 friends), the diameter is $\log(10^9) / \log(100) \approx 4.5$ - indeed about 4-6 hops.

Contrast with small-world: In the ring lattice (before rewiring), diameter is $n/(2k) = \Omega(n)$ - linear. Watts-Strogatz introduces just $\beta$ fraction of random rewirings to drop this to $O(\log n)$ while maintaining high clustering.

3.7 Spectral Properties of G(n,p)

Expected adjacency matrix: $\mathbb{E}[\mathbf{A}] = p(\mathbf{1}\mathbf{1}^\top - \mathbf{I}_n)$ , a rank-1 perturbation of $-p\mathbf{I}$ .

Spectral decomposition: Write $\mathbf{A} = \mathbb{E}[\mathbf{A}] + (\mathbf{A} - \mathbb{E}[\mathbf{A}])$ . The fluctuation matrix $\mathbf{W} = \mathbf{A} - \mathbb{E}[\mathbf{A}]$ is a Wigner matrix (symmetric, zero-diagonal, i.i.d. entries above diagonal).

Theorem (Furedi-Komlos, 1981): For $G(n,p)$ with $p$ constant, the eigenvalues of $\mathbf{A}$ satisfy:

$\lambda_1 \sim np$ (outlier, corresponding to the average degree direction $\mathbf{1}/\sqrt{n}$ )
$\lambda_2, \ldots, \lambda_n$ are in $[-(2+\epsilon)\sqrt{np(1-p)}, (2+\epsilon)\sqrt{np(1-p)}]$ w.h.p.

The bulk eigenvalues follow Wigner's semicircle law with radius $2\sqrt{np(1-p)}$ .

For spectral clustering: The spectral gap $\lambda_1 - \lambda_2 \approx np - 2\sqrt{np}$ determines how easily we can detect the leading eigenvector (which encodes the community structure in SBM).

4. Watts-Strogatz Small-World Model

4.1 Construction and Parameters

The Watts-Strogatz (WS) model was introduced in 1998 to explain a paradox: real-world networks (social, neural, power grid) have both high clustering AND short path lengths. Regular lattices have high clustering but long paths; ER random graphs have short paths but low clustering. WS interpolates between them.

Construction (Watts-Strogatz, 1998):

Start with a ring lattice: $n$ nodes arranged in a circle, each connected to $k$ nearest neighbors ( $k/2$ on each side). Assume $n \gg k \gg \ln n$ .
Rewire: For each edge $(i, j)$ with $j > i$ in the ring, with probability $\beta$ replace $j$ with a uniformly random node $j' \ne i$ (avoiding duplicate edges).

Parameters:

$n$ : number of nodes
$k$ : initial degree (even integer, typically $k \in \{4, 6, 10\}$ )
$\beta \in [0,1]$ $β \in [0, 1]$ : rewiring probability
- $\beta = 0$ : regular ring lattice (high clustering, long paths)
- $\beta = 1$ : approximately ER random graph (low clustering, short paths)
- $\beta \approx 0.01$ - $0.1$ : small-world regime (high clustering, short paths!)

For AI: Neural network connection graphs and attention patterns often exhibit small-world structure. The feedforward layers of transformers have short path lengths (like random graphs) while local attention heads maintain clustered connections (like lattices). Understanding this structure informs efficient attention design.

4.2 Clustering Coefficient

Definition: The local clustering coefficient of vertex $v$ with degree $d_v$ is:

C_v = \frac{|\{(u,w) \in E : u,w \in N(v)\}|}{\binom{d_v}{2}}

the fraction of $v$ 's possible neighbor-pairs that are also connected.

Global clustering coefficient: $C = \frac{1}{n} \sum_v C_v$ (average over vertices).

Ring lattice ( $\beta = 0$ ): Each vertex has $k$ neighbors (the $k/2$ on each side of the ring). Neighbors of $v$ include all nodes within distance $k/2$ . A pair of $v$ 's neighbors $(u,w)$ are connected iff they are within distance $k/2$ of each other.

For large $k$ , the fraction of connected neighbor pairs is approximately:

C_{\text{lattice}} = \frac{3(k-2)}{4(k-1)} \approx \frac{3}{4} \text{ for large } k

WS model: For small $\beta$ , the clustering coefficient is approximately:

C(\beta) \approx C(0) \cdot (1 - \beta)^3 = \frac{3(k-2)}{4(k-1)} (1 - \beta)^3

This is because a triangle involving $v$ requires all three edges to survive rewiring, and each edge survives with probability $(1-\beta)$ .

Random graph ( $\beta = 1$ ): For ER with $p \approx k/n$ :

C_{\text{random}} \approx k/n \ll 1

Key observation: For small $\beta$ (say $\beta = 0.05$ ), $C(\beta) \approx 0.75 \cdot (0.95)^3 \approx 0.64$ - still very high, close to the lattice value.

4.3 Average Path Length

Ring lattice ( $\beta = 0$ ): The shortest path between two nodes separated by $r$ positions in the ring is $\lceil r / (k/2) \rceil$ . The average path length is approximately $n/(2k)$ , which grows linearly with $n$ .

WS model ( $0 < \beta < 1$ ): The average path length drops dramatically with rewiring. Even a tiny $\beta$ introduces long-range shortcuts that drastically reduce path lengths.

Heuristic analysis (Newman-Watts): The typical inter-component distance after rewiring is:

L(\beta) \approx \frac{n}{k} \cdot f(nk\beta/2)

where $f(u) \sim \log(u)/u$ for large $u$ . For $\beta \gg 2/(nk)$ , the path length transitions from $\Theta(n)$ to $\Theta(\log n)$ .

Numerical example ( $n = 1000$ , $k = 10$ ):

$\beta$	$C(\beta)$	$L(\beta)$
0	0.667	50
0.001	0.660	20
0.01	0.640	8
0.1	0.528	5
0.5	0.195	4
1.0	0.010	3

The small-world regime ( $\beta \approx 0.01$ - $0.05$ ) achieves high $C$ and low $L$ simultaneously.

4.4 The Small-World Regime

Watts-Strogatz property: A graph is said to have the small-world property if:

$C \gg C_{\text{random}} = k/n$ (much more clustered than ER)
$L \approx L_{\text{random}} = \log(n)/\log(k)$ (path length comparable to ER)

These two conditions are simultaneously satisfied for WS with $\beta \in [\Omega(1/(nk)), O(1)]$ .

Empirical validation:

Watts and Strogatz validated the model against three real networks:

Network	$n$	$k$	$C_{\text{actual}}$	$C_{\text{random}}$	$L_{\text{actual}}$	$L_{\text{random}}$
Film actors	225,226	61	0.79	0.00027	3.65	2.99
Power grid (W. US)	4,941	2.67	0.080	0.005	18.7	12.4
C. elegans neural	282	14	0.28	0.05	2.65	2.25

All three networks have high clustering (much above ER random graph level) but short average path lengths (comparable to ER). This is the hallmark of small-world structure.

4.5 Navigability and Kleinberg's Grid

Watts and Strogatz showed that small-world graphs have short paths, but Kleinberg (2000) asked: can nodes find these short paths using only local information?

Kleinberg's model: Start with a 2D grid of $n = k \times k$ nodes. Each node $v$ has edges to all grid neighbors within distance $r$ (local structure) plus one long-range link to node $u$ chosen with probability proportional to $d(v,u)^{-s}$ .

Theorem (Kleinberg, 2000):

If $s = 2$ (exponent matches grid dimension): greedy routing finds paths of length $O((\log n)^2)$ w.h.p.
If $s \ne 2$ : any decentralized algorithm requires $\Omega(n^{\delta})$ steps for some $\delta > 0$ .

Interpretation: The power-law exponent $s = d$ (where $d$ is the grid dimension) uniquely enables efficient decentralized navigation. Real social networks approximate this condition, explaining how people can navigate social networks in few steps even with only local knowledge.

For AI: This connects to hierarchical attention in transformers. FlashAttention-2 and related methods achieve efficient attention by exploiting the fact that attention scores decay with token distance - a continuous analog of Kleinberg's inverse power-law long-range links.

Small-world structure has been empirically validated across many domains:

Social networks: Milgram's 1967 experiment found ~6 hops between strangers in the US; Facebook (2016) found average path length 3.57 for 1.6 billion users.
Neural connectomes: C. elegans (302 neurons), mouse visual cortex, human brain fMRI networks all show high clustering and short paths.
Internet topology: ASN-level and router-level graphs exhibit small-world properties.
Citation networks: Academic citation graphs have $C \approx 0.3$ - $0.5$ , well above the ER baseline.

Limitations: WS does not generate power-law degree distributions. Real networks often exhibit BOTH small-world AND scale-free properties - requiring hybrid models.

5. Barabasi-Albert Scale-Free Model

5.1 Preferential Attachment

The Barabasi-Albert (BA) model was introduced in 1999 to explain a universal observation: degree distributions in real-world networks follow a power law $P(k) \propto k^{-\gamma}$ with $\gamma \approx 2$ - $3$ . This is incompatible with the exponential tails of ER's Poisson distribution.

The explanation: networks grow over time, and new nodes prefer to attach to high-degree nodes. This "rich get richer" mechanism generates hubs.

Construction (Barabasi-Albert):

Start with $m_0 \ge m$ nodes with arbitrary connections.
At each time step $t = m_0+1, m_0+2, \ldots, n$ $t = m_{0} + 1, m_{0} + 2, \dots, n$ :
- Add one new node $v_t$
- Connect $v_t$ to $m$ existing nodes, chosen independently with probability proportional to their current degree:

\Pi(v_t \to u) = \frac{\deg(u)}{\sum_{w} \deg(w)}

The denominator $\sum_w \deg(w) = 2|E|$ (handshaking lemma).

Why "preferential attachment"? This mimics real-world network growth:

Web pages link to popular pages (already have many in-links)
Scientists cite highly-cited papers
Airports add routes to hubs

For AI: GNN training graphs from large knowledge bases (Wikidata, Freebase) exhibit scale-free degree distributions. GNN aggregation functions must handle this heterogeneity - a degree-10 node and a degree-1000 node require different treatment.

5.2 Power-Law Degree Distribution

Theorem (Barabasi-Albert, 1999; rigorous: Bollobas et al., 2001): For $G \sim \text{BA}(n, m)$ , as $n \to \infty$ , the degree distribution satisfies:

\mathbb{P}[\deg(v) = k] \to \frac{2m(m+1)}{k(k+1)(k+2)} \sim \frac{2m^2}{k^3} \text{ for large } k

This is a power law with exponent $\gamma = 3$ :

P(k) \propto k^{-3}

Scale-free: A distribution is called scale-free if $P(k) \propto k^{-\gamma}$ for some $\gamma > 1$ . Scale-free means the distribution looks the same at all scales (self-similar under rescaling).

Moments: For $P(k) \propto k^{-\gamma}$ :

$\mathbb{E}[k] < \infty$ iff $\gamma > 2$
$\mathbb{E}[k^2] < \infty$ iff $\gamma > 3$

For BA with $\gamma = 3$ : mean degree is finite, but variance diverges. This has profound implications:

No effective epidemic threshold: diseases spread to the whole network for any transmission rate $>0$
Robustness to random failure: removing random nodes rarely disconnects the graph (most nodes have low degree)
Vulnerability to targeted attack: removing the few hubs disconnects the graph immediately

5.3 Mean-Field Analysis

Mean-field derivation: Let $k_i(t)$ be the degree of node $i$ at time $t$ . Treating $k_i$ as a continuous variable and taking expectations:

\frac{\partial k_i}{\partial t} = m \cdot \Pi(i) = m \cdot \frac{k_i}{\sum_j k_j}

Since $\sum_j k_j = 2mt$ (each new node adds $m$ edges, contributing $2m$ to total degree sum):

\frac{\partial k_i}{\partial t} = \frac{k_i}{2t}

Solution: With initial condition $k_i(t_i) = m$ (node $i$ added at time $t_i$ with $m$ initial edges):

k_i(t) = m \sqrt{t / t_i}

Degree distribution from mean-field: Node $i$ has degree $> k$ at time $t$ iff $k_i(t) > k$ , i.e., $t_i < m^2 t / k^2$ . Since nodes are added uniformly at rate 1 per step:

\mathbb{P}[k_i(t) > k] = \mathbb{P}[t_i < m^2 t / k^2] = \frac{m^2 t / k^2}{t} = \frac{m^2}{k^2}

Therefore:

P(k) = -\frac{d}{dk}\mathbb{P}[k_i(t) > k] = \frac{2m^2}{k^3}

This reproduces the power law $P(k) \propto k^{-3}$ , consistent with the rigorous result.

Hub formation: The oldest nodes grow fastest (degree $\propto \sqrt{t/t_i}$ ). Node $i$ added at time $t_i = 1$ has degree $\sim m\sqrt{n}$ at time $n$ - a hub with $\Theta(\sqrt{n})$ degree.

5.4 Robustness and Fragility

Random failure: If each node is independently removed with probability $q$ , what is the critical $q^*$ above which the giant component disappears?

For a network with degree distribution $P(k)$ , the critical fraction for percolation is determined by the Molloy-Reed criterion:

\frac{\mathbb{E}[k^2]}{\mathbb{E}[k]} > 2

For scale-free networks with $\gamma \le 3$ (including BA with $\gamma = 3$ ): $\mathbb{E}[k^2] = \infty$ , so this criterion always holds. No matter how many nodes are randomly removed, a giant component persists (in infinite networks). In finite BA networks, the giant component persists for $q$ close to 1.

Targeted attack: If the highest-degree nodes are removed first, the network is highly vulnerable. Removing even 5% of the highest-degree nodes can destroy the giant component of a BA network.

Implication for AI: Neural network pruning that removes random weights (random failure) is much safer than pruning by magnitude (targeted attack) - magnitude-based pruning could inadvertently target the "hubs" of the implicit neural network graph.

5.5 Scale-Free Networks in the Wild

Evidence for scale-free: Many networks show approximate power-law degree distributions:

World Wide Web: $\gamma_{\text{in}} \approx 2.1$ , $\gamma_{\text{out}} \approx 2.7$
Internet (AS-level): $\gamma \approx 2.1$
Protein-protein interaction: $\gamma \approx 2.4$
Actor collaboration: $\gamma \approx 2.3$

Controversy (2019): Broido & Clauset (2019) argued that true power laws are rare - most claimed scale-free networks fit log-normal or other heavy-tailed distributions better. The debate continues, but the practical point stands: real networks have much heavier degree tails than ER Poisson.

GNN implications: Degree-heterogeneous graphs require careful normalization in GCN-style aggregation. GraphSAGE's neighborhood sampling provides approximate uniformity; PNA (Principal Neighbourhood Aggregation) uses multiple aggregators (mean, max, std) to handle degree heterogeneity robustly.

6. Stochastic Block Model

6.1 Definition and Parameters

The Stochastic Block Model (SBM) is the canonical random graph model for community structure. It is the workhorse of GNN benchmark generation and the model for which information-theoretic limits of community detection are fully understood.

Definition (SBM): Given parameters:

$n$ : number of nodes
$k$ : number of communities/blocks
$\sigma: [n] \to [k]$ : community assignment vector
$B \in [0,1]^{k \times k}$ : symmetric block probability matrix (B_{rs} = probability of edge between community $r$ and $s$ )

The SBM generates a random graph $G \sim \text{SBM}(n, k, \sigma, B)$ by independently including each edge $(u,v)$ with probability $B_{\sigma(u), \sigma(v)}$ .

Symmetric SBM (planted partition): All communities have equal size $n/k$ ; $B_{rr} = p$ (within-community) and $B_{rs} = q$ for $r \ne s$ (between-community), with $p > q$ .

Sparse SBM: $p = a/n$ , $q = b/n$ with $a > b$ constants. This is the regime studied in information-theoretic limits.

Why SBM matters for AI: Node classification on real graphs (citation networks, social networks) essentially asks: given an observed graph $G$ with node features, recover the community labels $\sigma$ . SBM provides the ground truth for studying when this is possible and what algorithms can achieve it.

6.2 Community Detection: Information-Theoretic Limits

Three regimes for symmetric 2-block SBM ( $k = 2$ , equal communities):

With $p = a/n$ , $q = b/n$ :

Exact recovery ( $a > b$ large enough): Recover $\sigma$ exactly (up to global permutation) with high probability. Threshold: $\sqrt{a} - \sqrt{b} > \sqrt{2}$ .
Weak recovery / detection: Identify communities better than chance (correlated with ground truth). Threshold (Kesten-Stigum): $(a-b)^2 > 2(a+b)$ .
Impossible: Below the Kesten-Stigum threshold, no algorithm can do better than random guessing.

Kesten-Stigum threshold: The SNR condition $(a-b)^2 > 2(a+b)$ can be rewritten as:

\frac{(a-b)^2}{2(a+b)} > 1

The left side is the signal-to-noise ratio: $(a-b)$ measures community signal, $\sqrt{a+b}$ measures noise (average degree). When SNR $< 1$ , the noise overwhelms the signal.

Spectral interpretation: The second eigenvalue of the adjacency matrix satisfies $\lambda_2 \approx (a-b)/n \cdot n/2 = (a-b)/2$ . The spectral radius of the noise (Wigner semicircle) is $2\sqrt{(a+b)/2}$ . The signal eigenvalue emerges from the noise bulk iff $(a-b)/2 > 2\sqrt{(a+b)/2}$ , i.e., $(a-b)^2 > 8(a+b)/2 = 4(a+b)$ - slightly different from KS due to different normalization, but same qualitative conclusion.

For GNNs: GNNs trained on SBM graphs face exactly this detection problem. Below the Kesten-Stigum threshold, no GNN - regardless of depth, width, or architecture - can reliably classify nodes into communities. This is the hard limit on what graph learning can achieve.

$k$ -block SBM: For $k$ communities, the threshold generalizes. The second eigenvalue of $B$ determines the computational threshold.

6.3 Degree-Corrected SBM

Real-world networks have heterogeneous degree distributions within communities. The DC-SBM adds degree parameters:

Definition (DC-SBM): Node $v$ has a degree parameter $\theta_v > 0$ . The probability of edge $(u,v)$ is:

\mathbb{P}[(u,v) \in E] = \theta_u \theta_v B_{\sigma(u), \sigma(v)}

Under DC-SBM, the expected degree of node $v$ is $\theta_v \sum_u \theta_u B_{\sigma(v), \sigma(u)}$ . Choosing $\theta_v \propto k^{-\alpha}$ for some power law generates a scale-free SBM combining community structure with hub-spoke topology.

Why DC-SBM matters: Citation networks and social networks have communities AND degree heterogeneity. Standard spectral clustering fails on DC-SBM because high-degree nodes dominate the leading eigenvectors. Regularized spectral clustering or normalized Laplacian-based methods are needed.

6.4 SBM as GNN Benchmark Generator

GraphWorld (2022): A Google framework that generates GNN benchmarks by sampling SBM parameters from a distribution over:

Number of communities $k \in [2, 20]$
Community sizes (balanced or Zipf-distributed)
Within-block density $p$
Between-block density $q$

By sampling $(a,b)$ pairs across and below the Kesten-Stigum threshold, GraphWorld generates benchmarks that are systematically hard or easy for GNNs.

Result: Different GNN architectures excel in different SBM regimes:

GCN: best for homophilic SBM (dense intra-community)
GraphSAGE: robust across densities
GIN: best near the Kesten-Stigum threshold (maximally expressive)
MixHop: best for heterophilic SBM (dense inter-community)

This validates the theoretical prediction that no single GNN architecture dominates all graph types.

7. Spectral Properties of Random Graphs

7.1 Wigner's Semicircle Law

Wigner's semicircle law is a fundamental result in random matrix theory that describes the bulk eigenvalue distribution of random symmetric matrices.

Setup: Let $W_n = \frac{1}{\sqrt{n}} M_n$ where $M_n$ is a real symmetric $n \times n$ random matrix with:

Diagonal entries $M_{ii} = 0$
Off-diagonal entries $M_{ij} = M_{ji}$ i.i.d. with mean 0 and variance $\sigma^2$

Theorem (Wigner, 1955): The empirical spectral distribution of $W_n$ :

\mu_n = \frac{1}{n} \sum_{i=1}^n \delta_{\lambda_i(W_n)}

converges weakly in probability to the semicircle law:

\mu_{sc}(dx) = \frac{1}{2\pi\sigma^2}\sqrt{4\sigma^2 - x^2} \cdot \mathbf{1}_{|x| \le 2\sigma} dx

Support: $[-2\sigma, 2\sigma]$ - all eigenvalues of $W_n$ lie in this interval asymptotically.

For $G(n,p)$ : The adjacency matrix $A$ centered and scaled: $W = (A - p\mathbf{1}\mathbf{1}^\top) / \sqrt{np(1-p)}$ has bulk eigenvalues following the semicircle law on $[-2, 2]$ (with $\sigma = 1$ ).

The outlier eigenvalue $\lambda_1 \approx np$ comes from the mean matrix $p\mathbf{1}\mathbf{1}^\top$ - it pokes out of the bulk.

For SBM: The adjacency matrix of an SBM decomposes as $A = \bar{A} + W$ where $\bar{A} = \mathbb{E}[A]$ is the block-structured mean matrix (rank $k$ ) and $W$ is a Wigner-like noise matrix. The $k$ outlier eigenvalues of $\bar{A}$ emerge from the semicircle bulk and encode the community structure, provided the signal-to-noise ratio exceeds the Kesten-Stigum threshold.

7.2 Spectral Gap and Community Detection

Spectral algorithm for SBM:

Compute the leading $k$ eigenvectors of $A$ (or normalized Laplacian $L_{sym}$ )
Form the $n \times k$ embedding matrix $U$
Run $k$ -means on rows of $U$ to recover community assignments

Why does this work? For the planted partition SBM with $p = a/n$ and $q = b/n$ :

Leading eigenvalue: $\lambda_1 \approx (a + (k-1)b)/(2k) \cdot n/n = (a + (k-1)b)/2$ ... (using $k$ -block symmetric SBM)
Second eigenvalue: $\lambda_2 \approx (a-b)/2$ (signal eigenvalue)
Bulk edge: $\lambda_{\text{bulk}} \approx \sqrt{(a+b)/2}$

Signal eigenvalue separates from bulk iff $|(a-b)/2| > \sqrt{(a+b)/2}$ , which is exactly the Kesten-Stigum condition.

Davis-Kahan bound (next subsection): If signal and noise eigenvalues are separated, the eigenvectors of $A$ are close to those of $\mathbb{E}[A]$ , enabling accurate community recovery.

7.3 Davis-Kahan Theorem and Perturbation Bounds

Theorem (Davis-Kahan, 1970): Let $A = \bar{A} + W$ where $\bar{A}$ is symmetric with eigenvalue $\lambda$ and eigenvector $\mathbf{u}$ , and $W$ is a perturbation with $\|W\|_{op} \le \delta$ . Let $\delta'$ be the gap between $\lambda$ and all other eigenvalues of $\bar{A}$ . Then the corresponding eigenvector $\hat{\mathbf{u}}$ of $A$ satisfies:

\sin\angle(\hat{\mathbf{u}}, \mathbf{u}) \le \frac{\delta}{\delta' - \delta}

Application to SBM: For the 2-block SBM:

Signal gap: $\delta' = \lambda_2(\bar{A}) - \lambda_3(\bar{A}) = (a-b)/2$
Noise: $\|W\|_{op} \le 2\sqrt{(a+b)/2} \cdot (1 + o(1))$ (Wigner semicircle radius)
Angle error: $\sin\angle(\hat{\mathbf{u}}_2, \mathbf{u}_2) \le \frac{2\sqrt{(a+b)/2}}{(a-b)/2 - 2\sqrt{(a+b)/2}}$

This is $o(1)$ iff $(a-b)/2 > 2\sqrt{(a+b)/2}$ , i.e., the Kesten-Stigum condition holds. When it holds, the estimated eigenvector is close to the ground-truth community indicator vector, and $k$ -means succeeds.

For GNNs: Davis-Kahan explains why GNNs with spectral-style message passing (GCN) can do community detection above the threshold but fail below it. It also shows that adding node features (which contribute additional signal to $\bar{A}$ ) can push GNNs above the threshold even when the graph structure alone is insufficient.

7.4 Laplacian Spectrum of G(n,p)

Normalized Laplacian: $L_{sym} = D^{-1/2}(D-A)D^{-1/2} = I - D^{-1/2}AD^{-1/2}$ .

For $G(n,p)$ with $p$ constant:

Eigenvalues of $L_{sym}$ lie in $[0, 2]$ (always)
$\lambda_1(L_{sym}) = 0$ always (corresponding to $\mathbf{1}/\sqrt{n}$ )
For $p$ fixed: $\lambda_n(L_{sym}) \to 2$ , $\lambda_2(L_{sym}) \to 1$ - the bulk concentrates near 1

Algebraic connectivity (Fiedler value): $\lambda_2(L)$ (unnormalized Laplacian) is the Fiedler value, which controls:

Convergence rate of random walks (mixing time $\propto 1/\lambda_2$ )
Robustness (edge connectivity $\ge \lambda_2$ )
Cheeger constant approximation: $\lambda_2/2 \le h(G) \le \sqrt{2\lambda_2}$

For $G(n,p)$ with $p = c/n$ , $c > 1$ : $\lambda_2(L) \approx c - 2\sqrt{c} + O(1/n)$ for the giant component.

8. Graphons: The Infinite-Size Limit

8.1 Dense Graph Sequences and the Cut Metric

As graph size $n \to \infty$ , what is the "limit" of a sequence of dense graphs? This question leads to graphon theory - the measure-theoretic framework that unifies all dense random graph models.

Dense graph sequence: A sequence $G_n$ of graphs with $|V(G_n)| = n$ and $|E(G_n)| = \Theta(n^2)$ (constant edge density).

Cut distance: The cut norm between two symmetric functions $f, g: [0,1]^2 \to \mathbb{R}$ is:

\|f - g\|_{\square} = \sup_{S, T \subseteq [0,1]} \left| \int_{S \times T} (f(x,y) - g(x,y)) \, dx \, dy \right|

The cut metric between graphs $G$ and $H$ on the same vertex set is $d_\square(G, H) = \|W_G - W_H\|_\square$ where $W_G(x,y)$ is the step function encoding the adjacency of $G$ .

After quotienting by relabelings (graph isomorphisms), we get the cut distance $\delta_\square(G, H)$ .

8.2 Graphon Definition and Sampling

Definition (Graphon): A graphon is a symmetric measurable function $W: [0,1]^2 \to [0,1]$ .

Intuition: Think of $W(x,y)$ as the "connection probability" between two abstract node types $x$ and $y$ , where node types are drawn uniformly from $[0,1]$ .

Graphon sampling: To sample an $n$ -node graph from graphon $W$ :

Draw $\xi_1, \ldots, \xi_n \sim \text{Uniform}[0,1]$ i.i.d. (latent node types)
Include edge $(i,j)$ with probability $W(\xi_i, \xi_j)$ , independently

Examples of graphons and their corresponding random graph models:

Graphon $W(x,y)$	Corresponding model
$W(x,y) = p$ (constant)	Erdos-Renyi $G(n,p)$
$W(x,y) = \mathbf{1}[x+y > 1]$	Half-graph
$W(x,y) = \mathbf{1}[\lfloor kx \rfloor = \lfloor ky \rfloor] \cdot p + (1 - \mathbf{1}[\ldots]) \cdot q$	$k$ -block SBM
$W(x,y) = x \cdot y$	Product graphon (Chung-Lu style)
$W(x,y) = \min(x,y)$	Threshold model

Lovasz-Szegedy theorem (2006): Every convergent sequence of dense graphs (in the cut metric) converges to a graphon. Conversely, every graphon arises as the limit of a graph sequence. The space of graphons with the cut metric is compact.

For AI: Graphons are the natural limiting objects for graph neural networks. A GNN $f$ maps graphs to outputs. If $f$ is continuous in the cut metric, then $f(G_n) \to f(W)$ whenever $G_n \to W$ in cut distance. This is precisely the condition for a GNN to be "robust" to graph size changes.

8.3 Homomorphism Densities and Graph Parameters

Homomorphism density: For graphs $F$ and $G$ , the homomorphism density is:

t(F, G) = \frac{\text{hom}(F, G)}{|V(G)|^{|V(F)|}}

where $\text{hom}(F, G)$ counts graph homomorphisms $F \to G$ .

Examples:

$t(K_2, G)$ = edge density
$t(K_3, G)$ = triangle density
$t(C_4, G)$ = 4-cycle density

Graphon version: For a graphon $W$ and graph $F$ with $k$ vertices:

t(F, W) = \int_{[0,1]^k} \prod_{(i,j) \in E(F)} W(x_i, x_j) \, d\mathbf{x}

Theorem (Lovasz-Szegedy): Two graphons $W$ and $W'$ are equivalent (isomorphic in the graphon sense) iff $t(F, W) = t(F, W')$ for all graphs $F$ .

Graph parameters from homomorphism densities:

Triangle density: $t(K_3, W) = \int W(x,y)W(y,z)W(x,z) \, dx\, dy\, dz$
Clustering coefficient: related to $t(K_3, W) / t(P_2, W)^{3/2}$

For ML: Substructure counting (homomorphism density estimation) is a key primitive for graph feature extraction. Ring GNNs and structural GNNs compute approximate homomorphism densities; this is one way to understand what graph structure GNNs learn.

8.4 Graphon Neural Networks

Definition (Graphon Neural Network, Keriven & Peyre, 2019): A graphon neural network is a sequence of functions $f_n$ (on $n$ -node graphs) that converge to a limiting function $f_W$ on graphons. Formally, $f_n$ is a GNN architecture such that as $G_n \to W$ in cut metric, $f_n(G_n) \to f_W(W)$ .

Key result: GCN-style message passing with the propagation operator $T_W h(x) = \int_0^1 W(x,y) h(y) \, dy$ defines a graphon neural network. The discrete GCN is a finite approximation.

Universality of graphon NNs: Maron et al. (2019) show that equivariant graph networks are universal approximators on graphons - any graphon property that is measurable can be approximated to arbitrary accuracy.

Limitation: Universality on graphons is density-dependent. For SPARSE graph sequences (edge density $\to 0$ ), graphons become trivial (the zero graphon), and a different limit theory (graphexes, local limits) is needed.

Forward reference: Graphon theory connects directly to functional analysis - the operator $T_W h(x) = \int W(x,y)h(y) \, dy$ is an integral operator on $L^2[0,1]$ . Spectral theory of compact operators (Hilbert-Schmidt theorem) governs its eigenvalue decomposition. -> Full treatment: Functional Analysis

Random Graphs: Part 1 - Intuition To 8 Graphons The Infinite Size Limit