Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 2
30 min read18 headingsSplit lesson page

Lesson overview | Previous part | Next part

Random Graphs: Part 9: Applications in Machine Learning to Appendix K: Notation Reference

9. Applications in Machine Learning

9.1 GraphWorld: Benchmark Generation from SBM

Problem: GNN papers often report results on 3-4 standard benchmarks (Cora, Citeseer, OGBN-Arxiv). These benchmarks may not represent the diversity of graph structures encountered in practice.

GraphWorld (Palowitch et al., 2022): A benchmark generation framework that:

  1. Parameterizes SBM space (k,n,p,q,features)(k, n, p, q, \text{features})
  2. Samples thousands of graph instances across this parameter space
  3. Evaluates GNN architectures across the full parameter landscape

Key findings:

  • No single GNN architecture dominates across all SBM parameters
  • GCN is best on homophilic dense graphs; GIN is best near the detection threshold
  • The Kesten-Stigum threshold accurately predicts when ALL GNNs fail
  • Node feature quality (signal-to-noise ratio in features) often matters more than graph structure

For practitioners: When evaluating a GNN, generate benchmarks from an SBM sweep to characterize the algorithm's operating regime, rather than relying on a few fixed benchmarks.

9.2 Graph Generation: GRAN, GDSS, DiGress

Graph generation models learn to sample new graphs from a distribution. Framed probabilistically: learn a distribution pθ(G)p_\theta(G) that matches a target distribution p(G)p^*(G).

GRAN (Graph Recurrent Attention Networks, 2019): Generates graphs node-by-node, at each step attending to previously generated nodes. The attention mechanism implicitly models preferential attachment - recently added high-degree nodes attract more attention, reproducing scale-free structure.

GDSS (Jo et al., 2022): Score-based diffusion model for graphs. Joint diffusion over node features and edge features. Samples new graphs by reversing a diffusion process that gradually adds noise. The score function implicitly learns the SBM-like block structure of training graphs.

DiGress (Vignac et al., 2022): Discrete diffusion - adds/removes edges following a Markov process. Denoising model is a graph transformer that learns to predict the original edge from the noised version. DiGress can generate molecular graphs and large social networks by learning implicit graphon structure.

Random graph theory connection: Graph generation is essentially graphon estimation. Given samples from p(G)p^*(G) (real graphs), estimate the underlying graphon WW^* such that sampling from WW^* approximates pp^*. The approximation quality is measured by the cut metric dd_\square.

9.3 LLM Attention as a Random Graph Process

Attention graph: In a transformer with LL layers and HH heads, the attention pattern at layer \ell, head hh defines a complete directed graph on tokens where edge weight (i,j)(i,j) is the attention score αij(,h)\alpha^{(\ell,h)}_{ij}.

Sparse attention = random graph: Sparse attention mechanisms (Longformer, BigBird, Reformer) explicitly sparsify the attention graph, keeping only O(n)O(n) edges rather than O(n2)O(n^2). The sparsification pattern is often random or pseudo-random.

Random graph analysis of attention:

  • Connectivity: Is the sparse attention graph connected? If not, information cannot flow between disconnected components. ER theory predicts connectivity iff expected degree lnn\ge \ln n.
  • Small-world: BigBird combines local window attention (ring lattice) with random global tokens (rewiring) and special tokens (hubs). This exactly matches the Watts-Strogatz construction!
  • Expander properties: Expander graphs (high spectral gap) are the best sparse graphs for information flow. Ramanujan graphs achieve the optimal spectral gap - this motivates expander-based sparse attention.

Result: The random graph structure of sparse attention patterns determines the theoretical expressiveness of the transformer. An attention graph that is disconnected or has large diameter cannot capture long-range dependencies regardless of the weight values.

9.4 Lottery Ticket Hypothesis and Sparse Subgraphs

Lottery Ticket Hypothesis (Frankle & Carlin, 2019): A randomly initialized dense network contains sparse subnetworks ("winning tickets") that, when trained in isolation from the same initialization, achieve comparable accuracy to the dense network.

Random graph framing: Think of the neural network as a random graph where:

  • Nodes = neurons
  • Edges = weights (including the weight value)
  • Sparsification = edge removal

The winning ticket is a sparse subgraph that retains the connectivity and flow properties of the dense graph. Random graph percolation theory describes when sparse subgraphs retain giant component connectivity.

Percolation interpretation: If we randomly retain each edge with probability ρ\rho (magnitude-based pruning approximation), the network retains its giant component iff ρ>ρc\rho > \rho_c, where ρc\rho_c is the percolation threshold. For ER-like neural network graphs, ρc1/(np0)\rho_c \approx 1/(np_0) where p0p_0 is the original edge density.

Practical implication: Networks can be pruned to 90%+ sparsity (i.e., ρ0.1\rho \approx 0.1) while retaining performance, consistent with the existence of a giant percolating subgraph at such densities for typical neural network width/depth ratios.

9.5 Social Network Analysis at Scale

Community detection at scale: For billion-node graphs (Facebook, Twitter), exact SBM community detection is computationally infeasible. In practice:

  • Louvain algorithm: Greedy modularity maximization, O(nlogn)O(n \log n)
  • Label propagation: Message-passing on the graph, converges in O(diam)O(\text{diam}) steps
  • GraphSAGE + semi-supervised: Use a few labeled nodes (community labels) to train GNN

Random graph models as null models: When analyzing a real social network, we ask: "Is the observed community structure more than what we'd expect by chance?" We compare to an ER null model with the same degree sequence (configuration model) and test whether the observed modularity exceeds the null expectation.

Link prediction: Predicting missing edges (u,v)(u,v) in a social graph using random graph models. Under ER: P[(u,v)E]=p\mathbb{P}[(u,v) \in E] = p (same for all pairs). Under SBM: P[(u,v)E]=Bσ(u),σ(v)\mathbb{P}[(u,v) \in E] = B_{\sigma(u), \sigma(v)} - within-community edges are more likely. GNNs for link prediction learn to approximate this block-structured probability.


10. Common Mistakes

#MistakeWhy It's WrongFix
1Confusing G(n,p)G(n,p) and G(n,m)G(n,m)G(n,p)G(n,p) has random edge count; G(n,m)G(n,m) has exactly mm edges - they agree asymptotically but differ in finite samplesUse G(n,p)G(n,p) when you want independence; G(n,m)G(n,m) when you want exact count control
2Assuming ER is a good null model for all real graphsER lacks clustering, hubs, and communities - major features of real networksUse configuration model (fixed degree sequence) or SBM as null model
3Saying Barabasi-Albert generates power law with any exponentBA always gives γ=3\gamma = 3; only extensions (fitness, rewiring) change γ\gammaUse generalized preferential attachment Π(v)deg(v)α\Pi(v) \propto \deg(v)^\alpha for γ=1+1/(α1)\gamma = 1 + 1/(\alpha - 1)
4Confusing clustering coefficient with transitivityLocal CC averages over vertices; global transitivity = 3×3 \times triangles / paths of length 2. They differ when degree distribution is heterogeneousUse transitivity for global property; local CC for per-node property
5Thinking the Kesten-Stigum threshold is a computational limitKS is an information-theoretic limit - it bounds what ANY algorithm can do, not just efficient ones. The computational hardness (SDP vs AMP) is a separate questionDistinguish between information-theoretic and computational thresholds
6Applying graphon theory to sparse graphsGraphons describe the limit of DENSE sequences (constant edge density). Sparse graphs (m=O(n)m = O(n)) have trivial graphon limits (the zero function)Use local weak limits (Benjamini-Schramm) or graphex theory for sparse sequences
7Equating "scale-free" with "power law"Scale-free means the DEGREE distribution is a power law. Many other distributions (log-normal, Pareto) look similar. Broido & Clauset (2019) show many claimed scale-free networks don't pass rigorous power-law testsUse maximum likelihood to fit and compare competing distributions (power law, log-normal, exponential)
8Ignoring the giant component when computing path lengthsAverage path length on a disconnected graph is undefined (infinite) or meaningless without restricting to the giant componentAlways compute path lengths within the largest connected component
9Assuming WS small-world is scale-freeWS generates Poisson-like degree distributions (each node rewires independently). It has small-world property but NOT scale-freeCombine WS with preferential attachment for small-world + scale-free
10Using the adjacency matrix spectrum directly for SBM clustering when graph is sparseFor sparse SBM, the adjacency matrix eigenvalues don't separate cleanly (semicircle radius comparable to signal). Use regularized Laplacian or Bethe-Hessian insteadReplace AA with Lrw=D1AL_{rw} = D^{-1}A or Bethe-Hessian H(ρ)=(ρ21)IρA+DH(\rho) = (\rho^2-1)I - \rho A + D
11Treating the WS model as a generative process for new nodesWS is defined on a fixed set of nn nodes with rewiring - it does not define how to add new nodes. It's not a growing network modelUse BA for growing network models; use WS for fixed-size networks
12Forgetting that graphon equivalence classes ignore node labelsTwo graphons WW and WW' are equivalent if one is a measure-preserving relabeling of the other. Most graph statistics depend only on the graphon equivalence classWork with homomorphism densities (relabeling-invariant) when comparing graphons

11. Exercises

Exercise 1 * - Phase Transition Simulation

Simulate the Erdos-Renyi phase transition computationally. For n=2000n = 2000 nodes and p=c/np = c/n with c[0.5,3.0]c \in [0.5, 3.0]:

(a) Generate G(n,p)G(n,p) for 20 values of cc linearly spaced in [0.5,3.0][0.5, 3.0].

(b) For each graph, compute the size of the largest connected component L1L_1 and the second largest L2L_2.

(c) Plot L1/nL_1/n and L2/nL_2/n as functions of cc. Identify the phase transition point visually.

(d) Overlay the theoretical prediction β(c)\beta(c) satisfying β=1ecβ\beta = 1 - e^{-c\beta}. Compute β(c)\beta(c) numerically (Newton's method or bisection) for each cc.

(e) Compute and plot L2/nL_2/n. What happens to the second-largest component at criticality? Explain from branching process theory.


Exercise 2 * - Degree Distribution Analysis

(a) Generate G(n,p)G(n,p) with n=10000n = 10000 and p=5/np = 5/n. Compute the empirical degree distribution and overlay the theoretical Poisson(5)\text{Poisson}(5) PMF.

(b) Generate a BA graph with n=10000n = 10000 and m=3m = 3. Compute the empirical degree distribution on a log-log scale and fit a power law using linear regression on the tail (k10k \ge 10). Report the estimated exponent γ^\hat{\gamma}.

(c) Compare the two distributions using a Q-Q plot. What are the key structural differences?

(d) Compute the maximum degree in each model. Derive theoretically why maxvdeg(v)=Θ(logn/loglogn)\max_v \deg(v) = \Theta(\log n / \log \log n) for ER and Θ(n)\Theta(\sqrt{n}) for BA.


Exercise 3 * - Small-World Analysis

Implement the Watts-Strogatz model from scratch.

(a) Build the ring lattice: n=500n = 500 nodes, each connected to k=10k = 10 nearest neighbors.

(b) For β{0,0.001,0.01,0.05,0.1,0.3,0.5,1.0}\beta \in \{0, 0.001, 0.01, 0.05, 0.1, 0.3, 0.5, 1.0\}, rewire each edge independently with probability β\beta. Compute C(β)C(\beta) and L(β)L(\beta) for each value.

(c) Plot normalized clustering coefficient C(β)/C(0)C(\beta)/C(0) and normalized average path length L(β)/L(0)L(\beta)/L(0) on the same plot (both on log scale for β\beta). Identify the small-world regime.

(d) Verify that C(β)C(0)(1β)3C(\beta) \approx C(0)(1-\beta)^3 for small β\beta. What does this formula say about how clustering degrades with rewiring?


Exercise 4 ** - Stochastic Block Model and Community Detection

(a) Sample an SBM with n=500n = 500, k=2k = 2 equal communities, p=a/np = a/n, q=b/nq = b/n. Use (a,b)=(20,5)(a,b) = (20, 5) (above Kesten-Stigum threshold).

(b) Apply spectral clustering: compute the second eigenvector of the adjacency matrix, threshold at 0 (positive = community 1, negative = community 2), and compute accuracy (fraction of correctly classified nodes, up to community permutation).

(c) Repeat with (a,b)=(10,5)(a,b) = (10, 5) (near threshold) and (a,b)=(6,4)(a,b) = (6, 4) (below threshold). How does accuracy vary?

(d) The Kesten-Stigum threshold for 2-block SBM is (ab)2=2(a+b)(a-b)^2 = 2(a+b). Verify your experimental results are consistent with this threshold.

(e) *** Implement the belief propagation algorithm (AMP / approximate message passing) for the 2-block SBM and compare accuracy to spectral clustering near the threshold.


Exercise 5 ** - Wigner's Semicircle Law

(a) Generate a Wigner matrix Wn=(M+M)/(2n)W_n = (M + M^\top) / (2\sqrt{n}) where MM has i.i.d. Normal(0,1)\text{Normal}(0,1) entries, for n{50,200,500,1000}n \in \{50, 200, 500, 1000\}.

(b) Plot the empirical spectral distribution (histogram of eigenvalues) for each nn. Overlay the theoretical semicircle density 2π4x2\frac{2}{\pi}\sqrt{4 - x^2} for x2|x| \le 2.

(c) Now set Wn=A/np(1p)W_n = A/\sqrt{np(1-p)} where AA is the centered adjacency matrix of G(n,p)G(n,p) with p=0.3p = 0.3, n=1000n = 1000. Plot the empirical spectral distribution. Identify the outlier eigenvalue corresponding to the average degree.

(d) For the SBM with 2 communities (n=1000n = 1000, a=15a = 15, b=3b = 3): plot the eigenvalue spectrum of AA. Identify which eigenvalues encode community structure and which are bulk noise.


Exercise 6 ** - Graphon Estimation

(a) Generate a sequence of SBM graphs with n{100,500,2000}n \in \{100, 500, 2000\}, k=3k = 3 communities, and block matrix B=(0.80.10.10.10.70.20.10.20.6)B = \begin{pmatrix} 0.8 & 0.1 & 0.1 \\ 0.1 & 0.7 & 0.2 \\ 0.1 & 0.2 & 0.6 \end{pmatrix}.

(b) For each graph, sort vertices by community label (oracle information) and display the sorted adjacency matrix as a heatmap. Does it converge to the block graphon as nn grows?

(c) Without oracle labels: apply spectral clustering to estimate community labels, then display the sorted (estimated) adjacency matrix. Measure the cut distance dd_\square between the estimated and true graphon.

(d) Implement the "histogram graphon estimator": divide [0,1]2[0,1]^2 into k2k^2 bins and estimate WW by averaging edges within each bin. Compute the L2L^2 error WestWL2\|W_{\text{est}} - W^*\|_{L^2}.


Exercise 7 *** - Giant Component Critical Window

This exercise studies the fine-grained behavior near the critical point p=1/np = 1/n.

(a) For p=(1+λn1/3)/np = (1 + \lambda n^{-1/3})/n with λ[3,3]\lambda \in [-3, 3] and n{500,2000,8000}n \in \{500, 2000, 8000\}, compute L1/n2/3L_1 / n^{2/3} for 50 trials each. Plot the distribution of L1/n2/3L_1/n^{2/3} vs λ\lambda.

(b) Observe that the distribution has a universal shape (independent of nn for large nn). This is the Brownian excursion limit - the critical window scaling is n2/3n^{2/3} for the component size and n1/3n^{1/3} for the window width.

(c) Compute the mean and standard deviation of L1/n2/3L_1 / n^{2/3} as functions of λ\lambda. Show that the mean crosses 0 near λ=0\lambda = 0 but the transition is smooth (not a jump) at finite nn.

(d) Compare to the predicted infinite-nn limit: for λ>0\lambda > 0, E[L1/n]β(1+λn1/3)λ2n1/3C\mathbb{E}[L_1/n] \to \beta(1 + \lambda n^{-1/3}) \approx \lambda^2 n^{-1/3} \cdot C for some constant CC. Verify this scaling.


Exercise 8 *** - Preferential Attachment Dynamics

(a) Implement the Barabasi-Albert preferential attachment model for n=5000n = 5000 nodes with m=2m = 2 edges per new node. Use the efficient alias method or linear scan for sampling.

(b) After generation, fit the degree distribution tail to a power law P(k)=CkγP(k) = C k^{-\gamma} using maximum likelihood estimation on kkmink \ge k_{\min} for suitable kmink_{\min}.

(c) Track the degree of each node over time as the network grows. Plot ki(t)k_i(t) for nodes added at times ti{10,100,500,1000}t_i \in \{10, 100, 500, 1000\}. Verify the mean-field prediction ki(t)mt/tik_i(t) \approx m\sqrt{t/t_i}.

(d) Implement "fitness-based" preferential attachment: Π(v)ηvdeg(v)\Pi(v) \propto \eta_v \deg(v) where ηvUniform[0,1]\eta_v \sim \text{Uniform}[0,1] is a fixed fitness. Compare the resulting degree distribution to standard BA. Does the power-law exponent change?

(e) Simulate a targeted attack: iteratively remove the highest-degree node. Plot the size of the giant component as a function of fraction of nodes removed. Compare to random removal. What fraction of nodes must be targeted to destroy the giant component?


12. Why This Matters for AI (2026 Perspective)

ConceptImpact on AI/ML
ER Phase TransitionConnectivity threshold determines information flow in GNNs on sparse graphs; GCN cannot aggregate across disconnected components, giving a hard limit on performance
Poisson Degree DistributionMost GNN benchmark graphs (Cora, Citeseer) have near-Poisson degrees; GCN's symmetric normalization is optimal for Poisson-degree graphs but suboptimal for scale-free
Kesten-Stigum ThresholdHard information-theoretic limit on community detection; no GNN can exceed this on SBM data regardless of architecture, depth, or width
Scale-Free NetworksReal knowledge graphs (Wikidata, Freebase) are scale-free; hub nodes with Θ(n)\Theta(\sqrt{n}) degree dominate GCN aggregation - require degree-aware normalization or degree-bucketing
Small-World StructureTransformer attention graphs have small-world properties; BigBird/Longformer mimic WS construction (local window + random long-range); designing optimal sparse attention = finding optimal WS parameters
GraphonsTheoretical foundation for GNN universality; GNNs that are continuous in cut metric topology can generalize across graph sizes; graphon theory predicts which graph properties a GNN can and cannot learn
Graphon Neural NetworksFirst provably universal GNN framework; used to prove that message-passing GNNs cannot distinguish non-isomorphic graphs with identical WL certificates
Spectral GapControls convergence rate of GNN over-smoothing (energy decays at rate λ2\lambda_2); higher Fiedler value -> faster smoothing -> shallower optimal depth
Davis-KahanExplains why spectral GNNs (GCN) succeed above community detection threshold and fail below; same theorem underlies learning theory for graph classification
Wigner SemicircleBulk eigenvalue distribution of noise in graph learning; signal from community structure must exceed semicircle radius to be learnable - same condition as Kesten-Stigum
Configuration ModelNull model for testing GNN hypotheses: does the GNN learn graph structure, or just degree sequence? Test by evaluating on configuration model graphs with same degree sequence
Random Graph GenerationDiGress, GDSS, GRAN all learn distributions over graphs - effectively learning graphons. Quality measured by cut distance between learned and true graphon

13. Conceptual Bridge

Backward: What This Builds On

Random graph theory synthesizes several branches of mathematics developed in earlier sections:

From Spectral Graph Theory (04): The Laplacian eigenvalues λ2,,λn\lambda_2, \ldots, \lambda_n studied there take on probabilistic meaning here - for random graphs, they become random variables following the Wigner semicircle law. The spectral gap λ2\lambda_2 that controls mixing time in deterministic graphs now becomes a random quantity whose distribution determines community detectability.

From Probability Theory (07-Probability-Statistics): All threshold results (giant component, connectivity, community detection) use first and second moment methods, Chernoff bounds, and the Poisson limit theorem. Branching process theory is the probabilistic tool that gives the exact threshold equation β=1ecβ\beta = 1 - e^{-c\beta}.

From Graph Neural Networks (05): The SBM community detection problem IS the node classification problem that GNNs solve in practice. The Kesten-Stigum threshold gives the hard limit on what any GNN can achieve, while Davis-Kahan shows why spectral GNNs succeed above this threshold.

Forward: What This Enables

Functional Analysis (12): The graphon operator TWh(x)=01W(x,y)h(y)dyT_W h(x) = \int_0^1 W(x,y)h(y)\,dy is a Hilbert-Schmidt operator on L2[0,1]L^2[0,1]. Its spectral theory - the Hilbert-Schmidt theorem giving a countable orthonormal eigenfunction expansion - is the infinite-dimensional generalization of the adjacency matrix eigendecomposition. This is the full treatment of graphon operators.

Graph Algorithms (07): Random graph models motivate efficient algorithms: spectral clustering (from SBM theory), Louvain community detection (from modularity theory), and link prediction (from random graph null models). The average-case complexity of graph problems is analyzed using random graph models.

Information Theory (09): The Kesten-Stigum threshold is an information-theoretic limit - it follows from a channel capacity argument. The mutual information between the community labels σ\sigma and the observed graph GG is zero below the threshold, making recovery impossible regardless of computation. This connects random graphs to channel coding theory.

POSITION IN CURRICULUM
========================================================================

  04 Spectral Graph Theory
       |  Laplacians, eigenvalues, Cheeger inequality
       |
       +--> 05 Graph Neural Networks
       |         GCN, GAT, GIN, MPNN, expressiveness
       |
       +--> 06 Random Graphs  <=== YOU ARE HERE
                |
                +- Erdos-Renyi: phase transitions, thresholds
                +- Watts-Strogatz: small-world, navigation
                +- Barabasi-Albert: scale-free, preferential attachment
                +- SBM: communities, Kesten-Stigum threshold
                +- Spectral theory: semicircle law, Davis-Kahan
                +- Graphons: infinite limits, universality
                       |
                       v
                07 Graph Algorithms
                       |
                       v
                12 Functional Analysis <-- graphon operators T_W
                       Hilbert-Schmidt theory, L^2 spectral decomposition

========================================================================

The central insight of this section - that random graph models are not merely toy examples but rigorous frameworks for understanding real network behavior - carries forward into every domain where graphs appear. In machine learning, understanding when community structure is detectable, when information can flow across a sparse graph, and when a graph distribution can be learned from samples are fundamental questions with precise mathematical answers, and those answers come from random graph theory.


<- Back to Graph Theory | Next: Graph Algorithms ->


Appendix A: Branching Process Theory

A.1 Galton-Watson Branching Processes

A Galton-Watson process is a model of population growth where each individual independently produces offspring according to a fixed offspring distribution {pk}k0\{p_k\}_{k \ge 0}.

Formal definition: Let Z0=1Z_0 = 1 (a single ancestor). At generation tt, if Zt=zZ_t = z, each of the zz individuals produces offspring independently according to {pk}\{p_k\}:

Zt+1=i=1Ztξi(t)Z_{t+1} = \sum_{i=1}^{Z_t} \xi_i^{(t)}

where ξi(t){pk}\xi_i^{(t)} \sim \{p_k\} i.i.d.

Mean offspring: μ=kkpk=E[ξ]\mu = \sum_k k p_k = \mathbb{E}[\xi].

Probability generating function (PGF): ϕ(s)=kpksk=E[sξ]\phi(s) = \sum_k p_k s^k = \mathbb{E}[s^\xi].

Extinction probability: The probability of ultimate extinction q=limtP[Zt=0]q = \lim_{t \to \infty} \mathbb{P}[Z_t = 0] satisfies q=ϕ(q)q = \phi(q).

Theorem (Extinction):

  • If μ1\mu \le 1: q=1q = 1 (certain extinction)
  • If μ>1\mu > 1: q<1q < 1 (positive survival probability =1q= 1 - q)

Connection to ER: For G(n,p)G(n,p) with p=c/np = c/n, exploring the component of a vertex proceeds like a branching process where each node has Binomial(nk,c/n)Poisson(c)\text{Binomial}(n-k, c/n) \approx \text{Poisson}(c) offspring (unexplored neighbors). The giant component probability β\beta satisfies:

1β=q=ecβ=ϕPoisson(c)(1β)1 - \beta = q = e^{-c\beta} = \phi_{\text{Poisson}(c)}(1 - \beta)

exactly the fixed-point equation β=1ecβ\beta = 1 - e^{-c\beta}.

A.2 Multi-Type Branching Processes

For the SBM with kk communities, the local neighborhood exploration is a multi-type branching process where the type of an individual is its community label.

Offspring matrix: MrsM_{rs} = expected number of type-ss offspring from a type-rr parent.

For the symmetric 2-block SBM with p=a/np = a/n, q=b/nq = b/n:

M=12(abba)M = \frac{1}{2}\begin{pmatrix} a & b \\ b & a \end{pmatrix}

Survival theorem: The multi-type branching process survives iff ρ(M)>1\rho(M) > 1, where ρ(M)\rho(M) is the spectral radius.

Eigenvalues of MM: λ+=(a+b)/2\lambda_+ = (a+b)/2, λ=(ab)/2\lambda_- = (a-b)/2.

  • Giant component exists iff λ+>1\lambda_+ > 1, i.e., a+b>2a + b > 2 (average degree condition).
  • Community detection possible iff the "community eigenvalue" λ=(ab)/2>1\lambda_- = (a-b)/2 > 1 ... but this is not quite right. The actual condition involves the non-backtracking matrix.

Non-backtracking operator: The correct spectral condition for SBM community detection uses the non-backtracking (Hashimoto) operator BB on directed edges. The eigenvalue of BB corresponding to community structure is (ab)/2(a-b)/2, and the bulk spectral radius is (a+b)/2\sqrt{(a+b)/2}. Community detection is possible iff:

ab2>a+b2\frac{a-b}{2} > \sqrt{\frac{a+b}{2}}

i.e., (ab)2>2(a+b)(a-b)^2 > 2(a+b) - precisely the Kesten-Stigum threshold.

A.3 Critical Branching Processes

At criticality μ=1\mu = 1 (Poisson offspring with c=1c = 1), the branching process has a different behavior:

Yaglom's theorem: Conditioned on survival to generation tt, the population Zt/tZ_t / t converges in distribution to an Exponential(1) random variable:

P[Zt/t>xZt>0]ex\mathbb{P}[Z_t / t > x \mid Z_t > 0] \to e^{-x}

For the ER critical window: At p=1/np = 1/n, the largest component has size Θ(n2/3)\Theta(n^{2/3}) and there are Θ(n1/3)\Theta(n^{1/3}) such components. The component size distribution follows Yaglom's theorem for the critical Poisson branching process, but rescaled by n2/3n^{2/3}.


Appendix B: Configuration Model

B.1 Definition

The configuration model generates a random graph with a prescribed degree sequence (d1,d2,,dn)(d_1, d_2, \ldots, d_n).

Construction:

  1. Give vertex vv exactly dvd_v "half-edges" (stubs)
  2. Pair up all 2m=vdv2m = \sum_v d_v half-edges uniformly at random
  3. Each pairing creates an edge

Result: A random multigraph (may have self-loops and multi-edges) with the given degree sequence.

Properties:

  • For degree sequences with bounded maximum degree: the number of self-loops and multi-edges is O(1)O(1) - a simple graph w.h.p.
  • Conditioned on simplicity: uniform over all simple graphs with the given degree sequence

Why use it? The configuration model is the correct null model for testing graph hypotheses. Instead of comparing to ER (wrong degree distribution), compare to configuration model (same degree distribution, no other structure). If a property (e.g., high clustering) exceeds what configuration model predicts, it's genuinely structural, not just a degree artifact.

B.2 Giant Component in Configuration Model

For the configuration model with degree distribution P(k)P(k):

Molloy-Reed criterion: A giant component exists iff:

kk(k2)P(k)>0    E[k2]E[k]>2\sum_k k(k-2) P(k) > 0 \iff \frac{\mathbb{E}[k^2]}{\mathbb{E}[k]} > 2

Interpretation: The excess degree distribution qk=(k+1)P(k+1)/E[k]q_k = (k+1)P(k+1)/\mathbb{E}[k] governs the branching factor of the exploration process. The process is supercritical (giant component exists) iff the mean of qkq_k exceeds 1, which is E[k(k1)]/E[k]>1\mathbb{E}[k(k-1)]/\mathbb{E}[k] > 1.

For scale-free networks: With P(k)kγP(k) \propto k^{-\gamma}, E[k2]\mathbb{E}[k^2] diverges when γ3\gamma \le 3. Hence the Molloy-Reed criterion is always satisfied for BA-style scale-free networks - a giant component exists for arbitrarily sparse scale-free graphs.

For ER: P(k)=ecck/k!P(k) = e^{-c} c^k / k! (Poisson), so E[k2]=c2+c\mathbb{E}[k^2] = c^2 + c and E[k]=c\mathbb{E}[k] = c. Molloy-Reed: (c2+c)/c>2c+1>2c>1(c^2 + c)/c > 2 \Leftrightarrow c + 1 > 2 \Leftrightarrow c > 1 - exactly the ER threshold.

B.3 Clustering in Configuration Model

For the configuration model:

Cconf=(E[k2]E[k])2nE[k]30C_{\text{conf}} = \frac{(\mathbb{E}[k^2] - \mathbb{E}[k])^2}{n \cdot \mathbb{E}[k]^3} \to 0

as nn \to \infty (for fixed degree distribution). The configuration model has vanishing clustering coefficient - it's locally tree-like.

Real networks vs. configuration model: Comparing observed clustering to Cconf0C_{\text{conf}} \approx 0 tests for genuine clustering beyond degree effects. Small-world networks have CCconfC \gg C_{\text{conf}} - they have clustering not explained by degree distribution alone.


Appendix C: Percolation Theory

C.1 Bond Percolation on Graphs

Bond percolation: Given a graph GG and probability ρ[0,1]\rho \in [0,1], independently retain each edge with probability ρ\rho. Let GρG_\rho denote the resulting random subgraph.

Site percolation: Independently retain each vertex with probability ρ\rho.

Critical probability: The percolation threshold ρc\rho_c is the infimum of ρ\rho for which GρG_\rho has an infinite component (on infinite graphs) or a giant component of size Θ(n)\Theta(n) (on finite graphs).

On the integer lattice Zd\mathbb{Z}^d: Exact thresholds:

  • d=1d = 1: ρc=1\rho_c = 1 (must keep all edges)
  • d=2d = 2 (square lattice): ρc=1/2\rho_c = 1/2 exactly (by self-duality)
  • d2d \ge 2: ρc<1\rho_c < 1; harder to compute exactly

On ER graphs: Bond percolation on G(n,p0)G(n,p_0) with retention probability ρ\rho gives G(n,ρp0)G(n, \rho p_0). The percolation threshold is ρc=1/(np0)\rho_c = 1/(np_0), so the giant component survives iff ρ>1/(np0)\rho > 1/(np_0), i.e., ρnp0>1\rho np_0 > 1.

C.2 Connection to Neural Network Pruning

Neural network weight pruning is isomorphic to bond percolation:

  • Dense network graph GG (neurons = nodes, weights = edges)
  • Pruning mask mij{0,1}m_{ij} \in \{0,1\} with P[mij=1]=ρ\mathbb{P}[m_{ij} = 1] = \rho (retention probability)
  • Pruned network GρG_\rho must retain computational connectivity

For the network to maintain its computational capacity, it must retain a giant component. The percolation threshold ρc\rho_c gives the minimum retention rate.

Structured pruning: Head pruning in transformers (removing entire attention heads) is site percolation on the attention head graph. Magnitude pruning selects edges by weight magnitude - approximately bond percolation with ρ=\rho = fraction of weights retained.

Lottery ticket connection: A winning lottery ticket is precisely a giant percolating subgraph that retains the "signal paths" of the original network. The existence of such subgraphs at high sparsity (ρ0.1\rho \approx 0.1) is guaranteed by percolation theory for sufficiently wide networks.

C.3 Expanders and Optimal Percolation

Expander graph: A dd-regular graph GG on nn nodes with spectral gap λ(G)=dλ2(AG)\lambda(G) = d - \lambda_2(A_G). Large spectral gap means fast mixing and high robustness.

Percolation on expanders: For a dd-regular expander, the percolation threshold is ρc=1/(dλ(G)/d)11/(d1)\rho_c = 1/(d - \lambda(G)/d)^{-1} \approx 1/(d - 1) for bounded-degree expanders. The giant component after percolation at ρ>ρc\rho > \rho_c has size (1ϵ)n\ge (1 - \epsilon)n for small ϵ\epsilon.

Implication for sparse attention: Expander-based sparse attention (using Ramanujan graphs with spectral gap d2d1\approx d - 2\sqrt{d-1}) achieves:

  • O(n)O(n) edges (efficient)
  • O(logn)O(\log n) diameter (short paths)
  • Maximum spectral gap (optimal information flow)
  • Robust to random edge removal (good percolation threshold)

This is why expanders are theoretically optimal sparse attention patterns, even if not used in practice due to implementation complexity.


Appendix D: Threshold Functions - Complete Table

Property P\mathcal{P}Threshold p(n)p^*(n)Window widthNotes
Contains an edgen2n^{-2}Θ(p)\Theta(p^*)First property to appear
Contains a trianglen1n^{-1}Θ(p)\Theta(p^*)K3K_3 threshold
Contains K4K_4n2/3n^{-2/3}Θ(p)\Theta(p^*)m(K4)=3/2m(K_4) = 3/2
Contains KrK_rn2/(r1)n^{-2/(r-1)}Θ(p)\Theta(p^*)m(Kr)=(r1)/2m(K_r) = (r-1)/2
Giant component1/n1/nΘ(1/(n))\Theta(1/(n))Phase transition
Connectivityln(n)/n\ln(n)/n1/n1/n (sharp!)Very sharp threshold
Diameter 2\le 2ln(n)/n\sqrt{\ln(n)/n}Θ(p)\Theta(p^*)
Contains Hamiltonian cycleln(n)/n\ln(n)/n1/n1/nSame as connectivity!
Planarity loss1/n1/nΘ(1/n)\Theta(1/n)Planarity threshold
Chromatic number >k> kProblem-dependentVariesOpen for exact kk

Sharp vs. coarse thresholds:

A threshold p(n)p^*(n) is sharp if the transition from probability 0 to probability 1 occurs in a window of width o(p)o(p^*). Connectivity and Hamiltonian cycles have sharp thresholds (window width O(1/n)ln(n)/nO(1/n) \ll \ln(n)/n).

A threshold is coarse if the window is Θ(p)\Theta(p^*). Most subgraph appearance thresholds are coarse - the probability transitions from ϵ\epsilon to 1ϵ1-\epsilon over a multiplicative constant change in pp.

Friedgut's theorem (1999): Every monotone property has a sharp threshold OR can be reduced to a "small" property. Most natural properties (connectivity, kk-colorability) have sharp thresholds. This is the technical statement of the Bollobas-Thomason heuristic.


Appendix E: Random Geometric Graphs

E.1 Definition

A random geometric graph G(n,r)G(n, r) is constructed by:

  1. Placing nn nodes uniformly at random in [0,1]2[0,1]^2 (or another metric space)
  2. Connecting two nodes iff their Euclidean distance is r\le r

Properties:

  • Soft threshold for connectivity: r=lnn/(πn)r^* = \sqrt{\ln n / (\pi n)} - same lnn/n\ln n / n scaling as ER
  • High clustering: Nodes close to each other share many common neighbors (geometric constraint) - C1C \to 1 as r0r \to 0 with nr2nr^2 fixed
  • Bounded degree distribution: All degrees in [0,πnr2][0, \pi n r^2]
  • No long-range edges: Maximum edge length =r= r, giving large diameter for small rr

For AI: Random geometric graphs model sensor networks, robotic swarms, and spatial point processes. They also model attention in vision transformers where tokens correspond to image patches at geometric positions - nearby patches should have high attention, distant patches low.

E.2 Comparison to Other Models

PropertyERWSBAGeometric
Degree dist.PoissonPoisson-likePower lawBounded
ClusteringLowHighLowVery high
Path lengthO(logn)O(\log n)O(logn)O(\log n)O(logn/loglogn)O(\log n / \log \log n)O(1/r)O(1/r)
SpatialNoNoNoYes
Community struct.NoneNoneNoneImplicit (spatial)

E.3 Connection to Kernel Methods

The random geometric graph adjacency matrix is a kernel matrix:

Aij=1[xixj2r]=k(xi,xj)A_{ij} = \mathbf{1}[\|x_i - x_j\|_2 \le r] = k(x_i, x_j)

with kernel k(x,y)=1[xyr]k(x,y) = \mathbf{1}[\|x-y\| \le r] (indicator kernel).

More generally, kernel random graphs use AijBernoulli(k(xi,xj))A_{ij} \sim \text{Bernoulli}(k(x_i, x_j)) for a kernel function k:X2[0,1]k: \mathcal{X}^2 \to [0,1]. This is a graphon with W(x,y)=k(x,y)W(x,y) = k(x,y) for node types x,yXx,y \in \mathcal{X}.

For attention: The softmax attention matrix softmax(QK/d)\text{softmax}(QK^\top/\sqrt{d}) is a random kernel matrix where the "positions" are query/key vectors. Random graph theory for kernel random graphs directly applies to study information flow in attention layers.


Appendix F: Network Motifs and Subgraph Statistics

F.1 Network Motifs

Definition: A network motif is a subgraph pattern that occurs significantly more often in a real network than in random graphs with the same degree sequence (configuration model null).

Common motifs:

  • Feedforward loop (3-node DAG): overrepresented in gene regulatory networks
  • Bifan (2->2->2 bipartite): common in neural circuits
  • Clique (K3K_3, K4K_4): overrepresented in social networks

Detection: For each candidate subgraph HH, compare t(H,Greal)t(H, G_{\text{real}}) (density in real graph) to E[t(H,Gconfig)]\mathbb{E}[t(H, G_{\text{config}})] (expected density under null model). Z-score >2> 2: motif. Z-score <2< -2: anti-motif.

For GNNs: Motif counting is a proxy for what GNNs learn. Standard MPNNs (GCN, GraphSAGE) can count triangles but not 4-cycles. Higher-order GNNs (OSAN, subgraph GNNs) can count richer motif sets. The motif profile of a graph determines which GNN architecture is most suitable.

F.2 Triangle Counting at Scale

Exact triangle count: For dense graphs, T=16tr(A3)T = \frac{1}{6}\text{tr}(A^3) using matrix multiplication in O(n2.37)O(n^{2.37}) (matrix multiplication exponent). For sparse graphs (m=O(n)m = O(n)), algorithms run in O(m3/2)O(m^{3/2}).

Approximate counting: For massive graphs (n=109n = 10^9), use streaming algorithms or random sampling:

  • Wedge sampling: Count triangles by sampling paths of length 2 and checking closure
  • DOULION: Sample edges independently with probability ρ\rho; count triangles in subgraph; scale by 1/ρ31/\rho^3

Expected triangle count in G(n,p)G(n,p): E[T]=(n3)p3n3p3/6\mathbb{E}[T] = \binom{n}{3} p^3 \approx n^3 p^3 / 6.

Concentration: For np3n2np^3 n^2 \to \infty, by Azuma-Hoeffding (Lipschitz condition), TT concentrates around its mean with standard deviation O(mean/n)O(\text{mean}/\sqrt{n}).


Appendix G: Information-Theoretic Limits

G.1 Mutual Information and Detection

The detection problem: Given GSBM(n,2,σ,p,q)G \sim \text{SBM}(n, 2, \sigma, p, q), recover σ\sigma (up to global flip).

Impossible regime: Below the Kesten-Stigum threshold, the mutual information I(σ;G)=0I(\sigma; G) = 0 in the nn \to \infty limit. This means the graph GG contains literally no information about the community labels that could be extracted by any algorithm.

Formal statement: For p=a/np = a/n, q=b/nq = b/n, (ab)22(a+b)(a-b)^2 \le 2(a+b):

limnI(σ;G)/n=0\lim_{n \to \infty} I(\sigma; G) / n = 0

The /n/ n normalization is because both σ\sigma and GG grow with nn; the mutual information per node goes to 0.

Above threshold: For (ab)2>2(a+b)(a-b)^2 > 2(a+b):

limnI(σ;G)/n=hb(1+α2)hb(12)>0\lim_{n \to \infty} I(\sigma; G) / n = h_b\left(\frac{1+\alpha^*}{2}\right) - h_b\left(\frac{1}{2}\right) > 0

where α\alpha^* is the Bayes-optimal overlap and hbh_b is binary entropy.

G.2 Exact Recovery Threshold

For the exact recovery problem (recover σ\sigma exactly, not just correlate with it), the threshold is higher:

Theorem (Abbe-Sandon, 2015): For the 2-block SBM with p=aln(n)/np = a \ln(n)/n, q=bln(n)/nq = b \ln(n)/n:

Exact recovery is possible w.h.p.    (ab)2>2\text{Exact recovery is possible w.h.p.} \iff \left(\sqrt{a} - \sqrt{b}\right)^2 > 2

This is the CHI-squared divergence condition between the Poisson distributions with means aa and bb.

The three thresholds:

  1. Impossible: (ab)22(a+b)(a-b)^2 \le 2(a+b) - no algorithm can detect communities
  2. Weak recovery: (ab)2>2(a+b)(a-b)^2 > 2(a+b) - algorithms exist to partially recover
  3. Exact recovery: (ab)2>2(\sqrt{a} - \sqrt{b})^2 > 2 - algorithms exist for exact recovery

These thresholds become relevant for GNN practitioners when designing tasks: is the community detection problem in your benchmark achievable at all? Which threshold regime does it fall in?


Appendix H: Practical Algorithms

H.1 Spectral Clustering Pipeline for SBM

INPUT: Adjacency matrix A of SBM graph with k communities
OUTPUT: Community assignments \sigma

1. REGULARIZE: Compute A_reg = A + \tau/n * 11^T (small ridge)
   (Prevents leading eigenvector being dominated by high-degree nodes)

2. NORMALIZE: Compute L_sym = D^{-1/2} A_reg D^{-1/2}

3. EIGENVECTORS: Compute top k eigenvectors U \in R^{n\timesk} of L_sym

4. NORMALIZE ROWS: Let U_row = U / ||u_i||_2 (spherical projection)

5. k-MEANS: Run k-means on rows of U_row, get cluster labels \sigma

6. RETURN: \sigma (up to global permutation)

COMPLEXITY: O(n*k/\epsilon^2) for sparse graphs (k power iterations)
ACCURACY: Achieves min error rate above Kesten-Stigum threshold

H.2 Louvain Community Detection

The Louvain algorithm maximizes modularity QQ, a measure of community quality:

Q=12mij[Aijdidj2m]δ(σi,σj)Q = \frac{1}{2m} \sum_{ij} \left[ A_{ij} - \frac{d_i d_j}{2m} \right] \delta(\sigma_i, \sigma_j)

Phase 1 (local optimization): For each node vv, move vv to the neighboring community that maximizes ΔQ\Delta Q. Repeat until no improvement.

Phase 2 (aggregation): Collapse each community into a supernode. Edge weights between supernodes = total edges between original communities. Apply Phase 1 to the collapsed graph.

Repeat until no further improvement.

Complexity: O(nlogn)O(n \log n) empirically - the fastest practical community detection algorithm.

Limitation: Modularity has a resolution limit: communities smaller than m\sqrt{m} edges may not be detected. For fine-grained community structure, use alternatives (Infomap, hierarchical spectral methods).

H.3 Fast Giant Component Detection

def giant_component_fraction(A, n):
    """Compute giant component size via BFS."""
    visited = [False] * n
    max_component = 0
    
    for start in range(n):
        if visited[start]:
            continue
        # BFS from start
        queue = [start]
        visited[start] = True
        component_size = 0
        while queue:
            v = queue.pop(0)
            component_size += 1
            for u in range(n):
                if A[v][u] and not visited[u]:
                    visited[u] = True
                    queue.append(u)
        max_component = max(max_component, component_size)
    
    return max_component / n

For sparse adjacency lists (CSR format), BFS runs in O(n+m)O(n + m) - linear in the graph size.


End of 06 Random Graphs notes.


Appendix I: Advanced Topics in Random Graphs

I.1 Local Weak Convergence (Benjamini-Schramm)

For sparse graph sequences where edge density 0\to 0, graphon theory breaks down. The correct limit theory uses local weak convergence.

Definition (Benjamini-Schramm limit): A sequence of graphs GnG_n converges in the local weak sense to a random rooted graph (G,ρ)(G, \rho) if for every rooted graph (H,v)(H, v) and every r0r \ge 0:

1n{uV(Gn):Br(Gn,u)(H,v)}P[Br(G,ρ)(H,v)]\frac{1}{n} |\{u \in V(G_n) : B_r(G_n, u) \cong (H, v)\}| \to \mathbb{P}[B_r(G, \rho) \cong (H, v)]

where Br(G,u)B_r(G, u) is the ball of radius rr around uu in GG.

For G(n,c/n)G(n, c/n): The local weak limit is the Galton-Watson tree with Poisson(cc) offspring distribution - an infinite random tree. This confirms the "locally tree-like" structure of sparse ER graphs.

For BA networks: The local weak limit involves correlated degree distributions due to preferential attachment - the limiting tree is not a Galton-Watson tree but a Polya urn tree.

For SBM: The local weak limit is a multi-type Galton-Watson tree where the types are community labels. This is the probabilistic foundation of the belief propagation algorithm for community detection.

I.2 Random Regular Graphs

Definition: A dd-regular random graph on nn vertices is a graph chosen uniformly at random among all dd-regular simple graphs on nn vertices.

Construction (configuration model): Give each vertex dd stubs, pair uniformly at random, condition on simplicity.

Spectral properties: For dd-regular random graphs, the eigenvalues are in [(2ϵ)d1,d][-(2-\epsilon)\sqrt{d-1}, d] w.h.p. The Alon-Boppana bound says λ22d1o(1)\lambda_2 \ge 2\sqrt{d-1} - o(1) for any dd-regular graph. A Ramanujan graph achieves equality: λ2=2d1\lambda_2 = 2\sqrt{d-1}.

Alon conjecture (proved by Friedman, 2008): Almost all dd-regular random graphs are Ramanujan:

λ2(G)2d1+ϵ\lambda_2(G) \le 2\sqrt{d-1} + \epsilon

for any fixed ϵ>0\epsilon > 0.

For AI: Expanders (near-Ramanujan regular graphs) provide the optimal sparse attention pattern:

  • dd edges per node (linear total edges)
  • Diameter O(logn)O(\log n) (short paths)
  • Spectral gap d2d1\approx d - 2\sqrt{d-1} (fast mixing)

I.3 Random Hypergraphs

Real-world networks often have higher-order interactions - a chemistry reaction involves multiple molecules simultaneously. Hypergraphs capture this.

Definition: A hypergraph H=(V,E)H = (V, E) where edges eEe \in E can contain any number of vertices.

Random kk-uniform hypergraph Gk(n,p)G^k(n,p): Each kk-subset of VV is an edge independently with probability pp.

Phase transition: The giant component threshold for Gk(n,p)G^k(n,p) occurs at p=1/(n1k1)p = 1/\binom{n-1}{k-1} (average degree 1). The analysis uses a multi-type branching process.

For AI: Simplicial complex neural networks (SCNNs) and topological data analysis (TDA) use higher-order graph structure. Random simplicial complexes (random clique complexes of ER graphs) model the topological structure of neural network activation spaces.

I.4 Dynamic Random Graphs

Real networks evolve over time. Several models capture network dynamics:

Forest Fire model (Leskovec et al., 2007): A new node contacts a random "ambassador" and copies some of its links. Exhibits densification (average degree increases over time) and shrinking diameter - both observed in real growing networks.

Copying model: Each new node copies mm existing edges from a random source node. This generates power-law degree distributions similar to BA but with different exponent depending on copy probability.

Edge dynamics: Instead of just adding edges, allow edge rewiring, deletion, and creation. Models temporal networks (who talked to whom at time tt). The temporal analog of clustering coefficient and path length requires time-respecting paths.

For AI: Training data graphs (social networks, citation graphs) evolve over time. Distribution shift in graph ML often comes from temporal evolution of the underlying random graph model. A GNN trained on a 2019 citation graph may fail on a 2024 citation graph if the graph generative process has changed.


Appendix J: Mathematical Proofs

J.1 Proof: Giant Component Threshold (Upper Bound)

We prove: for c<1c < 1, all components have size O(logn)O(\log n) w.h.p.

Proof: Let XkX_k = number of connected components of size exactly kk in G(n,p)G(n,p) with p=c/np = c/n.

The probability that a specific set SS of kk vertices forms a component involves:

  1. The internal graph on SS being connected: probability (1p)k(k1)/2[connectivity]\ge (1-p)^{k(k-1)/2} \cdot [\text{connectivity}]
  2. No edges from SS to VSV \setminus S: probability (1p)k(nk)(1-p)^{k(n-k)}

By union bound:

E[Xk](nk)kk2pk1(1p)k(nk)\mathbb{E}[X_k] \le \binom{n}{k} \cdot k^{k-2} p^{k-1} \cdot (1-p)^{k(n-k)}

using Cayley's formula (kk2k^{k-2} labeled trees on kk vertices, each connected tree is a spanning tree of its component).

For p=c/np = c/n and kKlognk \le K \log n:

E[Xk]nkk!kk2(c/n)k1eck(nk)/n\mathbb{E}[X_k] \le \frac{n^k}{k!} \cdot k^{k-2} \cdot (c/n)^{k-1} \cdot e^{-ck(n-k)/n} ck1kk2k!neck(1k/n)\approx \frac{c^{k-1} k^{k-2}}{k! \cdot n} \cdot e^{-ck(1 - k/n)} 1n(ce1c)kk3/2\lesssim \frac{1}{n} \cdot \frac{(ce^{1-c})^k}{k^{3/2}}

(using Stirling: k!kkek2πkk! \approx k^k e^{-k} \sqrt{2\pi k}, kk2/k!ekk2/2πkk^{k-2}/k! \approx e^k k^{-2}/\sqrt{2\pi k})

For c<1c < 1: ce1c<1ce^{1-c} < 1 (since f(c)=ce1cf(c) = ce^{1-c} has f(1)=1f(1) = 1, f(1)=0f'(1) = 0, f(1)=1<0f''(1) = -1 < 0 - it's a maximum at c=1c = 1). Let α=ce1c<1\alpha = ce^{1-c} < 1.

Summing over k=2,3,,k = 2, 3, \ldots, \infty:

kE[Xk]1nkαkk3/2=Cn0\sum_k \mathbb{E}[X_k] \lesssim \frac{1}{n} \sum_k \alpha^k k^{-3/2} = \frac{C}{n} \to 0

By Markov's inequality, the total number of vertices in components of size 2\ge 2 is O(1)=o(n)O(1) = o(n), so the maximum component size is O(logn)O(\log n). \square

J.2 Proof: Poisson Degree Distribution

Claim: For G(n,p)G(n,p) with p=c/np = c/n, deg(v)dPoisson(c)\deg(v) \xrightarrow{d} \text{Poisson}(c).

Proof via PGF: The degree deg(v)=uvXuv\deg(v) = \sum_{u \neq v} X_{uv} where XuvBernoulli(c/n)X_{uv} \sim \text{Bernoulli}(c/n) i.i.d.

PGF of deg(v)\deg(v):

Gdeg(s)=E[sdeg(v)]=uvE[sXuv]=(1cn+csn)n1G_{\deg}(s) = \mathbb{E}[s^{\deg(v)}] = \prod_{u \neq v} \mathbb{E}[s^{X_{uv}}] = \left(1 - \frac{c}{n} + \frac{cs}{n}\right)^{n-1}

Taking nn \to \infty:

(1+c(s1)n)n1ec(s1)=k=0ecckk!sk\left(1 + \frac{c(s-1)}{n}\right)^{n-1} \to e^{c(s-1)} = \sum_{k=0}^\infty \frac{e^{-c} c^k}{k!} s^k

which is the PGF of Poisson(c)\text{Poisson}(c).

Since PGF convergence (for s1|s| \le 1) implies distributional convergence (by continuity theorem for generating functions), deg(v)dPoisson(c)\deg(v) \xrightarrow{d} \text{Poisson}(c). \square

Multivariate extension: For distinct vertices v1,,vmv_1, \ldots, v_m, their degrees are jointly Poisson in the limit, with joint PGF:

E[i=1msideg(vi)]eici(si1)\mathbb{E}\left[\prod_{i=1}^m s_i^{\deg(v_i)}\right] \to e^{\sum_i c_i(s_i - 1)}

where ci=cc_i = c for all ii. But the joint distribution is NOT independent Poisson (edges between viv_i and vjv_j contribute to both deg(vi)\deg(v_i) and deg(vj)\deg(v_j)). For fixed mm, the correlation between degrees of v1v_1 and v2v_2 is p=c/n0p = c/n \to 0, so they are asymptotically independent.

J.3 Proof sketch: Davis-Kahan Theorem

Simplified version: Let A=Aˉ+WA = \bar{A} + W where Aˉ\bar{A} has eigenvalue λˉ\bar{\lambda} and unit eigenvector uˉ\bar{u}. Let u^\hat{u} be the unit eigenvector of AA corresponding to eigenvalue λ^\hat{\lambda} nearest to λˉ\bar{\lambda}. Let δ=minμσ(Aˉ){λˉ}λˉμ\delta = \min_{\mu \in \sigma(\bar{A}) \setminus \{\bar{\lambda}\}} |\bar{\lambda} - \mu| (gap to other eigenvalues).

Claim: sin(u^,uˉ)Wop/(δWop)\sin\angle(\hat{u}, \bar{u}) \le \|W\|_{op} / (\delta - \|W\|_{op}) (assuming δ>Wop\delta > \|W\|_{op}).

Proof: Decompose u^=αuˉ+βv\hat{u} = \alpha \bar{u} + \beta v where vuˉv \perp \bar{u}, α2+β2=1|\alpha|^2 + |\beta|^2 = 1, and sin(u^,uˉ)=β\sin\angle(\hat{u}, \bar{u}) = |\beta|.

From Au^=λ^u^A\hat{u} = \hat{\lambda}\hat{u}: Aˉu^+Wu^=λ^u^\bar{A}\hat{u} + W\hat{u} = \hat{\lambda}\hat{u}.

Project onto uˉ\bar{u}^\perp: PuˉAˉPuˉ(βv)+PuˉWu^=λ^βvP_{\bar{u}^\perp} \bar{A} P_{\bar{u}^\perp} (\beta v) + P_{\bar{u}^\perp} W \hat{u} = \hat{\lambda} \beta v.

All eigenvalues of PuˉAˉPuˉP_{\bar{u}^\perp} \bar{A} P_{\bar{u}^\perp} are at distance δ\ge \delta from λ^\hat{\lambda} (which is close to λˉ\bar{\lambda}):

(PuˉAˉPuˉλ^I)1op1δλ^λˉ\|(P_{\bar{u}^\perp} \bar{A} P_{\bar{u}^\perp} - \hat{\lambda} I)^{-1}\|_{op} \le \frac{1}{\delta - |\hat{\lambda} - \bar{\lambda}|}

Therefore:

β=Puˉ(PuˉAˉPuˉλ^I)1PuˉWu^WopδWop|\beta| = \|P_{\bar{u}^\perp}(P_{\bar{u}^\perp} \bar{A} P_{\bar{u}^\perp} - \hat{\lambda} I)^{-1} P_{\bar{u}^\perp} W \hat{u}\| \le \frac{\|W\|_{op}}{\delta - \|W\|_{op}}

This gives the Davis-Kahan bound. \square


Appendix K: Notation Reference

SymbolMeaningFirst appears
G(n,p)G(n,p)Erdos-Renyi random graph, nn vertices, edge prob pp3.1
G(n,m)G(n,m)ER graph with exactly mm edges3.1
L1(G)L_1(G)Size of largest connected component3.3
β(c)\beta(c)Giant component fraction, β=1ecβ\beta = 1 - e^{-c\beta}3.3
CvC_vLocal clustering coefficient of vertex vv4.2
LLAverage path length4.3
Π(v)\Pi(v)Preferential attachment probability of vertex vv5.1
SBM(n,k,σ,B)\text{SBM}(n, k, \sigma, B)Stochastic Block Model6.1
B[0,1]k×kB \in [0,1]^{k\times k}SBM block probability matrix6.1
σ:[n][k]\sigma: [n] \to [k]Community assignment6.1
ρ(M)\rho(M)Spectral radius of matrix MMApp. A.2
μsc\mu_{sc}Wigner semicircle measure7.1
d(G,H)d_\square(G, H)Cut metric between graphs GG and HH8.1
W:[0,1]2[0,1]W: [0,1]^2 \to [0,1]Graphon8.2
t(F,W)t(F, W)Homomorphism density of FF in WW8.3
TWT_WGraphon integral operator, TWh(x)=W(x,y)h(y)dyT_Wh(x) = \int W(x,y)h(y)dy8.4
w.h.p.\text{w.h.p.}With high probability (probability 1\to 1)2.2
o(),O(),Θ(),ω()o(\cdot), O(\cdot), \Theta(\cdot), \omega(\cdot)Asymptotic notation2.2
I(c)=c1lncI(c) = c - 1 - \ln cLarge deviation function for ER3.3
ϕ(s)\phi(s)Probability generating functionApp. A.1
qkq_kExcess degree distributionApp. B.2

<- Back to Graph Theory | Next: Graph Algorithms ->


Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue