Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 4
18 min read18 headingsSplit lesson page

Lesson overview | Previous part | Lesson overview

Graph Neural Networks: Appendix C: Connections to Classical Machine Learning to Appendix H: Quick Reference - Key Formulas

Appendix C: Connections to Classical Machine Learning

C.1 GNNs and Kernel Methods

The GNN's computation can be viewed through the lens of kernel methods. Define the graph neural tangent kernel (GNTK) as the limiting kernel of an infinitely-wide GNN at initialization:

ΘGNN(G1,G2)=limdθfθ(G1),θfθ(G2)\Theta_{\text{GNN}}(G_1, G_2) = \lim_{d \to \infty} \langle \nabla_\theta f_\theta(G_1), \nabla_\theta f_\theta(G_2) \rangle

This connects GNNs to the NTK theory of Jacot et al. (2018) and provides a framework for analyzing GNN training dynamics. For graph classification:

  • Wide GNN \equiv kernel SVM: at infinite width, a GNN with random feature initialization and gradient descent is equivalent to a kernel machine with the GNTK
  • GNTK expressiveness: the GNTK of a GIN-like architecture is exactly as expressive as 1-WL (matching the finite-depth GNN result)

Graph Weisfeiler-Leman kernel (Shervashidze et al., 2011). Before GNNs, graph kernels were the primary ML method for graphs. The WL kernel computes a feature vector from the color histograms at each WL refinement step:

KWL(G1,G2)=l=0Lϕl(G1),ϕl(G2)K_{\text{WL}}(G_1, G_2) = \sum_{l=0}^L \langle \phi_l(G_1), \phi_l(G_2) \rangle

where ϕl(G)\phi_l(G) is the histogram of colors at WL iteration ll. The WL kernel is equivalent to a GIN with fixed (non-learned) MLP - it is the non-parametric version of GIN.

For AI: Understanding GNNs as kernel machines allows the use of kernel theory (generalization bounds, bandwidth selection, representer theorem) to analyze GNN behavior - important for the theoretical foundations in Chapter 12 (Functional Analysis) and Chapter 21 (Statistical Learning Theory).

C.2 GNNs and Belief Propagation

Loopy belief propagation (LBP) on probabilistic graphical models (Markov random fields, factor graphs) has exactly the same computational structure as MPNN message passing.

Belief propagation update:

μuv(t+1)(xv)xuψuv(xu,xv)ϕu(xu)wN(u)vμwu(t)(xu)\mu_{u \to v}^{(t+1)}(x_v) \propto \sum_{x_u} \psi_{uv}(x_u, x_v) \cdot \phi_u(x_u) \cdot \prod_{w \in \mathcal{N}(u) \setminus v} \mu_{w \to u}^{(t)}(x_u)

MPNN message update:

muv[l+1]=M ⁣(hv[l],hu[l],euv)\mathbf{m}_{u \to v}^{[l+1]} = M\!\left(\mathbf{h}_v^{[l]}, \mathbf{h}_u^{[l]}, \mathbf{e}_{uv}\right)

The analogy:

  • BP messages μuv(xv)\mu_{u \to v}(x_v) <-> GNN messages muv\mathbf{m}_{u \to v}
  • BP potential ψuv\psi_{uv} <-> GNN message function MM
  • BP product of incoming messages <-> GNN sum/product aggregation

Implication: GNNs can be viewed as learned approximations to belief propagation. Where BP computes exact marginals on trees (and approximate marginals on loopy graphs), GNNs learn to compute useful representations that may or may not correspond to probabilistic marginals. For tasks that arise from probabilistic inference on graphs (decoding, error correction, satisfiability), architectures inspired by BP (like LDPC decoders implemented as GNNs) achieve state-of-the-art results.

C.3 Label Propagation vs GCN

Label propagation (Zhu et al., 2003) is a classical semi-supervised learning method that also propagates information across graph edges. The update rule:

F(t+1)=αSF(t)+(1α)YF^{(t+1)} = \alpha S F^{(t)} + (1-\alpha) Y

where S=D1/2AD1/2S = D^{-1/2}AD^{-1/2}, FRn×CF \in \mathbb{R}^{n \times C} is the label matrix (soft predictions), YY is the observed labels (zero for unlabeled nodes), and α(0,1)\alpha \in (0,1) controls the balance between propagation and fitting observed labels.

At convergence: F=(1α)(IαS)1YF^* = (1-\alpha)(I - \alpha S)^{-1} Y - a closed-form solution.

GCN vs label propagation:

  • Label propagation: linear, closed-form, uses only graph structure (no node features)
  • GCN: nonlinear, gradient-based, uses both node features and graph structure
  • For feature-rich graphs: GCN consistently outperforms LP
  • For feature-poor graphs (only node IDs): LP is competitive and much cheaper
  • APPNP (Gasteiger et al., 2019): decouples the GNN feature transformation from the propagation step, applying propagation steps as a post-processing operation similar to LP. Achieves better performance than GCN on many node classification benchmarks by using more propagation steps without over-smoothing (due to the α\alpha mixing coefficient).

C.4 Random Walk Methods and GNNs

Before GNNs, random walk methods (DeepWalk, Node2Vec) were the standard for learning node embeddings.

DeepWalk (Perozzi et al., 2014). Sample random walks from each node; treat the walk as a "sentence" and apply Skip-Gram (Word2Vec) to learn node embeddings that predict context nodes in the walk.

Node2Vec (Grover & Leskovec, 2016). Biased random walks with parameters pp (return probability) and qq (in-out probability) to balance BFS-like (homophily) and DFS-like (structural equivalence) exploration.

Connection to GNNs. The Skip-Gram objective for DeepWalk is equivalent to factorizing the matrix:

log ⁣(vol(G)Tt=1TAtD1)\log\!\left(\frac{\text{vol}(G)}{T} \sum_{t=1}^T A_t \cdot D^{-1}\right)

where AtA_t is the tt-step transition matrix. This is related to the graph's random walk kernel.

GNNs can incorporate random walk information through RWSE (7.6) or through the unsupervised pretraining objective of GraphSAGE (5.4). Modern practice often combines: pretrain with random walk objectives (fast, scalable), then fine-tune with GNN layers (expressive, task-specific).

C.5 Over-smoothing and Spectral Perspective

From the spectral perspective of 11-04, over-smoothing has a clean frequency interpretation. A graph signal x\mathbf{x} can be decomposed into its graph Fourier components x^k=ukx\hat{x}_k = \mathbf{u}_k^\top \mathbf{x} (where uk\mathbf{u}_k is the kk-th Laplacian eigenvector). The GCN propagation S=ILsymS = I - L_{\text{sym}} acts as a low-pass filter:

(Sx)^k=(1λk)x^k\widehat{(Sx)}_k = (1 - \lambda_k) \hat{x}_k

where λk\lambda_k are the eigenvalues of LsymL_{\text{sym}}. After LL layers:

(SLx)^k=(1λk)Lx^k\widehat{(S^L x)}_k = (1-\lambda_k)^L \hat{x}_k
  • Lowest frequency (k=1k=1, λ1=0\lambda_1 = 0): constant component, amplified by factor 1 - perfectly preserved
  • Low frequencies (λk0\lambda_k \approx 0): slowly varying signals, amplified by (1λk)L1(1-\lambda_k)^L \approx 1 - nearly preserved
  • High frequencies (λk2\lambda_k \approx 2): rapidly oscillating signals, suppressed by (1λk)L(1)L0(1-\lambda_k)^L \approx (-1)^L \to 0 - exponentially suppressed

The class label signal is typically a low-frequency signal (connected nodes tend to have the same class in homophilic graphs). But even the label signal is eventually suppressed by many propagation steps - after enough layers, even the zero-frequency constant component dominates and all nodes collapse.

Optimal depth. The optimal number of layers LL^* balances two effects:

  1. More layers: larger receptive field, more node context, better long-range information
  2. More layers: stronger smoothing, loss of high-frequency discriminative features

For a graph with Fiedler value λ2\lambda_2, the optimal depth is approximately L1/λ2L^* \approx 1/\lambda_2 (the mixing time of the random walk). For clustered graphs with small λ2\lambda_2 (e.g., λ2=0.01\lambda_2 = 0.01): L100L^* \approx 100 layers could be used. For expander-like graphs with large λ2\lambda_2 (e.g., λ2=0.5\lambda_2 = 0.5): L2L^* \approx 2 layers is optimal.


Appendix D: GNN Training Recipes and Engineering Tricks

D.1 Data Preprocessing for Graph ML

Node feature normalization. Unlike image pixels (naturally in [0,255][0,255]), node features in graphs span wildly different scales. Atom features in molecular graphs: atomic number (1-118), formal charge (4-4 to +4+4), number of hydrogens (00-44). Normalizing to zero mean, unit variance per feature dimension is essential.

Edge feature preprocessing. Bond lengths in molecules range from 1.0 A (triple bond) to 1.5 A (single bond). If used as raw edge features, their small absolute range means they have little influence relative to bond type one-hot features. Apply normalization or use learned embeddings.

Graph-level normalization. For graph regression tasks (predicting molecular energy, which scales with the number of atoms), normalize the target yGy_G by dividing by nn (atoms count). Otherwise, larger molecules trivially have larger predicted values.

Degree-based normalization. When using sum aggregation, node representations can grow with degree. Apply degree normalization as part of preprocessing: scale node features by dv1/2d_v^{-1/2} before input, matching the GCN normalization.

D.2 Positional Encodings: Practical Considerations

LapPE sign ambiguity. Eigenvectors are defined up to sign: if u\mathbf{u} is an eigenvector of LL, so is u-\mathbf{u}. Naive use of eigenvectors as PEs means two identical graphs with differently-signed PEs produce different GNN outputs - violating permutation invariance over graph orientation.

Fix (Lim et al., 2022): during training, randomly flip the sign of each eigenvector independently at each mini-batch. This forces the GNN to be invariant to sign choice. At inference, use a consistent sign (e.g., always choose the sign such that the first non-zero element is positive).

RWSE efficiency. Computing [ApD1]vv[A^p D^{-1}]_{vv} for p=1,,Pp = 1, \ldots, P requires matrix powers - expensive for large graphs. Efficient computation: start a random walk from node vv; estimate the return probability pv(p)p_v^{(p)} by averaging over RR random walkers:

p^v(p)=1Rr=1R1[walk r returns to v at step p]\hat{p}_v^{(p)} = \frac{1}{R} \sum_{r=1}^R \mathbf{1}[\text{walk } r \text{ returns to } v \text{ at step } p]

With R=200R=200 walkers per node, this gives accurate RWSE estimates in O(nPR)O(n \cdot P \cdot R) time.

Eigenvalue degeneracy. If two eigenvalues of LL are equal (λk=λk+1\lambda_k = \lambda_{k+1}), the corresponding eigenspace is two-dimensional and any orthonormal basis for it is valid. This means LapPE is non-unique for graphs with symmetry (e.g., complete graphs, regular graphs). Degenerate eigenvectors encode no structural information - they should either be excluded from PEs or handled with random rotation augmentation.

D.3 Loss Functions for Graph Tasks

Node classification. Cross-entropy over labeled nodes:

L=1VLvVLc=1CYvclogY^vc\mathcal{L} = -\frac{1}{|V_L|}\sum_{v \in V_L} \sum_{c=1}^C Y_{vc} \log \hat{Y}_{vc}

For class-imbalanced graphs (e.g., rare fraud nodes), use focal loss or weighted cross-entropy.

Link prediction. Binary cross-entropy with negative sampling:

L=(u,v)Elogσ(huhv)(u,v)Elog(1σ(huhv))\mathcal{L} = -\sum_{(u,v) \in E} \log \sigma(\mathbf{h}_u^\top \mathbf{h}_v) - \sum_{(u,v) \notin E} \log(1 - \sigma(\mathbf{h}_u^\top \mathbf{h}_v))

Negative sampling ratio (how many non-edges per positive edge): 1:1 to 1:5 is standard. Hard negative mining (sampling non-edges that are structurally close to positive edges) improves embedding quality at the cost of more complex sampling.

Graph regression. Mean absolute error (MAE) for quantum chemistry (physically meaningful, outlier-robust) or mean squared error (MSE) for property prediction where large errors are particularly bad. Normalize targets per-dataset: (yμy)/σy(y - \mu_y) / \sigma_y for training, inverse-transform predictions.

Contrastive learning on graphs. Self-supervised GNNs (GraphCL, SimGRACE) maximize agreement between augmented views of the same graph:

LNT-Xent=logexp(sim(hG,hG)/τ)GGexp(sim(hG,hG)/τ)\mathcal{L}_{\text{NT-Xent}} = -\log \frac{\exp(\operatorname{sim}(\mathbf{h}_G, \mathbf{h}_G') / \tau)}{\sum_{G^- \neq G} \exp(\operatorname{sim}(\mathbf{h}_G, \mathbf{h}_{G^-}) / \tau)}

where sim(u,v)=uv/(uv)\operatorname{sim}(\mathbf{u}, \mathbf{v}) = \mathbf{u}^\top \mathbf{v} / (\lVert\mathbf{u}\rVert\lVert\mathbf{v}\rVert) is cosine similarity, τ\tau is temperature, and hG\mathbf{h}_G' is the embedding of an augmented view (e.g., edge dropout, node feature masking). Graph augmentations must preserve semantic meaning - dropping a bond in a molecule changes its identity, so augmentation must be domain-aware.

D.4 Debugging GNNs in Practice

Symptom: GNN performs worse than MLP on node features alone. Cause: the graph structure is not informative for the task (e.g., low homophily), or the GNN is over-smoothing and destroying the discriminative node features. Fix: verify homophily ratio h=(u,v)E:yu=yv/Eh = |{(u,v) \in E : y_u = y_v}| / |E|; if h<0.2h < 0.2, graph structure hurts. Use heterophilic GNNs (FAGCN, H2GCN) that can handle graphs where connected nodes have different labels.

Symptom: Training loss decreases but validation loss explodes. Cause: overfitting, likely due to GNN memorizing the training graph topology rather than learning transferable patterns. Fix: add dropout on node features (not just edges), reduce model capacity, add weight decay. For transductive semi-supervised learning, ensure the train/val split is performed correctly.

Symptom: All node representations collapse to the same vector. Cause: over-smoothing (too many layers) or numerical instability (exploding/vanishing gradients through A^\hat{A}). Fix: check Dirichlet energy after each layer; use PairNorm or GCNII; add gradient clipping.

Symptom: GAT attention weights are all uniform (αuv1/N(v)\alpha_{uv} \approx 1/|\mathcal{N}(v)|). Cause: the attention vector a\mathbf{a} has collapsed - the model learned that uniform attention is safest. Often occurs with high learning rates or without proper initialization. Fix: initialize a\mathbf{a} with Xavier initialization; use a smaller learning rate for attention parameters; check for gradient vanishing in LeakyReLU (use a larger negative slope, e.g., 0.20.2).

Symptom: Training is extremely slow. Cause: running GNN on the full large graph every step (full-batch training). Fix: switch to mini-batch training with neighbor sampling (GraphSAGE), graph partitioning (Cluster-GCN), or subgraph sampling (GraphSAINT). Profile GPU utilization - if below 70%, the bottleneck is data loading, not computation.


Appendix E: Benchmark Datasets and Evaluation

E.1 Standard Node Classification Benchmarks

DatasetNodesEdgesFeaturesClassesHomophilyTask
Cora2,7085,4291,43370.81Transductive node clf
CiteSeer3,3274,7323,70360.74Transductive node clf
PubMed19,71744,33850030.80Transductive node clf
Amazon-Photo7,650119,08174580.83Transductive node clf
Chameleon2,27736,1012,32550.23Heterophilic node clf
Squirrel5,201217,0732,08950.22Heterophilic node clf
OGB-arxiv169,3431,166,243128400.65Large-scale node clf
OGB-products2,449,02961,859,140100470.81Large-scale node clf

Evaluation protocol. Standard split for Cora/CiteSeer/PubMed: 20 nodes per class for training, 500 for validation, 1000 for test (following Kipf & Welling, 2017). OGB uses official dataset-specific splits with multiple random seeds for reliable comparison.

E.2 Standard Graph Classification Benchmarks

DatasetGraphsAvg. nodesAvg. edgesClassesDomain
MUTAG18817.919.82Molecular (mutagenicity)
PROTEINS1,11339.172.82Protein function
IMDB-B1,00019.896.52Social (movie collaboration)
COLLAB5,00074.52,457.83Social (research collaboration)
OGB-molhiv41,12725.527.52Molecular (HIV inhibition)
OGB-molpcba437,92926.028.1128Molecular (bioassays)
QM9130,83118.018.6-Molecular (12 quantum properties)

Evaluation. Graph classification: 10-fold cross-validation accuracy for TU datasets; ROC-AUC for OGB. Molecular regression (QM9): mean absolute error (MAE) per property, target unit in Angstroms, eV, Debye (domain-dependent).

E.3 Long-Range Graph Benchmarks (LRGB)

Introduced by Dwivedi et al. (2022) to specifically test long-range dependency learning - tasks where local GNNs fail but graph transformers succeed.

DatasetNodesTaskLong-range property
Peptides-func150K graphs, avg 150 nodesGraph multi-label clfFunctional class depends on full peptide structure
Peptides-struct150K graphs, avg 150 nodesGraph regression3D structural properties (end-to-end distances)
PCQM-Contact529K graphsLink predictionMolecular contact maps (spatial proximity)
COCO-SP123K graphs, avg 476 nodesNode segmentationPixel graphs; semantic context is long-range

GNN vs Graph Transformer gap. On Peptides-func (AP metric):

  • GCN: 0.5930
  • GIN: 0.6621
  • GPS (GINE + Transformer): 0.7470
  • GPS + RWSE: 0.7538

The 0.09\sim 0.09 gap between GIN and GPS demonstrates the clear advantage of global attention for long-range tasks.

E.4 Choosing the Right Architecture

ARCHITECTURE SELECTION GUIDE
========================================================================

  Task requires long-range dependencies?
  +-- YES -> Graph Transformer (GPS, Graphormer)
  |         or Graph rewiring + MPNN
  +-- NO  down

  Graph has millions of nodes?
  +-- YES -> GraphSAGE (inductive) or Cluster-GCN/GraphSAINT (large)
  +-- NO  down

  Need to classify entire graphs?
  +-- YES -> GIN + sum pooling (expressiveness matters)
  |         or DiffPool/SAGPool for hierarchical structure
  +-- NO  down

  Need interpretable edge importance?
  +-- YES -> GAT or GATv2
  +-- NO  down

  Simple strong baseline needed first?
  +-- -> 2-layer GCN (fast to implement, often competitive)

  Has rich edge features (bond types, relation types)?
  +-- -> MPNN/GINE with edge message functions

========================================================================

Appendix F: Notation Summary for This Section

SymbolDefinitionFirst Used
G=(V,E)G = (V, E)Undirected graph with vertex set VV, edge set EE2.1
n=Vn = \lvert V \rvert, m=Em = \lvert E \rvertNumber of nodes and edges2.1
XRn×dvX \in \mathbb{R}^{n \times d_v}Node feature matrix; row vv is xv\mathbf{x}_v2.1
N(v)\mathcal{N}(v)Neighborhood of node vv: {u:(u,v)E}\{u : (u,v) \in E\}1.2
dv=N(v)d_v = \lvert\mathcal{N}(v)\rvertDegree of node vv2.1
hv[l]Rd\mathbf{h}_v^{[l]} \in \mathbb{R}^dRepresentation of node vv at layer ll3.1
H[l]Rn×dH^{[l]} \in \mathbb{R}^{n \times d}All node representations at layer ll4.2
A{0,1}n×nA \in \{0,1\}^{n \times n}Adjacency matrix2.1
A~=A+In\tilde{A} = A + I_nAdjacency with self-loops4.2
DRn×nD \in \mathbb{R}^{n \times n}Degree matrix: Dii=diD_{ii} = d_i4.2
D~\tilde{D}Degree matrix of A~\tilde{A}4.2
A^=D~1/2A~D~1/2\hat{A} = \tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2}GCN propagation matrix4.2
L=DAL = D - AUnnormalized Laplacian8.2
Lsym=ID1/2AD1/2L_{\text{sym}} = I - D^{-1/2}AD^{-1/2}Normalized (symmetric) Laplacian10.2
E(H)=tr(HLH)E(H) = \operatorname{tr}(H^\top L H)Dirichlet energy of HH8.2
mv[l]Rd\mathbf{m}_v^{[l]} \in \mathbb{R}^{d'}Aggregated message at node vv, layer ll3.1
M[l]()M^{[l]}(\cdot)Message function at layer ll3.1
U[l]()U^{[l]}(\cdot)Update function at layer ll3.1
R()R(\cdot)Readout function for graph-level tasks3.1
W[l]Rd×dW^{[l]} \in \mathbb{R}^{d \times d'}Weight matrix at layer ll4.2
αuv[0,1]\alpha_{uv} \in [0,1]GAT attention coefficient from uu to vv6.2
cv(t)c_v^{(t)}WL color of node vv at iteration tt7.2
ε[l]\varepsilon^{[l]}GIN learnable mixing parameter7.4
MAD(H)\operatorname{MAD}(H)Mean Average Distance of node representations8.1
SRn×kS \in \mathbb{R}^{n \times k}DiffPool soft cluster assignment matrix9.3
PEvRp\text{PE}_v \in \mathbb{R}^pPositional encoding of node vv10.2
UkRn×kU_k \in \mathbb{R}^{n \times k}First kk eigenvectors of LsymL_{\text{sym}}10.2
ϕ(u,v)\phi(u,v)Shortest-path distance between uu and vv10.4
bϕb_\phiGraphormer spatial encoding bias at distance ϕ\phi10.4
PΠnP \in \Pi_nn×nn \times n permutation matrix2.3
πRn\boldsymbol{\pi} \in \mathbb{R}^nStationary distribution of random walk on A~\tilde{A}8.1
ΘGNN(G1,G2)\Theta_{\text{GNN}}(G_1, G_2)Graph neural tangent kernelC.1
KWL(G1,G2)K_{\text{WL}}(G_1, G_2)Weisfeiler-Leman graph kernelC.1

All vectors are column vectors by default (hvRd\mathbf{h}_v \in \mathbb{R}^d means d×1d \times 1). Matrix norms: WF\lVert W \rVert_F (Frobenius), W2\lVert W \rVert_2 (spectral). Notation follows docs/NOTATION_GUIDE.md throughout.


Appendix G: Further Reading

Foundational Papers (Must-Read)

  1. Scarselli et al. (2009) - "The Graph Neural Network Model." IEEE Transactions on Neural Networks. Original GNN formulation with fixed-point iteration. Historical foundation.

  2. Defferrard, Bresson & Vandergheynst (2016) - "Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering." NeurIPS 2016. ChebNet: polynomial spectral filters, making spectral GNNs computationally feasible.

  3. Kipf & Welling (2017) - "Semi-Supervised Classification with Graph Convolutional Networks." ICLR 2017. GCN: the simplification that started the field. Short, clear, highly cited.

  4. Hamilton, Ying & Leskovec (2017) - "Inductive Representation Learning on Large Graphs." NeurIPS 2017. GraphSAGE: inductive learning, neighbor sampling, unsupervised objectives.

  5. Gilmer et al. (2017) - "Neural Message Passing for Quantum Chemistry." ICML 2017. MPNN framework: the unified abstraction. The formalism most GNN papers use today.

  6. Velickovic et al. (2018) - "Graph Attention Networks." ICLR 2018. GAT: learned attention weights for graphs. Widely deployed in production.

  7. Xu et al. (2019) - "How Powerful are Graph Neural Networks?" ICLR 2019. GIN + WL expressiveness theorem: the theoretical foundation of GNN expressiveness.

Depth Pathologies

  1. Li et al. (2018) - "Deeper Insights into Graph Convolutional Networks for Semi-Supervised Classification." AAAI 2018. First formal analysis of over-smoothing.

  2. Alon & Yahav (2021) - "On the Bottleneck of Graph Neural Networks and its Practical Implications." ICLR 2021. Over-squashing: Jacobian analysis and graph rewiring motivation.

  3. Chen et al. (2020) - "Simple and Deep Graph Convolutional Networks." ICML 2020. GCNII: initial residual + identity mapping for training 64-layer GCNs.

Graph Transformers

  1. Ying et al. (2021) - "Do Transformers Really Perform Bad for Graph Representation?" NeurIPS 2021. Graphormer: spatial, central, and edge encodings for graph transformers. OGB-LSC winner.

  2. Rampasek et al. (2022) - "Recipe for a General, Powerful, Scalable Graph Transformer." NeurIPS 2022. GPS framework: MPNN + Transformer combination with arbitrary PEs.

  3. Dwivedi et al. (2022) - "Long Range Graph Benchmark." NeurIPS 2022 Datasets. LRGB: benchmarks specifically designed to test long-range dependency learning.

Expressiveness

  1. Morris et al. (2019) - "Weisfeiler and Leman Go Neural: Higher-Order Graph Neural Networks." AAAI 2019. kk-GNN: implementing kk-WL as a neural network.

  2. Brody, Alon & Yahav (2022) - "How Attentive are Graph Attention Networks?" ICLR 2022. GATv2: fixing the static attention problem in GAT.

Applications

  1. Jumper et al. (2021) - "Highly Accurate Protein Structure Prediction with AlphaFold." Nature. AlphaFold 2: the Evoformer and structure module. GNNs at scale for biology.

  2. Ying et al. (2018) - "Graph Convolutional Neural Networks for Web-Scale Recommender Systems." KDD 2018. PinSage: GraphSAGE at 3 billion nodes. Engineering lessons for production GNNs.

  3. Edge et al. (2024) - "From Local to Global: A Graph RAG Approach to Query-Focused Summarization." Microsoft Research. Graph RAG: GNNs + LLMs for document understanding.

Textbooks and Surveys

  1. Hamilton (2020) - Graph Representation Learning. Synthesis Lectures on AI. The definitive textbook on GNNs; covers theory and practice comprehensively.

  2. Wu et al. (2022) - "A Comprehensive Study on Large-Scale Graph Training." NeurIPS 2022. Empirical comparison of full-batch, neighbor sampling, and subgraph sampling methods.


Appendix H: Quick Reference - Key Formulas

GCN

H[l+1]=σ ⁣(D~1/2A~D~1/2H[l]W[l]),A~=A+I,D~ii=jA~ijH^{[l+1]} = \sigma\!\left(\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2} H^{[l]} W^{[l]}\right), \quad \tilde{A} = A + I, \quad \tilde{D}_{ii} = \sum_j \tilde{A}_{ij}

GAT Attention Coefficient

αuv=exp ⁣(LeakyReLU ⁣(a[WhvWhu]))kN(v)exp ⁣(LeakyReLU ⁣(a[WhvWhk]))\alpha_{uv} = \frac{\exp\!\left(\operatorname{LeakyReLU}\!\left(\mathbf{a}^\top [W\mathbf{h}_v \| W\mathbf{h}_u]\right)\right)}{\sum_{k \in \mathcal{N}(v)} \exp\!\left(\operatorname{LeakyReLU}\!\left(\mathbf{a}^\top [W\mathbf{h}_v \| W\mathbf{h}_k]\right)\right)}

GATv2 (Dynamic Attention)

euv=aLeakyReLU ⁣(W[hvhu])e_{uv} = \mathbf{a}^\top \operatorname{LeakyReLU}\!\left(W\left[\mathbf{h}_v \| \mathbf{h}_u\right]\right)

GIN

hv[l+1]=MLP[l] ⁣((1+ε[l])hv[l]+uN(v)hu[l])\mathbf{h}_v^{[l+1]} = \operatorname{MLP}^{[l]}\!\left((1 + \varepsilon^{[l]})\mathbf{h}_v^{[l]} + \sum_{u \in \mathcal{N}(v)} \mathbf{h}_u^{[l]}\right)

GraphSAGE Update

hv[l+1]=σ ⁣(W[l][hv[l]1SvuSvhu[l]])\mathbf{h}_v^{[l+1]} = \sigma\!\left(W^{[l]}\left[\mathbf{h}_v^{[l]} \,\|\, \frac{1}{|\mathcal{S}_v|}\sum_{u \in \mathcal{S}_v} \mathbf{h}_u^{[l]}\right]\right)

GCNII (Deep GCN with Residual)

hv[l+1]=σ ⁣([(1α)A^hv[l]+αhv[0]] ⁣[(1β)I+βW[l]])\mathbf{h}_v^{[l+1]} = \sigma\!\left(\left[(1-\alpha)\hat{A}\mathbf{h}_v^{[l]} + \alpha \mathbf{h}_v^{[0]}\right]\!\left[(1-\beta)I + \beta W^{[l]}\right]\right)

DiffPool Coarsening

H(l+1)=S(l)Z(l),A(l+1)=S(l)A(l)S(l)H^{(l+1)} = S^{(l)\top} Z^{(l)}, \qquad A^{(l+1)} = S^{(l)\top} A^{(l)} S^{(l)}

Dirichlet Energy

E(H)=(i,j)Ehihj22=tr(HLH)E(H) = \sum_{(i,j) \in E} \lVert\mathbf{h}_i - \mathbf{h}_j\rVert_2^2 = \operatorname{tr}(H^\top L H)

GPS Layer

H[l+1]=LayerNorm ⁣(H[l]+MPNN[l](H[l],A)+Attn[l](H[l]))H^{[l+1]} = \operatorname{LayerNorm}\!\left(H^{[l]} + \operatorname{MPNN}^{[l]}(H^{[l]}, A) + \operatorname{Attn}^{[l]}(H^{[l]})\right)

Graphormer Attention with Spatial Bias

euv=(HWQ)(HWK)uvdk+bϕ(u,v)e_{uv} = \frac{(H W_Q)(H W_K)^\top_{uv}}{\sqrt{d_k}} + b_{\phi(u,v)}

1-WL Color Refinement

cv(t+1)=HASH ⁣(cv(t),  { ⁣{cu(t):uN(v)} ⁣})c_v^{(t+1)} = \operatorname{HASH}\!\left(c_v^{(t)},\; \left\{\!\left\{c_u^{(t)} : u \in \mathcal{N}(v)\right\}\!\right\}\right)

Unsupervised GraphSAGE Loss

L(u,v)=logσ ⁣(huhv)QEvn ⁣[log ⁣(1σ ⁣(huhvn))]\mathcal{L}(u,v) = -\log\sigma\!\left(\mathbf{h}_u^\top \mathbf{h}_v\right) - Q\,\mathbb{E}_{v_n}\!\left[\log\!\left(1 - \sigma\!\left(\mathbf{h}_u^\top \mathbf{h}_{v_n}\right)\right)\right]

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue