Part 4

18 min read18 headingsSplit lesson page

Lesson overview | Previous part | Lesson overview

Graph Neural Networks: Appendix C: Connections to Classical Machine Learning to Appendix H: Quick Reference - Key Formulas

Appendix C: Connections to Classical Machine Learning

C.1 GNNs and Kernel Methods

The GNN's computation can be viewed through the lens of kernel methods. Define the graph neural tangent kernel (GNTK) as the limiting kernel of an infinitely-wide GNN at initialization:

\Theta_{\text{GNN}}(G_1, G_2) = \lim_{d \to \infty} \langle \nabla_\theta f_\theta(G_1), \nabla_\theta f_\theta(G_2) \rangle

This connects GNNs to the NTK theory of Jacot et al. (2018) and provides a framework for analyzing GNN training dynamics. For graph classification:

Wide GNN $\equiv$ kernel SVM: at infinite width, a GNN with random feature initialization and gradient descent is equivalent to a kernel machine with the GNTK
GNTK expressiveness: the GNTK of a GIN-like architecture is exactly as expressive as 1-WL (matching the finite-depth GNN result)

Graph Weisfeiler-Leman kernel (Shervashidze et al., 2011). Before GNNs, graph kernels were the primary ML method for graphs. The WL kernel computes a feature vector from the color histograms at each WL refinement step:

K_{\text{WL}}(G_1, G_2) = \sum_{l=0}^L \langle \phi_l(G_1), \phi_l(G_2) \rangle

where $\phi_l(G)$ is the histogram of colors at WL iteration $l$ . The WL kernel is equivalent to a GIN with fixed (non-learned) MLP - it is the non-parametric version of GIN.

For AI: Understanding GNNs as kernel machines allows the use of kernel theory (generalization bounds, bandwidth selection, representer theorem) to analyze GNN behavior - important for the theoretical foundations in Chapter 12 (Functional Analysis) and Chapter 21 (Statistical Learning Theory).

C.2 GNNs and Belief Propagation

Loopy belief propagation (LBP) on probabilistic graphical models (Markov random fields, factor graphs) has exactly the same computational structure as MPNN message passing.

Belief propagation update:

\mu_{u \to v}^{(t+1)}(x_v) \propto \sum_{x_u} \psi_{uv}(x_u, x_v) \cdot \phi_u(x_u) \cdot \prod_{w \in \mathcal{N}(u) \setminus v} \mu_{w \to u}^{(t)}(x_u)

MPNN message update:

\mathbf{m}_{u \to v}^{[l+1]} = M\!\left(\mathbf{h}_v^{[l]}, \mathbf{h}_u^{[l]}, \mathbf{e}_{uv}\right)

The analogy:

BP messages $\mu_{u \to v}(x_v)$ <-> GNN messages $\mathbf{m}_{u \to v}$
BP potential $\psi_{uv}$ <-> GNN message function $M$
BP product of incoming messages <-> GNN sum/product aggregation

Implication: GNNs can be viewed as learned approximations to belief propagation. Where BP computes exact marginals on trees (and approximate marginals on loopy graphs), GNNs learn to compute useful representations that may or may not correspond to probabilistic marginals. For tasks that arise from probabilistic inference on graphs (decoding, error correction, satisfiability), architectures inspired by BP (like LDPC decoders implemented as GNNs) achieve state-of-the-art results.

C.3 Label Propagation vs GCN

Label propagation (Zhu et al., 2003) is a classical semi-supervised learning method that also propagates information across graph edges. The update rule:

F^{(t+1)} = \alpha S F^{(t)} + (1-\alpha) Y

where $S = D^{-1/2}AD^{-1/2}$ , $F \in \mathbb{R}^{n \times C}$ is the label matrix (soft predictions), $Y$ is the observed labels (zero for unlabeled nodes), and $\alpha \in (0,1)$ controls the balance between propagation and fitting observed labels.

At convergence: $F^* = (1-\alpha)(I - \alpha S)^{-1} Y$ - a closed-form solution.

GCN vs label propagation:

Label propagation: linear, closed-form, uses only graph structure (no node features)
GCN: nonlinear, gradient-based, uses both node features and graph structure
For feature-rich graphs: GCN consistently outperforms LP
For feature-poor graphs (only node IDs): LP is competitive and much cheaper
APPNP (Gasteiger et al., 2019): decouples the GNN feature transformation from the propagation step, applying propagation steps as a post-processing operation similar to LP. Achieves better performance than GCN on many node classification benchmarks by using more propagation steps without over-smoothing (due to the $\alpha$ mixing coefficient).

C.4 Random Walk Methods and GNNs

Before GNNs, random walk methods (DeepWalk, Node2Vec) were the standard for learning node embeddings.

DeepWalk (Perozzi et al., 2014). Sample random walks from each node; treat the walk as a "sentence" and apply Skip-Gram (Word2Vec) to learn node embeddings that predict context nodes in the walk.

Node2Vec (Grover & Leskovec, 2016). Biased random walks with parameters $p$ (return probability) and $q$ (in-out probability) to balance BFS-like (homophily) and DFS-like (structural equivalence) exploration.

Connection to GNNs. The Skip-Gram objective for DeepWalk is equivalent to factorizing the matrix:

\log\!\left(\frac{\text{vol}(G)}{T} \sum_{t=1}^T A_t \cdot D^{-1}\right)

where $A_t$ is the $t$ -step transition matrix. This is related to the graph's random walk kernel.

GNNs can incorporate random walk information through RWSE (7.6) or through the unsupervised pretraining objective of GraphSAGE (5.4). Modern practice often combines: pretrain with random walk objectives (fast, scalable), then fine-tune with GNN layers (expressive, task-specific).

C.5 Over-smoothing and Spectral Perspective

From the spectral perspective of 11-04, over-smoothing has a clean frequency interpretation. A graph signal $\mathbf{x}$ can be decomposed into its graph Fourier components $\hat{x}_k = \mathbf{u}_k^\top \mathbf{x}$ (where $\mathbf{u}_k$ is the $k$ -th Laplacian eigenvector). The GCN propagation $S = I - L_{\text{sym}}$ acts as a low-pass filter:

\widehat{(Sx)}_k = (1 - \lambda_k) \hat{x}_k

where $\lambda_k$ are the eigenvalues of $L_{\text{sym}}$ . After $L$ layers:

\widehat{(S^L x)}_k = (1-\lambda_k)^L \hat{x}_k

Lowest frequency ( $k=1$ , $\lambda_1 = 0$ ): constant component, amplified by factor 1 - perfectly preserved
Low frequencies ( $\lambda_k \approx 0$ ): slowly varying signals, amplified by $(1-\lambda_k)^L \approx 1$ - nearly preserved
High frequencies ( $\lambda_k \approx 2$ ): rapidly oscillating signals, suppressed by $(1-\lambda_k)^L \approx (-1)^L \to 0$ - exponentially suppressed

The class label signal is typically a low-frequency signal (connected nodes tend to have the same class in homophilic graphs). But even the label signal is eventually suppressed by many propagation steps - after enough layers, even the zero-frequency constant component dominates and all nodes collapse.

Optimal depth. The optimal number of layers $L^*$ balances two effects:

More layers: larger receptive field, more node context, better long-range information
More layers: stronger smoothing, loss of high-frequency discriminative features

For a graph with Fiedler value $\lambda_2$ , the optimal depth is approximately $L^* \approx 1/\lambda_2$ (the mixing time of the random walk). For clustered graphs with small $\lambda_2$ (e.g., $\lambda_2 = 0.01$ ): $L^* \approx 100$ layers could be used. For expander-like graphs with large $\lambda_2$ (e.g., $\lambda_2 = 0.5$ ): $L^* \approx 2$ layers is optimal.

Appendix D: GNN Training Recipes and Engineering Tricks

D.1 Data Preprocessing for Graph ML

Node feature normalization. Unlike image pixels (naturally in $[0,255]$ ), node features in graphs span wildly different scales. Atom features in molecular graphs: atomic number (1-118), formal charge ( $-4$ to $+4$ ), number of hydrogens ( $0$ - $4$ ). Normalizing to zero mean, unit variance per feature dimension is essential.

Edge feature preprocessing. Bond lengths in molecules range from 1.0 A (triple bond) to 1.5 A (single bond). If used as raw edge features, their small absolute range means they have little influence relative to bond type one-hot features. Apply normalization or use learned embeddings.

Graph-level normalization. For graph regression tasks (predicting molecular energy, which scales with the number of atoms), normalize the target $y_G$ by dividing by $n$ (atoms count). Otherwise, larger molecules trivially have larger predicted values.

Degree-based normalization. When using sum aggregation, node representations can grow with degree. Apply degree normalization as part of preprocessing: scale node features by $d_v^{-1/2}$ before input, matching the GCN normalization.

D.2 Positional Encodings: Practical Considerations

LapPE sign ambiguity. Eigenvectors are defined up to sign: if $\mathbf{u}$ is an eigenvector of $L$ , so is $-\mathbf{u}$ . Naive use of eigenvectors as PEs means two identical graphs with differently-signed PEs produce different GNN outputs - violating permutation invariance over graph orientation.

Fix (Lim et al., 2022): during training, randomly flip the sign of each eigenvector independently at each mini-batch. This forces the GNN to be invariant to sign choice. At inference, use a consistent sign (e.g., always choose the sign such that the first non-zero element is positive).

RWSE efficiency. Computing $[A^p D^{-1}]_{vv}$ for $p = 1, \ldots, P$ requires matrix powers - expensive for large graphs. Efficient computation: start a random walk from node $v$ ; estimate the return probability $p_v^{(p)}$ by averaging over $R$ random walkers:

\hat{p}_v^{(p)} = \frac{1}{R} \sum_{r=1}^R \mathbf{1}[\text{walk } r \text{ returns to } v \text{ at step } p]

With $R=200$ walkers per node, this gives accurate RWSE estimates in $O(n \cdot P \cdot R)$ time.

Eigenvalue degeneracy. If two eigenvalues of $L$ are equal ( $\lambda_k = \lambda_{k+1}$ ), the corresponding eigenspace is two-dimensional and any orthonormal basis for it is valid. This means LapPE is non-unique for graphs with symmetry (e.g., complete graphs, regular graphs). Degenerate eigenvectors encode no structural information - they should either be excluded from PEs or handled with random rotation augmentation.

D.3 Loss Functions for Graph Tasks

Node classification. Cross-entropy over labeled nodes:

\mathcal{L} = -\frac{1}{|V_L|}\sum_{v \in V_L} \sum_{c=1}^C Y_{vc} \log \hat{Y}_{vc}

For class-imbalanced graphs (e.g., rare fraud nodes), use focal loss or weighted cross-entropy.

Link prediction. Binary cross-entropy with negative sampling:

\mathcal{L} = -\sum_{(u,v) \in E} \log \sigma(\mathbf{h}_u^\top \mathbf{h}_v) - \sum_{(u,v) \notin E} \log(1 - \sigma(\mathbf{h}_u^\top \mathbf{h}_v))

Negative sampling ratio (how many non-edges per positive edge): 1:1 to 1:5 is standard. Hard negative mining (sampling non-edges that are structurally close to positive edges) improves embedding quality at the cost of more complex sampling.

Graph regression. Mean absolute error (MAE) for quantum chemistry (physically meaningful, outlier-robust) or mean squared error (MSE) for property prediction where large errors are particularly bad. Normalize targets per-dataset: $(y - \mu_y) / \sigma_y$ for training, inverse-transform predictions.

Contrastive learning on graphs. Self-supervised GNNs (GraphCL, SimGRACE) maximize agreement between augmented views of the same graph:

\mathcal{L}_{\text{NT-Xent}} = -\log \frac{\exp(\operatorname{sim}(\mathbf{h}_G, \mathbf{h}_G') / \tau)}{\sum_{G^- \neq G} \exp(\operatorname{sim}(\mathbf{h}_G, \mathbf{h}_{G^-}) / \tau)}

where $\operatorname{sim}(\mathbf{u}, \mathbf{v}) = \mathbf{u}^\top \mathbf{v} / (\lVert\mathbf{u}\rVert\lVert\mathbf{v}\rVert)$ is cosine similarity, $\tau$ is temperature, and $\mathbf{h}_G'$ is the embedding of an augmented view (e.g., edge dropout, node feature masking). Graph augmentations must preserve semantic meaning - dropping a bond in a molecule changes its identity, so augmentation must be domain-aware.

D.4 Debugging GNNs in Practice

Symptom: GNN performs worse than MLP on node features alone. Cause: the graph structure is not informative for the task (e.g., low homophily), or the GNN is over-smoothing and destroying the discriminative node features. Fix: verify homophily ratio $h = |{(u,v) \in E : y_u = y_v}| / |E|$ ; if $h < 0.2$ , graph structure hurts. Use heterophilic GNNs (FAGCN, H2GCN) that can handle graphs where connected nodes have different labels.

Symptom: Training loss decreases but validation loss explodes. Cause: overfitting, likely due to GNN memorizing the training graph topology rather than learning transferable patterns. Fix: add dropout on node features (not just edges), reduce model capacity, add weight decay. For transductive semi-supervised learning, ensure the train/val split is performed correctly.

Symptom: All node representations collapse to the same vector. Cause: over-smoothing (too many layers) or numerical instability (exploding/vanishing gradients through $\hat{A}$ ). Fix: check Dirichlet energy after each layer; use PairNorm or GCNII; add gradient clipping.

Symptom: GAT attention weights are all uniform ( $\alpha_{uv} \approx 1/|\mathcal{N}(v)|$ ). Cause: the attention vector $\mathbf{a}$ has collapsed - the model learned that uniform attention is safest. Often occurs with high learning rates or without proper initialization. Fix: initialize $\mathbf{a}$ with Xavier initialization; use a smaller learning rate for attention parameters; check for gradient vanishing in LeakyReLU (use a larger negative slope, e.g., $0.2$ ).

Symptom: Training is extremely slow. Cause: running GNN on the full large graph every step (full-batch training). Fix: switch to mini-batch training with neighbor sampling (GraphSAGE), graph partitioning (Cluster-GCN), or subgraph sampling (GraphSAINT). Profile GPU utilization - if below 70%, the bottleneck is data loading, not computation.

Appendix E: Benchmark Datasets and Evaluation

E.1 Standard Node Classification Benchmarks

Dataset	Nodes	Edges	Features	Classes	Homophily	Task
Cora	2,708	5,429	1,433	7	0.81	Transductive node clf
CiteSeer	3,327	4,732	3,703	6	0.74	Transductive node clf
PubMed	19,717	44,338	500	3	0.80	Transductive node clf
Amazon-Photo	7,650	119,081	745	8	0.83	Transductive node clf
Chameleon	2,277	36,101	2,325	5	0.23	Heterophilic node clf
Squirrel	5,201	217,073	2,089	5	0.22	Heterophilic node clf
OGB-arxiv	169,343	1,166,243	128	40	0.65	Large-scale node clf
OGB-products	2,449,029	61,859,140	100	47	0.81	Large-scale node clf

Evaluation protocol. Standard split for Cora/CiteSeer/PubMed: 20 nodes per class for training, 500 for validation, 1000 for test (following Kipf & Welling, 2017). OGB uses official dataset-specific splits with multiple random seeds for reliable comparison.

E.2 Standard Graph Classification Benchmarks

Dataset	Graphs	Avg. nodes	Avg. edges	Classes	Domain
MUTAG	188	17.9	19.8	2	Molecular (mutagenicity)
PROTEINS	1,113	39.1	72.8	2	Protein function
IMDB-B	1,000	19.8	96.5	2	Social (movie collaboration)
COLLAB	5,000	74.5	2,457.8	3	Social (research collaboration)
OGB-molhiv	41,127	25.5	27.5	2	Molecular (HIV inhibition)
OGB-molpcba	437,929	26.0	28.1	128	Molecular (bioassays)
QM9	130,831	18.0	18.6	-	Molecular (12 quantum properties)

Evaluation. Graph classification: 10-fold cross-validation accuracy for TU datasets; ROC-AUC for OGB. Molecular regression (QM9): mean absolute error (MAE) per property, target unit in Angstroms, eV, Debye (domain-dependent).

E.3 Long-Range Graph Benchmarks (LRGB)

Introduced by Dwivedi et al. (2022) to specifically test long-range dependency learning - tasks where local GNNs fail but graph transformers succeed.

Dataset	Nodes	Task	Long-range property
Peptides-func	150K graphs, avg 150 nodes	Graph multi-label clf	Functional class depends on full peptide structure
Peptides-struct	150K graphs, avg 150 nodes	Graph regression	3D structural properties (end-to-end distances)
PCQM-Contact	529K graphs	Link prediction	Molecular contact maps (spatial proximity)
COCO-SP	123K graphs, avg 476 nodes	Node segmentation	Pixel graphs; semantic context is long-range

GNN vs Graph Transformer gap. On Peptides-func (AP metric):

GCN: 0.5930
GIN: 0.6621
GPS (GINE + Transformer): 0.7470
GPS + RWSE: 0.7538

The $\sim 0.09$ gap between GIN and GPS demonstrates the clear advantage of global attention for long-range tasks.

E.4 Choosing the Right Architecture

ARCHITECTURE SELECTION GUIDE
========================================================================

  Task requires long-range dependencies?
  +-- YES -> Graph Transformer (GPS, Graphormer)
  |         or Graph rewiring + MPNN
  +-- NO  down

  Graph has millions of nodes?
  +-- YES -> GraphSAGE (inductive) or Cluster-GCN/GraphSAINT (large)
  +-- NO  down

  Need to classify entire graphs?
  +-- YES -> GIN + sum pooling (expressiveness matters)
  |         or DiffPool/SAGPool for hierarchical structure
  +-- NO  down

  Need interpretable edge importance?
  +-- YES -> GAT or GATv2
  +-- NO  down

  Simple strong baseline needed first?
  +-- -> 2-layer GCN (fast to implement, often competitive)

  Has rich edge features (bond types, relation types)?
  +-- -> MPNN/GINE with edge message functions

========================================================================

Appendix F: Notation Summary for This Section

Symbol	Definition	First Used
$G = (V, E)$	Undirected graph with vertex set $V$ , edge set $E$	2.1
$n = \lvert V \rvert$ , $m = \lvert E \rvert$	Number of nodes and edges	2.1
$X \in \mathbb{R}^{n \times d_v}$	Node feature matrix; row $v$ is $\mathbf{x}_v$	2.1
$\mathcal{N}(v)$	Neighborhood of node $v$ : $\{u : (u,v) \in E\}$	1.2
$d_v = \lvert\mathcal{N}(v)\rvert$	Degree of node $v$	2.1
$\mathbf{h}_v^{[l]} \in \mathbb{R}^d$	Representation of node $v$ at layer $l$	3.1
$H^{[l]} \in \mathbb{R}^{n \times d}$	All node representations at layer $l$	4.2
$A \in \{0,1\}^{n \times n}$	Adjacency matrix	2.1
$\tilde{A} = A + I_n$	Adjacency with self-loops	4.2
$D \in \mathbb{R}^{n \times n}$	Degree matrix: $D_{ii} = d_i$	4.2
$\tilde{D}$	Degree matrix of $\tilde{A}$	4.2
$\hat{A} = \tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2}$	GCN propagation matrix	4.2
$L = D - A$	Unnormalized Laplacian	8.2
$L_{\text{sym}} = I - D^{-1/2}AD^{-1/2}$	Normalized (symmetric) Laplacian	10.2
$E(H) = \operatorname{tr}(H^\top L H)$	Dirichlet energy of $H$	8.2
$\mathbf{m}_v^{[l]} \in \mathbb{R}^{d'}$	Aggregated message at node $v$ , layer $l$	3.1
$M^{[l]}(\cdot)$	Message function at layer $l$	3.1
$U^{[l]}(\cdot)$	Update function at layer $l$	3.1
$R(\cdot)$	Readout function for graph-level tasks	3.1
$W^{[l]} \in \mathbb{R}^{d \times d'}$	Weight matrix at layer $l$	4.2
$\alpha_{uv} \in [0,1]$	GAT attention coefficient from $u$ to $v$	6.2
$c_v^{(t)}$	WL color of node $v$ at iteration $t$	7.2
$\varepsilon^{[l]}$	GIN learnable mixing parameter	7.4
$\operatorname{MAD}(H)$	Mean Average Distance of node representations	8.1
$S \in \mathbb{R}^{n \times k}$	DiffPool soft cluster assignment matrix	9.3
$\text{PE}_v \in \mathbb{R}^p$	Positional encoding of node $v$	10.2
$U_k \in \mathbb{R}^{n \times k}$	First $k$ eigenvectors of $L_{\text{sym}}$	10.2
$\phi(u,v)$	Shortest-path distance between $u$ and $v$	10.4
$b_\phi$	Graphormer spatial encoding bias at distance $\phi$	10.4
$P \in \Pi_n$	$n \times n$ permutation matrix	2.3
$\boldsymbol{\pi} \in \mathbb{R}^n$	Stationary distribution of random walk on $\tilde{A}$	8.1
$\Theta_{\text{GNN}}(G_1, G_2)$	Graph neural tangent kernel	C.1
$K_{\text{WL}}(G_1, G_2)$	Weisfeiler-Leman graph kernel	C.1

All vectors are column vectors by default ( $\mathbf{h}_v \in \mathbb{R}^d$ means $d \times 1$ ). Matrix norms: $\lVert W \rVert_F$ (Frobenius), $\lVert W \rVert_2$ (spectral). Notation follows docs/NOTATION_GUIDE.md throughout.

Appendix G: Further Reading

Foundational Papers (Must-Read)

Scarselli et al. (2009) - "The Graph Neural Network Model." IEEE Transactions on Neural Networks. Original GNN formulation with fixed-point iteration. Historical foundation.
Defferrard, Bresson & Vandergheynst (2016) - "Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering." NeurIPS 2016. ChebNet: polynomial spectral filters, making spectral GNNs computationally feasible.
Kipf & Welling (2017) - "Semi-Supervised Classification with Graph Convolutional Networks." ICLR 2017. GCN: the simplification that started the field. Short, clear, highly cited.
Hamilton, Ying & Leskovec (2017) - "Inductive Representation Learning on Large Graphs." NeurIPS 2017. GraphSAGE: inductive learning, neighbor sampling, unsupervised objectives.
Gilmer et al. (2017) - "Neural Message Passing for Quantum Chemistry." ICML 2017. MPNN framework: the unified abstraction. The formalism most GNN papers use today.
Velickovic et al. (2018) - "Graph Attention Networks." ICLR 2018. GAT: learned attention weights for graphs. Widely deployed in production.
Xu et al. (2019) - "How Powerful are Graph Neural Networks?" ICLR 2019. GIN + WL expressiveness theorem: the theoretical foundation of GNN expressiveness.

Depth Pathologies

Li et al. (2018) - "Deeper Insights into Graph Convolutional Networks for Semi-Supervised Classification." AAAI 2018. First formal analysis of over-smoothing.
Alon & Yahav (2021) - "On the Bottleneck of Graph Neural Networks and its Practical Implications." ICLR 2021. Over-squashing: Jacobian analysis and graph rewiring motivation.
Chen et al. (2020) - "Simple and Deep Graph Convolutional Networks." ICML 2020. GCNII: initial residual + identity mapping for training 64-layer GCNs.

Graph Transformers

Ying et al. (2021) - "Do Transformers Really Perform Bad for Graph Representation?" NeurIPS 2021. Graphormer: spatial, central, and edge encodings for graph transformers. OGB-LSC winner.
Rampasek et al. (2022) - "Recipe for a General, Powerful, Scalable Graph Transformer." NeurIPS 2022. GPS framework: MPNN + Transformer combination with arbitrary PEs.
Dwivedi et al. (2022) - "Long Range Graph Benchmark." NeurIPS 2022 Datasets. LRGB: benchmarks specifically designed to test long-range dependency learning.

Expressiveness

Morris et al. (2019) - "Weisfeiler and Leman Go Neural: Higher-Order Graph Neural Networks." AAAI 2019. $k$ -GNN: implementing $k$ -WL as a neural network.
Brody, Alon & Yahav (2022) - "How Attentive are Graph Attention Networks?" ICLR 2022. GATv2: fixing the static attention problem in GAT.

Applications

Jumper et al. (2021) - "Highly Accurate Protein Structure Prediction with AlphaFold." Nature. AlphaFold 2: the Evoformer and structure module. GNNs at scale for biology.
Ying et al. (2018) - "Graph Convolutional Neural Networks for Web-Scale Recommender Systems." KDD 2018. PinSage: GraphSAGE at 3 billion nodes. Engineering lessons for production GNNs.
Edge et al. (2024) - "From Local to Global: A Graph RAG Approach to Query-Focused Summarization." Microsoft Research. Graph RAG: GNNs + LLMs for document understanding.

Textbooks and Surveys

Hamilton (2020) - Graph Representation Learning. Synthesis Lectures on AI. The definitive textbook on GNNs; covers theory and practice comprehensively.
Wu et al. (2022) - "A Comprehensive Study on Large-Scale Graph Training." NeurIPS 2022. Empirical comparison of full-batch, neighbor sampling, and subgraph sampling methods.

Appendix H: Quick Reference - Key Formulas

GCN

H^{[l+1]} = \sigma\!\left(\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2} H^{[l]} W^{[l]}\right), \quad \tilde{A} = A + I, \quad \tilde{D}_{ii} = \sum_j \tilde{A}_{ij}

GAT Attention Coefficient

\alpha_{uv} = \frac{\exp\!\left(\operatorname{LeakyReLU}\!\left(\mathbf{a}^\top [W\mathbf{h}_v \| W\mathbf{h}_u]\right)\right)}{\sum_{k \in \mathcal{N}(v)} \exp\!\left(\operatorname{LeakyReLU}\!\left(\mathbf{a}^\top [W\mathbf{h}_v \| W\mathbf{h}_k]\right)\right)}

GATv2 (Dynamic Attention)

e_{uv} = \mathbf{a}^\top \operatorname{LeakyReLU}\!\left(W\left[\mathbf{h}_v \| \mathbf{h}_u\right]\right)

GIN

\mathbf{h}_v^{[l+1]} = \operatorname{MLP}^{[l]}\!\left((1 + \varepsilon^{[l]})\mathbf{h}_v^{[l]} + \sum_{u \in \mathcal{N}(v)} \mathbf{h}_u^{[l]}\right)

GraphSAGE Update

\mathbf{h}_v^{[l+1]} = \sigma\!\left(W^{[l]}\left[\mathbf{h}_v^{[l]} \,\|\, \frac{1}{|\mathcal{S}_v|}\sum_{u \in \mathcal{S}_v} \mathbf{h}_u^{[l]}\right]\right)

GCNII (Deep GCN with Residual)

\mathbf{h}_v^{[l+1]} = \sigma\!\left(\left[(1-\alpha)\hat{A}\mathbf{h}_v^{[l]} + \alpha \mathbf{h}_v^{[0]}\right]\!\left[(1-\beta)I + \beta W^{[l]}\right]\right)

DiffPool Coarsening

H^{(l+1)} = S^{(l)\top} Z^{(l)}, \qquad A^{(l+1)} = S^{(l)\top} A^{(l)} S^{(l)}

Dirichlet Energy

E(H) = \sum_{(i,j) \in E} \lVert\mathbf{h}_i - \mathbf{h}_j\rVert_2^2 = \operatorname{tr}(H^\top L H)

GPS Layer

H^{[l+1]} = \operatorname{LayerNorm}\!\left(H^{[l]} + \operatorname{MPNN}^{[l]}(H^{[l]}, A) + \operatorname{Attn}^{[l]}(H^{[l]})\right)

Graphormer Attention with Spatial Bias

e_{uv} = \frac{(H W_Q)(H W_K)^\top_{uv}}{\sqrt{d_k}} + b_{\phi(u,v)}

1-WL Color Refinement

c_v^{(t+1)} = \operatorname{HASH}\!\left(c_v^{(t)},\; \left\{\!\left\{c_u^{(t)} : u \in \mathcal{N}(v)\right\}\!\right\}\right)

Unsupervised GraphSAGE Loss

\mathcal{L}(u,v) = -\log\sigma\!\left(\mathbf{h}_u^\top \mathbf{h}_v\right) - Q\,\mathbb{E}_{v_n}\!\left[\log\!\left(1 - \sigma\!\left(\mathbf{h}_u^\top \mathbf{h}_{v_n}\right)\right)\right]

Graph Neural Networks: Part 4 - Appendix C Connections To Classical Machine Learning To Appendix H Qui