Part 1

29 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Graph Representations: Part 1: Intuition to 6. Sparse Matrix Formats

1. Intuition

1.1 Why Representation Matters

A graph is an abstract object - a set of vertices and a set of edges. Before we can run a single algorithm on it, we need to answer a concrete engineering question: how do we store it in memory?

This question matters more than it might seem. Consider a social network with $n = 10^8$ users (Facebook-scale). The "obvious" representation - a square adjacency matrix $A \in \{0,1\}^{n \times n}$ - would require $10^{16}$ bits of storage, roughly a billion terabytes. Yet this same graph has only $m \approx 10^{11}$ edges (average degree ~1000), and a sparse adjacency list stores it in $O(n + m) \approx O(10^{11})$ bytes - about 100 gigabytes, entirely within reach of a modern data centre node.

Conversely, when running a Graph Convolutional Network (GCN) on a small dense molecular graph ( $n = 30$ atoms, $m = 60$ bonds), the adjacency matrix is the natural choice: it enables the GCN's core computation $\hat{A} X W$ as a single dense matrix multiply, with full GPU utilization and no sparse indexing overhead.

The wrong representation doesn't just waste memory - it changes algorithmic complexity. Checking whether edge $(u, v)$ exists takes $O(1)$ with an adjacency matrix but $O(\deg(u))$ with an adjacency list. Enumerating all neighbours of $u$ takes $O(n)$ with an adjacency matrix but $O(\deg(u))$ with an adjacency list. For a power-law graph where most nodes have degree $\sim 10$ but a few hubs have degree $\sim 10^5$ , these differences dominate runtime.

For AI: Every major GNN framework uses a different primary representation:

PyTorch Geometric (PyG): COO edge_index tensor, shape $2 \times m$
Deep Graph Library (DGL): CSR-based DGLGraph with sorted adjacency
JAX/Flax graph libraries: Dense adjacency matrices padded for TPU alignment
NetworkX: Pure-Python adjacency dictionary (great for prototyping, slow at scale)

Understanding the representations underpinning these frameworks lets you write efficient GNN code, debug shape errors, and choose the right library for each task.

1.2 The Four Questions Every Representation Must Answer

Any graph representation must efficiently support (some subset of) four fundamental operations. The choice of representation is a choice about which operations to optimise:

THE FOUR FUNDAMENTAL GRAPH QUERIES
========================================================================

  Q1.  EDGE EXISTENCE   Does edge (u, v) exist?
                        E.g., "Is user A friends with user B?"

  Q2.  NEIGHBOUR ENUM   What are all neighbours of vertex u?
                        E.g., "Who are all of A's friends?"

  Q3.  VERTEX ITER      Iterate over all vertices
                        E.g., "For every user, compute their degree"

  Q4.  EDGE ITER        Iterate over all edges
                        E.g., "For every friendship, compute weight"

  Q5.  (Bonus) MATRIX   Compute A*x, A^2*x, eigendecompose A
                        E.g., PageRank, GCN, spectral clustering

========================================================================

No single representation answers all five questions optimally. The adjacency matrix excels at Q1 and Q5 but wastes memory and is slow for Q2. The adjacency list excels at Q2 and Q3 but is awkward for Q5. Sparse formats (CSR) balance Q2-Q5 for large sparse graphs. The art is choosing the right tool.

1.3 A Taxonomy of Graph Representations

GRAPH REPRESENTATION TAXONOMY
========================================================================

  DENSE                    SPARSE                    HYBRID
  ---------------------    ---------------------     ------------------
  Adjacency Matrix         Adjacency List            CSR / CSC
  (n\timesn array)              (dict of lists)           (3-array format)

                           Adjacency Set             COO / edge_index
                           (dict of sets)            (2\timesm tensor)

                           Edge List                 LIL / DOK
                           (list of tuples)          (dynamic sparse)

  STRUCTURED               GENERALISED
  ---------------------    ---------------------
  Incidence Matrix         Heterogeneous adjacency
  (n\timesm, vertex-edge)       (multiple relation types)

  Laplacian L = D - A      Temporal snapshot list
  (combinatorial)          (graph_t for each time t)

========================================================================

1.4 Historical Evolution

The history of graph representations mirrors the history of computing hardware:

1736 (Euler): Graphs as drawings on paper - no formal data structure
1960s: Adjacency matrices become standard in combinatorics textbooks; O(n^2) space is acceptable for small graphs
1970s: Adjacency lists emerge as graphs grow larger; Dijkstra's 1959 algorithm is re-implemented with lists
1980s: Compressed Sparse Row (CSR) format developed for sparse linear algebra; adopted for large graph algorithms
1990s: Web-scale graphs (PageRank, 1998) expose the inadequacy of dense representations; sparse formats dominate
2000s: GraphChi, PowerGraph, Pregel enable distributed graph processing with edge-list-based disk formats
2017: PyTorch Geometric introduces edge_index - a COO format on GPU tensors - as the standard for GNN research
2020s: Mixed precision, TPU padding, and FlashAttention-style sparse attention define new representation challenges

1.5 What This Section Covers vs. What Comes Later

This section defines and analyses six core representations. Content that builds on these representations but belongs to later sections:

Preview: Graph Algorithms (03)

BFS, DFS, Dijkstra, and Kruskal all operate on the representations defined here. The choice of adjacency list vs. matrix determines the asymptotic complexity of these algorithms. Full algorithmic treatment: 03 Graph Algorithms.

Preview: Spectral Graph Theory (04)

The Laplacian $L = D - A$ defined in 3.4 below is the central object of spectral graph theory. Its eigenvalues encode connectivity, bipartiteness, and cluster structure. Eigenvalue analysis, the Fiedler vector, and spectral clustering: 04 Spectral Graph Theory.

Preview: Graph Neural Networks (05)

GNNs use the adjacency matrix or edge_index to define message-passing operations. The normalised adjacency $\hat{A} = D^{-1/2} A D^{-1/2}$ introduced in 3.5 is the GCN propagation rule (Kipf & Welling, 2017). Full GNN architectures: 05 Graph Neural Networks.

2. Formal Framework

2.1 The Representation Problem

Definition (Graph Representation). A representation of graph $G = (V, E)$ is a data structure $\mathcal{R}(G)$ that stores sufficient information to reconstruct $G$ and support a specified set of queries.

A representation is correct if it is lossless: $\mathcal{R}(G_1) = \mathcal{R}(G_2)$ iff $G_1 \cong G_2$ (up to vertex labelling in the canonical form). Practically, we work with labelled graphs where vertices have fixed integer indices $0, 1, \ldots, n-1$ , and correctness means exact reconstruction of both vertex and edge sets.

Notation. Throughout this section:

$n = |V|$ - number of vertices
$m = |E|$ - number of edges
$\Delta = \max_{v \in V} \deg(v)$ - maximum degree
$\bar{d} = 2m/n$ - average degree (undirected)

2.2 Complexity Measures

We evaluate representations on five dimensions:

Measure	Description	Typical units
Space	Memory used to store $\mathcal{R}(G)$	Bits or bytes
Edge query	Time to answer "does $(u,v) \in E$ ?"	Wall-clock or $O(\cdot)$
Neighbour enum	Time to list all neighbours of $u$	Per neighbour
Construction	Time to build $\mathcal{R}(G)$ from edge list	Total
Update	Time to insert or delete an edge	Per operation

Note that $O(\cdot)$ complexity hides constants that matter in practice. An $O(1)$ hash lookup has a larger constant than an $O(1)$ array index. CSR row access is $O(1)$ but requires binary search for edge queries ( $O(\log \Delta)$ ).

2.3 Density, Sparsity, and the Fill Ratio

Definition (Fill ratio). The fill ratio of a graph is:

\rho(G) = \frac{m}{\binom{n}{2}} = \frac{2m}{n(n-1)}

A graph is dense if $\rho \approx 1$ (close to the maximum $\binom{n}{2}$ edges) and sparse if $\rho \approx 0$ .

Practical thresholds:

$m = O(n)$ : sparse (social networks, citation graphs, molecular graphs)
$m = O(n \log n)$ : mildly sparse (small-world networks)
$m = O(n^{1.5})$ : moderately dense
$m = O(n^2)$ : dense (complete graphs, small knowledge graphs)

For AI: Real-world graphs used in ML are almost universally sparse:

Citation networks: $m \approx 5n$ (Cora: 2708 nodes, 5429 edges)
Social networks: $m \approx 10n$ to $100n$
Molecular graphs: $m \approx 1.5n$ to $3n$ (small but dense for their size)
Knowledge graphs: $m \approx 3n$ to $10n$ (Freebase, Wikidata)

The sparsity of real-world graphs is the fundamental reason why $O(n+m)$ sparse representations are preferred over $O(n^2)$ dense matrices at scale.

3. Adjacency Matrix

3.1 Definition and Properties

Definition (Adjacency Matrix). For a simple undirected graph $G = (V, E)$ with vertex set $V = \{0, 1, \ldots, n-1\}$ , the adjacency matrix is $A \in \{0,1\}^{n \times n}$ defined by:

A_{ij} = \begin{cases} 1 & \text{if } \{i, j\} \in E \\ 0 & \text{otherwise} \end{cases}

Fundamental properties of the undirected adjacency matrix:

Symmetry: $A = A^\top$ , since $\{i,j\} \in E \Leftrightarrow \{j,i\} \in E$
Zero diagonal: $A_{ii} = 0$ for all $i$ (no self-loops in a simple graph)
Binary entries: $A_{ij} \in \{0, 1\}$
Row sum = degree: $(A\mathbf{1})_i = \sum_j A_{ij} = \deg(i)$

Worked example. Consider the "diamond" graph with vertices $\{0,1,2,3\}$ and edges $\{0,1\}, \{0,2\}, \{1,2\}, \{1,3\}, \{2,3\}$ :

A = \begin{pmatrix} 0 & 1 & 1 & 0 \\ 1 & 0 & 1 & 1 \\ 1 & 1 & 0 & 1 \\ 0 & 1 & 1 & 0 \end{pmatrix}

Degree check: row sums are $[2, 3, 3, 2]$ , so $\sum \deg = 10 = 2 \times 5 = 2|E|$ . OK

3.2 Directed and Weighted Variants

Directed adjacency matrix. For a directed graph (digraph) $G = (V, E)$ with directed edges $(u \to v)$ :

A_{ij} = \begin{cases} 1 & \text{if } (i \to j) \in E \\ 0 & \text{otherwise} \end{cases}

$A$ is generally not symmetric. Row $i$ encodes out-edges of $i$ ; column $j$ encodes in-edges of $j$ .

Out-degree of $i$ : $\deg^+(i) = \sum_j A_{ij}$ (row sum)
In-degree of $i$ : $\deg^-(i) = \sum_j A_{ji}$ (column sum)

Weighted adjacency matrix. For a weighted graph $G = (V, E, w)$ with $w: E \to \mathbb{R}$ :

A_{ij} = \begin{cases} w(i,j) & \text{if } \{i,j\} \in E \\ 0 & \text{otherwise} \end{cases}

Common weight semantics:

Similarity weights ( $w > 0$ , larger = more similar): social strength, correlation
Distance weights ( $w > 0$ , smaller = closer): road networks, molecular bond lengths
Signed weights ( $w \in \mathbb{R}$ ): financial networks (positive = cooperation, negative = competition)

For AI: Transformer attention defines a weighted digraph where $A_{ij} = \operatorname{softmax}(Q_i K_j^\top / \sqrt{d_k})$ - a fully connected weighted directed graph over sequence positions. Each attention head defines a different graph on the same token set.

3.3 Self-Loops and Augmented Adjacency

In simple graphs $A_{ii} = 0$ . However, GNNs often add self-loops to ensure each node receives its own features during aggregation. The augmented adjacency matrix is:

\tilde{A} = A + I_n

where $I_n$ is the $n \times n$ identity. This sets $\tilde{A}_{ii} = 1$ for all $i$ , adding a self-loop at every vertex. The GCN (Kipf & Welling, 2017) uses $\tilde{A}$ in its propagation rule.

Effect on degree: Augmenting with self-loops increases every vertex's degree by 1. If $D$ is the degree matrix of $A$ , then the degree matrix of $\tilde{A}$ is $\tilde{D} = D + I_n$ .

3.4 The Degree Matrix and Laplacian

Definition (Degree Matrix). The degree matrix of $G$ is the diagonal matrix:

D = \operatorname{diag}(\deg(0), \deg(1), \ldots, \deg(n-1))

so $D_{ii} = \deg(i)$ and $D_{ij} = 0$ for $i \neq j$ . In matrix form: $D = \operatorname{diag}(A \mathbf{1})$ .

Definition (Graph Laplacian). The combinatorial Laplacian of $G$ is:

L = D - A

Properties of $L$ :

Symmetric: $L = L^\top$ (for undirected graphs)
Positive semidefinite: $L \succeq 0$ , i.e., $\mathbf{x}^\top L \mathbf{x} \geq 0$ for all $\mathbf{x} \in \mathbb{R}^n$
Zero row sums: $L \mathbf{1} = \mathbf{0}$ (so $\lambda = 0$ is always an eigenvalue)
Quadratic form: $\mathbf{x}^\top L \mathbf{x} = \sum_{\{i,j\} \in E} (x_i - x_j)^2$ - measures "roughness" of $\mathbf{x}$ on $G$
Connectivity: $G$ is connected iff the second-smallest eigenvalue $\lambda_2 > 0$ (Fiedler value)

For the diamond graph above:

L = \begin{pmatrix} 2 & -1 & -1 & 0 \\ -1 & 3 & -1 & -1 \\ -1 & -1 & 3 & -1 \\ 0 & -1 & -1 & 2 \end{pmatrix}

Proof of the quadratic form identity (Property 4). We verify that $\mathbf{x}^\top L \mathbf{x} = \sum_{\{i,j\} \in E} (x_i - x_j)^2$ for any $\mathbf{x} \in \mathbb{R}^n$ .

\mathbf{x}^\top L \mathbf{x} = \mathbf{x}^\top D \mathbf{x} - \mathbf{x}^\top A \mathbf{x} = \sum_i D_{ii} x_i^2 - \sum_{i,j} A_{ij} x_i x_j

= \sum_i \deg(i) x_i^2 - \sum_{\{i,j\} \in E} 2 x_i x_j

For each edge $\{i,j\} \in E$ , vertex $i$ contributes $x_i^2$ to $\sum_i \deg(i) x_i^2$ exactly $\deg(i)$ times, and vertex $j$ contributes $x_j^2$ exactly $\deg(j)$ times. Rearranging by edges:

= \sum_{\{i,j\} \in E} (x_i^2 - 2x_i x_j + x_j^2) = \sum_{\{i,j\} \in E} (x_i - x_j)^2 \geq 0 \quad \square

This identity has a beautiful interpretation: $\mathbf{x}^\top L \mathbf{x}$ measures the total "variation" of the signal $\mathbf{x}$ across edges of $G$ . If $\mathbf{x}$ assigns similar values to adjacent vertices (a "smooth" signal), $\mathbf{x}^\top L \mathbf{x}$ is small. If $\mathbf{x}$ assigns very different values to connected vertices, $\mathbf{x}^\top L \mathbf{x}$ is large.

Proof of zero row sums (Property 3): $(L\mathbf{1})_i = D_{ii} \cdot 1 - \sum_j A_{ij} \cdot 1 = \deg(i) - \deg(i) = 0$ . $\square$

So $\mathbf{1} \in \ker(L)$ always - the all-ones vector is in the null space of $L$ , meaning $\lambda = 0$ is always an eigenvalue. For a connected graph, $\lambda = 0$ has multiplicity exactly 1 (the null space is spanned by $\mathbf{1}$ ). For a graph with $k$ connected components, $\lambda = 0$ has multiplicity $k$ .

Worked numerical verification - diamond graph:

\mathbf{x} = (1, 0, 0, -1)^\top

\mathbf{x}^\top L \mathbf{x} = \sum_{\{i,j\} \in E} (x_i - x_j)^2 = (1-0)^2 + (1-0)^2 + (0-0)^2 + (0-0)^2 + (0-(-1))^2 + (0-(-1))^2

= 1 + 1 + 0 + 0 + 1 + 1 = 4

Edges: $\{0,1\}, \{0,2\}, \{1,2\}, \{1,3\}, \{2,3\}$ . Differences: $(1-0)^2, (1-0)^2, (0-0)^2, (0-(-1))^2, (0-(-1))^2$ . Sum $= 4$ .

Direct: $\mathbf{x}^\top L \mathbf{x} = (1)(2) + (-1)(2) + \text{cross terms}$ ... let us verify numerically:

L\mathbf{x} = \begin{pmatrix} 2 & -1 & -1 & 0 \\ -1 & 3 & -1 & -1 \\ -1 & -1 & 3 & -1 \\ 0 & -1 & -1 & 2 \end{pmatrix}\begin{pmatrix}1\\0\\0\\-1\end{pmatrix} = \begin{pmatrix}2\\-1+1\\-1+1\\2\end{pmatrix} = \begin{pmatrix}2\\0\\0\\2\end{pmatrix}

\mathbf{x}^\top L \mathbf{x} = (1)(2) + (0)(0) + (0)(0) + (-1)(2) = 2 + 0 + 0 - 2 = 0

Wait - this gives 0, not 4. Let me recheck the edge set. Diamond graph edges: $\{0,1\}, \{0,2\}, \{1,2\}, \{1,3\}, \{2,3\}$ . With $\mathbf{x} = (1,0,0,-1)^\top$ :

$\{0,1\}$ : $(1-0)^2 = 1$
$\{0,2\}$ : $(1-0)^2 = 1$
$\{1,2\}$ : $(0-0)^2 = 0$
$\{1,3\}$ : $(0-(-1))^2 = 1$
$\{2,3\}$ : $(0-(-1))^2 = 1$

Sum $= 4$ . The discrepancy is in the matrix multiply: let me recheck $L$ for the diamond graph with edges $\{0,1\},\{0,2\},\{1,2\},\{1,3\},\{2,3\}$ . Degrees: $\deg(0) = 2$ , $\deg(1) = 3$ , $\deg(2) = 3$ , $\deg(3) = 2$ .

L = \begin{pmatrix}2&-1&-1&0\\-1&3&-1&-1\\-1&-1&3&-1\\0&-1&-1&2\end{pmatrix}, \quad L\mathbf{x} = \begin{pmatrix}2(1)+(-1)(0)+(-1)(0)+0(-1)\\ (-1)(1)+3(0)+(-1)(0)+(-1)(-1)\\ (-1)(1)+(-1)(0)+3(0)+(-1)(-1)\\0(1)+(-1)(0)+(-1)(0)+2(-1)\end{pmatrix} = \begin{pmatrix}2\\0\\0\\-2\end{pmatrix}

\mathbf{x}^\top L\mathbf{x} = 1\cdot2 + 0\cdot0 + 0\cdot0 + (-1)(-2) = 2 + 2 = 4 \quad \checkmark

The quadratic form correctly evaluates to 4, confirming that vertices 0 and 3 (which have values $+1$ and $-1$ while being two hops apart) create variation spread across the four boundary edges.

The full spectral analysis of $L$ - eigenvalues, Fiedler vector, spectral clustering - belongs to 04. Here we define $L$ as a matrix derived from the graph and establish its basic algebraic properties. See 04 Spectral Graph Theory for the complete treatment.

3.5 Normalised Adjacency

Many GNN architectures use normalised forms of $A$ to prevent numerical instability during iterative propagation. The two most common normalisations are:

Symmetric normalisation (GCN, Kipf & Welling 2017):

\hat{A} = D^{-1/2} A D^{-1/2}, \quad \hat{A}_{ij} = \frac{A_{ij}}{\sqrt{\deg(i) \cdot \deg(j)}}

This is symmetric and has eigenvalues in $[-1, 1]$ . The GCN propagation rule with self-loops is:

H^{(l+1)} = \sigma\!\left(\tilde{D}^{-1/2} \tilde{A}\, \tilde{D}^{-1/2}\, H^{(l)}\, W^{(l)}\right)

where $\tilde{A} = A + I$ and $\tilde{D}_{ii} = \deg(i) + 1$ .

Row normalisation (random walk):

P = D^{-1} A, \quad P_{ij} = \frac{A_{ij}}{\deg(i)}

$P$ is the transition matrix of the random walk on $G$ : $P_{ij}$ is the probability of moving from $i$ to $j$ in one step. Each row sums to 1. GraphSAGE's mean aggregation computes $P \cdot X$ implicitly.

Normalised Laplacian:

L_{\text{sym}} = I - D^{-1/2} A D^{-1/2} = I - \hat{A}

Eigenvalues of $L_{\text{sym}}$ lie in $[0, 2]$ ; the GCN can be derived as a first-order polynomial filter on $L_{\text{sym}}$ (see 04).

3.6 Space-Time Analysis

Operation	Complexity	Notes
Space	$O(n^2)$	Always - regardless of $m$
Edge query $(i,j)$	$O(1)$	Direct array access
Neighbour enum of $v$	$O(n)$	Must scan full row
Add/remove edge	$O(1)$	Set one entry
Add vertex	$O(n^2)$	Rebuild matrix
Build from edge list	$O(n^2 + m)$	Initialise + fill
Matrix-vector $A\mathbf{x}$	$O(n^2)$	Dense BLAS
Matrix power $A^k$	$O(n^3)$ per step	Dense eigval or repeated multiply

Break-even point. For the adjacency matrix to be memory-competitive with CSR, the fill ratio must satisfy roughly $\rho \gtrsim 1/32$ (for 32-bit floats). Real-world graphs almost always have $\rho \ll 1/32$ , making sparse formats strictly superior at scale.

3.7 Matrix Operations on Graphs

The adjacency matrix bridges graph theory and linear algebra, enabling powerful matrix-based analyses:

Walk counting: $(A^k)_{ij}$ counts walks of length $k$ from $i$ to $j$ (proved in 01). This gives a matrix-algebraic approach to: triangle counting ( $\operatorname{tr}(A^3)/6$ ), reachability ( $\sum_{k=0}^\infty A^k$ for DAGs), and PageRank.

PageRank (Page et al., 1999): The stationary distribution $\boldsymbol{\pi}$ of the damped random walk solves:

\boldsymbol{\pi} = \alpha P^\top \boldsymbol{\pi} + \frac{1-\alpha}{n} \mathbf{1}

where $\alpha \approx 0.85$ is the damping factor. Solving this requires repeated sparse matrix-vector multiplication - justifying CSR over dense $A$ .

Spectral decomposition: $A = Q \Lambda Q^\top$ (for symmetric $A$ ) reveals the graph's eigenspectrum - covered fully in 04.

For AI: The GCN layer computes $H' = \sigma(\hat{A} H W)$ . With $n = 10^4$ nodes, $d = 128$ features, and $n_h = 64$ hidden units, this is a $10^4 \times 10^4$ sparse-dense matrix multiplication - the dominant computational cost. Efficient sparse implementations (cuSPARSE, PyG's torch_sparse) are essential.

3.8 When to Use the Adjacency Matrix

Use the adjacency matrix when:

$n \leq 10^4$ and the graph is dense ( $\rho > 0.1$ )
You need $O(1)$ edge existence queries (e.g., building complement graphs)
You need matrix eigendecomposition or matrix powers
The graph is static (no insertions/deletions)
You're using small molecular graphs in a GNN (batched dense matmul is GPU-efficient)

Avoid the adjacency matrix when:

$n > 10^4$ and/or $m \ll n^2$ (sparse graphs at scale)
Memory is constrained
You need fast neighbour enumeration for large-degree nodes
The graph is dynamic (edges added/removed frequently)

3.9 Worked Example: GCN Propagation with the Adjacency Matrix

To ground the theory, let us trace one GCN layer for the diamond graph. We have $n = 4$ , feature dimension $d = 2$ , and output dimension $d' = 2$ .

Step 1: Construct $\tilde{A} = A + I$ .

\tilde{A} = \begin{pmatrix} 1 & 1 & 1 & 0 \\ 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 \\ 0 & 1 & 1 & 1 \end{pmatrix}

Step 2: Compute $\tilde{D}$ (degree of $\tilde{A}$ , i.e., original degree + 1).

Original degrees: $[2, 3, 3, 2]$ , so $\tilde{D} = \operatorname{diag}(3, 4, 4, 3)$ .

Step 3: Compute $\tilde{D}^{-1/2}$ .

\tilde{D}^{-1/2} = \operatorname{diag}\!\left(\frac{1}{\sqrt{3}}, \frac{1}{2}, \frac{1}{2}, \frac{1}{\sqrt{3}}\right) \approx \operatorname{diag}(0.577, 0.5, 0.5, 0.577)

Step 4: Compute the normalised propagation matrix $\hat{A} = \tilde{D}^{-1/2} \tilde{A} \tilde{D}^{-1/2}$ .

The $(i,j)$ entry is $\hat{A}_{ij} = \tilde{A}_{ij} / \sqrt{\tilde{d}_i \cdot \tilde{d}_j}$ , so for example:

\hat{A}_{01} = \frac{1}{\sqrt{3 \cdot 4}} = \frac{1}{2\sqrt{3}} \approx 0.289

\hat{A}_{12} = \frac{1}{\sqrt{4 \cdot 4}} = \frac{1}{4} = 0.25

Step 5: Apply $H^{(1)} = \operatorname{ReLU}(\hat{A} H^{(0)} W)$ .

With $H^{(0)} = I_4$ (identity features) and $W \in \mathbb{R}^{4 \times 2}$ (learned weights), the output is $\hat{A} W$ - each row of $W$ is aggregated and weighted by the normalised adjacency. Node 1 (highest degree) receives the most averaged signal; node 0 and 3 (boundary nodes) receive less mixing.

Key insight: The normalisation $\tilde{D}^{-1/2} \tilde{A} \tilde{D}^{-1/2}$ ensures that aggregating over a high-degree hub does not blow up the magnitude of the resulting embedding, because contributions are divided by $\sqrt{\deg(u) \cdot \deg(v)}$ .

4. Adjacency List

4.1 Definition and Python Implementation

Definition (Adjacency List). An adjacency list representation stores, for each vertex $v \in V$ , a list of its neighbours: $\mathcal{N}(v) = \{u : \{v,u\} \in E\}$ .

The full adjacency list is the mapping $\text{adj}: V \to \mathcal{P}(V)$ where $\mathcal{P}(V)$ is the power set. In Python, this is naturally a dictionary of lists:

# Undirected graph: 0-1, 0-2, 1-2, 1-3, 2-3
adj = {
    0: [1, 2],
    1: [0, 2, 3],
    2: [0, 1, 3],
    3: [1, 2]
}

For a graph with $n$ vertices and $m$ edges, the total storage is $\Theta(n + 2m) = \Theta(n + m)$ (each undirected edge appears in two lists).

Python implementation using defaultdict:

from collections import defaultdict

class Graph:
    def __init__(self, directed=False):
        self.adj = defaultdict(list)
        self.directed = directed
        self.n_vertices = 0
        self.n_edges = 0

    def add_edge(self, u, v, weight=None):
        entry = (v, weight) if weight is not None else v
        self.adj[u].append(entry)
        if not self.directed:
            rev = (u, weight) if weight is not None else u
            self.adj[v].append(rev)
        self.n_edges += 1
        self.n_vertices = max(self.n_vertices, u + 1, v + 1)

    def neighbours(self, u):
        return self.adj[u]

4.2 Directed and Weighted Variants

Directed adjacency list. Each vertex $v$ stores only its out-neighbours $\{u : (v \to u) \in E\}$ . To support in-neighbour queries, maintain a separate reverse adjacency list:

adj_out[v]  # out-neighbours of v (for forward pass)
adj_in[v]   # in-neighbours of v  (for backward pass / reverse BFS)

For AI: GNNs often need both directions. In PyG, setting add_self_loops=True and using to_undirected() generates bidirectional edges automatically. Message passing typically uses adj_out (source -> destination aggregation).

Weighted adjacency list. Each entry stores a (neighbour, weight) pair:

adj = {
    0: [(1, 0.5), (2, 1.2)],
    1: [(0, 0.5), (2, 0.8), (3, 2.1)],
    ...
}

This supports weighted BFS/Dijkstra without conversion overhead.

4.3 Adjacency Set: O(1) Edge Lookup

The standard adjacency list has $O(\deg(v))$ edge lookup. Replacing each list with a hash set yields $O(1)$ average edge lookup:

adj_set = {
    0: {1, 2},
    1: {0, 2, 3},
    2: {0, 1, 3},
    3: {1, 2}
}

# O(1) average edge query:
exists = 3 in adj_set[1]   # True

Trade-offs of adjacency set vs. list:

Property	Adjacency List	Adjacency Set
Edge query	$O(\deg(v))$	$O(1)$ avg
Neighbour iteration	$O(\deg(v))$	$O(\deg(v))$
Memory	Lower (list overhead)	Higher (hash table)
Ordered neighbours	Yes	No
Suitable for Dijkstra	Yes (sorted heaps)	Limited

For graphs where edge existence queries dominate (e.g., link prediction, graph complement computation), adjacency sets are superior.

4.4 Space-Time Analysis

Operation	Adjacency List	Adjacency Set
Space	$O(n + m)$	$O(n + m)$
Edge query $(u,v)$	$O(\deg(u))$	$O(1)$ avg
All neighbours of $u$	$O(\deg(u))$	$O(\deg(u))$
Add edge	$O(1)$	$O(1)$ avg
Remove edge	$O(\deg(u))$	$O(1)$ avg
Add vertex	$O(1)$	$O(1)$
Build from edge list	$O(n + m)$	$O(n + m)$
Matrix-vector $A\mathbf{x}$	$O(m)$	$O(m)$

Critical advantage over adjacency matrix: For sparse graphs with $m = O(n)$ , the adjacency list reduces memory from $O(n^2)$ to $O(n)$ - a factor of $n$ . For $n = 10^6$ , this is the difference between 1 TB and 1 MB.

4.5 Cache Performance and Memory Layout

The Python dictionary-of-lists adjacency representation, while correct and flexible, has poor cache locality. Each list is a separate Python object allocated at a random memory address, causing cache misses during graph traversal.

Cache-friendly alternative: flat CSR arrays (see 6.2). CSR packs all adjacency information into two contiguous arrays, dramatically improving cache hit rates for algorithms that traverse many edges.

Memory layout comparison (1M edges):

Python dict-of-lists:  ~80 bytes per edge (Python overhead)  -> ~80 MB
NumPy CSR arrays:      ~8 bytes per edge (int64)             -> ~8 MB

For production GNN code, the adjacency list (Python dict) is a prototyping tool; CSR or COO formats are used in actual training.

4.6 When to Use Adjacency Lists

Use adjacency lists when:

Medium-scale sparse graphs ( $10^3 \leq n \leq 10^6$ )
The primary operations are neighbour enumeration and graph traversal (BFS, DFS)
The graph is dynamic (edges inserted/deleted frequently)
You're prototyping an algorithm in Python (NetworkX uses this internally)
Weighted graphs with varying edge weights

Avoid adjacency lists when:

You need fast edge existence queries (use adjacency set or adjacency matrix)
You need matrix operations (convert to CSR or adjacency matrix)
You're targeting GPU computation (convert to COO / edge_index)

4.7 Adjacency List Implementation Patterns

Beyond the basic Python dictionary, several implementation patterns arise in practice:

Pattern 1 - CSR as a flattened adjacency list. The CSR format (6.2) can be viewed as a memory-efficient adjacency list: col_idx[row_ptr[i]:row_ptr[i+1]] is exactly the neighbour list of vertex $i$ , stored in a flat array for cache efficiency. The adjacency list and CSR are conceptually equivalent; CSR is the production-grade, cache-friendly version.

Pattern 2 - Sorted adjacency list. When edge queries are common but an adjacency set is too memory-intensive, keep each neighbour list sorted and use binary search:

import bisect

class SortedAdjList:
    def __init__(self, n):
        self.adj = [[] for _ in range(n)]

    def add_edge(self, u, v):
        bisect.insort(self.adj[u], v)
        bisect.insort(self.adj[v], u)

    def has_edge(self, u, v) -> bool:
        idx = bisect.bisect_left(self.adj[u], v)
        return idx < len(self.adj[u]) and self.adj[u][idx] == v

Edge lookup is $O(\log \deg(u))$ . Insertion is $O(\deg(u))$ to shift elements. This is a good middle ground between adjacency list ( $O(\deg)$ lookup) and adjacency set ( $O(1)$ lookup with high memory).

Pattern 3 - Degree-aware pre-allocation. When the degree sequence is known in advance (e.g., loading from a file), pre-allocate arrays of the correct size to avoid repeated resizing:

def build_adj_preallocated(n, edges):
    deg = [0] * n
    for u, v in edges:
        deg[u] += 1; deg[v] += 1
    adj = [np.empty(deg[i], dtype=np.int32) for i in range(n)]
    ptr = [0] * n
    for u, v in edges:
        adj[u][ptr[u]] = v; ptr[u] += 1
        adj[v][ptr[v]] = u; ptr[v] += 1
    return adj

This avoids Python list resizing overhead and produces NumPy arrays for each row - enabling vectorised operations per neighbourhood.

For AI - GraphSAGE neighbour sampling. GraphSAGE (Hamilton et al., 2017) samples a fixed number $k$ of neighbours per node at each layer. The implementation uses the adjacency list directly:

def sample_neighbors(adj, node, k):
    nbrs = adj[node]
    if len(nbrs) <= k:
        return nbrs
    return random.sample(nbrs, k)

This is $O(k)$ sampling with replacement using reservoir sampling. DGL's NeighborSampler wraps this in a C++ implementation over CSR for scalability.

5. Edge List and COO Format

5.1 Edge List

Definition (Edge List). An edge list is the simplest possible graph representation: a sequence of all edges in $G$ .

For an undirected unweighted graph:

\mathcal{E} = [(u_1, v_1), (u_2, v_2), \ldots, (u_m, v_m)]

For a weighted graph:

\mathcal{E} = [(u_1, v_1, w_1), \ldots, (u_m, v_m, w_m)]

The edge list stores only what exists - no zero entries, no empty adjacency slots. It is the most compact representation for pure storage ( $O(m)$ space) and the most natural format for file I/O.

Standard file formats using edge lists:

.edges / .txt: One edge per line, space-separated (u v or u v w)
Matrix Market .mtx: Standard sparse matrix exchange format (used by SNAP, KONECT)
GraphML: XML-based, supports rich metadata
Parquet: Columnar binary format for billion-edge datasets (used by DGL, OGB)

For AI: The Open Graph Benchmark (OGB) distributes all datasets as edge lists in CSV/Parquet. Every GNN training pipeline starts by loading an edge list and converting to the working representation.

5.2 COO Format

Definition (COO - Coordinate List Format). The COO format represents a sparse matrix (and hence a graph) as three parallel arrays:

\text{row}[k] = u_k, \quad \text{col}[k] = v_k, \quad \text{data}[k] = w_k

for each edge $k = 0, \ldots, m-1$ .

Graph:  0->1 (w=2.0),  0->3 (w=1.0),  1->2 (w=0.5),  2->3 (w=3.0)

COO:
  row  = [0, 0, 1, 2]
  col  = [1, 3, 2, 3]
  data = [2.0, 1.0, 0.5, 3.0]

COO is essentially an indexed edge list with separate arrays for rows, columns, and values. It admits duplicate entries (multiple values for the same $(i,j)$ pair are summed when converting to CSR), which is useful for accumulating gradients during GNN backpropagation.

Key properties:

$O(m)$ space
$O(m)$ edge query (scan all entries)
$O(1)$ edge insertion (append to all three arrays)
Conversion to CSR is $O(m \log m)$ (sort by row) or $O(m + n)$ (counting sort)

5.3 PyTorch Geometric edge_index

PyTorch Geometric (PyG) uses a specialised COO format for graph edges:

Definition (edge_index). The edge_index tensor is a torch.Tensor of shape $2 \times m$ and dtype torch.long:

\text{edge\_index} = \begin{pmatrix} u_1 & u_2 & \cdots & u_m \\ v_1 & v_2 & \cdots & v_m \end{pmatrix}

Row 0 contains source nodes; row 1 contains destination nodes. Each column represents one directed edge $(u_k \to v_k)$ .

# Diamond graph: 0-1, 0-2, 1-2, 1-3, 2-3 (undirected -> bidirectional)
edge_index = torch.tensor([
    [0, 1, 0, 2, 1, 2, 1, 3, 2, 3],   # sources
    [1, 0, 2, 0, 2, 1, 3, 1, 3, 2]    # targets
], dtype=torch.long)

Why edge_index and not a sparse matrix? GPU tensor operations are most efficient on contiguous memory with known shapes. The $2 \times m$ layout:

Fits in a single contiguous GPU tensor
Supports batched operations across multiple graphs via the batch vector
Enables scatter_add / scatter_mean for message aggregation without sparse matrix overhead
Works natively with PyTorch's autograd for differentiable graph operations

Edge attributes. Optional edge features are stored as a separate edge_attr tensor of shape $m \times d_e$ :

edge_attr = torch.tensor([
    [2.0],  # weight of edge (0->1)
    [1.0],  # weight of edge (0->3)
    ...
], dtype=torch.float)  # shape: (m, d_e)

5.4 The PyG Data Object

PyG unifies all graph components into a torch_geometric.data.Data object:

from torch_geometric.data import Data

data = Data(
    x          = node_features,    # shape: (n, d_v)   - node feature matrix
    edge_index = edge_index,       # shape: (2, m)     - COO edge list
    edge_attr  = edge_attr,        # shape: (m, d_e)   - edge features (optional)
    y          = labels,           # shape: (n,) or (1,) - node or graph labels
    pos        = positions,        # shape: (n, 3)     - spatial coordinates (optional)
)

print(data.num_nodes)     # n
print(data.num_edges)     # m
print(data.is_undirected()) # True if edge_index is symmetric

Batching multiple graphs. When training on a dataset of graphs (molecular property prediction, graph classification), PyG concatenates multiple Data objects into a single Batch:

# Batch of 3 graphs with n1, n2, n3 nodes
batch.x           # shape: (n1+n2+n3, d_v)
batch.edge_index  # shape: (2, m1+m2+m3), indices shifted per graph
batch.batch       # shape: (n1+n2+n3,), maps each node to its graph index

The batch vector enables graph-level pooling: global_mean_pool(batch.x, batch.batch) computes one embedding per graph.

5.5 GPU-Friendly Memory Layout

The choice of COO / edge_index over sparse matrix formats is motivated by GPU architecture:

GPU memory hierarchy:

L1 cache: ~32 KB per streaming multiprocessor (SM)
L2 cache: ~40 MB shared
Global memory: 24-80 GB, high bandwidth (900 GB/s on H100)

Sparse matrix-vector multiplication (SpMV) in CSR format requires indirect memory access (following row pointers then column indices), which causes cache misses. The scatter_add pattern used in PyG's message passing accesses edge_index sequentially, enabling prefetching and coalesced memory access.

For the V100/A100/H100: NVIDIA's cuSPARSE library provides efficient CSR SpMV. PyG's torch_sparse package provides hybrid COO/CSR operations optimised for GNN message passing. At very large scale ( $m > 10^9$ ), systems like DGL use CSR-based graph sampling with pinned memory for CPU-GPU transfer.

5.6 When to Use Edge Lists / COO

Use edge lists / COO when:

Loading graph data from files (edge lists are the universal exchange format)
Building a graph incrementally (append edges one by one)
GPU-based GNN training with PyTorch Geometric
Batch processing multiple small graphs
When you need to convert to another format as a first step

Avoid when:

You need fast neighbour enumeration (convert to CSR first)
You need edge existence queries (COO is $O(m)$ ; use adjacency matrix or set)
Memory is constrained and $m$ is large (COO has 2-3x overhead vs. CSR for large graphs)

6. Sparse Matrix Formats

6.1 The Sparsity Problem

A sparse matrix is one with "sufficiently many" zero entries that storing and operating on zeros wastes time and space. For graphs, the adjacency matrix $A$ has $n^2$ entries but only $2m$ non-zeros (for undirected). The fill ratio $\rho = 2m/n^2$ measures how non-sparse the matrix is.

Example - CiteSeer citation network:

$n = 3,327$ nodes, $m = 4,732$ edges
Dense adjacency: $3327^2 \approx 11.1M$ entries, ~88 MB (float32)
Non-zeros: $2 \times 4732 = 9,464$ entries
Fill ratio: $\rho \approx 0.00085$ - 99.9% of entries are zero
Sparse storage (CSR): ~76 KB - over 1000\times smaller

Sparse formats exploit this structure: they store only non-zero entries and their indices.

6.2 CSR: Compressed Sparse Row

CSR (Compressed Sparse Row) is the standard sparse matrix format for row-oriented operations. It uses three arrays:

data[k] - value of the $k$ -th non-zero entry
col_idx[k] - column index of the $k$ -th non-zero entry
row_ptr[i] - index in data/col_idx where row $i$ begins

Construction. For row $i$ , its non-zero entries are data[row_ptr[i] : row_ptr[i+1]] and their columns are col_idx[row_ptr[i] : row_ptr[i+1]].

Worked example - diamond graph adjacency matrix:

A = \begin{pmatrix} 0 & 1 & 1 & 0 \\ 1 & 0 & 1 & 1 \\ 1 & 1 & 0 & 1 \\ 0 & 1 & 1 & 0 \end{pmatrix}

Non-zeros by row:
  Row 0: A[0,1]=1, A[0,2]=1
  Row 1: A[1,0]=1, A[1,2]=1, A[1,3]=1
  Row 2: A[2,0]=1, A[2,1]=1, A[2,3]=1
  Row 3: A[3,1]=1, A[3,2]=1

CSR:
  data    = [1, 1,  1, 1, 1,  1, 1, 1,  1, 1]
  col_idx = [1, 2,  0, 2, 3,  0, 1, 3,  1, 2]
  row_ptr = [0,     2,     5,     8,     10]

Reading row 1: data[row_ptr[1]:row_ptr[2]] = data[2:5] = [1, 1, 1] at columns col_idx[2:5] = [0, 2, 3]. Correct: vertex 1 connects to 0, 2, 3. OK

Space: $O(n + m)$ - specifically $n+1$ integers for row_ptr, $m$ integers for col_idx, and $m$ values for data (or absent for unweighted graphs).

CSR Operation	Complexity	Notes
Space	$O(n + m)$	Optimal for sparse
Row $i$ neighbours	$O(\deg(i))$	Contiguous slice
Edge query $(i,j)$	$O(\deg(i))$ or $O(\log \deg(i))$	Linear scan or binary search if sorted
SpMV $A\mathbf{x}$	$O(m)$	Industry standard
Add edge	$O(m)$ amortised	Must shift entries
Convert from COO	$O(m \log m)$	Sort by row index

6.3 CSC: Compressed Sparse Column

CSC (Compressed Sparse Column) is the column-oriented analogue of CSR:

data[k] - value of the $k$ -th non-zero
row_idx[k] - row index of the $k$ -th non-zero
col_ptr[j] - index where column $j$ begins

CSC excels at column operations: accessing all in-neighbours of vertex $j$ (i.e., all $i$ such that $(i \to j) \in E$ ) takes $O(\deg^-(j))$ - a single contiguous slice of row_idx.

CSC vs CSR for directed GNNs: In message-passing GNNs, aggregating messages from in-neighbours of $v$ is the forward pass operation. CSC organises data by destination node, making this $O(\deg^-(v))$ per node. DGL internally uses a CSR/CSC pair for efficient forward and backward passes.

Relationship: CSC of $A$ = CSR of $A^\top$ . Converting between CSR and CSC is therefore equivalent to matrix transposition, which is $O(m)$ for sparse formats.

6.4 LIL: List of Lists

LIL (List of Lists) stores the matrix as a list of rows, where each row is itself a sorted list of (col_idx, value) pairs:

# LIL representation of diamond graph (4 rows)
lil = [
    [(1, 1.0), (2, 1.0)],           # row 0
    [(0, 1.0), (2, 1.0), (3, 1.0)], # row 1
    [(0, 1.0), (1, 1.0), (3, 1.0)], # row 2
    [(1, 1.0), (2, 1.0)]            # row 3
]

Advantages of LIL:

Efficient row slicing: access row $i$ in $O(1)$
Efficient incremental construction: insert entry $(i,j,v)$ in $O(\deg(i))$ with binary search
Good for iterative graph construction where edges are added one by one

Disadvantages:

Inefficient for column access
No cache-friendly layout (scattered Python objects)
Must convert to CSR before matrix operations

When to use LIL: When building a sparse graph representation incrementally from a stream of edges. Use scipy.sparse.lil_matrix for construction, then convert to CSR for computation.

6.5 DOK: Dictionary of Keys

DOK stores only non-zero entries in a Python dictionary keyed by (row, col) tuples:

dok = {
    (0, 1): 1.0, (0, 2): 1.0,
    (1, 0): 1.0, (1, 2): 1.0, (1, 3): 1.0,
    (2, 0): 1.0, (2, 1): 1.0, (2, 3): 1.0,
    (3, 1): 1.0, (3, 2): 1.0
}

Advantages: $O(1)$ entry access, insertion, and deletion - ideal for sparse updates and random access patterns.

Disadvantages: High memory overhead (Python dict stores hash, key, value per entry - ~200 bytes vs. ~8 bytes in CSR). Not suitable for large graphs.

When to use DOK: Constructing small sparse matrices with frequent random-access updates (e.g., building a proximity matrix from $k$ -NN search results).

6.6 Sparse Matrix-Vector Multiplication

The key numerical operation for graph algorithms and GNNs is sparse matrix-vector multiplication (SpMV): compute $\mathbf{y} = A\mathbf{x}$ where $A$ is sparse.

CSR SpMV algorithm:

def spmv_csr(data, col_idx, row_ptr, x):
    n = len(row_ptr) - 1
    y = np.zeros(n)
    for i in range(n):
        for k in range(row_ptr[i], row_ptr[i+1]):
            y[i] += data[k] * x[col_idx[k]]
    return y

Time: $O(m)$ - each non-zero contributes exactly one multiply-add.

Worked trace of CSR SpMV - diamond graph, $\mathbf{x} = (1, 2, 3, 4)^\top$ .

CSR arrays (from 6.2):

data    = [1, 1,  1, 1, 1,  1, 1, 1,  1, 1]
col_idx = [1, 2,  0, 2, 3,  0, 1, 3,  1, 2]
row_ptr = [0,     2,     5,     8,     10]

Algorithm trace:

i=0: k in [0,2):
     k=0: y[0] += data[0]*x[col_idx[0]] = 1*x[1] = 1*2 = 2
     k=1: y[0] += data[1]*x[col_idx[1]] = 1*x[2] = 1*3 = 3
     -> y[0] = 5

i=1: k in [2,5):
     k=2: y[1] += 1*x[0] = 1*1 = 1
     k=3: y[1] += 1*x[2] = 1*3 = 3
     k=4: y[1] += 1*x[3] = 1*4 = 4
     -> y[1] = 8

i=2: k in [5,8):
     k=5: y[2] += 1*x[0] = 1
     k=6: y[2] += 1*x[1] = 2
     k=7: y[2] += 1*x[3] = 4
     -> y[2] = 7

i=3: k in [8,10):
     k=8:  y[3] += 1*x[1] = 2
     k=9:  y[3] += 1*x[2] = 3
     -> y[3] = 5

Result: $A\mathbf{x} = (5, 8, 7, 5)^\top$ .

Verification: Row 1 of $A$ is $[1, 0, 1, 1]$ , so $(A\mathbf{x})_1 = 1\cdot1 + 0\cdot2 + 1\cdot3 + 1\cdot4 = 8$ . OK

Equivalently, $(A\mathbf{x})_i = \sum_{j \in \mathcal{N}(i)} x_j$ - the SpMV computes the sum of feature values over neighbours. This is exactly the sum-aggregation in GNNs without normalisation. The GCN adds the normalisation weights $\hat{A}_{ij}$ as the data array entries (instead of 1s).

GNN message passing as SpMV. The GCN aggregation step $H' = \hat{A} H$ (where $\hat{A} \in \mathbb{R}^{n \times n}$ sparse, $H \in \mathbb{R}^{n \times d}$ dense) is a sparse-dense matrix multiply (SpMM). It generalises SpMV from vectors to matrices:

H'_i = \sum_{j \in \mathcal{N}(i)} \hat{A}_{ij} H_j \quad \Leftrightarrow \quad H' = \hat{A} H

Modern GPU libraries (cuSPARSE, torch_sparse) implement SpMM with significant optimisation: tiling, register blocking, and warp-level parallelism.

6.7 SciPy Sparse API

SciPy provides a comprehensive sparse matrix library (scipy.sparse) with six formats:

from scipy import sparse

# Build from COO
A_coo = sparse.coo_matrix((data, (row, col)), shape=(n, n))

# Convert to CSR (most common for computation)
A_csr = A_coo.tocsr()

# Convert to CSC
A_csc = A_csr.tocsc()

# LIL for incremental construction
A_lil = sparse.lil_matrix((n, n))
A_lil[0, 1] = 1.0
A_csr = A_lil.tocsr()  # convert when done building

# SpMV
x = np.random.randn(n)
y = A_csr @ x           # uses optimised BLAS routines

# Eigenvalues (interface to ARPACK)
from scipy.sparse.linalg import eigsh
eigenvalues, eigenvectors = eigsh(A_csr, k=6, which='SM')

Conversion cost summary:

From -> To	Cost	Method
COO -> CSR	$O(m \log m)$	`.tocsr()`
CSR -> COO	$O(m)$	`.tocoo()`
CSR -> CSC	$O(m)$	`.tocsc()`
LIL -> CSR	$O(m)$	`.tocsr()`
DOK -> COO	$O(m)$	`.tocoo()`

6.8 Format Comparison Table

SPARSE FORMAT COMPARISON
========================================================================

 Format    Space      EdgeQuery   RowAccess   ColAccess   Insert  Best Use
 -------------------------------------------------------------------------
 CSR       O(n+m)     O(deg)      O(deg)      O(m)        Slow    SpMV, GNN
 CSC       O(n+m)     O(deg)      O(m)        O(deg)      Slow    SpMM cols
 COO       O(m)       O(m)        O(m)        O(m)        O(1)    I/O, PyG
 LIL       O(n+m)     O(deg)      O(deg)      Slow        O(deg)  Build
 DOK       O(m)       O(1)avg     Slow        Slow        O(1)    Updates
 Dense     O(n^2)      O(1)        O(n)        O(n)        O(1)    n\leq10^4
========================================================================

6.9 Benchmark: Real Memory Usage

To make the trade-offs concrete, consider a Cora-scale citation graph ( $n = 2708$ , $m = 5429$ ):

Format	Arrays stored	Memory (bytes)	Relative
Dense float32 $A$	$n^2 = 7{,}333{,}264$ values	29.3 MB	1\times
Dense bool $A$	$n^2 = 7{,}333{,}264$ bits	0.92 MB	0.031\times
CSR int32 (unweighted)	$n+1 + 2m = 13{,}567$ ints	54 KB	0.002\times
COO int32	$2m = 10{,}858$ ints	43 KB	0.0015\times
PyG `edge_index` int64	$2 \times 2m = 21{,}716$ ints	174 KB	0.006\times

CSR gives a 540\times memory reduction over dense float32. For billion-edge graphs (OGB-Papers100M: $n = 1.1 \times 10^8$ , $m = 1.6 \times 10^9$ ), the dense matrix would require $\sim 10^{16}$ bytes - physically impossible - while CSR uses $\sim 13$ GB, fitting on a single high-memory node.

Throughput comparison (SpMV, single CPU core, Cora graph):

Format	SpMV time (ms)	Throughput (GFLOPS)
Dense (NumPy BLAS)	0.8	0.02
CSR (SciPy)	0.04	0.27
COO (manual)	0.05	0.22

Dense SpMV wastes $\sim 99.9\%$ of operations on zero entries. CSR SpMV achieves 7\times higher throughput by operating only on the $2m$ non-zeros.

Graph Representations: Part 1 - Intuition To 6 Sparse Matrix Formats