Theory Notebook
Converted from
theory.ipynbfor web reading.
§11.02 Graph Representations — Theory Notebook
"The difference between a good algorithm and a great one often comes down to the choice of data structure."
Companion to notes.md. Every representation is implemented, analysed, and connected to GNN frameworks.
Sections
- Setup
- Adjacency Matrix — construction, properties, Laplacian
- Degree Matrix, Laplacian, and Normalised Adjacency
- GCN Propagation Step (worked example)
- Adjacency List — implementations and adjacency set
- Edge List and COO Format
- PyG edge_index format
- Sparse Matrix Formats: CSR, CSC, LIL, DOK
- SpMV: worked trace and benchmark
- Incidence Matrix and L = BB^T
- Representation Conversions
- Heterogeneous and Temporal Graphs
- Choosing a Representation
- Memory Benchmarks
Code cell 2
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
try:
import seaborn as sns
sns.set_theme(style="whitegrid", palette="colorblind")
HAS_SNS = True
except ImportError:
plt.style.use("seaborn-v0_8-whitegrid")
HAS_SNS = False
mpl.rcParams.update({
"figure.figsize": (10, 6),
"figure.dpi": 120,
"font.size": 13,
"axes.titlesize": 15,
"axes.labelsize": 13,
"xtick.labelsize": 11,
"ytick.labelsize": 11,
"legend.fontsize": 11,
"legend.framealpha": 0.85,
"lines.linewidth": 2.0,
"axes.spines.top": False,
"axes.spines.right": False,
"savefig.bbox": "tight",
"savefig.dpi": 150,
})
np.random.seed(42)
print("Plot setup complete.")
Code cell 3
import numpy as np
import scipy.sparse as sp
import scipy.sparse.linalg as spla
import collections, time
try:
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
HAS_MPL = True
except ImportError:
HAS_MPL = False
try:
import networkx as nx
HAS_NX = True
except ImportError:
HAS_NX = False
COLORS = {
'primary': '#0077BB',
'secondary': '#EE7733',
'tertiary': '#009988',
'error': '#CC3311',
'neutral': '#555555',
'highlight': '#EE3377',
}
np.set_printoptions(precision=4, suppress=True)
np.random.seed(42)
print('Setup complete. NetworkX:', HAS_NX, ' Matplotlib:', HAS_MPL)
2. Adjacency Matrix
The adjacency matrix is the most fundamental graph representation. For an undirected simple graph:
Key properties: symmetric (), zero diagonal, row sums = degrees.
Code cell 5
# === 2.1 Build diamond graph adjacency matrix ===
# Vertices 0-3, edges: {0,1},{0,2},{1,2},{1,3},{2,3}
edges = [(0,1),(0,2),(1,2),(1,3),(2,3)]
n = 4
A = np.zeros((n, n), dtype=int)
for u, v in edges:
A[u, v] = A[v, u] = 1
print('Adjacency matrix A:')
print(A)
print(f'\nSymmetric: {(A == A.T).all()}')
print(f'Zero diagonal: {(np.diag(A) == 0).all()}')
print(f'Row sums (degrees): {A.sum(axis=1)}')
print(f'Sum of degrees = {A.sum()} = 2*|E| = {2*len(edges)}')
Code cell 6
# === 2.2 Directed adjacency matrix ===
# Directed: 0→1, 0→2, 1→2, 1→3, 2→3
A_dir = np.zeros((n, n), dtype=int)
for u, v in edges: # treat as directed edges u→v
A_dir[u, v] = 1
print('Directed adjacency matrix A_dir:')
print(A_dir)
print(f'\nOut-degrees (row sums): {A_dir.sum(axis=1)}')
print(f'In-degrees (col sums): {A_dir.sum(axis=0)}')
print(f'Symmetric: {(A_dir == A_dir.T).all()} (expected False)')
Code cell 7
# === 2.3 Walk counting via matrix powers ===
# (A^k)_{ij} = number of walks of length k from i to j
for k in [1, 2, 3]:
Ak = np.linalg.matrix_power(A, k)
print(f'A^{k}:')
print(Ak)
if k == 2:
print(f' (A^2)_{{00}} = {Ak[0,0]}: walks of length 2 returning to vertex 0')
if k == 3:
print(f' tr(A^3) = {int(np.trace(Ak))}')
print(f' Triangles = tr(A^3)/6 = {int(np.trace(Ak))//6}')
print()
3. Degree Matrix, Laplacian, and Normalised Adjacency
The graph Laplacian is central to spectral graph theory and GNNs.
Key properties:
- (positive semidefinite)
- (zero row sums)
Code cell 9
# === 3.1 Compute D, L, and verify PSD ===
degrees = A.sum(axis=1)
D = np.diag(degrees)
L = D - A
print('Degree matrix D:')
print(D)
print('\nLaplacian L = D - A:')
print(L)
# Verify zero row sums
print(f'\nL @ 1 = {L @ np.ones(n)} (should be all zeros)')
# Verify PSD via eigenvalues
eigs = np.linalg.eigvalsh(L)
print(f'Eigenvalues of L: {eigs.round(4)}')
print(f'All eigenvalues >= 0: {(eigs >= -1e-10).all()} → L is PSD')
print(f'Number of zero eigenvalues: {(np.abs(eigs) < 1e-10).sum()} → {(np.abs(eigs) < 1e-10).sum()} connected component(s)')
Code cell 10
# === 3.2 Quadratic form: x^T L x = sum_{(i,j) in E} (x_i - x_j)^2 ===
x = np.array([1.0, 0.0, 0.0, -1.0])
# Direct formula
quadratic_direct = float(x @ L @ x)
# Edge-based formula
quadratic_edge = sum((x[u] - x[v])**2 for u, v in edges)
print(f'x = {x}')
print(f'x^T L x (direct) = {quadratic_direct:.4f}')
print(f'sum (xi-xj)^2 = {quadratic_edge:.4f}')
print(f'PASS: both agree: {abs(quadratic_direct - quadratic_edge) < 1e-10}')
print(f'\nInterpretation: signal x assigns +1 to vertex 0, -1 to vertex 3.')
print(f'The Laplacian quadratic form measures the total variation across edges.')
Code cell 11
# === 3.3 Normalised adjacency A_hat = D^{-1/2} A D^{-1/2} ===
D_inv_sqrt = np.diag(1.0 / np.sqrt(degrees.astype(float)))
A_hat = D_inv_sqrt @ A @ D_inv_sqrt
print('Normalised adjacency A_hat = D^{-1/2} A D^{-1/2}:')
print(A_hat.round(4))
eigs_Ahat = np.linalg.eigvalsh(A_hat)
print(f'\nEigenvalues of A_hat: {eigs_Ahat.round(4)}')
print(f'All in [-1, 1]: {(np.abs(eigs_Ahat) <= 1 + 1e-10).all()}')
L_sym = np.eye(n) - A_hat
eigs_Lsym = np.linalg.eigvalsh(L_sym)
print(f'\nEigenvalues of L_sym = I - A_hat: {eigs_Lsym.round(4)}')
print(f'All in [0, 2]: {(eigs_Lsym >= -1e-10).all() and (eigs_Lsym <= 2+1e-10).all()}')
4. GCN Propagation Step (Worked Example)
The GCN layer (Kipf & Welling, 2017) computes:
where (self-loops) and .
Code cell 13
# === 4.1 Build GCN normalised adjacency (with self-loops) ===
A_tilde = A + np.eye(n, dtype=float)
d_tilde = A_tilde.sum(axis=1) # degree in A_tilde
D_tilde_inv_sqrt = np.diag(1.0 / np.sqrt(d_tilde))
A_gcn = D_tilde_inv_sqrt @ A_tilde @ D_tilde_inv_sqrt
print('GCN normalised adjacency (with self-loops):')
print(A_gcn.round(4))
# Verify entries: A_gcn[i,j] = 1/sqrt(d_tilde[i]*d_tilde[j])
i, j = 0, 1
expected = 1.0 / np.sqrt(d_tilde[i] * d_tilde[j])
print(f'\nA_gcn[0,1] = {A_gcn[0,1]:.4f}, expected = {expected:.4f}, match: {abs(A_gcn[0,1]-expected)<1e-10}')
Code cell 14
# === 4.2 One GCN layer: H' = ReLU(A_gcn @ H @ W) ===
np.random.seed(7)
d_in, d_out = 4, 2 # input dim = n (identity features), output dim = 2
# Identity node features (each node's feature = its one-hot encoding)
H0 = np.eye(n) # shape (n, n)
# Random weight matrix
W = np.random.randn(n, d_out) * 0.5 # shape (n, d_out)
# GCN layer
Z = A_gcn @ H0 @ W # shape (n, d_out): aggregation + linear
H1 = np.maximum(0, Z) # ReLU
print('Weight matrix W:')
print(W.round(4))
print('\nPre-activation Z = A_gcn @ H0 @ W:')
print(Z.round(4))
print('\nPost-ReLU H1:')
print(H1.round(4))
print('\nNote: row 1 (degree-3 node) receives more averaged signal than rows 0,3 (degree-2).')
5. Adjacency List
The adjacency list stores, per vertex, the list of its neighbours. Memory: .
Code cell 16
# === 5.1 Python dict-of-lists adjacency list ===
adj_list = collections.defaultdict(list)
for u, v in edges:
adj_list[u].append(v)
adj_list[v].append(u)
print('Adjacency list:')
for v in range(n):
print(f' {v}: {sorted(adj_list[v])}')
# BFS using adjacency list
def bfs(adj, root, n):
visited = [False]*n
visited[root] = True
queue = collections.deque([root])
order = []
while queue:
v = queue.popleft()
order.append(v)
for u in adj[v]:
if not visited[u]:
visited[u] = True
queue.append(u)
return order
order = bfs(adj_list, root=0, n=n)
print(f'\nBFS order from vertex 0: {order}')
Code cell 17
# === 5.2 Adjacency set: O(1) edge lookup ===
adj_set = {v: set(nbrs) for v, nbrs in adj_list.items()}
# Edge queries
queries = [(0,1), (0,3), (1,2), (2,3)]
print('Edge existence queries (adjacency set):')
for u, v in queries:
print(f' ({u},{v}): {v in adj_set[u]}')
# Verify adjacency list vs set give same degree
for v in range(n):
assert len(adj_list[v]) == len(adj_set[v])
print('\nPASS - adjacency list and set have consistent degrees')
Code cell 18
# === 5.3 Sorted adjacency list: O(log deg) edge lookup ===
import bisect
adj_sorted = {v: sorted(nbrs) for v, nbrs in adj_list.items()}
def has_edge_sorted(adj, u, v):
idx = bisect.bisect_left(adj[u], v)
return idx < len(adj[u]) and adj[u][idx] == v
print('Edge existence queries (sorted list, binary search):')
for u, v in queries:
print(f' ({u},{v}): {has_edge_sorted(adj_sorted, u, v)}')
# All should match adjacency set results
for u, v in queries:
assert has_edge_sorted(adj_sorted, u, v) == (v in adj_set.get(u, set()))
print('\nPASS - sorted list agrees with adjacency set')
6. Edge List and COO Format
The edge list is the simplest representation; COO extends it to three parallel arrays.
Code cell 20
# === 6.1 Edge list ===
edge_list = list(edges) # [(0,1),(0,2),(1,2),(1,3),(2,3)]
print('Edge list:', edge_list)
print(f'Space: O(m) = O({len(edge_list)})')
# === 6.2 COO format ===
row_arr = np.array([u for u,v in edge_list], dtype=np.int32)
col_arr = np.array([v for u,v in edge_list], dtype=np.int32)
# For undirected, add reverse edges
row_coo = np.concatenate([row_arr, col_arr])
col_coo = np.concatenate([col_arr, row_arr])
data_coo = np.ones(len(row_coo), dtype=np.float32)
print(f'\nCOO (undirected, {len(row_coo)} entries):')
print(f' row: {row_coo}')
print(f' col: {col_coo}')
print(f' data: {data_coo.astype(int)}')
Code cell 21
# === 6.3 SciPy COO and CSR ===
A_coo = sp.coo_matrix((data_coo, (row_coo, col_coo)), shape=(n, n))
A_csr = A_coo.tocsr()
print('SciPy COO:')
print(f' shape: {A_coo.shape}, nnz: {A_coo.nnz}')
print('\nSciPy CSR:')
print(f' data: {A_csr.data.astype(int)}')
print(f' col_idx: {A_csr.indices}')
print(f' row_ptr: {A_csr.indptr}')
# Verify matches dense adjacency
A_reconstructed = A_csr.toarray().astype(int)
print(f'\nMatches dense A: {(A_reconstructed == A).all()}')
7. PyTorch Geometric edge_index Format
PyG uses a COO tensor of shape to represent directed edges. Row 0 = sources, Row 1 = targets.
Code cell 23
# === 7.1 Build edge_index (simulated without torch) ===
# For undirected graph, include both (u,v) and (v,u)
src = np.array(row_coo, dtype=np.int64)
dst = np.array(col_coo, dtype=np.int64)
edge_index = np.stack([src, dst], axis=0) # shape (2, 2m)
print('edge_index (shape 2 x 2m):')
print(edge_index)
print(f'Shape: {edge_index.shape} (2 x {edge_index.shape[1]})')
print(f'Num directed edges: {edge_index.shape[1]} = 2 * {len(edges)} (undirected)')
# Verify: for each undirected edge, both directions present
edge_set_directed = set(zip(src.tolist(), dst.tolist()))
for u, v in edges:
assert (u,v) in edge_set_directed and (v,u) in edge_set_directed
print('PASS - both directions present for all undirected edges')
Code cell 24
# === 7.2 Message passing using edge_index (scatter_add simulation) ===
# GNN aggregation: for each node v, aggregate features from in-neighbours
# H'[v] = sum_{u: (u,v) in edge_index} H[u]
# Simulated node features
H = np.array([[1., 0.], [0., 1.], [1., 1.], [0., 0.]])
print('Node features H (shape n x d):')
print(H)
# Scatter-add aggregation (Python simulation)
H_agg = np.zeros_like(H)
for k in range(edge_index.shape[1]):
u_k = edge_index[0, k]
v_k = edge_index[1, k]
H_agg[v_k] += H[u_k] # accumulate source features into destination
print('\nAggregated H_agg = scatter_add(H, edge_index):')
print(H_agg)
print('\nVerify: H_agg[0] = sum of neighbours of 0 = H[1] + H[2]:', H[1] + H[2])
8. Sparse Matrix Formats: CSR, CSC, LIL, DOK
CSR (Compressed Sparse Row) is the standard for efficient SpMV and GNN computations.
Code cell 26
# === 8.1 Manual CSR construction ===
# Build CSR for diamond graph from scratch
# Sort edges by source (row)
undirected_pairs = [(u,v) for u,v in edges] + [(v,u) for u,v in edges]
undirected_pairs.sort() # sort by source
data_csr = np.ones(len(undirected_pairs), dtype=np.int32)
col_idx = np.array([v for u,v in undirected_pairs], dtype=np.int32)
row_ptr = np.zeros(n + 1, dtype=np.int32)
# Count edges per row
for u, v in undirected_pairs:
row_ptr[u + 1] += 1
# Prefix sum
for i in range(n):
row_ptr[i+1] += row_ptr[i]
print('Manual CSR for diamond graph:')
print(f' data: {data_csr}')
print(f' col_idx: {col_idx}')
print(f' row_ptr: {row_ptr}')
# Verify row 1 (neighbours of vertex 1 = [0,2,3])
row1_nbrs = col_idx[row_ptr[1]:row_ptr[2]]
print(f'\nNeighbours of vertex 1: {sorted(row1_nbrs.tolist())} (expected [0,2,3])')
assert set(row1_nbrs.tolist()) == {0, 2, 3}, 'Row 1 neighbours wrong!'
print('PASS')
Code cell 27
# === 8.2 CSR vs CSC ===
A_csc = A_csr.tocsc()
print('CSR: efficient row (out-neighbour) access')
print(f' Row 1 of CSR: {A_csr.getrow(1).indices} (out-neighbours of 1)')
print('\nCSC: efficient column (in-neighbour) access')
print(f' Col 1 of CSC: {A_csc.getcol(1).indices} (in-neighbours of 1)')
# For undirected graphs, in-neighbours == out-neighbours
assert set(A_csr.getrow(1).indices) == set(A_csc.getcol(1).indices)
print('\nPASS - for undirected graph, in-nbrs == out-nbrs')
Code cell 28
# === 8.3 LIL and DOK (dynamic construction) ===
# LIL: efficient incremental construction
A_lil = sp.lil_matrix((n, n), dtype=float)
for u, v in edges:
A_lil[u, v] = 1.0
A_lil[v, u] = 1.0
print('LIL matrix data structure (rows as sorted lists):')
for i in range(n):
print(f' row {i}: cols={A_lil.rows[i]}')
# Convert LIL → CSR for computation
A_from_lil = A_lil.tocsr()
print(f'\nAfter tocsr(): shape={A_from_lil.shape}, nnz={A_from_lil.nnz}')
# DOK: dictionary of keys
A_dok = sp.dok_matrix((n, n), dtype=float)
for u, v in edges:
A_dok[u, v] = A_dok[v, u] = 1.0
print(f'\nDOK matrix: {dict(list(A_dok.items())[:3])} ... ({A_dok.nnz} entries)')
9. Sparse Matrix-Vector Multiplication (SpMV)
SpMV is the core operation in GNNs and PageRank. CSR achieves — one multiply-add per non-zero.
Code cell 30
# === 9.1 Manual CSR SpMV with trace ===
x_vec = np.array([1., 2., 3., 4.])
def spmv_csr(data, col_idx, row_ptr, x):
n = len(row_ptr) - 1
y = np.zeros(n)
for i in range(n):
for k in range(row_ptr[i], row_ptr[i+1]):
y[i] += data[k] * x[col_idx[k]]
return y
y_sparse = spmv_csr(data_csr.astype(float), col_idx, row_ptr, x_vec)
y_dense = A @ x_vec
print(f'x = {x_vec}')
print(f'SpMV result (CSR): {y_sparse}')
print(f'Dense A @ x: {y_dense}')
print(f'Agree: {np.allclose(y_sparse, y_dense)}')
# Interpretation: y[i] = sum of x[j] for j in neighbours of i
print('\nInterpretation: y[i] = sum of x-values over neighbours of i')
for i in range(n):
nbrs = col_idx[row_ptr[i]:row_ptr[i+1]].tolist()
print(f' y[{i}] = sum(x[{nbrs}]) = {sum(x_vec[j] for j in nbrs):.0f} ✓')
Code cell 31
# === 9.2 SpMV benchmark: dense vs sparse ===
import time
sizes = [100, 500, 1000]
results = []
for sz in sizes:
# Sparse Erdos-Renyi graph: n nodes, ~5n edges
p = 10.0 / sz
A_rand = np.random.rand(sz, sz) < p
A_rand = ((A_rand | A_rand.T) & ~np.eye(sz, dtype=bool)).astype(np.float32)
A_sp = sp.csr_matrix(A_rand)
xv = np.random.randn(sz).astype(np.float32)
fill = A_sp.nnz / sz**2
# Dense
t0 = time.perf_counter()
for _ in range(200): _ = A_rand @ xv
t_dense = (time.perf_counter() - t0) / 200 * 1e6
# Sparse
t0 = time.perf_counter()
for _ in range(200): _ = A_sp @ xv
t_sparse = (time.perf_counter() - t0) / 200 * 1e6
results.append((sz, fill, t_dense, t_sparse, t_dense/t_sparse))
print(f'n={sz:4d}: fill={fill:.3f} dense={t_dense:.1f}µs sparse={t_sparse:.1f}µs speedup={t_dense/t_sparse:.1f}x')
10. Incidence Matrix and L = BB^⊤
The incidence matrix (rows = vertices, columns = edges) satisfies for undirected graphs.
Code cell 33
# === 10.1 Build incidence matrix for diamond graph ===
m_edges = len(edges)
B = np.zeros((n, m_edges), dtype=float)
for e_idx, (u, v) in enumerate(edges):
B[u, e_idx] = 1.0
B[v, e_idx] = 1.0
print('Incidence matrix B (n x m):')
print(B.astype(int))
print(f'\nShape: {B.shape} (n={n} vertices, m={m_edges} edges)')
print(f'Column sums (each edge has 2 endpoints): {B.sum(axis=0)}')
print(f'Row sums (= degrees): {B.sum(axis=1)}')
Code cell 34
# === 10.2 Verify L = B B^T ===
L_from_B = B @ B.T
print('BB^T:')
print(L_from_B.astype(int))
print('\nL = D - A:')
print(L)
print(f'\nL == BB^T: {np.allclose(L_from_B, L)}')
print('PASS: Kirchhoff identity verified')
# Corollary: L is PSD because BB^T = sum of outer products = PSD
print(f'\nPSD verification: x^T(BB^T)x = ||B^T x||^2 >= 0')
x_test = np.random.randn(n)
lhs = x_test @ L_from_B @ x_test
rhs = np.linalg.norm(B.T @ x_test)**2
print(f' x^T L x = {lhs:.6f}')
print(f' ||B^T x||^2 = {rhs:.6f}')
print(f' Match: {np.isclose(lhs, rhs)}')
Code cell 35
# === 10.3 Directed incidence matrix ===
B_dir = np.zeros((n, m_edges), dtype=float)
for e_idx, (u, v) in enumerate(edges):
B_dir[u, e_idx] = -1.0 # source (tail)
B_dir[v, e_idx] = +1.0 # target (head)
print('Directed incidence matrix B_dir:')
print(B_dir.astype(int))
L_dir = B_dir @ B_dir.T
print(f'\nB_dir @ B_dir^T == L: {np.allclose(L_dir, L)}')
print('PASS: L = BB^T holds for directed incidence matrix too')
# Interpretation of B_dir^T x: discrete gradient
x_signal = np.array([0., 1., 2., 3.]) # linear signal
grad = B_dir.T @ x_signal # edge differences
print(f'\nSignal x = {x_signal}')
print(f'Edge differences B_dir^T x = {grad}')
for e_idx, (u,v) in enumerate(edges):
print(f' Edge ({u}→{v}): x[{v}] - x[{u}] = {x_signal[v]-x_signal[u]:.0f} ✓')
11. Representation Conversions
All major conversions and their costs.
Code cell 37
# === 11.1 Conversion functions ===
def edge_list_to_matrix(edge_list, n, directed=False):
A = np.zeros((n, n), dtype=int)
for u, v in edge_list:
A[u, v] = 1
if not directed: A[v, u] = 1
return A
def edge_list_to_adj_list(edge_list, n, directed=False):
adj = {i: [] for i in range(n)}
for u, v in edge_list:
adj[u].append(v)
if not directed: adj[v].append(u)
return adj
def adj_list_to_coo(adj, n):
rows, cols = [], []
for u, nbrs in adj.items():
for v in nbrs:
rows.append(u); cols.append(v)
return np.array(rows, np.int32), np.array(cols, np.int32)
# Test pipeline: edge_list → adj_list → COO → CSR
A_mat = edge_list_to_matrix(edges, n)
adj = edge_list_to_adj_list(edges, n)
rows, cols = adj_list_to_coo(adj, n)
A_csr_conv = sp.csr_matrix((np.ones(len(rows)), (rows, cols)), shape=(n,n))
print(f'Pipeline: edge_list → matrix → adj_list → COO → CSR')
print(f'Dense matches: {(A_mat == A).all()}')
print(f'CSR matches: {np.allclose(A_csr_conv.toarray(), A)}')
print('PASS - all conversions consistent')
Code cell 38
# === 11.2 SciPy conversion API ===
print('SciPy format conversions:')
for fmt, converter in [('CSR', lambda x: x.tocsr()),
('CSC', lambda x: x.tocsc()),
('COO', lambda x: x.tocoo()),
('LIL', lambda x: x.tolil())]:
converted = converter(A_csr)
dense = converted.toarray()
ok = np.allclose(dense, A)
print(f' → {fmt}: {converted.__class__.__name__}, matches={ok}')
12. Heterogeneous and Temporal Graphs
Heterogeneous graphs have multiple node/edge types, each with its own edge_index.
Code cell 40
# === 12.1 Heterogeneous graph (author-paper-venue) ===
np.random.seed(42)
n_authors, n_papers, n_venues = 5, 8, 3
# Relation: (author, writes, paper)
writes_src = np.array([0, 0, 1, 2, 2, 3, 3, 4], dtype=np.int64)
writes_dst = np.array([0, 1, 2, 3, 4, 5, 6, 7], dtype=np.int64)
# Relation: (paper, published_in, venue)
pub_src = np.arange(n_papers, dtype=np.int64)
pub_dst = np.array([0,0,1,1,2,2,0,1], dtype=np.int64)
print('Heterogeneous graph:')
print(f' Nodes: {n_authors} authors, {n_papers} papers, {n_venues} venues')
print(f' Relation (author,writes,paper): {len(writes_src)} edges')
print(f' Relation (paper,published_in,venue): {len(pub_src)} edges')
# Per-relation adjacency matrices (rectangular)
A_writes = sp.csr_matrix(
(np.ones(len(writes_src)), (writes_src, writes_dst)),
shape=(n_authors, n_papers)
)
print(f'\nA_writes shape: {A_writes.shape} (authors x papers)')
print(f'Papers per author: {A_writes.toarray().sum(axis=1).astype(int)}')
Code cell 41
# === 12.2 Temporal edge list ===
# Simulate a temporal graph: edges with timestamps
n_nodes = 6
temporal_edges = [
(0, 1, 0.0), # (src, dst, timestamp)
(1, 2, 1.5),
(0, 3, 2.0),
(2, 4, 2.5),
(1, 4, 3.0),
(3, 5, 4.0),
(0, 5, 4.5),
]
# Sort by timestamp (standard for temporal GNNs)
temporal_edges.sort(key=lambda e: e[2])
print('Temporal edge list (sorted by timestamp):')
for u, v, t in temporal_edges:
print(f' t={t:.1f}: ({u} → {v})')
# Snapshot at t <= 2.0
snapshot_edges = [(u,v) for u,v,t in temporal_edges if t <= 2.0]
print(f'\nSnapshot at t<=2.0: {snapshot_edges}')
A_snap = edge_list_to_matrix(snapshot_edges, n_nodes, directed=True)
print('Snapshot adjacency:')
print(A_snap)
13. Choosing a Representation
A systematic comparison to guide representation selection.
Code cell 43
# === 13.1 Memory comparison across sizes ===
print(f'{'Graph':>10} {'n':>8} {'m':>10} {'fill':>6} {'Dense(MB)':>10} {'CSR(MB)':>8} {'ratio':>7}')
print('-'*70)
benchmarks = [
('Cora', 2708, 5429),
('CiteSeer',3327, 4732),
('PubMed', 19717, 44338),
('ogbn-arxiv',169343, 1166243),
]
for name, nv, mv in benchmarks:
dense_mb = (nv**2 * 4) / 1e6 # float32
csr_mb = ((nv+1 + 2*mv) * 4) / 1e6 # int32 row_ptr + col_idx (symmetric)
fill = 2*mv / (nv**2)
ratio = dense_mb / csr_mb
print(f'{name:>10} {nv:>8,} {mv:>10,} {fill:>6.4f} {dense_mb:>10.1f} {csr_mb:>8.2f} {ratio:>6.0f}x')
Code cell 44
# === 13.2 Visualise representation complexity ===
if HAS_MPL:
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
fig.suptitle('Graph Representation Trade-offs', fontsize=14)
# Left: Memory vs n for different fill ratios
ns = np.logspace(2, 6, 100)
ax = axes[0]
ax.loglog(ns, 4*ns**2/1e6, color=COLORS['error'], label='Dense float32 ($O(n^2)$)', lw=2)
ax.loglog(ns, 4*(ns+2*5*ns)/1e6, color=COLORS['primary'], label='CSR int32, $m=5n$', lw=2)
ax.loglog(ns, 4*(ns+2*ns)/1e6, color=COLORS['tertiary'], label='CSR int32, $m=n$', lw=2)
ax.set_xlabel('Number of vertices $n$'); ax.set_ylabel('Memory (MB)')
ax.set_title('Memory vs graph size')
ax.legend(fontsize=9); ax.grid(True, alpha=0.3)
# Right: Operation complexity comparison
ops = ['Edge\nquery', 'Neighbour\nenum', 'SpMV', 'Build\nfrom EL']
dense_ops = [1, 3, 4, 3] # relative log-scale
csr_ops = [2, 1, 1, 2]
adjl_ops = [3, 1, 1, 1]
x_pos = np.arange(len(ops))
w = 0.25
ax2 = axes[1]
ax2.bar(x_pos-w, dense_ops, w, label='Dense', color=COLORS['error'], alpha=0.8)
ax2.bar(x_pos, csr_ops, w, label='CSR', color=COLORS['primary'], alpha=0.8)
ax2.bar(x_pos+w, adjl_ops, w, label='Adj list', color=COLORS['tertiary'], alpha=0.8)
ax2.set_xticks(x_pos); ax2.set_xticklabels(ops)
ax2.set_ylabel('Relative cost (higher = slower)')
ax2.set_title('Operation complexity (qualitative)')
ax2.legend(fontsize=9)
fig.tight_layout()
plt.savefig('/tmp/graph_repr_tradeoffs.png')
plt.show()
print('Plot saved.')
14. Summary
| Representation | Space | Edge query | Neighbours | SpMV | Best for |
|---|---|---|---|---|---|
| Dense | Small dense graphs, spectral ops | ||||
| Adj. list | BFS/DFS, dynamic, prototyping | ||||
| Adj. set | avg | Frequent edge queries | |||
| COO/edge_index | PyG, GPU, I/O | ||||
| CSR | SpMV, GNN, large graphs | ||||
| CSC | Column ops, in-neighbour access | ||||
| Incidence | — | — | — | Hypergraphs, flow, Hodge theory |
Key formula: with quadratic form .
Key insight: No single representation wins on all operations. Choose based on graph size/density, dominant operation, and hardware target.