Notes - Math for LLMs Tutorial

Notes

"The convolution theorem is the most important theorem in signal processing - it is what makes Fourier analysis a practical computational tool rather than a purely theoretical construct."

Alan V. Oppenheim, MIT EECS

Overview

The Convolution Theorem establishes one of the most beautiful and practically consequential dualities in all of mathematics: convolution in the time domain equals pointwise multiplication in the frequency domain. This single identity reduces an $O(N^2)$ sliding-dot-product computation to $O(N \log N)$ FFT-based arithmetic - a transformation that made real-time digital audio processing possible in the 1970s and powers nearly every modern deep learning architecture today.

This section develops the Convolution Theorem rigorously in both its continuous and discrete forms, builds the surrounding theory of linear time-invariant (LTI) systems and filter design, treats efficient block-convolution algorithms (overlap-add, overlap-save), and traces the theorem's consequences through cross-correlation, the Wiener-Khinchin theorem, and deconvolution. The second half connects these classical results to modern AI: convolutional neural networks as learned frequency filters, WaveNet's causal dilated convolutions, state space models (S4, Mamba) as implicit convolution operators, the Hyena hierarchy's parameterized long convolutions, and FNet's replacement of multi-head attention with a single 2D FFT.

The Convolution Theorem is not merely useful - it is a ring isomorphism. The DFT diagonalizes the circular convolution algebra $\mathbb{C}[\mathbb{Z}/N\mathbb{Z}]$ , turning a non-commutative-looking operation into trivial pointwise multiplication. Understanding why this works - through the lens of group representation theory and Fourier analysis on abelian groups - gives deep insight into why spectral methods are so pervasive in modern sequence modeling.

Prerequisites

DFT and FFT - DFT definition, circular convolution preview, FFT algorithm, $O(N \log N)$ complexity (Section 20-03)
Fourier Transform - continuous FT definition ( $\xi$ -convention), Plancherel's theorem, $L^1$ and $L^2$ theory (Section 20-02)
Fourier Series - complex exponential basis, Parseval's identity (Section 20-01)
Linear algebra - inner products, unitary matrices, change of basis (Section 02)
Complex analysis - $e^{i\theta}$ , modulus, argument, roots of unity (Section 01)

Companion Notebooks

Notebook	Description
theory.ipynb	Convolution theorem proofs, LTI frequency response, overlap-add, Wiener filter, CNN filters, S4/Mamba SSM, Hyena, FNet
exercises.ipynb	10 graded problems: convolution by hand through Hyena long-convolution parameterization

Learning Objectives

After completing this section, you will:

State and prove the Convolution Theorem in both continuous ( $L^1$ / $L^2$ ) and discrete (DFT) forms
Distinguish circular convolution from linear convolution and apply the zero-padding correction
Define an LTI system via its impulse response and derive the frequency response $H(f) = \mathcal{F}(h)(f)$
Analyze the group delay $-\frac{d}{df}\angle H(f)$ and explain its significance for signal distortion
Compare FIR and IIR filter designs: stability, linear phase, computational cost
Implement the overlap-add algorithm for $O(N \log N)$ block convolution of long sequences
Prove the cross-correlation theorem and derive the Wiener-Khinchin theorem (PSD = FT of autocorrelation)
Derive the Wiener deconvolution filter from the MSE minimization principle
Interpret CNN convolutional layers as learned LTI frequency filters in the spatial domain
Explain how S4/Mamba SSMs compute output via FFT-based convolution in training mode
Describe Hyena's parameterized long convolutions and their $O(N \log N)$ complexity
Identify and fix the 10 most common convolution errors in numerical implementations

1. Intuition
2. Formal Definitions
3. The Convolution Theorem
4. Circular vs. Linear Convolution
5. LTI Systems and Filter Theory
6. Efficient Long Convolution
7. Cross-Correlation and Power Spectral Density
8. Deconvolution
9. Applications in Machine Learning
10. Advanced Topics
11. Common Mistakes
12. Exercises
13. Why This Matters for AI (2026 Perspective)
14. Conceptual Bridge

1. Intuition

1.1 What Is Convolution?

Convolution is a mathematical operation that blends two functions by sliding one over the other and measuring their overlap at each position. More precisely, the linear convolution of $f$ and $g$ at position $t$ is the integral of the pointwise product of $f$ with a reversed, shifted copy of $g$ :

(f * g)(t) = \int_{-\infty}^{\infty} f(\tau)\, g(t - \tau)\, d\tau

The operation appears in an enormous variety of contexts, often without being recognized as convolution:

Image blurring: convolving an image with a Gaussian kernel $g(x,y) = e^{-(x^2+y^2)/2\sigma^2}$ produces a smoothed version. Each output pixel is a weighted average of its neighborhood.
Running average: a box filter $g[n] = \frac{1}{M}\mathbf{1}_{[0,M-1]}$ convolved with a time series produces the $M$ -point moving average - standard in financial data smoothing.
Polynomial multiplication: multiplying $(a_0 + a_1 x + a_2 x^2)$ by $(b_0 + b_1 x)$ gives coefficients that are exactly the linear convolution of $[a_0, a_1, a_2]$ and $[b_0, b_1]$ . This is why the FFT can multiply two degree- $N$ polynomials in $O(N \log N)$ operations.
Echo simulation: convolving an audio signal with a room impulse response $h(t)$ (the sound of a clap in an empty hall) produces the "reverberated" signal.
Probability: the PDF of the sum $Z = X + Y$ of independent random variables is $(p_X * p_Y)(z)$ - convolution of their PDFs.

The defining algebraic property is that convolution is commutative ( $f * g = g * f$ ), associative ( $(f * g) * h = f * (g * h)$ ), and distributive over addition. These properties make it a natural operation in any linear system.

For AI: every convolutional layer in a CNN computes exactly this operation. A $3 \times 3$ filter $W$ convolved with feature map $X$ gives $Y = W * X$ - the kernel slides over the input, taking inner products at each position. Understanding convolution as sliding-overlap reveals why translation equivariance holds: if the input shifts, the output shifts by the same amount.

1.2 The Central Insight: Frequency Domain Diagonalization

Computing convolution directly costs $O(N^2)$ operations: for each of $N$ output positions, you compute an inner product with $N$ filter coefficients. For $N = 10^6$ samples (one second of audio at 1 MHz), this is $10^{12}$ multiplications - completely infeasible.

The Convolution Theorem provides a dramatic shortcut. The key observation is that complex exponentials are eigenfunctions of convolution:

(h * e^{2\pi i f \cdot})(t) = \underbrace{\left(\int_{-\infty}^{\infty} h(\tau) e^{-2\pi i f \tau} d\tau\right)}_{\hat{h}(f) = H(f)} \cdot e^{2\pi i f t}

Each complex exponential $e^{2\pi i ft}$ passes through any LTI system unchanged in frequency - it is only scaled by the complex number $H(f) = \hat{h}(f)$ . This means:

Decompose the input $x$ into its frequency components via the Fourier transform: $X(f) = \mathcal{F}(x)(f)$
Multiply each component by the filter's frequency response: $Y(f) = H(f) \cdot X(f)$ (pointwise, $O(N)$ )
Reconstruct the output via the inverse Fourier transform: $y = \mathcal{F}^{-1}(Y)$

Steps 1 and 3 each cost $O(N \log N)$ via the FFT. Step 2 costs $O(N)$ . The total is $O(N \log N)$ - compared to $O(N^2)$ for direct convolution.

CONVOLUTION THEOREM: TWO EQUIVALENT COMPUTATIONS
========================================================================

  TIME DOMAIN                     FREQUENCY DOMAIN
  ----------------------          ----------------------
  x[n], h[n]  (length N, M)       X[k] = DFT(x)
       v                               v
  y = x  h                      Y[k] = X[k]  H[k]
  O(NM) operations                    v
                                   y = IDFT(Y)
                           O(N log N) via FFT

  Both computations give IDENTICAL output y[n].

========================================================================

The mathematical reason this works is deep: the DFT is a ring isomorphism between $(\mathbb{C}^N, \circledast)$ - the ring of sequences under circular convolution - and $(\mathbb{C}^N, \cdot)$ - the ring of sequences under pointwise multiplication. Convolution, which looks complicated, becomes trivial multiplication after a change of basis.

1.3 Why This Matters for AI (First Look)

The Convolution Theorem is not merely a classical signal processing tool - it is at the heart of modern deep learning architectures:

CNNs: every torch.nn.Conv2d layer computes a learned 2D convolution. For large kernels, PyTorch uses FFT-based convolution internally. The frequency domain perspective explains why early CNN layers learn Gabor-like filters (oriented bandpass filters) and why deep layers respond to semantic content.
WaveNet (van den Oord et al., 2016): autoregressive waveform generation using dilated causal convolutions. The receptive field grows exponentially with depth via dilation, reaching $2^L - 1$ samples with only $L$ layers.
S4 (Gu et al., 2021): represents a sequence model as a linear recurrence $\mathbf{h}_t = A\mathbf{h}_{t-1} + B x_t$ , $y_t = C\mathbf{h}_t$ . During training, the recurrence is unrolled into a convolution $y = \bar{\mathbf{K}} * x$ where $\bar{\mathbf{K}} = [C\bar{B}, C\bar{A}\bar{B}, C\bar{A}^2\bar{B}, \ldots]$ is the SSM kernel, computed in $O(N \log N)$ via FFT.
Hyena (Poli et al., 2023): replaces attention with $d$ learned long convolutions $h_1, \ldots, h_d$ , each parameterized by a neural network. Complexity is $O(N \log N)$ per layer vs. $O(N^2)$ for attention.
FNet (Lee-Thorp et al., 2022): replaces multi-head attention with a single 2D FFT mixing operation, achieving 92% of BERT's GLUE score with 7 speedup.

1.4 Historical Timeline

Year	Event
1821	Cauchy identifies convolution in complex function theory
1822	Fourier's Theorie analytique - FT diagonalizes the heat equation (an LTI PDE)
1887	Heaviside introduces operational calculus; transfer functions for electrical circuits
1920s	Norbert Wiener develops autocorrelation and power spectral density theory
1942	Wiener invents the optimal Wiener deconvolution filter
1965	Cooley & Tukey - FFT makes convolution theorem computationally practical
1970s	Digital signal processing revolution: digital filters, audio processing
1989	LeCun et al. - LeNet: CNNs for handwritten digit recognition
1998	LeNet-5 - modern CNN architecture with learned convolutional filters
2012	AlexNet - deep CNNs dominate ImageNet; GPU-accelerated conv layers
2016	WaveNet (DeepMind) - dilated causal convolutions for audio generation
2021	S4 (Stanford) - SSMs as implicit FFT convolutions for long sequences
2022	FNet (Google) - FFT replaces attention at competitive performance
2023	Hyena / Mamba - parameterized and selective convolutions for sub-quadratic LLMs

2. Formal Definitions

2.1 Linear (Aperiodic) Convolution

Definition (Continuous Linear Convolution). For $f, g \in L^1(\mathbb{R})$ , the convolution of $f$ and $g$ is:

(f * g)(t) = \int_{-\infty}^{\infty} f(\tau)\, g(t - \tau)\, d\tau

The integral is well-defined a.e. and $f * g \in L^1(\mathbb{R})$ by Young's inequality.

Definition (Discrete Linear Convolution). For sequences $x \in \ell^1(\mathbb{Z})$ and $h \in \ell^1(\mathbb{Z})$ :

(x * h)[n] = \sum_{k=-\infty}^{\infty} x[k]\, h[n - k]

For finite sequences of lengths $M$ and $N$ , the output has length $M + N - 1$ .

Algebraic properties:

Commutativity: $f * g = g * f$ (proof: substitute $u = t - \tau$ )
Associativity: $(f * g) * h = f * (g * h)$ (follows from Fubini's theorem)
Distributivity: $f * (g + h) = f * g + f * h$
Identity: $f * \delta = f$ where $\delta$ is the Dirac delta (continuous) or Kronecker delta (discrete)
Scaling: $(af) * g = a(f * g)$ for $a \in \mathbb{C}$

Support rule: If $f$ is supported on $[a_1, a_2]$ and $g$ on $[b_1, b_2]$ , then $f * g$ is supported on $[a_1 + b_1, a_2 + b_2]$ . For finite sequences of lengths $M$ and $N$ , the output has exactly $M + N - 1$ non-zero samples.

Standard examples:

Example 1 (Gaussian * Gaussian): If $f = \mathcal{N}(0, \sigma_1^2)$ and $g = \mathcal{N}(0, \sigma_2^2)$ (as PDFs), then:

(f * g)(t) = \mathcal{N}(0, \sigma_1^2 + \sigma_2^2)(t)

Convolution of Gaussians is Gaussian with variance sum. This is the sum-of-independent-normals theorem.

Example 2 (rect * rect = triangle): The convolution of two rectangular pulses $\Pi_{[0,1]}$ is the triangle function $\Lambda$ supported on $[0, 2]$ . Each rect * rect gives a piecewise-linear tent.

Example 3 (polynomial multiplication): $(1 + 2x)(3 + 4x) = 3 + 10x + 8x^2$ . As convolution: $[1, 2] * [3, 4] = [3, 10, 8]$ .

Non-examples:

$f * g$ is NOT the pointwise product $f \cdot g$ - these are entirely different operations.
The convolution of two non-integrable functions (e.g., constants) is not generally well-defined in $L^1$ .

2.2 Circular (Periodic) Convolution

Definition (Circular Convolution). For $x, y \in \mathbb{C}^N$ , the $N$ -point circular (periodic) convolution is:

(x \circledast y)[n] = \sum_{m=0}^{N-1} x[m]\, y[(n - m) \bmod N], \quad n = 0, 1, \ldots, N-1

Circular convolution wraps the index modulo $N$ : sample $y[-1]$ is the same as $y[N-1]$ .

The key difference from linear convolution: circular convolution assumes both signals are periodic with period $N$ . When the input is not actually periodic, the wrap-around creates time-aliasing - contributions from the "end" of the sequence wrap around and contaminate the "beginning" of the output. This produces incorrect results whenever $y$ is not truly $N$ -periodic.

Matrix form: Circular convolution is exactly a matrix-vector product with a circulant matrix:

y = C_h \mathbf{x}, \qquad (C_h)_{ij} = h[(i-j) \bmod N]

Every circulant matrix is diagonalized by the DFT matrix $F_N$ :

C_h = F_N^{-1} \operatorname{diag}(H) F_N, \quad H = F_N \mathbf{h}

This is precisely the Convolution Theorem in matrix form.

When circular = linear: If $x$ has length $M$ and $h$ has length $L$ , zero-pad both to length $N \geq M + L - 1$ . Then $N$ -point circular convolution equals linear convolution exactly.

2.3 Cross-Correlation and Autocorrelation

Definition (Cross-Correlation). The cross-correlation of $f$ and $g$ is:

(f \star g)(t) = \int_{-\infty}^{\infty} f^*(\tau)\, g(t + \tau)\, d\tau = (f^*(- \cdot) * g)(t)

Note: cross-correlation is convolution with the time-reversed conjugate of the first argument. It is not commutative: $(f \star g)(t) = (g \star f)^*(-t)$ .

Cross-correlation measures the similarity between $f$ and a shifted copy of $g$ . The peak of $|(f \star g)(t)|$ occurs at the lag $t$ where $g$ most resembles $f$ .

Definition (Autocorrelation). The autocorrelation of $f$ is:

R_{ff}(\tau) = (f \star f)(\tau) = \int_{-\infty}^{\infty} f^*(t)\, f(t + \tau)\, dt

Properties:

$R_{ff}(0) = \lVert f \rVert_2^2$ (maximum at zero lag)
$R_{ff}(\tau) = R_{ff}^*(-\tau)$ (Hermitian symmetry)
$|R_{ff}(\tau)| \leq R_{ff}(0)$ for all $\tau$

For AI: Cross-correlation is the operation actually computed in standard "convolution" layers in PyTorch and TensorFlow. The torch.nn.Conv2d layer computes $(W \star x)$ (cross-correlation), not $(W * x)$ (true convolution). Since the filter weights are learned, this distinction does not affect expressive power - the network simply learns the time-reversed filter.

2.4 The Convolution Algebra

The space $L^1(\mathbb{R})$ equipped with convolution is a commutative Banach algebra:

Vector space: $(L^1(\mathbb{R}), +, \cdot)$
Algebra: equipped with multiplication $(f, g) \mapsto f * g$
Banach: complete normed space with $\lVert f * g \rVert_1 \leq \lVert f \rVert_1 \lVert g \rVert_1$

The discrete version: $\ell^1(\mathbb{Z})$ with discrete convolution is a Banach algebra under the same submultiplicativity bound.

The cyclic group algebra $\mathbb{C}[\mathbb{Z}/N\mathbb{Z}] \cong \mathbb{C}^N$ with circular convolution is a commutative semisimple algebra over $\mathbb{C}$ . By Maschke's theorem (since $\operatorname{char}(\mathbb{C}) \nmid N$ ), it decomposes as:

\mathbb{C}[\mathbb{Z}/N\mathbb{Z}] \cong \bigoplus_{k=0}^{N-1} \mathbb{C} \cdot e_k

where $e_k$ are the primitive idempotents - the DFT basis vectors $\omega_N^{kn}$ . The DFT is exactly the isomorphism to the direct sum. Convolution in the group algebra becomes pointwise multiplication in the direct sum decomposition - this is the algebraic content of the Convolution Theorem.

3. The Convolution Theorem

3.1 Continuous Version

Theorem (Convolution Theorem, Continuous). Let $f, g \in L^1(\mathbb{R})$ . Then $f * g \in L^1(\mathbb{R})$ and:

\mathcal{F}(f * g)(\xi) = \hat{f}(\xi) \cdot \hat{g}(\xi)

where $\hat{f}(\xi) = \int_{-\infty}^{\infty} f(t)\, e^{-2\pi i \xi t}\, dt$ (using the $\xi$ -convention consistent with Section 20-02).

Proof. Compute directly using Fubini's theorem (justified by $\lVert f * g \rVert_1 \leq \lVert f \rVert_1 \lVert g \rVert_1 < \infty$ ):

\mathcal{F}(f * g)(\xi) = \int_{-\infty}^{\infty} \left[\int_{-\infty}^{\infty} f(\tau)\, g(t-\tau)\, d\tau \right] e^{-2\pi i \xi t}\, dt

Swap integration order and substitute $u = t - \tau$ :

= \int_{-\infty}^{\infty} f(\tau) \left[\int_{-\infty}^{\infty} g(u)\, e^{-2\pi i \xi (u+\tau)}\, du \right] d\tau = \int_{-\infty}^{\infty} f(\tau)\, e^{-2\pi i \xi \tau}\, d\tau \cdot \hat{g}(\xi) = \hat{f}(\xi) \cdot \hat{g}(\xi) \qquad \square

Extension to $L^2$ : For $f \in L^1 \cap L^2$ and $g \in L^2$ , the theorem still holds in $L^2$ by a density argument, with $\lVert f * g \rVert_2 \leq \lVert f \rVert_1 \lVert g \rVert_2$ (a special case of Young's inequality).

Corollary (Shift Theorem). If $g_a(t) = g(t - a)$ is a shifted version of $g$ :

\hat{g}_a(\xi) = e^{-2\pi i a \xi} \hat{g}(\xi)

This follows from the Convolution Theorem applied to $g * \delta_a$ where $\hat{\delta}_a(\xi) = e^{-2\pi i a \xi}$ .

For AI: Dilated convolutions in WaveNet compute $(h * x_{\text{dilated}})[n]$ where $x_{\text{dilated}}[n] = x[n/d]$ for dilation factor $d$ . In frequency: $\mathcal{F}(x_{\text{dilated}})(\xi) = |d| \cdot \hat{x}(d\xi)$ - dilation in time becomes compression in frequency. Each dilation level captures different frequency bands.

3.2 Discrete Version

Theorem (Convolution Theorem, Discrete). For $x, y \in \mathbb{C}^N$ with DFT $X = \mathcal{F}(x)$ and $Y = \mathcal{F}(y)$ :

\mathcal{F}(x \circledast y)[k] = X[k] \cdot Y[k], \quad k = 0, 1, \ldots, N-1

Equivalently: $x \circledast y = \mathcal{F}^{-1}(X \cdot Y)$ .

Proof. Compute the DFT of the circular convolution directly:

\mathcal{F}(x \circledast y)[k] = \sum_{n=0}^{N-1} \left[\sum_{m=0}^{N-1} x[m]\, y[(n-m) \bmod N]\right] \omega_N^{-nk}

where $\omega_N = e^{2\pi i / N}$ . Swap sums and substitute $j = (n-m) \bmod N$ :

= \sum_{m=0}^{N-1} x[m] \omega_N^{-mk} \sum_{j=0}^{N-1} y[j]\, \omega_N^{-jk} = X[k] \cdot Y[k] \qquad \square

Normalization bookkeeping. NumPy's np.fft.ifft divides by $N$ , so the full pipeline is:

Y_k = X_k * H_k          (pointwise, no 1/N)
y = np.fft.ifft(Y_k)     (divides by N automatically)

Avoid double-dividing by $N$ when mixing manual DFT with NumPy functions.

Algorithm: FFT-based convolution of $x$ (length $M$ ) with $h$ (length $L$ ):

Choose $N \geq M + L - 1$ , next power of 2: $N = 2^{\lceil \log_2(M+L-1) \rceil}$
Zero-pad: $\tilde{x}[n] = x[n]$ for $n < M$ , else $0$ ; similarly $\tilde{h}$
$X = \text{FFT}(\tilde{x})$ , $H = \text{FFT}(\tilde{h})$
$Y[k] = X[k] \cdot H[k]$ for all $k$
$y = \text{IFFT}(Y)$ , keep first $M + L - 1$ samples

Total cost: $3 \cdot O(N \log N)$ for three FFTs + $O(N)$ for pointwise multiply.

3.3 The Multiplication Theorem (Dual)

Theorem (Multiplication / Dual Convolution Theorem). For $f, g \in L^1(\mathbb{R}) \cap L^2(\mathbb{R})$ :

\mathcal{F}(f \cdot g)(\xi) = (\hat{f} * \hat{g})(\xi) = \int_{-\infty}^{\infty} \hat{f}(\eta)\, \hat{g}(\xi - \eta)\, d\eta

Pointwise multiplication in time = convolution in frequency. This is exact duality with the Convolution Theorem.

Consequences:

Windowing: Multiplying a signal by a window $w(t)$ (e.g., Hann window) convolves its spectrum with $\hat{w}(\xi)$ . This is the spectral leakage mechanism studied in Section 20-03: the wider the window in time, the narrower $\hat{w}$ in frequency, and the less leakage.
Amplitude modulation: $x(t) \cos(2\pi f_0 t) = x(t) \cdot \frac{1}{2}(e^{2\pi i f_0 t} + e^{-2\pi i f_0 t})$ shifts the spectrum to $\pm f_0$ . In the frequency domain: $\hat{X}(\xi - f_0)/2 + \hat{X}(\xi + f_0)/2$ .
Parseval's theorem: Setting $g = f$ and $\xi = 0$ : $\int |f(t)|^2 dt = \int |\hat{f}(\xi)|^2 d\xi$ .

3.4 Parseval and Convolution

Parseval's theorem for convolution. Combining Plancherel and the Convolution Theorem:

\lVert f * g \rVert_2^2 = \int |\hat{f}(\xi) \hat{g}(\xi)|^2 d\xi = \lVert \hat{f} \cdot \hat{g} \rVert_2^2

Young's convolution inequality. For $\frac{1}{r} = \frac{1}{p} + \frac{1}{q} - 1$ with $1 \leq p, q, r \leq \infty$ :

\lVert f * g \rVert_r \leq \lVert f \rVert_p \lVert g \rVert_q

Special cases:

$p = 1, q = r = 2$ : $\lVert f * g \rVert_2 \leq \lVert f \rVert_1 \lVert g \rVert_2$ (used in filter stability analysis)
$p = q = 1, r = 1$ : $\lVert f * g \rVert_1 \leq \lVert f \rVert_1 \lVert g \rVert_1$ (Banach algebra property)
$p = 1, q = \infty, r = \infty$ : $\lVert f * g \rVert_\infty \leq \lVert f \rVert_1 \lVert g \rVert_\infty$

Energy interpretation: The energy of the filtered output is bounded by the filter's $L^1$ norm times the input's $L^2$ energy. A filter with $\lVert h \rVert_1 \leq 1$ is energy non-increasing - it cannot amplify energy.

4. Circular vs. Linear Convolution

4.1 The Wrap-Around Problem

Circular convolution assumes periodicity. When you apply the DFT-multiply-IDFT recipe to two finite sequences without zero-padding, you compute circular convolution - and the result is not the same as linear convolution.

The wrap-around artifact. Suppose $x = [1, 2, 3]$ and $h = [1, 1]$ (both length $N = 3$ ). Linear convolution gives $y_{\text{lin}} = [1, 3, 5, 3]$ (length 4). But the 3-point circular convolution gives:

(x \circledast h)[0] = x[0]h[0] + x[2]h[1] = 1 + 3 = 4

because the term $x[-1 \bmod 3] = x[2]$ wraps around. The correct linear result would have $y_{\text{lin}}[0] = 1$ .

Time-aliasing analogy. Circular convolution on a length- $N$ DFT is the time-domain analogue of frequency-domain aliasing. Just as undersampling causes spectral copies to overlap (aliasing in frequency), using an insufficient DFT size causes time-domain copies to overlap (aliasing in time).

CIRCULAR vs LINEAR CONVOLUTION
========================================================================

  x = [1, 2, 3],  h = [1, 1]

  LINEAR (correct):                CIRCULAR (N=3, wrap-around error):
  y[0] = 11         = 1          y[0] = 11 + 31 = 4  <- WRONG
  y[1] = 21 + 11   = 3          y[1] = 21 + 11 = 3  PASS
  y[2] = 31 + 21   = 5          y[2] = 31 + 21 = 5  PASS
  y[3] =      31    = 3          (missing term - aliased into y[0])
  Output length: 4                 Output length: 3

========================================================================

4.2 The Zero-Padding Solution

Theorem. Let $x \in \mathbb{C}^M$ and $h \in \mathbb{C}^L$ . If we zero-pad both to length $N \geq M + L - 1$ before computing the DFT, then the first $M + L - 1$ samples of the $N$ -point IDFT of $X \cdot H$ equal the linear convolution $x *_{\text{lin}} h$ .

Why it works. The linear convolution output has $M + L - 1$ non-zero samples. Padding to $N \geq M + L - 1$ ensures the circular buffer is large enough that the "wrap-around" tail falls on zero-padded positions - so no aliasing occurs.

Practical recipe:

N = 2**int(np.ceil(np.log2(len(x) + len(h) - 1)))  # next power of 2
X = np.fft.rfft(x, n=N)
H = np.fft.rfft(h, n=N)
y = np.fft.irfft(X * H, n=N)[:len(x) + len(h) - 1]  # trim to correct length

Using rfft (real FFT) halves the computation for real-valued inputs.

Power-of-2 choice. Padding to the next power of 2 ensures the Cooley-Tukey FFT uses radix-2 butterflies, maximizing efficiency. For some $N$ , mixed-radix FFTs (e.g., $N = 3 \times 5 \times 7$ ) can be as fast or faster; scipy.fft automatically chooses the best FFT size.

4.3 Cost Analysis

For convolving a signal of length $M$ with a filter of length $L$ :

Method	Flops	When optimal
Direct linear convolution	$O(ML)$	$L \ll M$ (short filter)
FFT-based ( $N = M + L - 1$ )	$O(N \log N)$	$L$ comparable to $M$
Overlap-add (blocks of size $B$ )	$O(M \log L)$	$M \gg L$ (long signal, fixed filter)

Crossover point. FFT convolution beats direct when $ML \gtrsim N \log_2 N$ , approximately:

L \gtrsim \frac{N \log_2 N}{M} \approx \log_2 N \approx 20 \text{ (for } N = 10^6 \text{)}

For filters longer than $\sim 20$ taps, FFT-based convolution is almost always faster in practice. For very short filters (e.g., $3 \times 3$ CNN kernels), direct convolution with SIMD instructions often wins due to lower overhead.

5. LTI Systems and Filter Theory

5.1 Linear Time-Invariant Systems

A system $\mathcal{H}: x \mapsto y$ is linear if it satisfies superposition:

\mathcal{H}(ax + bz) = a\mathcal{H}(x) + b\mathcal{H}(z) \quad \forall a, b \in \mathbb{C}

It is time-invariant if a time shift in input produces the same time shift in output:

\text{if } y = \mathcal{H}(x), \text{ then } \mathcal{H}(x(\cdot - \tau)) = y(\cdot - \tau) \quad \forall \tau

Characterization theorem: Every LTI system is completely described by its impulse response $h(t) = \mathcal{H}(\delta)(t)$ . The output for any input $x$ is:

y(t) = (h * x)(t) = \int_{-\infty}^{\infty} h(\tau)\, x(t-\tau)\, d\tau

Causality: $h(t) = 0$ for $t < 0$ . A causal system cannot depend on future inputs - essential for real-time processing. The convolution simplifies to $y(t) = \int_0^{\infty} h(\tau) x(t - \tau) d\tau$ .

Stability (BIBO): An LTI system is bounded-input bounded-output (BIBO) stable if and only if:

\lVert h \rVert_1 = \int_{-\infty}^{\infty} |h(t)|\, dt < \infty

This is the $L^1$ condition - exactly the domain in which the Convolution Theorem holds most cleanly.

For AI: Recurrent neural networks can be analyzed through the LTI lens. A linear RNN $\mathbf{h}_t = A\mathbf{h}_{t-1} + \mathbf{b} x_t$ has impulse response $h[n] = CA^n B$ for $n \geq 0$ . BIBO stability requires $\rho(A) < 1$ (spectral radius). S4 enforces stability by constraining $A$ to have negative-real eigenvalues via the HiPPO initialization.

5.2 Frequency Response and Transfer Function

Definition. The frequency response of an LTI system with impulse response $h$ is:

H(f) = \hat{h}(f) = \int_{-\infty}^{\infty} h(t)\, e^{-2\pi i f t}\, dt

This is simply the Fourier transform of the impulse response. The Convolution Theorem then gives:

Y(f) = H(f) \cdot X(f)

The system applies a complex gain $H(f)$ to each frequency component:

$|H(f)|$ - the magnitude response (gain/attenuation at frequency $f$ )
$\angle H(f)$ - the phase response (phase shift at frequency $f$ )

Group delay. The group delay measures how much each frequency band is delayed:

\tau_g(f) = -\frac{d}{df} \angle H(f)

A system with constant group delay $\tau_g(f) = \tau_0$ is said to have linear phase: $\angle H(f) = -2\pi f \tau_0$ . This means all frequencies are delayed by the same amount $\tau_0$ - no dispersion, no distortion of waveform shape.

Frequency response examples:

Filter type	$H(f)$	Impulse response $h(t)$
Ideal lowpass, cutoff $f_c$	$\mathbf{1}_{	f
Ideal highpass	$\mathbf{1}_{	f
Gaussian smoother	$e^{-\pi^2 f^2 / (2\sigma^2)}$	$\sigma e^{-\sigma^2 \pi^2 t^2 / 2}$ (Gaussian)
Differentiator	$2\pi i f$	$\delta'(t)$
Hilbert transformer	$-i \operatorname{sgn}(f)$	$\frac{1}{\pi t}$ (Cauchy principal value)

For AI: The Hilbert transformer shifts all frequency components by 90, producing the analytic signal $z(t) = x(t) + i\mathcal{H}[x](t)$ . This is used in envelope detection for audio and is related to the complex-valued activations in modern SSM architectures.

5.3 FIR vs. IIR Filters

FIR (Finite Impulse Response): $h[n]$ has only finitely many non-zero terms ( $n = 0, 1, \ldots, L-1$ ). Output is a finite weighted sum: $y[n] = \sum_{k=0}^{L-1} h[k] x[n-k]$ .

Advantages:

Always BIBO stable (finite sum, bounded output)
Can achieve exact linear phase: symmetric FIR ( $h[n] = h[L-1-n]$ ) has linear phase
Easy to design: window method, least-squares, Parks-McClellan algorithm

Disadvantages:

Long filter needed for sharp frequency cutoffs (e.g., $L \approx 300$ taps for sharp lowpass)
High computational cost for long filters (mitigated by FFT convolution)

IIR (Infinite Impulse Response): $h[n]$ is non-zero for all $n \geq 0$ . Described by a rational transfer function:

H(z) = \frac{B(z)}{A(z)} = \frac{\sum_{k=0}^{M} b_k z^{-k}}{1 + \sum_{k=1}^{N} a_k z^{-k}}

Implemented as a recursive difference equation: $y[n] = \sum_k b_k x[n-k] - \sum_k a_k y[n-k]$ .

Advantages:

Very compact: sharp frequency responses achievable with few coefficients (e.g., 5th-order Butterworth)
Classic analog designs: Butterworth (maximally flat), Chebyshev (equiripple), elliptic (optimal)

Disadvantages:

Stability not guaranteed: poles of $H(z)$ must lie inside the unit circle
Non-linear phase: group delay is frequency-dependent, causing waveform dispersion
Cannot be parallelized easily: recursive computation creates data dependencies

For AI: CNN layers are FIR filters (finite, feedforward). RNNs and LSTMs implement IIR-like behavior (infinite memory via recurrence). S4/Mamba use IIR-structured kernels but compute them in FIR/parallel mode via FFT convolution during training.

5.4 Ideal Filter Prototypes

The four ideal filter types, defined by their frequency response:

| Type | $|H(f)|$ | Application | |------|----------|-------------| | Lowpass (LP) | $1$ for $|f| \leq f_c$ , $0$ otherwise | Smoothing, anti-aliasing | | Highpass (HP) | $0$ for $|f| \leq f_c$ , $1$ otherwise | Edge detection, noise removal | | Bandpass (BP) | $1$ for $f_1 \leq |f| \leq f_2$ , $0$ otherwise | Channel selection, mel filters | | Bandstop (BS) | $0$ for $f_1 \leq |f| \leq f_2$ , $1$ otherwise | Notch filter (50/60 Hz hum removal) |

All ideal filters have impulse responses that are $\operatorname{sinc}$ -like - non-causal (extend to $-\infty$ ) and non-realizable. Real-world filter design approximates these ideal brick-wall responses.

Gibbs phenomenon. Truncating the infinite sinc impulse response creates the Gibbs overshoot: the truncated filter's frequency response overshoots the ideal by $\approx 9\%$ at the cutoff regardless of truncation length. Windowing the sinc (multiplying by Hann or Kaiser window) reduces the Gibbs ripple at the cost of a wider transition band - the same leakage-resolution trade-off from Section 20-03.

6. Efficient Long Convolution

6.1 Overlap-Add Algorithm

The overlap-add (OA) algorithm efficiently convolves a long signal $x$ of length $M$ with a short filter $h$ of length $L$ by processing $x$ in blocks.

Algorithm:

Choose block size $B$ (typically $B = 4L$ or $B = 8L$ for efficiency)
Pad $h$ to length $N = B + L - 1$ , compute $H = \text{FFT}(h, N)$
For each block $m = 0, 1, 2, \ldots$ : a. Extract $x_m = x[mB : (m+1)B]$ (zero-padded if needed) b. Compute $X_m = \text{FFT}(x_m, N)$ c. Compute $Y_m = \text{IFFT}(X_m \cdot H, N)$ - length $N = B + L - 1$ d. Add $Y_m$ to the output buffer with overlap: output[m*B : m*B + N] += Y_m

The last $L - 1$ samples of each block output overlap with the first $L - 1$ samples of the next - these overlapping tails are summed (the "add" in overlap-add).

Why it works. Each block-convolution $Y_m$ contains the correct contribution of block $x_m$ to the full output. The convolution theorem guarantees linearity: summing all contributions gives the correct total output.

Total cost: $\lceil M/B \rceil$ FFTs of size $N = B + L - 1$ , each costing $O(N \log N)$ :

\text{Total} = O\!\left(\frac{M}{B} \cdot (B + L) \log(B + L)\right) \approx O(M \log L) \quad \text{when } B \approx L

Optimal block size minimizes $\frac{(B+L)\log(B+L)}{B}$ , giving $B \approx 2L$ to $8L$ .

6.2 Overlap-Save Algorithm

Overlap-save (OS) is a computationally equivalent alternative that avoids the explicit addition step by saving (reusing) a portion of each input block.

Algorithm:

Initialize a buffer of size $N = B + L - 1$ filled with zeros
For each new $B$ input samples: a. Shift buffer left by $B$ : keep the last $L-1$ samples from the previous block b. Fill buffer positions $[L-1, N-1]$ with the new $B$ input samples c. Compute $X_m = \text{FFT}(\text{buffer}, N)$ d. $Y_m = \text{IFFT}(X_m \cdot H)$ e. Discard first $L-1$ samples of $Y_m$ (circular contamination), save last $B$ samples

The "save" refers to keeping $L-1$ samples from the previous block in the current buffer. The "discard" step removes the circular wrap-around artifacts at the beginning of each block.

Comparison:

Property	Overlap-Add	Overlap-Save
Computation per block	Same	Same
Memory	$2N$ (input + output buffer)	$N$ (single circular buffer)
Implementation complexity	Slightly simpler	Slightly more memory-efficient
Real-time streaming	Good	Slightly better

6.3 Block Convolution in Practice

Choosing block size $B$ :

Too small ( $B \ll L$ ): FFT overhead dominates; most computation is transform, not convolution
Too large ( $B \gg L$ ): good efficiency, but large latency ( $B$ samples before any output)
Rule of thumb: $B = 4L$ to $8L$ balances efficiency and latency

scipy.signal.fftconvolve implementation:

import scipy.signal as signal
y = signal.fftconvolve(x, h, mode='full')    # linear conv, output length M+L-1
y = signal.fftconvolve(x, h, mode='same')    # same length as x (centered)
y = signal.fftconvolve(x, h, mode='valid')   # only fully-overlapping part

scipy.signal.oaconvolve for long signals:

y = signal.oaconvolve(x, h, mode='full')    # uses overlap-add internally

This is typically 10-100 faster than fftconvolve when $M \gg L$ .

For AI: The overlap-add structure appears in neural audio synthesis. In WaveNet, the causal dilated convolutions can be viewed as a sequence of overlapping block convolutions, where each block is processed causally. This is why WaveNet inference is slow (sequential) but training is fast (parallel over all positions using FFT convolution).

7. Cross-Correlation and Power Spectral Density

7.1 Cross-Correlation Theorem

Theorem (Cross-Correlation Theorem). For $f, g \in L^2(\mathbb{R})$ :

\mathcal{F}(f \star g)(\xi) = \hat{f}^*(\xi) \cdot \hat{g}(\xi)

where $(f \star g)(\tau) = \int f^*(t) g(t + \tau) dt$ is the cross-correlation.

Proof. Write $f \star g = f^*(-\cdot) * g$ (time-reverse and conjugate the first argument). Apply the Convolution Theorem and the time-reversal property $\mathcal{F}(f(-\cdot))(\xi) = \hat{f}(-\xi)$ :

\mathcal{F}(f \star g)(\xi) = \mathcal{F}(f^*(-\cdot))(\xi) \cdot \hat{g}(\xi) = \hat{f}^*(-(-\xi)) \cdot \hat{g}(\xi) = \hat{f}^*(\xi) \cdot \hat{g}(\xi) \qquad \square

Discrete version. For sequences $x, y \in \mathbb{C}^N$ :

(x \star y)[k] = \sum_{n} x^*[n]\, y[n+k]

\mathcal{F}(x \star y)[k] = X^*[k] \cdot Y[k]

In NumPy: np.fft.ifft(np.conj(X) * Y) computes cross-correlation via FFT.

Applications:

Template matching: $(h \star x)(\tau)$ peaks where $x$ most resembles the template $h$ - used in pattern recognition, face detection, audio fingerprinting (Shazam)
Time-delay estimation: the peak of $|R_{xy}(\tau)|$ estimates the time delay between two sensors
Matched filter: the optimal linear filter for detecting a known signal in white noise is the cross-correlator (see Section 7.3)

7.2 Autocorrelation and Wiener-Khinchin Theorem

Definition. The autocorrelation of $f$ is $R_{ff}(\tau) = (f \star f)(\tau) = \int f^*(t) f(t+\tau) dt$ .

Theorem (Wiener-Khinchin). For a wide-sense stationary random process $X(t)$ with autocorrelation $R_{XX}(\tau) = \mathbb{E}[X^*(t)X(t+\tau)]$ , the power spectral density is its Fourier transform:

S_{XX}(f) = \mathcal{F}(R_{XX})(f) = \int_{-\infty}^{\infty} R_{XX}(\tau)\, e^{-2\pi i f \tau}\, d\tau

Corollary. The total power equals the integral of the PSD:

\mathbb{E}[|X(t)|^2] = R_{XX}(0) = \int_{-\infty}^{\infty} S_{XX}(f)\, df

(By Parseval's theorem applied to the autocorrelation.)

Spectral factorization. Since $S_{XX}(f) = |\hat{X}(f)|^2 \geq 0$ is non-negative and real, one can always write $S_{XX}(f) = |L(f)|^2$ for a causal filter $L$ - the spectral factor. This is used in Wiener filter design.

Practical estimation. The periodogram estimates the PSD from $N$ samples:

\hat{S}_{XX}(f_k) = \frac{1}{N} |X[k]|^2, \quad X = \text{FFT}(x)

The periodogram is an inconsistent estimator (variance doesn't decrease with $N$ ). Welch's method averages periodograms over overlapping windowed segments:

\hat{S}_{XX}^{\text{Welch}}(f) = \frac{1}{K} \sum_{m=0}^{K-1} \frac{1}{N} |\text{FFT}(x_m \cdot w)|^2

For AI: Welch's PSD is used in analyzing neural network weight matrices and gradient noise. The gradient PSD reveals dominant frequency scales of gradient oscillations - key for understanding Adam's moment estimation and learning rate scheduling.

7.3 Matched Filtering

Problem: Given an observation $y(t) = s(t) + n(t)$ where $s$ is a known signal and $n$ is white noise with spectral density $N_0$ , find the linear filter $h$ that maximizes the output SNR at detection time $t_0$ .

Theorem (Matched Filter). The optimal filter is:

H_{\text{MF}}(f) = \frac{S^*(f) e^{-2\pi i f t_0}}{N_0}

i.e., $h(t) = \frac{1}{N_0} s^*(t_0 - t)$ - the time-reversed, conjugated, delayed signal. The maximum output SNR is:

\text{SNR}_{\text{max}} = \frac{2}{N_0} \int |S(f)|^2 df = \frac{2 \lVert s \rVert_2^2}{N_0}

The matched filter output is proportional to the cross-correlation of $y$ with $s$ .

For AI: The attention mechanism in Transformers can be interpreted as a soft matched filter. The query $\mathbf{q}$ plays the role of a template signal; the keys $\mathbf{k}_j$ are candidate matches; $\text{softmax}(\mathbf{q}^\top \mathbf{k}_j / \sqrt{d_k})$ is a soft selection of the best-matching key, analogous to thresholding the cross-correlation peak.

8. Deconvolution

8.1 The Deconvolution Problem

Problem setup. Given the observed signal $y = h * x + n$ where $h$ is a known (or estimated) blurring/mixing filter, $x$ is the unknown clean signal, and $n$ is additive noise, recover $x$ .

In the frequency domain: $Y(f) = H(f) X(f) + N(f)$ .

Naive inverse filter. The obvious approach is $\hat{X}(f) = Y(f) / H(f)$ . This recovers $X(f)$ exactly when $n = 0$ , but in the presence of noise:

\hat{X}(f) = X(f) + \frac{N(f)}{H(f)}

When $H(f) \approx 0$ at some frequencies (the filter attenuates them strongly), the noise term $N(f)/H(f)$ is amplified catastrophically. The inverse filter is numerically ill-conditioned - a small noise level can produce arbitrarily large artifacts.

Example (image deblurring): A Gaussian blur $H(f) = e^{-\pi^2 f^2 / (2\sigma^2)}$ decays to near-zero at high frequencies. Dividing by $H(f)$ at high frequencies amplifies high-frequency noise by factors of $e^{\pi^2 f^2 / (2\sigma^2)}$ , producing extreme ringing artifacts.

8.2 Wiener Filter

The Wiener filter is the optimal linear filter minimizing the mean squared error $\mathbb{E}[|x(t) - \hat{x}(t)|^2]$ .

Theorem (Wiener Filter). The optimal deconvolution filter in the frequency domain is:

H_W(f) = \frac{H^*(f) S_{xx}(f)}{|H(f)|^2 S_{xx}(f) + S_{nn}(f)}

where $S_{xx}(f)$ is the PSD of the signal and $S_{nn}(f)$ is the PSD of the noise.

Derivation sketch. We seek $G(f)$ such that $\hat{X}(f) = G(f) Y(f) = G(f)(H(f)X(f) + N(f))$ minimizes $\mathbb{E}[|X(f) - \hat{X}(f)|^2]$ . Setting the derivative to zero (orthogonality principle):

G(f) = \frac{H^*(f) S_{xx}(f)}{|H(f)|^2 S_{xx}(f) + S_{nn}(f)}

Intuition. The Wiener filter trades off between two regimes:

When $|H(f)|^2 S_{xx}(f) \gg S_{nn}(f)$ (high SNR): $H_W(f) \approx 1/H(f)$ - nearly perfect inversion
When $|H(f)|^2 S_{xx}(f) \ll S_{nn}(f)$ (low SNR): $H_W(f) \approx 0$ - suppress the noisy frequency, don't attempt to invert

The ratio $\text{SNR}(f) = S_{xx}(f)|H(f)|^2 / S_{nn}(f)$ governs the transition between inversion and suppression at each frequency.

For AI: The Wiener filter principle appears in noise-robust attention. Sparse attention patterns (Longformer, BigBird) can be viewed as suppressing "noisy" long-range attention weights when the query-key similarity is low - a form of adaptive filtering where the "SNR" is the attention score.

8.3 Regularized Deconvolution

When $S_{xx}$ and $S_{nn}$ are unknown, we use Tikhonov regularization:

H_{\text{reg}}(f) = \frac{H^*(f)}{|H(f)|^2 + \lambda}

The regularization parameter $\lambda > 0$ prevents division by near-zero values of $H(f)$ :

$\lambda \to 0$ : approaches inverse filter (unstable)
$\lambda \to \infty$ : output $\hat{x} \to 0$ (over-regularized)
Optimal $\lambda$ chosen by cross-validation or the discrepancy principle

L-curve method. Plot $\lVert \hat{x} \rVert_2$ vs. $\lVert y - h * \hat{x} \rVert_2$ as $\lambda$ varies. The "corner" of the L-shaped curve gives a good $\lambda$ - balancing solution norm against residual.

Sparse deconvolution. When the clean signal $x$ is sparse (e.g., spike train, musical onset detection), using $L^1$ regularization instead of $L^2$ :

\hat{x} = \arg\min_x \lVert y - h * x \rVert_2^2 + \lambda \lVert x \rVert_1

This is the LASSO in the signal processing context and recovers sparse solutions more faithfully than the Wiener filter, which assumes a smooth Gaussian prior on $x$ .

9. Applications in Machine Learning

9.1 Convolutional Neural Networks

A convolutional layer computes cross-correlation (not true convolution, but the distinction is irrelevant since weights are learned):

(Y)_{i,j} = \sum_{m,n} W_{m,n} X_{i+m,\, j+n} + b

Frequency perspective: Each learned filter $W$ is an LTI system with frequency response $\hat{W}(f)$ . Deep CNNs learn a hierarchy of filters:

Layer 1 filters: oriented edges, color gradients (Gabor-like bandpass filters)
Layer 2 filters: corners, textures (combinations of edge filters)
Layer $L$ filters: object parts, semantic features

Dilated convolutions expand the receptive field without increasing parameters: $(X *_d W)[n] = \sum_k W[k] X[n - dk]$ . In frequency: this is equivalent to subsampling $\hat{W}$ by factor $d$ , creating copies at multiples of $f_s/d$ - the dilation aliases the filter spectrum.

11 convolutions are pointwise linear transformations across channels - equivalent to applying a separate MLP at each spatial position. They mix channel information without spatial mixing.

For AI (2026): Modern vision transformers (ViT) treat patches as tokens. Recent work shows that early ViT layers perform operations similar to CNN filters - local spatial filtering. Conversely, "ConvNeXt" (Liu et al., 2022) shows that pure CNN architectures can match ViT by adopting transformer-inspired design choices (depth-wise convolutions, GELU activations, layer norm).

9.2 Causal Convolution and WaveNet

WaveNet (van den Oord et al., 2016) generates raw audio waveforms autoregressively at 16/24 kHz using dilated causal convolutions.

Causal convolution: $(h *_{\text{causal}} x)[n] = \sum_{k=0}^{L-1} h[k] x[n-k]$ - only past samples contribute. In training, implemented as a padded standard convolution.

Dilated receptive field: With dilation factors $d = 1, 2, 4, 8, \ldots, 2^{L-1}$ over $L$ layers, the total receptive field is $2^L - 1$ samples using only $L$ layers of $K$ -tap filters. At 16 kHz with $L = 10$ layers of $K = 2$ taps: receptive field = $2^{10} - 1 = 1023$ samples = 64 ms.

WAVENET DILATED CAUSAL RECEPTIVE FIELD (L=4 layers)
========================================================================

  Input:  x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] x[8]
           |    |    |    |    |    |    |    |    |
  d=1:    +----+    +----+    +----+    +----+    (receptive field: 2)
           |         |         |         |
  d=2:    +---------+    +---------+    (receptive field: 4)
           |              |
  d=4:    +--------------+              (receptive field: 8)

  Total receptive field: 2^4 - 1 = 15 samples
  Output y[n] depends on x[n], x[n-1], ..., x[n-14]

========================================================================

9.3 State Space Models as Convolution (S4)

The Structured State Space Sequence model (S4) (Gu et al., 2021) bridges recurrent and convolutional views of sequence modeling.

SSM definition. The continuous-time state space model:

\dot{\mathbf{h}}(t) = A\mathbf{h}(t) + B x(t), \quad y(t) = C\mathbf{h}(t) + D x(t)

Discretized with step $\Delta$ :

\mathbf{h}_t = \bar{A}\mathbf{h}_{t-1} + \bar{B} x_t, \quad y_t = C\mathbf{h}_t + D x_t

\bar{A} = e^{\Delta A}, \quad \bar{B} = (e^{\Delta A} - I)A^{-1}B

Convolutional mode (training). Unrolling the recurrence gives:

y_t = \sum_{k=0}^{t} \underbrace{C \bar{A}^k \bar{B}}_{\bar{K}[k]} x_{t-k} + D x_t

The SSM kernel $\bar{\mathbf{K}} = [C\bar{B},\; C\bar{A}\bar{B},\; C\bar{A}^2\bar{B},\; \ldots,\; C\bar{A}^{L-1}\bar{B}]$ defines a length- $L$ FIR filter. Convolution $y = \bar{\mathbf{K}} * x$ is computed in $O(L \log L)$ via FFT.

HiPPO initialization. S4 initializes $A$ as the HiPPO matrix that optimally projects history onto Legendre polynomials. This gives $\bar{K}$ long-range decay properties that enable learning dependencies over thousands of steps.

For AI: S4 achieves state-of-the-art performance on the Long Range Arena benchmark, processing sequences of length 16,000 in $O(N \log N)$ vs. $O(N^2)$ for Transformers.

9.4 Mamba and Selective State Spaces

Mamba (Gu & Dao, 2023) extends S4 with input-selective SSM parameters: $\bar{A}(x_t), \bar{B}(x_t), C(x_t)$ depend on the input token. This breaks the time-invariance of S4's convolution kernel - the filter $\bar{K}$ is now input-dependent and changes for every sequence.

Why not use FFT directly: Since $\bar{K}$ changes with input, Mamba cannot precompute a single convolution kernel and apply it via FFT. Instead, it uses a hardware-aware parallel scan algorithm (Blelloch 1990) that processes the recurrence in $O(N)$ time on GPU with memory efficiency via kernel fusion.

Selective mechanism: The input-dependent parameters enable the model to selectively "remember" or "forget" context - analogous to LSTM gates but more flexible. This allows Mamba to match or exceed Transformer performance on language modeling while maintaining sub-quadratic scaling.

9.5 Hyena and Parameterized Long Convolutions

Hyena (Poli et al., 2023) replaces the attention mechanism with a hierarchy of parameterized long convolutions.

Architecture. A Hyena operator of order $N$ computes:

y = h_N * (v_N \odot (h_{N-1} * (v_{N-1} \odot (\cdots (h_1 * (v_1 \odot x)) \cdots))))

where each $h_i$ is a learned long convolution (length $=$ sequence length), $v_i = W_i x$ are linear projections, and $\odot$ is elementwise multiplication.

Parameterization of $h_i$ : Each long filter is parameterized by a small neural network $h_i(t) = f_{\theta}(t)$ evaluated at integer time positions $t = 0, 1, \ldots, L-1$ . This gives $O(1)$ parameters per filter regardless of sequence length.

Complexity: Each convolution costs $O(N \log N)$ via FFT. Total: $O(dN \log N)$ per layer for order- $d$ Hyena vs. $O(N^2 d)$ for attention.

For AI: Hyena reaches 99.9% of attention's performance on CIFAR-10 at sequence length 1024 while being 8 faster. It scales better to length 64K where attention becomes prohibitive.

9.6 FNet: Replacing Attention with FFT

FNet (Lee-Thorp et al., 2022) replaces the self-attention sub-layer in a standard Transformer with a 2D FFT mixing layer:

\text{FNetMix}(X) = \text{Re}(\mathcal{F}_{\text{seq}}(\mathcal{F}_{\text{model}}(X)))

Two FFTs: one along the sequence dimension, one along the model dimension. No learned parameters in the mixing layer - just a fixed unitary transformation. The FFN sub-layers remain unchanged.

Results:

Achieves 92-97% of BERT's performance on GLUE
Training speed: 7 faster than BERT on TPUs (no attention computation)
Parameter count: same as BERT (all parameters are in FFN, embeddings, output projection)

Why it works: The 2D FFT performs a global mixing of all positions and all channels simultaneously. It acts like an "equal attention" that attends to all positions - less expressive than learned attention but sufficient for many tasks after fine-tuning.

For AI (2026): FNet is the existence proof that token mixing does not require learned attention. This insight motivated Mlp-Mixer, gMLP, and other mixing architectures that replace $O(N^2)$ attention with $O(N)$ or $O(N \log N)$ alternatives.

9.7 Filter Banks Preview

Preview: Wavelets and Filter Banks

The Convolution Theorem underlies an even richer structure: filter banks - collections of bandpass filters that partition the spectrum. Convolving a signal with a bank of bandpass filters and subsampling the outputs gives a multi-resolution representation. The perfect reconstruction condition requires that the synthesis filters exactly invert the analysis filters - a constraint directly expressed via the Convolution Theorem.

The Haar wavelet is the simplest perfect-reconstruction filter bank: a lowpass filter $h_0 = [1, 1]/\sqrt{2}$ and highpass $h_1 = [1, -1]/\sqrt{2}$ . Iterating the lowpass branch produces the discrete wavelet transform (DWT).

-> Full treatment: Wavelets

10. Advanced Topics

10.1 Convolution on Groups

The Convolution Theorem extends far beyond $\mathbb{R}$ and $\mathbb{Z}/N\mathbb{Z}$ to any locally compact abelian group $G$ :

(f * g)(x) = \int_G f(y^{-1} x)\, g(y)\, d\mu(y)

where $\mu$ is the Haar measure on $G$ . The Fourier transform on $G$ diagonalizes this convolution exactly as in the classical case:

\widehat{f * g}(\chi) = \hat{f}(\chi) \cdot \hat{g}(\chi)

where $\chi: G \to \mathbb{T}$ are the characters (1D unitary representations) of $G$ .

Examples:

$G = \mathbb{Z}/N\mathbb{Z}$ : DFT, circular convolution
$G = \mathbb{Z}$ : Z-transform, discrete convolution
$G = \mathbb{R}$ : Fourier transform, linear convolution
$G = \mathbb{T}$ (circle): Fourier series, periodic convolution
$G = \mathbb{R}^n$ : multidimensional FT, image processing

Non-abelian groups. For non-abelian $G$ (e.g., rotation group $SO(3)$ , permutation group $S_n$ ), the Fourier transform takes values in matrix-valued representations, not scalars. Convolution in the group algebra is still diagonalized, but now each "frequency component" is a matrix equation $\hat{f}(\rho) \cdot \hat{g}(\rho) = \widehat{f * g}(\rho)$ where $\rho$ is an irreducible representation.

$G$ -CNNs (Cohen & Welling, 2016): Convolutional networks on Lie groups $G$ that are equivariant to all group transformations. Each layer computes group convolution $(f *_G h)(g) = \int_G f(h) W(g^{-1}h)\, dh$ . The output transforms predictably under $G$ -transformations, encoding equivariance as an inductive bias rather than learning it from data.

For AI: Group-equivariant CNNs are used in molecular property prediction ( $G = SO(3)$ for 3D rotational equivariance), protein structure prediction (SE(3)-equivariant networks in AlphaFold 2), and medical imaging (rotation/reflection equivariance for tissue classification).

10.2 Spectral Graph Convolution

On a graph $\mathcal{G} = (\mathcal{V}, \mathcal{E})$ with $N$ nodes, the graph Laplacian $L = D - A$ (where $D$ is the degree matrix) plays the role of the Laplacian in continuous Fourier analysis. Its eigendecomposition $L = U\Lambda U^\top$ defines the graph Fourier transform:

\hat{\mathbf{x}} = U^\top \mathbf{x}, \quad \mathbf{x} = U\hat{\mathbf{x}}

Graph convolution:

\mathbf{x} *_{\mathcal{G}} \mathbf{h} = U(U^\top \mathbf{x} \odot U^\top \mathbf{h}) = U\operatorname{diag}(\hat{h}) U^\top \mathbf{x}

This is the Convolution Theorem on graphs: spectral multiplication = graph convolution.

Limitations: Computing $U$ costs $O(N^3)$ ; the resulting filters are non-localized. ChebNet (Defferrard et al., 2016) approximates $\hat{h}(\Lambda)$ by a Chebyshev polynomial, giving $K$ -hop local filters. GCN (Kipf & Welling, 2017) uses the first-order approximation $\hat{h}(\Lambda) \approx \hat{h}_0 I + \hat{h}_1 \tilde{\Lambda}$ (two-parameter filter).

-> Full treatment: Spectral Graph Theory

10.3 Young's Convolution Inequality

Theorem (Young's Inequality for Convolutions). Let $1 \leq p, q, r \leq \infty$ with $\frac{1}{p} + \frac{1}{q} = 1 + \frac{1}{r}$ . For $f \in L^p(\mathbb{R})$ and $g \in L^q(\mathbb{R})$ :

\lVert f * g \rVert_r \leq \lVert f \rVert_p \lVert g \rVert_q

Key special cases:

$p$	$q$	$r$	Interpretation
1	1	1	$L^1$ is a Banach algebra under convolution
1	2	2	FIR filtering preserves $L^2$ norm (up to $\lVert h \rVert_1$ )
1	$\infty$	$\infty$	Bounded signals remain bounded after $L^1$ filtering
2	2	$\infty$	Autocorrelation is bounded

Proof sketch. Generalized Holder's inequality applied to the convolution integral; see Lieb & Loss "Analysis" Section 4.2.

Sharp version (Beckner-Brascamp-Lieb). The sharp constant in Young's inequality is:

\lVert f * g \rVert_r \leq C_{p,q} \lVert f \rVert_p \lVert g \rVert_q, \quad C_{p,q} = \prod_s \left(\frac{p_s^{1/p_s}}{p_s'^{1/p_s'}}\right)^{1/2}

Gaussians are the extremizers - they saturate the inequality with equality.

11. Common Mistakes

#	Mistake	Why It's Wrong	Fix
1	Using circular convolution when linear convolution is needed	DFT assumes periodicity; wrap-around artifacts contaminate the output	Zero-pad both sequences to length $\geq M + L - 1$ before DFT
2	Forgetting to zero-pad to a power of 2	Non-power-of-2 FFT sizes use slower mixed-radix algorithms	Use `N = 2**ceil(log2(M+L-1))` or let `scipy.fft` choose `N`
3	Double-dividing by $N$	Manually computing $\frac{1}{N}\sum x[n]\omega^{nk}$ then calling `np.fft.ifft` (which already divides by $N$ )	Use either all-manual or all-NumPy; never mix conventions
4	Confusing convolution with cross-correlation	`torch.nn.Conv2d` computes cross-correlation $(W \star x)$ , not $(W * x)$ ; for symmetric kernels they are equal, but in general they differ	Cross-correlation flips one input; for learned filters it makes no difference (the network learns the flip)
5	Ignoring phase response	Analyzing only $	H(f)
6	Applying IIR filters in both directions without understanding "zero-phase"	`scipy.signal.filtfilt` applies a filter twice (forward then backward) to achieve zero phase, but it is non-causal; cannot be used in real-time streaming	Use `filtfilt` only for offline analysis; use `lfilter` with delay compensation for real-time
7	Using periodogram directly as PSD estimate	The periodogram has variance equal to its mean (inconsistent estimator) - averaging periodograms over many realizations does not reduce variance	Use Welch's method (`scipy.signal.welch`) to average windowed periodogram estimates
8	Naive inverse filter on attenuated frequencies	Dividing by $H(f) \approx 0$ amplifies noise by $1/	H(f)
9	Confusing convolution with element-wise multiplication	Writing `y = h * x` in Python (NumPy) gives element-wise multiply, not convolution	Use `np.convolve(x, h)` or `scipy.signal.fftconvolve(x, h)` for convolution
10	Incorrect output length after convolution	Not accounting for the "full", "same", or "valid" output modes, or expecting circular convolution to give $M + L - 1$ output samples (it gives $N$ )	Know the output lengths: full = $M+L-1$ ; same = $\max(M,L)$ ; valid = $

12. Exercises

Exercise 1 * - Convolution by Hand Compute the linear convolution of $x = [1, 2, 3, 2, 1]$ and $h = [1, 0, -1]$ by hand using the sliding-dot-product definition. Verify your answer using np.convolve. Then compute the 5-point circular convolution of the same inputs and identify which output samples differ from the linear convolution result due to wrap-around.

Exercise 2 * - Convolution Theorem Verification For $x = [1, 2, 3, 4]$ and $h = [1, -1, 1]$ : (a) Compute the linear convolution directly. (b) Compute it via the FFT route (zero-pad, multiply spectra, IFFT). (c) Verify that both approaches give the same result to machine precision. (d) Verify Parseval: $\lVert y \rVert_2^2 = \frac{1}{N}\lVert Y \rVert_2^2$ where $Y = \text{DFT}(y)$ .

Exercise 3 * - LTI Frequency Response Design a simple FIR lowpass filter using the window method: (a) Start with the ideal lowpass impulse response $h_d[n] = 2f_c \operatorname{sinc}(2f_c n)$ with cutoff $f_c = 0.2$ (in normalized frequency). (b) Truncate to $L = 51$ taps. (c) Multiply by a Hann window. (d) Plot $|H(f)|$ in dB and measure the passband ripple, stopband attenuation, and transition bandwidth.

Exercise 4 ** - Circular vs. Linear Convolution Demo (a) Convolve a length-64 chirp signal with a length-16 Hann-windowed filter using both direct np.convolve and the 64-point circular DFT method. (b) Show the circular convolution error (difference from linear convolution). (c) Repeat with zero-padding to length $64 + 16 - 1 = 79$ (or next power of 2 = 128). (d) Verify that the zero-padded circular convolution matches linear convolution exactly.

Exercise 5 ** - Overlap-Add Implementation Implement the overlap-add algorithm from scratch: def overlap_add(x, h, block_size). (a) Process a 1024-sample signal with a 64-sample filter using block size 128. (b) Verify your output matches scipy.signal.fftconvolve(x, h) to machine precision. (c) Measure and compare the runtime of your overlap-add, np.convolve, and scipy.signal.fftconvolve for varying signal lengths.

Exercise 6 ** - Wiener Deconvolution Given a blurred signal $y = h * x + n$ where $h$ is a Gaussian blur, $x$ is a test signal, and $n$ is white Gaussian noise: (a) Implement the naive inverse filter and show the noise amplification. (b) Implement the Wiener filter assuming known $S_{xx}$ and $S_{nn}$ . (c) Show the deconvolution result for three SNR levels ( $10, 20, 30$ dB). (d) Compare with regularized Tikhonov deconvolution using the L-curve to select $\lambda$ .

Exercise 7 *** - S4 SSM as FFT Convolution Implement the S4 state space model in convolutional mode: (a) Construct the SSM matrices $A, B, C$ for a simple diagonal SSM with $N = 64$ states. (b) Discretize with step $\Delta = 0.01$ using the ZOH method: $\bar{A} = e^{\Delta A}$ , $\bar{B} = (e^{\Delta A} - I)A^{-1}B$ . (c) Compute the SSM kernel $\bar{K} = [C\bar{B}, C\bar{A}\bar{B}, \ldots, C\bar{A}^{L-1}\bar{B}]$ for $L = 256$ . (d) Apply the kernel to a test sequence via FFT convolution. (e) Verify by also running the recurrence directly and comparing outputs.

Exercise 8 *** - Hyena Parameterized Long Convolution Implement a simplified order-2 Hyena operator: (a) Define two learnable filters $h_1, h_2$ of length $L = 128$ , parameterized by a 2-layer MLP evaluated at positions $t = 0, \ldots, L-1$ . (b) Implement the Hyena forward pass: $y = h_2 * (v_2 \odot (h_1 * (v_1 \odot x)))$ where $v_i = W_i x$ are learned projections. (c) Verify that the total compute is $O(L \log L)$ per forward pass. (d) Compare: for $L = 1024$ , count the FLOPs for Hyena vs. self-attention ( $O(L^2 d)$ with $d = 64$ ).

13. Why This Matters for AI (2026 Perspective)

Concept	AI Impact
Convolution Theorem ( $f * g \leftrightarrow \hat{f}\hat{g}$ )	Reduces $O(N^2)$ convolution to $O(N \log N)$ in all CNN layers; directly enables efficient sequence modeling
LTI system theory + impulse response	S4/Mamba: SSM = LTI system; training in convolutional mode via precomputed SSM kernel; inference in recurrent mode
Circular vs. linear convolution	Padding strategy in all CNN implementations; must zero-pad to avoid wrap-around artifacts in FFT-based conv layers
FIR filter design (linear phase)	Dilated causal convolutions in WaveNet are FIR filters; dilation exponentially extends receptive field
IIR filter stability ( $\rho(A) < 1$ )	Gradient stability in RNNs, LSTMs, S4; HiPPO matrix design for stable long-range memory
Cross-correlation = attention (soft matched filter)	Multi-head attention is a learned bank of cross-correlators; QK inner product = cross-correlation score
Wiener filter (optimal deconvolution)	Noise-robust attention; NLP denoising objectives; diffusion model score matching as optimal deconvolution
Overlap-add / overlap-save	WaveNet training (parallel causal conv); audio codec inference; streaming inference for SSMs
Group convolution (equivariance)	SE(3)-equivariant networks in protein structure (AlphaFold 2); molecular dynamics (NequIP, MACE)
Spectral graph convolution	GCN, ChebNet, GraphSAGE - all use graph convolution derived from graph Laplacian diagonalization
Hyena parameterized convolutions	Sub-quadratic LLM attention replacement; competitive with transformers at context lengths 8K-64K
FNet (FFT attention replacement)	Existence proof that fixed global mixing (FFT) achieves 92% of BERT; motivates Mlp-Mixer, gMLP architectures

14. Conceptual Bridge

Looking Back

This section builds directly on three prior foundations. From Section 20-01 (Fourier Series), we inherit the idea that periodic signals decompose into sinusoidal harmonics and that these harmonics are orthogonal - the very orthogonality that makes the DFT a unitary transform. From Section 20-02 (Fourier Transform), we carry forward the continuous convolution theorem, Plancherel's theorem, and the $L^1$ / $L^2$ framework in which these operations are well-defined. From Section 20-03 (DFT and FFT), we take the computational machinery: the DFT matrix, the Cooley-Tukey algorithm, and the concept of circular periodicity that makes the discrete Convolution Theorem exact rather than approximate.

The Convolution Theorem can be seen as the payoff of all the Fourier analysis developed in Section 01-03: the abstract orthogonal decomposition becomes a practical speedup. The key insight - that eigenfunctions of convolution are complex exponentials - was already visible in Section 01's Fourier series (eigenfunctions of translation operators on the circle) and Section 02's FT (eigenfunctions of all LTI operators). Here we make that insight computationally concrete.

The Core Equation

x \circledast y \;\overset{\mathcal{F}}{\longleftrightarrow}\; X[k] \cdot Y[k]

This identity, proved in Section 3.2, is what connects abstract Fourier analysis to practical engineering and modern deep learning. Every CNN layer, every S4/Mamba layer, every FNet mixer, and every overlap-add audio system computes a consequence of this equation.

Looking Forward

The immediate next step is Section 20-05 (Wavelets), which uses the Convolution Theorem as its foundation. Wavelet analysis overcomes Fourier's fundamental limitation - no time localization - by using short convolutions with oscillatory kernels at multiple scales. The perfect reconstruction filter bank condition (the core technical requirement of wavelets) is a system of equations on the Fourier transforms of the analysis and synthesis filters. Every MRA, DWT, and lifting scheme is a cleverly designed set of convolutions satisfying this spectral condition.

Looking further, this section connects to:

Section 11-04 (Spectral Graph Theory): Graph convolution = signal processing on irregular domains; ChebNet, GCN, and spectral clustering all use the graph Laplacian eigendecomposition in the same way the Convolution Theorem uses DFT eigenvectors
Section 12-02 (Hilbert Spaces): The Convolution Theorem in $L^2$ is a statement about isometric isomorphisms of Hilbert spaces; Plancherel's theorem is the isometry; the Convolution Theorem is a multiplicative structure on top
Section 17 (Generative Models): Diffusion models can be interpreted as iterative deconvolution - the forward process convolves with progressively wider Gaussians, and the reverse process learns the Wiener deconvolution at each noise level

POSITION IN CURRICULUM
========================================================================

  Section 20-01 Fourier Series          Section 20-02 Fourier Transform
  (periodic, discrete spectrum)  (aperiodic, continuous spectrum)
         v                              v
  Section 20-03 DFT and FFT  ----------- Section 20-04 Convolution Theorem  <- HERE
  (discrete, computable)          (theorem + LTI + ML apps)
                                         v
                               Section 20-05 Wavelets
                               (multi-resolution, time-local)
                                         v
                    +--------------------+--------------------+
                    v                    v                    v
             Section 11-04 Spectral     Section 12-02 Hilbert        Section 17 Diffusion
             Graph Theory        Spaces (L2)           Models

========================================================================

Appendix A: The Z-Transform and Discrete-Time LTI Systems

The Z-transform is the discrete-time counterpart of the Laplace transform, and its restriction to the unit circle is the DTFT (discrete-time Fourier transform):

X(z) = \sum_{n=-\infty}^{\infty} x[n]\, z^{-n}, \quad z \in \mathbb{C}

The DFT evaluates the Z-transform at $N$ equally spaced points on the unit circle: $z = e^{2\pi i k/N}$ for $k = 0, \ldots, N-1$ .

Convolution in the Z-domain. The Z-transform of the convolution is the product of Z-transforms:

\mathcal{Z}(x * h)(z) = X(z) \cdot H(z)

This follows by exactly the same calculation as the continuous Convolution Theorem (substitute $u = n - m$ ).

Transfer function. For an IIR filter with difference equation:

y[n] = \sum_{k=0}^{M} b_k x[n-k] - \sum_{k=1}^{N} a_k y[n-k]

The Z-transform gives: $Y(z)(1 + \sum a_k z^{-k}) = X(z) \sum b_k z^{-k}$ , so:

H(z) = \frac{Y(z)}{X(z)} = \frac{\sum_{k=0}^{M} b_k z^{-k}}{1 + \sum_{k=1}^{N} a_k z^{-k}} = \frac{B(z)}{A(z)}

Stability from Z-domain. The BIBO stability condition $\lVert h \rVert_1 < \infty$ translates to: all poles of $H(z)$ (roots of $A(z)$ ) must lie strictly inside the unit circle $|z| < 1$ .

Frequency response from transfer function. Evaluate $H(z)$ on the unit circle: $H(e^{2\pi i f}) = H(z)|_{z = e^{2\pi if}}$ . This gives the frequency response directly from the pole-zero plot.

A.1 Pole-Zero Analysis

The transfer function $H(z) = K \prod_k (z - z_k) / \prod_j (z - p_j)$ has:

Zeros $z_k$ : frequencies where $|H(e^{i\omega})| = 0$ (nulled by filter)
Poles $p_j$ : resonances where $|H(e^{i\omega})|$ peaks (amplified by filter)

Notch filter example. To null frequency $f_0$ , place a zero pair at $z = e^{\pm 2\pi i f_0}$ :

H(z) = (z - e^{2\pi i f_0})(z - e^{-2\pi i f_0}) = z^2 - 2\cos(2\pi f_0)z + 1

This creates a perfect null at $f_0$ . Used in power-line interference removal (50/60 Hz notch filter).

For AI: The poles of a linear RNN $A$ are the Z-transform poles of its impulse response. Stable initialization requires all eigenvalues of $A$ inside the unit disk. S4's HiPPO matrix has eigenvalues on the imaginary axis (continuous time), which after discretization map to eigenvalues near but inside the unit circle - ensuring stability while preserving long-range information.

Appendix B: Worked Examples

B.1 Convolution Theorem for the Heat Equation

The heat equation $\partial_t u = \partial_{xx} u$ with initial condition $u(x, 0) = u_0(x)$ is solved by convolution with the heat kernel:

u(x, t) = (G_t * u_0)(x), \quad G_t(x) = \frac{1}{\sqrt{4\pi t}} e^{-x^2/(4t)}

Derivation via Fourier transform. Take the FT in $x$ :

\partial_t \hat{u}(\xi, t) = -(2\pi\xi)^2 \hat{u}(\xi, t)

This is a simple ODE in $t$ : $\hat{u}(\xi, t) = \hat{u}_0(\xi) e^{-(2\pi\xi)^2 t}$ .

The Convolution Theorem identifies $e^{-(2\pi\xi)^2 t} = \hat{G}_t(\xi)$ (the FT of the Gaussian kernel $G_t$ ). Inverting:

u(x, t) = \mathcal{F}^{-1}(\hat{G}_t \cdot \hat{u}_0)(x) = (G_t * u_0)(x) \qquad \square

The Fourier transform diagonalizes the heat operator: each frequency evolves independently, decaying at rate $(2\pi\xi)^2$ . High frequencies decay fastest - blurring sharp features. This is exactly why Gaussian convolution produces smooth outputs.

B.2 Verification of Circular vs. Linear Convolution

Let $x = [3, 1, 4, 1]$ (length 4) and $h = [1, 2]$ (length 2).

Linear convolution (length $4 + 2 - 1 = 5$ ):

$y[0] = 3 \cdot 1 = 3$
$y[1] = 1 \cdot 1 + 3 \cdot 2 = 7$
$y[2] = 4 \cdot 1 + 1 \cdot 2 = 6$
$y[3] = 1 \cdot 1 + 4 \cdot 2 = 9$
$y[4] = 1 \cdot 2 = 2$
$y_{\text{lin}} = [3, 7, 6, 9, 2]$

4-point circular convolution (wrong - wrap-around):

$y[0] = 3 \cdot 1 + 1 \cdot 2 = 5$ <- includes $x[3] \cdot h[1]$ , wrong
$y[1] = 1 \cdot 1 + 3 \cdot 2 = 7$ PASS
$y[2] = 4 \cdot 1 + 1 \cdot 2 = 6$ PASS
$y[3] = 1 \cdot 1 + 4 \cdot 2 = 9$ PASS
$y_{\text{circ,4}} = [5, 7, 6, 9]$ - first sample is wrong

8-point circular convolution (zero-padded, correct):

Zero-pad $x \to [3,1,4,1,0,0,0,0]$ and $h \to [1,2,0,0,0,0,0,0]$
Circular convolution of length-8 vectors gives $[3,7,6,9,2,0,0,0]$
Trim to length 5: $[3, 7, 6, 9, 2]$ PASS matches linear convolution

B.3 Wiener Filter Derivation in Detail

Setup. Observation model: $Y(f) = H(f)X(f) + N(f)$ . Desired: estimate $\hat{X}(f) = G(f)Y(f)$ minimizing $\mathbb{E}[|X(f) - G(f)Y(f)|^2]$ .

Expansion:

\mathbb{E}[|X - GY|^2] = \mathbb{E}[|X|^2] - G\mathbb{E}[XY^*] - G^*\mathbb{E}[X^*Y] + |G|^2\mathbb{E}[|Y|^2]

Assuming $X$ and $N$ are uncorrelated ( $\mathbb{E}[X N^*] = 0$ ):

$\mathbb{E}[XY^*] = H^* S_{XX}$
$\mathbb{E}[|Y|^2] = |H|^2 S_{XX} + S_{NN}$

Setting $\partial/\partial G^* = 0$ :

G_{\text{opt}} = \frac{H^* S_{XX}}{|H|^2 S_{XX} + S_{NN}}

This is the Wiener filter. The minimum MSE at frequency $f$ is:

\text{MMSE}(f) = \frac{S_{XX}(f) \cdot S_{NN}(f)}{|H(f)|^2 S_{XX}(f) + S_{NN}(f)} = \frac{S_{XX}(f)}{1 + \text{SNR}(f)}

where $\text{SNR}(f) = |H(f)|^2 S_{XX}(f) / S_{NN}(f)$ .

Appendix C: Numerical Notes and Implementation Details

C.1 Choosing FFT Library

Library	Speed	Features	Best for
`numpy.fft`	Good	Basic FFT, rfft	Simple use; numpy-native
`scipy.fft`	Better	Handles non-power-of-2 optimally, workers param	General purpose
`pyfftw`	Best	Multithreaded FFTW backend	Production speed-critical
`torch.fft`	GPU	Batched FFT on GPU	Neural network layers
`cuFFT` (via cupy)	GPU	CUDA-native	Large batch processing

Rule: For ML training, use torch.fft.rfft which batches over the first dimensions automatically. For signal processing, use scipy.fft which automatically selects the best algorithm.

C.2 Real-Valued Convolution with `rfft`

For real inputs $x$ and $h$ , the DFT has conjugate symmetry: $X[N-k] = X[k]^*$ . Only the first $N/2 + 1$ values are independent. Using numpy.fft.rfft instead of fft:

Input: real $x$ of length $N$
Output: complex of length $N/2 + 1$ (not $N$ )
Speedup: $\approx 2 \times$ for large $N$

# Real-valued FFT convolution
N = 2**int(np.ceil(np.log2(len(x) + len(h) - 1)))
X = np.fft.rfft(x, n=N)   # length N//2 + 1
H = np.fft.rfft(h, n=N)
y = np.fft.irfft(X * H, n=N)[:len(x) + len(h) - 1]

C.3 Numerical Precision in Convolution

Round-off error. FFT-based convolution introduces floating-point errors of order $O(\varepsilon_{\text{mach}} N \log N)$ where $\varepsilon_{\text{mach}} \approx 2.2 \times 10^{-16}$ for double precision. For $N = 2^{20}$ : error $\approx 2.2 \times 10^{-16} \times 20 \times 10^6 \approx 4 \times 10^{-9}$ - acceptable for most applications.

Exact convolution. For integer-valued signals, the NTT (Number Theoretic Transform over $\mathbb{Z}_p$ ) computes exact convolution without floating-point error - covered in Section 20-03.

C.4 Memory Layout for Batched Convolution

When processing batches of signals in PyTorch:

# Batch of B signals of length L, C channels
x = torch.randn(B, C, L)
X = torch.fft.rfft(x, n=N, dim=-1)    # shape: (B, C, N//2+1)
H = torch.fft.rfft(h.unsqueeze(0), n=N, dim=-1)  # broadcast over batch
Y = X * H
y = torch.fft.irfft(Y, n=N, dim=-1)[:, :, :L]

The dim=-1 argument applies the FFT along the last (time) dimension, leaving batch and channel dimensions intact.

Appendix D: Connections to Other Chapters

D.1 Convolution and Probability

The sum $Z = X + Y$ of independent random variables satisfies $p_Z = p_X * p_Y$ (PDF convolution). This connection explains:

Central Limit Theorem via FT: Under mild conditions, $\hat{p}_Z(f) = \hat{p}_X(f) \cdot \hat{p}_Y(f) \to e^{-2\pi^2\sigma^2 f^2}$ (Gaussian in freq), so $p_Z \to \mathcal{N}(0, \sigma^2)$ as more terms are added.
Characteristic functions: The characteristic function $\phi_X(t) = \mathbb{E}[e^{itX}]$ is the FT of $p_X$ . The convolution theorem gives $\phi_{X+Y} = \phi_X \cdot \phi_Y$ directly.
Diffusion models: The forward diffusion process adds Gaussian noise iteratively. The noisy marginal $q(x_t) = \int q(x_t \mid x_0) q(x_0) dx_0$ is a convolution of the data distribution with a Gaussian kernel.

D.2 Convolution and Regularization

Ridge regression solves $\min_x \lVert Ax - b \rVert_2^2 + \lambda \lVert x \rVert_2^2$ . When $A$ is a circulant matrix (i.e., $A = C_h$ for some filter $h$ ), ridge regression becomes Tikhonov deconvolution:

\hat{X}(f) = \frac{H^*(f)}{|H(f)|^2 + \lambda} Y(f)

The regularization parameter $\lambda$ has the same role as in the Wiener filter: it prevents noise amplification at frequencies where $H(f) \approx 0$ .

D.3 Convolution and Kernel Methods

A shift-invariant kernel $k(x - y)$ defines a feature map $\phi: x \mapsto k(x, \cdot)$ such that $k(x, y) = \langle \phi(x), \phi(y) \rangle$ . By Bochner's theorem, $k$ is a positive definite shift-invariant kernel if and only if $\hat{k}(\omega) \geq 0$ for all $\omega$ - i.e., $k$ is the FT of a non-negative measure.

Random Fourier Features (Rahimi & Recht, 2007): approximate a shift-invariant kernel via Monte Carlo:

k(x - y) = \mathbb{E}_{\omega \sim \hat{k}}[e^{i\omega^\top(x-y)}] \approx \frac{1}{D}\sum_{j=1}^D e^{i\omega_j^\top x} (e^{i\omega_j^\top y})^*

This approximates the kernel evaluation as a dot product of random Fourier features, reducing kernel SVM training from $O(n^3)$ to $O(nD)$ .

Appendix E: Filter Design Methods

E.1 Window Method for FIR Design

The window method designs a length- $L$ FIR filter approximating an ideal frequency response $H_d(f)$ :

Compute the ideal (infinite-length) impulse response: $h_d[n] = \int H_d(f) e^{2\pi i fn} df$
Truncate to $L$ taps: $h_w[n] = h_d[n] \cdot w[n]$ for $n = 0, \ldots, L-1$

The window $w[n]$ controls the trade-off between transition width and stopband attenuation. Using Hann window gives $-44$ dB stopband; Kaiser window allows continuous tuning via parameter $\beta$ :

$\beta = 0$ : rectangular ( $-13$ dB)
$\beta = 5.653$ : Hann-equivalent ( $-44$ dB)
$\beta = 8.6$ : $-69$ dB stopband

Kaiser window formulas:

w[n] = \frac{I_0(\beta\sqrt{1-(2n/L-1)^2})}{I_0(\beta)}, \quad \beta = \begin{cases} 0.1102(\alpha-8.7) & \alpha > 50 \\ 0.5842(\alpha-21)^{0.4} + 0.07886(\alpha-21) & 21 \leq \alpha \leq 50 \\ 0 & \alpha < 21 \end{cases}

where $\alpha$ is the desired stopband attenuation in dB and $I_0$ is the modified Bessel function.

Minimum filter length for Kaiser design with stopband attenuation $\alpha$ dB and transition width $\Delta f$ :

L \approx \frac{\alpha - 8}{2.285 \cdot 2\pi \Delta f}

E.2 Parks-McClellan (Equiripple) FIR Design

The Parks-McClellan algorithm finds the optimal equiripple FIR filter minimizing the maximum deviation from the desired response. It uses the Remez exchange algorithm on the Chebyshev equiripple approximation problem.

Key property: For a given filter length $L$ , Parks-McClellan achieves the minimum transition bandwidth subject to equal ripple in passband and stopband. The Chebyshev equiripple solution is optimal in the $L^\infty$ sense (minimax).

In Python: scipy.signal.remez(L, bands, desired, weight).

E.3 Butterworth IIR Design

The $N$ -th order Butterworth filter maximally flat in the passband:

|H(j\Omega)|^2 = \frac{1}{1 + (\Omega/\Omega_c)^{2N}}

$N = 1$ : simple RC circuit (6 dB/octave rolloff)
$N = 2$ : 12 dB/octave
$N$ large: approaches ideal brick-wall

Poles lie on a circle of radius $\Omega_c$ in the left half-plane, equally spaced by $\pi/N$ :

p_k = \Omega_c e^{i\pi(2k+N-1)/(2N)}, \quad k = 1, \ldots, N

In Python: scipy.signal.butter(N, Wn, btype='low', analog=False).

Trade-off: Butterworth has no passband ripple but slower rolloff than Chebyshev or elliptic. For applications where in-band flatness is critical (audio, instrumentation), Butterworth is preferred.

Appendix F: Multidimensional Convolution

F.1 2D Convolution

For 2D signals (images), the 2D linear convolution is:

(f * g)(x, y) = \int\!\!\int f(u, v)\, g(x-u, y-v)\, du\, dv

The 2D Convolution Theorem:

\mathcal{F}_{2D}(f * g)(\xi, \eta) = \hat{f}(\xi, \eta) \cdot \hat{g}(\xi, \eta)

where $\hat{f}(\xi, \eta) = \int\!\!\int f(x,y) e^{-2\pi i(\xi x + \eta y)} dx\, dy$ .

Separable filters. A 2D filter $h(x, y) = h_1(x) h_2(y)$ is separable - its 2D FT factors as $\hat{h}(\xi, \eta) = \hat{h}_1(\xi) \hat{h}_2(\eta)$ . Separable filtering can be implemented as two sequential 1D convolutions:

(h * f)(x, y) = (h_1 *_x (h_2 *_y f))(x, y)

Cost: $O(N^2(K_1 + K_2))$ vs $O(N^2 K_1 K_2)$ for non-separable. Depthwise separable convolutions in MobileNet exploit this: apply a 1D filter per channel, then mix channels - reducing parameters by a factor of $K^2$ for a $K \times K$ filter.

F.2 Image Processing Applications

Common 2D convolution kernels and their frequency responses:

Kernel	FT behavior	Effect
Gaussian $G_\sigma$	Gaussian $e^{-2\pi^2\sigma^2(\xi^2+\eta^2)}$	Lowpass, smooth
Laplacian $\nabla^2$	$-4\pi^2(\xi^2 + \eta^2)$	Highpass, edge enhance
Sobel x-direction	$2\pi i \xi \cdot \text{window}$	Horizontal gradient
Box filter $\frac{1}{K^2}\mathbf{1}_{K\times K}$	$\operatorname{sinc}(K\xi)\operatorname{sinc}(K\eta)$	Lowpass, boxy
Identity	$1$ everywhere	No change

Unsharp masking: $f_{\text{sharp}} = f + \alpha(f - G_\sigma * f) = (1 + \alpha)f - \alpha G_\sigma * f$ . In frequency: $H(\xi) = 1 + \alpha(1 - e^{-2\pi^2\sigma^2|\xi|^2})$ - boosts high frequencies.

Appendix G: Convolution in Sequence Modeling Timeline

The evolution of convolution in deep learning sequence modeling:

Year	Model	Convolution Type	Key Innovation
1989	LeNet	2D spatial conv	Translation equivariance for images
1997	LSTM	Gated recurrence	Implicitly learns long-range IIR response
2016	WaveNet	1D dilated causal conv	Receptive field $2^L$ with $L$ layers
2016	TCN	Dilated causal conv + residual	General temporal conv network
2017	Transformer	Cross-correlation (attention)	Global "convolution" with learned kernels
2021	S4	SSM = implicit IIR conv, trained as FIR via FFT	HiPPO initialization for long-range
2022	S4D	Diagonal SSM	Simplification: scalar SSM per channel
2022	FNet	2D FFT mixing	Fixed unitary conv replaces attention
2023	Hyena	Parameterized long conv	Data-controlled conv length; sub-quadratic
2023	Mamba	Selective SSM	Input-dependent scan; best of RNN+conv
2024	Mamba-2	State space duality	Connects SSM to linear attention

The trajectory is clear: convolution is progressively replacing or supplementing attention as the primary mixing mechanism in large-scale sequence models, driven by the $O(N \log N)$ vs. $O(N^2)$ complexity advantage at long sequence lengths.

Appendix H: Quick Reference

H.1 Convolution Formulas

Operation	Formula	Cost
Linear conv (continuous)	$(f*g)(t) = \int f(\tau)g(t-\tau)d\tau$	$O(N^2)$ direct
Linear conv (discrete)	$(x*h)[n] = \sum_k x[k]h[n-k]$	$O(MN)$ direct
Circular conv (discrete)	$(x \circledast y)[n] = \sum_{m} x[m]y[(n-m)\bmod N]$	$O(N \log N)$ via FFT
FFT convolution	zero-pad -> FFT -> multiply -> IFFT	$O(N \log N)$
Cross-correlation	$(f \star g)(\tau) = \int f^*(t)g(t+\tau)dt$	$O(N \log N)$ via FFT
Autocorrelation	$R_{ff}(\tau) = (f \star f)(\tau)$	$O(N \log N)$ via FFT

H.2 Convolution Theorem Summary

Domain	Time / Space	Frequency
Continuous	$f * g$	$\hat{f} \cdot \hat{g}$
Continuous (dual)	$f \cdot g$	$\hat{f} * \hat{g}$
Discrete (DFT)	$x \circledast y$	$X[k] \cdot Y[k]$
Discrete (dual)	$x[n] \cdot y[n]$	$\frac{1}{N}(X \circledast Y)[k]$
Cross-correlation	$f \star g$	$\hat{f}^*(\xi) \cdot \hat{g}(\xi)$
Wiener-Khinchin	$R_{XX}(\tau)$	$S_{XX}(f) =

H.3 Filter Types Comparison

Property	FIR	IIR
Impulse response	Finite ( $L$ taps)	Infinite (recursive)
Phase	Can be exactly linear	Non-linear (dispersive)
Stability	Always BIBO stable	Only if poles inside unit circle
Computation	$O(L)$ per sample	$O(N_{\text{poles}})$ per sample
FFT-accelerated	Yes, trivially	Not directly
Design methods	Window, Parks-McClellan, LS	Butterworth, Chebyshev, elliptic
ML analogy	CNN, WaveNet, S4 kernel	RNN, LSTM (implicit)

H.4 Overlap-Add vs. Overlap-Save

Property	Overlap-Add	Overlap-Save
Strategy	Block convolution + add tails	Circular buffer + discard head
FFT size	$B + L - 1$	$B + L - 1$
Memory	$2N$	$N$
Per-sample latency	$B$ samples	$B$ samples
Block size optimal	$B \approx 4L$ to $8L$	Same
scipy function	`signal.oaconvolve`	(manual implementation)

H.5 Deconvolution Methods

Method	Formula	When to use
Naive inverse	$\hat{X} = Y/H$	Only when $H(f) \neq 0$ everywhere, no noise
Wiener filter	$\hat{X} = \frac{H^* S_{xx}}{	H
Tikhonov / ridge	$\hat{X} = \frac{H^*}{	H
LASSO deconv	$\min_x \lVert y-h*x\rVert_2^2 + \lambda\lVert x\rVert_1$	Sparse signals (spike trains)
Total variation	$\min_x \lVert y-h*x\rVert_2^2 + \lambda\text{TV}(x)$	Piecewise-constant signals

H.6 Key Inequalities

\lVert f * g \rVert_1 \leq \lVert f \rVert_1 \lVert g \rVert_1 \quad (L^1 \text{ Banach algebra})

\lVert f * g \rVert_2 \leq \lVert f \rVert_1 \lVert g \rVert_2 \quad (\text{Young, } p=1, q=r=2)

\lVert f * g \rVert_r \leq \lVert f \rVert_p \lVert g \rVert_q, \quad \tfrac{1}{r} = \tfrac{1}{p} + \tfrac{1}{q} - 1 \quad (\text{Young general})

|R_{ff}(\tau)| \leq R_{ff}(0) = \lVert f \rVert_2^2 \quad (\text{autocorrelation peak at zero})

\text{SNR}_{\text{matched}} = \frac{2\lVert s \rVert_2^2}{N_0} \quad (\text{matched filter maximum SNR})

Appendix I: Common Signal Pairs and Their Convolutions

Understanding which signals arise as convolutions of common building blocks builds strong intuition.

I.1 Standard Convolution Pairs

$f$	$g$	$(f * g)$
$\mathcal{N}(\mu_1, \sigma_1^2)$	$\mathcal{N}(\mu_2, \sigma_2^2)$	$\mathcal{N}(\mu_1+\mu_2, \sigma_1^2+\sigma_2^2)$
$\Pi_{[0,T]}$ (rect pulse)	$\Pi_{[0,T]}$ (rect pulse)	$\Lambda_{[0,2T]}$ (triangle)
$e^{-at}\mathbf{1}_{t\geq 0}$	$e^{-bt}\mathbf{1}_{t\geq 0}$	$\frac{e^{-at} - e^{-bt}}{b-a}\mathbf{1}_{t\geq 0}$ (if $a \neq b$ )
$\delta(t - t_0)$	$g(t)$	$g(t - t_0)$ (shift)
$\delta'(t)$	$g(t)$	$g'(t)$ (derivative)
$\operatorname{sinc}(2f_c t)$	$\operatorname{sinc}(2f_c t)$	$\operatorname{sinc}(2f_c t)$ (idempotent!)

The sinc idempotence $(h_{\text{LP}} * h_{\text{LP}} = h_{\text{LP}})$ reflects the fact that the ideal lowpass filter is a projection: applying it twice gives the same result as once. In ML terms, this is analogous to idempotent attention patterns.

I.2 Discrete Convolution Pairs

$x[n]$	$h[n]$	$(x * h)[n]$
$\delta[n]$	$h[n]$	$h[n]$ (impulse response)
$u[n]$ (step)	$h[n]$	$\sum_{k=0}^{n} h[k]$ (running sum)
$a^n u[n]$	$b^n u[n]$	$\frac{a^{n+1} - b^{n+1}}{a - b}u[n]$
$[1, 2, 1]$	$[1, 2, 1]$	$[1, 4, 6, 4, 1]$ (binomial expansion)
$\cos(2\pi f_0 n)$	FIR filter $h$	$

I.3 Cascade and Parallel Systems

Cascade (series): If system 1 has impulse response $h_1$ and system 2 has $h_2$ , the cascade has impulse response $h_1 * h_2$ . Frequency response: $H_{\text{cascade}}(f) = H_1(f) \cdot H_2(f)$ .

Parallel: Impulse response $h_1 + h_2$ . Frequency response: $H_1(f) + H_2(f)$ .

Feedback (IIR): If the forward path has response $H_1$ and feedback has $H_2$ :

H_{\text{feedback}}(f) = \frac{H_1(f)}{1 - H_1(f)H_2(f)}

This is the origin of poles in IIR filters: the denominator $1 - H_1 H_2 = 0$ defines the feedback resonances.

References

Oppenheim, A.V. & Schafer, R.W. (2009). Discrete-Time Signal Processing, 3rd ed. Prentice Hall. - The definitive textbook; Chapters 2-4 cover LTI systems and Z-transform.
Proakis, J.G. & Manolakis, D.G. (2006). Digital Signal Processing, 4th ed. Pearson. - Thorough treatment of filter design.
van den Oord, A. et al. (2016). WaveNet: A Generative Model for Raw Audio. arXiv:1609.03499. - Dilated causal convolutions for audio.
Gu, A. et al. (2021). Efficiently Modeling Long Sequences with Structured State Spaces. ICLR 2022. - S4 SSM as FFT convolution.
Gu, A. & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752. - Selective SSM.
Poli, M. et al. (2023). Hyena Hierarchy. ICML 2023. - Parameterized long convolutions.
Lee-Thorp, J. et al. (2022). FNet: Mixing Tokens with Fourier Transforms. NAACL 2022. - FFT as attention substitute.
Cohen, T. & Welling, M. (2016). Group Equivariant Convolutional Networks. ICML 2016. - $G$ -CNNs.
Wiener, N. (1942). Extrapolation, Interpolation, and Smoothing of Stationary Time Series. MIT Press. - Original Wiener filter paper.
Young, W.H. (1912). On the multiplication of successions of Fourier constants. Proc. R. Soc. London. - Original Young's inequality.

Convolution Theorem