NotesMath for LLMs

Convolution Theorem

Fourier Analysis and Signal Processing / Convolution Theorem

Notes

"The convolution theorem is the most important theorem in signal processing - it is what makes Fourier analysis a practical computational tool rather than a purely theoretical construct."

  • Alan V. Oppenheim, MIT EECS

Overview

The Convolution Theorem establishes one of the most beautiful and practically consequential dualities in all of mathematics: convolution in the time domain equals pointwise multiplication in the frequency domain. This single identity reduces an O(N2)O(N^2) sliding-dot-product computation to O(NlogN)O(N \log N) FFT-based arithmetic - a transformation that made real-time digital audio processing possible in the 1970s and powers nearly every modern deep learning architecture today.

This section develops the Convolution Theorem rigorously in both its continuous and discrete forms, builds the surrounding theory of linear time-invariant (LTI) systems and filter design, treats efficient block-convolution algorithms (overlap-add, overlap-save), and traces the theorem's consequences through cross-correlation, the Wiener-Khinchin theorem, and deconvolution. The second half connects these classical results to modern AI: convolutional neural networks as learned frequency filters, WaveNet's causal dilated convolutions, state space models (S4, Mamba) as implicit convolution operators, the Hyena hierarchy's parameterized long convolutions, and FNet's replacement of multi-head attention with a single 2D FFT.

The Convolution Theorem is not merely useful - it is a ring isomorphism. The DFT diagonalizes the circular convolution algebra C[Z/NZ]\mathbb{C}[\mathbb{Z}/N\mathbb{Z}], turning a non-commutative-looking operation into trivial pointwise multiplication. Understanding why this works - through the lens of group representation theory and Fourier analysis on abelian groups - gives deep insight into why spectral methods are so pervasive in modern sequence modeling.

Prerequisites

  • DFT and FFT - DFT definition, circular convolution preview, FFT algorithm, O(NlogN)O(N \log N) complexity (Section 20-03)
  • Fourier Transform - continuous FT definition (ξ\xi-convention), Plancherel's theorem, L1L^1 and L2L^2 theory (Section 20-02)
  • Fourier Series - complex exponential basis, Parseval's identity (Section 20-01)
  • Linear algebra - inner products, unitary matrices, change of basis (Section 02)
  • Complex analysis - eiθe^{i\theta}, modulus, argument, roots of unity (Section 01)

Companion Notebooks

NotebookDescription
theory.ipynbConvolution theorem proofs, LTI frequency response, overlap-add, Wiener filter, CNN filters, S4/Mamba SSM, Hyena, FNet
exercises.ipynb10 graded problems: convolution by hand through Hyena long-convolution parameterization

Learning Objectives

After completing this section, you will:

  1. State and prove the Convolution Theorem in both continuous (L1L^1/L2L^2) and discrete (DFT) forms
  2. Distinguish circular convolution from linear convolution and apply the zero-padding correction
  3. Define an LTI system via its impulse response and derive the frequency response H(f)=F(h)(f)H(f) = \mathcal{F}(h)(f)
  4. Analyze the group delay ddfH(f)-\frac{d}{df}\angle H(f) and explain its significance for signal distortion
  5. Compare FIR and IIR filter designs: stability, linear phase, computational cost
  6. Implement the overlap-add algorithm for O(NlogN)O(N \log N) block convolution of long sequences
  7. Prove the cross-correlation theorem and derive the Wiener-Khinchin theorem (PSD = FT of autocorrelation)
  8. Derive the Wiener deconvolution filter from the MSE minimization principle
  9. Interpret CNN convolutional layers as learned LTI frequency filters in the spatial domain
  10. Explain how S4/Mamba SSMs compute output via FFT-based convolution in training mode
  11. Describe Hyena's parameterized long convolutions and their O(NlogN)O(N \log N) complexity
  12. Identify and fix the 10 most common convolution errors in numerical implementations

Table of Contents


1. Intuition

1.1 What Is Convolution?

Convolution is a mathematical operation that blends two functions by sliding one over the other and measuring their overlap at each position. More precisely, the linear convolution of ff and gg at position tt is the integral of the pointwise product of ff with a reversed, shifted copy of gg:

(fg)(t)=f(τ)g(tτ)dτ(f * g)(t) = \int_{-\infty}^{\infty} f(\tau)\, g(t - \tau)\, d\tau

The operation appears in an enormous variety of contexts, often without being recognized as convolution:

  • Image blurring: convolving an image with a Gaussian kernel g(x,y)=e(x2+y2)/2σ2g(x,y) = e^{-(x^2+y^2)/2\sigma^2} produces a smoothed version. Each output pixel is a weighted average of its neighborhood.
  • Running average: a box filter g[n]=1M1[0,M1]g[n] = \frac{1}{M}\mathbf{1}_{[0,M-1]} convolved with a time series produces the MM-point moving average - standard in financial data smoothing.
  • Polynomial multiplication: multiplying (a0+a1x+a2x2)(a_0 + a_1 x + a_2 x^2) by (b0+b1x)(b_0 + b_1 x) gives coefficients that are exactly the linear convolution of [a0,a1,a2][a_0, a_1, a_2] and [b0,b1][b_0, b_1]. This is why the FFT can multiply two degree-NN polynomials in O(NlogN)O(N \log N) operations.
  • Echo simulation: convolving an audio signal with a room impulse response h(t)h(t) (the sound of a clap in an empty hall) produces the "reverberated" signal.
  • Probability: the PDF of the sum Z=X+YZ = X + Y of independent random variables is (pXpY)(z)(p_X * p_Y)(z) - convolution of their PDFs.

The defining algebraic property is that convolution is commutative (fg=gff * g = g * f), associative ((fg)h=f(gh)(f * g) * h = f * (g * h)), and distributive over addition. These properties make it a natural operation in any linear system.

For AI: every convolutional layer in a CNN computes exactly this operation. A 3×33 \times 3 filter WW convolved with feature map XX gives Y=WXY = W * X - the kernel slides over the input, taking inner products at each position. Understanding convolution as sliding-overlap reveals why translation equivariance holds: if the input shifts, the output shifts by the same amount.

1.2 The Central Insight: Frequency Domain Diagonalization

Computing convolution directly costs O(N2)O(N^2) operations: for each of NN output positions, you compute an inner product with NN filter coefficients. For N=106N = 10^6 samples (one second of audio at 1 MHz), this is 101210^{12} multiplications - completely infeasible.

The Convolution Theorem provides a dramatic shortcut. The key observation is that complex exponentials are eigenfunctions of convolution:

(he2πif)(t)=(h(τ)e2πifτdτ)h^(f)=H(f)e2πift(h * e^{2\pi i f \cdot})(t) = \underbrace{\left(\int_{-\infty}^{\infty} h(\tau) e^{-2\pi i f \tau} d\tau\right)}_{\hat{h}(f) = H(f)} \cdot e^{2\pi i f t}

Each complex exponential e2πifte^{2\pi i ft} passes through any LTI system unchanged in frequency - it is only scaled by the complex number H(f)=h^(f)H(f) = \hat{h}(f). This means:

  1. Decompose the input xx into its frequency components via the Fourier transform: X(f)=F(x)(f)X(f) = \mathcal{F}(x)(f)
  2. Multiply each component by the filter's frequency response: Y(f)=H(f)X(f)Y(f) = H(f) \cdot X(f) (pointwise, O(N)O(N))
  3. Reconstruct the output via the inverse Fourier transform: y=F1(Y)y = \mathcal{F}^{-1}(Y)

Steps 1 and 3 each cost O(NlogN)O(N \log N) via the FFT. Step 2 costs O(N)O(N). The total is O(NlogN)O(N \log N) - compared to O(N2)O(N^2) for direct convolution.

CONVOLUTION THEOREM: TWO EQUIVALENT COMPUTATIONS
========================================================================

  TIME DOMAIN                     FREQUENCY DOMAIN
  ----------------------          ----------------------
  x[n], h[n]  (length N, M)       X[k] = DFT(x)
       v                               v
  y = x  h                      Y[k] = X[k]  H[k]
  O(NM) operations                    v
                                   y = IDFT(Y)
                           O(N log N) via FFT

  Both computations give IDENTICAL output y[n].

========================================================================

The mathematical reason this works is deep: the DFT is a ring isomorphism between (CN,)(\mathbb{C}^N, \circledast) - the ring of sequences under circular convolution - and (CN,)(\mathbb{C}^N, \cdot) - the ring of sequences under pointwise multiplication. Convolution, which looks complicated, becomes trivial multiplication after a change of basis.

1.3 Why This Matters for AI (First Look)

The Convolution Theorem is not merely a classical signal processing tool - it is at the heart of modern deep learning architectures:

  • CNNs: every torch.nn.Conv2d layer computes a learned 2D convolution. For large kernels, PyTorch uses FFT-based convolution internally. The frequency domain perspective explains why early CNN layers learn Gabor-like filters (oriented bandpass filters) and why deep layers respond to semantic content.
  • WaveNet (van den Oord et al., 2016): autoregressive waveform generation using dilated causal convolutions. The receptive field grows exponentially with depth via dilation, reaching 2L12^L - 1 samples with only LL layers.
  • S4 (Gu et al., 2021): represents a sequence model as a linear recurrence ht=Aht1+Bxt\mathbf{h}_t = A\mathbf{h}_{t-1} + B x_t, yt=Chty_t = C\mathbf{h}_t. During training, the recurrence is unrolled into a convolution y=Kˉxy = \bar{\mathbf{K}} * x where Kˉ=[CBˉ,CAˉBˉ,CAˉ2Bˉ,]\bar{\mathbf{K}} = [C\bar{B}, C\bar{A}\bar{B}, C\bar{A}^2\bar{B}, \ldots] is the SSM kernel, computed in O(NlogN)O(N \log N) via FFT.
  • Hyena (Poli et al., 2023): replaces attention with dd learned long convolutions h1,,hdh_1, \ldots, h_d, each parameterized by a neural network. Complexity is O(NlogN)O(N \log N) per layer vs. O(N2)O(N^2) for attention.
  • FNet (Lee-Thorp et al., 2022): replaces multi-head attention with a single 2D FFT mixing operation, achieving 92% of BERT's GLUE score with 7 speedup.

1.4 Historical Timeline

YearEvent
1821Cauchy identifies convolution in complex function theory
1822Fourier's Theorie analytique - FT diagonalizes the heat equation (an LTI PDE)
1887Heaviside introduces operational calculus; transfer functions for electrical circuits
1920sNorbert Wiener develops autocorrelation and power spectral density theory
1942Wiener invents the optimal Wiener deconvolution filter
1965Cooley & Tukey - FFT makes convolution theorem computationally practical
1970sDigital signal processing revolution: digital filters, audio processing
1989LeCun et al. - LeNet: CNNs for handwritten digit recognition
1998LeNet-5 - modern CNN architecture with learned convolutional filters
2012AlexNet - deep CNNs dominate ImageNet; GPU-accelerated conv layers
2016WaveNet (DeepMind) - dilated causal convolutions for audio generation
2021S4 (Stanford) - SSMs as implicit FFT convolutions for long sequences
2022FNet (Google) - FFT replaces attention at competitive performance
2023Hyena / Mamba - parameterized and selective convolutions for sub-quadratic LLMs

2. Formal Definitions

2.1 Linear (Aperiodic) Convolution

Definition (Continuous Linear Convolution). For f,gL1(R)f, g \in L^1(\mathbb{R}), the convolution of ff and gg is:

(fg)(t)=f(τ)g(tτ)dτ(f * g)(t) = \int_{-\infty}^{\infty} f(\tau)\, g(t - \tau)\, d\tau

The integral is well-defined a.e. and fgL1(R)f * g \in L^1(\mathbb{R}) by Young's inequality.

Definition (Discrete Linear Convolution). For sequences x1(Z)x \in \ell^1(\mathbb{Z}) and h1(Z)h \in \ell^1(\mathbb{Z}):

(xh)[n]=k=x[k]h[nk](x * h)[n] = \sum_{k=-\infty}^{\infty} x[k]\, h[n - k]

For finite sequences of lengths MM and NN, the output has length M+N1M + N - 1.

Algebraic properties:

  • Commutativity: fg=gff * g = g * f (proof: substitute u=tτu = t - \tau)
  • Associativity: (fg)h=f(gh)(f * g) * h = f * (g * h) (follows from Fubini's theorem)
  • Distributivity: f(g+h)=fg+fhf * (g + h) = f * g + f * h
  • Identity: fδ=ff * \delta = f where δ\delta is the Dirac delta (continuous) or Kronecker delta (discrete)
  • Scaling: (af)g=a(fg)(af) * g = a(f * g) for aCa \in \mathbb{C}

Support rule: If ff is supported on [a1,a2][a_1, a_2] and gg on [b1,b2][b_1, b_2], then fgf * g is supported on [a1+b1,a2+b2][a_1 + b_1, a_2 + b_2]. For finite sequences of lengths MM and NN, the output has exactly M+N1M + N - 1 non-zero samples.

Standard examples:

Example 1 (Gaussian * Gaussian): If f=N(0,σ12)f = \mathcal{N}(0, \sigma_1^2) and g=N(0,σ22)g = \mathcal{N}(0, \sigma_2^2) (as PDFs), then:

(fg)(t)=N(0,σ12+σ22)(t)(f * g)(t) = \mathcal{N}(0, \sigma_1^2 + \sigma_2^2)(t)

Convolution of Gaussians is Gaussian with variance sum. This is the sum-of-independent-normals theorem.

Example 2 (rect * rect = triangle): The convolution of two rectangular pulses Π[0,1]\Pi_{[0,1]} is the triangle function Λ\Lambda supported on [0,2][0, 2]. Each rect * rect gives a piecewise-linear tent.

Example 3 (polynomial multiplication): (1+2x)(3+4x)=3+10x+8x2(1 + 2x)(3 + 4x) = 3 + 10x + 8x^2. As convolution: [1,2][3,4]=[3,10,8][1, 2] * [3, 4] = [3, 10, 8].

Non-examples:

  • fgf * g is NOT the pointwise product fgf \cdot g - these are entirely different operations.
  • The convolution of two non-integrable functions (e.g., constants) is not generally well-defined in L1L^1.

2.2 Circular (Periodic) Convolution

Definition (Circular Convolution). For x,yCNx, y \in \mathbb{C}^N, the NN-point circular (periodic) convolution is:

(xy)[n]=m=0N1x[m]y[(nm)modN],n=0,1,,N1(x \circledast y)[n] = \sum_{m=0}^{N-1} x[m]\, y[(n - m) \bmod N], \quad n = 0, 1, \ldots, N-1

Circular convolution wraps the index modulo NN: sample y[1]y[-1] is the same as y[N1]y[N-1].

The key difference from linear convolution: circular convolution assumes both signals are periodic with period NN. When the input is not actually periodic, the wrap-around creates time-aliasing - contributions from the "end" of the sequence wrap around and contaminate the "beginning" of the output. This produces incorrect results whenever yy is not truly NN-periodic.

Matrix form: Circular convolution is exactly a matrix-vector product with a circulant matrix:

y=Chx,(Ch)ij=h[(ij)modN]y = C_h \mathbf{x}, \qquad (C_h)_{ij} = h[(i-j) \bmod N]

Every circulant matrix is diagonalized by the DFT matrix FNF_N:

Ch=FN1diag(H)FN,H=FNhC_h = F_N^{-1} \operatorname{diag}(H) F_N, \quad H = F_N \mathbf{h}

This is precisely the Convolution Theorem in matrix form.

When circular = linear: If xx has length MM and hh has length LL, zero-pad both to length NM+L1N \geq M + L - 1. Then NN-point circular convolution equals linear convolution exactly.

2.3 Cross-Correlation and Autocorrelation

Definition (Cross-Correlation). The cross-correlation of ff and gg is:

(fg)(t)=f(τ)g(t+τ)dτ=(f()g)(t)(f \star g)(t) = \int_{-\infty}^{\infty} f^*(\tau)\, g(t + \tau)\, d\tau = (f^*(- \cdot) * g)(t)

Note: cross-correlation is convolution with the time-reversed conjugate of the first argument. It is not commutative: (fg)(t)=(gf)(t)(f \star g)(t) = (g \star f)^*(-t).

Cross-correlation measures the similarity between ff and a shifted copy of gg. The peak of (fg)(t)|(f \star g)(t)| occurs at the lag tt where gg most resembles ff.

Definition (Autocorrelation). The autocorrelation of ff is:

Rff(τ)=(ff)(τ)=f(t)f(t+τ)dtR_{ff}(\tau) = (f \star f)(\tau) = \int_{-\infty}^{\infty} f^*(t)\, f(t + \tau)\, dt

Properties:

  • Rff(0)=f22R_{ff}(0) = \lVert f \rVert_2^2 (maximum at zero lag)
  • Rff(τ)=Rff(τ)R_{ff}(\tau) = R_{ff}^*(-\tau) (Hermitian symmetry)
  • Rff(τ)Rff(0)|R_{ff}(\tau)| \leq R_{ff}(0) for all τ\tau

For AI: Cross-correlation is the operation actually computed in standard "convolution" layers in PyTorch and TensorFlow. The torch.nn.Conv2d layer computes (Wx)(W \star x) (cross-correlation), not (Wx)(W * x) (true convolution). Since the filter weights are learned, this distinction does not affect expressive power - the network simply learns the time-reversed filter.

2.4 The Convolution Algebra

The space L1(R)L^1(\mathbb{R}) equipped with convolution is a commutative Banach algebra:

  • Vector space: (L1(R),+,)(L^1(\mathbb{R}), +, \cdot)
  • Algebra: equipped with multiplication (f,g)fg(f, g) \mapsto f * g
  • Banach: complete normed space with fg1f1g1\lVert f * g \rVert_1 \leq \lVert f \rVert_1 \lVert g \rVert_1

The discrete version: 1(Z)\ell^1(\mathbb{Z}) with discrete convolution is a Banach algebra under the same submultiplicativity bound.

The cyclic group algebra C[Z/NZ]CN\mathbb{C}[\mathbb{Z}/N\mathbb{Z}] \cong \mathbb{C}^N with circular convolution is a commutative semisimple algebra over C\mathbb{C}. By Maschke's theorem (since char(C)N\operatorname{char}(\mathbb{C}) \nmid N), it decomposes as:

C[Z/NZ]k=0N1Cek\mathbb{C}[\mathbb{Z}/N\mathbb{Z}] \cong \bigoplus_{k=0}^{N-1} \mathbb{C} \cdot e_k

where eke_k are the primitive idempotents - the DFT basis vectors ωNkn\omega_N^{kn}. The DFT is exactly the isomorphism to the direct sum. Convolution in the group algebra becomes pointwise multiplication in the direct sum decomposition - this is the algebraic content of the Convolution Theorem.

3. The Convolution Theorem

3.1 Continuous Version

Theorem (Convolution Theorem, Continuous). Let f,gL1(R)f, g \in L^1(\mathbb{R}). Then fgL1(R)f * g \in L^1(\mathbb{R}) and:

F(fg)(ξ)=f^(ξ)g^(ξ)\mathcal{F}(f * g)(\xi) = \hat{f}(\xi) \cdot \hat{g}(\xi)

where f^(ξ)=f(t)e2πiξtdt\hat{f}(\xi) = \int_{-\infty}^{\infty} f(t)\, e^{-2\pi i \xi t}\, dt (using the ξ\xi-convention consistent with Section 20-02).

Proof. Compute directly using Fubini's theorem (justified by fg1f1g1<\lVert f * g \rVert_1 \leq \lVert f \rVert_1 \lVert g \rVert_1 < \infty):

F(fg)(ξ)=[f(τ)g(tτ)dτ]e2πiξtdt\mathcal{F}(f * g)(\xi) = \int_{-\infty}^{\infty} \left[\int_{-\infty}^{\infty} f(\tau)\, g(t-\tau)\, d\tau \right] e^{-2\pi i \xi t}\, dt

Swap integration order and substitute u=tτu = t - \tau:

=f(τ)[g(u)e2πiξ(u+τ)du]dτ=f(τ)e2πiξτdτg^(ξ)=f^(ξ)g^(ξ)= \int_{-\infty}^{\infty} f(\tau) \left[\int_{-\infty}^{\infty} g(u)\, e^{-2\pi i \xi (u+\tau)}\, du \right] d\tau = \int_{-\infty}^{\infty} f(\tau)\, e^{-2\pi i \xi \tau}\, d\tau \cdot \hat{g}(\xi) = \hat{f}(\xi) \cdot \hat{g}(\xi) \qquad \square

Extension to L2L^2: For fL1L2f \in L^1 \cap L^2 and gL2g \in L^2, the theorem still holds in L2L^2 by a density argument, with fg2f1g2\lVert f * g \rVert_2 \leq \lVert f \rVert_1 \lVert g \rVert_2 (a special case of Young's inequality).

Corollary (Shift Theorem). If ga(t)=g(ta)g_a(t) = g(t - a) is a shifted version of gg:

g^a(ξ)=e2πiaξg^(ξ)\hat{g}_a(\xi) = e^{-2\pi i a \xi} \hat{g}(\xi)

This follows from the Convolution Theorem applied to gδag * \delta_a where δ^a(ξ)=e2πiaξ\hat{\delta}_a(\xi) = e^{-2\pi i a \xi}.

For AI: Dilated convolutions in WaveNet compute (hxdilated)[n](h * x_{\text{dilated}})[n] where xdilated[n]=x[n/d]x_{\text{dilated}}[n] = x[n/d] for dilation factor dd. In frequency: F(xdilated)(ξ)=dx^(dξ)\mathcal{F}(x_{\text{dilated}})(\xi) = |d| \cdot \hat{x}(d\xi) - dilation in time becomes compression in frequency. Each dilation level captures different frequency bands.

3.2 Discrete Version

Theorem (Convolution Theorem, Discrete). For x,yCNx, y \in \mathbb{C}^N with DFT X=F(x)X = \mathcal{F}(x) and Y=F(y)Y = \mathcal{F}(y):

F(xy)[k]=X[k]Y[k],k=0,1,,N1\mathcal{F}(x \circledast y)[k] = X[k] \cdot Y[k], \quad k = 0, 1, \ldots, N-1

Equivalently: xy=F1(XY)x \circledast y = \mathcal{F}^{-1}(X \cdot Y).

Proof. Compute the DFT of the circular convolution directly:

F(xy)[k]=n=0N1[m=0N1x[m]y[(nm)modN]]ωNnk\mathcal{F}(x \circledast y)[k] = \sum_{n=0}^{N-1} \left[\sum_{m=0}^{N-1} x[m]\, y[(n-m) \bmod N]\right] \omega_N^{-nk}

where ωN=e2πi/N\omega_N = e^{2\pi i / N}. Swap sums and substitute j=(nm)modNj = (n-m) \bmod N:

=m=0N1x[m]ωNmkj=0N1y[j]ωNjk=X[k]Y[k]= \sum_{m=0}^{N-1} x[m] \omega_N^{-mk} \sum_{j=0}^{N-1} y[j]\, \omega_N^{-jk} = X[k] \cdot Y[k] \qquad \square

Normalization bookkeeping. NumPy's np.fft.ifft divides by NN, so the full pipeline is:

Y_k = X_k * H_k          (pointwise, no 1/N)
y = np.fft.ifft(Y_k)     (divides by N automatically)

Avoid double-dividing by NN when mixing manual DFT with NumPy functions.

Algorithm: FFT-based convolution of xx (length MM) with hh (length LL):

  1. Choose NM+L1N \geq M + L - 1, next power of 2: N=2log2(M+L1)N = 2^{\lceil \log_2(M+L-1) \rceil}
  2. Zero-pad: x~[n]=x[n]\tilde{x}[n] = x[n] for n<Mn < M, else 00; similarly h~\tilde{h}
  3. X=FFT(x~)X = \text{FFT}(\tilde{x}), H=FFT(h~)H = \text{FFT}(\tilde{h})
  4. Y[k]=X[k]H[k]Y[k] = X[k] \cdot H[k] for all kk
  5. y=IFFT(Y)y = \text{IFFT}(Y), keep first M+L1M + L - 1 samples

Total cost: 3O(NlogN)3 \cdot O(N \log N) for three FFTs + O(N)O(N) for pointwise multiply.

3.3 The Multiplication Theorem (Dual)

Theorem (Multiplication / Dual Convolution Theorem). For f,gL1(R)L2(R)f, g \in L^1(\mathbb{R}) \cap L^2(\mathbb{R}):

F(fg)(ξ)=(f^g^)(ξ)=f^(η)g^(ξη)dη\mathcal{F}(f \cdot g)(\xi) = (\hat{f} * \hat{g})(\xi) = \int_{-\infty}^{\infty} \hat{f}(\eta)\, \hat{g}(\xi - \eta)\, d\eta

Pointwise multiplication in time = convolution in frequency. This is exact duality with the Convolution Theorem.

Consequences:

  • Windowing: Multiplying a signal by a window w(t)w(t) (e.g., Hann window) convolves its spectrum with w^(ξ)\hat{w}(\xi). This is the spectral leakage mechanism studied in Section 20-03: the wider the window in time, the narrower w^\hat{w} in frequency, and the less leakage.
  • Amplitude modulation: x(t)cos(2πf0t)=x(t)12(e2πif0t+e2πif0t)x(t) \cos(2\pi f_0 t) = x(t) \cdot \frac{1}{2}(e^{2\pi i f_0 t} + e^{-2\pi i f_0 t}) shifts the spectrum to ±f0\pm f_0. In the frequency domain: X^(ξf0)/2+X^(ξ+f0)/2\hat{X}(\xi - f_0)/2 + \hat{X}(\xi + f_0)/2.
  • Parseval's theorem: Setting g=fg = f and ξ=0\xi = 0: f(t)2dt=f^(ξ)2dξ\int |f(t)|^2 dt = \int |\hat{f}(\xi)|^2 d\xi.

3.4 Parseval and Convolution

Parseval's theorem for convolution. Combining Plancherel and the Convolution Theorem:

fg22=f^(ξ)g^(ξ)2dξ=f^g^22\lVert f * g \rVert_2^2 = \int |\hat{f}(\xi) \hat{g}(\xi)|^2 d\xi = \lVert \hat{f} \cdot \hat{g} \rVert_2^2

Young's convolution inequality. For 1r=1p+1q1\frac{1}{r} = \frac{1}{p} + \frac{1}{q} - 1 with 1p,q,r1 \leq p, q, r \leq \infty:

fgrfpgq\lVert f * g \rVert_r \leq \lVert f \rVert_p \lVert g \rVert_q

Special cases:

  • p=1,q=r=2p = 1, q = r = 2: fg2f1g2\lVert f * g \rVert_2 \leq \lVert f \rVert_1 \lVert g \rVert_2 (used in filter stability analysis)
  • p=q=1,r=1p = q = 1, r = 1: fg1f1g1\lVert f * g \rVert_1 \leq \lVert f \rVert_1 \lVert g \rVert_1 (Banach algebra property)
  • p=1,q=,r=p = 1, q = \infty, r = \infty: fgf1g\lVert f * g \rVert_\infty \leq \lVert f \rVert_1 \lVert g \rVert_\infty

Energy interpretation: The energy of the filtered output is bounded by the filter's L1L^1 norm times the input's L2L^2 energy. A filter with h11\lVert h \rVert_1 \leq 1 is energy non-increasing - it cannot amplify energy.

4. Circular vs. Linear Convolution

4.1 The Wrap-Around Problem

Circular convolution assumes periodicity. When you apply the DFT-multiply-IDFT recipe to two finite sequences without zero-padding, you compute circular convolution - and the result is not the same as linear convolution.

The wrap-around artifact. Suppose x=[1,2,3]x = [1, 2, 3] and h=[1,1]h = [1, 1] (both length N=3N = 3). Linear convolution gives ylin=[1,3,5,3]y_{\text{lin}} = [1, 3, 5, 3] (length 4). But the 3-point circular convolution gives:

(xh)[0]=x[0]h[0]+x[2]h[1]=1+3=4(x \circledast h)[0] = x[0]h[0] + x[2]h[1] = 1 + 3 = 4

because the term x[1mod3]=x[2]x[-1 \bmod 3] = x[2] wraps around. The correct linear result would have ylin[0]=1y_{\text{lin}}[0] = 1.

Time-aliasing analogy. Circular convolution on a length-NN DFT is the time-domain analogue of frequency-domain aliasing. Just as undersampling causes spectral copies to overlap (aliasing in frequency), using an insufficient DFT size causes time-domain copies to overlap (aliasing in time).

CIRCULAR vs LINEAR CONVOLUTION
========================================================================

  x = [1, 2, 3],  h = [1, 1]

  LINEAR (correct):                CIRCULAR (N=3, wrap-around error):
  y[0] = 11         = 1          y[0] = 11 + 31 = 4  <- WRONG
  y[1] = 21 + 11   = 3          y[1] = 21 + 11 = 3  PASS
  y[2] = 31 + 21   = 5          y[2] = 31 + 21 = 5  PASS
  y[3] =      31    = 3          (missing term - aliased into y[0])
  Output length: 4                 Output length: 3

========================================================================

4.2 The Zero-Padding Solution

Theorem. Let xCMx \in \mathbb{C}^M and hCLh \in \mathbb{C}^L. If we zero-pad both to length NM+L1N \geq M + L - 1 before computing the DFT, then the first M+L1M + L - 1 samples of the NN-point IDFT of XHX \cdot H equal the linear convolution xlinhx *_{\text{lin}} h.

Why it works. The linear convolution output has M+L1M + L - 1 non-zero samples. Padding to NM+L1N \geq M + L - 1 ensures the circular buffer is large enough that the "wrap-around" tail falls on zero-padded positions - so no aliasing occurs.

Practical recipe:

N = 2**int(np.ceil(np.log2(len(x) + len(h) - 1)))  # next power of 2
X = np.fft.rfft(x, n=N)
H = np.fft.rfft(h, n=N)
y = np.fft.irfft(X * H, n=N)[:len(x) + len(h) - 1]  # trim to correct length

Using rfft (real FFT) halves the computation for real-valued inputs.

Power-of-2 choice. Padding to the next power of 2 ensures the Cooley-Tukey FFT uses radix-2 butterflies, maximizing efficiency. For some NN, mixed-radix FFTs (e.g., N=3×5×7N = 3 \times 5 \times 7) can be as fast or faster; scipy.fft automatically chooses the best FFT size.

4.3 Cost Analysis

For convolving a signal of length MM with a filter of length LL:

MethodFlopsWhen optimal
Direct linear convolutionO(ML)O(ML)LML \ll M (short filter)
FFT-based (N=M+L1N = M + L - 1)O(NlogN)O(N \log N)LL comparable to MM
Overlap-add (blocks of size BB)O(MlogL)O(M \log L)MLM \gg L (long signal, fixed filter)

Crossover point. FFT convolution beats direct when MLNlog2NML \gtrsim N \log_2 N, approximately:

LNlog2NMlog2N20 (for N=106)L \gtrsim \frac{N \log_2 N}{M} \approx \log_2 N \approx 20 \text{ (for } N = 10^6 \text{)}

For filters longer than 20\sim 20 taps, FFT-based convolution is almost always faster in practice. For very short filters (e.g., 3×33 \times 3 CNN kernels), direct convolution with SIMD instructions often wins due to lower overhead.

5. LTI Systems and Filter Theory

5.1 Linear Time-Invariant Systems

A system H:xy\mathcal{H}: x \mapsto y is linear if it satisfies superposition:

H(ax+bz)=aH(x)+bH(z)a,bC\mathcal{H}(ax + bz) = a\mathcal{H}(x) + b\mathcal{H}(z) \quad \forall a, b \in \mathbb{C}

It is time-invariant if a time shift in input produces the same time shift in output:

if y=H(x), then H(x(τ))=y(τ)τ\text{if } y = \mathcal{H}(x), \text{ then } \mathcal{H}(x(\cdot - \tau)) = y(\cdot - \tau) \quad \forall \tau

Characterization theorem: Every LTI system is completely described by its impulse response h(t)=H(δ)(t)h(t) = \mathcal{H}(\delta)(t). The output for any input xx is:

y(t)=(hx)(t)=h(τ)x(tτ)dτy(t) = (h * x)(t) = \int_{-\infty}^{\infty} h(\tau)\, x(t-\tau)\, d\tau

Causality: h(t)=0h(t) = 0 for t<0t < 0. A causal system cannot depend on future inputs - essential for real-time processing. The convolution simplifies to y(t)=0h(τ)x(tτ)dτy(t) = \int_0^{\infty} h(\tau) x(t - \tau) d\tau.

Stability (BIBO): An LTI system is bounded-input bounded-output (BIBO) stable if and only if:

h1=h(t)dt<\lVert h \rVert_1 = \int_{-\infty}^{\infty} |h(t)|\, dt < \infty

This is the L1L^1 condition - exactly the domain in which the Convolution Theorem holds most cleanly.

For AI: Recurrent neural networks can be analyzed through the LTI lens. A linear RNN ht=Aht1+bxt\mathbf{h}_t = A\mathbf{h}_{t-1} + \mathbf{b} x_t has impulse response h[n]=CAnBh[n] = CA^n B for n0n \geq 0. BIBO stability requires ρ(A)<1\rho(A) < 1 (spectral radius). S4 enforces stability by constraining AA to have negative-real eigenvalues via the HiPPO initialization.

5.2 Frequency Response and Transfer Function

Definition. The frequency response of an LTI system with impulse response hh is:

H(f)=h^(f)=h(t)e2πiftdtH(f) = \hat{h}(f) = \int_{-\infty}^{\infty} h(t)\, e^{-2\pi i f t}\, dt

This is simply the Fourier transform of the impulse response. The Convolution Theorem then gives:

Y(f)=H(f)X(f)Y(f) = H(f) \cdot X(f)

The system applies a complex gain H(f)H(f) to each frequency component:

  • H(f)|H(f)| - the magnitude response (gain/attenuation at frequency ff)
  • H(f)\angle H(f) - the phase response (phase shift at frequency ff)

Group delay. The group delay measures how much each frequency band is delayed:

τg(f)=ddfH(f)\tau_g(f) = -\frac{d}{df} \angle H(f)

A system with constant group delay τg(f)=τ0\tau_g(f) = \tau_0 is said to have linear phase: H(f)=2πfτ0\angle H(f) = -2\pi f \tau_0. This means all frequencies are delayed by the same amount τ0\tau_0 - no dispersion, no distortion of waveform shape.

Frequency response examples:

Filter typeH(f)H(f)Impulse response h(t)h(t)
Ideal lowpass, cutoff fcf_c$\mathbf{1}_{f
Ideal highpass$\mathbf{1}_{f
Gaussian smoothereπ2f2/(2σ2)e^{-\pi^2 f^2 / (2\sigma^2)}σeσ2π2t2/2\sigma e^{-\sigma^2 \pi^2 t^2 / 2} (Gaussian)
Differentiator2πif2\pi i fδ(t)\delta'(t)
Hilbert transformerisgn(f)-i \operatorname{sgn}(f)1πt\frac{1}{\pi t} (Cauchy principal value)

For AI: The Hilbert transformer shifts all frequency components by 90, producing the analytic signal z(t)=x(t)+iH[x](t)z(t) = x(t) + i\mathcal{H}[x](t). This is used in envelope detection for audio and is related to the complex-valued activations in modern SSM architectures.

5.3 FIR vs. IIR Filters

FIR (Finite Impulse Response): h[n]h[n] has only finitely many non-zero terms (n=0,1,,L1n = 0, 1, \ldots, L-1). Output is a finite weighted sum: y[n]=k=0L1h[k]x[nk]y[n] = \sum_{k=0}^{L-1} h[k] x[n-k].

Advantages:

  • Always BIBO stable (finite sum, bounded output)
  • Can achieve exact linear phase: symmetric FIR (h[n]=h[L1n]h[n] = h[L-1-n]) has linear phase
  • Easy to design: window method, least-squares, Parks-McClellan algorithm

Disadvantages:

  • Long filter needed for sharp frequency cutoffs (e.g., L300L \approx 300 taps for sharp lowpass)
  • High computational cost for long filters (mitigated by FFT convolution)

IIR (Infinite Impulse Response): h[n]h[n] is non-zero for all n0n \geq 0. Described by a rational transfer function:

H(z)=B(z)A(z)=k=0Mbkzk1+k=1NakzkH(z) = \frac{B(z)}{A(z)} = \frac{\sum_{k=0}^{M} b_k z^{-k}}{1 + \sum_{k=1}^{N} a_k z^{-k}}

Implemented as a recursive difference equation: y[n]=kbkx[nk]kaky[nk]y[n] = \sum_k b_k x[n-k] - \sum_k a_k y[n-k].

Advantages:

  • Very compact: sharp frequency responses achievable with few coefficients (e.g., 5th-order Butterworth)
  • Classic analog designs: Butterworth (maximally flat), Chebyshev (equiripple), elliptic (optimal)

Disadvantages:

  • Stability not guaranteed: poles of H(z)H(z) must lie inside the unit circle
  • Non-linear phase: group delay is frequency-dependent, causing waveform dispersion
  • Cannot be parallelized easily: recursive computation creates data dependencies

For AI: CNN layers are FIR filters (finite, feedforward). RNNs and LSTMs implement IIR-like behavior (infinite memory via recurrence). S4/Mamba use IIR-structured kernels but compute them in FIR/parallel mode via FFT convolution during training.

5.4 Ideal Filter Prototypes

The four ideal filter types, defined by their frequency response:

| Type | H(f)|H(f)| | Application | |------|----------|-------------| | Lowpass (LP) | 11 for ffc|f| \leq f_c, 00 otherwise | Smoothing, anti-aliasing | | Highpass (HP) | 00 for ffc|f| \leq f_c, 11 otherwise | Edge detection, noise removal | | Bandpass (BP) | 11 for f1ff2f_1 \leq |f| \leq f_2, 00 otherwise | Channel selection, mel filters | | Bandstop (BS) | 00 for f1ff2f_1 \leq |f| \leq f_2, 11 otherwise | Notch filter (50/60 Hz hum removal) |

All ideal filters have impulse responses that are sinc\operatorname{sinc}-like - non-causal (extend to -\infty) and non-realizable. Real-world filter design approximates these ideal brick-wall responses.

Gibbs phenomenon. Truncating the infinite sinc impulse response creates the Gibbs overshoot: the truncated filter's frequency response overshoots the ideal by 9%\approx 9\% at the cutoff regardless of truncation length. Windowing the sinc (multiplying by Hann or Kaiser window) reduces the Gibbs ripple at the cost of a wider transition band - the same leakage-resolution trade-off from Section 20-03.

6. Efficient Long Convolution

6.1 Overlap-Add Algorithm

The overlap-add (OA) algorithm efficiently convolves a long signal xx of length MM with a short filter hh of length LL by processing xx in blocks.

Algorithm:

  1. Choose block size BB (typically B=4LB = 4L or B=8LB = 8L for efficiency)
  2. Pad hh to length N=B+L1N = B + L - 1, compute H=FFT(h,N)H = \text{FFT}(h, N)
  3. For each block m=0,1,2,m = 0, 1, 2, \ldots: a. Extract xm=x[mB:(m+1)B]x_m = x[mB : (m+1)B] (zero-padded if needed) b. Compute Xm=FFT(xm,N)X_m = \text{FFT}(x_m, N) c. Compute Ym=IFFT(XmH,N)Y_m = \text{IFFT}(X_m \cdot H, N) - length N=B+L1N = B + L - 1 d. Add YmY_m to the output buffer with overlap: output[m*B : m*B + N] += Y_m

The last L1L - 1 samples of each block output overlap with the first L1L - 1 samples of the next - these overlapping tails are summed (the "add" in overlap-add).

Why it works. Each block-convolution YmY_m contains the correct contribution of block xmx_m to the full output. The convolution theorem guarantees linearity: summing all contributions gives the correct total output.

Total cost: M/B\lceil M/B \rceil FFTs of size N=B+L1N = B + L - 1, each costing O(NlogN)O(N \log N):

Total=O ⁣(MB(B+L)log(B+L))O(MlogL)when BL\text{Total} = O\!\left(\frac{M}{B} \cdot (B + L) \log(B + L)\right) \approx O(M \log L) \quad \text{when } B \approx L

Optimal block size minimizes (B+L)log(B+L)B\frac{(B+L)\log(B+L)}{B}, giving B2LB \approx 2L to 8L8L.

6.2 Overlap-Save Algorithm

Overlap-save (OS) is a computationally equivalent alternative that avoids the explicit addition step by saving (reusing) a portion of each input block.

Algorithm:

  1. Initialize a buffer of size N=B+L1N = B + L - 1 filled with zeros
  2. For each new BB input samples: a. Shift buffer left by BB: keep the last L1L-1 samples from the previous block b. Fill buffer positions [L1,N1][L-1, N-1] with the new BB input samples c. Compute Xm=FFT(buffer,N)X_m = \text{FFT}(\text{buffer}, N) d. Ym=IFFT(XmH)Y_m = \text{IFFT}(X_m \cdot H) e. Discard first L1L-1 samples of YmY_m (circular contamination), save last BB samples

The "save" refers to keeping L1L-1 samples from the previous block in the current buffer. The "discard" step removes the circular wrap-around artifacts at the beginning of each block.

Comparison:

PropertyOverlap-AddOverlap-Save
Computation per blockSameSame
Memory2N2N (input + output buffer)NN (single circular buffer)
Implementation complexitySlightly simplerSlightly more memory-efficient
Real-time streamingGoodSlightly better

6.3 Block Convolution in Practice

Choosing block size BB:

  • Too small (BLB \ll L): FFT overhead dominates; most computation is transform, not convolution
  • Too large (BLB \gg L): good efficiency, but large latency (BB samples before any output)
  • Rule of thumb: B=4LB = 4L to 8L8L balances efficiency and latency

scipy.signal.fftconvolve implementation:

import scipy.signal as signal
y = signal.fftconvolve(x, h, mode='full')    # linear conv, output length M+L-1
y = signal.fftconvolve(x, h, mode='same')    # same length as x (centered)
y = signal.fftconvolve(x, h, mode='valid')   # only fully-overlapping part

scipy.signal.oaconvolve for long signals:

y = signal.oaconvolve(x, h, mode='full')    # uses overlap-add internally

This is typically 10-100 faster than fftconvolve when MLM \gg L.

For AI: The overlap-add structure appears in neural audio synthesis. In WaveNet, the causal dilated convolutions can be viewed as a sequence of overlapping block convolutions, where each block is processed causally. This is why WaveNet inference is slow (sequential) but training is fast (parallel over all positions using FFT convolution).

7. Cross-Correlation and Power Spectral Density

7.1 Cross-Correlation Theorem

Theorem (Cross-Correlation Theorem). For f,gL2(R)f, g \in L^2(\mathbb{R}):

F(fg)(ξ)=f^(ξ)g^(ξ)\mathcal{F}(f \star g)(\xi) = \hat{f}^*(\xi) \cdot \hat{g}(\xi)

where (fg)(τ)=f(t)g(t+τ)dt(f \star g)(\tau) = \int f^*(t) g(t + \tau) dt is the cross-correlation.

Proof. Write fg=f()gf \star g = f^*(-\cdot) * g (time-reverse and conjugate the first argument). Apply the Convolution Theorem and the time-reversal property F(f())(ξ)=f^(ξ)\mathcal{F}(f(-\cdot))(\xi) = \hat{f}(-\xi):

F(fg)(ξ)=F(f())(ξ)g^(ξ)=f^((ξ))g^(ξ)=f^(ξ)g^(ξ)\mathcal{F}(f \star g)(\xi) = \mathcal{F}(f^*(-\cdot))(\xi) \cdot \hat{g}(\xi) = \hat{f}^*(-(-\xi)) \cdot \hat{g}(\xi) = \hat{f}^*(\xi) \cdot \hat{g}(\xi) \qquad \square

Discrete version. For sequences x,yCNx, y \in \mathbb{C}^N:

(xy)[k]=nx[n]y[n+k](x \star y)[k] = \sum_{n} x^*[n]\, y[n+k] F(xy)[k]=X[k]Y[k]\mathcal{F}(x \star y)[k] = X^*[k] \cdot Y[k]

In NumPy: np.fft.ifft(np.conj(X) * Y) computes cross-correlation via FFT.

Applications:

  • Template matching: (hx)(τ)(h \star x)(\tau) peaks where xx most resembles the template hh - used in pattern recognition, face detection, audio fingerprinting (Shazam)
  • Time-delay estimation: the peak of Rxy(τ)|R_{xy}(\tau)| estimates the time delay between two sensors
  • Matched filter: the optimal linear filter for detecting a known signal in white noise is the cross-correlator (see Section 7.3)

7.2 Autocorrelation and Wiener-Khinchin Theorem

Definition. The autocorrelation of ff is Rff(τ)=(ff)(τ)=f(t)f(t+τ)dtR_{ff}(\tau) = (f \star f)(\tau) = \int f^*(t) f(t+\tau) dt.

Theorem (Wiener-Khinchin). For a wide-sense stationary random process X(t)X(t) with autocorrelation RXX(τ)=E[X(t)X(t+τ)]R_{XX}(\tau) = \mathbb{E}[X^*(t)X(t+\tau)], the power spectral density is its Fourier transform:

SXX(f)=F(RXX)(f)=RXX(τ)e2πifτdτS_{XX}(f) = \mathcal{F}(R_{XX})(f) = \int_{-\infty}^{\infty} R_{XX}(\tau)\, e^{-2\pi i f \tau}\, d\tau

Corollary. The total power equals the integral of the PSD:

E[X(t)2]=RXX(0)=SXX(f)df\mathbb{E}[|X(t)|^2] = R_{XX}(0) = \int_{-\infty}^{\infty} S_{XX}(f)\, df

(By Parseval's theorem applied to the autocorrelation.)

Spectral factorization. Since SXX(f)=X^(f)20S_{XX}(f) = |\hat{X}(f)|^2 \geq 0 is non-negative and real, one can always write SXX(f)=L(f)2S_{XX}(f) = |L(f)|^2 for a causal filter LL - the spectral factor. This is used in Wiener filter design.

Practical estimation. The periodogram estimates the PSD from NN samples:

S^XX(fk)=1NX[k]2,X=FFT(x)\hat{S}_{XX}(f_k) = \frac{1}{N} |X[k]|^2, \quad X = \text{FFT}(x)

The periodogram is an inconsistent estimator (variance doesn't decrease with NN). Welch's method averages periodograms over overlapping windowed segments:

S^XXWelch(f)=1Km=0K11NFFT(xmw)2\hat{S}_{XX}^{\text{Welch}}(f) = \frac{1}{K} \sum_{m=0}^{K-1} \frac{1}{N} |\text{FFT}(x_m \cdot w)|^2

For AI: Welch's PSD is used in analyzing neural network weight matrices and gradient noise. The gradient PSD reveals dominant frequency scales of gradient oscillations - key for understanding Adam's moment estimation and learning rate scheduling.

7.3 Matched Filtering

Problem: Given an observation y(t)=s(t)+n(t)y(t) = s(t) + n(t) where ss is a known signal and nn is white noise with spectral density N0N_0, find the linear filter hh that maximizes the output SNR at detection time t0t_0.

Theorem (Matched Filter). The optimal filter is:

HMF(f)=S(f)e2πift0N0H_{\text{MF}}(f) = \frac{S^*(f) e^{-2\pi i f t_0}}{N_0}

i.e., h(t)=1N0s(t0t)h(t) = \frac{1}{N_0} s^*(t_0 - t) - the time-reversed, conjugated, delayed signal. The maximum output SNR is:

SNRmax=2N0S(f)2df=2s22N0\text{SNR}_{\text{max}} = \frac{2}{N_0} \int |S(f)|^2 df = \frac{2 \lVert s \rVert_2^2}{N_0}

The matched filter output is proportional to the cross-correlation of yy with ss.

For AI: The attention mechanism in Transformers can be interpreted as a soft matched filter. The query q\mathbf{q} plays the role of a template signal; the keys kj\mathbf{k}_j are candidate matches; softmax(qkj/dk)\text{softmax}(\mathbf{q}^\top \mathbf{k}_j / \sqrt{d_k}) is a soft selection of the best-matching key, analogous to thresholding the cross-correlation peak.

8. Deconvolution

8.1 The Deconvolution Problem

Problem setup. Given the observed signal y=hx+ny = h * x + n where hh is a known (or estimated) blurring/mixing filter, xx is the unknown clean signal, and nn is additive noise, recover xx.

In the frequency domain: Y(f)=H(f)X(f)+N(f)Y(f) = H(f) X(f) + N(f).

Naive inverse filter. The obvious approach is X^(f)=Y(f)/H(f)\hat{X}(f) = Y(f) / H(f). This recovers X(f)X(f) exactly when n=0n = 0, but in the presence of noise:

X^(f)=X(f)+N(f)H(f)\hat{X}(f) = X(f) + \frac{N(f)}{H(f)}

When H(f)0H(f) \approx 0 at some frequencies (the filter attenuates them strongly), the noise term N(f)/H(f)N(f)/H(f) is amplified catastrophically. The inverse filter is numerically ill-conditioned - a small noise level can produce arbitrarily large artifacts.

Example (image deblurring): A Gaussian blur H(f)=eπ2f2/(2σ2)H(f) = e^{-\pi^2 f^2 / (2\sigma^2)} decays to near-zero at high frequencies. Dividing by H(f)H(f) at high frequencies amplifies high-frequency noise by factors of eπ2f2/(2σ2)e^{\pi^2 f^2 / (2\sigma^2)}, producing extreme ringing artifacts.

8.2 Wiener Filter

The Wiener filter is the optimal linear filter minimizing the mean squared error E[x(t)x^(t)2]\mathbb{E}[|x(t) - \hat{x}(t)|^2].

Theorem (Wiener Filter). The optimal deconvolution filter in the frequency domain is:

HW(f)=H(f)Sxx(f)H(f)2Sxx(f)+Snn(f)H_W(f) = \frac{H^*(f) S_{xx}(f)}{|H(f)|^2 S_{xx}(f) + S_{nn}(f)}

where Sxx(f)S_{xx}(f) is the PSD of the signal and Snn(f)S_{nn}(f) is the PSD of the noise.

Derivation sketch. We seek G(f)G(f) such that X^(f)=G(f)Y(f)=G(f)(H(f)X(f)+N(f))\hat{X}(f) = G(f) Y(f) = G(f)(H(f)X(f) + N(f)) minimizes E[X(f)X^(f)2]\mathbb{E}[|X(f) - \hat{X}(f)|^2]. Setting the derivative to zero (orthogonality principle):

G(f)=H(f)Sxx(f)H(f)2Sxx(f)+Snn(f)G(f) = \frac{H^*(f) S_{xx}(f)}{|H(f)|^2 S_{xx}(f) + S_{nn}(f)}

Intuition. The Wiener filter trades off between two regimes:

  • When H(f)2Sxx(f)Snn(f)|H(f)|^2 S_{xx}(f) \gg S_{nn}(f) (high SNR): HW(f)1/H(f)H_W(f) \approx 1/H(f) - nearly perfect inversion
  • When H(f)2Sxx(f)Snn(f)|H(f)|^2 S_{xx}(f) \ll S_{nn}(f) (low SNR): HW(f)0H_W(f) \approx 0 - suppress the noisy frequency, don't attempt to invert

The ratio SNR(f)=Sxx(f)H(f)2/Snn(f)\text{SNR}(f) = S_{xx}(f)|H(f)|^2 / S_{nn}(f) governs the transition between inversion and suppression at each frequency.

For AI: The Wiener filter principle appears in noise-robust attention. Sparse attention patterns (Longformer, BigBird) can be viewed as suppressing "noisy" long-range attention weights when the query-key similarity is low - a form of adaptive filtering where the "SNR" is the attention score.

8.3 Regularized Deconvolution

When SxxS_{xx} and SnnS_{nn} are unknown, we use Tikhonov regularization:

Hreg(f)=H(f)H(f)2+λH_{\text{reg}}(f) = \frac{H^*(f)}{|H(f)|^2 + \lambda}

The regularization parameter λ>0\lambda > 0 prevents division by near-zero values of H(f)H(f):

  • λ0\lambda \to 0: approaches inverse filter (unstable)
  • λ\lambda \to \infty: output x^0\hat{x} \to 0 (over-regularized)
  • Optimal λ\lambda chosen by cross-validation or the discrepancy principle

L-curve method. Plot x^2\lVert \hat{x} \rVert_2 vs. yhx^2\lVert y - h * \hat{x} \rVert_2 as λ\lambda varies. The "corner" of the L-shaped curve gives a good λ\lambda - balancing solution norm against residual.

Sparse deconvolution. When the clean signal xx is sparse (e.g., spike train, musical onset detection), using L1L^1 regularization instead of L2L^2:

x^=argminxyhx22+λx1\hat{x} = \arg\min_x \lVert y - h * x \rVert_2^2 + \lambda \lVert x \rVert_1

This is the LASSO in the signal processing context and recovers sparse solutions more faithfully than the Wiener filter, which assumes a smooth Gaussian prior on xx.

9. Applications in Machine Learning

9.1 Convolutional Neural Networks

A convolutional layer computes cross-correlation (not true convolution, but the distinction is irrelevant since weights are learned):

(Y)i,j=m,nWm,nXi+m,j+n+b(Y)_{i,j} = \sum_{m,n} W_{m,n} X_{i+m,\, j+n} + b

Frequency perspective: Each learned filter WW is an LTI system with frequency response W^(f)\hat{W}(f). Deep CNNs learn a hierarchy of filters:

  • Layer 1 filters: oriented edges, color gradients (Gabor-like bandpass filters)
  • Layer 2 filters: corners, textures (combinations of edge filters)
  • Layer LL filters: object parts, semantic features

Dilated convolutions expand the receptive field without increasing parameters: (XdW)[n]=kW[k]X[ndk](X *_d W)[n] = \sum_k W[k] X[n - dk]. In frequency: this is equivalent to subsampling W^\hat{W} by factor dd, creating copies at multiples of fs/df_s/d - the dilation aliases the filter spectrum.

11 convolutions are pointwise linear transformations across channels - equivalent to applying a separate MLP at each spatial position. They mix channel information without spatial mixing.

For AI (2026): Modern vision transformers (ViT) treat patches as tokens. Recent work shows that early ViT layers perform operations similar to CNN filters - local spatial filtering. Conversely, "ConvNeXt" (Liu et al., 2022) shows that pure CNN architectures can match ViT by adopting transformer-inspired design choices (depth-wise convolutions, GELU activations, layer norm).

9.2 Causal Convolution and WaveNet

WaveNet (van den Oord et al., 2016) generates raw audio waveforms autoregressively at 16/24 kHz using dilated causal convolutions.

Causal convolution: (hcausalx)[n]=k=0L1h[k]x[nk](h *_{\text{causal}} x)[n] = \sum_{k=0}^{L-1} h[k] x[n-k] - only past samples contribute. In training, implemented as a padded standard convolution.

Dilated receptive field: With dilation factors d=1,2,4,8,,2L1d = 1, 2, 4, 8, \ldots, 2^{L-1} over LL layers, the total receptive field is 2L12^L - 1 samples using only LL layers of KK-tap filters. At 16 kHz with L=10L = 10 layers of K=2K = 2 taps: receptive field = 2101=10232^{10} - 1 = 1023 samples = 64 ms.

WAVENET DILATED CAUSAL RECEPTIVE FIELD (L=4 layers)
========================================================================

  Input:  x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] x[8]
           |    |    |    |    |    |    |    |    |
  d=1:    +----+    +----+    +----+    +----+    (receptive field: 2)
           |         |         |         |
  d=2:    +---------+    +---------+    (receptive field: 4)
           |              |
  d=4:    +--------------+              (receptive field: 8)

  Total receptive field: 2^4 - 1 = 15 samples
  Output y[n] depends on x[n], x[n-1], ..., x[n-14]

========================================================================

9.3 State Space Models as Convolution (S4)

The Structured State Space Sequence model (S4) (Gu et al., 2021) bridges recurrent and convolutional views of sequence modeling.

SSM definition. The continuous-time state space model:

h˙(t)=Ah(t)+Bx(t),y(t)=Ch(t)+Dx(t)\dot{\mathbf{h}}(t) = A\mathbf{h}(t) + B x(t), \quad y(t) = C\mathbf{h}(t) + D x(t)

Discretized with step Δ\Delta:

ht=Aˉht1+Bˉxt,yt=Cht+Dxt\mathbf{h}_t = \bar{A}\mathbf{h}_{t-1} + \bar{B} x_t, \quad y_t = C\mathbf{h}_t + D x_t Aˉ=eΔA,Bˉ=(eΔAI)A1B\bar{A} = e^{\Delta A}, \quad \bar{B} = (e^{\Delta A} - I)A^{-1}B

Convolutional mode (training). Unrolling the recurrence gives:

yt=k=0tCAˉkBˉKˉ[k]xtk+Dxty_t = \sum_{k=0}^{t} \underbrace{C \bar{A}^k \bar{B}}_{\bar{K}[k]} x_{t-k} + D x_t

The SSM kernel Kˉ=[CBˉ,  CAˉBˉ,  CAˉ2Bˉ,  ,  CAˉL1Bˉ]\bar{\mathbf{K}} = [C\bar{B},\; C\bar{A}\bar{B},\; C\bar{A}^2\bar{B},\; \ldots,\; C\bar{A}^{L-1}\bar{B}] defines a length-LL FIR filter. Convolution y=Kˉxy = \bar{\mathbf{K}} * x is computed in O(LlogL)O(L \log L) via FFT.

HiPPO initialization. S4 initializes AA as the HiPPO matrix that optimally projects history onto Legendre polynomials. This gives Kˉ\bar{K} long-range decay properties that enable learning dependencies over thousands of steps.

For AI: S4 achieves state-of-the-art performance on the Long Range Arena benchmark, processing sequences of length 16,000 in O(NlogN)O(N \log N) vs. O(N2)O(N^2) for Transformers.

9.4 Mamba and Selective State Spaces

Mamba (Gu & Dao, 2023) extends S4 with input-selective SSM parameters: Aˉ(xt),Bˉ(xt),C(xt)\bar{A}(x_t), \bar{B}(x_t), C(x_t) depend on the input token. This breaks the time-invariance of S4's convolution kernel - the filter Kˉ\bar{K} is now input-dependent and changes for every sequence.

Why not use FFT directly: Since Kˉ\bar{K} changes with input, Mamba cannot precompute a single convolution kernel and apply it via FFT. Instead, it uses a hardware-aware parallel scan algorithm (Blelloch 1990) that processes the recurrence in O(N)O(N) time on GPU with memory efficiency via kernel fusion.

Selective mechanism: The input-dependent parameters enable the model to selectively "remember" or "forget" context - analogous to LSTM gates but more flexible. This allows Mamba to match or exceed Transformer performance on language modeling while maintaining sub-quadratic scaling.

9.5 Hyena and Parameterized Long Convolutions

Hyena (Poli et al., 2023) replaces the attention mechanism with a hierarchy of parameterized long convolutions.

Architecture. A Hyena operator of order NN computes:

y=hN(vN(hN1(vN1((h1(v1x))))))y = h_N * (v_N \odot (h_{N-1} * (v_{N-1} \odot (\cdots (h_1 * (v_1 \odot x)) \cdots))))

where each hih_i is a learned long convolution (length == sequence length), vi=Wixv_i = W_i x are linear projections, and \odot is elementwise multiplication.

Parameterization of hih_i: Each long filter is parameterized by a small neural network hi(t)=fθ(t)h_i(t) = f_{\theta}(t) evaluated at integer time positions t=0,1,,L1t = 0, 1, \ldots, L-1. This gives O(1)O(1) parameters per filter regardless of sequence length.

Complexity: Each convolution costs O(NlogN)O(N \log N) via FFT. Total: O(dNlogN)O(dN \log N) per layer for order-dd Hyena vs. O(N2d)O(N^2 d) for attention.

For AI: Hyena reaches 99.9% of attention's performance on CIFAR-10 at sequence length 1024 while being 8 faster. It scales better to length 64K where attention becomes prohibitive.

9.6 FNet: Replacing Attention with FFT

FNet (Lee-Thorp et al., 2022) replaces the self-attention sub-layer in a standard Transformer with a 2D FFT mixing layer:

FNetMix(X)=Re(Fseq(Fmodel(X)))\text{FNetMix}(X) = \text{Re}(\mathcal{F}_{\text{seq}}(\mathcal{F}_{\text{model}}(X)))

Two FFTs: one along the sequence dimension, one along the model dimension. No learned parameters in the mixing layer - just a fixed unitary transformation. The FFN sub-layers remain unchanged.

Results:

  • Achieves 92-97% of BERT's performance on GLUE
  • Training speed: 7 faster than BERT on TPUs (no attention computation)
  • Parameter count: same as BERT (all parameters are in FFN, embeddings, output projection)

Why it works: The 2D FFT performs a global mixing of all positions and all channels simultaneously. It acts like an "equal attention" that attends to all positions - less expressive than learned attention but sufficient for many tasks after fine-tuning.

For AI (2026): FNet is the existence proof that token mixing does not require learned attention. This insight motivated Mlp-Mixer, gMLP, and other mixing architectures that replace O(N2)O(N^2) attention with O(N)O(N) or O(NlogN)O(N \log N) alternatives.

9.7 Filter Banks Preview

Preview: Wavelets and Filter Banks

The Convolution Theorem underlies an even richer structure: filter banks - collections of bandpass filters that partition the spectrum. Convolving a signal with a bank of bandpass filters and subsampling the outputs gives a multi-resolution representation. The perfect reconstruction condition requires that the synthesis filters exactly invert the analysis filters - a constraint directly expressed via the Convolution Theorem.

The Haar wavelet is the simplest perfect-reconstruction filter bank: a lowpass filter h0=[1,1]/2h_0 = [1, 1]/\sqrt{2} and highpass h1=[1,1]/2h_1 = [1, -1]/\sqrt{2}. Iterating the lowpass branch produces the discrete wavelet transform (DWT).

-> Full treatment: Wavelets

10. Advanced Topics

10.1 Convolution on Groups

The Convolution Theorem extends far beyond R\mathbb{R} and Z/NZ\mathbb{Z}/N\mathbb{Z} to any locally compact abelian group GG:

(fg)(x)=Gf(y1x)g(y)dμ(y)(f * g)(x) = \int_G f(y^{-1} x)\, g(y)\, d\mu(y)

where μ\mu is the Haar measure on GG. The Fourier transform on GG diagonalizes this convolution exactly as in the classical case:

fg^(χ)=f^(χ)g^(χ)\widehat{f * g}(\chi) = \hat{f}(\chi) \cdot \hat{g}(\chi)

where χ:GT\chi: G \to \mathbb{T} are the characters (1D unitary representations) of GG.

Examples:

  • G=Z/NZG = \mathbb{Z}/N\mathbb{Z}: DFT, circular convolution
  • G=ZG = \mathbb{Z}: Z-transform, discrete convolution
  • G=RG = \mathbb{R}: Fourier transform, linear convolution
  • G=TG = \mathbb{T} (circle): Fourier series, periodic convolution
  • G=RnG = \mathbb{R}^n: multidimensional FT, image processing

Non-abelian groups. For non-abelian GG (e.g., rotation group SO(3)SO(3), permutation group SnS_n), the Fourier transform takes values in matrix-valued representations, not scalars. Convolution in the group algebra is still diagonalized, but now each "frequency component" is a matrix equation f^(ρ)g^(ρ)=fg^(ρ)\hat{f}(\rho) \cdot \hat{g}(\rho) = \widehat{f * g}(\rho) where ρ\rho is an irreducible representation.

GG-CNNs (Cohen & Welling, 2016): Convolutional networks on Lie groups GG that are equivariant to all group transformations. Each layer computes group convolution (fGh)(g)=Gf(h)W(g1h)dh(f *_G h)(g) = \int_G f(h) W(g^{-1}h)\, dh. The output transforms predictably under GG-transformations, encoding equivariance as an inductive bias rather than learning it from data.

For AI: Group-equivariant CNNs are used in molecular property prediction (G=SO(3)G = SO(3) for 3D rotational equivariance), protein structure prediction (SE(3)-equivariant networks in AlphaFold 2), and medical imaging (rotation/reflection equivariance for tissue classification).

10.2 Spectral Graph Convolution

On a graph G=(V,E)\mathcal{G} = (\mathcal{V}, \mathcal{E}) with NN nodes, the graph Laplacian L=DAL = D - A (where DD is the degree matrix) plays the role of the Laplacian in continuous Fourier analysis. Its eigendecomposition L=UΛUL = U\Lambda U^\top defines the graph Fourier transform:

x^=Ux,x=Ux^\hat{\mathbf{x}} = U^\top \mathbf{x}, \quad \mathbf{x} = U\hat{\mathbf{x}}

Graph convolution:

xGh=U(UxUh)=Udiag(h^)Ux\mathbf{x} *_{\mathcal{G}} \mathbf{h} = U(U^\top \mathbf{x} \odot U^\top \mathbf{h}) = U\operatorname{diag}(\hat{h}) U^\top \mathbf{x}

This is the Convolution Theorem on graphs: spectral multiplication = graph convolution.

Limitations: Computing UU costs O(N3)O(N^3); the resulting filters are non-localized. ChebNet (Defferrard et al., 2016) approximates h^(Λ)\hat{h}(\Lambda) by a Chebyshev polynomial, giving KK-hop local filters. GCN (Kipf & Welling, 2017) uses the first-order approximation h^(Λ)h^0I+h^1Λ~\hat{h}(\Lambda) \approx \hat{h}_0 I + \hat{h}_1 \tilde{\Lambda} (two-parameter filter).

-> Full treatment: Spectral Graph Theory

10.3 Young's Convolution Inequality

Theorem (Young's Inequality for Convolutions). Let 1p,q,r1 \leq p, q, r \leq \infty with 1p+1q=1+1r\frac{1}{p} + \frac{1}{q} = 1 + \frac{1}{r}. For fLp(R)f \in L^p(\mathbb{R}) and gLq(R)g \in L^q(\mathbb{R}):

fgrfpgq\lVert f * g \rVert_r \leq \lVert f \rVert_p \lVert g \rVert_q

Key special cases:

ppqqrrInterpretation
111L1L^1 is a Banach algebra under convolution
122FIR filtering preserves L2L^2 norm (up to h1\lVert h \rVert_1)
1\infty\inftyBounded signals remain bounded after L1L^1 filtering
22\inftyAutocorrelation is bounded

Proof sketch. Generalized Holder's inequality applied to the convolution integral; see Lieb & Loss "Analysis" Section 4.2.

Sharp version (Beckner-Brascamp-Lieb). The sharp constant in Young's inequality is:

fgrCp,qfpgq,Cp,q=s(ps1/psps1/ps)1/2\lVert f * g \rVert_r \leq C_{p,q} \lVert f \rVert_p \lVert g \rVert_q, \quad C_{p,q} = \prod_s \left(\frac{p_s^{1/p_s}}{p_s'^{1/p_s'}}\right)^{1/2}

Gaussians are the extremizers - they saturate the inequality with equality.

11. Common Mistakes

#MistakeWhy It's WrongFix
1Using circular convolution when linear convolution is neededDFT assumes periodicity; wrap-around artifacts contaminate the outputZero-pad both sequences to length M+L1\geq M + L - 1 before DFT
2Forgetting to zero-pad to a power of 2Non-power-of-2 FFT sizes use slower mixed-radix algorithmsUse N = 2**ceil(log2(M+L-1)) or let scipy.fft choose N
3Double-dividing by NNManually computing 1Nx[n]ωnk\frac{1}{N}\sum x[n]\omega^{nk} then calling np.fft.ifft (which already divides by NN)Use either all-manual or all-NumPy; never mix conventions
4Confusing convolution with cross-correlationtorch.nn.Conv2d computes cross-correlation (Wx)(W \star x), not (Wx)(W * x); for symmetric kernels they are equal, but in general they differCross-correlation flips one input; for learned filters it makes no difference (the network learns the flip)
5Ignoring phase responseAnalyzing only $H(f)
6Applying IIR filters in both directions without understanding "zero-phase"scipy.signal.filtfilt applies a filter twice (forward then backward) to achieve zero phase, but it is non-causal; cannot be used in real-time streamingUse filtfilt only for offline analysis; use lfilter with delay compensation for real-time
7Using periodogram directly as PSD estimateThe periodogram has variance equal to its mean (inconsistent estimator) - averaging periodograms over many realizations does not reduce varianceUse Welch's method (scipy.signal.welch) to average windowed periodogram estimates
8Naive inverse filter on attenuated frequenciesDividing by H(f)0H(f) \approx 0 amplifies noise by $1/H(f)
9Confusing convolution with element-wise multiplicationWriting y = h * x in Python (NumPy) gives element-wise multiply, not convolutionUse np.convolve(x, h) or scipy.signal.fftconvolve(x, h) for convolution
10Incorrect output length after convolutionNot accounting for the "full", "same", or "valid" output modes, or expecting circular convolution to give M+L1M + L - 1 output samples (it gives NN)Know the output lengths: full = M+L1M+L-1; same = max(M,L)\max(M,L); valid = $

12. Exercises

Exercise 1 * - Convolution by Hand Compute the linear convolution of x=[1,2,3,2,1]x = [1, 2, 3, 2, 1] and h=[1,0,1]h = [1, 0, -1] by hand using the sliding-dot-product definition. Verify your answer using np.convolve. Then compute the 5-point circular convolution of the same inputs and identify which output samples differ from the linear convolution result due to wrap-around.

Exercise 2 * - Convolution Theorem Verification For x=[1,2,3,4]x = [1, 2, 3, 4] and h=[1,1,1]h = [1, -1, 1]: (a) Compute the linear convolution directly. (b) Compute it via the FFT route (zero-pad, multiply spectra, IFFT). (c) Verify that both approaches give the same result to machine precision. (d) Verify Parseval: y22=1NY22\lVert y \rVert_2^2 = \frac{1}{N}\lVert Y \rVert_2^2 where Y=DFT(y)Y = \text{DFT}(y).

Exercise 3 * - LTI Frequency Response Design a simple FIR lowpass filter using the window method: (a) Start with the ideal lowpass impulse response hd[n]=2fcsinc(2fcn)h_d[n] = 2f_c \operatorname{sinc}(2f_c n) with cutoff fc=0.2f_c = 0.2 (in normalized frequency). (b) Truncate to L=51L = 51 taps. (c) Multiply by a Hann window. (d) Plot H(f)|H(f)| in dB and measure the passband ripple, stopband attenuation, and transition bandwidth.

Exercise 4 ** - Circular vs. Linear Convolution Demo (a) Convolve a length-64 chirp signal with a length-16 Hann-windowed filter using both direct np.convolve and the 64-point circular DFT method. (b) Show the circular convolution error (difference from linear convolution). (c) Repeat with zero-padding to length 64+161=7964 + 16 - 1 = 79 (or next power of 2 = 128). (d) Verify that the zero-padded circular convolution matches linear convolution exactly.

Exercise 5 ** - Overlap-Add Implementation Implement the overlap-add algorithm from scratch: def overlap_add(x, h, block_size). (a) Process a 1024-sample signal with a 64-sample filter using block size 128. (b) Verify your output matches scipy.signal.fftconvolve(x, h) to machine precision. (c) Measure and compare the runtime of your overlap-add, np.convolve, and scipy.signal.fftconvolve for varying signal lengths.

Exercise 6 ** - Wiener Deconvolution Given a blurred signal y=hx+ny = h * x + n where hh is a Gaussian blur, xx is a test signal, and nn is white Gaussian noise: (a) Implement the naive inverse filter and show the noise amplification. (b) Implement the Wiener filter assuming known SxxS_{xx} and SnnS_{nn}. (c) Show the deconvolution result for three SNR levels (10,20,3010, 20, 30 dB). (d) Compare with regularized Tikhonov deconvolution using the L-curve to select λ\lambda.

Exercise 7 *** - S4 SSM as FFT Convolution Implement the S4 state space model in convolutional mode: (a) Construct the SSM matrices A,B,CA, B, C for a simple diagonal SSM with N=64N = 64 states. (b) Discretize with step Δ=0.01\Delta = 0.01 using the ZOH method: Aˉ=eΔA\bar{A} = e^{\Delta A}, Bˉ=(eΔAI)A1B\bar{B} = (e^{\Delta A} - I)A^{-1}B. (c) Compute the SSM kernel Kˉ=[CBˉ,CAˉBˉ,,CAˉL1Bˉ]\bar{K} = [C\bar{B}, C\bar{A}\bar{B}, \ldots, C\bar{A}^{L-1}\bar{B}] for L=256L = 256. (d) Apply the kernel to a test sequence via FFT convolution. (e) Verify by also running the recurrence directly and comparing outputs.

Exercise 8 *** - Hyena Parameterized Long Convolution Implement a simplified order-2 Hyena operator: (a) Define two learnable filters h1,h2h_1, h_2 of length L=128L = 128, parameterized by a 2-layer MLP evaluated at positions t=0,,L1t = 0, \ldots, L-1. (b) Implement the Hyena forward pass: y=h2(v2(h1(v1x)))y = h_2 * (v_2 \odot (h_1 * (v_1 \odot x))) where vi=Wixv_i = W_i x are learned projections. (c) Verify that the total compute is O(LlogL)O(L \log L) per forward pass. (d) Compare: for L=1024L = 1024, count the FLOPs for Hyena vs. self-attention (O(L2d)O(L^2 d) with d=64d = 64).

13. Why This Matters for AI (2026 Perspective)

ConceptAI Impact
Convolution Theorem (fgf^g^f * g \leftrightarrow \hat{f}\hat{g})Reduces O(N2)O(N^2) convolution to O(NlogN)O(N \log N) in all CNN layers; directly enables efficient sequence modeling
LTI system theory + impulse responseS4/Mamba: SSM = LTI system; training in convolutional mode via precomputed SSM kernel; inference in recurrent mode
Circular vs. linear convolutionPadding strategy in all CNN implementations; must zero-pad to avoid wrap-around artifacts in FFT-based conv layers
FIR filter design (linear phase)Dilated causal convolutions in WaveNet are FIR filters; dilation exponentially extends receptive field
IIR filter stability (ρ(A)<1\rho(A) < 1)Gradient stability in RNNs, LSTMs, S4; HiPPO matrix design for stable long-range memory
Cross-correlation = attention (soft matched filter)Multi-head attention is a learned bank of cross-correlators; QK inner product = cross-correlation score
Wiener filter (optimal deconvolution)Noise-robust attention; NLP denoising objectives; diffusion model score matching as optimal deconvolution
Overlap-add / overlap-saveWaveNet training (parallel causal conv); audio codec inference; streaming inference for SSMs
Group convolution (equivariance)SE(3)-equivariant networks in protein structure (AlphaFold 2); molecular dynamics (NequIP, MACE)
Spectral graph convolutionGCN, ChebNet, GraphSAGE - all use graph convolution derived from graph Laplacian diagonalization
Hyena parameterized convolutionsSub-quadratic LLM attention replacement; competitive with transformers at context lengths 8K-64K
FNet (FFT attention replacement)Existence proof that fixed global mixing (FFT) achieves 92% of BERT; motivates Mlp-Mixer, gMLP architectures

14. Conceptual Bridge

Looking Back

This section builds directly on three prior foundations. From Section 20-01 (Fourier Series), we inherit the idea that periodic signals decompose into sinusoidal harmonics and that these harmonics are orthogonal - the very orthogonality that makes the DFT a unitary transform. From Section 20-02 (Fourier Transform), we carry forward the continuous convolution theorem, Plancherel's theorem, and the L1L^1/L2L^2 framework in which these operations are well-defined. From Section 20-03 (DFT and FFT), we take the computational machinery: the DFT matrix, the Cooley-Tukey algorithm, and the concept of circular periodicity that makes the discrete Convolution Theorem exact rather than approximate.

The Convolution Theorem can be seen as the payoff of all the Fourier analysis developed in Section 01-03: the abstract orthogonal decomposition becomes a practical speedup. The key insight - that eigenfunctions of convolution are complex exponentials - was already visible in Section 01's Fourier series (eigenfunctions of translation operators on the circle) and Section 02's FT (eigenfunctions of all LTI operators). Here we make that insight computationally concrete.

The Core Equation

xy  F  X[k]Y[k]x \circledast y \;\overset{\mathcal{F}}{\longleftrightarrow}\; X[k] \cdot Y[k]

This identity, proved in Section 3.2, is what connects abstract Fourier analysis to practical engineering and modern deep learning. Every CNN layer, every S4/Mamba layer, every FNet mixer, and every overlap-add audio system computes a consequence of this equation.

Looking Forward

The immediate next step is Section 20-05 (Wavelets), which uses the Convolution Theorem as its foundation. Wavelet analysis overcomes Fourier's fundamental limitation - no time localization - by using short convolutions with oscillatory kernels at multiple scales. The perfect reconstruction filter bank condition (the core technical requirement of wavelets) is a system of equations on the Fourier transforms of the analysis and synthesis filters. Every MRA, DWT, and lifting scheme is a cleverly designed set of convolutions satisfying this spectral condition.

Looking further, this section connects to:

  • Section 11-04 (Spectral Graph Theory): Graph convolution = signal processing on irregular domains; ChebNet, GCN, and spectral clustering all use the graph Laplacian eigendecomposition in the same way the Convolution Theorem uses DFT eigenvectors
  • Section 12-02 (Hilbert Spaces): The Convolution Theorem in L2L^2 is a statement about isometric isomorphisms of Hilbert spaces; Plancherel's theorem is the isometry; the Convolution Theorem is a multiplicative structure on top
  • Section 17 (Generative Models): Diffusion models can be interpreted as iterative deconvolution - the forward process convolves with progressively wider Gaussians, and the reverse process learns the Wiener deconvolution at each noise level
POSITION IN CURRICULUM
========================================================================

  Section 20-01 Fourier Series          Section 20-02 Fourier Transform
  (periodic, discrete spectrum)  (aperiodic, continuous spectrum)
         v                              v
  Section 20-03 DFT and FFT  ----------- Section 20-04 Convolution Theorem  <- HERE
  (discrete, computable)          (theorem + LTI + ML apps)
                                         v
                               Section 20-05 Wavelets
                               (multi-resolution, time-local)
                                         v
                    +--------------------+--------------------+
                    v                    v                    v
             Section 11-04 Spectral     Section 12-02 Hilbert        Section 17 Diffusion
             Graph Theory        Spaces (L2)           Models

========================================================================

Appendix A: The Z-Transform and Discrete-Time LTI Systems

The Z-transform is the discrete-time counterpart of the Laplace transform, and its restriction to the unit circle is the DTFT (discrete-time Fourier transform):

X(z)=n=x[n]zn,zCX(z) = \sum_{n=-\infty}^{\infty} x[n]\, z^{-n}, \quad z \in \mathbb{C}

The DFT evaluates the Z-transform at NN equally spaced points on the unit circle: z=e2πik/Nz = e^{2\pi i k/N} for k=0,,N1k = 0, \ldots, N-1.

Convolution in the Z-domain. The Z-transform of the convolution is the product of Z-transforms:

Z(xh)(z)=X(z)H(z)\mathcal{Z}(x * h)(z) = X(z) \cdot H(z)

This follows by exactly the same calculation as the continuous Convolution Theorem (substitute u=nmu = n - m).

Transfer function. For an IIR filter with difference equation:

y[n]=k=0Mbkx[nk]k=1Naky[nk]y[n] = \sum_{k=0}^{M} b_k x[n-k] - \sum_{k=1}^{N} a_k y[n-k]

The Z-transform gives: Y(z)(1+akzk)=X(z)bkzkY(z)(1 + \sum a_k z^{-k}) = X(z) \sum b_k z^{-k}, so:

H(z)=Y(z)X(z)=k=0Mbkzk1+k=1Nakzk=B(z)A(z)H(z) = \frac{Y(z)}{X(z)} = \frac{\sum_{k=0}^{M} b_k z^{-k}}{1 + \sum_{k=1}^{N} a_k z^{-k}} = \frac{B(z)}{A(z)}

Stability from Z-domain. The BIBO stability condition h1<\lVert h \rVert_1 < \infty translates to: all poles of H(z)H(z) (roots of A(z)A(z)) must lie strictly inside the unit circle z<1|z| < 1.

Frequency response from transfer function. Evaluate H(z)H(z) on the unit circle: H(e2πif)=H(z)z=e2πifH(e^{2\pi i f}) = H(z)|_{z = e^{2\pi if}}. This gives the frequency response directly from the pole-zero plot.

A.1 Pole-Zero Analysis

The transfer function H(z)=Kk(zzk)/j(zpj)H(z) = K \prod_k (z - z_k) / \prod_j (z - p_j) has:

  • Zeros zkz_k: frequencies where H(eiω)=0|H(e^{i\omega})| = 0 (nulled by filter)
  • Poles pjp_j: resonances where H(eiω)|H(e^{i\omega})| peaks (amplified by filter)

Notch filter example. To null frequency f0f_0, place a zero pair at z=e±2πif0z = e^{\pm 2\pi i f_0}:

H(z)=(ze2πif0)(ze2πif0)=z22cos(2πf0)z+1H(z) = (z - e^{2\pi i f_0})(z - e^{-2\pi i f_0}) = z^2 - 2\cos(2\pi f_0)z + 1

This creates a perfect null at f0f_0. Used in power-line interference removal (50/60 Hz notch filter).

For AI: The poles of a linear RNN AA are the Z-transform poles of its impulse response. Stable initialization requires all eigenvalues of AA inside the unit disk. S4's HiPPO matrix has eigenvalues on the imaginary axis (continuous time), which after discretization map to eigenvalues near but inside the unit circle - ensuring stability while preserving long-range information.

Appendix B: Worked Examples

B.1 Convolution Theorem for the Heat Equation

The heat equation tu=xxu\partial_t u = \partial_{xx} u with initial condition u(x,0)=u0(x)u(x, 0) = u_0(x) is solved by convolution with the heat kernel:

u(x,t)=(Gtu0)(x),Gt(x)=14πtex2/(4t)u(x, t) = (G_t * u_0)(x), \quad G_t(x) = \frac{1}{\sqrt{4\pi t}} e^{-x^2/(4t)}

Derivation via Fourier transform. Take the FT in xx:

tu^(ξ,t)=(2πξ)2u^(ξ,t)\partial_t \hat{u}(\xi, t) = -(2\pi\xi)^2 \hat{u}(\xi, t)

This is a simple ODE in tt: u^(ξ,t)=u^0(ξ)e(2πξ)2t\hat{u}(\xi, t) = \hat{u}_0(\xi) e^{-(2\pi\xi)^2 t}.

The Convolution Theorem identifies e(2πξ)2t=G^t(ξ)e^{-(2\pi\xi)^2 t} = \hat{G}_t(\xi) (the FT of the Gaussian kernel GtG_t). Inverting:

u(x,t)=F1(G^tu^0)(x)=(Gtu0)(x)u(x, t) = \mathcal{F}^{-1}(\hat{G}_t \cdot \hat{u}_0)(x) = (G_t * u_0)(x) \qquad \square

The Fourier transform diagonalizes the heat operator: each frequency evolves independently, decaying at rate (2πξ)2(2\pi\xi)^2. High frequencies decay fastest - blurring sharp features. This is exactly why Gaussian convolution produces smooth outputs.

B.2 Verification of Circular vs. Linear Convolution

Let x=[3,1,4,1]x = [3, 1, 4, 1] (length 4) and h=[1,2]h = [1, 2] (length 2).

Linear convolution (length 4+21=54 + 2 - 1 = 5):

  • y[0]=31=3y[0] = 3 \cdot 1 = 3
  • y[1]=11+32=7y[1] = 1 \cdot 1 + 3 \cdot 2 = 7
  • y[2]=41+12=6y[2] = 4 \cdot 1 + 1 \cdot 2 = 6
  • y[3]=11+42=9y[3] = 1 \cdot 1 + 4 \cdot 2 = 9
  • y[4]=12=2y[4] = 1 \cdot 2 = 2
  • ylin=[3,7,6,9,2]y_{\text{lin}} = [3, 7, 6, 9, 2]

4-point circular convolution (wrong - wrap-around):

  • y[0]=31+12=5y[0] = 3 \cdot 1 + 1 \cdot 2 = 5 <- includes x[3]h[1]x[3] \cdot h[1], wrong
  • y[1]=11+32=7y[1] = 1 \cdot 1 + 3 \cdot 2 = 7 PASS
  • y[2]=41+12=6y[2] = 4 \cdot 1 + 1 \cdot 2 = 6 PASS
  • y[3]=11+42=9y[3] = 1 \cdot 1 + 4 \cdot 2 = 9 PASS
  • ycirc,4=[5,7,6,9]y_{\text{circ,4}} = [5, 7, 6, 9] - first sample is wrong

8-point circular convolution (zero-padded, correct):

  • Zero-pad x[3,1,4,1,0,0,0,0]x \to [3,1,4,1,0,0,0,0] and h[1,2,0,0,0,0,0,0]h \to [1,2,0,0,0,0,0,0]
  • Circular convolution of length-8 vectors gives [3,7,6,9,2,0,0,0][3,7,6,9,2,0,0,0]
  • Trim to length 5: [3,7,6,9,2][3, 7, 6, 9, 2] PASS matches linear convolution

B.3 Wiener Filter Derivation in Detail

Setup. Observation model: Y(f)=H(f)X(f)+N(f)Y(f) = H(f)X(f) + N(f). Desired: estimate X^(f)=G(f)Y(f)\hat{X}(f) = G(f)Y(f) minimizing E[X(f)G(f)Y(f)2]\mathbb{E}[|X(f) - G(f)Y(f)|^2].

Expansion:

E[XGY2]=E[X2]GE[XY]GE[XY]+G2E[Y2]\mathbb{E}[|X - GY|^2] = \mathbb{E}[|X|^2] - G\mathbb{E}[XY^*] - G^*\mathbb{E}[X^*Y] + |G|^2\mathbb{E}[|Y|^2]

Assuming XX and NN are uncorrelated (E[XN]=0\mathbb{E}[X N^*] = 0):

  • E[XY]=HSXX\mathbb{E}[XY^*] = H^* S_{XX}
  • E[Y2]=H2SXX+SNN\mathbb{E}[|Y|^2] = |H|^2 S_{XX} + S_{NN}

Setting /G=0\partial/\partial G^* = 0:

Gopt=HSXXH2SXX+SNNG_{\text{opt}} = \frac{H^* S_{XX}}{|H|^2 S_{XX} + S_{NN}}

This is the Wiener filter. The minimum MSE at frequency ff is:

MMSE(f)=SXX(f)SNN(f)H(f)2SXX(f)+SNN(f)=SXX(f)1+SNR(f)\text{MMSE}(f) = \frac{S_{XX}(f) \cdot S_{NN}(f)}{|H(f)|^2 S_{XX}(f) + S_{NN}(f)} = \frac{S_{XX}(f)}{1 + \text{SNR}(f)}

where SNR(f)=H(f)2SXX(f)/SNN(f)\text{SNR}(f) = |H(f)|^2 S_{XX}(f) / S_{NN}(f).

Appendix C: Numerical Notes and Implementation Details

C.1 Choosing FFT Library

LibrarySpeedFeaturesBest for
numpy.fftGoodBasic FFT, rfftSimple use; numpy-native
scipy.fftBetterHandles non-power-of-2 optimally, workers paramGeneral purpose
pyfftwBestMultithreaded FFTW backendProduction speed-critical
torch.fftGPUBatched FFT on GPUNeural network layers
cuFFT (via cupy)GPUCUDA-nativeLarge batch processing

Rule: For ML training, use torch.fft.rfft which batches over the first dimensions automatically. For signal processing, use scipy.fft which automatically selects the best algorithm.

C.2 Real-Valued Convolution with rfft

For real inputs xx and hh, the DFT has conjugate symmetry: X[Nk]=X[k]X[N-k] = X[k]^*. Only the first N/2+1N/2 + 1 values are independent. Using numpy.fft.rfft instead of fft:

  • Input: real xx of length NN
  • Output: complex of length N/2+1N/2 + 1 (not NN)
  • Speedup: 2×\approx 2 \times for large NN
# Real-valued FFT convolution
N = 2**int(np.ceil(np.log2(len(x) + len(h) - 1)))
X = np.fft.rfft(x, n=N)   # length N//2 + 1
H = np.fft.rfft(h, n=N)
y = np.fft.irfft(X * H, n=N)[:len(x) + len(h) - 1]

C.3 Numerical Precision in Convolution

Round-off error. FFT-based convolution introduces floating-point errors of order O(εmachNlogN)O(\varepsilon_{\text{mach}} N \log N) where εmach2.2×1016\varepsilon_{\text{mach}} \approx 2.2 \times 10^{-16} for double precision. For N=220N = 2^{20}: error 2.2×1016×20×1064×109\approx 2.2 \times 10^{-16} \times 20 \times 10^6 \approx 4 \times 10^{-9} - acceptable for most applications.

Exact convolution. For integer-valued signals, the NTT (Number Theoretic Transform over Zp\mathbb{Z}_p) computes exact convolution without floating-point error - covered in Section 20-03.

C.4 Memory Layout for Batched Convolution

When processing batches of signals in PyTorch:

# Batch of B signals of length L, C channels
x = torch.randn(B, C, L)
X = torch.fft.rfft(x, n=N, dim=-1)    # shape: (B, C, N//2+1)
H = torch.fft.rfft(h.unsqueeze(0), n=N, dim=-1)  # broadcast over batch
Y = X * H
y = torch.fft.irfft(Y, n=N, dim=-1)[:, :, :L]

The dim=-1 argument applies the FFT along the last (time) dimension, leaving batch and channel dimensions intact.

Appendix D: Connections to Other Chapters

D.1 Convolution and Probability

The sum Z=X+YZ = X + Y of independent random variables satisfies pZ=pXpYp_Z = p_X * p_Y (PDF convolution). This connection explains:

  • Central Limit Theorem via FT: Under mild conditions, p^Z(f)=p^X(f)p^Y(f)e2π2σ2f2\hat{p}_Z(f) = \hat{p}_X(f) \cdot \hat{p}_Y(f) \to e^{-2\pi^2\sigma^2 f^2} (Gaussian in freq), so pZN(0,σ2)p_Z \to \mathcal{N}(0, \sigma^2) as more terms are added.
  • Characteristic functions: The characteristic function ϕX(t)=E[eitX]\phi_X(t) = \mathbb{E}[e^{itX}] is the FT of pXp_X. The convolution theorem gives ϕX+Y=ϕXϕY\phi_{X+Y} = \phi_X \cdot \phi_Y directly.
  • Diffusion models: The forward diffusion process adds Gaussian noise iteratively. The noisy marginal q(xt)=q(xtx0)q(x0)dx0q(x_t) = \int q(x_t \mid x_0) q(x_0) dx_0 is a convolution of the data distribution with a Gaussian kernel.

D.2 Convolution and Regularization

Ridge regression solves minxAxb22+λx22\min_x \lVert Ax - b \rVert_2^2 + \lambda \lVert x \rVert_2^2. When AA is a circulant matrix (i.e., A=ChA = C_h for some filter hh), ridge regression becomes Tikhonov deconvolution:

X^(f)=H(f)H(f)2+λY(f)\hat{X}(f) = \frac{H^*(f)}{|H(f)|^2 + \lambda} Y(f)

The regularization parameter λ\lambda has the same role as in the Wiener filter: it prevents noise amplification at frequencies where H(f)0H(f) \approx 0.

D.3 Convolution and Kernel Methods

A shift-invariant kernel k(xy)k(x - y) defines a feature map ϕ:xk(x,)\phi: x \mapsto k(x, \cdot) such that k(x,y)=ϕ(x),ϕ(y)k(x, y) = \langle \phi(x), \phi(y) \rangle. By Bochner's theorem, kk is a positive definite shift-invariant kernel if and only if k^(ω)0\hat{k}(\omega) \geq 0 for all ω\omega - i.e., kk is the FT of a non-negative measure.

Random Fourier Features (Rahimi & Recht, 2007): approximate a shift-invariant kernel via Monte Carlo:

k(xy)=Eωk^[eiω(xy)]1Dj=1Deiωjx(eiωjy)k(x - y) = \mathbb{E}_{\omega \sim \hat{k}}[e^{i\omega^\top(x-y)}] \approx \frac{1}{D}\sum_{j=1}^D e^{i\omega_j^\top x} (e^{i\omega_j^\top y})^*

This approximates the kernel evaluation as a dot product of random Fourier features, reducing kernel SVM training from O(n3)O(n^3) to O(nD)O(nD).

Appendix E: Filter Design Methods

E.1 Window Method for FIR Design

The window method designs a length-LL FIR filter approximating an ideal frequency response Hd(f)H_d(f):

  1. Compute the ideal (infinite-length) impulse response: hd[n]=Hd(f)e2πifndfh_d[n] = \int H_d(f) e^{2\pi i fn} df
  2. Truncate to LL taps: hw[n]=hd[n]w[n]h_w[n] = h_d[n] \cdot w[n] for n=0,,L1n = 0, \ldots, L-1

The window w[n]w[n] controls the trade-off between transition width and stopband attenuation. Using Hann window gives 44-44 dB stopband; Kaiser window allows continuous tuning via parameter β\beta:

  • β=0\beta = 0: rectangular (13-13 dB)
  • β=5.653\beta = 5.653: Hann-equivalent (44-44 dB)
  • β=8.6\beta = 8.6: 69-69 dB stopband

Kaiser window formulas:

w[n]=I0(β1(2n/L1)2)I0(β),β={0.1102(α8.7)α>500.5842(α21)0.4+0.07886(α21)21α500α<21w[n] = \frac{I_0(\beta\sqrt{1-(2n/L-1)^2})}{I_0(\beta)}, \quad \beta = \begin{cases} 0.1102(\alpha-8.7) & \alpha > 50 \\ 0.5842(\alpha-21)^{0.4} + 0.07886(\alpha-21) & 21 \leq \alpha \leq 50 \\ 0 & \alpha < 21 \end{cases}

where α\alpha is the desired stopband attenuation in dB and I0I_0 is the modified Bessel function.

Minimum filter length for Kaiser design with stopband attenuation α\alpha dB and transition width Δf\Delta f:

Lα82.2852πΔfL \approx \frac{\alpha - 8}{2.285 \cdot 2\pi \Delta f}

E.2 Parks-McClellan (Equiripple) FIR Design

The Parks-McClellan algorithm finds the optimal equiripple FIR filter minimizing the maximum deviation from the desired response. It uses the Remez exchange algorithm on the Chebyshev equiripple approximation problem.

Key property: For a given filter length LL, Parks-McClellan achieves the minimum transition bandwidth subject to equal ripple in passband and stopband. The Chebyshev equiripple solution is optimal in the LL^\infty sense (minimax).

In Python: scipy.signal.remez(L, bands, desired, weight).

E.3 Butterworth IIR Design

The NN-th order Butterworth filter maximally flat in the passband:

H(jΩ)2=11+(Ω/Ωc)2N|H(j\Omega)|^2 = \frac{1}{1 + (\Omega/\Omega_c)^{2N}}
  • N=1N = 1: simple RC circuit (6 dB/octave rolloff)
  • N=2N = 2: 12 dB/octave
  • NN large: approaches ideal brick-wall

Poles lie on a circle of radius Ωc\Omega_c in the left half-plane, equally spaced by π/N\pi/N:

pk=Ωceiπ(2k+N1)/(2N),k=1,,Np_k = \Omega_c e^{i\pi(2k+N-1)/(2N)}, \quad k = 1, \ldots, N

In Python: scipy.signal.butter(N, Wn, btype='low', analog=False).

Trade-off: Butterworth has no passband ripple but slower rolloff than Chebyshev or elliptic. For applications where in-band flatness is critical (audio, instrumentation), Butterworth is preferred.

Appendix F: Multidimensional Convolution

F.1 2D Convolution

For 2D signals (images), the 2D linear convolution is:

(fg)(x,y)= ⁣ ⁣f(u,v)g(xu,yv)dudv(f * g)(x, y) = \int\!\!\int f(u, v)\, g(x-u, y-v)\, du\, dv

The 2D Convolution Theorem:

F2D(fg)(ξ,η)=f^(ξ,η)g^(ξ,η)\mathcal{F}_{2D}(f * g)(\xi, \eta) = \hat{f}(\xi, \eta) \cdot \hat{g}(\xi, \eta)

where f^(ξ,η)= ⁣ ⁣f(x,y)e2πi(ξx+ηy)dxdy\hat{f}(\xi, \eta) = \int\!\!\int f(x,y) e^{-2\pi i(\xi x + \eta y)} dx\, dy.

Separable filters. A 2D filter h(x,y)=h1(x)h2(y)h(x, y) = h_1(x) h_2(y) is separable - its 2D FT factors as h^(ξ,η)=h^1(ξ)h^2(η)\hat{h}(\xi, \eta) = \hat{h}_1(\xi) \hat{h}_2(\eta). Separable filtering can be implemented as two sequential 1D convolutions:

(hf)(x,y)=(h1x(h2yf))(x,y)(h * f)(x, y) = (h_1 *_x (h_2 *_y f))(x, y)

Cost: O(N2(K1+K2))O(N^2(K_1 + K_2)) vs O(N2K1K2)O(N^2 K_1 K_2) for non-separable. Depthwise separable convolutions in MobileNet exploit this: apply a 1D filter per channel, then mix channels - reducing parameters by a factor of K2K^2 for a K×KK \times K filter.

F.2 Image Processing Applications

Common 2D convolution kernels and their frequency responses:

KernelFT behaviorEffect
Gaussian GσG_\sigmaGaussian e2π2σ2(ξ2+η2)e^{-2\pi^2\sigma^2(\xi^2+\eta^2)}Lowpass, smooth
Laplacian 2\nabla^24π2(ξ2+η2)-4\pi^2(\xi^2 + \eta^2)Highpass, edge enhance
Sobel x-direction2πiξwindow2\pi i \xi \cdot \text{window}Horizontal gradient
Box filter 1K21K×K\frac{1}{K^2}\mathbf{1}_{K\times K}sinc(Kξ)sinc(Kη)\operatorname{sinc}(K\xi)\operatorname{sinc}(K\eta)Lowpass, boxy
Identity11 everywhereNo change

Unsharp masking: fsharp=f+α(fGσf)=(1+α)fαGσff_{\text{sharp}} = f + \alpha(f - G_\sigma * f) = (1 + \alpha)f - \alpha G_\sigma * f. In frequency: H(ξ)=1+α(1e2π2σ2ξ2)H(\xi) = 1 + \alpha(1 - e^{-2\pi^2\sigma^2|\xi|^2}) - boosts high frequencies.

Appendix G: Convolution in Sequence Modeling Timeline

The evolution of convolution in deep learning sequence modeling:

YearModelConvolution TypeKey Innovation
1989LeNet2D spatial convTranslation equivariance for images
1997LSTMGated recurrenceImplicitly learns long-range IIR response
2016WaveNet1D dilated causal convReceptive field 2L2^L with LL layers
2016TCNDilated causal conv + residualGeneral temporal conv network
2017TransformerCross-correlation (attention)Global "convolution" with learned kernels
2021S4SSM = implicit IIR conv, trained as FIR via FFTHiPPO initialization for long-range
2022S4DDiagonal SSMSimplification: scalar SSM per channel
2022FNet2D FFT mixingFixed unitary conv replaces attention
2023HyenaParameterized long convData-controlled conv length; sub-quadratic
2023MambaSelective SSMInput-dependent scan; best of RNN+conv
2024Mamba-2State space dualityConnects SSM to linear attention

The trajectory is clear: convolution is progressively replacing or supplementing attention as the primary mixing mechanism in large-scale sequence models, driven by the O(NlogN)O(N \log N) vs. O(N2)O(N^2) complexity advantage at long sequence lengths.

Appendix H: Quick Reference

H.1 Convolution Formulas

OperationFormulaCost
Linear conv (continuous)(fg)(t)=f(τ)g(tτ)dτ(f*g)(t) = \int f(\tau)g(t-\tau)d\tauO(N2)O(N^2) direct
Linear conv (discrete)(xh)[n]=kx[k]h[nk](x*h)[n] = \sum_k x[k]h[n-k]O(MN)O(MN) direct
Circular conv (discrete)(xy)[n]=mx[m]y[(nm)modN](x \circledast y)[n] = \sum_{m} x[m]y[(n-m)\bmod N]O(NlogN)O(N \log N) via FFT
FFT convolutionzero-pad -> FFT -> multiply -> IFFTO(NlogN)O(N \log N)
Cross-correlation(fg)(τ)=f(t)g(t+τ)dt(f \star g)(\tau) = \int f^*(t)g(t+\tau)dtO(NlogN)O(N \log N) via FFT
AutocorrelationRff(τ)=(ff)(τ)R_{ff}(\tau) = (f \star f)(\tau)O(NlogN)O(N \log N) via FFT

H.2 Convolution Theorem Summary

DomainTime / SpaceFrequency
Continuousfgf * gf^g^\hat{f} \cdot \hat{g}
Continuous (dual)fgf \cdot gf^g^\hat{f} * \hat{g}
Discrete (DFT)xyx \circledast yX[k]Y[k]X[k] \cdot Y[k]
Discrete (dual)x[n]y[n]x[n] \cdot y[n]1N(XY)[k]\frac{1}{N}(X \circledast Y)[k]
Cross-correlationfgf \star gf^(ξ)g^(ξ)\hat{f}^*(\xi) \cdot \hat{g}(\xi)
Wiener-KhinchinRXX(τ)R_{XX}(\tau)$S_{XX}(f) =

H.3 Filter Types Comparison

PropertyFIRIIR
Impulse responseFinite (LL taps)Infinite (recursive)
PhaseCan be exactly linearNon-linear (dispersive)
StabilityAlways BIBO stableOnly if poles inside unit circle
ComputationO(L)O(L) per sampleO(Npoles)O(N_{\text{poles}}) per sample
FFT-acceleratedYes, triviallyNot directly
Design methodsWindow, Parks-McClellan, LSButterworth, Chebyshev, elliptic
ML analogyCNN, WaveNet, S4 kernelRNN, LSTM (implicit)

H.4 Overlap-Add vs. Overlap-Save

PropertyOverlap-AddOverlap-Save
StrategyBlock convolution + add tailsCircular buffer + discard head
FFT sizeB+L1B + L - 1B+L1B + L - 1
Memory2N2NNN
Per-sample latencyBB samplesBB samples
Block size optimalB4LB \approx 4L to 8L8LSame
scipy functionsignal.oaconvolve(manual implementation)

H.5 Deconvolution Methods

MethodFormulaWhen to use
Naive inverseX^=Y/H\hat{X} = Y/HOnly when H(f)0H(f) \neq 0 everywhere, no noise
Wiener filter$\hat{X} = \frac{H^* S_{xx}}{H
Tikhonov / ridge$\hat{X} = \frac{H^*}{H
LASSO deconvminxyhx22+λx1\min_x \lVert y-h*x\rVert_2^2 + \lambda\lVert x\rVert_1Sparse signals (spike trains)
Total variationminxyhx22+λTV(x)\min_x \lVert y-h*x\rVert_2^2 + \lambda\text{TV}(x)Piecewise-constant signals

H.6 Key Inequalities

fg1f1g1(L1 Banach algebra)\lVert f * g \rVert_1 \leq \lVert f \rVert_1 \lVert g \rVert_1 \quad (L^1 \text{ Banach algebra}) fg2f1g2(Young, p=1,q=r=2)\lVert f * g \rVert_2 \leq \lVert f \rVert_1 \lVert g \rVert_2 \quad (\text{Young, } p=1, q=r=2) fgrfpgq,1r=1p+1q1(Young general)\lVert f * g \rVert_r \leq \lVert f \rVert_p \lVert g \rVert_q, \quad \tfrac{1}{r} = \tfrac{1}{p} + \tfrac{1}{q} - 1 \quad (\text{Young general}) Rff(τ)Rff(0)=f22(autocorrelation peak at zero)|R_{ff}(\tau)| \leq R_{ff}(0) = \lVert f \rVert_2^2 \quad (\text{autocorrelation peak at zero}) SNRmatched=2s22N0(matched filter maximum SNR)\text{SNR}_{\text{matched}} = \frac{2\lVert s \rVert_2^2}{N_0} \quad (\text{matched filter maximum SNR})

Appendix I: Common Signal Pairs and Their Convolutions

Understanding which signals arise as convolutions of common building blocks builds strong intuition.

I.1 Standard Convolution Pairs

ffgg(fg)(f * g)
N(μ1,σ12)\mathcal{N}(\mu_1, \sigma_1^2)N(μ2,σ22)\mathcal{N}(\mu_2, \sigma_2^2)N(μ1+μ2,σ12+σ22)\mathcal{N}(\mu_1+\mu_2, \sigma_1^2+\sigma_2^2)
Π[0,T]\Pi_{[0,T]} (rect pulse)Π[0,T]\Pi_{[0,T]} (rect pulse)Λ[0,2T]\Lambda_{[0,2T]} (triangle)
eat1t0e^{-at}\mathbf{1}_{t\geq 0}ebt1t0e^{-bt}\mathbf{1}_{t\geq 0}eatebtba1t0\frac{e^{-at} - e^{-bt}}{b-a}\mathbf{1}_{t\geq 0} (if aba \neq b)
δ(tt0)\delta(t - t_0)g(t)g(t)g(tt0)g(t - t_0) (shift)
δ(t)\delta'(t)g(t)g(t)g(t)g'(t) (derivative)
sinc(2fct)\operatorname{sinc}(2f_c t)sinc(2fct)\operatorname{sinc}(2f_c t)sinc(2fct)\operatorname{sinc}(2f_c t) (idempotent!)

The sinc idempotence (hLPhLP=hLP)(h_{\text{LP}} * h_{\text{LP}} = h_{\text{LP}}) reflects the fact that the ideal lowpass filter is a projection: applying it twice gives the same result as once. In ML terms, this is analogous to idempotent attention patterns.

I.2 Discrete Convolution Pairs

x[n]x[n]h[n]h[n](xh)[n](x * h)[n]
δ[n]\delta[n]h[n]h[n]h[n]h[n] (impulse response)
u[n]u[n] (step)h[n]h[n]k=0nh[k]\sum_{k=0}^{n} h[k] (running sum)
anu[n]a^n u[n]bnu[n]b^n u[n]an+1bn+1abu[n]\frac{a^{n+1} - b^{n+1}}{a - b}u[n]
[1,2,1][1, 2, 1][1,2,1][1, 2, 1][1,4,6,4,1][1, 4, 6, 4, 1] (binomial expansion)
cos(2πf0n)\cos(2\pi f_0 n)FIR filter hh$

I.3 Cascade and Parallel Systems

Cascade (series): If system 1 has impulse response h1h_1 and system 2 has h2h_2, the cascade has impulse response h1h2h_1 * h_2. Frequency response: Hcascade(f)=H1(f)H2(f)H_{\text{cascade}}(f) = H_1(f) \cdot H_2(f).

Parallel: Impulse response h1+h2h_1 + h_2. Frequency response: H1(f)+H2(f)H_1(f) + H_2(f).

Feedback (IIR): If the forward path has response H1H_1 and feedback has H2H_2:

Hfeedback(f)=H1(f)1H1(f)H2(f)H_{\text{feedback}}(f) = \frac{H_1(f)}{1 - H_1(f)H_2(f)}

This is the origin of poles in IIR filters: the denominator 1H1H2=01 - H_1 H_2 = 0 defines the feedback resonances.


References

  1. Oppenheim, A.V. & Schafer, R.W. (2009). Discrete-Time Signal Processing, 3rd ed. Prentice Hall. - The definitive textbook; Chapters 2-4 cover LTI systems and Z-transform.
  2. Proakis, J.G. & Manolakis, D.G. (2006). Digital Signal Processing, 4th ed. Pearson. - Thorough treatment of filter design.
  3. van den Oord, A. et al. (2016). WaveNet: A Generative Model for Raw Audio. arXiv:1609.03499. - Dilated causal convolutions for audio.
  4. Gu, A. et al. (2021). Efficiently Modeling Long Sequences with Structured State Spaces. ICLR 2022. - S4 SSM as FFT convolution.
  5. Gu, A. & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752. - Selective SSM.
  6. Poli, M. et al. (2023). Hyena Hierarchy. ICML 2023. - Parameterized long convolutions.
  7. Lee-Thorp, J. et al. (2022). FNet: Mixing Tokens with Fourier Transforms. NAACL 2022. - FFT as attention substitute.
  8. Cohen, T. & Welling, M. (2016). Group Equivariant Convolutional Networks. ICML 2016. - GG-CNNs.
  9. Wiener, N. (1942). Extrapolation, Interpolation, and Smoothing of Stationary Time Series. MIT Press. - Original Wiener filter paper.
  10. Young, W.H. (1912). On the multiplication of successions of Fourier constants. Proc. R. Soc. London. - Original Young's inequality.