"The convolution theorem is the most important theorem in signal processing - it is what makes Fourier analysis a practical computational tool rather than a purely theoretical construct."
- Alan V. Oppenheim, MIT EECS
Overview
The Convolution Theorem establishes one of the most beautiful and practically consequential dualities in all of mathematics: convolution in the time domain equals pointwise multiplication in the frequency domain. This single identity reduces an sliding-dot-product computation to FFT-based arithmetic - a transformation that made real-time digital audio processing possible in the 1970s and powers nearly every modern deep learning architecture today.
This section develops the Convolution Theorem rigorously in both its continuous and discrete forms, builds the surrounding theory of linear time-invariant (LTI) systems and filter design, treats efficient block-convolution algorithms (overlap-add, overlap-save), and traces the theorem's consequences through cross-correlation, the Wiener-Khinchin theorem, and deconvolution. The second half connects these classical results to modern AI: convolutional neural networks as learned frequency filters, WaveNet's causal dilated convolutions, state space models (S4, Mamba) as implicit convolution operators, the Hyena hierarchy's parameterized long convolutions, and FNet's replacement of multi-head attention with a single 2D FFT.
The Convolution Theorem is not merely useful - it is a ring isomorphism. The DFT diagonalizes the circular convolution algebra , turning a non-commutative-looking operation into trivial pointwise multiplication. Understanding why this works - through the lens of group representation theory and Fourier analysis on abelian groups - gives deep insight into why spectral methods are so pervasive in modern sequence modeling.
Prerequisites
- DFT and FFT - DFT definition, circular convolution preview, FFT algorithm, complexity (Section 20-03)
- Fourier Transform - continuous FT definition (-convention), Plancherel's theorem, and theory (Section 20-02)
- Fourier Series - complex exponential basis, Parseval's identity (Section 20-01)
- Linear algebra - inner products, unitary matrices, change of basis (Section 02)
- Complex analysis - , modulus, argument, roots of unity (Section 01)
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | Convolution theorem proofs, LTI frequency response, overlap-add, Wiener filter, CNN filters, S4/Mamba SSM, Hyena, FNet |
| exercises.ipynb | 10 graded problems: convolution by hand through Hyena long-convolution parameterization |
Learning Objectives
After completing this section, you will:
- State and prove the Convolution Theorem in both continuous (/) and discrete (DFT) forms
- Distinguish circular convolution from linear convolution and apply the zero-padding correction
- Define an LTI system via its impulse response and derive the frequency response
- Analyze the group delay and explain its significance for signal distortion
- Compare FIR and IIR filter designs: stability, linear phase, computational cost
- Implement the overlap-add algorithm for block convolution of long sequences
- Prove the cross-correlation theorem and derive the Wiener-Khinchin theorem (PSD = FT of autocorrelation)
- Derive the Wiener deconvolution filter from the MSE minimization principle
- Interpret CNN convolutional layers as learned LTI frequency filters in the spatial domain
- Explain how S4/Mamba SSMs compute output via FFT-based convolution in training mode
- Describe Hyena's parameterized long convolutions and their complexity
- Identify and fix the 10 most common convolution errors in numerical implementations
Table of Contents
- 1. Intuition
- 2. Formal Definitions
- 3. The Convolution Theorem
- 4. Circular vs. Linear Convolution
- 5. LTI Systems and Filter Theory
- 6. Efficient Long Convolution
- 7. Cross-Correlation and Power Spectral Density
- 8. Deconvolution
- 9. Applications in Machine Learning
- 10. Advanced Topics
- 11. Common Mistakes
- 12. Exercises
- 13. Why This Matters for AI (2026 Perspective)
- 14. Conceptual Bridge
1. Intuition
1.1 What Is Convolution?
Convolution is a mathematical operation that blends two functions by sliding one over the other and measuring their overlap at each position. More precisely, the linear convolution of and at position is the integral of the pointwise product of with a reversed, shifted copy of :
The operation appears in an enormous variety of contexts, often without being recognized as convolution:
- Image blurring: convolving an image with a Gaussian kernel produces a smoothed version. Each output pixel is a weighted average of its neighborhood.
- Running average: a box filter convolved with a time series produces the -point moving average - standard in financial data smoothing.
- Polynomial multiplication: multiplying by gives coefficients that are exactly the linear convolution of and . This is why the FFT can multiply two degree- polynomials in operations.
- Echo simulation: convolving an audio signal with a room impulse response (the sound of a clap in an empty hall) produces the "reverberated" signal.
- Probability: the PDF of the sum of independent random variables is - convolution of their PDFs.
The defining algebraic property is that convolution is commutative (), associative (), and distributive over addition. These properties make it a natural operation in any linear system.
For AI: every convolutional layer in a CNN computes exactly this operation. A filter convolved with feature map gives - the kernel slides over the input, taking inner products at each position. Understanding convolution as sliding-overlap reveals why translation equivariance holds: if the input shifts, the output shifts by the same amount.
1.2 The Central Insight: Frequency Domain Diagonalization
Computing convolution directly costs operations: for each of output positions, you compute an inner product with filter coefficients. For samples (one second of audio at 1 MHz), this is multiplications - completely infeasible.
The Convolution Theorem provides a dramatic shortcut. The key observation is that complex exponentials are eigenfunctions of convolution:
Each complex exponential passes through any LTI system unchanged in frequency - it is only scaled by the complex number . This means:
- Decompose the input into its frequency components via the Fourier transform:
- Multiply each component by the filter's frequency response: (pointwise, )
- Reconstruct the output via the inverse Fourier transform:
Steps 1 and 3 each cost via the FFT. Step 2 costs . The total is - compared to for direct convolution.
CONVOLUTION THEOREM: TWO EQUIVALENT COMPUTATIONS
========================================================================
TIME DOMAIN FREQUENCY DOMAIN
---------------------- ----------------------
x[n], h[n] (length N, M) X[k] = DFT(x)
v v
y = x h Y[k] = X[k] H[k]
O(NM) operations v
y = IDFT(Y)
O(N log N) via FFT
Both computations give IDENTICAL output y[n].
========================================================================
The mathematical reason this works is deep: the DFT is a ring isomorphism between - the ring of sequences under circular convolution - and - the ring of sequences under pointwise multiplication. Convolution, which looks complicated, becomes trivial multiplication after a change of basis.
1.3 Why This Matters for AI (First Look)
The Convolution Theorem is not merely a classical signal processing tool - it is at the heart of modern deep learning architectures:
- CNNs: every
torch.nn.Conv2dlayer computes a learned 2D convolution. For large kernels, PyTorch uses FFT-based convolution internally. The frequency domain perspective explains why early CNN layers learn Gabor-like filters (oriented bandpass filters) and why deep layers respond to semantic content. - WaveNet (van den Oord et al., 2016): autoregressive waveform generation using dilated causal convolutions. The receptive field grows exponentially with depth via dilation, reaching samples with only layers.
- S4 (Gu et al., 2021): represents a sequence model as a linear recurrence , . During training, the recurrence is unrolled into a convolution where is the SSM kernel, computed in via FFT.
- Hyena (Poli et al., 2023): replaces attention with learned long convolutions , each parameterized by a neural network. Complexity is per layer vs. for attention.
- FNet (Lee-Thorp et al., 2022): replaces multi-head attention with a single 2D FFT mixing operation, achieving 92% of BERT's GLUE score with 7 speedup.
1.4 Historical Timeline
| Year | Event |
|---|---|
| 1821 | Cauchy identifies convolution in complex function theory |
| 1822 | Fourier's Theorie analytique - FT diagonalizes the heat equation (an LTI PDE) |
| 1887 | Heaviside introduces operational calculus; transfer functions for electrical circuits |
| 1920s | Norbert Wiener develops autocorrelation and power spectral density theory |
| 1942 | Wiener invents the optimal Wiener deconvolution filter |
| 1965 | Cooley & Tukey - FFT makes convolution theorem computationally practical |
| 1970s | Digital signal processing revolution: digital filters, audio processing |
| 1989 | LeCun et al. - LeNet: CNNs for handwritten digit recognition |
| 1998 | LeNet-5 - modern CNN architecture with learned convolutional filters |
| 2012 | AlexNet - deep CNNs dominate ImageNet; GPU-accelerated conv layers |
| 2016 | WaveNet (DeepMind) - dilated causal convolutions for audio generation |
| 2021 | S4 (Stanford) - SSMs as implicit FFT convolutions for long sequences |
| 2022 | FNet (Google) - FFT replaces attention at competitive performance |
| 2023 | Hyena / Mamba - parameterized and selective convolutions for sub-quadratic LLMs |
2. Formal Definitions
2.1 Linear (Aperiodic) Convolution
Definition (Continuous Linear Convolution). For , the convolution of and is:
The integral is well-defined a.e. and by Young's inequality.
Definition (Discrete Linear Convolution). For sequences and :
For finite sequences of lengths and , the output has length .
Algebraic properties:
- Commutativity: (proof: substitute )
- Associativity: (follows from Fubini's theorem)
- Distributivity:
- Identity: where is the Dirac delta (continuous) or Kronecker delta (discrete)
- Scaling: for
Support rule: If is supported on and on , then is supported on . For finite sequences of lengths and , the output has exactly non-zero samples.
Standard examples:
Example 1 (Gaussian * Gaussian): If and (as PDFs), then:
Convolution of Gaussians is Gaussian with variance sum. This is the sum-of-independent-normals theorem.
Example 2 (rect * rect = triangle): The convolution of two rectangular pulses is the triangle function supported on . Each rect * rect gives a piecewise-linear tent.
Example 3 (polynomial multiplication): . As convolution: .
Non-examples:
- is NOT the pointwise product - these are entirely different operations.
- The convolution of two non-integrable functions (e.g., constants) is not generally well-defined in .
2.2 Circular (Periodic) Convolution
Definition (Circular Convolution). For , the -point circular (periodic) convolution is:
Circular convolution wraps the index modulo : sample is the same as .
The key difference from linear convolution: circular convolution assumes both signals are periodic with period . When the input is not actually periodic, the wrap-around creates time-aliasing - contributions from the "end" of the sequence wrap around and contaminate the "beginning" of the output. This produces incorrect results whenever is not truly -periodic.
Matrix form: Circular convolution is exactly a matrix-vector product with a circulant matrix:
Every circulant matrix is diagonalized by the DFT matrix :
This is precisely the Convolution Theorem in matrix form.
When circular = linear: If has length and has length , zero-pad both to length . Then -point circular convolution equals linear convolution exactly.
2.3 Cross-Correlation and Autocorrelation
Definition (Cross-Correlation). The cross-correlation of and is:
Note: cross-correlation is convolution with the time-reversed conjugate of the first argument. It is not commutative: .
Cross-correlation measures the similarity between and a shifted copy of . The peak of occurs at the lag where most resembles .
Definition (Autocorrelation). The autocorrelation of is:
Properties:
- (maximum at zero lag)
- (Hermitian symmetry)
- for all
For AI: Cross-correlation is the operation actually computed in standard "convolution" layers in PyTorch and TensorFlow. The torch.nn.Conv2d layer computes (cross-correlation), not (true convolution). Since the filter weights are learned, this distinction does not affect expressive power - the network simply learns the time-reversed filter.
2.4 The Convolution Algebra
The space equipped with convolution is a commutative Banach algebra:
- Vector space:
- Algebra: equipped with multiplication
- Banach: complete normed space with
The discrete version: with discrete convolution is a Banach algebra under the same submultiplicativity bound.
The cyclic group algebra with circular convolution is a commutative semisimple algebra over . By Maschke's theorem (since ), it decomposes as:
where are the primitive idempotents - the DFT basis vectors . The DFT is exactly the isomorphism to the direct sum. Convolution in the group algebra becomes pointwise multiplication in the direct sum decomposition - this is the algebraic content of the Convolution Theorem.
3. The Convolution Theorem
3.1 Continuous Version
Theorem (Convolution Theorem, Continuous). Let . Then and:
where (using the -convention consistent with Section 20-02).
Proof. Compute directly using Fubini's theorem (justified by ):
Swap integration order and substitute :
Extension to : For and , the theorem still holds in by a density argument, with (a special case of Young's inequality).
Corollary (Shift Theorem). If is a shifted version of :
This follows from the Convolution Theorem applied to where .
For AI: Dilated convolutions in WaveNet compute where for dilation factor . In frequency: - dilation in time becomes compression in frequency. Each dilation level captures different frequency bands.
3.2 Discrete Version
Theorem (Convolution Theorem, Discrete). For with DFT and :
Equivalently: .
Proof. Compute the DFT of the circular convolution directly:
where . Swap sums and substitute :
Normalization bookkeeping. NumPy's np.fft.ifft divides by , so the full pipeline is:
Y_k = X_k * H_k (pointwise, no 1/N)
y = np.fft.ifft(Y_k) (divides by N automatically)
Avoid double-dividing by when mixing manual DFT with NumPy functions.
Algorithm: FFT-based convolution of (length ) with (length ):
- Choose , next power of 2:
- Zero-pad: for , else ; similarly
- ,
- for all
- , keep first samples
Total cost: for three FFTs + for pointwise multiply.
3.3 The Multiplication Theorem (Dual)
Theorem (Multiplication / Dual Convolution Theorem). For :
Pointwise multiplication in time = convolution in frequency. This is exact duality with the Convolution Theorem.
Consequences:
- Windowing: Multiplying a signal by a window (e.g., Hann window) convolves its spectrum with . This is the spectral leakage mechanism studied in Section 20-03: the wider the window in time, the narrower in frequency, and the less leakage.
- Amplitude modulation: shifts the spectrum to . In the frequency domain: .
- Parseval's theorem: Setting and : .
3.4 Parseval and Convolution
Parseval's theorem for convolution. Combining Plancherel and the Convolution Theorem:
Young's convolution inequality. For with :
Special cases:
- : (used in filter stability analysis)
- : (Banach algebra property)
- :
Energy interpretation: The energy of the filtered output is bounded by the filter's norm times the input's energy. A filter with is energy non-increasing - it cannot amplify energy.
4. Circular vs. Linear Convolution
4.1 The Wrap-Around Problem
Circular convolution assumes periodicity. When you apply the DFT-multiply-IDFT recipe to two finite sequences without zero-padding, you compute circular convolution - and the result is not the same as linear convolution.
The wrap-around artifact. Suppose and (both length ). Linear convolution gives (length 4). But the 3-point circular convolution gives:
because the term wraps around. The correct linear result would have .
Time-aliasing analogy. Circular convolution on a length- DFT is the time-domain analogue of frequency-domain aliasing. Just as undersampling causes spectral copies to overlap (aliasing in frequency), using an insufficient DFT size causes time-domain copies to overlap (aliasing in time).
CIRCULAR vs LINEAR CONVOLUTION
========================================================================
x = [1, 2, 3], h = [1, 1]
LINEAR (correct): CIRCULAR (N=3, wrap-around error):
y[0] = 11 = 1 y[0] = 11 + 31 = 4 <- WRONG
y[1] = 21 + 11 = 3 y[1] = 21 + 11 = 3 PASS
y[2] = 31 + 21 = 5 y[2] = 31 + 21 = 5 PASS
y[3] = 31 = 3 (missing term - aliased into y[0])
Output length: 4 Output length: 3
========================================================================
4.2 The Zero-Padding Solution
Theorem. Let and . If we zero-pad both to length before computing the DFT, then the first samples of the -point IDFT of equal the linear convolution .
Why it works. The linear convolution output has non-zero samples. Padding to ensures the circular buffer is large enough that the "wrap-around" tail falls on zero-padded positions - so no aliasing occurs.
Practical recipe:
N = 2**int(np.ceil(np.log2(len(x) + len(h) - 1))) # next power of 2
X = np.fft.rfft(x, n=N)
H = np.fft.rfft(h, n=N)
y = np.fft.irfft(X * H, n=N)[:len(x) + len(h) - 1] # trim to correct length
Using rfft (real FFT) halves the computation for real-valued inputs.
Power-of-2 choice. Padding to the next power of 2 ensures the Cooley-Tukey FFT uses radix-2 butterflies, maximizing efficiency. For some , mixed-radix FFTs (e.g., ) can be as fast or faster; scipy.fft automatically chooses the best FFT size.
4.3 Cost Analysis
For convolving a signal of length with a filter of length :
| Method | Flops | When optimal |
|---|---|---|
| Direct linear convolution | (short filter) | |
| FFT-based () | comparable to | |
| Overlap-add (blocks of size ) | (long signal, fixed filter) |
Crossover point. FFT convolution beats direct when , approximately:
For filters longer than taps, FFT-based convolution is almost always faster in practice. For very short filters (e.g., CNN kernels), direct convolution with SIMD instructions often wins due to lower overhead.
5. LTI Systems and Filter Theory
5.1 Linear Time-Invariant Systems
A system is linear if it satisfies superposition:
It is time-invariant if a time shift in input produces the same time shift in output:
Characterization theorem: Every LTI system is completely described by its impulse response . The output for any input is:
Causality: for . A causal system cannot depend on future inputs - essential for real-time processing. The convolution simplifies to .
Stability (BIBO): An LTI system is bounded-input bounded-output (BIBO) stable if and only if:
This is the condition - exactly the domain in which the Convolution Theorem holds most cleanly.
For AI: Recurrent neural networks can be analyzed through the LTI lens. A linear RNN has impulse response for . BIBO stability requires (spectral radius). S4 enforces stability by constraining to have negative-real eigenvalues via the HiPPO initialization.
5.2 Frequency Response and Transfer Function
Definition. The frequency response of an LTI system with impulse response is:
This is simply the Fourier transform of the impulse response. The Convolution Theorem then gives:
The system applies a complex gain to each frequency component:
- - the magnitude response (gain/attenuation at frequency )
- - the phase response (phase shift at frequency )
Group delay. The group delay measures how much each frequency band is delayed:
A system with constant group delay is said to have linear phase: . This means all frequencies are delayed by the same amount - no dispersion, no distortion of waveform shape.
Frequency response examples:
| Filter type | Impulse response | |
|---|---|---|
| Ideal lowpass, cutoff | $\mathbf{1}_{ | f |
| Ideal highpass | $\mathbf{1}_{ | f |
| Gaussian smoother | (Gaussian) | |
| Differentiator | ||
| Hilbert transformer | (Cauchy principal value) |
For AI: The Hilbert transformer shifts all frequency components by 90, producing the analytic signal . This is used in envelope detection for audio and is related to the complex-valued activations in modern SSM architectures.
5.3 FIR vs. IIR Filters
FIR (Finite Impulse Response): has only finitely many non-zero terms (). Output is a finite weighted sum: .
Advantages:
- Always BIBO stable (finite sum, bounded output)
- Can achieve exact linear phase: symmetric FIR () has linear phase
- Easy to design: window method, least-squares, Parks-McClellan algorithm
Disadvantages:
- Long filter needed for sharp frequency cutoffs (e.g., taps for sharp lowpass)
- High computational cost for long filters (mitigated by FFT convolution)
IIR (Infinite Impulse Response): is non-zero for all . Described by a rational transfer function:
Implemented as a recursive difference equation: .
Advantages:
- Very compact: sharp frequency responses achievable with few coefficients (e.g., 5th-order Butterworth)
- Classic analog designs: Butterworth (maximally flat), Chebyshev (equiripple), elliptic (optimal)
Disadvantages:
- Stability not guaranteed: poles of must lie inside the unit circle
- Non-linear phase: group delay is frequency-dependent, causing waveform dispersion
- Cannot be parallelized easily: recursive computation creates data dependencies
For AI: CNN layers are FIR filters (finite, feedforward). RNNs and LSTMs implement IIR-like behavior (infinite memory via recurrence). S4/Mamba use IIR-structured kernels but compute them in FIR/parallel mode via FFT convolution during training.
5.4 Ideal Filter Prototypes
The four ideal filter types, defined by their frequency response:
| Type | | Application | |------|----------|-------------| | Lowpass (LP) | for , otherwise | Smoothing, anti-aliasing | | Highpass (HP) | for , otherwise | Edge detection, noise removal | | Bandpass (BP) | for , otherwise | Channel selection, mel filters | | Bandstop (BS) | for , otherwise | Notch filter (50/60 Hz hum removal) |
All ideal filters have impulse responses that are -like - non-causal (extend to ) and non-realizable. Real-world filter design approximates these ideal brick-wall responses.
Gibbs phenomenon. Truncating the infinite sinc impulse response creates the Gibbs overshoot: the truncated filter's frequency response overshoots the ideal by at the cutoff regardless of truncation length. Windowing the sinc (multiplying by Hann or Kaiser window) reduces the Gibbs ripple at the cost of a wider transition band - the same leakage-resolution trade-off from Section 20-03.
6. Efficient Long Convolution
6.1 Overlap-Add Algorithm
The overlap-add (OA) algorithm efficiently convolves a long signal of length with a short filter of length by processing in blocks.
Algorithm:
- Choose block size (typically or for efficiency)
- Pad to length , compute
- For each block :
a. Extract (zero-padded if needed)
b. Compute
c. Compute - length
d. Add to the output buffer with overlap:
output[m*B : m*B + N] += Y_m
The last samples of each block output overlap with the first samples of the next - these overlapping tails are summed (the "add" in overlap-add).
Why it works. Each block-convolution contains the correct contribution of block to the full output. The convolution theorem guarantees linearity: summing all contributions gives the correct total output.
Total cost: FFTs of size , each costing :
Optimal block size minimizes , giving to .
6.2 Overlap-Save Algorithm
Overlap-save (OS) is a computationally equivalent alternative that avoids the explicit addition step by saving (reusing) a portion of each input block.
Algorithm:
- Initialize a buffer of size filled with zeros
- For each new input samples: a. Shift buffer left by : keep the last samples from the previous block b. Fill buffer positions with the new input samples c. Compute d. e. Discard first samples of (circular contamination), save last samples
The "save" refers to keeping samples from the previous block in the current buffer. The "discard" step removes the circular wrap-around artifacts at the beginning of each block.
Comparison:
| Property | Overlap-Add | Overlap-Save |
|---|---|---|
| Computation per block | Same | Same |
| Memory | (input + output buffer) | (single circular buffer) |
| Implementation complexity | Slightly simpler | Slightly more memory-efficient |
| Real-time streaming | Good | Slightly better |
6.3 Block Convolution in Practice
Choosing block size :
- Too small (): FFT overhead dominates; most computation is transform, not convolution
- Too large (): good efficiency, but large latency ( samples before any output)
- Rule of thumb: to balances efficiency and latency
scipy.signal.fftconvolve implementation:
import scipy.signal as signal
y = signal.fftconvolve(x, h, mode='full') # linear conv, output length M+L-1
y = signal.fftconvolve(x, h, mode='same') # same length as x (centered)
y = signal.fftconvolve(x, h, mode='valid') # only fully-overlapping part
scipy.signal.oaconvolve for long signals:
y = signal.oaconvolve(x, h, mode='full') # uses overlap-add internally
This is typically 10-100 faster than fftconvolve when .
For AI: The overlap-add structure appears in neural audio synthesis. In WaveNet, the causal dilated convolutions can be viewed as a sequence of overlapping block convolutions, where each block is processed causally. This is why WaveNet inference is slow (sequential) but training is fast (parallel over all positions using FFT convolution).
7. Cross-Correlation and Power Spectral Density
7.1 Cross-Correlation Theorem
Theorem (Cross-Correlation Theorem). For :
where is the cross-correlation.
Proof. Write (time-reverse and conjugate the first argument). Apply the Convolution Theorem and the time-reversal property :
Discrete version. For sequences :
In NumPy: np.fft.ifft(np.conj(X) * Y) computes cross-correlation via FFT.
Applications:
- Template matching: peaks where most resembles the template - used in pattern recognition, face detection, audio fingerprinting (Shazam)
- Time-delay estimation: the peak of estimates the time delay between two sensors
- Matched filter: the optimal linear filter for detecting a known signal in white noise is the cross-correlator (see Section 7.3)
7.2 Autocorrelation and Wiener-Khinchin Theorem
Definition. The autocorrelation of is .
Theorem (Wiener-Khinchin). For a wide-sense stationary random process with autocorrelation , the power spectral density is its Fourier transform:
Corollary. The total power equals the integral of the PSD:
(By Parseval's theorem applied to the autocorrelation.)
Spectral factorization. Since is non-negative and real, one can always write for a causal filter - the spectral factor. This is used in Wiener filter design.
Practical estimation. The periodogram estimates the PSD from samples:
The periodogram is an inconsistent estimator (variance doesn't decrease with ). Welch's method averages periodograms over overlapping windowed segments:
For AI: Welch's PSD is used in analyzing neural network weight matrices and gradient noise. The gradient PSD reveals dominant frequency scales of gradient oscillations - key for understanding Adam's moment estimation and learning rate scheduling.
7.3 Matched Filtering
Problem: Given an observation where is a known signal and is white noise with spectral density , find the linear filter that maximizes the output SNR at detection time .
Theorem (Matched Filter). The optimal filter is:
i.e., - the time-reversed, conjugated, delayed signal. The maximum output SNR is:
The matched filter output is proportional to the cross-correlation of with .
For AI: The attention mechanism in Transformers can be interpreted as a soft matched filter. The query plays the role of a template signal; the keys are candidate matches; is a soft selection of the best-matching key, analogous to thresholding the cross-correlation peak.
8. Deconvolution
8.1 The Deconvolution Problem
Problem setup. Given the observed signal where is a known (or estimated) blurring/mixing filter, is the unknown clean signal, and is additive noise, recover .
In the frequency domain: .
Naive inverse filter. The obvious approach is . This recovers exactly when , but in the presence of noise:
When at some frequencies (the filter attenuates them strongly), the noise term is amplified catastrophically. The inverse filter is numerically ill-conditioned - a small noise level can produce arbitrarily large artifacts.
Example (image deblurring): A Gaussian blur decays to near-zero at high frequencies. Dividing by at high frequencies amplifies high-frequency noise by factors of , producing extreme ringing artifacts.
8.2 Wiener Filter
The Wiener filter is the optimal linear filter minimizing the mean squared error .
Theorem (Wiener Filter). The optimal deconvolution filter in the frequency domain is:
where is the PSD of the signal and is the PSD of the noise.
Derivation sketch. We seek such that minimizes . Setting the derivative to zero (orthogonality principle):
Intuition. The Wiener filter trades off between two regimes:
- When (high SNR): - nearly perfect inversion
- When (low SNR): - suppress the noisy frequency, don't attempt to invert
The ratio governs the transition between inversion and suppression at each frequency.
For AI: The Wiener filter principle appears in noise-robust attention. Sparse attention patterns (Longformer, BigBird) can be viewed as suppressing "noisy" long-range attention weights when the query-key similarity is low - a form of adaptive filtering where the "SNR" is the attention score.
8.3 Regularized Deconvolution
When and are unknown, we use Tikhonov regularization:
The regularization parameter prevents division by near-zero values of :
- : approaches inverse filter (unstable)
- : output (over-regularized)
- Optimal chosen by cross-validation or the discrepancy principle
L-curve method. Plot vs. as varies. The "corner" of the L-shaped curve gives a good - balancing solution norm against residual.
Sparse deconvolution. When the clean signal is sparse (e.g., spike train, musical onset detection), using regularization instead of :
This is the LASSO in the signal processing context and recovers sparse solutions more faithfully than the Wiener filter, which assumes a smooth Gaussian prior on .
9. Applications in Machine Learning
9.1 Convolutional Neural Networks
A convolutional layer computes cross-correlation (not true convolution, but the distinction is irrelevant since weights are learned):
Frequency perspective: Each learned filter is an LTI system with frequency response . Deep CNNs learn a hierarchy of filters:
- Layer 1 filters: oriented edges, color gradients (Gabor-like bandpass filters)
- Layer 2 filters: corners, textures (combinations of edge filters)
- Layer filters: object parts, semantic features
Dilated convolutions expand the receptive field without increasing parameters: . In frequency: this is equivalent to subsampling by factor , creating copies at multiples of - the dilation aliases the filter spectrum.
11 convolutions are pointwise linear transformations across channels - equivalent to applying a separate MLP at each spatial position. They mix channel information without spatial mixing.
For AI (2026): Modern vision transformers (ViT) treat patches as tokens. Recent work shows that early ViT layers perform operations similar to CNN filters - local spatial filtering. Conversely, "ConvNeXt" (Liu et al., 2022) shows that pure CNN architectures can match ViT by adopting transformer-inspired design choices (depth-wise convolutions, GELU activations, layer norm).
9.2 Causal Convolution and WaveNet
WaveNet (van den Oord et al., 2016) generates raw audio waveforms autoregressively at 16/24 kHz using dilated causal convolutions.
Causal convolution: - only past samples contribute. In training, implemented as a padded standard convolution.
Dilated receptive field: With dilation factors over layers, the total receptive field is samples using only layers of -tap filters. At 16 kHz with layers of taps: receptive field = samples = 64 ms.
WAVENET DILATED CAUSAL RECEPTIVE FIELD (L=4 layers)
========================================================================
Input: x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] x[8]
| | | | | | | | |
d=1: +----+ +----+ +----+ +----+ (receptive field: 2)
| | | |
d=2: +---------+ +---------+ (receptive field: 4)
| |
d=4: +--------------+ (receptive field: 8)
Total receptive field: 2^4 - 1 = 15 samples
Output y[n] depends on x[n], x[n-1], ..., x[n-14]
========================================================================
9.3 State Space Models as Convolution (S4)
The Structured State Space Sequence model (S4) (Gu et al., 2021) bridges recurrent and convolutional views of sequence modeling.
SSM definition. The continuous-time state space model:
Discretized with step :
Convolutional mode (training). Unrolling the recurrence gives:
The SSM kernel defines a length- FIR filter. Convolution is computed in via FFT.
HiPPO initialization. S4 initializes as the HiPPO matrix that optimally projects history onto Legendre polynomials. This gives long-range decay properties that enable learning dependencies over thousands of steps.
For AI: S4 achieves state-of-the-art performance on the Long Range Arena benchmark, processing sequences of length 16,000 in vs. for Transformers.
9.4 Mamba and Selective State Spaces
Mamba (Gu & Dao, 2023) extends S4 with input-selective SSM parameters: depend on the input token. This breaks the time-invariance of S4's convolution kernel - the filter is now input-dependent and changes for every sequence.
Why not use FFT directly: Since changes with input, Mamba cannot precompute a single convolution kernel and apply it via FFT. Instead, it uses a hardware-aware parallel scan algorithm (Blelloch 1990) that processes the recurrence in time on GPU with memory efficiency via kernel fusion.
Selective mechanism: The input-dependent parameters enable the model to selectively "remember" or "forget" context - analogous to LSTM gates but more flexible. This allows Mamba to match or exceed Transformer performance on language modeling while maintaining sub-quadratic scaling.
9.5 Hyena and Parameterized Long Convolutions
Hyena (Poli et al., 2023) replaces the attention mechanism with a hierarchy of parameterized long convolutions.
Architecture. A Hyena operator of order computes:
where each is a learned long convolution (length sequence length), are linear projections, and is elementwise multiplication.
Parameterization of : Each long filter is parameterized by a small neural network evaluated at integer time positions . This gives parameters per filter regardless of sequence length.
Complexity: Each convolution costs via FFT. Total: per layer for order- Hyena vs. for attention.
For AI: Hyena reaches 99.9% of attention's performance on CIFAR-10 at sequence length 1024 while being 8 faster. It scales better to length 64K where attention becomes prohibitive.
9.6 FNet: Replacing Attention with FFT
FNet (Lee-Thorp et al., 2022) replaces the self-attention sub-layer in a standard Transformer with a 2D FFT mixing layer:
Two FFTs: one along the sequence dimension, one along the model dimension. No learned parameters in the mixing layer - just a fixed unitary transformation. The FFN sub-layers remain unchanged.
Results:
- Achieves 92-97% of BERT's performance on GLUE
- Training speed: 7 faster than BERT on TPUs (no attention computation)
- Parameter count: same as BERT (all parameters are in FFN, embeddings, output projection)
Why it works: The 2D FFT performs a global mixing of all positions and all channels simultaneously. It acts like an "equal attention" that attends to all positions - less expressive than learned attention but sufficient for many tasks after fine-tuning.
For AI (2026): FNet is the existence proof that token mixing does not require learned attention. This insight motivated Mlp-Mixer, gMLP, and other mixing architectures that replace attention with or alternatives.
9.7 Filter Banks Preview
Preview: Wavelets and Filter Banks
The Convolution Theorem underlies an even richer structure: filter banks - collections of bandpass filters that partition the spectrum. Convolving a signal with a bank of bandpass filters and subsampling the outputs gives a multi-resolution representation. The perfect reconstruction condition requires that the synthesis filters exactly invert the analysis filters - a constraint directly expressed via the Convolution Theorem.
The Haar wavelet is the simplest perfect-reconstruction filter bank: a lowpass filter and highpass . Iterating the lowpass branch produces the discrete wavelet transform (DWT).
-> Full treatment: Wavelets
10. Advanced Topics
10.1 Convolution on Groups
The Convolution Theorem extends far beyond and to any locally compact abelian group :
where is the Haar measure on . The Fourier transform on diagonalizes this convolution exactly as in the classical case:
where are the characters (1D unitary representations) of .
Examples:
- : DFT, circular convolution
- : Z-transform, discrete convolution
- : Fourier transform, linear convolution
- (circle): Fourier series, periodic convolution
- : multidimensional FT, image processing
Non-abelian groups. For non-abelian (e.g., rotation group , permutation group ), the Fourier transform takes values in matrix-valued representations, not scalars. Convolution in the group algebra is still diagonalized, but now each "frequency component" is a matrix equation where is an irreducible representation.
-CNNs (Cohen & Welling, 2016): Convolutional networks on Lie groups that are equivariant to all group transformations. Each layer computes group convolution . The output transforms predictably under -transformations, encoding equivariance as an inductive bias rather than learning it from data.
For AI: Group-equivariant CNNs are used in molecular property prediction ( for 3D rotational equivariance), protein structure prediction (SE(3)-equivariant networks in AlphaFold 2), and medical imaging (rotation/reflection equivariance for tissue classification).
10.2 Spectral Graph Convolution
On a graph with nodes, the graph Laplacian (where is the degree matrix) plays the role of the Laplacian in continuous Fourier analysis. Its eigendecomposition defines the graph Fourier transform:
Graph convolution:
This is the Convolution Theorem on graphs: spectral multiplication = graph convolution.
Limitations: Computing costs ; the resulting filters are non-localized. ChebNet (Defferrard et al., 2016) approximates by a Chebyshev polynomial, giving -hop local filters. GCN (Kipf & Welling, 2017) uses the first-order approximation (two-parameter filter).
-> Full treatment: Spectral Graph Theory
10.3 Young's Convolution Inequality
Theorem (Young's Inequality for Convolutions). Let with . For and :
Key special cases:
| Interpretation | |||
|---|---|---|---|
| 1 | 1 | 1 | is a Banach algebra under convolution |
| 1 | 2 | 2 | FIR filtering preserves norm (up to ) |
| 1 | Bounded signals remain bounded after filtering | ||
| 2 | 2 | Autocorrelation is bounded |
Proof sketch. Generalized Holder's inequality applied to the convolution integral; see Lieb & Loss "Analysis" Section 4.2.
Sharp version (Beckner-Brascamp-Lieb). The sharp constant in Young's inequality is:
Gaussians are the extremizers - they saturate the inequality with equality.
11. Common Mistakes
| # | Mistake | Why It's Wrong | Fix |
|---|---|---|---|
| 1 | Using circular convolution when linear convolution is needed | DFT assumes periodicity; wrap-around artifacts contaminate the output | Zero-pad both sequences to length before DFT |
| 2 | Forgetting to zero-pad to a power of 2 | Non-power-of-2 FFT sizes use slower mixed-radix algorithms | Use N = 2**ceil(log2(M+L-1)) or let scipy.fft choose N |
| 3 | Double-dividing by | Manually computing then calling np.fft.ifft (which already divides by ) | Use either all-manual or all-NumPy; never mix conventions |
| 4 | Confusing convolution with cross-correlation | torch.nn.Conv2d computes cross-correlation , not ; for symmetric kernels they are equal, but in general they differ | Cross-correlation flips one input; for learned filters it makes no difference (the network learns the flip) |
| 5 | Ignoring phase response | Analyzing only $ | H(f) |
| 6 | Applying IIR filters in both directions without understanding "zero-phase" | scipy.signal.filtfilt applies a filter twice (forward then backward) to achieve zero phase, but it is non-causal; cannot be used in real-time streaming | Use filtfilt only for offline analysis; use lfilter with delay compensation for real-time |
| 7 | Using periodogram directly as PSD estimate | The periodogram has variance equal to its mean (inconsistent estimator) - averaging periodograms over many realizations does not reduce variance | Use Welch's method (scipy.signal.welch) to average windowed periodogram estimates |
| 8 | Naive inverse filter on attenuated frequencies | Dividing by amplifies noise by $1/ | H(f) |
| 9 | Confusing convolution with element-wise multiplication | Writing y = h * x in Python (NumPy) gives element-wise multiply, not convolution | Use np.convolve(x, h) or scipy.signal.fftconvolve(x, h) for convolution |
| 10 | Incorrect output length after convolution | Not accounting for the "full", "same", or "valid" output modes, or expecting circular convolution to give output samples (it gives ) | Know the output lengths: full = ; same = ; valid = $ |
12. Exercises
Exercise 1 * - Convolution by Hand
Compute the linear convolution of and by hand using the sliding-dot-product definition. Verify your answer using np.convolve. Then compute the 5-point circular convolution of the same inputs and identify which output samples differ from the linear convolution result due to wrap-around.
Exercise 2 * - Convolution Theorem Verification For and : (a) Compute the linear convolution directly. (b) Compute it via the FFT route (zero-pad, multiply spectra, IFFT). (c) Verify that both approaches give the same result to machine precision. (d) Verify Parseval: where .
Exercise 3 * - LTI Frequency Response Design a simple FIR lowpass filter using the window method: (a) Start with the ideal lowpass impulse response with cutoff (in normalized frequency). (b) Truncate to taps. (c) Multiply by a Hann window. (d) Plot in dB and measure the passband ripple, stopband attenuation, and transition bandwidth.
Exercise 4 ** - Circular vs. Linear Convolution Demo
(a) Convolve a length-64 chirp signal with a length-16 Hann-windowed filter using both direct np.convolve and the 64-point circular DFT method. (b) Show the circular convolution error (difference from linear convolution). (c) Repeat with zero-padding to length (or next power of 2 = 128). (d) Verify that the zero-padded circular convolution matches linear convolution exactly.
Exercise 5 ** - Overlap-Add Implementation
Implement the overlap-add algorithm from scratch: def overlap_add(x, h, block_size). (a) Process a 1024-sample signal with a 64-sample filter using block size 128. (b) Verify your output matches scipy.signal.fftconvolve(x, h) to machine precision. (c) Measure and compare the runtime of your overlap-add, np.convolve, and scipy.signal.fftconvolve for varying signal lengths.
Exercise 6 ** - Wiener Deconvolution Given a blurred signal where is a Gaussian blur, is a test signal, and is white Gaussian noise: (a) Implement the naive inverse filter and show the noise amplification. (b) Implement the Wiener filter assuming known and . (c) Show the deconvolution result for three SNR levels ( dB). (d) Compare with regularized Tikhonov deconvolution using the L-curve to select .
Exercise 7 *** - S4 SSM as FFT Convolution Implement the S4 state space model in convolutional mode: (a) Construct the SSM matrices for a simple diagonal SSM with states. (b) Discretize with step using the ZOH method: , . (c) Compute the SSM kernel for . (d) Apply the kernel to a test sequence via FFT convolution. (e) Verify by also running the recurrence directly and comparing outputs.
Exercise 8 *** - Hyena Parameterized Long Convolution Implement a simplified order-2 Hyena operator: (a) Define two learnable filters of length , parameterized by a 2-layer MLP evaluated at positions . (b) Implement the Hyena forward pass: where are learned projections. (c) Verify that the total compute is per forward pass. (d) Compare: for , count the FLOPs for Hyena vs. self-attention ( with ).
13. Why This Matters for AI (2026 Perspective)
| Concept | AI Impact |
|---|---|
| Convolution Theorem () | Reduces convolution to in all CNN layers; directly enables efficient sequence modeling |
| LTI system theory + impulse response | S4/Mamba: SSM = LTI system; training in convolutional mode via precomputed SSM kernel; inference in recurrent mode |
| Circular vs. linear convolution | Padding strategy in all CNN implementations; must zero-pad to avoid wrap-around artifacts in FFT-based conv layers |
| FIR filter design (linear phase) | Dilated causal convolutions in WaveNet are FIR filters; dilation exponentially extends receptive field |
| IIR filter stability () | Gradient stability in RNNs, LSTMs, S4; HiPPO matrix design for stable long-range memory |
| Cross-correlation = attention (soft matched filter) | Multi-head attention is a learned bank of cross-correlators; QK inner product = cross-correlation score |
| Wiener filter (optimal deconvolution) | Noise-robust attention; NLP denoising objectives; diffusion model score matching as optimal deconvolution |
| Overlap-add / overlap-save | WaveNet training (parallel causal conv); audio codec inference; streaming inference for SSMs |
| Group convolution (equivariance) | SE(3)-equivariant networks in protein structure (AlphaFold 2); molecular dynamics (NequIP, MACE) |
| Spectral graph convolution | GCN, ChebNet, GraphSAGE - all use graph convolution derived from graph Laplacian diagonalization |
| Hyena parameterized convolutions | Sub-quadratic LLM attention replacement; competitive with transformers at context lengths 8K-64K |
| FNet (FFT attention replacement) | Existence proof that fixed global mixing (FFT) achieves 92% of BERT; motivates Mlp-Mixer, gMLP architectures |
14. Conceptual Bridge
Looking Back
This section builds directly on three prior foundations. From Section 20-01 (Fourier Series), we inherit the idea that periodic signals decompose into sinusoidal harmonics and that these harmonics are orthogonal - the very orthogonality that makes the DFT a unitary transform. From Section 20-02 (Fourier Transform), we carry forward the continuous convolution theorem, Plancherel's theorem, and the / framework in which these operations are well-defined. From Section 20-03 (DFT and FFT), we take the computational machinery: the DFT matrix, the Cooley-Tukey algorithm, and the concept of circular periodicity that makes the discrete Convolution Theorem exact rather than approximate.
The Convolution Theorem can be seen as the payoff of all the Fourier analysis developed in Section 01-03: the abstract orthogonal decomposition becomes a practical speedup. The key insight - that eigenfunctions of convolution are complex exponentials - was already visible in Section 01's Fourier series (eigenfunctions of translation operators on the circle) and Section 02's FT (eigenfunctions of all LTI operators). Here we make that insight computationally concrete.
The Core Equation
This identity, proved in Section 3.2, is what connects abstract Fourier analysis to practical engineering and modern deep learning. Every CNN layer, every S4/Mamba layer, every FNet mixer, and every overlap-add audio system computes a consequence of this equation.
Looking Forward
The immediate next step is Section 20-05 (Wavelets), which uses the Convolution Theorem as its foundation. Wavelet analysis overcomes Fourier's fundamental limitation - no time localization - by using short convolutions with oscillatory kernels at multiple scales. The perfect reconstruction filter bank condition (the core technical requirement of wavelets) is a system of equations on the Fourier transforms of the analysis and synthesis filters. Every MRA, DWT, and lifting scheme is a cleverly designed set of convolutions satisfying this spectral condition.
Looking further, this section connects to:
- Section 11-04 (Spectral Graph Theory): Graph convolution = signal processing on irregular domains; ChebNet, GCN, and spectral clustering all use the graph Laplacian eigendecomposition in the same way the Convolution Theorem uses DFT eigenvectors
- Section 12-02 (Hilbert Spaces): The Convolution Theorem in is a statement about isometric isomorphisms of Hilbert spaces; Plancherel's theorem is the isometry; the Convolution Theorem is a multiplicative structure on top
- Section 17 (Generative Models): Diffusion models can be interpreted as iterative deconvolution - the forward process convolves with progressively wider Gaussians, and the reverse process learns the Wiener deconvolution at each noise level
POSITION IN CURRICULUM
========================================================================
Section 20-01 Fourier Series Section 20-02 Fourier Transform
(periodic, discrete spectrum) (aperiodic, continuous spectrum)
v v
Section 20-03 DFT and FFT ----------- Section 20-04 Convolution Theorem <- HERE
(discrete, computable) (theorem + LTI + ML apps)
v
Section 20-05 Wavelets
(multi-resolution, time-local)
v
+--------------------+--------------------+
v v v
Section 11-04 Spectral Section 12-02 Hilbert Section 17 Diffusion
Graph Theory Spaces (L2) Models
========================================================================
Appendix A: The Z-Transform and Discrete-Time LTI Systems
The Z-transform is the discrete-time counterpart of the Laplace transform, and its restriction to the unit circle is the DTFT (discrete-time Fourier transform):
The DFT evaluates the Z-transform at equally spaced points on the unit circle: for .
Convolution in the Z-domain. The Z-transform of the convolution is the product of Z-transforms:
This follows by exactly the same calculation as the continuous Convolution Theorem (substitute ).
Transfer function. For an IIR filter with difference equation:
The Z-transform gives: , so:
Stability from Z-domain. The BIBO stability condition translates to: all poles of (roots of ) must lie strictly inside the unit circle .
Frequency response from transfer function. Evaluate on the unit circle: . This gives the frequency response directly from the pole-zero plot.
A.1 Pole-Zero Analysis
The transfer function has:
- Zeros : frequencies where (nulled by filter)
- Poles : resonances where peaks (amplified by filter)
Notch filter example. To null frequency , place a zero pair at :
This creates a perfect null at . Used in power-line interference removal (50/60 Hz notch filter).
For AI: The poles of a linear RNN are the Z-transform poles of its impulse response. Stable initialization requires all eigenvalues of inside the unit disk. S4's HiPPO matrix has eigenvalues on the imaginary axis (continuous time), which after discretization map to eigenvalues near but inside the unit circle - ensuring stability while preserving long-range information.
Appendix B: Worked Examples
B.1 Convolution Theorem for the Heat Equation
The heat equation with initial condition is solved by convolution with the heat kernel:
Derivation via Fourier transform. Take the FT in :
This is a simple ODE in : .
The Convolution Theorem identifies (the FT of the Gaussian kernel ). Inverting:
The Fourier transform diagonalizes the heat operator: each frequency evolves independently, decaying at rate . High frequencies decay fastest - blurring sharp features. This is exactly why Gaussian convolution produces smooth outputs.
B.2 Verification of Circular vs. Linear Convolution
Let (length 4) and (length 2).
Linear convolution (length ):
4-point circular convolution (wrong - wrap-around):
- <- includes , wrong
- PASS
- PASS
- PASS
- - first sample is wrong
8-point circular convolution (zero-padded, correct):
- Zero-pad and
- Circular convolution of length-8 vectors gives
- Trim to length 5: PASS matches linear convolution
B.3 Wiener Filter Derivation in Detail
Setup. Observation model: . Desired: estimate minimizing .
Expansion:
Assuming and are uncorrelated ():
Setting :
This is the Wiener filter. The minimum MSE at frequency is:
where .
Appendix C: Numerical Notes and Implementation Details
C.1 Choosing FFT Library
| Library | Speed | Features | Best for |
|---|---|---|---|
numpy.fft | Good | Basic FFT, rfft | Simple use; numpy-native |
scipy.fft | Better | Handles non-power-of-2 optimally, workers param | General purpose |
pyfftw | Best | Multithreaded FFTW backend | Production speed-critical |
torch.fft | GPU | Batched FFT on GPU | Neural network layers |
cuFFT (via cupy) | GPU | CUDA-native | Large batch processing |
Rule: For ML training, use torch.fft.rfft which batches over the first dimensions automatically. For signal processing, use scipy.fft which automatically selects the best algorithm.
C.2 Real-Valued Convolution with rfft
For real inputs and , the DFT has conjugate symmetry: . Only the first values are independent. Using numpy.fft.rfft instead of fft:
- Input: real of length
- Output: complex of length (not )
- Speedup: for large
# Real-valued FFT convolution
N = 2**int(np.ceil(np.log2(len(x) + len(h) - 1)))
X = np.fft.rfft(x, n=N) # length N//2 + 1
H = np.fft.rfft(h, n=N)
y = np.fft.irfft(X * H, n=N)[:len(x) + len(h) - 1]
C.3 Numerical Precision in Convolution
Round-off error. FFT-based convolution introduces floating-point errors of order where for double precision. For : error - acceptable for most applications.
Exact convolution. For integer-valued signals, the NTT (Number Theoretic Transform over ) computes exact convolution without floating-point error - covered in Section 20-03.
C.4 Memory Layout for Batched Convolution
When processing batches of signals in PyTorch:
# Batch of B signals of length L, C channels
x = torch.randn(B, C, L)
X = torch.fft.rfft(x, n=N, dim=-1) # shape: (B, C, N//2+1)
H = torch.fft.rfft(h.unsqueeze(0), n=N, dim=-1) # broadcast over batch
Y = X * H
y = torch.fft.irfft(Y, n=N, dim=-1)[:, :, :L]
The dim=-1 argument applies the FFT along the last (time) dimension, leaving batch and channel dimensions intact.
Appendix D: Connections to Other Chapters
D.1 Convolution and Probability
The sum of independent random variables satisfies (PDF convolution). This connection explains:
- Central Limit Theorem via FT: Under mild conditions, (Gaussian in freq), so as more terms are added.
- Characteristic functions: The characteristic function is the FT of . The convolution theorem gives directly.
- Diffusion models: The forward diffusion process adds Gaussian noise iteratively. The noisy marginal is a convolution of the data distribution with a Gaussian kernel.
D.2 Convolution and Regularization
Ridge regression solves . When is a circulant matrix (i.e., for some filter ), ridge regression becomes Tikhonov deconvolution:
The regularization parameter has the same role as in the Wiener filter: it prevents noise amplification at frequencies where .
D.3 Convolution and Kernel Methods
A shift-invariant kernel defines a feature map such that . By Bochner's theorem, is a positive definite shift-invariant kernel if and only if for all - i.e., is the FT of a non-negative measure.
Random Fourier Features (Rahimi & Recht, 2007): approximate a shift-invariant kernel via Monte Carlo:
This approximates the kernel evaluation as a dot product of random Fourier features, reducing kernel SVM training from to .
Appendix E: Filter Design Methods
E.1 Window Method for FIR Design
The window method designs a length- FIR filter approximating an ideal frequency response :
- Compute the ideal (infinite-length) impulse response:
- Truncate to taps: for
The window controls the trade-off between transition width and stopband attenuation. Using Hann window gives dB stopband; Kaiser window allows continuous tuning via parameter :
- : rectangular ( dB)
- : Hann-equivalent ( dB)
- : dB stopband
Kaiser window formulas:
where is the desired stopband attenuation in dB and is the modified Bessel function.
Minimum filter length for Kaiser design with stopband attenuation dB and transition width :
E.2 Parks-McClellan (Equiripple) FIR Design
The Parks-McClellan algorithm finds the optimal equiripple FIR filter minimizing the maximum deviation from the desired response. It uses the Remez exchange algorithm on the Chebyshev equiripple approximation problem.
Key property: For a given filter length , Parks-McClellan achieves the minimum transition bandwidth subject to equal ripple in passband and stopband. The Chebyshev equiripple solution is optimal in the sense (minimax).
In Python: scipy.signal.remez(L, bands, desired, weight).
E.3 Butterworth IIR Design
The -th order Butterworth filter maximally flat in the passband:
- : simple RC circuit (6 dB/octave rolloff)
- : 12 dB/octave
- large: approaches ideal brick-wall
Poles lie on a circle of radius in the left half-plane, equally spaced by :
In Python: scipy.signal.butter(N, Wn, btype='low', analog=False).
Trade-off: Butterworth has no passband ripple but slower rolloff than Chebyshev or elliptic. For applications where in-band flatness is critical (audio, instrumentation), Butterworth is preferred.
Appendix F: Multidimensional Convolution
F.1 2D Convolution
For 2D signals (images), the 2D linear convolution is:
The 2D Convolution Theorem:
where .
Separable filters. A 2D filter is separable - its 2D FT factors as . Separable filtering can be implemented as two sequential 1D convolutions:
Cost: vs for non-separable. Depthwise separable convolutions in MobileNet exploit this: apply a 1D filter per channel, then mix channels - reducing parameters by a factor of for a filter.
F.2 Image Processing Applications
Common 2D convolution kernels and their frequency responses:
| Kernel | FT behavior | Effect |
|---|---|---|
| Gaussian | Gaussian | Lowpass, smooth |
| Laplacian | Highpass, edge enhance | |
| Sobel x-direction | Horizontal gradient | |
| Box filter | Lowpass, boxy | |
| Identity | everywhere | No change |
Unsharp masking: . In frequency: - boosts high frequencies.
Appendix G: Convolution in Sequence Modeling Timeline
The evolution of convolution in deep learning sequence modeling:
| Year | Model | Convolution Type | Key Innovation |
|---|---|---|---|
| 1989 | LeNet | 2D spatial conv | Translation equivariance for images |
| 1997 | LSTM | Gated recurrence | Implicitly learns long-range IIR response |
| 2016 | WaveNet | 1D dilated causal conv | Receptive field with layers |
| 2016 | TCN | Dilated causal conv + residual | General temporal conv network |
| 2017 | Transformer | Cross-correlation (attention) | Global "convolution" with learned kernels |
| 2021 | S4 | SSM = implicit IIR conv, trained as FIR via FFT | HiPPO initialization for long-range |
| 2022 | S4D | Diagonal SSM | Simplification: scalar SSM per channel |
| 2022 | FNet | 2D FFT mixing | Fixed unitary conv replaces attention |
| 2023 | Hyena | Parameterized long conv | Data-controlled conv length; sub-quadratic |
| 2023 | Mamba | Selective SSM | Input-dependent scan; best of RNN+conv |
| 2024 | Mamba-2 | State space duality | Connects SSM to linear attention |
The trajectory is clear: convolution is progressively replacing or supplementing attention as the primary mixing mechanism in large-scale sequence models, driven by the vs. complexity advantage at long sequence lengths.
Appendix H: Quick Reference
H.1 Convolution Formulas
| Operation | Formula | Cost |
|---|---|---|
| Linear conv (continuous) | direct | |
| Linear conv (discrete) | direct | |
| Circular conv (discrete) | via FFT | |
| FFT convolution | zero-pad -> FFT -> multiply -> IFFT | |
| Cross-correlation | via FFT | |
| Autocorrelation | via FFT |
H.2 Convolution Theorem Summary
| Domain | Time / Space | Frequency |
|---|---|---|
| Continuous | ||
| Continuous (dual) | ||
| Discrete (DFT) | ||
| Discrete (dual) | ||
| Cross-correlation | ||
| Wiener-Khinchin | $S_{XX}(f) = |
H.3 Filter Types Comparison
| Property | FIR | IIR |
|---|---|---|
| Impulse response | Finite ( taps) | Infinite (recursive) |
| Phase | Can be exactly linear | Non-linear (dispersive) |
| Stability | Always BIBO stable | Only if poles inside unit circle |
| Computation | per sample | per sample |
| FFT-accelerated | Yes, trivially | Not directly |
| Design methods | Window, Parks-McClellan, LS | Butterworth, Chebyshev, elliptic |
| ML analogy | CNN, WaveNet, S4 kernel | RNN, LSTM (implicit) |
H.4 Overlap-Add vs. Overlap-Save
| Property | Overlap-Add | Overlap-Save |
|---|---|---|
| Strategy | Block convolution + add tails | Circular buffer + discard head |
| FFT size | ||
| Memory | ||
| Per-sample latency | samples | samples |
| Block size optimal | to | Same |
| scipy function | signal.oaconvolve | (manual implementation) |
H.5 Deconvolution Methods
| Method | Formula | When to use |
|---|---|---|
| Naive inverse | Only when everywhere, no noise | |
| Wiener filter | $\hat{X} = \frac{H^* S_{xx}}{ | H |
| Tikhonov / ridge | $\hat{X} = \frac{H^*}{ | H |
| LASSO deconv | Sparse signals (spike trains) | |
| Total variation | Piecewise-constant signals |
H.6 Key Inequalities
Appendix I: Common Signal Pairs and Their Convolutions
Understanding which signals arise as convolutions of common building blocks builds strong intuition.
I.1 Standard Convolution Pairs
| (rect pulse) | (rect pulse) | (triangle) |
| (if ) | ||
| (shift) | ||
| (derivative) | ||
| (idempotent!) |
The sinc idempotence reflects the fact that the ideal lowpass filter is a projection: applying it twice gives the same result as once. In ML terms, this is analogous to idempotent attention patterns.
I.2 Discrete Convolution Pairs
| (impulse response) | ||
| (step) | (running sum) | |
| (binomial expansion) | ||
| FIR filter | $ |
I.3 Cascade and Parallel Systems
Cascade (series): If system 1 has impulse response and system 2 has , the cascade has impulse response . Frequency response: .
Parallel: Impulse response . Frequency response: .
Feedback (IIR): If the forward path has response and feedback has :
This is the origin of poles in IIR filters: the denominator defines the feedback resonances.
References
- Oppenheim, A.V. & Schafer, R.W. (2009). Discrete-Time Signal Processing, 3rd ed. Prentice Hall. - The definitive textbook; Chapters 2-4 cover LTI systems and Z-transform.
- Proakis, J.G. & Manolakis, D.G. (2006). Digital Signal Processing, 4th ed. Pearson. - Thorough treatment of filter design.
- van den Oord, A. et al. (2016). WaveNet: A Generative Model for Raw Audio. arXiv:1609.03499. - Dilated causal convolutions for audio.
- Gu, A. et al. (2021). Efficiently Modeling Long Sequences with Structured State Spaces. ICLR 2022. - S4 SSM as FFT convolution.
- Gu, A. & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752. - Selective SSM.
- Poli, M. et al. (2023). Hyena Hierarchy. ICML 2023. - Parameterized long convolutions.
- Lee-Thorp, J. et al. (2022). FNet: Mixing Tokens with Fourier Transforms. NAACL 2022. - FFT as attention substitute.
- Cohen, T. & Welling, M. (2016). Group Equivariant Convolutional Networks. ICML 2016. - -CNNs.
- Wiener, N. (1942). Extrapolation, Interpolation, and Smoothing of Stationary Time Series. MIT Press. - Original Wiener filter paper.
- Young, W.H. (1912). On the multiplication of successions of Fourier constants. Proc. R. Soc. London. - Original Young's inequality.