Lesson overview | Previous part | Lesson overview
Discrete Fourier Transform and FFT: Appendix I: DFT and Signal Reconstruction to Appendix R: Quick-Reference Formula Sheet
Appendix I: DFT and Signal Reconstruction
I.1 Bandlimited Interpolation
Given DFT coefficients (retaining only the lowest frequencies), the bandlimited interpolation of the corresponding signal to any intermediate time (not necessarily a sample point) is:
This is the continuous analog of what FNO does (Section 8.2): it parameterizes a function by its low-frequency DFT coefficients, and evaluates it at any desired spatial resolution.
I.2 Compressed Sensing and Sparse Recovery
Problem (Compressed Sensing). A signal is -sparse in the DFT domain: only Fourier coefficients are nonzero. Can we recover from measurements?
Answer (Candes, Romberg, Tao, 2006; Donoho, 2006). If random measurements are taken, can be exactly recovered (with high probability) by solving the minimization:
where is the random measurement matrix. This is dramatically below the requirements of classical sampling.
For AI: compressed sensing underlies MRI reconstruction (6-8 speedup by acquiring 1/6 of -space data), compressed attention (attending to sparse subsets of frequency components), and efficient LLM inference when activations are sparse in the Fourier domain. The theoretical guarantees connect directly to the RIP (Restricted Isometry Property) of random DFT matrices.
I.3 Phase Retrieval
Problem. Given only (the power spectrum, without phase), recover .
Phase retrieval is fundamentally ill-posed in 1D (any signal and its time-reversed, time-shifted version have the same power spectrum). In 2D (images), phase retrieval becomes uniquely solvable (generically) due to oversampling constraints.
Algorithms: Gerchberg-Saxton (1972), HIO (Fienup, 1982), and modern deep learning approaches (phase retrieval nets).
For AI: X-ray crystallography uses phase retrieval to determine protein structure. Modern protein structure prediction (AlphaFold 2, RoseTTAFold) sidesteps phase retrieval by directly predicting 3D coordinates from sequence and co-evolutionary information.
Appendix J: Worked ML Examples
J.1 Building a Mel Spectrogram from Scratch
This section works through the complete mathematics behind the mel spectrogram used in Whisper, step by step.
Step 1: Mel scale. The mel scale converts frequency (Hz) to perceived pitch (mels):
The mel scale is approximately linear below 1 kHz (300 Hz approx 401 mel; 700 Hz approx 811 mel) and logarithmic above 1 kHz, matching the frequency resolution of the human auditory system (the cochlea's place-frequency mapping).
Step 2: Filterbank design. For mel filters covering :
- Convert the frequency range to mels:
- Place equally-spaced mel points:
- Convert back to Hz:
- For filterbank (), define the triangular filter:
where are the DFT bin frequencies.
Step 3: Power accumulation. For each STFT frame :
Step 4: Log compression.
The floor prevents . The Whisper normalization subtracts the mean of log_mel and divides by 4 (an empirically chosen scale factor that centers the values near for typical speech).
Why these choices?
- 80 mel bins: empirically optimal for large-scale ASR (more bins -> marginal improvement; fewer -> information loss)
- 0-8000 Hz: human speech information is concentrated here; extending to 8+ kHz adds noise without significant ASR benefit
- 25 ms window / 10 ms hop: matches the typical duration of speech phonemes (20-200 ms)
- Log compression: matches perceptual loudness; prevents high-amplitude phonemes from dominating the feature representation
J.2 FNO Architecture: Complete Mathematical Description
The Fourier Neural Operator (Li et al., 2021) architecture for 1D problems:
Input: discretized function (input function on points with channels) Output: discretized function (solution on points with channels)
Architecture (4 FNO layers + output projection):
Layer 0: Input lifting
a R^{N d_a} -> v R^{N d_v} (learnable MLP)
Layer 1-4: FNO blocks
v <- FNO-Block(v)
Layer 5: Output projection
v R^{N d_v} -> u R^{N d_u} (learnable MLP)
FNO Block:
FNO-Block(v):
1. Spectral branch:
V = DFT(v) # shape: (N, d_v) complex
V_trunc = V[:K, :] # keep lowest K modes
V_out = R @ V_trunc # R C^{K d_v d_v}: learned weight
V_pad = [V_out; 0, ..., 0] # zero-pad back to N modes
v_spectral = IDFT(V_pad) # shape: (N, d_v)
2. Local branch:
v_local = W @ v # W R^{d_v d_v}: learned weight
3. Combine:
v_new = (v_spectral + v_local) # = GeLU
return v_new
Key design choices:
- retained modes: sufficient for smooth PDE solutions; higher is used for more complex solutions (turbulence: )
- channels: comparable to modern transformer hidden dimension
- Weight sharing: the same is used across all spatial positions (translation equivariance in the frequency domain)
- Resolution-invariance: the FNO can be trained on and evaluated on by changing the DFT size; the spectral weights and local weights are independent of
Parameter count comparison (1D, , , ):
| Layer | Dense global convolution | FNO spectral layer |
|---|---|---|
| Parameters | (complex) | |
| FLOPs |
The FNO spectral layer has fewer parameters and fewer FLOPs per layer than a fully global convolution.
J.3 Monarch Matrix Decomposition
A Monarch matrix of size is defined as:
where:
- is a fixed permutation (the "interleaving permutation")
- is block-diagonal with blocks of size
- is block-diagonal with blocks of size
Connection to FFT. The FFT butterfly matrix factors as:
where is a diagonal twiddle-factor matrix and is the Kronecker product. Each butterfly stage is a block-diagonal matrix - exactly the Monarch structure. Thus the FFT is a special Monarch matrix with specific parameter values.
Monarch parameterization: instead of fixing the block entries to be DFT matrices, let and be learnable complex matrices. This gives parameters (choosing ).
Expressiveness. Monarch matrices can represent any circulant matrix (by choosing and appropriately) and therefore any convolution. They can also represent some attention patterns and are more expressive than diagonal matrices or individual FFT layers.
Appendix K: Reference Summary
K.1 Key Formulas
| Formula | Expression |
|---|---|
| DFT | , |
| IDFT | |
| DFT Matrix | , |
| Parseval | |
| Circular convolution | |
| Circular shift | |
| Conjugate symmetry | |
| FFT complexity | complex multiplications, additions |
| Frequency resolution | |
| Nyquist limit | |
| STFT | |
| Mel scale |
K.2 Quick Comparison: FT vs DFT
| Property | Continuous FT | DFT |
|---|---|---|
| Domain | or | (finite sequences) |
| Output | Continuous spectrum on | complex values |
| Inversion | Integral formula | Exact, finite sum |
| Shift | Linear: phase factor | Circular: phase factor |
| Convolution | Linear convolution | Circular convolution |
| Parseval | ||
| Unitarity | FT on : unitary with factor 1 | is unitary |
| Computation | Cannot compute exactly | FFT: |
| Leakage | No finite truncation | Leakage from windowing |
Appendix L: Case Studies
L.1 Whisper Large-v3: Preprocessing Bug Hunt
A common real-world bug in Whisper-based systems illustrates why understanding the DFT pipeline matters.
Symptom: transcription quality degrades for audio from a different microphone than training data, even though the content is identical.
Root cause: the new microphone captures at 44.1 kHz (not 16 kHz). Naive resampling using librosa.load(file, sr=16000) with default settings may introduce aliasing artifacts or produce a slightly different frequency axis. Specifically:
-
Aliasing from incorrect anti-aliasing filter: if the 44.1 kHz signal is downsampled by factor without applying a proper lowpass filter at 8 kHz (the Nyquist for 16 kHz), frequencies between 8 kHz and 22.05 kHz alias into the 0-8 kHz band, corrupting the mel spectrogram.
-
Frequency axis mismatch: if the mel filterbanks were designed for exactly 16 kHz sampling, resampling to 15,994 Hz (a common off-by-one error in resampling libraries) shifts all frequency bin centers by - too small to affect single-frequency detection but enough to corrupt the precise triangular filterbank overlaps.
Fix: explicitly verify sample rate after loading (assert sr == 16000), use scipy.signal.resample_poly with explicit anti-aliasing, and validate the mel filterbank output against a known reference signal.
Lesson: every stage of the FFT pipeline - sampling rate, window length, hop size, FFT size, filterbank design - has exact numerical requirements. The DFT is unforgiving of approximations.
L.2 FNO for Climate Modeling: FourCastNet
FourCastNet (Pathak et al., 2022, NVIDIA) uses a variant of the FNO architecture to predict global weather at 6-hour intervals. It replaces the FFT-based spectral layer with a Fourier-based AFNO (Adaptive Fourier Neural Operator) block:
AFNO Layer:
1. Patchify: divide input grid (7201440) into patches -> tokens
2. DFT along spatial dimensions (2D FFT)
3. Mix frequencies with learned complex weights (block-diagonal for efficiency)
4. IDFT -> spatial domain
5. MLP for channel mixing
Key difference from FNO: AFNO uses token mixing in frequency space with a block-diagonal structure, reducing parameters from to where is the block size.
Results:
- Forecasts 10-day global weather in 2 seconds (vs 4 hours for NWP models like ECMWF)
- Competitive with ECMWF operational forecasts on 2-week horizons for key variables (temperature at 500 hPa, wind at 850 hPa)
- Trained on 40 years of ERA5 reanalysis data (3.8 TB)
Why frequency domain? Atmospheric dynamics are dominated by large-scale planetary waves (low spatial frequencies) - the Rossby waves that govern jet streams and storm tracks. Truncating to modes in each dimension captures of the variance in ERA5 data. Higher-frequency turbulence below the grid scale is parameterized separately.
L.3 STFT Denoising in Production Audio
Modern audio denoising systems (used in Zoom, Teams, Discord) use STFT-based processing:
Architecture:
- STFT with Hann window (1024 samples, hop 256) -> spectrogram and phase
- Neural network (small RNN or transformer) estimates the noise mask
- Apply mask: (keep speech, suppress noise)
- Reconstruct waveform via ISTFT with original phase (overlap-add)
Why preserve phase: human perception is relatively insensitive to phase errors in speech (the McGurk effect), but phase-inconsistent reconstruction (Griffin-Lim iterations) introduces audible artifacts. Most commercial denoisers use the noisy signal's phase directly - a pragmatic approximation.
Real-time constraint: at 48 kHz with 1024-sample windows and 256-sample hop: one frame every 5.33 ms. A DNN processing one frame must complete in on the target hardware (often a small DSP or CPU). This limits the network size to parameters.
The STFT as a learned feature extractor question: in principle, one could learn the analysis window jointly with the denoising network (learnable STFT). In practice, fixed Hann windows almost always match or outperform learned windows on denoising benchmarks, because the Hann window's known good time-frequency resolution is hard to improve upon with a finite training set.
Appendix M: Connections to Other Mathematical Areas
M.1 Harmonic Analysis on Finite Abelian Groups
The DFT on (the cyclic group of integers modulo ) is a special case of harmonic analysis on locally compact abelian groups (LCA groups). For a finite abelian group , the Pontryagin dual is the group of characters (group homomorphisms ). For : with characters - exactly the DFT basis.
The abstract Fourier transform on is:
For : this recovers the DFT exactly. For : this recovers the continuous FT. The same theorem - Pontryagin duality, Plancherel's theorem, Poisson summation - holds in all cases.
This unification is developed fully in Section 12 Functional Analysis, connecting the DFT to representation theory of compact groups, wavelet theory, and the Fourier analysis of non-commutative groups (non-abelian harmonic analysis).
M.2 Number-Theoretic Applications
Quadratic reciprocity via DFT. The Gauss sum plays a central role in the proof of quadratic reciprocity. For prime :
This is a DFT-like object: where for all and the kernel is instead of .
Integer factorization. Shor's quantum algorithm (1994) for integer factorization relies on the quantum Fourier transform (QFT): a quantum circuit that applies the DFT to a quantum superposition state in gate operations. Classical FFT requires operations; QFT requires . Shor's algorithm uses the QFT to find the period of , which reveals the prime factors of in polynomial time.
Appendix N: Spectral Analysis in Practice - Extended Examples
N.1 Frequency Estimation: Sub-Bin Accuracy
A key practical problem: given noisy measurements , estimate , , and with sub-bin accuracy.
Step 1: Coarse estimate. Compute the DFT magnitude and find . This gives with error up to .
Step 2: Parabolic interpolation. Fit a parabola to , , :
This achieves approximately 10 better accuracy than the coarse estimate for SNR dB.
Step 3: Quinn's Estimator. For even better accuracy, use the complex DFT values directly:
Quinn's estimator is unbiased and achieves near-Cramer-Rao bound accuracy.
Step 4: Amplitude and phase. Once is estimated, the amplitude is (for a one-sided real spectrum) corrected for the window's coherent gain, and the phase is .
N.2 Detecting Harmonics in Speech
Speech is approximately quasi-periodic: voiced sounds (vowels) have a fundamental frequency (pitch) and harmonics at The harmonic structure is visible in the DFT as a series of peaks at integer multiples of .
HPS (Harmonic Product Spectrum) algorithm:
- Compute the DFT magnitude
- Form the product spectrum:
- Estimate
The HPS effectively downsamples the spectrum by factors and takes the product - aligning the harmonic peaks while suppressing noise between them.
In AI: pitch estimation (fundamental frequency detection) is a key component of speech processing pipelines for voice synthesis (VALL-E, SoundStorm), emotion recognition, and singing voice synthesis. Modern deep-learning pitch detectors (CREPE, PYIN, PENN) outperform HPS but rely on spectral representations derived from the STFT.
N.3 The Spectrogram in Transformer-Based Audio Models
The dominant paradigm for audio language models (AudioLM, VALL-E, MusicLM, Stable Audio) is:
Audio waveform
v FFT -> Mel spectrogram (continuous representation)
v VQ-VAE or EnCodec -> Discrete tokens
v Language model (transformer) -> Next-token prediction
v Decoder -> Reconstructed mel spectrogram
v Vocoder (HiFi-GAN, WaveGrad) -> Audio waveform
The FFT and mel spectrogram serve as both the input representation for the encoder and the target representation for the decoder. The transformer operates entirely in the token space derived from the mel spectrogram.
Why not train end-to-end on waveforms? The waveform representation at 16-44 kHz requires sequence lengths of - samples per second - far beyond current transformer context lengths. The mel spectrogram with 10 ms frame shift compresses by a factor of (16 kHz) while preserving the perceptually relevant structure. This compression is what makes transformer-based audio generation feasible.
Appendix O: Summary of AI Model Connections
| AI System | DFT Role | Key Parameters |
|---|---|---|
| OpenAI Whisper | FFT -> mel spectrogram as input | kHz, , , , 80 mel bins |
| FNet (Google) | Full 2D DFT replaces attention | DFT along sequence AND embedding dims; no learned weights |
| FNO (CalTech) | Truncated spectral convolution | - modes; global PDE operator learning |
| FourCastNet (NVIDIA) | AFNO block for weather prediction | 2D FFT on lat/lon grid; 73 variables |
| FlashFFTConv | Butterfly-structured long convolution | Monarch factorization; with Tensor Cores |
| Monarch (Together AI) | FFT butterfly as trainable transform | parameters; used as attention surrogate |
| VALL-E | Codec (EnCodec) uses STFT loss | Multi-resolution STFT loss for perceptual quality |
| AudioLM | Semantic/acoustic tokenization | EnCodec uses mel-scale frequency band processing |
| CREPE (pitch) | Spectrogram input to CNN | , ms; pitch detection on mel spec |
| HiFi-GAN (vocoder) | Multi-period discriminator | Multi-resolution STFT loss; discriminates real vs fake |
| S4/Mamba | Spectral view of state space models | SSM recurrence equivalent to causal convolution (FFT in training) |
| RoPE (LLaMA) | Frequency-domain position encoding | Complex rotation = modulation in DFT sense (Section 3.3) |
Appendix P: Proofs and Derivations
P.1 Derivation of Cooley-Tukey for General Radix
For , write and with , :
Expanding the exponent:
Since , the first term vanishes. Grouping:
The inner sum is an -point DFT (for each fixed ), and the outer sum is an -point DFT (for each fixed ). The twiddle factors multiply between the two stages.
Total operations: DFTs of size + DFTs of size + twiddle multiplications = by the master recurrence.
P.2 The Uncertainty Principle for the DFT (Donoho-Stark Theorem)
Theorem (Donoho & Stark, 1989). For any nonzero :
where is the support of .
Proof sketch. Suppose and . The DFT restricted to and defines an submatrix of . Since is supported on points and is supported on points, this submatrix satisfies . A counting argument using the rank of gives .
Corollary. A signal cannot have fewer than nonzero samples AND fewer than nonzero DFT coefficients simultaneously. This is the discrete analog of the Heisenberg uncertainty principle.
For AI: compressed sensing exploits this theorem: a signal supported on DFT bins can be recovered from time-domain measurements - far fewer than . This justifies FNO's use of Fourier modes: smooth solutions to PDEs have small Fourier support.
P.3 Aliasing Formula: Proof via Poisson Summation
When is sampled at rate (below the Nyquist rate for bandlimited ), the DFT of the samples is:
This is the aliased spectrum: the true spectrum summed over all frequency shifts by integer multiples of . When is nonzero outside , adjacent copies overlap and contaminate each other - this is aliasing.
Proof. Starting from the Poisson summation formula (Section 02-7.5):
Setting and , and evaluating at :
The left side is the DTFT of the sampled signal; the right side is the periodized spectrum. The DFT is the sampling of this DTFT at , giving the result.
Appendix Q: Further Reading
Primary References
-
Cooley, J.W. and Tukey, J.W. (1965). "An Algorithm for the Machine Calculation of Complex Fourier Series." Mathematics of Computation, 19(90), 297-301. - The original FFT paper; two pages, historically indispensable.
-
Oppenheim, A.V. and Schafer, R.W. (2010). Discrete-Time Signal Processing (3rd ed.). Prentice Hall. - The standard reference for DSP; chapters 8-10 cover DFT, FFT, and windowing comprehensively.
-
Strang, G. (1986). "The Discrete Cosine Transform." SIAM Review, 41(1), 135-147. - Beautiful exposition connecting FFT, DCT, and linear algebra.
-
Frigo, M. and Johnson, S.G. (2005). "The Design and Implementation of FFTW3." Proceedings of the IEEE, 93(2), 216-231. - FFTW's adaptive algorithm selection strategy.
-
Li, Z. et al. (2021). "Fourier Neural Operator for Parametric Partial Differential Equations." ICLR 2021. - FNO architecture and theory.
-
Dao, T. et al. (2022). "Monarch: Expressive Structured Matrices for Efficient and Accurate Training." ICML 2022. - Butterfly matrix parameterization.
-
Radford, A. et al. (2022). "Robust Speech Recognition via Large-Scale Weak Supervision." ICML 2023. - Whisper architecture and mel spectrogram preprocessing.
-
Donoho, D.L. and Stark, P.B. (1989). "Uncertainty Principles and Signal Recovery." SIAM Journal on Applied Mathematics, 49(3), 906-931. - Discrete uncertainty principle.
Online Resources
- NumPy FFT documentation:
numpy.fftmodule reference with complete API - SciPy signal processing:
scipy.signal.spectrogram,scipy.signal.welch,scipy.fft - Whisper preprocessing code:
openai/whisperGitHub repository,audio.py - FNO implementation:
neuraloperator/neuraloperatorGitHub repository
Appendix R: Quick-Reference Formula Sheet
| Symbol | Meaning | Value / Formula |
|---|---|---|
| Principal -th root of unity | ||
| -th DFT coefficient | ||
| IDFT reconstruction | ||
| DFT matrix | , so | |
| Frequency bin width | ||
| Nyquist frequency | ||
| Minimum to resolve | ||
| STFT coefficient | ||
| COLA | Overlap-add constant | for all |
| Conjugate symmetry (real ) |
Window cheat-sheet
| Window | Main lobe (bins) | First sidelobe (dB) | Use case |
|---|---|---|---|
| Rectangular | 2 | Broadband noise, transients | |
| Hann | 4 | General purpose | |
| Hamming | 4 | Speech processing | |
| Blackman | 6 | Weak sinusoids near strong ones | |
| Kaiser | variable | Precision filter design |
FFT complexity
Compared to naive DFT: reduces from to operations - a speedup.