Part 3

17 min read18 headingsSplit lesson page

Lesson overview | Previous part | Lesson overview

Discrete Fourier Transform and FFT: Appendix I: DFT and Signal Reconstruction to Appendix R: Quick-Reference Formula Sheet

Appendix I: DFT and Signal Reconstruction

I.1 Bandlimited Interpolation

Given DFT coefficients $X[0], \ldots, X[K]$ (retaining only the lowest $K+1$ frequencies), the bandlimited interpolation of the corresponding signal to any intermediate time $t$ (not necessarily a sample point) is:

\hat{x}(t) = \frac{1}{N}\sum_{k=0}^{K} X[k]\, e^{2\pi i k t/N} + \text{conjugate terms}

This is the continuous analog of what FNO does (Section 8.2): it parameterizes a function by its low-frequency DFT coefficients, and evaluates it at any desired spatial resolution.

I.2 Compressed Sensing and Sparse Recovery

Problem (Compressed Sensing). A signal $\mathbf{x} \in \mathbb{C}^N$ is $s$ -sparse in the DFT domain: only $s \ll N$ Fourier coefficients are nonzero. Can we recover $\mathbf{x}$ from $m \ll N$ measurements?

Answer (Candes, Romberg, Tao, 2006; Donoho, 2006). If $m \geq C \cdot s \log(N/s)$ random measurements are taken, $\mathbf{x}$ can be exactly recovered (with high probability) by solving the $\ell^1$ minimization:

\min_{\hat{\mathbf{x}}} \lVert \hat{\mathbf{x}} \rVert_1 \quad \text{subject to} \quad A\hat{\mathbf{x}} = \mathbf{b}

where $A$ is the random measurement matrix. This is dramatically below the $m \geq N$ requirements of classical sampling.

For AI: compressed sensing underlies MRI reconstruction (6-8 speedup by acquiring 1/6 of $k$ -space data), compressed attention (attending to sparse subsets of frequency components), and efficient LLM inference when activations are sparse in the Fourier domain. The theoretical guarantees connect directly to the RIP (Restricted Isometry Property) of random DFT matrices.

I.3 Phase Retrieval

Problem. Given only $|X[k]|^2$ (the power spectrum, without phase), recover $\mathbf{x}$ .

Phase retrieval is fundamentally ill-posed in 1D (any signal and its time-reversed, time-shifted version have the same power spectrum). In 2D (images), phase retrieval becomes uniquely solvable (generically) due to oversampling constraints.

Algorithms: Gerchberg-Saxton (1972), HIO (Fienup, 1982), and modern deep learning approaches (phase retrieval nets).

For AI: X-ray crystallography uses phase retrieval to determine protein structure. Modern protein structure prediction (AlphaFold 2, RoseTTAFold) sidesteps phase retrieval by directly predicting 3D coordinates from sequence and co-evolutionary information.

Appendix J: Worked ML Examples

J.1 Building a Mel Spectrogram from Scratch

This section works through the complete mathematics behind the mel spectrogram used in Whisper, step by step.

Step 1: Mel scale. The mel scale converts frequency $f$ (Hz) to perceived pitch $m$ (mels):

m(f) = 2595 \log_{10}\!\left(1 + \frac{f}{700}\right), \quad f(m) = 700\left(10^{m/2595} - 1\right)

The mel scale is approximately linear below 1 kHz (300 Hz approx 401 mel; 700 Hz approx 811 mel) and logarithmic above 1 kHz, matching the frequency resolution of the human auditory system (the cochlea's place-frequency mapping).

Step 2: Filterbank design. For $M = 80$ mel filters covering $[f_{\min}, f_{\max}] = [0, 8000]\,\text{Hz}$ :

Convert the frequency range to mels: $[m_{\min}, m_{\max}] = [0, 3227]\,\text{mels}$
Place $M+2 = 82$ equally-spaced mel points: $m_0 < m_1 < \cdots < m_{81}$
Convert back to Hz: $f_k = f(m_k)$
For filterbank $m$ ( $m = 1, \ldots, 80$ ), define the triangular filter:

H_m[k] = \begin{cases} \dfrac{f_k - f_{m-1}}{f_m - f_{m-1}} & f_{m-1} \leq f_k \leq f_m \\ \dfrac{f_{m+1} - f_k}{f_{m+1} - f_m} & f_m \leq f_k \leq f_{m+1} \\ 0 & \text{otherwise} \end{cases}

where $f_k = k f_s / N$ are the DFT bin frequencies.

Step 3: Power accumulation. For each STFT frame $l$ :

\text{mel\_spec}[m, l] = \sum_{k=0}^{N/2} H_m[k] \cdot |X[k, l]|^2

Step 4: Log compression.

\text{log\_mel}[m, l] = \log_{10}\!\left(\max\!\left(\text{mel\_spec}[m, l], 10^{-10}\right)\right)

The floor $10^{-10}$ prevents $\log(0)$ . The Whisper normalization subtracts the mean of log_mel and divides by 4 (an empirically chosen scale factor that centers the values near $[-1, 1]$ for typical speech).

Why these choices?

80 mel bins: empirically optimal for large-scale ASR (more bins -> marginal improvement; fewer -> information loss)
0-8000 Hz: human speech information is concentrated here; extending to 8+ kHz adds noise without significant ASR benefit
25 ms window / 10 ms hop: matches the typical duration of speech phonemes (20-200 ms)
Log compression: matches perceptual loudness; prevents high-amplitude phonemes from dominating the feature representation

J.2 FNO Architecture: Complete Mathematical Description

The Fourier Neural Operator (Li et al., 2021) architecture for 1D problems:

Input: discretized function $\mathbf{a} \in \mathbb{R}^{N \times d_a}$ (input function on $N$ points with $d_a$ channels) Output: discretized function $\mathbf{u} \in \mathbb{R}^{N \times d_u}$ (solution on $N$ points with $d_u$ channels)

Architecture (4 FNO layers + output projection):

Layer 0: Input lifting
  a  R^{N  d_a}  ->  v  R^{N  d_v}   (learnable MLP)

Layer 1-4: FNO blocks
  v <- FNO-Block(v)

Layer 5: Output projection
  v  R^{N  d_v}  ->  u  R^{N  d_u}   (learnable MLP)

FNO Block:

FNO-Block(v):
    1. Spectral branch:
       V = DFT(v)                  # shape: (N, d_v) complex
       V_trunc = V[:K, :]           # keep lowest K modes
       V_out = R @ V_trunc          # R  C^{K  d_v  d_v}: learned weight
       V_pad = [V_out; 0, ..., 0]   # zero-pad back to N modes
       v_spectral = IDFT(V_pad)     # shape: (N, d_v)
    
    2. Local branch:
       v_local = W @ v              # W  R^{d_v  d_v}: learned weight
    
    3. Combine:
       v_new = (v_spectral + v_local)  #  = GeLU
    
    return v_new

Key design choices:

$K = 16$ retained modes: sufficient for smooth PDE solutions; higher $K$ is used for more complex solutions (turbulence: $K = 32$ )
$d_v = 64$ channels: comparable to modern transformer hidden dimension
Weight sharing: the same $R$ is used across all spatial positions (translation equivariance in the frequency domain)
Resolution-invariance: the FNO can be trained on $N = 64$ and evaluated on $N = 256$ by changing the DFT size; the spectral weights $R$ and local weights $W$ are independent of $N$

Parameter count comparison (1D, $N = 1024$ , $K = 16$ , $d_v = 64$ ):

Layer	Dense global convolution	FNO spectral layer
Parameters	$N \times d_v^2 = 4.2 \times 10^6$	$K \times d_v^2 \times 2 = 131,072$ (complex)
FLOPs	$O(N^2 d_v) = 4 \times 10^9$	$O(N d_v \log N + K d_v^2) = 6.7 \times 10^6$

The FNO spectral layer has $32\times$ fewer parameters and $600\times$ fewer FLOPs per layer than a fully global convolution.

J.3 Monarch Matrix Decomposition

A Monarch matrix of size $N = N_1 N_2$ is defined as:

M = P L P^\top R

where:

$P \in \{0,1\}^{N \times N}$ is a fixed permutation (the "interleaving permutation")
$L = \operatorname{diag}(L_0, L_1, \ldots, L_{N_1-1})$ is block-diagonal with $N_1$ blocks of size $N_2 \times N_2$
$R = \operatorname{diag}(R_0, R_1, \ldots, R_{N_2-1})$ is block-diagonal with $N_2$ blocks of size $N_1 \times N_1$

Connection to FFT. The FFT butterfly matrix factors as:

F_N = (I_{N_1} \otimes F_{N_2}) \cdot D \cdot (F_{N_1} \otimes I_{N_2}) \cdot P_{\text{bit-reverse}}

where $D$ is a diagonal twiddle-factor matrix and $\otimes$ is the Kronecker product. Each butterfly stage is a block-diagonal matrix - exactly the Monarch structure. Thus the FFT is a special Monarch matrix with specific parameter values.

Monarch parameterization: instead of fixing the block entries to be DFT matrices, let $L_i$ and $R_j$ be learnable complex matrices. This gives $O(N(N_1 + N_2)) = O(N\sqrt{N})$ parameters (choosing $N_1 = N_2 = \sqrt{N}$ ).

Expressiveness. Monarch matrices can represent any circulant matrix (by choosing $L_i$ and $R_j$ appropriately) and therefore any convolution. They can also represent some attention patterns and are more expressive than diagonal matrices or individual FFT layers.

Appendix K: Reference Summary

K.1 Key Formulas

Formula	Expression
DFT	$X[k] = \sum_{n=0}^{N-1} x[n]\,\omega_N^{-nk}$ , $\quad\omega_N = e^{2\pi i/N}$
IDFT	$x[n] = \frac{1}{N}\sum_{k=0}^{N-1} X[k]\,\omega_N^{nk}$
DFT Matrix	$(F_N)_{kn} = \omega_N^{-kn}$ , $\quad F_N F_N^* = N I$
Parseval	$\sum_n \lvert x[n]\rvert^2 = \frac{1}{N}\sum_k \lvert X[k]\rvert^2$
Circular convolution	$(x \circledast y)[n] = \sum_{m=0}^{N-1} x[m]\,y[n-m\bmod N]$
Circular shift	$x[n-m\bmod N] \leftrightarrow \omega_N^{-mk} X[k]$
Conjugate symmetry	$x[n]\in\mathbb{R} \Rightarrow X[N-k] = X[k]^*$
FFT complexity	$\frac{N}{2}\log_2 N$ complex multiplications, $N\log_2 N$ additions
Frequency resolution	$\Delta f = f_s/N = 1/T$
Nyquist limit	$f_{\max} = f_s/2$
STFT	$S[l,k] = \sum_{m=0}^{M-1} x[lH+m]\,w[m]\,\omega_N^{-mk}$
Mel scale	$m = 2595\log_{10}(1 + f/700)$

K.2 Quick Comparison: FT vs DFT

Property	Continuous FT	DFT
Domain	$L^1(\mathbb{R})$ or $L^2(\mathbb{R})$	$\mathbb{C}^N$ (finite sequences)
Output	Continuous spectrum on $\mathbb{R}$	$N$ complex values
Inversion	Integral formula	Exact, finite sum
Shift	Linear: $e^{-2\pi i\xi\tau}$ phase factor	Circular: $\omega_N^{-mk}$ phase factor
Convolution	Linear convolution	Circular convolution
Parseval	$\int\lvert f\rvert^2 = \int\lvert\hat{f}\rvert^2$	$\sum\lvert x\rvert^2 = \frac{1}{N}\sum\lvert X\rvert^2$
Unitarity	FT on $L^2$ : unitary with factor 1	$\frac{1}{\sqrt{N}}F_N$ is unitary
Computation	Cannot compute exactly	FFT: $O(N\log N)$
Leakage	No finite truncation	Leakage from windowing

Appendix L: Case Studies

L.1 Whisper Large-v3: Preprocessing Bug Hunt

A common real-world bug in Whisper-based systems illustrates why understanding the DFT pipeline matters.

Symptom: transcription quality degrades for audio from a different microphone than training data, even though the content is identical.

Root cause: the new microphone captures at 44.1 kHz (not 16 kHz). Naive resampling using librosa.load(file, sr=16000) with default settings may introduce aliasing artifacts or produce a slightly different frequency axis. Specifically:

Aliasing from incorrect anti-aliasing filter: if the 44.1 kHz signal is downsampled by factor $\approx 2.76$ without applying a proper lowpass filter at 8 kHz (the Nyquist for 16 kHz), frequencies between 8 kHz and 22.05 kHz alias into the 0-8 kHz band, corrupting the mel spectrogram.
Frequency axis mismatch: if the mel filterbanks were designed for exactly 16 kHz sampling, resampling to 15,994 Hz (a common off-by-one error in resampling libraries) shifts all frequency bin centers by $\approx 0.04\%$ - too small to affect single-frequency detection but enough to corrupt the precise triangular filterbank overlaps.

Fix: explicitly verify sample rate after loading (assert sr == 16000), use scipy.signal.resample_poly with explicit anti-aliasing, and validate the mel filterbank output against a known reference signal.

Lesson: every stage of the FFT pipeline - sampling rate, window length, hop size, FFT size, filterbank design - has exact numerical requirements. The DFT is unforgiving of approximations.

L.2 FNO for Climate Modeling: FourCastNet

FourCastNet (Pathak et al., 2022, NVIDIA) uses a variant of the FNO architecture to predict global weather at 6-hour intervals. It replaces the FFT-based spectral layer with a Fourier-based AFNO (Adaptive Fourier Neural Operator) block:

AFNO Layer:
  1. Patchify: divide input grid (7201440) into patches -> tokens
  2. DFT along spatial dimensions (2D FFT)
  3. Mix frequencies with learned complex weights (block-diagonal for efficiency)
  4. IDFT -> spatial domain
  5. MLP for channel mixing

Key difference from FNO: AFNO uses token mixing in frequency space with a block-diagonal structure, reducing parameters from $O(K^2 d^2)$ to $O(K d^2 / B)$ where $B$ is the block size.

Results:

Forecasts 10-day global weather in 2 seconds (vs 4 hours for NWP models like ECMWF)
Competitive with ECMWF operational forecasts on 2-week horizons for key variables (temperature at 500 hPa, wind at 850 hPa)
Trained on 40 years of ERA5 reanalysis data (3.8 TB)

Why frequency domain? Atmospheric dynamics are dominated by large-scale planetary waves (low spatial frequencies) - the Rossby waves that govern jet streams and storm tracks. Truncating to $K = 32$ modes in each dimension captures $> 95\%$ of the variance in ERA5 data. Higher-frequency turbulence below the grid scale is parameterized separately.

L.3 STFT Denoising in Production Audio

Modern audio denoising systems (used in Zoom, Teams, Discord) use STFT-based processing:

Architecture:

STFT with Hann window (1024 samples, hop 256) -> spectrogram $|X[l,k]|$ and phase $\angle X[l,k]$
Neural network (small RNN or transformer) estimates the noise mask $M[l,k] \in [0,1]$
Apply mask: $\hat{X}[l,k] = M[l,k] \cdot X[l,k]$ (keep speech, suppress noise)
Reconstruct waveform via ISTFT with original phase (overlap-add)

Why preserve phase: human perception is relatively insensitive to phase errors in speech (the McGurk effect), but phase-inconsistent reconstruction (Griffin-Lim iterations) introduces audible artifacts. Most commercial denoisers use the noisy signal's phase directly - a pragmatic approximation.

Real-time constraint: at 48 kHz with 1024-sample windows and 256-sample hop: one frame every 5.33 ms. A DNN processing one frame must complete in $< 5\,\text{ms}$ on the target hardware (often a small DSP or CPU). This limits the network size to $\sim 10^5$ parameters.

The STFT as a learned feature extractor question: in principle, one could learn the analysis window jointly with the denoising network (learnable STFT). In practice, fixed Hann windows almost always match or outperform learned windows on denoising benchmarks, because the Hann window's known good time-frequency resolution is hard to improve upon with a finite training set.

Appendix M: Connections to Other Mathematical Areas

M.1 Harmonic Analysis on Finite Abelian Groups

The DFT on $\mathbb{Z}_N$ (the cyclic group of integers modulo $N$ ) is a special case of harmonic analysis on locally compact abelian groups (LCA groups). For a finite abelian group $G$ , the Pontryagin dual $\hat{G}$ is the group of characters (group homomorphisms $\chi: G \to S^1$ ). For $G = \mathbb{Z}_N$ : $\hat{G} \cong \mathbb{Z}_N$ with characters $\chi_k(n) = \omega_N^{kn}$ - exactly the DFT basis.

The abstract Fourier transform on $G$ is:

\hat{f}(\chi) = \sum_{g \in G} f(g)\,\overline{\chi(g)}

For $G = \mathbb{Z}_N$ : this recovers the DFT exactly. For $G = \mathbb{R}$ : this recovers the continuous FT. The same theorem - Pontryagin duality, Plancherel's theorem, Poisson summation - holds in all cases.

This unification is developed fully in Section 12 Functional Analysis, connecting the DFT to representation theory of compact groups, wavelet theory, and the Fourier analysis of non-commutative groups (non-abelian harmonic analysis).

M.2 Number-Theoretic Applications

Quadratic reciprocity via DFT. The Gauss sum $G(a, N) = \sum_{n=0}^{N-1} e^{2\pi i a n^2/N}$ plays a central role in the proof of quadratic reciprocity. For prime $p$ :

G(1, p) = \begin{cases} \sqrt{p} & p \equiv 1 \pmod 4 \\ i\sqrt{p} & p \equiv 3 \pmod 4 \end{cases}

This is a DFT-like object: $G(a, N) = X[a]$ where $x[n] = 1$ for all $n$ and the kernel is $\omega_N^{an^2}$ instead of $\omega_N^{an}$ .

Integer factorization. Shor's quantum algorithm (1994) for integer factorization relies on the quantum Fourier transform (QFT): a quantum circuit that applies the DFT to a quantum superposition state in $O((\log N)^2)$ gate operations. Classical FFT requires $O(N \log N)$ operations; QFT requires $O((\log N)^2)$ . Shor's algorithm uses the QFT to find the period of $a^k \bmod N$ , which reveals the prime factors of $N$ in polynomial time.

Appendix N: Spectral Analysis in Practice - Extended Examples

N.1 Frequency Estimation: Sub-Bin Accuracy

A key practical problem: given noisy measurements $y[n] = A\cos(2\pi f_0 n/f_s + \phi) + \varepsilon[n]$ , estimate $f_0$ , $A$ , and $\phi$ with sub-bin accuracy.

Step 1: Coarse estimate. Compute the DFT magnitude $|Y[k]|$ and find $k_{\max} = \arg\max_k |Y[k]|$ . This gives $\hat{f}_0^{(1)} = k_{\max} f_s / N$ with error up to $\pm \Delta f/2 = \pm f_s/(2N)$ .

Step 2: Parabolic interpolation. Fit a parabola to $|Y[k_{\max}-1]|$ , $|Y[k_{\max}]|$ , $|Y[k_{\max}+1]|$ :

\hat{k} = k_{\max} + \frac{|Y[k_{\max}+1]| - |Y[k_{\max}-1]|}{2(2|Y[k_{\max}]| - |Y[k_{\max}-1]| - |Y[k_{\max}+1]|)}

This achieves approximately 10 better accuracy than the coarse estimate for SNR $> 20$ dB.

Step 3: Quinn's Estimator. For even better accuracy, use the complex DFT values directly:

\hat{k} = k_{\max} + \operatorname{Re}\!\left[\frac{Y[k_{\max}+1]/Y[k_{\max}]}{1 - Y[k_{\max}+1]/Y[k_{\max}]}\right]

Quinn's estimator is unbiased and achieves near-Cramer-Rao bound accuracy.

Step 4: Amplitude and phase. Once $\hat{k}$ is estimated, the amplitude is $\hat{A} = 2|Y[\hat{k}]|/N$ (for a one-sided real spectrum) corrected for the window's coherent gain, and the phase is $\hat{\phi} = \angle Y[\hat{k}]$ .

N.2 Detecting Harmonics in Speech

Speech is approximately quasi-periodic: voiced sounds (vowels) have a fundamental frequency $F_0$ (pitch) and harmonics at $2F_0, 3F_0, \ldots$ The harmonic structure is visible in the DFT as a series of peaks at integer multiples of $F_0$ .

HPS (Harmonic Product Spectrum) algorithm:

Compute the DFT magnitude $|X[k]|$
Form the product spectrum: $P[k] = \prod_{h=1}^{H} |X[hk]|$
Estimate $F_0 = f_s \cdot \arg\max_k P[k] / N$

The HPS effectively downsamples the spectrum by factors $1, 2, \ldots, H$ and takes the product - aligning the harmonic peaks while suppressing noise between them.

In AI: pitch estimation (fundamental frequency detection) is a key component of speech processing pipelines for voice synthesis (VALL-E, SoundStorm), emotion recognition, and singing voice synthesis. Modern deep-learning pitch detectors (CREPE, PYIN, PENN) outperform HPS but rely on spectral representations derived from the STFT.

N.3 The Spectrogram in Transformer-Based Audio Models

The dominant paradigm for audio language models (AudioLM, VALL-E, MusicLM, Stable Audio) is:

Audio waveform
    v FFT -> Mel spectrogram (continuous representation)
    v VQ-VAE or EnCodec -> Discrete tokens
    v Language model (transformer) -> Next-token prediction
    v Decoder -> Reconstructed mel spectrogram
    v Vocoder (HiFi-GAN, WaveGrad) -> Audio waveform

The FFT and mel spectrogram serve as both the input representation for the encoder and the target representation for the decoder. The transformer operates entirely in the token space derived from the mel spectrogram.

Why not train end-to-end on waveforms? The waveform representation at 16-44 kHz requires sequence lengths of $10^4$ - $10^5$ samples per second - far beyond current transformer context lengths. The mel spectrogram with 10 ms frame shift compresses by a factor of $\sim 160\times$ (16 kHz) while preserving the perceptually relevant structure. This compression is what makes transformer-based audio generation feasible.

Appendix O: Summary of AI Model Connections

AI System	DFT Role	Key Parameters
OpenAI Whisper	FFT -> mel spectrogram as input	$f_s=16$ kHz, $M=400$ , $H=160$ , $N=512$ , 80 mel bins
FNet (Google)	Full 2D DFT replaces attention	DFT along sequence AND embedding dims; no learned weights
FNO (CalTech)	Truncated spectral convolution	$K=16$ - $32$ modes; global PDE operator learning
FourCastNet (NVIDIA)	AFNO block for weather prediction	2D FFT on $720\times1440$ lat/lon grid; 73 variables
FlashFFTConv	Butterfly-structured long convolution	Monarch factorization; $O(N\log N)$ with Tensor Cores
Monarch (Together AI)	FFT butterfly as trainable transform	$O(N\sqrt{N})$ parameters; used as attention surrogate
VALL-E	Codec (EnCodec) uses STFT loss	Multi-resolution STFT loss for perceptual quality
AudioLM	Semantic/acoustic tokenization	EnCodec uses mel-scale frequency band processing
CREPE (pitch)	Spectrogram input to CNN	$n_{\text{fft}}=1024$ , $h=10$ ms; pitch detection on mel spec
HiFi-GAN (vocoder)	Multi-period discriminator	Multi-resolution STFT loss; discriminates real vs fake
S4/Mamba	Spectral view of state space models	SSM recurrence equivalent to causal convolution (FFT in training)
RoPE (LLaMA)	Frequency-domain position encoding	Complex rotation = modulation in DFT sense (Section 3.3)

Appendix P: Proofs and Derivations

P.1 Derivation of Cooley-Tukey for General Radix

For $N = N_1 \cdot N_2$ , write $n = N_2 n_1 + n_2$ and $k = N_1 k_2 + k_1$ with $n_1, k_1 \in [N_1]$ , $n_2, k_2 \in [N_2]$ :

X[k] = \sum_{n_1=0}^{N_1-1}\sum_{n_2=0}^{N_2-1} x[N_2 n_1 + n_2]\, \omega_N^{-(N_2 n_1 + n_2)(N_1 k_2 + k_1)}

Expanding the exponent: $-(N_2 n_1 + n_2)(N_1 k_2 + k_1) = -N_1 N_2 n_1 k_2 - N_2 n_1 k_1 - n_2 N_1 k_2 - n_2 k_1$

Since $\omega_N^{N_1 N_2} = 1$ , the first term vanishes. Grouping:

X[N_1 k_2 + k_1] = \sum_{n_2=0}^{N_2-1} \omega_{N_2}^{-n_2 k_2} \left[\underbrace{\omega_N^{-n_2 k_1}}_{\text{twiddle}} \sum_{n_1=0}^{N_1-1} \omega_{N_1}^{-n_1 k_1} x[N_2 n_1 + n_2]\right]

The inner sum is an $N_1$ -point DFT (for each fixed $n_2, k_1$ ), and the outer sum is an $N_2$ -point DFT (for each fixed $k_1$ ). The twiddle factors $\omega_N^{-n_2 k_1}$ multiply between the two stages.

Total operations: $N_2$ DFTs of size $N_1$ + $N_1$ DFTs of size $N_2$ + $N_1 N_2$ twiddle multiplications = $O(N \log N)$ by the master recurrence.

P.2 The Uncertainty Principle for the DFT (Donoho-Stark Theorem)

Theorem (Donoho & Stark, 1989). For any nonzero $\mathbf{x} \in \mathbb{C}^N$ :

|\text{supp}(\mathbf{x})| \cdot |\text{supp}(\mathbf{X})| \geq N

where $\text{supp}(\mathbf{x}) = \{n : x[n] \neq 0\}$ is the support of $\mathbf{x}$ .

Proof sketch. Suppose $|\text{supp}(\mathbf{x})| = S$ and $|\text{supp}(\mathbf{X})| = F$ . The DFT restricted to $\text{supp}(\mathbf{x})$ and $\text{supp}(\mathbf{X})$ defines an $F \times S$ submatrix $A$ of $\frac{1}{\sqrt{N}}F_N$ . Since $\mathbf{x}$ is supported on $S$ points and $\mathbf{X}$ is supported on $F$ points, this submatrix satisfies $A \mathbf{x}_{|S} = \mathbf{X}_{|F}/\sqrt{N}$ . A counting argument using the rank of $A$ gives $SF \geq N$ .

Corollary. A signal cannot have fewer than $\sqrt{N}$ nonzero samples AND fewer than $\sqrt{N}$ nonzero DFT coefficients simultaneously. This is the discrete analog of the Heisenberg uncertainty principle.

For AI: compressed sensing exploits this theorem: a signal supported on $S$ DFT bins can be recovered from $m \geq CS\log N$ time-domain measurements - far fewer than $N$ . This justifies FNO's use of $K \ll N$ Fourier modes: smooth solutions to PDEs have small Fourier support.

P.3 Aliasing Formula: Proof via Poisson Summation

When $x(t)$ is sampled at rate $f_s$ (below the Nyquist rate for bandlimited $x$ ), the DFT of the samples is:

X_s[k] = \sum_{n=-\infty}^{\infty} \hat{x}\!\left(\frac{k}{N T_s} - \frac{n}{T_s}\right) = \frac{1}{T_s}\sum_{n=-\infty}^{\infty} \hat{x}(k\Delta f - n f_s)

This is the aliased spectrum: the true spectrum $\hat{x}$ summed over all frequency shifts by integer multiples of $f_s$ . When $\hat{x}$ is nonzero outside $[-f_s/2, f_s/2)$ , adjacent copies overlap and contaminate each other - this is aliasing.

Proof. Starting from the Poisson summation formula (Section 02-7.5):

\sum_{n=-\infty}^{\infty} f(t - nT) = \frac{1}{T}\sum_{k=-\infty}^{\infty} \hat{f}(k/T)\, e^{2\pi i k t/T}

Setting $f(t) = x(t) e^{-2\pi i \xi t}$ and $T = 1/f_s$ , and evaluating at $t=0$ :

\sum_n x(n/f_s)\, e^{-2\pi i\xi n/f_s} = f_s \sum_k \hat{x}(\xi - kf_s)

The left side is the DTFT of the sampled signal; the right side is the periodized spectrum. The DFT is the sampling of this DTFT at $\xi = k f_s/N$ , giving the result.

Appendix Q: Further Reading

Primary References

Cooley, J.W. and Tukey, J.W. (1965). "An Algorithm for the Machine Calculation of Complex Fourier Series." Mathematics of Computation, 19(90), 297-301. - The original FFT paper; two pages, historically indispensable.
Oppenheim, A.V. and Schafer, R.W. (2010). Discrete-Time Signal Processing (3rd ed.). Prentice Hall. - The standard reference for DSP; chapters 8-10 cover DFT, FFT, and windowing comprehensively.
Strang, G. (1986). "The Discrete Cosine Transform." SIAM Review, 41(1), 135-147. - Beautiful exposition connecting FFT, DCT, and linear algebra.
Frigo, M. and Johnson, S.G. (2005). "The Design and Implementation of FFTW3." Proceedings of the IEEE, 93(2), 216-231. - FFTW's adaptive algorithm selection strategy.
Li, Z. et al. (2021). "Fourier Neural Operator for Parametric Partial Differential Equations." ICLR 2021. - FNO architecture and theory.
Dao, T. et al. (2022). "Monarch: Expressive Structured Matrices for Efficient and Accurate Training." ICML 2022. - Butterfly matrix parameterization.
Radford, A. et al. (2022). "Robust Speech Recognition via Large-Scale Weak Supervision." ICML 2023. - Whisper architecture and mel spectrogram preprocessing.
Donoho, D.L. and Stark, P.B. (1989). "Uncertainty Principles and Signal Recovery." SIAM Journal on Applied Mathematics, 49(3), 906-931. - Discrete uncertainty principle.

Online Resources

NumPy FFT documentation: numpy.fft module reference with complete API
SciPy signal processing: scipy.signal.spectrogram, scipy.signal.welch, scipy.fft
Whisper preprocessing code: openai/whisper GitHub repository, audio.py
FNO implementation: neuraloperator/neuraloperator GitHub repository

Appendix R: Quick-Reference Formula Sheet

Symbol	Meaning	Value / Formula
$\omega_N$	Principal $N$ -th root of unity	$e^{2\pi i/N}$
$X[k]$	$k$ -th DFT coefficient	$\sum_{n=0}^{N-1} x[n]\,\omega_N^{-nk}$
$x[n]$	IDFT reconstruction	$\frac{1}{N}\sum_{k=0}^{N-1} X[k]\,\omega_N^{nk}$
$F_N$	DFT matrix	$(F_N)_{kn} = \omega_N^{-kn}$ , so $F_N F_N^* = NI$
$\Delta f$	Frequency bin width	$f_s/N = 1/T$
$f_{\text{Nyq}}$	Nyquist frequency	$f_s/2$
$N_{\min}$	Minimum $N$ to resolve $\Delta f_0$	$\lceil f_s/\Delta f_0 \rceil$
$S[l,k]$	STFT coefficient	$\sum_{m} x[lH+m]\,w[m]\,\omega_N^{-mk}$
COLA	Overlap-add constant	$\sum_l w[n-lH] = C$ for all $n$
$X[N-k]$	Conjugate symmetry (real $x$ )	$X[k]^*$

Window cheat-sheet

Window	Main lobe (bins)	First sidelobe (dB)	Use case
Rectangular	2	$-13$	Broadband noise, transients
Hann	4	$-31.5$	General purpose
Hamming	4	$-42.7$	Speech processing
Blackman	6	$-58$	Weak sinusoids near strong ones
Kaiser $\beta=8.6$	variable	$-69$	Precision filter design

FFT complexity

T(N) = 2T(N/2) + O(N) \implies T(N) = O(N \log_2 N)

Compared to naive DFT: $N = 10^6$ reduces from $10^{12}$ to $2 \times 10^7$ operations - a $50\,000\times$ speedup.

Key identities

x \circledast y \;\overset{\mathcal{F}}{\longleftrightarrow}\; X[k] \cdot Y[k] \qquad \text{(see Section 04)}

\sum_{n=0}^{N-1}|x[n]|^2 = \frac{1}{N}\sum_{k=0}^{N-1}|X[k]|^2 \qquad \text{(Parseval)}

X[k] = X[k+N] \qquad \text{(periodicity in frequency)}

x[n] = x[n+N] \qquad \text{(periodicity in time, implicit)}

Discrete Fourier Transform and FFT: Part 3 - Appendix I Dft And Signal Reconstruction To Appendix R Quick Reference