Part 2

29 min read18 headingsSplit lesson page

Lesson overview | Previous part | Next part

Discrete Fourier Transform and FFT: Part 7: Short-Time Fourier Transform to Appendix H: The FFT in Modern Hardware

7. Short-Time Fourier Transform

7.1 Motivation: Time-Varying Spectra

The DFT and continuous FT assume the signal's spectral content is stationary - constant over the entire observation window. This assumption fails catastrophically for real-world signals:

Speech: phonemes last 20-100 ms, with frequency content changing at each phoneme boundary
Music: pitch and timbre evolve continuously; a melody is precisely a time-varying spectrum
EEG: neural oscillations appear and disappear; seizure onset is detected by spectral changes
Radar: target velocity changes the Doppler shift over time

The Short-Time Fourier Transform (STFT) addresses this by computing a sequence of DFTs on overlapping short segments of the signal. The result is a two-dimensional representation: time on one axis, frequency on the other.

7.2 STFT Definition

Definition (STFT). The Short-Time Fourier Transform of signal $x[n]$ with analysis window $w[m]$ of length $M$ and hop size $H$ is:

\text{STFT}\{x\}[l, k] = \sum_{m=0}^{M-1} x[lH + m]\, w[m]\, \omega_N^{-mk}, \quad l = 0, 1, \ldots, L-1,\; k = 0, 1, \ldots, N-1

where:

$l$ is the frame index (time axis), $lH$ is the starting sample of frame $l$
$H$ is the hop size (number of samples between consecutive frames, $H \leq M$ )
$N \geq M$ is the FFT size (zero-padding if $N > M$ )
$L = \lfloor (T_{\text{signal}} - M) / H \rfloor + 1$ is the number of frames

Parameters and their effects:

Parameter	Effect
Window length $M$	Longer $M$ -> better frequency resolution, worse time resolution
Hop size $H$	Smaller $H$ -> more frames (denser time axis), more computation
FFT size $N$	$N > M$ : zero-padding for denser frequency grid (not more resolution)
Window type	Hann (default), Hamming, Blackman - sidelobe/resolution tradeoff (Section 5.2)

Whisper's STFT parameters: $M = 400$ samples (25 ms), $H = 160$ samples (10 ms), $N = 512$ , window = Hann. At 16 kHz, this gives 257 frequency bins and one frame every 10 ms.

7.3 The Spectrogram

Definition (Spectrogram). The spectrogram is the squared magnitude of the STFT:

\text{Spectrogram}[l, k] = |\text{STFT}\{x\}[l, k]|^2

It represents the power (not amplitude) of each frequency at each time frame. Displayed as a heatmap (time on x-axis, frequency on y-axis, color = power), it visualizes how the frequency content of the signal evolves over time.

Log-power spectrogram: $10\log_{10}(|\text{STFT}|^2)$ in decibels. The logarithmic scale compresses the dynamic range and matches the perceptual sensitivity of human hearing (roughly logarithmic in both time and frequency).

Mel spectrogram: sum the spectrogram over mel-frequency filterbanks (triangular filters spaced on the mel scale, which is a perceptual frequency scale logarithmic above 1 kHz). Used in Whisper, most modern ASR systems, and many audio AI models. Full details in Section 8.1.

Chirp spectrogram (example): a signal sweeping linearly from $f_1$ to $f_2$ shows a diagonal stripe in the spectrogram - the instantaneous frequency increases linearly with time. This illustrates how the STFT tracks time-varying frequency content.

7.4 Uncertainty Principle for STFT

The STFT cannot escape the Heisenberg uncertainty principle from Section 02. For the windowed transform with window $w$ of duration $T_w$ and bandwidth $B_w$ :

T_w \cdot B_w \geq \frac{1}{4\pi}

For a Hann window of length $M$ samples (duration $T_w = M/f_s$ ), the effective bandwidth is $B_w \approx 2f_s/M$ (the -3 dB bandwidth of the Hann spectral window). The time-frequency resolution product is fixed:

\Delta t \cdot \Delta f \geq \frac{1}{4\pi} \approx 0.08

Practical consequence: you cannot simultaneously achieve:

Fine time resolution (short window $\to$ small $M$ )
Fine frequency resolution (long window $\to$ large $M$ )

For speech recognition (Whisper): $M = 400$ samples -> $\Delta t = 25\,\text{ms}$ , $\Delta f \approx 40\,\text{Hz}$ . For musical pitch tracking (need $\Delta f < 1\,\text{Hz}$ to distinguish semitones near 100 Hz): requires $M > 16000$ samples ( $> 1$ second window). The STFT is inadequate for simultaneously good time and frequency resolution across scales - this is exactly why wavelets exist (Section 05 Wavelets).

7.5 Synthesis and Perfect Reconstruction

Overlap-Add synthesis: Given STFT frames $S[l, k]$ (possibly modified), reconstruct $x[n]$ by:

\hat{x}[n] = \frac{\sum_l w_s[n - lH] \cdot \text{IDFT}\{S[l, \cdot]\}[n-lH]}{\sum_l w_a[n-lH]\, w_s[n-lH]}

where $w_a$ is the analysis window and $w_s$ is the synthesis window.

COLA (Constant Overlap-Add) condition: for perfect reconstruction of unmodified STFT:

\sum_{l=-\infty}^{\infty} w[n - lH]^2 = C \quad \forall n \quad \text{(using synthesis window } w_s = w_a = w \text{)}

Hann window with $H = M/2$ (50% overlap): $w[n]^2 + w[n-M/2]^2 = 1$ for all $n$ - COLA satisfied.

Modified DFT (MDCT): audio codecs (AAC, Vorbis, Opus) use the MDCT - a cosine-based transform with 50% overlap and critically sampled (no redundancy). The MDCT satisfies exact reconstruction (called "Time Domain Aliasing Cancellation") and is why modern audio codecs can achieve high quality at low bit rates. Full filter bank theory is developed in Section 05 Wavelets.

Forward reference: The STFT and MDCT are the simplest multi-resolution analyses. Wavelets generalize this by allowing different window lengths at different frequencies - short windows at high frequencies (good time resolution) and long windows at low frequencies (good frequency resolution). See Section 20-05 Wavelets for the full treatment.

8. Applications in Machine Learning

8.1 Whisper: FFT Pipeline for Speech Recognition

OpenAI Whisper (Radford et al., 2022) is a 680M-parameter transformer trained on 680,000 hours of multilingual speech. Its input is not raw audio but an 80-channel log-mel spectrogram computed by a fixed FFT pipeline.

The Whisper preprocessing pipeline:

WHISPER AUDIO PREPROCESSING PIPELINE
========================================================================

  Raw audio (PCM, 16 kHz)
        |
        
  1. Resample to 16 kHz (if needed)
        |
        
  2. Pad/trim to 30-second segments (480,000 samples)
        |
        
  3. STFT: window=Hann(400), hop=160, N_FFT=512
     -> Complex spectrogram: shape (257, T_frames)
        |
        
  4. Power spectrogram: |STFT|2 -> shape (257, T_frames)
        |
        
  5. Apply 80 mel filterbanks (triangular, mel-spaced, 0-8000 Hz)
     -> Mel power spectrogram: shape (80, T_frames)
        |
        
  6. Log compression: log10(max(mel_spec, 1e-10))
        |
        
  7. Normalize: (log_mel - mean) / 4  (empirically determined)
        |
        
  Transformer input: shape (80, 3000) for 30-second clip
  (3000 frames  10ms/frame = 30 seconds)

========================================================================

Mel filterbanks: The mel scale maps frequency $f$ (Hz) to perceived pitch $m$ (mel):

m = 2595 \log_{10}\!\left(1 + \frac{f}{700}\right)

Filterbanks are triangular filters equally spaced on the mel scale, covering the range $[f_{\min}, f_{\max}] = [0, 8000]\,\text{Hz}$ . Human speech information is concentrated below 8 kHz; the mel scale matches the roughly logarithmic frequency resolution of the cochlea.

Why log-mel spectrograms? (1) Log compression matches the roughly logarithmic perception of loudness (Stevens' power law). (2) Log mel spectrograms are approximately translation-equivariant in the log-frequency dimension - a pitch shift is a translation in log-frequency. (3) Empirically, they produce much better ASR results than raw spectrograms or MFCCs (Mel-Frequency Cepstral Coefficients) for large-scale transformer models.

For AI: Every audio transformer (Whisper, AudioPaLM, MusicGen, AudioLM) uses a similar FFT -> mel spectrogram pipeline as its front-end. Understanding the pipeline is essential for debugging audio preprocessing bugs, which are among the most common sources of silent failure in audio AI systems.

8.2 Fourier Neural Operator (FNO)

The Fourier Neural Operator (Li et al., 2021, NeurIPS) is a neural architecture for learning solution operators of PDEs. Instead of learning to map one function to another pointwise, the FNO learns in the frequency domain.

Problem setting: given input function $a : D \to \mathbb{R}$ (e.g., initial condition of a PDE), learn the operator $\mathcal{G} : a \mapsto u$ where $u$ is the solution (e.g., PDE solution at time $T$ ).

FNO layer: each FNO layer $v_t \mapsto v_{t+1}$ applies:

v_{t+1}(x) = \sigma\!\left(W v_t(x) + \mathcal{F}^{-1}\!\left[R_\phi \cdot \mathcal{F}[v_t]\right](x)\right)

where:

$\mathcal{F}$ is the DFT along the spatial dimension
$R_\phi \in \mathbb{C}^{K \times d_v \times d_v}$ is a learned complex weight matrix (parametrized by $\phi$ )
$K$ is the number of retained frequency modes (truncation threshold)
$W$ is a local linear transform (pointwise)
$\sigma$ is a nonlinearity

Key innovation: truncate to the lowest $K$ Fourier modes before applying $R_\phi$ :

(\mathcal{F}[v_t])_k = 0 \quad \text{for } k > K

This implements a global convolution with a low-frequency kernel - learning only the large-scale structure of the operator, discarding high-frequency noise. For 2D problems (e.g., Navier-Stokes), truncate to $(K_1, K_2)$ modes in each dimension.

Computational cost: $O(N \log N)$ for the DFT plus $O(K d_v^2)$ for the spectral convolution, vs. $O(N^2)$ for full global convolution. For $N = 10^5$ and $K = 20$ : the FNO spectral layer is $\sim 250$ cheaper than full attention.

For AI: FNO was the first neural architecture to demonstrate that learning in frequency space outperforms learning in physical space for many PDE tasks. It has been extended to weather prediction (Pathak et al., 2022 "FourCastNet"), fluid dynamics, and climate modeling. The core idea - truncate, learn in frequency, invert - generalizes the classic Galerkin method from numerical PDE.

8.3 Monarch Matrices and Structured FFT

Dao et al. (2022, "Monarch: Expressive Structured Matrices for Efficient and Accurate Training") observed that the FFT matrix $F_N$ can be factored as a product of sparse butterfly matrices:

F_N = B_{\log_2 N} \cdots B_2 B_1 P

where $P$ is the bit-reversal permutation and each $B_j$ is a block-diagonal matrix with $N/2$ butterfly factors. Each $B_j$ has exactly $N$ non-zero entries (2 per butterfly), giving $O(N)$ parameters per stage and $O(N \log N)$ total parameters for the full factorization.

Monarch parameterization: instead of fixing the butterfly structure to be the FFT, learn the butterfly weights:

M = P L P^\top R

where $L$ and $R$ are block-diagonal "butterfly-like" matrices. This gives a learnable structured matrix with $O(N\sqrt{N})$ parameters (between $O(N)$ for diagonal and $O(N^2)$ for dense).

Applications:

Efficient attention: Monarch matrices can approximate attention in $O(N\sqrt{N})$ vs $O(N^2)$
Hyperspherical projection: replacing standard linear layers with Monarch matrices in transformers for 2 speedup with minimal accuracy loss
FFT-based initialization: Monarch matrices initialized as FFT butterfly factors learn sparse frequency representations naturally

For AI: The butterfly matrix decomposition connects signal processing (FFT structure) to deep learning (efficient matrix approximation). Models like FLASHATTENTION-2 and several emerging efficient transformer variants build on butterfly-structured operators as efficient replacements for dense weight matrices.

8.4 FlashFFTConv and Long-Range Convolutions

Dao et al. (2023, "FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores") addresses a key bottleneck in long-context sequence modeling: computing long convolutions efficiently on modern GPU hardware.

The problem: State Space Models (SSMs) like S4 (Gu et al., 2022) and Mamba (Gu & Dao, 2023) compute long convolutions in $O(N \log N)$ using the FFT-based convolution theorem. However, standard FFT implementations are memory-bandwidth-bound and do not efficiently utilize the Tensor Cores (matrix-multiply units) of modern GPUs (A100, H100).

FlashFFTConv solution: decompose the long convolution kernel $h$ of length $N$ using the Monarch factorization into a product of shorter matrix operations, enabling Tensor Core utilization while preserving $O(N \log N)$ asymptotic complexity.

Key results:

2-7.3 speedup over standard FFT convolution for sequences up to 4M tokens
Linear memory in sequence length (vs. quadratic for full attention)
Enables training SSMs on sequences orders of magnitude longer than attention allows

Connection to Mamba: Mamba (Gu & Dao, 2023) uses selective state spaces - a time-varying SSM where the transition matrices depend on the input. The resulting recurrence can be parallelized via a parallel scan (prefix sum), NOT an FFT. FlashFFTConv applies specifically to time-invariant SSMs like S4.

8.5 Spectral Methods in Graph Neural Networks

Graph Neural Networks (GCNs) performing spectral graph convolution use the graph Laplacian eigenvectors as a "Fourier basis" for the graph. The connection is:

Graph Laplacian $L = D - A$ (where $D$ is degree matrix, $A$ is adjacency)
Spectral decomposition $L = U \Lambda U^\top$ provides "graph Fourier modes" $U$
Graph convolution: $y = U \cdot g_\theta(\Lambda) \cdot U^\top x$ where $g_\theta$ is a learnable spectral filter

This is a graph-domain analog of the DFT, with the regular frequency grid replaced by the graph eigenvalue spectrum.

Forward reference to Section 11-04: The full treatment of spectral graph convolution - including ChebNet (Defferrard et al., 2016), GCN (Kipf & Welling, 2017), and the connection to the graph Laplacian - is the canonical content of Section 11-04 Spectral Graph Theory. Here we note only the structural analogy: DFT is to regular grids as graph Fourier transform is to irregular graphs.

9. Common Mistakes

#	Mistake	Why It's Wrong	Fix
1	Using `np.fft.fft` output directly as frequency-domain amplitude without normalization	`fft` returns $X[k] = \sum_n x[n]\omega_N^{-nk}$ ; the amplitude of a unit sinusoid is $N/2$ , not 1	Divide by $N$ (or $N/2$ for one-sided real FFT) to recover true sinusoidal amplitudes
2	Building frequency axis as `np.arange(N) * fs / N` and treating all $N$ bins as positive frequencies	Bins $k > N/2$ represent negative frequencies wrapped around; calling $X[N-1]$ "frequency $(N-1)f_s/N$ " is incorrect	Use `np.fft.fftfreq(N, d=1/fs)` which returns the correct signed frequencies; apply `fftshift` to center zero
3	Forgetting to apply `fftshift` before plotting a spectrum	Without `fftshift`, the negative-frequency bins appear at the right end of the plot, not the left, making the spectrum appear discontinuous	Always call `fftshift(X)` and `fftshift(freqs)` before plotting a two-sided spectrum
4	Concluding that zero-padding improves frequency resolution	Zero-padding increases the density of frequency bins but does NOT reduce $\Delta f = 1/T$ ; two sinusoids closer than $1/T$ remain unresolvable regardless of zero-padding	Say "zero-padding interpolates the spectrum" not "improves resolution"; use longer observation windows for genuine resolution improvement
5	Applying DFT-property formulas (e.g., shift property) to linearly-shifted signals	The DFT shift property applies only to circular shifts. A linear shift (with zeros entering from one end) introduces boundary effects that invalidate $Y[k] = \omega_N^{-mk}X[k]$	Use circular indexing explicitly, or use the linear convolution approach with appropriate zero-padding
6	Confusing the DFT normalization convention of different libraries	NumPy puts $1/N$ on `ifft` only; some textbooks use $1/\sqrt{N}$ on both; PyTorch matches NumPy by default but has `norm` parameter options	Always check the documentation of the specific function; print a known signal (e.g., pure DC) and verify the expected coefficient
7	Misinterpreting spectral leakage as separate frequency components	When a sinusoid does not land on a DFT bin, its energy spreads to neighboring bins, appearing as multiple "peaks" in the spectrum	Apply a window function before the DFT and look for main lobes (not individual bins) as frequency indicators
8	Treating the Nyquist frequency bin $X[N/2]$ as a complex value	For real input, $X[N/2]$ is always real (by conjugate symmetry: $X[N/2] = X[N/2]^*$ ). Its imaginary part should be zero (up to numerical noise)	Assert `np.allclose(X[N//2].imag, 0)` for real input; take only the real part when reporting the Nyquist coefficient
9	Using an FFT size smaller than the signal length	If $N < \text{len}(x)$ , NumPy silently truncates the input; this discards signal information and produces incorrect results	Set $N \geq \text{len}(x)$ ; for linear convolution, use $N \geq L_1 + L_2 - 1$
10	Applying windowing after the FFT	Windows must be applied to the time-domain signal before the FFT. Multiplying the frequency-domain output by a window function is a circular convolution in frequency, not spectral smoothing	Multiply `x * w` in the time domain, then compute `fft(x * w)`
11	Overlooking numerical precision loss in the FFT	FFT algorithms propagate floating-point errors; for $N = 2^{20}$ , accumulated rounding errors can be $\sim 10^{-9}$ ; this matters for precision-sensitive applications	Use `np.float64` (double precision) inputs; for critically sensitive applications, consider the inverse error bound $\epsilon_{\text{FFT}} \approx \epsilon_{\text{machine}} \log_2 N$
12	Applying DFT to complex-valued signals and expecting `rfft` to work	`rfft` assumes real input and returns $N/2+1$ values; calling it on complex input produces incorrect results (it treats the input as real, discarding imaginary parts)	Use `fft` (not `rfft`) for complex inputs; use `rfft` only when you are certain the signal is real-valued

10. Exercises

Exercise 1 * - DFT by Hand

(a) Compute the 4-point DFT of $\mathbf{x} = (1, 0, -1, 0)$ by hand using $\omega_4 = i$ . Show all four coefficients $X[0], X[1], X[2], X[3]$ .

(b) Verify that $X[0]$ equals the sum of all samples (DC component) and that $|X[k]|^2$ satisfies Parseval's theorem.

(c) What is the IDFT of $\mathbf{X} = (4, 0, 0, 0)$ ? Compute by hand and verify.

(d) For $\mathbf{x} = (1, 1, 1, 1)$ , what is the DFT? Explain geometrically why $X[k] = 0$ for $k \neq 0$ .

Exercise 2 * - DFT Matrix Properties

(a) Construct the 44 DFT matrix $F_4$ explicitly and verify numerically that $F_4 F_4^* = 4I$ .

(b) Show that $F_4^2 = N P$ where $P$ is the time-reversal permutation matrix (i.e., applying the DFT twice returns the time-reversed signal scaled by $N$ ).

(c) Compute the eigenvalues of $F_8$ and show they belong to $\{1, -1, i, -i\}$ . (Hint: $F_N^4 = N^2 I$ .)

(d) For a random complex vector $\mathbf{x} \in \mathbb{C}^8$ , numerically verify the unitary property: $\lVert F_8\mathbf{x}/\sqrt{8} \rVert_2 = \lVert \mathbf{x} \rVert_2$ .

Exercise 3 * - FFT Implementation from Scratch

(a) Implement the radix-2 Cooley-Tukey FFT recursively in Python (without using numpy.fft). Your function should accept any power-of-2 length.

(b) Test on inputs of length $N \in \{8, 64, 512, 4096\}$ . Verify against numpy.fft.fft with relative error $< 10^{-10}$ .

(c) Time your implementation vs numpy.fft.fft for $N = 2^{14}$ . By what factor is NumPy faster?

(d) Implement the bit-reversal permutation as a standalone function and verify it matches the permutation implicitly used by your FFT.

Exercise 4 ** - Spectral Leakage and Windowing

(a) Generate a signal $x[n] = \cos(2\pi \cdot 3.7 \cdot n/N)$ for $N = 128$ samples (a sinusoid at a non-integer frequency). Compute and plot the DFT magnitude for rectangular, Hann, and Blackman windows.

(b) Quantify the leakage: for the rectangular window, what fraction of the total spectral energy lies outside the main lobe (bins $\{3, 4\}$ )?

(c) Two sinusoids at frequencies $f_1 = 5$ and $f_2 = 5.5$ cycles over $N = 64$ samples. Show that they cannot be resolved with the rectangular window, but can be partially resolved after zero-padding to $N = 256$ .

(d) Implement Kaiser window computation (using scipy.special.i0) and sweep $\beta$ from 0 to 14. Plot peak sidelobe level vs. main-lobe width. Reproduce the three window parameter rows from the table in Section 5.3.

Exercise 5 ** - Parseval, Energy, and the Power Spectral Density

(a) Generate a bandlimited noise signal: $x[n] = \sum_{k=10}^{20} \cos(2\pi kn/N + \phi_k)$ for $N = 512$ with random phases $\phi_k \sim \mathcal{U}(0, 2\pi)$ . Verify Parseval's theorem numerically: $\sum_n |x[n]|^2 = \frac{1}{N}\sum_k |X[k]|^2$ .

(b) Compute the one-sided power spectral density (PSD) and identify the occupied bandwidth.

(c) Compare the PSD estimated by the DFT (single-frame) vs Welch's method (scipy.signal.welch, averaged over overlapping frames). Show that Welch's estimate is less noisy.

(d) For white noise $x[n] \sim \mathcal{N}(0,1)$ , verify that the expected PSD is approximately flat, and that the variance of the PSD estimate decreases with the number of Welch averages.

Exercise 6 ** - STFT and Spectrogram Analysis

(a) Generate a chirp signal: $x[n] = \cos(2\pi(f_0 + \alpha n/N) \cdot n/f_s)$ with $f_0 = 100$ Hz, $\alpha \cdot f_s = 1000$ Hz, $f_s = 8000$ Hz, $N = 8000$ samples (1 second). Compute and display the STFT spectrogram using Hann window of length 256 samples with 50% overlap.

(b) Show how the spectrogram changes as you vary the window length: $M \in \{64, 256, 1024\}$ . Identify the time-frequency resolution tradeoff.

(c) Add a 500 Hz sinusoidal interference to the chirp. Show that a narrow window resolves the interference in time but not frequency, while a wide window resolves it in frequency but not time.

(d) Use the STFT to perform simple spectral subtraction denoising: estimate noise power from a "silent" region, subtract from all frames, reconstruct via overlap-add. Report the signal-to-noise ratio improvement.

Exercise 7 *** - Whisper Mel Spectrogram Pipeline

(a) Implement the full Whisper mel-spectrogram pipeline from scratch:

STFT with Hann window (400 samples), hop 160, FFT size 512
Power spectrogram $|X|^2$
80 mel filterbanks covering 0-8000 Hz on the mel scale
Log compression with floor at $10^{-10}$

Test on a 1-second synthetic signal: a mixture of three sinusoids at 300, 1200, and 3500 Hz.

(b) Verify that your mel filterbanks sum to (approximately) the total spectral power: $\sum_{m=1}^{80} \text{mel\_spec}[m, l] \approx \sum_k w_m[k] |X[k, l]|^2$ .

(c) Apply your pipeline to the test signal and plot the 80100 mel spectrogram. Label the three sinusoid peaks.

(d) Analyze the effect of mel filterbank overlap: are the three sinusoids at 300, 1200, and 3500 Hz resolved in separate mel bins? What is the mel-bin width in Hz at each of the three frequencies?

Exercise 8 *** - Fourier Neural Operator: Spectral Convolution Layer

(a) Implement the 1D FNO spectral convolution layer:

def spectral_conv1d(x, weights, modes):
    # x: (batch, channels, N)
    # weights: (in_channels, out_channels, modes) complex
    # modes: number of retained Fourier modes

The layer: DFT along last axis -> truncate to modes -> multiply by weights -> zero-pad -> IDFT.

(b) Test on the 1D Burgers' equation approximation: generate smooth initial conditions $u_0 \sim \text{GP}(0, (-\partial^2_x + 1)^{-2})$ and apply your spectral layer. Show that retaining $K = 16$ modes captures $> 99\%$ of the energy for smooth functions.

(c) Compare the spectral layer against a naive dense convolution layer of comparable parameter count. For $N = 1024$ and $K = 16$ : how many parameters does the spectral layer have vs. the dense layer?

(d) Implement a 2-layer FNO and train it to approximate the integral operator $\mathcal{G}[a](x) = \int_0^1 e^{-|x-y|} a(y)\,dy$ on discretized grids of size $N \in \{64, 128, 256\}$ . Report the relative $L^2$ error and show that the spectral layer generalizes across grid resolutions (resolution-invariance).

11. Why This Matters for AI (2026 Perspective)

Concept	AI/ML Impact
FFT O(N log N) complexity	Foundation of all efficient sequence models: FNet, S4, Mamba, FNO, FlashFFTConv - the difference between feasible and infeasible at 100K+ token lengths
DFT matrix unitarity	Guarantees that FFT-based feature maps preserve signal energy; relevant to gradient stability in FNO and FNet - unitary transforms avoid the vanishing/exploding gradient problem in deep spectral networks
Spectral truncation	FNO's key innovation: learning only low-frequency behavior yields generalizable, resolution-invariant operators; matches the inductive bias that real physics is smooth
STFT and mel spectrograms	The front-end of every modern audio model (Whisper, AudioPaLM, MusicGen, AudioLM, VALL-E 2); getting the FFT pipeline right is prerequisite for any audio AI work
Windowing	Hann/Hamming windows in Whisper's STFT directly affect spectrogram quality; poor window choice causes leakage artifacts that corrupt speech features used by the ASR transformer
Butterfly matrix structure	Monarch matrices (Dao et al., 2022) parameterize structured transforms with $O(N\sqrt{N})$ parameters, enabling trainable FFT-like operations inside transformers at 2-4 speedup
Circular convolution theorem	Makes FFT-based convolution O(N log N) instead of O(N2); used in FlashFFTConv for long-range dependencies; connection to Section 04 where this is developed fully
Nyquist-Shannon theorem	Governs sample rate requirements for all audio AI; at 16 kHz (Whisper), the Nyquist limit is 8 kHz, matching the mel filterbank design; misunderstanding aliasing causes silent audio processing bugs
Frequency resolution	In FNO: $K$ retained modes determines the scale of PDE features the network can learn; too few modes = underfitting; too many = overfitting to noise; analogous to bandwidth selection in statistics
Fourier Neural Operator	State-of-the-art for weather prediction (FourCastNet, GraphCast partially inspired), fluid dynamics, and climate modeling at $O(N\log N)$ cost; represents the convergence of PDE numerics and deep learning

12. Conceptual Bridge

Looking Backward

This section discretizes and algorithmizes the continuous Fourier Transform from Section 20-02. The connection is precise: the DFT evaluates the continuous FT of a sampled signal at $N$ equally-spaced frequencies, with errors bounded by the Nyquist theorem and the windowing artifact. Every property of the continuous FT - linearity, shift, scaling, Plancherel - has a discrete analog, with the crucial difference that shifts are now circular. The $L^2$ inner product of $\mathbb{R}$ becomes the finite inner product on $\mathbb{C}^N$ ; unitarity of the continuous FT (Plancherel) becomes unitarity of the DFT matrix. The Poisson summation formula from Section 02-7.5 directly explains why sampled spectra are periodic and why aliasing occurs.

The Cooley-Tukey FFT is not a mathematical theorem about the DFT - it is an algorithmic fact about how to implement the DFT matrix-vector product by exploiting the Vandermonde structure of the DFT matrix. The mathematics is in the DFT; the computer science is in the FFT.

Looking Forward

The immediate next step is Section 20-04 Convolution Theorem, which proves the central result previewed in Section 3.6: DFT multiplication equals circular convolution. This makes FFT-based convolution $O(N\log N)$ instead of $O(N^2)$ , and underlies every CNN, SSM, and FNO architecture. Overlap-add and overlap-save (introduced in Section 5.5) are the standard tools for applying this theorem to long signals.

Section 20-05 Wavelets addresses the STFT's fundamental limitation: fixed time-frequency resolution. Wavelets use variable window sizes - short windows at high frequencies, long windows at low frequencies - to achieve optimal time-frequency localization simultaneously across all scales. The connection runs through the filter bank theory developed in Section 04: Daubechies wavelets are exactly the solution to the "perfect reconstruction filter bank" problem.

Further afield, the Fourier Neural Operator of Section 8.2 will appear again in advanced PDE-solving contexts, and the spectral graph convolution preview of Section 8.5 connects to Section 11-04 Spectral Graph Theory. The mathematical framework of the DFT as a change of basis in a finite-dimensional Hilbert space generalizes to infinite-dimensional Hilbert spaces in Section 12-02 Hilbert Spaces, where the orthonormal Fourier basis becomes a complete orthonormal system and Parseval's identity extends to an equality that characterizes separable Hilbert spaces.

POSITION IN THE FOURIER ANALYSIS CURRICULUM
========================================================================

  Section 20-01  Fourier Series           periodic functions, L2[-,]
          v  (take period T -> infty)
  Section 20-02  Fourier Transform        continuous FT, Plancherel, uncertainty
          v  (sample at rate fs, observe N points)
  Section 20-03  DFT and FFT              <- YOU ARE HERE
          v  (DFT multiplication = circular convolution)
  Section 20-04  Convolution Theorem      CNNs, WaveNet, S4/Mamba, Hyena
          v  (variable-resolution analysis)
  Section 20-05  Wavelets                 MRA, Daubechies, scattering networks

  Key Connections:
  +-----------------------------------------------------------+
  | DFT = discrete + finite version of the continuous FT      |
  | FFT = O(N log N) algorithm for computing the DFT matrix   |
  | STFT = sliding DFT -> time-frequency representation        |
  | Windowing = control leakage  frequency resolution        |
  | Parseval: x2 = X2/N  (energy conserved by DFT)      |
  | Circular convolution theorem -> leads to Section 04               |
  | Fixed resolution limitation of STFT -> leads to Section 05        |
  +-----------------------------------------------------------+

========================================================================

<- Back to Fourier Analysis | Previous: Fourier Transform <- | Next: Convolution Theorem ->

Appendix A: Extended DFT Theory

A.1 The DFT as a Projection onto the Fourier Basis

The geometric interpretation of the DFT deserves deeper development. Working in $\mathbb{C}^N$ with the standard Hermitian inner product $\langle \mathbf{u}, \mathbf{v} \rangle = \sum_{n=0}^{N-1} u[n]\overline{v[n]}$ , we construct the Fourier basis.

Orthonormality proof. Define $\mathbf{e}_k = \frac{1}{\sqrt{N}}(1, \omega_N^k, \omega_N^{2k}, \ldots, \omega_N^{(N-1)k})^\top$ . Then:

\langle \mathbf{e}_k, \mathbf{e}_l \rangle = \frac{1}{N}\sum_{n=0}^{N-1} \omega_N^{nk}\overline{\omega_N^{nl}} = \frac{1}{N}\sum_{n=0}^{N-1}\omega_N^{n(k-l)} = \delta_{kl}

by the geometric sum formula. The normalized DFT matrix $\frac{1}{\sqrt{N}}F_N$ has columns $\mathbf{e}_0, \mathbf{e}_1, \ldots, \mathbf{e}_{N-1}$ (up to conjugation), confirming it is unitary.

The DFT as expansion coefficients. The expansion of any $\mathbf{x} \in \mathbb{C}^N$ in the Fourier basis is:

\mathbf{x} = \sum_{k=0}^{N-1} \hat{x}_k \mathbf{e}_k, \qquad \hat{x}_k = \langle \mathbf{x}, \mathbf{e}_k \rangle = \frac{1}{\sqrt{N}}\sum_{n=0}^{N-1} x[n]\omega_N^{-nk} = \frac{X[k]}{\sqrt{N}}

The DFT coefficient $X[k]$ is $\sqrt{N}$ times the coordinate of $\mathbf{x}$ in the $k$ -th Fourier direction.

Projection interpretation. $X[k] = \langle \mathbf{x}, \sqrt{N}\mathbf{e}_k \rangle$ measures the "overlap" or "correlation" of $\mathbf{x}$ with the $k$ -th complex sinusoid. A large $|X[k]|$ means $\mathbf{x}$ has significant oscillatory component at frequency $k/N$ cycles per sample.

A.2 Spectral Density and the Periodogram

The periodogram is the classical (non-parametric) estimator of the Power Spectral Density (PSD):

\hat{S}(f_k) = \frac{1}{N f_s} |X[k]|^2, \quad f_k = k f_s / N

The periodogram is an inconsistent estimator - its variance does not decrease as $N \to \infty$ , even though its mean converges to the true PSD. This is because each DFT coefficient is approximately a chi-squared random variable with 2 degrees of freedom, regardless of $N$ .

Smoothed periodogram estimators:

Bartlett's method: average periodograms from non-overlapping blocks of length $M$ . Reduces variance by factor $K$ (number of blocks) at the cost of frequency resolution $\Delta f = f_s/M$ .
Welch's method: average periodograms from overlapping windowed blocks. With 50% overlap and $K$ blocks: variance reduction $\approx K/1.5$ , same frequency resolution as Bartlett.
Multitaper method (Thomson, 1982): use multiple orthogonal "Slepian" (DPSS) tapers instead of one window. Optimal bias-variance tradeoff for stationary processes.

For AI: Welch's method is used in biomedical signal processing (EEG power spectra), audio analysis, and anomaly detection. Understanding PSD estimation is essential for interpreting spectral features in any time-series ML pipeline.

A.3 The Discrete-Time Fourier Transform (DTFT) Relation

The DTFT of an infinite sequence $x[n]$ is:

X(e^{j\omega}) = \sum_{n=-\infty}^{\infty} x[n]\, e^{-j\omega n}, \quad \omega \in [-\pi, \pi)

The DTFT produces a continuous spectrum over the normalized frequency $\omega = 2\pi f / f_s \in [-\pi, \pi)$ .

DFT as sampled DTFT. If $x[n]$ is nonzero only for $n = 0, \ldots, N-1$ (or equivalently, if we window to this range), then:

X[k] = X(e^{j\omega})\Big|_{\omega = 2\pi k/N} = X\!\left(e^{j2\pi k/N}\right)

The DFT is obtained by sampling the DTFT at $N$ equally-spaced points on the unit circle. Denser sampling (larger $N$ or zero-padding) evaluates the DTFT at more points but does not change the underlying DTFT.

The $z$ -transform connection. The DTFT evaluates the $z$ -transform $X(z) = \sum_n x[n]z^{-n}$ on the unit circle $z = e^{j\omega}$ . The DFT evaluates it at $N$ equally-spaced points on the unit circle. Poles of $X(z)$ inside the unit circle correspond to stable causal systems.

A.4 Goertzel Algorithm: Single-Frequency DFT

Computing the entire $N$ -point DFT costs $O(N\log N)$ . If you need only a single DFT coefficient $X[k_0]$ for a specific frequency $k_0$ , the FFT is wasteful. The Goertzel algorithm computes $X[k_0]$ in $O(N)$ operations using a second-order IIR filter.

Second-order filter interpretation. Define $y[n] = x[n] + 2\cos(\omega_0)y[n-1] - y[n-2]$ with $\omega_0 = 2\pi k_0/N$ . Then $X[k_0] = y[N] - e^{-j\omega_0}y[N-1]$ (initialized with $y[-1] = y[-2] = 0$ ).

Applications:

DTMF (Dual-Tone Multi-Frequency) telephone tone decoding: detect specific frequencies (697, 770, 852, 941, 1209, 1336, 1477, 1633 Hz)
Musical pitch detection when only octave-frequency analysis is needed
Frequency-selective filtering in embedded systems with limited memory

Appendix B: Advanced FFT Topics

B.1 In-Place FFT and Memory Efficiency

The butterfly computation in the DIT-FFT can be performed in-place: the two output values of each butterfly can overwrite the two input values, because after the butterfly both inputs are no longer needed. This means the entire $N$ -point FFT can be computed using only the $N$ complex input/output array (plus $O(\log N)$ twiddle factor storage).

Implementation sketch:

FOR each stage s = 1, 2, ..., log2(N):
    stride = N / 2^s
    FOR each group g = 0, 1, ..., 2^(s-1) - 1:
        twiddle = omega^{-g * stride}
        FOR each butterfly within group:
            a = x[position]
            b = twiddle * x[position + stride]
            x[position]          = a + b   (overwrite)
            x[position + stride] = a - b   (overwrite)

This avoids allocating a second array of size $N$ , halving memory usage - critical for embedded systems and for FFTs of very large signals.

B.2 Multi-Dimensional FFT

The 2D DFT is separable: apply a 1D FFT to each row, then a 1D FFT to each column (or vice versa):

X[k_1, k_2] = \sum_{n_1=0}^{N_1-1}\sum_{n_2=0}^{N_2-1} x[n_1, n_2]\,\omega_{N_1}^{-n_1 k_1}\,\omega_{N_2}^{-n_2 k_2}

Computational cost: $N_2$ FFTs of length $N_1$ (for rows) plus $N_1$ FFTs of length $N_2$ (for columns) = $O(N_1 N_2 \log(N_1 N_2))$ .

Applications in AI:

Image processing: 2D FFT used for frequency-domain image filtering, noise removal, and texture analysis
Convolutional neural networks: image convolution can be computed via 2D FFT for large kernels ( $O(N^2 \log N)$ vs $O(N^2 K^2)$ for direct convolution with kernel size $K$ )
FNO on 2D domains: the 2D FNO for Navier-Stokes uses 2D DFT with $K^2$ retained modes

B.3 Number-Theoretic Transform (NTT)

The Number-Theoretic Transform replaces complex arithmetic with arithmetic modulo a prime $p$ . The NTT uses a primitive $N$ -th root of unity modulo $p$ - an integer $\omega$ such that $\omega^N \equiv 1 \pmod{p}$ and $\omega^k \not\equiv 1 \pmod{p}$ for $k < N$ .

Why NTT matters: unlike the FFT with floating-point arithmetic, the NTT is exact (integer arithmetic modulo $p$ ). It is used for:

Polynomial multiplication modulo a prime: key operation in lattice-based cryptography (NTRU, CRYSTALS-Kyber)
Large integer multiplication: GNU MP library uses NTT for multiplying large integers
Post-quantum cryptography: Kyber and Dilithium (NIST post-quantum standards) rely on NTT for efficient polynomial operations

For AI security: language model serving with privacy guarantees (homomorphic encryption) requires NTT-based polynomial multiplications. The efficiency of NTT directly affects the feasibility of private inference for LLMs.

Appendix C: Worked Examples

C.1 Complete DFT Computation: 8-Point Example

Let $\mathbf{x} = (1, 2, 3, 4, 4, 3, 2, 1)$ . Compute the 8-point DFT.

Step 1: Note that $\mathbf{x}$ is symmetric: $x[n] = x[N-n]$ for $n = 1, \ldots, N-1$ . By conjugate symmetry for real input, $X[k] = X[N-k]^*$ . For a symmetric real input, all $X[k]$ are real.

Step 2: Even and odd subsequences:

Even: $(x[0], x[2], x[4], x[6]) = (1, 3, 4, 2)$
Odd: $(x[1], x[3], x[5], x[7]) = (2, 4, 3, 1)$

Step 3: 4-point DFT of even subsequence ( $\omega_4 = i$ ):

E[0] = 1+3+4+2 = 10, \quad E[1] = 1-3i-4+2i = -3-i

E[2] = 1-3+4-2 = 0, \quad E[3] = 1+3i-4-2i = -3+i

Step 4: 4-point DFT of odd subsequence:

O[0] = 2+4+3+1 = 10, \quad O[1] = 2-4i-3+i = -1-3i

O[2] = 2-4+3-1 = 0, \quad O[3] = 2+4i-3-i = -1+3i

Step 5: Combine with twiddle factors $\omega_8^k = e^{-i\pi k/4}$ :

X[0] = E[0] + \omega_8^0 O[0] = 10 + 10 = 20

X[1] = E[1] + \omega_8^1 O[1] = (-3-i) + \frac{1-i}{\sqrt{2}}(-1-3i)

Computing $\omega_8^1(-1-3i) = \frac{-1-3i-i+3i^2}{\sqrt{2}} \cdot (-1) = \frac{(1+3i)(1-i)}{\sqrt{2}} \cdot (-1)$ ... the exact arithmetic is tedious; numerically: $X = (20, -5.657, 0, -0.828, 0, -0.828, 0, -5.657)$ .

Verification: $X[0] = \sum_n x[n] = 1+2+3+4+4+3+2+1 = 20$ PASS. Parseval: $\sum_n x[n]^2 = 1+4+9+16+16+9+4+1 = 60$ ; $\frac{1}{8}\sum_k |X[k]|^2 = \frac{1}{8}(400 + 32 + 0 + 0.686 + 0 + 0.686 + 0 + 32) \approx 60$ PASS.

C.2 Leakage Quantification: Rectangular vs Hann

Signal: $x(t) = \cos(2\pi \cdot 3.5 t)$ sampled at $f_s = 1$ Hz for $N = 16$ samples. The frequency $f_0 = 3.5$ lies exactly halfway between bins 3 and 4.

Rectangular window: All 16 DFT values are nonzero. The two closest bins ( $k=3$ , $k=4$ ) have magnitude $\approx 7.1$ each (out of a maximum of 8 for a bin-aligned frequency). The farthest bins ( $k=8$ ) have magnitude $\approx 0.5$ . Leakage energy (all bins except 3, 4): $\approx 10\%$ of total.

Hann window: $w[n] = \frac{1}{2}(1 - \cos(2\pi n/15))$ . The main lobe is now 4 bins wide (bins 2, 3, 4, 5 have significant energy). Leakage energy outside the main lobe: $< 0.1\%$ - a 100 reduction in far-field leakage.

Trade-off: the Hann window "blurs" the frequency axis (wider main lobe), but dramatically reduces interference from distant frequency components. For audio analysis with multiple simultaneous tones, the Hann window is almost always preferred.

C.3 FFT for Image Compression (Preview)

A grayscale image $I[m, n]$ of size $256 \times 256$ has $2^{16}$ pixel values. The 2D DFT produces $2^{16}$ complex frequency coefficients.

Low-frequency concentration: natural images have most energy in low spatial frequencies - slowly varying regions (sky, smooth surfaces). Empirically, the top-left $16 \times 16$ corner of the DFT magnitude (the lowest 16 frequencies in each direction) typically captures $> 90\%$ of total image energy.

Compression idea: retain only $K \times K$ DFT coefficients. Setting all others to zero and applying the IDFT gives a low-frequency reconstruction. For $K = 32$ : $1024$ retained coefficients out of $65536$ ( $\approx 1.6\%$ ), with visually acceptable reconstruction for most images.

This is the conceptual predecessor of JPEG, which uses the Discrete Cosine Transform (DCT) - the real-valued cousin of the DFT - in $8 \times 8$ blocks, achieving better compression due to avoiding complex arithmetic and reducing boundary artifacts.

Appendix D: Numerical Implementation Notes

D.1 Choosing FFT Size for Maximum Speed

Different values of $N$ require dramatically different computation times even with the same asymptotic complexity. FFTW benchmarks (in rough order of efficiency):

$N$ type	Relative speed
Powers of 2 ( $2^k$ )	Fastest
Products $2^a 3^b 5^c$	Very fast
Smooth numbers (all prime factors $\leq 7$ )	Fast
Prime $N$ (uses Bluestein)	Slow - avoid if possible
$N = 2p$ for large prime $p$	Slow

Rule of thumb: always zero-pad to a highly composite number. scipy.fft.next_fast_len(N) returns the smallest integer $\geq N$ that has only small prime factors.

D.2 Avoiding Common Numerical Pitfalls

Precision requirements:

float32 (single precision): round-off error $\approx 10^{-7}$ per operation. For $N = 2^{20}$ : accumulated FFT error $\approx 10^{-7} \times \log_2(2^{20}) \approx 2 \times 10^{-6}$ .
float64 (double precision): error $\approx 10^{-16}$ per operation. FFT error for $N = 2^{20}$ : $\approx 3 \times 10^{-15}$ .

For audio processing (16-bit integer input -> 32-bit float): single precision is fine. For scientific computing or precision-sensitive applications: use double precision.

Large DFT sizes and memory bandwidth: for $N > 10^6$ , the FFT becomes memory-bandwidth-limited rather than compute-limited. The working set (the data being accessed at each butterfly stage) exceeds the CPU cache, and cache misses dominate runtime. For very large FFTs: consider FFT algorithms with cache-oblivious access patterns (e.g., FFTW's recursive divide-and-conquer).

D.3 GPU FFT Considerations

CUDA FFT (cuFFT) and ROCm FFT achieve peak performance for:

Power-of-2 sizes on any GPU
Batch FFTs (many independent FFTs of the same size): significantly faster than sequential single FFTs
Real-to-complex (rfft) transforms on modern GPUs with Tensor Cores

For ML training: when using FNO or FlashFFTConv, always batch the FFT computation across the mini-batch and channel dimensions simultaneously. torch.fft.fft(x, dim=-1) can process a (batch, channels, N) tensor with a single cuFFT call, achieving near-peak GPU utilization.

Appendix E: Connections to Other Areas

E.1 DFT and Polynomial Multiplication

The DFT has an elegant algebraic interpretation: it converts polynomial multiplication in the ring $\mathbb{C}[x]/(x^N - 1)$ into pointwise multiplication of coefficient sequences.

Fact. Let $p(x) = \sum_{n=0}^{N-1} a_n x^n$ and $q(x) = \sum_{n=0}^{N-1} b_n x^n$ . Their product modulo $x^N - 1$ is the circular convolution $c = a \circledast b$ . The DFT converts this to $C[k] = A[k] \cdot B[k]$ (pointwise product). Thus: DFT diagonalizes the cyclic convolution operator.

More precisely, the matrix of circular convolution by $\mathbf{a}$ (the circulant matrix $C_a$ ) has eigenvectors equal to the Fourier basis vectors $\mathbf{f}_k$ , with eigenvalues $A[k]$ :

C_a \mathbf{f}_k = A[k] \mathbf{f}_k

The DFT simultaneously diagonalizes all circulant matrices. This is the algebraic foundation of the Convolution Theorem (Section 04).

E.2 DFT and Number Theory: Quadratic Gauss Sums

The Gauss sum $G = \sum_{n=0}^{N-1} e^{2\pi i n^2/N}$ arises naturally in number theory and signal processing. Its magnitude satisfies $|G| = \sqrt{N}$ , which is the basis for several FFT-based tricks.

Bluestein's algorithm uses the identity $nk = \frac{1}{2}[n^2 + k^2 - (n-k)^2]$ to rewrite the DFT as a convolution:

X[k] = e^{-i\pi k^2/N} \sum_{n=0}^{N-1} \left(x[n] e^{-i\pi n^2/N}\right) e^{i\pi(n-k)^2/N}

This is a convolution of $h[n] = x[n]e^{-i\pi n^2/N}$ with the chirp kernel $f[n] = e^{i\pi n^2/N}$ , computable via FFT for any $N$ .

E.3 DFT and the Cooley-Tukey Prime Factor Algorithm

When $N = N_1 N_2$ with $\gcd(N_1, N_2) = 1$ , the Good-Thomas (Prime Factor) Algorithm maps the 1D DFT to a 2D DFT via the Chinese Remainder Theorem (CRT). This gives a different algorithm from Cooley-Tukey that:

Requires no twiddle-factor multiplications (only additions)
Is advantageous when $N$ has many small coprime factors

The CRT mapping is $n \leftrightarrow (n_1, n_2)$ with $n = N_2 n_1 + N_1 n_2 \bmod N$ . The DFT factors as:

X[k_1 N_1 + k_2 N_2 \bmod N] = \text{DFT}_{N_1} \otimes \text{DFT}_{N_2}

Connection to representation theory: the Good-Thomas algorithm expresses the DFT on a cyclic group $\mathbb{Z}_N$ in terms of DFTs on subgroups $\mathbb{Z}_{N_1}$ and $\mathbb{Z}_{N_2}$ when $N = N_1 N_2$ is coprime. The general theory of DFT on finite abelian groups is a special case of harmonic analysis on locally compact abelian groups (Pontryagin duality), which is developed in full in Section 12 Functional Analysis.

Appendix F: DFT Properties - Detailed Proofs

F.1 Time Reversal

If $y[n] = x[-n \bmod N]$ , then $Y[k] = X[-k \bmod N] = X[N-k]$ .

Proof.

Y[k] = \sum_{n=0}^{N-1} x[-n \bmod N]\,\omega_N^{-nk} \overset{m=-n \bmod N}{=} \sum_{m=0}^{N-1} x[m]\,\omega_N^{mk} = X[-k \bmod N]

Consequence. For real $\mathbf{x}$ : $X[N-k] = X[-k] = X[k]^*$ (conjugate symmetry, Section 3.4). Time reversal in the time domain maps to frequency reversal (conjugation) in the frequency domain.

F.2 Duality

If $Y[n] = X[n]$ (i.e., the time-domain signal equals the DFT of another signal), then:

\mathcal{F}\{X[n]\}[k] = N\, x[-k \bmod N]

Proof. From the IDFT: $x[n] = \frac{1}{N}\sum_k X[k]\omega_N^{nk}$ . Replace $n \to -k$ and $k \to n$ :

x[-k] = \frac{1}{N}\sum_{n=0}^{N-1} X[n]\omega_N^{-kn} = \frac{1}{N}\mathcal{F}\{X[n]\}[k]

Therefore $\mathcal{F}\{X[n]\}[k] = N\, x[-k \bmod N]$ .

Application. If $\mathbf{x}$ is a rectangular pulse (1 for $n = 0, \ldots, M-1$ , else 0), its DFT $X[k] = \frac{\sin(\pi kM/N)}{\sin(\pi k/N)}e^{-i\pi k(M-1)/N}$ is a Dirichlet kernel. By duality, if the input is a Dirichlet kernel, the DFT is (essentially) a rectangular pulse. This is the discrete analog of the rectangle-sinc FT pair.

F.3 DFT of Real and Imaginary Parts

For a complex signal $z[n] = x[n] + iy[n]$ with real parts $x[n]$ and $y[n]$ :

X[k] = \mathcal{F}\{x[n]\}[k] = \frac{Z[k] + Z[N-k]^*}{2}

Y[k] = \mathcal{F}\{y[n]\}[k] = \frac{Z[k] - Z[N-k]^*}{2i}

This allows computing two real DFTs in the time of one complex DFT - an important optimization for processing stereo audio or in-phase/quadrature (IQ) signals.

Appendix G: Windowing Theory - Spectral Analysis Framework

G.1 The Window Design Problem

Given a finite observation $x_w[n] = x[n] \cdot w[n]$ of an infinite signal $x[n]$ , the DFT gives:

X_w[k] = \frac{1}{N}(X \circledast W)[k]

where $W[k] = \mathcal{F}\{w[n]\}[k]$ is the DFT of the window. The true spectrum $X[k]$ is "blurred" by $W[k]$ .

Ideal window properties (contradictory):

$W[k] = N\delta[k]$ (no leakage - only possible for infinite observation)
$w[n] = 1$ for $n \in [0, N-1]$ (rectangular window: maximum time-domain energy)
Compact support in both time and frequency (impossible by uncertainty principle)

Any window must trade off between these competing desiderata. The Kaiser window with adjustable $\beta$ offers a continuous tradeoff between resolution and leakage control.

G.2 Sidelobe Rolloff Rate

The rate at which sidelobes decrease away from the main lobe is called the rolloff rate (in dB per octave):

Rectangular window: discontinuities at $n=0$ and $n=N-1$ -> $|W[k]| \sim 1/k$ -> rolloff $\approx -6\,\text{dB/octave}$
Hann window: continuous at endpoints (value 0) but derivative discontinuity -> $|W[k]| \sim 1/k^2$ -> rolloff $\approx -18\,\text{dB/octave}$
Blackman window: zero at endpoints with zero first and second derivatives -> $|W[k]| \sim 1/k^3$ -> rolloff $\approx -18\,\text{dB/octave}$ (but lower peak sidelobe)

Generalization. A window with $p-1$ continuous derivatives at the endpoints has sidelobe falloff $O(k^{-(p+1)})$ , corresponding to a rolloff rate of $-6(p+1)\,\text{dB/octave}$ . This is a direct consequence of the Riemann-Lebesgue lemma applied to the windowed Fourier series.

G.3 Optimal Windows: Prolate Spheroidal Wave Functions

The Slepian sequences (discrete prolate spheroidal wave functions, DPSS) are the theoretically optimal windows in the following sense:

Problem. Among all windows $w[n]$ with $\sum_n w[n]^2 = 1$ , find the one that maximizes the fraction of energy concentrated in $|\xi| \leq W$ (a specified bandwidth):

\text{maximize} \quad \frac{\sum_{|k| \leq KN} |W[k]|^2}{\sum_k |W[k]|^2}

The solution is the first Slepian sequence, which is the eigenvector corresponding to the largest eigenvalue of the matrix $A_{nm} = \frac{\sin(2\pi W(n-m))}{\pi(n-m)}$ .

Slepian sequences achieve theoretically optimal concentration - the maximum possible energy fraction in a given bandwidth - and are used in multitaper spectral estimation (Section A.2).

Appendix H: The FFT in Modern Hardware

H.1 Vectorization and SIMD

Modern CPUs provide SIMD (Single Instruction Multiple Data) instructions that process multiple values simultaneously:

SSE2: 2 double-precision complex numbers per operation
AVX2: 4 double-precision complex numbers per operation
AVX-512: 8 double-precision complex numbers per operation

FFTW automatically generates SIMD-optimized code for each detected instruction set. For $N = 2^{20}$ at double precision, AVX-512 provides approximately 8 speedup over scalar code.

H.2 GPU FFT (cuFFT)

NVIDIA's cuFFT library achieves peak performance by:

Tiling: decomposing the DFT into tiles that fit in shared memory (SRAM)
Warp-level butterfly: mapping butterfly computations to warp-level shuffle operations
Tensor Core utilization: for batch FFTs, reformulating twiddle-factor multiplication as GEMM operations executable on Tensor Cores

Peak throughput: NVIDIA H100 achieves $\sim 30\,\text{TFLOPS}$ on batch FFTs of power-of-2 size with float16 (half precision).

H.3 Distributed FFT for Very Large Transforms

For transforms exceeding GPU memory (e.g., $N > 10^8$ ), the DFT can be decomposed across multiple GPUs using the row-column algorithm:

Distribute $N = N_1 \times N_2$ points across $P$ GPUs: each GPU holds $N/P$ rows
Each GPU applies 1D FFT to its rows ( $N_2$ -point FFTs)
All-to-all communication transposes the data (global shuffle)
Each GPU applies 1D FFT to its columns ( $N_1$ -point FFTs) and applies twiddle factors
Final all-to-all restores the layout

Total communication: $O(N/P)$ complex numbers per GPU per transpose step. For $P = 8$ GPUs and $N = 10^9$ : $\sim 128\,\text{GB}$ data movement per step.

Applications: seismic data processing, climate modeling, and cosmological simulations routinely require distributed FFTs of $10^9$ - $10^{12}$ points.

Discrete Fourier Transform and FFT: Part 2 - Short Time Fourier Transform To Appendix H The Fft In Modern Hardwar