NotesMath for LLMs

Discrete Fourier Transform and FFT

Fourier Analysis and Signal Processing / Discrete Fourier Transform and FFT

Notes

"The FFT is the most important numerical algorithm of our lifetime."

  • Gilbert Strang, MIT Mathematics

Overview

The Discrete Fourier Transform (DFT) is the numerically computable version of the Fourier Transform: given NN equally-spaced samples of a signal, it produces NN complex frequency coefficients that describe the signal's spectral content. The DFT is not merely an approximation to the continuous Fourier Transform - it is an exact, invertible change of basis on CN\mathbb{C}^N, with its own complete theory of properties, convergence, and error.

What makes the DFT indispensable is the Fast Fourier Transform (FFT): the Cooley-Tukey algorithm (1965) reduces the O(N2)O(N^2) matrix-vector product defining the DFT to O(NlogN)O(N \log N) arithmetic operations through a recursive divide-and-conquer strategy. For N=220106N = 2^{20} \approx 10^6, this is the difference between 101210^{12} and 2×1072 \times 10^7 operations - five orders of magnitude. The FFT is the engine behind digital audio, MRI reconstruction, radar, telecommunications, JPEG compression, and - increasingly - large language models, neural operators, and efficient sequence modeling architectures.

This section develops the DFT rigorously from first principles, derives the Cooley-Tukey algorithm and its O(NlogN)O(N \log N) complexity, treats spectral leakage and windowing in depth, introduces the Short-Time Fourier Transform (STFT) for time-varying signals, and traces the FFT through modern AI systems: Whisper's mel spectrogram pipeline, the Fourier Neural Operator, Monarch matrices, and FlashFFTConv.

Prerequisites

  • Fourier Transform - continuous FT definition (ξ\xi-convention), Plancherel's theorem, uncertainty principle, Poisson summation formula (Section 20-02)
  • Fourier Series - complex Fourier coefficients, orthonormality of complex exponentials (Section 20-01)
  • Complex numbers - eiθ=cosθ+isinθe^{i\theta} = \cos\theta + i\sin\theta, roots of unity, complex modulus and argument (Section 01)
  • Linear algebra - matrix-vector products, unitary matrices, change of basis, inner products (Section 02)
  • Sampling theory - Nyquist-Shannon theorem (introduced here; connection to Section 02-7.5 Poisson summation)

Companion Notebooks

NotebookDescription
theory.ipynbDFT matrix, FFT from scratch, butterfly diagrams, window functions, STFT spectrograms, Whisper mel pipeline, FNO spectral layer
exercises.ipynb10 graded problems: DFT by hand through Monarch butterfly factorization

Learning Objectives

After completing this section, you will:

  1. Define the NN-point DFT and compute it by hand for small NN using roots of unity
  2. Express the DFT as a unitary matrix transformation and verify orthonormality of the Fourier basis vectors
  3. State and prove the Cooley-Tukey radix-2 FFT recurrence and derive the O(Nlog2N)O(N\log_2 N) complexity
  4. Draw and interpret a butterfly diagram for an 8-point DFT
  5. Apply all standard DFT properties: linearity, circular shift, modulation, conjugate symmetry, Parseval
  6. Explain spectral leakage and quantify the tradeoff between main-lobe width and side-lobe level for Hann, Hamming, Blackman, and Kaiser windows
  7. Set up a correct FFT-based spectral analysis: choose NN, build the frequency axis, apply fftshift, handle negative frequencies
  8. Define the STFT and describe the uncertainty tradeoff between time and frequency resolution
  9. Implement the mel-spectrogram pipeline used by OpenAI Whisper (Radford et al., 2022)
  10. Describe the Fourier Neural Operator (Li et al., 2021) and implement its truncated spectral convolution layer
  11. Explain Monarch matrices (Dao et al., 2022) as a product-of-sparse-butterfly-matrices parameterization of FFT-like transforms
  12. Identify and correct the eight most common DFT errors: index conventions, frequency axis misuse, leakage misdiagnosis, normalization ambiguity

Table of Contents


1. Intuition

1.1 From Continuous to Discrete: Sampling a Spectrum

Every practical computation with signals must at some point confront a fundamental constraint: computers cannot store or process continuous functions. What they can store is a finite sequence of numbers - samples of a signal taken at discrete time instants. The passage from the continuous world of the Fourier Transform to the discrete world of the DFT is therefore not a mathematical nicety but a computational necessity.

Suppose we record NN samples of a signal x(t)x(t) at a sampling rate of fsf_s samples per second:

x[n]=x(nTs),Ts=1fs,n=0,1,,N1x[n] = x(n \cdot T_s), \quad T_s = \frac{1}{f_s}, \quad n = 0, 1, \ldots, N-1

The total observation time is T=NTsT = N \cdot T_s. What frequencies can we see in this finite record? Two constraints emerge immediately:

  1. Highest observable frequency (Nyquist): any frequency above fs/2f_s/2 will be indistinguishable from a lower frequency due to aliasing - two distinct sinusoids will produce identical sample sequences.
  2. Frequency resolution: the smallest frequency difference we can distinguish is Δf=fs/N=1/T\Delta f = f_s / N = 1/T - the reciprocal of the observation duration.

The DFT honours both constraints. It examines the signal at exactly NN equally-spaced frequency bins:

ξk=kfsN,k=0,1,,N1\xi_k = \frac{k \cdot f_s}{N}, \quad k = 0, 1, \ldots, N-1

covering one full period [0,fs)[0, f_s) of the periodic spectrum. The first N/2N/2 bins correspond to positive frequencies [0,fs/2)[0, f_s/2); the remaining N/2N/2 bins correspond to negative frequencies (fs/2,0)(-f_s/2, 0) wrapped around to [fs/2,fs)[f_s/2, f_s). This wrapping is the discrete manifestation of the Nyquist limit.

For AI: Every neural network that processes time-series data - speech, ECG, seismic, financial - ultimately works with sampled signals. Understanding the frequency axis setup of the DFT is essential for interpreting what spectral features (mel filterbanks in Whisper, power spectral density in EEG classifiers) actually represent. Misinterpreting negative-frequency bins is a common source of subtle bugs in spectrogram preprocessing.

1.2 What the DFT Does Geometrically

The space of NN-point complex sequences CN\mathbb{C}^N has dimension NN. The DFT provides a complete orthogonal basis for this space - the Fourier basis - consisting of NN complex sinusoids at the frequencies ξk=k/N\xi_k = k/N:

fk=[1ωNkωN2kωN(N1)k],ωN=e2πi/N,k=0,1,,N1\mathbf{f}_k = \begin{bmatrix} 1 \\ \omega_N^k \\ \omega_N^{2k} \\ \vdots \\ \omega_N^{(N-1)k} \end{bmatrix}, \quad \omega_N = e^{2\pi i / N}, \quad k = 0, 1, \ldots, N-1

Each basis vector fk\mathbf{f}_k is a sampled complex sinusoid at frequency k/Nk/N cycles per sample. The DFT coefficient X[k]X[k] is the inner product of the signal x\mathbf{x} with the kk-th basis vector:

X[k]=x,fk=n=0N1x[n]ωNnk=n=0N1x[n]ωNnkX[k] = \langle \mathbf{x}, \mathbf{f}_k \rangle = \sum_{n=0}^{N-1} x[n]\, \overline{\omega_N^{nk}} = \sum_{n=0}^{N-1} x[n]\, \omega_N^{-nk}

The DFT is therefore a change of basis from the standard basis (time domain) to the Fourier basis (frequency domain). The magnitude X[k]|X[k]| measures how strongly the signal oscillates at frequency k/Nk/N cycles per sample; the phase X[k]\angle X[k] measures the phase of that oscillation.

The inverse DFT reconstructs the signal by superimposing all NN sinusoids with their computed amplitudes and phases:

x[n]=1Nk=0N1X[k]ωNnkx[n] = \frac{1}{N} \sum_{k=0}^{N-1} X[k]\, \omega_N^{nk}

The factor 1/N1/N comes from the normalization of the basis - the fk\mathbf{f}_k have squared norm fk2=N\lVert \mathbf{f}_k \rVert^2 = N, not 1. Using the orthonormal basis fk/N\mathbf{f}_k / \sqrt{N}, both forward and inverse transforms carry a 1/N1/\sqrt{N} factor. Different software packages use different conventions; always check.

Geometric picture: if you think of the signal as a vector in CN\mathbb{C}^N, the DFT rotates this vector into a new coordinate system. The new coordinate axes are complex sinusoids. The DFT matrix FNF_N is the rotation matrix (more precisely, a scaled unitary matrix). The operation X=FNx\mathbf{X} = F_N \mathbf{x} is a matrix-vector product; naive evaluation takes O(N2)O(N^2) operations. The FFT computes this same product in O(NlogN)O(N \log N).

1.3 Why O(N log N) Changed the World

To appreciate the scale of the FFT's impact, consider the arithmetic:

NNNaive DFT (N2N^2)FFT (Nlog2NN \log_2 N)Speedup
644,09638411
1,0241,048,57610,240102
65,5364.3×1094.3 \times 10^91.0×1061.0 \times 10^64,369
10610^6101210^{12}2.0×1072.0 \times 10^750,000

Before the FFT, computing the spectrum of a 1-second audio clip sampled at 44.1 kHz was infeasible in real time. After the FFT, it takes microseconds. The entire modern infrastructure of digital audio, wireless communications (OFDM), radar, MRI, seismology, and antenna design rests on this algorithmic breakthrough.

The secret is that the DFT matrix FNF_N has enormous structure: it is a Vandermonde matrix with entries that are powers of NN-th roots of unity, and these entries satisfy recursive relationships that allow massive cancellation of work through a divide-and-conquer decomposition.

For AI (2026): Long-context language models face a quadratic bottleneck in self-attention: processing a 100K-token context with full attention requires 1010\sim 10^{10} operations. The FFT offers an alternative: if a sequence operation can be formulated as a convolution, it can be computed in O(NlogN)O(N \log N) using the FFT. This is the foundation of FNet (Lee-Thorp et al., 2022), FlashFFTConv (Dao et al., 2023), S4 (Gu et al., 2022), and the Fourier Neural Operator (Li et al., 2021). The O(NlogN)O(N \log N) vs O(N2)O(N^2) complexity gap, already dramatic at N=103N = 10^3, becomes existential at N=105N = 10^5.

1.4 Historical Timeline

The story of the FFT is a case study in mathematical rediscovery and the transformative effect of the right algorithm at the right moment.

1805 - Carl Friedrich Gauss derived what we now recognize as the FFT algorithm while computing asteroid orbits. His manuscript lay unpublished and largely unknown.

1903 - Carl Runge independently derived a related fast algorithm for trigonometric interpolation.

1942 - Danielson and Lanczos published a recursive factorization of the DFT based on even-odd splitting, establishing the core mathematical structure.

1965 - James Cooley and John Tukey published "An Algorithm for the Machine Calculation of Complex Fourier Series," the paper that launched the digital signal processing revolution. Their algorithm was motivated by Tukey's desire to detect Soviet nuclear tests through seismic monitoring. The paper, just two pages, is one of the most cited in the history of applied mathematics.

1966 - Gentleman and Sande introduced the "decimation in frequency" variant; Bergland gave a comprehensive tutorial; the algorithm became standard in every engineering discipline within a decade.

1984 - FFTW - Frigo and Johnson developed FFTW ("Fastest Fourier Transform in the West"), an adaptive library that benchmarks and selects the optimal algorithm for any NN and hardware. FFTW ships in NumPy, SciPy, and is the backbone of essentially all scientific computing FFT pipelines.

2021 - Fourier Neural Operator (Li et al.) applies the DFT inside a neural network layer to learn solution operators for PDEs in O(NlogN)O(N \log N).

2022 - Monarch matrices (Dao et al.) parameterize FFT-like structured transforms as products of sparse butterfly matrices, enabling learnable near-FFT computations in transformer architectures.


2. Formal Definitions

2.1 The N-Point DFT

Definition (DFT). Let x=(x[0],x[1],,x[N1])CN\mathbf{x} = (x[0], x[1], \ldots, x[N-1]) \in \mathbb{C}^N be a finite sequence. The NN-point Discrete Fourier Transform of x\mathbf{x} is the sequence X=(X[0],X[1],,X[N1])CN\mathbf{X} = (X[0], X[1], \ldots, X[N-1]) \in \mathbb{C}^N defined by:

X[k]=n=0N1x[n]ωNnk,k=0,1,,N1X[k] = \sum_{n=0}^{N-1} x[n]\, \omega_N^{-nk}, \quad k = 0, 1, \ldots, N-1

where ωN=e2πi/N\omega_N = e^{2\pi i / N} is the primitive NN-th root of unity.

The complex number ωNnk=e2πink/N\omega_N^{-nk} = e^{-2\pi i nk/N} is a sampled complex sinusoid at frequency k/Nk/N cycles per sample, evaluated at time nn. The DFT coefficient X[k]X[k] measures the complex amplitude (magnitude and phase) of the frequency-kk component of x\mathbf{x}.

Sign convention. The minus sign in ωNnk\omega_N^{-nk} is the physics/engineering convention (analysis kernel e2πie^{-2\pi i}). Some mathematical texts use ωN+nk\omega_N^{+nk} (positive sign) for the forward transform; this merely relabels forward and inverse. We follow NumPy/SciPy: forward DFT has the - sign.

Roots of unity. The NN values {ωN0,ωN1,,ωNN1}\{\omega_N^0, \omega_N^1, \ldots, \omega_N^{N-1}\} are the NN-th roots of unity - equally spaced points on the unit circle in C\mathbb{C}, starting at 1 and rotating counterclockwise by 2π/N2\pi/N at each step. They satisfy:

n=0N1ωNnk={Nif k0(modN)0otherwise\sum_{n=0}^{N-1} \omega_N^{nk} = \begin{cases} N & \text{if } k \equiv 0 \pmod{N} \\ 0 & \text{otherwise} \end{cases}

This orthogonality of roots of unity is the key identity underlying all DFT properties and the inversion formula.

Standard examples for small NN:

N=2N=2: ω2=eiπ=1\omega_2 = e^{i\pi} = -1. The 2-point DFT is:

X[0]=x[0]+x[1],X[1]=x[0]x[1]X[0] = x[0] + x[1], \quad X[1] = x[0] - x[1]

This is the Hadamard/Walsh transform in 2D - just a sum and difference.

N=4N=4: ω4=eiπ/2=i\omega_4 = e^{i\pi/2} = i. The 4-point DFT:

X[0]=x[0]+x[1]+x[2]+x[3],X[1]=x[0]ix[1]x[2]+ix[3]X[0] = x[0]+x[1]+x[2]+x[3], \quad X[1] = x[0] - ix[1] - x[2] + ix[3] X[2]=x[0]x[1]+x[2]x[3],X[3]=x[0]+ix[1]x[2]ix[3]X[2] = x[0]-x[1]+x[2]-x[3], \quad X[3] = x[0]+ix[1]-x[2]-ix[3]

Non-example: An infinite sequence x[n]x[n] for nZn \in \mathbb{Z} does NOT have a DFT - you must first window or truncate it to NN points. Applying the DFT directly to an infinite sequence is a category error; what you would compute is the zz-transform or DTFT (Discrete-Time Fourier Transform), which is distinct from the DFT.

2.2 Inverse DFT and Normalization Conventions

Definition (IDFT). The Inverse Discrete Fourier Transform is:

x[n]=1Nk=0N1X[k]ωNnk,n=0,1,,N1x[n] = \frac{1}{N} \sum_{k=0}^{N-1} X[k]\, \omega_N^{nk}, \quad n = 0, 1, \ldots, N-1

Verification. Substitute the DFT into the IDFT:

1NkX[k]ωNnk=1Nk(mx[m]ωNmk)ωNnk=mx[m]1NkωN(nm)k=mx[m]δ[nm]=x[n]\frac{1}{N}\sum_k X[k]\omega_N^{nk} = \frac{1}{N}\sum_k \left(\sum_m x[m]\omega_N^{-mk}\right)\omega_N^{nk} = \sum_m x[m] \cdot \frac{1}{N}\sum_k \omega_N^{(n-m)k} = \sum_m x[m]\, \delta[n-m] = x[n]

using the orthogonality of roots of unity in the penultimate step.

Three normalization conventions (all valid, software-dependent):

ConventionForward DFTInverse DFTUsed by
Engineering (default)X[k]=nx[n]ωNnkX[k] = \sum_n x[n]\omega_N^{-nk}x[n]=1NkX[k]ωNnkx[n] = \frac{1}{N}\sum_k X[k]\omega_N^{nk}NumPy, SciPy, MATLAB
Symmetric (unitary)X[k]=1Nnx[n]ωNnkX[k] = \frac{1}{\sqrt{N}}\sum_n x[n]\omega_N^{-nk}x[n]=1NkX[k]ωNnkx[n] = \frac{1}{\sqrt{N}}\sum_k X[k]\omega_N^{nk}Some textbooks
Inverse-normalizedX[k]=1Nnx[n]ωNnkX[k] = \frac{1}{N}\sum_n x[n]\omega_N^{-nk}x[n]=kX[k]ωNnkx[n] = \sum_k X[k]\omega_N^{nk}Rare; avoid

For AI: Always check which convention a library uses. NumPy uses engineering convention: np.fft.fft has no 1/N1/N factor, np.fft.ifft has 1/N1/N. PyTorch's torch.fft.fft matches NumPy. When mixing libraries or implementing from scratch, convention mismatch is a common silent bug.

2.3 The DFT Matrix

The DFT is a linear map X=FNx\mathbf{X} = F_N \mathbf{x}, where the DFT matrix FNCN×NF_N \in \mathbb{C}^{N \times N} has entries:

(FN)kn=ωNkn,k,n=0,1,,N1(F_N)_{kn} = \omega_N^{-kn}, \quad k, n = 0, 1, \ldots, N-1

The matrix is fully specified by its single generator ωN\omega_N. Written out explicitly for N=4N=4:

F4=[11111i1i11111i1i]F_4 = \begin{bmatrix} 1 & 1 & 1 & 1 \\ 1 & -i & -1 & i \\ 1 & -1 & 1 & -1 \\ 1 & i & -1 & -i \end{bmatrix}

Unitarity. The DFT matrix satisfies FNFN=NINF_N F_N^* = N I_N, where FNF_N^* is the conjugate transpose. Equivalently, the normalized DFT matrix 1NFN\frac{1}{\sqrt{N}}F_N is unitary:

1NFN(1NFN)=IN\frac{1}{\sqrt{N}}F_N \cdot \left(\frac{1}{\sqrt{N}}F_N\right)^* = I_N

Proof. The (k,l)(k, l) entry of FNFNF_N F_N^* is:

(FNFN)kl=n=0N1ωNknωNln=n=0N1ωN(lk)n={Nif k=l0if kl(F_N F_N^*)_{kl} = \sum_{n=0}^{N-1} \omega_N^{-kn} \overline{\omega_N^{-ln}} = \sum_{n=0}^{N-1} \omega_N^{(l-k)n} = \begin{cases} N & \text{if } k=l \\ 0 & \text{if } k \neq l \end{cases}

by the orthogonality of roots of unity. Therefore FNFN=NIF_N F_N^* = NI, so FN1=1NFNF_N^{-1} = \frac{1}{N}F_N^*, confirming the IDFT formula: x=1NFNX\mathbf{x} = \frac{1}{N}F_N^* \mathbf{X}.

Key structural properties of FNF_N:

  • Vandermonde structure: FNF_N is a Vandermonde matrix with nodes ωN0,ωN1,,ωN(N1)\omega_N^0, \omega_N^{-1}, \ldots, \omega_N^{-(N-1)}
  • Symmetric: (FN)kn=(FN)nk(F_N)_{kn} = (F_N)_{nk} - the matrix is symmetric (not Hermitian)
  • Periodic indexing: all indices are taken modulo NN, making the transform naturally circular

2.4 DFT as a Change of Basis

The columns of FNF_N^* (equivalently, the rows of FNF_N conjugated) are the Fourier basis vectors:

fk=1N[1ωNkωN2kωN(N1)k]CN,k=0,1,,N1\mathbf{f}_k = \frac{1}{\sqrt{N}}\begin{bmatrix} 1 \\ \omega_N^k \\ \omega_N^{2k} \\ \vdots \\ \omega_N^{(N-1)k} \end{bmatrix} \in \mathbb{C}^N, \quad k = 0, 1, \ldots, N-1

These form an orthonormal basis for CN\mathbb{C}^N: fk,fl=δkl\langle \mathbf{f}_k, \mathbf{f}_l \rangle = \delta_{kl}.

The DFT coefficient X[k]/NX[k]/\sqrt{N} is the coordinate of x\mathbf{x} in the direction fk\mathbf{f}_k:

X[k]N=x,fk=1Nn=0N1x[n]ωNnk\frac{X[k]}{\sqrt{N}} = \left\langle \mathbf{x}, \mathbf{f}_k \right\rangle = \frac{1}{\sqrt{N}}\sum_{n=0}^{N-1} x[n]\,\omega_N^{-nk}

Fourier basis as "frequency detector" vectors: fk\mathbf{f}_k oscillates at exactly kk cycles across the NN samples. If x\mathbf{x} is itself a pure sinusoid at frequency k0k_0, then X[k]|X[k]| is zero for all kk0k \neq k_0 and equals NN at k=k0k = k_0. The DFT "detects" which frequencies are present by computing inner products with all NN basis sinusoids simultaneously.

Contrast with DTFT (Discrete-Time Fourier Transform): The DTFT is defined for infinite sequences as X(ejω)=n=x[n]ejωnX(e^{j\omega}) = \sum_{n=-\infty}^\infty x[n]e^{-j\omega n} and produces a continuous spectrum over ω[0,2π)\omega \in [0, 2\pi). The DFT is a sampled version of the DTFT, evaluating it at NN equally-spaced frequencies. The DFT is the only version that is both finite and exactly invertible.

2.5 Relation to the Continuous Fourier Transform

The DFT approximates the continuous Fourier Transform of a sampled signal. Specifically, if x(t)x(t) is bandlimited to [fs/2,fs/2)[-f_s/2, f_s/2) and sampled at rate fsf_s:

X[k]x^ ⁣(kNTs)1Ts1N=1Tsx^(kΔf)X[k] \approx \hat{x}\!\left(\frac{k}{N T_s}\right) \cdot \frac{1}{T_s} \cdot \frac{1}{N} = \frac{1}{T_s}\hat{x}(k\,\Delta f)

where Δf=fs/N\Delta f = f_s/N is the frequency bin spacing. The DFT thus provides a discretized snapshot of the continuous spectrum at NN frequency points.

Three key approximation errors arise:

  1. Aliasing: frequencies above fs/2f_s/2 fold back into [0,fs/2)[0, f_s/2) - see Section 6.3
  2. Leakage: the signal is observed for only NN samples, equivalent to multiplying by a rectangular window, causing sidelobe contamination in the spectrum - see Section 5
  3. Picket fence: the DFT only evaluates the spectrum at NN discrete frequencies, potentially missing peaks between bins - see Section 6.2

3. Properties of the DFT

All DFT properties follow from linearity of the sum and the orthogonality of roots of unity. We state each property, prove it, and note its discrete-specific character where it differs from the continuous FT.

3.1 Linearity

F{ax[n]+by[n]}[k]=aX[k]+bY[k]\mathcal{F}\{ax[n] + by[n]\}[k] = a\,X[k] + b\,Y[k]

Proof. Immediate from linearity of summation: n(ax[n]+by[n])ωNnk=anx[n]ωNnk+bny[n]ωNnk\sum_n (ax[n]+by[n])\omega_N^{-nk} = a\sum_n x[n]\omega_N^{-nk} + b\sum_n y[n]\omega_N^{-nk}.

Linearity makes the DFT a linear operator on CN\mathbb{C}^N, represented by the matrix FNF_N.

3.2 Circular Shift

If y[n]=x[nmmodN]y[n] = x[n - m \bmod N] (shift x\mathbf{x} right by mm positions with wrap-around), then:

Y[k]=ωNmkX[k]Y[k] = \omega_N^{-mk}\, X[k]

Proof.

Y[k]=n=0N1x[nm]ωNnk=l=nml=0N1x[l]ωN(l+m)k=ωNmkl=0N1x[l]ωNlk=ωNmkX[k]Y[k] = \sum_{n=0}^{N-1} x[n-m]\,\omega_N^{-nk} \overset{l=n-m}{=} \sum_{l=0}^{N-1} x[l]\,\omega_N^{-(l+m)k} = \omega_N^{-mk} \sum_{l=0}^{N-1} x[l]\,\omega_N^{-lk} = \omega_N^{-mk} X[k]

Important distinction from continuous case. In the continuous FT, a time shift by τ\tau introduces the factor e2πiξτe^{-2\pi i\xi\tau}. In the DFT, the shift is circular (modular): shifting right pushes the last sample to the first position. If you apply a linear (non-circular) shift to a finite-length signal, the result is NOT simply the DFT multiplied by a phase factor - truncation effects occur. This is a frequent source of confusion when applying the shift property carelessly.

Application. In signal processing, circular shift models delay in a circular buffer. In machine learning, circular convolution (Section 3.6) exploits this property to compute convolutions via FFT.

3.3 Modulation (Frequency Shift)

If y[n]=ωNmnx[n]y[n] = \omega_N^{mn} x[n] (multiply by a complex sinusoid), then:

Y[k]=X[kmmodN]Y[k] = X[k - m \bmod N]

This is the dual of the circular shift property: multiplication in time = shift in frequency.

Application. Frequency-shift keying (FSK) in communications. In transformers, rotary position encoding (RoPE) multiplies embeddings by complex exponentials - this is modulation in the DFT sense, shifting the frequency-domain representation of each token by its position.

3.4 Conjugate Symmetry for Real Inputs

If x[n]Rx[n] \in \mathbb{R} for all nn, then:

X[Nk]=X[k],k=0,1,,N1X[N-k] = X[k]^*, \quad k = 0, 1, \ldots, N-1

Proof. X[Nk]=nx[n]ωNn(Nk)=nx[n]ωNnk=nx[n]ωNnk=X[k]X[N-k] = \sum_n x[n]\omega_N^{-n(N-k)} = \sum_n x[n]\omega_N^{nk} = \sum_n x[n]\overline{\omega_N^{-nk}} = \overline{X[k]}, using x[n]Rx[n] \in \mathbb{R} so x[n]=x[n]x[n] = \overline{x[n]} and ωNnN=1\omega_N^{nN} = 1.

Consequence. Only N/2+1\lfloor N/2 \rfloor + 1 DFT coefficients X[0],X[1],,X[N/2]X[0], X[1], \ldots, X[N/2] are independent; the rest are determined by conjugate symmetry. NumPy's rfft exploits this to return only the non-redundant half, saving memory and computation. For AI applications working with real-valued signals (audio, time-series), always use rfft/irfft.

DC and Nyquist bins: X[0]=nx[n]RX[0] = \sum_n x[n] \in \mathbb{R} (always real, proportional to the mean). For even NN: X[N/2]=nx[n](1)nRX[N/2] = \sum_n x[n](-1)^n \in \mathbb{R} (always real, proportional to the alternating sum).

3.5 Parseval's Theorem for the DFT

n=0N1x[n]2=1Nk=0N1X[k]2\sum_{n=0}^{N-1} |x[n]|^2 = \frac{1}{N} \sum_{k=0}^{N-1} |X[k]|^2

Proof. Since 1NFN\frac{1}{\sqrt{N}}F_N is unitary, it preserves the 2\ell^2 norm:

x2=1NFNx2=1NX2\lVert \mathbf{x} \rVert^2 = \left\lVert \frac{1}{\sqrt{N}}F_N \mathbf{x} \right\rVert^2 = \frac{1}{N}\lVert \mathbf{X} \rVert^2

Interpretation. Energy is perfectly preserved by the DFT - the total energy in the time domain equals the total energy in the frequency domain (up to the 1/N1/N normalization factor). This is the discrete analog of Plancherel's theorem from Section 02.

Application. The power spectral density (PSD) is X[k]2/N|X[k]|^2 / N, which by Parseval sums to the total signal power. In Whisper's mel spectrogram, the log power log(X[k]2)\log(|X[k]|^2) of each FFT bin is computed and then summed over mel filterbanks.

3.6 Circular Convolution - Preview

The most important DFT property is the circular convolution theorem: pointwise multiplication in frequency corresponds to circular convolution in time.

Definition. The circular (cyclic) convolution of x\mathbf{x} and y\mathbf{y} in CN\mathbb{C}^N is:

(xy)[n]=m=0N1x[m]y[nmmodN](x \circledast y)[n] = \sum_{m=0}^{N-1} x[m]\, y[n-m \bmod N]

Circular Convolution Theorem. F{xy}[k]=X[k]Y[k]\mathcal{F}\{x \circledast y\}[k] = X[k] \cdot Y[k].

Preview: This theorem is the entire foundation of FFT-based convolution. The full treatment - including the distinction between circular and linear convolution, overlap-add/save for long signals, filter design, and the connection to CNNs - is the subject of Section 20-04 Convolution Theorem. Here we note the fact; the applications follow in Section 04.

Why circular? The DFT implicitly assumes the signal is periodic with period NN. The product X[k]Y[k]X[k]Y[k] is the DFT of the circular (not linear) convolution of x\mathbf{x} and y\mathbf{y}. To compute linear convolution via FFT, zero-pad both sequences to length N+M1N + M - 1 before taking the DFT (Section 6.2).

3.7 DFT Properties Master Table

PropertyTime domain x[n]x[n]Frequency domain X[k]X[k]
Linearityax[n]+by[n]ax[n] + by[n]aX[k]+bY[k]aX[k] + bY[k]
Circular shiftx[nmmodN]x[n-m \bmod N]ωNmkX[k]\omega_N^{-mk} X[k]
ModulationωNmnx[n]\omega_N^{mn} x[n]X[kmmodN]X[k-m \bmod N]
Conjugate symmetryx[n]Rx[n] \in \mathbb{R}X[Nk]=X[k]X[N-k] = X[k]^*
Time reversalx[nmodN]x[-n \bmod N]X[kmodN]=X[Nk]X[-k \bmod N] = X[N-k]
Conjugationx[n]x[n]^*X[kmodN]=X[Nk]X[-k \bmod N]^* = X[N-k]^*
Circular convolution(xy)[n](x \circledast y)[n]X[k]Y[k]X[k]\cdot Y[k]
Pointwise productx[n]y[n]x[n] \cdot y[n]1N(XY)[k]\frac{1}{N}(X \circledast Y)[k]
Parsevalnx[n]2\sum_n \lvert x[n] \rvert^21NkX[k]2\frac{1}{N}\sum_k \lvert X[k] \rvert^2
DualityX[n]X[n]Nx[kmodN]N\, x[-k \bmod N]

4. The Fast Fourier Transform Algorithm

4.1 Complexity of Naive DFT

Direct evaluation of X[k]=n=0N1x[n]ωNnkX[k] = \sum_{n=0}^{N-1} x[n]\omega_N^{-nk} for all k=0,,N1k = 0, \ldots, N-1 requires:

  • NN complex multiplications per output bin (multiplying each x[n]x[n] by ωNnk\omega_N^{-nk})
  • N1N-1 complex additions per bin
  • Total: N2N^2 complex multiplications, N(N1)N(N-1) complex additions
  • Each complex multiplication = 4 real multiplications + 2 additions

For N=1024N = 1024: about 10610^6 complex multiplications. For N=65536N = 65536: about 4×1094 \times 10^9. At 2010 hardware speeds (~10910^9 operations/second), a 65536-point DFT takes ~4 seconds. Audio applications need hundreds per second. Naive DFT is simply not usable.

4.2 The Cooley-Tukey Insight: Divide and Conquer

Assume N=2mN = 2^m (a power of 2). Split the sum into even-indexed and odd-indexed samples:

X[k]=n=0N1x[n]ωNnk=r=0N/21x[2r]ωN2rkE[k]+r=0N/21x[2r+1]ωN(2r+1)kO[k]X[k] = \sum_{n=0}^{N-1} x[n]\omega_N^{-nk} = \underbrace{\sum_{r=0}^{N/2-1} x[2r]\,\omega_N^{-2rk}}_{E[k]} + \underbrace{\sum_{r=0}^{N/2-1} x[2r+1]\,\omega_N^{-(2r+1)k}}_{O[k]}

Observe that ωN2rk=e2πi(2r)k/N=e2πirk/(N/2)=ωN/2rk\omega_N^{-2rk} = e^{-2\pi i(2r)k/N} = e^{-2\pi irk/(N/2)} = \omega_{N/2}^{-rk}. Therefore:

E[k]=r=0N/21x[2r]ωN/2rk=DFTN/2 ⁣({x[0],x[2],x[4],})[k]E[k] = \sum_{r=0}^{N/2-1} x[2r]\,\omega_{N/2}^{-rk} = \text{DFT}_{N/2}\!\left(\{x[0], x[2], x[4], \ldots\}\right)[k] O[k]=ωNkr=0N/21x[2r+1]ωN/2rk=ωNkDFTN/2 ⁣({x[1],x[3],x[5],})[k]O[k] = \omega_N^{-k}\sum_{r=0}^{N/2-1} x[2r+1]\,\omega_{N/2}^{-rk} = \omega_N^{-k} \cdot \text{DFT}_{N/2}\!\left(\{x[1], x[3], x[5], \ldots\}\right)[k]

Both E[k]E[k] and O[k]O[k] are (N/2)(N/2)-point DFTs, and both are periodic in kk with period N/2N/2. For k=0,,N/21k = 0, \ldots, N/2-1:

X[k]=E[k]+ωNkO[k]\boxed{X[k] = E[k] + \omega_N^{-k} O[k]} X[k+N/2]=E[k]ωNkO[k]\boxed{X[k + N/2] = E[k] - \omega_N^{-k} O[k]}

This is the Cooley-Tukey butterfly: two N/2N/2-point DFTs plus N/2N/2 complex multiplications (by the twiddle factors ωNk\omega_N^{-k}) yield the full NN-point DFT.

Recursion: Apply the same split to each N/2N/2-point DFT, then to each N/4N/4-point DFT, and so on, until we reach 2-point DFTs (which are trivial: X[0]=x[0]+x[1]X[0] = x[0]+x[1], X[1]=x[0]x[1]X[1] = x[0]-x[1]).

4.3 Butterfly Diagram

The Cooley-Tukey recursion is beautifully represented as a signal flow graph called the butterfly diagram. For an 8-point DFT (N=8N=8, log28=3\log_2 8 = 3 stages):

DFT-8 BUTTERFLY DIAGRAM (Decimation in Time)
========================================================================

  Input           Stage 1          Stage 2          Stage 3     Output
  (bit-rev)       N/8 DFTs         N/4 DFTs         N/2 DFTs

  x[0] ----+--------------+---------------+-------- X[0]
            |              |               |
  x[4] ----+W0   ...      |               |
            |              |               |
  x[2] ----+              |W0   ...       |
            |              |               |
  x[6] ----+              |               |
            |              |               |
  x[1] ----+              |               |W0
            |              |               |
  x[5] ----+              |               |W1
            |              |               |
  x[3] ----+              |               |W2
            |              |               |
  x[7] ------------------------------------------ X[7]

  Each node: X[k] = E[k] + W^k * O[k]
             X[k+N/2] = E[k] - W^k * O[k]

  W^k = _N^{-k} = e^{-2ik/N}  (twiddle factor)

========================================================================

Each butterfly is a 2-input, 2-output operation:

  a --+------ a + Wb
      |    
      |     
      |        (butterfly crossing)
      |     
      |    
  b --+-- a - Wb

For N=2mN = 2^m, the diagram has m=log2Nm = \log_2 N stages, each containing N/2N/2 butterflies. Total butterflies: N2log2N\frac{N}{2}\log_2 N.

4.4 Decimation in Time vs Decimation in Frequency

Decimation in Time (DIT): Split the input by even/odd indices. The input must be in bit-reversed order (see Section 4.5), while the output emerges in natural order. This is the form described above, and the most common in practice.

Decimation in Frequency (DIF): Split the output by even/odd frequencies. The input is in natural order, while the output emerges in bit-reversed order. The butterfly computation slightly differs but the complexity is identical.

Both variants require exactly the same number of operations. The choice is often determined by memory access patterns or hardware constraints.

In-place computation: The butterfly computation can be performed in-place - the two outputs overwrite the two inputs without needing extra memory. This makes the FFT extremely memory-efficient for large NN.

4.5 Bit-Reversal Permutation

The DIT-FFT requires the input in bit-reversed order. For N=8N = 8 (3-bit indices):

Natural indexBinaryBit-reversedBit-reversed decimal
00000000
10011004
20100102
30111106
41000011
51011015
61100113
71111117

Why bit-reversal arises: Each stage of the DIT recursion separates samples by their last bit (even vs odd) - what was the last bit of the original index becomes the first bit at the deepest recursion level. After log2N\log_2 N stages of even-odd splitting, the indices are fully bit-reversed.

Efficient computation: Bit-reversal can be computed in O(N)O(N) time using integer bit-reversal operations or a simple loop with bin() in Python. Hardware FFT chips often include dedicated bit-reversal circuits.

4.6 Complexity Analysis: O(N log N)

Let T(N)T(N) be the number of complex multiplications required by the FFT. The recursion is:

T(N)=2T(N/2)+N2T(N) = 2\,T(N/2) + \frac{N}{2}

where N/2N/2 twiddle-factor multiplications are needed to combine two N/2N/2-point DFTs. With T(1)=0T(1) = 0:

Solving the recurrence:

T(N)=2T(N/2)+N/2T(N) = 2T(N/2) + N/2 =2[2T(N/4)+N/4]+N/2=4T(N/4)+N/2+N/2= 2\bigl[2T(N/4) + N/4\bigr] + N/2 = 4T(N/4) + N/2 + N/2 =4[2T(N/8)+N/8]+N=8T(N/8)+3N/2= 4\bigl[2T(N/8) + N/8\bigr] + N = 8T(N/8) + 3N/2 \vdots =NT(1)+N2log2N=N2log2N= N \cdot T(1) + \frac{N}{2}\log_2 N = \frac{N}{2}\log_2 N

Total operations: N2log2N\frac{N}{2}\log_2 N complex multiplications and Nlog2NN\log_2 N complex additions. This is O(NlogN)O(N \log N).

Comparison with naive DFT (N2N^2 multiplications):

NNFFT / DFT ratio
643/644.7%3 / 64 \approx 4.7\%
1,0245/10240.49%5 / 1024 \approx 0.49\%
10610^60.002%\approx 0.002\%

For large NN, the FFT uses less than 0.01% of the operations required by the naive DFT.

Empirical validation: The O(NlogN)O(N \log N) curve can be confirmed by timing numpy.fft.fft for N=2kN = 2^k, k=6,,20k = 6, \ldots, 20 - the log-log plot should have slope 1\approx 1 (since NlogNN1N \log N \approx N^1 for moderate NN, growing only slowly faster than linear).

4.7 Variants and Extensions

Mixed-radix FFT. The pure radix-2 algorithm requires NN to be a power of 2. Real-world signals rarely have power-of-2 lengths. Mixed-radix algorithms factor N=N1×N2××NrN = N_1 \times N_2 \times \cdots \times N_r into small prime factors and apply a generalized Cooley-Tukey recursion. With N=N1N2N = N_1 N_2, the DFT can be decomposed into N1N_1 DFTs of size N2N_2 and N2N_2 DFTs of size N1N_1. Common radices: 2, 3, 4, 5, 7, 8.

Split-radix FFT. Combines radix-2 and radix-4 butterflies to achieve the minimum theoretical operation count: approximately 13Nlog2N\frac{1}{3}N\log_2 N complex multiplications. The split-radix algorithm is the most operation-efficient known for power-of-2 lengths.

Prime-length algorithms:

  • Rader's algorithm (1968): Converts a prime-length DFT into a circular convolution of composite length, solvable by FFT.
  • Bluestein's algorithm (1970): Converts any DFT into a convolution via the identity nk=(n+k2)(n2)(k2)nk = \binom{n+k}{2} - \binom{n}{2} - \binom{k}{2}, enabling FFT-based computation for any NN.

Real-valued FFT (rfft). For real inputs, conjugate symmetry (Section 3.4) means only N/2+1N/2+1 output values are independent. np.fft.rfft computes only these, roughly halving computation and memory.

FFTW (Fastest Fourier Transform in the West). FFTW (Frigo & Johnson, 2005) automatically selects the best algorithm for any NN and hardware through a "plan" computed at initialization time. It achieves near-optimal performance across a wide range of NN and machine architectures. All major scientific computing libraries (NumPy, SciPy, MATLAB, Julia) use FFTW or FFTW-compatible libraries under the hood.

For AI: PyTorch's torch.fft.fft uses CUDA-accelerated FFT (cuFFT) on GPU. For sequence lengths that are not powers of 2, cuFFT uses mixed-radix algorithms. When tuning sequence lengths for FFT-based models (FNet, FNO, FlashFFTConv), choosing N=2kN = 2^k or N=2a3b5cN = 2^a \cdot 3^b \cdot 5^c avoids prime-length slowdowns.


5. Spectral Leakage and Windowing

5.1 The Leakage Problem

The DFT assumes that the NN samples represent one complete period of a periodic signal. In reality, we observe a finite segment of a continuous signal - we multiply by a rectangular window that is 1 inside [0,N1][0, N-1] and 0 outside. In frequency, multiplication by a rectangular window corresponds to convolution with the DFT of the rectangle - the Dirichlet kernel:

Wrect[k]=n=0N1ωNnk={Nk=01ωNNk1ωNk=sin(πk)sin(πk/N)eiπk(N1)/Nk0W_{\text{rect}}[k] = \sum_{n=0}^{N-1} \omega_N^{-nk} = \begin{cases} N & k=0 \\ \frac{1-\omega_N^{-Nk}}{1-\omega_N^{-k}} = \frac{\sin(\pi k)}{\sin(\pi k/N)}e^{-i\pi k(N-1)/N} & k \neq 0 \end{cases}

For non-integer frequencies (frequencies that do not land exactly on a DFT bin), the Dirichlet kernel's large sidelobes spread energy from the true frequency into adjacent bins - this is spectral leakage.

Example. Suppose the signal is x(t)=cos(2πf0t)x(t) = \cos(2\pi f_0 t) with f0=3.5fs/Nf_0 = 3.5 f_s/N (a non-integer number of cycles in the window). The true spectrum has just two peaks at ±f0\pm f_0. The DFT shows energy spread across all NN bins, with the main peak at the nearest bin (k=3k=3 or k=4k=4) and significant sidelobe contamination at all other bins.

Consequence for AI: In Whisper's mel spectrogram, a sharp onset (e.g., a consonant burst) at a non-bin frequency would appear smeared across multiple mel filterbanks. Window choice directly affects how well transient events are separated in the spectrogram.

5.2 Window Functions

A window function w[n]w[n] tapers the signal to zero at its endpoints, reducing the discontinuity that causes leakage. The windowed DFT is:

Xw[k]=n=0N1w[n]x[n]ωNnkX_w[k] = \sum_{n=0}^{N-1} w[n]\, x[n]\, \omega_N^{-nk}

The spectrum XwX_w is the DFT of w[n]x[n]w[n]x[n] - by the product property (Section 3.7), this equals the circular convolution of XX and WW (normalized by 1/N1/N). Windows with smaller sidelobes produce less leakage at the cost of a wider main lobe (reduced frequency resolution).

Standard window functions (all defined for n=0,1,,N1n = 0, 1, \ldots, N-1):

Rectangular (Boxcar):

w[n]=1w[n] = 1
  • Main lobe width: 4π/N4\pi/N (narrowest)
  • Peak sidelobe: 13dB-13\,\mathrm{dB} (highest leakage)
  • Use case: when the signal exactly fits NN samples; spectral analysis of periodic signals at bin frequencies

Hann (Hanning):

w[n]=12(1cos ⁣2πnN1)w[n] = \frac{1}{2}\left(1 - \cos\!\frac{2\pi n}{N-1}\right)
  • Main lobe width: 8π/N8\pi/N
  • Peak sidelobe: 31.5dB-31.5\,\mathrm{dB}
  • Use case: default choice for general spectral analysis; used in Whisper STFT

Hamming:

w[n]=0.540.46cos ⁣2πnN1w[n] = 0.54 - 0.46\cos\!\frac{2\pi n}{N-1}
  • Main lobe width: 8π/N8\pi/N
  • Peak sidelobe: 42.7dB-42.7\,\mathrm{dB}
  • Use case: narrowband applications requiring low sidelobes; slightly wider main lobe than Hann

Blackman:

w[n]=0.420.5cos ⁣2πnN1+0.08cos ⁣4πnN1w[n] = 0.42 - 0.5\cos\!\frac{2\pi n}{N-1} + 0.08\cos\!\frac{4\pi n}{N-1}
  • Main lobe width: 12π/N12\pi/N
  • Peak sidelobe: 58dB-58\,\mathrm{dB}
  • Use case: high-dynamic-range spectral analysis

Kaiser:

w[n]=I0 ⁣(β1(2n/(N1)1)2)I0(β)w[n] = \frac{I_0\!\left(\beta\sqrt{1-(2n/(N-1)-1)^2}\right)}{I_0(\beta)}

where I0I_0 is the zeroth-order modified Bessel function and β\beta is a shape parameter.

  • β=0\beta = 0: rectangular; β5\beta \approx 5: similar to Hamming; β8.6\beta \approx 8.6: similar to Blackman
  • Chebyshev-optimal: for a given main-lobe width, the Kaiser window minimizes peak sidelobe level
  • Use case: when precise sidelobe control is required; used in scientific spectral analysis and filter design

5.3 The Main-Lobe/Side-Lobe Tradeoff

Every window function must satisfy a fundamental constraint: the product of main-lobe width (in frequency bins) and side-lobe level (in dB) cannot be simultaneously minimized. This is a consequence of the uncertainty principle applied to finite sequences.

WindowMain-lobe width (bins)Peak sidelobe (dB)Sidelobe rolloff (dB/oct)
Rectangular213-136-6
Hann431.5-31.518-18
Hamming442.7-42.76-6
Blackman658-5818-18
Blackman-Harris892-926-6
Kaiser (β=8.6\beta=8.6)880-806-6

Coherent gain: the window amplitude gain, nw[n]/N\sum_n w[n]/N. Windows with lower sidelobes also have lower coherent gain (the main peak is smaller), which must be corrected by dividing by the coherent gain when measuring amplitudes.

Equivalent noise bandwidth (ENBW): the width of a rectangular window that would pass the same noise power. Hann has ENBW =1.5= 1.5 bins - it is effectively 1.5 times wider than the rectangular window for noise measurements.

5.4 Scalloping Loss and the Picket Fence Effect

The DFT evaluates the spectrum at exactly NN uniformly-spaced bins. If a sinusoid's true frequency f0f_0 falls exactly halfway between two bins (at k+0.5k + 0.5 in bin units), then both neighboring bins are equally attenuated. The worst-case attenuation - the scalloping loss - depends on the window:

  • Rectangular: 3.9dB-3.9\,\mathrm{dB} (worst case for nearest-bin amplitude estimate)
  • Hann: 1.4dB-1.4\,\mathrm{dB}
  • Blackman: 0.8dB-0.8\,\mathrm{dB}

This is the picket fence effect: the DFT spectrum looks like a picket fence - it shows signal energy at the posts (bins) but may miss peaks in the gaps between posts.

Mitigation strategies:

  1. Zero-padding (Section 6.2): append zeros to increase NN without changing the observation window, interpolating the spectrum between bins. This does NOT improve frequency resolution (the bandwidth 1/T1/T is unchanged) but allows finer frequency estimation.

  2. Parabolic interpolation: fit a parabola to the peak bin and its two neighbors; the parabola's maximum is a better estimate of the true frequency.

  3. Quinn's estimator and Jacobsen's estimator: closed-form formulas giving sub-bin frequency estimates with low computational cost.

5.5 Overlap-Add and Overlap-Save

For long signals (audio tracks, continuous data streams), we cannot apply a single DFT to the entire signal. We must process it in blocks. Two standard methods handle the boundary effects at block edges:

Overlap-Add (OLA):

  1. Segment the input into blocks of length MM
  2. Window each block with w[n]w[n] (length MM)
  3. Zero-pad each block to length NM+L1N \geq M + L - 1 (where LL is the filter length)
  4. Compute DFT-based processing on each block
  5. Overlap and add the outputs (adjacent outputs share L1L-1 samples)

Overlap-Save (OLS):

  1. Segment the input into blocks of length NN with overlap of L1L-1 samples
  2. DFT-based processing on each block (no zero-padding needed)
  3. Discard the first L1L-1 samples of each output block (the "contaminated" samples)
  4. Concatenate the remaining NL+1N - L + 1 samples

Both methods allow exact (non-approximative) processing of long signals with FFT-based operations. The COLA (Constant Overlap-Add) condition for perfect reconstruction is:

m=w[nmH]=Cn\sum_{m=-\infty}^{\infty} w[n - mH] = C \quad \forall n

where HH is the hop size. Hann windows with 50% overlap (H=N/2H = N/2) satisfy COLA, making them the standard for STFT-based audio processing.


6. Frequency Resolution and Zero-Padding

6.1 Frequency Resolution

The frequency resolution of the DFT is:

Δf=fsN=1T\Delta f = \frac{f_s}{N} = \frac{1}{T}

where T=N/fsT = N/f_s is the total observation duration. This fundamental limit says: to distinguish two sinusoids at frequencies f1f_1 and f2f_2, you need to observe the signal for at least T=1/f1f2T = 1/|f_1 - f_2| seconds. No amount of zero-padding or upsampling can circumvent this - it is a direct consequence of the uncertainty principle from Section 02.

The time-bandwidth product is fixed: TΔf=1T \cdot \Delta f = 1. You cannot simultaneously have:

  • Short observation window (good for tracking rapid changes)
  • Fine frequency resolution (good for distinguishing close frequencies)

For AI: Whisper uses a 25 ms analysis window (400 samples at 16 kHz) with 10 ms hop size, giving Δf=1/0.025=40\Delta f = 1/0.025 = 40 Hz frequency resolution. This is a deliberate engineering choice: fine enough to separate speech formants (typically 100-300 Hz apart) while short enough to track fast phoneme transitions.

6.2 Zero-Padding

Zero-padding: appending MNM - N zeros to a signal of length NN before computing an MM-point DFT (M>NM > N).

What zero-padding does: it computes MM equally-spaced samples of the continuous DTFT of the original NN-point signal (i.e., the DTFT sampled at MM bins instead of NN). This is frequency-domain interpolation - it fills in points between the original NN bins.

What zero-padding does NOT do: improve frequency resolution. The DTFT itself has resolution limited by 1/N1/N (the original number of samples). Zero-padding only reveals the DTFT more finely; it cannot resolve frequencies closer than 1/N1/N apart.

Applications of zero-padding:

  1. Sub-bin frequency estimation (combine with interpolation): zero-padding by factor of 4-8 allows visual identification of spectral peaks with sub-bin accuracy.

  2. Linear convolution via FFT: to convolve signals of lengths L1L_1 and L2L_2 linearly (not circularly), zero-pad both to length L1+L21\geq L_1 + L_2 - 1 before DFT multiplication (Section 3.6). The circular convolution of the zero-padded sequences equals the linear convolution of the originals.

  3. Next power-of-2: zero-pad to the nearest power of 2 to maximize FFT efficiency, e.g., 1000 -> 1024.

For AI: The Fourier Neural Operator (Section 8.2) explicitly truncates the spectrum to the lowest KK frequencies before learning, implicitly treating the signal as if zero-padded beyond the training domain. FlashFFTConv similarly uses zero-padding to convert long circular convolutions to linear ones.

6.3 The Nyquist-Shannon Sampling Theorem

Theorem (Shannon, 1949; Nyquist, 1928; Whittaker, 1915). If a continuous signal x(t)x(t) is bandlimited - meaning x^(ξ)=0\hat{x}(\xi) = 0 for ξ>B|\xi| > B - then x(t)x(t) can be perfectly reconstructed from samples at rate fs2Bf_s \geq 2B.

Why the DFT cares: if fs<2Bf_s < 2B, frequencies above fs/2f_s/2 alias onto lower frequencies. Specifically, a sinusoid at frequency f0>fs/2f_0 > f_s/2 produces the same sample sequence as a sinusoid at frequency fsf0<fs/2f_s - f_0 < f_s/2. In the DFT, bin kk and bin NkN-k represent the same physical frequency if fsf_s is not high enough.

The aliasing formula: a signal at frequency f0f_0 aliases to falias=f0round(f0/fs)fsf_{\text{alias}} = |f_0 - \text{round}(f_0/f_s) \cdot f_s|, i.e., the nearest copy within [0,fs/2)[0, f_s/2).

Connection to Poisson summation (from Section 02-7.5): the spectrum of the sampled signal is the sum of periodically-shifted copies of the continuous spectrum:

Xs(ξ)=fsk=x^(ξkfs)X_s(\xi) = f_s \sum_{k=-\infty}^{\infty} \hat{x}(\xi - kf_s)

If adjacent copies overlap (i.e., fs<2Bf_s < 2B), they contaminate each other - this is aliasing. If fs2Bf_s \geq 2B, the copies are non-overlapping and the original spectrum can be recovered by applying a lowpass filter with cutoff at fs/2f_s/2.

Practical rule: In practice, use fs2.5Bf_s \geq 2.5B with an anti-aliasing lowpass filter at fs/2f_s/2 before the ADC. Audio CDs use fs=44.1f_s = 44.1 kHz because human hearing extends to 20\sim 20 kHz (requires fs>40f_s > 40 kHz), with the extra 2.1 kHz providing a transition band for the anti-aliasing filter.

6.4 Practical FFT Setup Checklist

When computing an FFT-based spectral analysis, follow this sequence:

PRACTICAL FFT CHECKLIST
========================================================================

  1. CHOOSE N
     - Prefer N = 2^k (fastest FFT)
     - If not possible: N = 2^a * 3^b * 5^c (still fast)
     - N controls frequency resolution: f = fs / N

  2. APPLY WINDOW
     - Default: Hann window (good sidelobes, modest resolution loss)
     - High dynamic range: Blackman or Kaiser ( approx 8)
     - If signal is periodic with exactly N samples: rectangular

  3. COMPUTE FFT
     - Real input: use rfft (returns N/2+1 values)
     - Complex input: use fft (returns N values)

  4. BUILD FREQUENCY AXIS
     - For rfft: freqs = np.fft.rfftfreq(N, d=1/fs)   [0, fs/2]
     - For fft:  freqs = np.fft.fftfreq(N, d=1/fs)    [0, fs/2, -fs/2, 0]
     - After fftshift: freqs = [-fs/2, ..., 0, ..., fs/2]

  5. INTERPRET MAGNITUDE
     - |X[k]| / N  ->  amplitude of the sinusoid
     - |X[k]|^2 / N  ->  power spectral density
     - Divide by window coherent gain for absolute amplitudes

  6. HANDLE NEGATIVE FREQUENCIES
     - Use fftshift(X) to center zero-frequency
     - For real inputs: mirror the positive half (rfft returns one-sided)

========================================================================

7. Short-Time Fourier Transform

7.1 Motivation: Time-Varying Spectra

The DFT and continuous FT assume the signal's spectral content is stationary - constant over the entire observation window. This assumption fails catastrophically for real-world signals:

  • Speech: phonemes last 20-100 ms, with frequency content changing at each phoneme boundary
  • Music: pitch and timbre evolve continuously; a melody is precisely a time-varying spectrum
  • EEG: neural oscillations appear and disappear; seizure onset is detected by spectral changes
  • Radar: target velocity changes the Doppler shift over time

The Short-Time Fourier Transform (STFT) addresses this by computing a sequence of DFTs on overlapping short segments of the signal. The result is a two-dimensional representation: time on one axis, frequency on the other.

7.2 STFT Definition

Definition (STFT). The Short-Time Fourier Transform of signal x[n]x[n] with analysis window w[m]w[m] of length MM and hop size HH is:

STFT{x}[l,k]=m=0M1x[lH+m]w[m]ωNmk,l=0,1,,L1,  k=0,1,,N1\text{STFT}\{x\}[l, k] = \sum_{m=0}^{M-1} x[lH + m]\, w[m]\, \omega_N^{-mk}, \quad l = 0, 1, \ldots, L-1,\; k = 0, 1, \ldots, N-1

where:

  • ll is the frame index (time axis), lHlH is the starting sample of frame ll
  • HH is the hop size (number of samples between consecutive frames, HMH \leq M)
  • NMN \geq M is the FFT size (zero-padding if N>MN > M)
  • L=(TsignalM)/H+1L = \lfloor (T_{\text{signal}} - M) / H \rfloor + 1 is the number of frames

Parameters and their effects:

ParameterEffect
Window length MMLonger MM -> better frequency resolution, worse time resolution
Hop size HHSmaller HH -> more frames (denser time axis), more computation
FFT size NNN>MN > M: zero-padding for denser frequency grid (not more resolution)
Window typeHann (default), Hamming, Blackman - sidelobe/resolution tradeoff (Section 5.2)

Whisper's STFT parameters: M=400M = 400 samples (25 ms), H=160H = 160 samples (10 ms), N=512N = 512, window = Hann. At 16 kHz, this gives 257 frequency bins and one frame every 10 ms.

7.3 The Spectrogram

Definition (Spectrogram). The spectrogram is the squared magnitude of the STFT:

Spectrogram[l,k]=STFT{x}[l,k]2\text{Spectrogram}[l, k] = |\text{STFT}\{x\}[l, k]|^2

It represents the power (not amplitude) of each frequency at each time frame. Displayed as a heatmap (time on x-axis, frequency on y-axis, color = power), it visualizes how the frequency content of the signal evolves over time.

Log-power spectrogram: 10log10(STFT2)10\log_{10}(|\text{STFT}|^2) in decibels. The logarithmic scale compresses the dynamic range and matches the perceptual sensitivity of human hearing (roughly logarithmic in both time and frequency).

Mel spectrogram: sum the spectrogram over mel-frequency filterbanks (triangular filters spaced on the mel scale, which is a perceptual frequency scale logarithmic above 1 kHz). Used in Whisper, most modern ASR systems, and many audio AI models. Full details in Section 8.1.

Chirp spectrogram (example): a signal sweeping linearly from f1f_1 to f2f_2 shows a diagonal stripe in the spectrogram - the instantaneous frequency increases linearly with time. This illustrates how the STFT tracks time-varying frequency content.

7.4 Uncertainty Principle for STFT

The STFT cannot escape the Heisenberg uncertainty principle from Section 02. For the windowed transform with window ww of duration TwT_w and bandwidth BwB_w:

TwBw14πT_w \cdot B_w \geq \frac{1}{4\pi}

For a Hann window of length MM samples (duration Tw=M/fsT_w = M/f_s), the effective bandwidth is Bw2fs/MB_w \approx 2f_s/M (the -3 dB bandwidth of the Hann spectral window). The time-frequency resolution product is fixed:

ΔtΔf14π0.08\Delta t \cdot \Delta f \geq \frac{1}{4\pi} \approx 0.08

Practical consequence: you cannot simultaneously achieve:

  • Fine time resolution (short window \to small MM)
  • Fine frequency resolution (long window \to large MM)

For speech recognition (Whisper): M=400M = 400 samples -> Δt=25ms\Delta t = 25\,\text{ms}, Δf40Hz\Delta f \approx 40\,\text{Hz}. For musical pitch tracking (need Δf<1Hz\Delta f < 1\,\text{Hz} to distinguish semitones near 100 Hz): requires M>16000M > 16000 samples (>1> 1 second window). The STFT is inadequate for simultaneously good time and frequency resolution across scales - this is exactly why wavelets exist (Section 05 Wavelets).

7.5 Synthesis and Perfect Reconstruction

Overlap-Add synthesis: Given STFT frames S[l,k]S[l, k] (possibly modified), reconstruct x[n]x[n] by:

x^[n]=lws[nlH]IDFT{S[l,]}[nlH]lwa[nlH]ws[nlH]\hat{x}[n] = \frac{\sum_l w_s[n - lH] \cdot \text{IDFT}\{S[l, \cdot]\}[n-lH]}{\sum_l w_a[n-lH]\, w_s[n-lH]}

where waw_a is the analysis window and wsw_s is the synthesis window.

COLA (Constant Overlap-Add) condition: for perfect reconstruction of unmodified STFT:

l=w[nlH]2=Cn(using synthesis window ws=wa=w)\sum_{l=-\infty}^{\infty} w[n - lH]^2 = C \quad \forall n \quad \text{(using synthesis window } w_s = w_a = w \text{)}

Hann window with H=M/2H = M/2 (50% overlap): w[n]2+w[nM/2]2=1w[n]^2 + w[n-M/2]^2 = 1 for all nn - COLA satisfied.

Modified DFT (MDCT): audio codecs (AAC, Vorbis, Opus) use the MDCT - a cosine-based transform with 50% overlap and critically sampled (no redundancy). The MDCT satisfies exact reconstruction (called "Time Domain Aliasing Cancellation") and is why modern audio codecs can achieve high quality at low bit rates. Full filter bank theory is developed in Section 05 Wavelets.

Forward reference: The STFT and MDCT are the simplest multi-resolution analyses. Wavelets generalize this by allowing different window lengths at different frequencies - short windows at high frequencies (good time resolution) and long windows at low frequencies (good frequency resolution). See Section 20-05 Wavelets for the full treatment.


8. Applications in Machine Learning

8.1 Whisper: FFT Pipeline for Speech Recognition

OpenAI Whisper (Radford et al., 2022) is a 680M-parameter transformer trained on 680,000 hours of multilingual speech. Its input is not raw audio but an 80-channel log-mel spectrogram computed by a fixed FFT pipeline.

The Whisper preprocessing pipeline:

WHISPER AUDIO PREPROCESSING PIPELINE
========================================================================

  Raw audio (PCM, 16 kHz)
        |
        
  1. Resample to 16 kHz (if needed)
        |
        
  2. Pad/trim to 30-second segments (480,000 samples)
        |
        
  3. STFT: window=Hann(400), hop=160, N_FFT=512
     -> Complex spectrogram: shape (257, T_frames)
        |
        
  4. Power spectrogram: |STFT|2 -> shape (257, T_frames)
        |
        
  5. Apply 80 mel filterbanks (triangular, mel-spaced, 0-8000 Hz)
     -> Mel power spectrogram: shape (80, T_frames)
        |
        
  6. Log compression: log10(max(mel_spec, 1e-10))
        |
        
  7. Normalize: (log_mel - mean) / 4  (empirically determined)
        |
        
  Transformer input: shape (80, 3000) for 30-second clip
  (3000 frames  10ms/frame = 30 seconds)

========================================================================

Mel filterbanks: The mel scale maps frequency ff (Hz) to perceived pitch mm (mel):

m=2595log10 ⁣(1+f700)m = 2595 \log_{10}\!\left(1 + \frac{f}{700}\right)

Filterbanks are triangular filters equally spaced on the mel scale, covering the range [fmin,fmax]=[0,8000]Hz[f_{\min}, f_{\max}] = [0, 8000]\,\text{Hz}. Human speech information is concentrated below 8 kHz; the mel scale matches the roughly logarithmic frequency resolution of the cochlea.

Why log-mel spectrograms? (1) Log compression matches the roughly logarithmic perception of loudness (Stevens' power law). (2) Log mel spectrograms are approximately translation-equivariant in the log-frequency dimension - a pitch shift is a translation in log-frequency. (3) Empirically, they produce much better ASR results than raw spectrograms or MFCCs (Mel-Frequency Cepstral Coefficients) for large-scale transformer models.

For AI: Every audio transformer (Whisper, AudioPaLM, MusicGen, AudioLM) uses a similar FFT -> mel spectrogram pipeline as its front-end. Understanding the pipeline is essential for debugging audio preprocessing bugs, which are among the most common sources of silent failure in audio AI systems.

8.2 Fourier Neural Operator (FNO)

The Fourier Neural Operator (Li et al., 2021, NeurIPS) is a neural architecture for learning solution operators of PDEs. Instead of learning to map one function to another pointwise, the FNO learns in the frequency domain.

Problem setting: given input function a:DRa : D \to \mathbb{R} (e.g., initial condition of a PDE), learn the operator G:au\mathcal{G} : a \mapsto u where uu is the solution (e.g., PDE solution at time TT).

FNO layer: each FNO layer vtvt+1v_t \mapsto v_{t+1} applies:

vt+1(x)=σ ⁣(Wvt(x)+F1 ⁣[RϕF[vt]](x))v_{t+1}(x) = \sigma\!\left(W v_t(x) + \mathcal{F}^{-1}\!\left[R_\phi \cdot \mathcal{F}[v_t]\right](x)\right)

where:

  • F\mathcal{F} is the DFT along the spatial dimension
  • RϕCK×dv×dvR_\phi \in \mathbb{C}^{K \times d_v \times d_v} is a learned complex weight matrix (parametrized by ϕ\phi)
  • KK is the number of retained frequency modes (truncation threshold)
  • WW is a local linear transform (pointwise)
  • σ\sigma is a nonlinearity

Key innovation: truncate to the lowest KK Fourier modes before applying RϕR_\phi:

(F[vt])k=0for k>K(\mathcal{F}[v_t])_k = 0 \quad \text{for } k > K

This implements a global convolution with a low-frequency kernel - learning only the large-scale structure of the operator, discarding high-frequency noise. For 2D problems (e.g., Navier-Stokes), truncate to (K1,K2)(K_1, K_2) modes in each dimension.

Computational cost: O(NlogN)O(N \log N) for the DFT plus O(Kdv2)O(K d_v^2) for the spectral convolution, vs. O(N2)O(N^2) for full global convolution. For N=105N = 10^5 and K=20K = 20: the FNO spectral layer is 250\sim 250 cheaper than full attention.

For AI: FNO was the first neural architecture to demonstrate that learning in frequency space outperforms learning in physical space for many PDE tasks. It has been extended to weather prediction (Pathak et al., 2022 "FourCastNet"), fluid dynamics, and climate modeling. The core idea - truncate, learn in frequency, invert - generalizes the classic Galerkin method from numerical PDE.

8.3 Monarch Matrices and Structured FFT

Dao et al. (2022, "Monarch: Expressive Structured Matrices for Efficient and Accurate Training") observed that the FFT matrix FNF_N can be factored as a product of sparse butterfly matrices:

FN=Blog2NB2B1PF_N = B_{\log_2 N} \cdots B_2 B_1 P

where PP is the bit-reversal permutation and each BjB_j is a block-diagonal matrix with N/2N/2 butterfly factors. Each BjB_j has exactly NN non-zero entries (2 per butterfly), giving O(N)O(N) parameters per stage and O(NlogN)O(N \log N) total parameters for the full factorization.

Monarch parameterization: instead of fixing the butterfly structure to be the FFT, learn the butterfly weights:

M=PLPRM = P L P^\top R

where LL and RR are block-diagonal "butterfly-like" matrices. This gives a learnable structured matrix with O(NN)O(N\sqrt{N}) parameters (between O(N)O(N) for diagonal and O(N2)O(N^2) for dense).

Applications:

  • Efficient attention: Monarch matrices can approximate attention in O(NN)O(N\sqrt{N}) vs O(N2)O(N^2)
  • Hyperspherical projection: replacing standard linear layers with Monarch matrices in transformers for 2 speedup with minimal accuracy loss
  • FFT-based initialization: Monarch matrices initialized as FFT butterfly factors learn sparse frequency representations naturally

For AI: The butterfly matrix decomposition connects signal processing (FFT structure) to deep learning (efficient matrix approximation). Models like FLASHATTENTION-2 and several emerging efficient transformer variants build on butterfly-structured operators as efficient replacements for dense weight matrices.

8.4 FlashFFTConv and Long-Range Convolutions

Dao et al. (2023, "FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores") addresses a key bottleneck in long-context sequence modeling: computing long convolutions efficiently on modern GPU hardware.

The problem: State Space Models (SSMs) like S4 (Gu et al., 2022) and Mamba (Gu & Dao, 2023) compute long convolutions in O(NlogN)O(N \log N) using the FFT-based convolution theorem. However, standard FFT implementations are memory-bandwidth-bound and do not efficiently utilize the Tensor Cores (matrix-multiply units) of modern GPUs (A100, H100).

FlashFFTConv solution: decompose the long convolution kernel hh of length NN using the Monarch factorization into a product of shorter matrix operations, enabling Tensor Core utilization while preserving O(NlogN)O(N \log N) asymptotic complexity.

Key results:

  • 2-7.3 speedup over standard FFT convolution for sequences up to 4M tokens
  • Linear memory in sequence length (vs. quadratic for full attention)
  • Enables training SSMs on sequences orders of magnitude longer than attention allows

Connection to Mamba: Mamba (Gu & Dao, 2023) uses selective state spaces - a time-varying SSM where the transition matrices depend on the input. The resulting recurrence can be parallelized via a parallel scan (prefix sum), NOT an FFT. FlashFFTConv applies specifically to time-invariant SSMs like S4.

8.5 Spectral Methods in Graph Neural Networks

Graph Neural Networks (GCNs) performing spectral graph convolution use the graph Laplacian eigenvectors as a "Fourier basis" for the graph. The connection is:

  • Graph Laplacian L=DAL = D - A (where DD is degree matrix, AA is adjacency)
  • Spectral decomposition L=UΛUL = U \Lambda U^\top provides "graph Fourier modes" UU
  • Graph convolution: y=Ugθ(Λ)Uxy = U \cdot g_\theta(\Lambda) \cdot U^\top x where gθg_\theta is a learnable spectral filter

This is a graph-domain analog of the DFT, with the regular frequency grid replaced by the graph eigenvalue spectrum.

Forward reference to Section 11-04: The full treatment of spectral graph convolution - including ChebNet (Defferrard et al., 2016), GCN (Kipf & Welling, 2017), and the connection to the graph Laplacian - is the canonical content of Section 11-04 Spectral Graph Theory. Here we note only the structural analogy: DFT is to regular grids as graph Fourier transform is to irregular graphs.


9. Common Mistakes

#MistakeWhy It's WrongFix
1Using np.fft.fft output directly as frequency-domain amplitude without normalizationfft returns X[k]=nx[n]ωNnkX[k] = \sum_n x[n]\omega_N^{-nk}; the amplitude of a unit sinusoid is N/2N/2, not 1Divide by NN (or N/2N/2 for one-sided real FFT) to recover true sinusoidal amplitudes
2Building frequency axis as np.arange(N) * fs / N and treating all NN bins as positive frequenciesBins k>N/2k > N/2 represent negative frequencies wrapped around; calling X[N1]X[N-1] "frequency (N1)fs/N(N-1)f_s/N" is incorrectUse np.fft.fftfreq(N, d=1/fs) which returns the correct signed frequencies; apply fftshift to center zero
3Forgetting to apply fftshift before plotting a spectrumWithout fftshift, the negative-frequency bins appear at the right end of the plot, not the left, making the spectrum appear discontinuousAlways call fftshift(X) and fftshift(freqs) before plotting a two-sided spectrum
4Concluding that zero-padding improves frequency resolutionZero-padding increases the density of frequency bins but does NOT reduce Δf=1/T\Delta f = 1/T; two sinusoids closer than 1/T1/T remain unresolvable regardless of zero-paddingSay "zero-padding interpolates the spectrum" not "improves resolution"; use longer observation windows for genuine resolution improvement
5Applying DFT-property formulas (e.g., shift property) to linearly-shifted signalsThe DFT shift property applies only to circular shifts. A linear shift (with zeros entering from one end) introduces boundary effects that invalidate Y[k]=ωNmkX[k]Y[k] = \omega_N^{-mk}X[k]Use circular indexing explicitly, or use the linear convolution approach with appropriate zero-padding
6Confusing the DFT normalization convention of different librariesNumPy puts 1/N1/N on ifft only; some textbooks use 1/N1/\sqrt{N} on both; PyTorch matches NumPy by default but has norm parameter optionsAlways check the documentation of the specific function; print a known signal (e.g., pure DC) and verify the expected coefficient
7Misinterpreting spectral leakage as separate frequency componentsWhen a sinusoid does not land on a DFT bin, its energy spreads to neighboring bins, appearing as multiple "peaks" in the spectrumApply a window function before the DFT and look for main lobes (not individual bins) as frequency indicators
8Treating the Nyquist frequency bin X[N/2]X[N/2] as a complex valueFor real input, X[N/2]X[N/2] is always real (by conjugate symmetry: X[N/2]=X[N/2]X[N/2] = X[N/2]^*). Its imaginary part should be zero (up to numerical noise)Assert np.allclose(X[N//2].imag, 0) for real input; take only the real part when reporting the Nyquist coefficient
9Using an FFT size smaller than the signal lengthIf N<len(x)N < \text{len}(x), NumPy silently truncates the input; this discards signal information and produces incorrect resultsSet Nlen(x)N \geq \text{len}(x); for linear convolution, use NL1+L21N \geq L_1 + L_2 - 1
10Applying windowing after the FFTWindows must be applied to the time-domain signal before the FFT. Multiplying the frequency-domain output by a window function is a circular convolution in frequency, not spectral smoothingMultiply x * w in the time domain, then compute fft(x * w)
11Overlooking numerical precision loss in the FFTFFT algorithms propagate floating-point errors; for N=220N = 2^{20}, accumulated rounding errors can be 109\sim 10^{-9}; this matters for precision-sensitive applicationsUse np.float64 (double precision) inputs; for critically sensitive applications, consider the inverse error bound ϵFFTϵmachinelog2N\epsilon_{\text{FFT}} \approx \epsilon_{\text{machine}} \log_2 N
12Applying DFT to complex-valued signals and expecting rfft to workrfft assumes real input and returns N/2+1N/2+1 values; calling it on complex input produces incorrect results (it treats the input as real, discarding imaginary parts)Use fft (not rfft) for complex inputs; use rfft only when you are certain the signal is real-valued

10. Exercises

Exercise 1 * - DFT by Hand

(a) Compute the 4-point DFT of x=(1,0,1,0)\mathbf{x} = (1, 0, -1, 0) by hand using ω4=i\omega_4 = i. Show all four coefficients X[0],X[1],X[2],X[3]X[0], X[1], X[2], X[3].

(b) Verify that X[0]X[0] equals the sum of all samples (DC component) and that X[k]2|X[k]|^2 satisfies Parseval's theorem.

(c) What is the IDFT of X=(4,0,0,0)\mathbf{X} = (4, 0, 0, 0)? Compute by hand and verify.

(d) For x=(1,1,1,1)\mathbf{x} = (1, 1, 1, 1), what is the DFT? Explain geometrically why X[k]=0X[k] = 0 for k0k \neq 0.


Exercise 2 * - DFT Matrix Properties

(a) Construct the 44 DFT matrix F4F_4 explicitly and verify numerically that F4F4=4IF_4 F_4^* = 4I.

(b) Show that F42=NPF_4^2 = N P where PP is the time-reversal permutation matrix (i.e., applying the DFT twice returns the time-reversed signal scaled by NN).

(c) Compute the eigenvalues of F8F_8 and show they belong to {1,1,i,i}\{1, -1, i, -i\}. (Hint: FN4=N2IF_N^4 = N^2 I.)

(d) For a random complex vector xC8\mathbf{x} \in \mathbb{C}^8, numerically verify the unitary property: F8x/82=x2\lVert F_8\mathbf{x}/\sqrt{8} \rVert_2 = \lVert \mathbf{x} \rVert_2.


Exercise 3 * - FFT Implementation from Scratch

(a) Implement the radix-2 Cooley-Tukey FFT recursively in Python (without using numpy.fft). Your function should accept any power-of-2 length.

(b) Test on inputs of length N{8,64,512,4096}N \in \{8, 64, 512, 4096\}. Verify against numpy.fft.fft with relative error <1010< 10^{-10}.

(c) Time your implementation vs numpy.fft.fft for N=214N = 2^{14}. By what factor is NumPy faster?

(d) Implement the bit-reversal permutation as a standalone function and verify it matches the permutation implicitly used by your FFT.


Exercise 4 ** - Spectral Leakage and Windowing

(a) Generate a signal x[n]=cos(2π3.7n/N)x[n] = \cos(2\pi \cdot 3.7 \cdot n/N) for N=128N = 128 samples (a sinusoid at a non-integer frequency). Compute and plot the DFT magnitude for rectangular, Hann, and Blackman windows.

(b) Quantify the leakage: for the rectangular window, what fraction of the total spectral energy lies outside the main lobe (bins {3,4}\{3, 4\})?

(c) Two sinusoids at frequencies f1=5f_1 = 5 and f2=5.5f_2 = 5.5 cycles over N=64N = 64 samples. Show that they cannot be resolved with the rectangular window, but can be partially resolved after zero-padding to N=256N = 256.

(d) Implement Kaiser window computation (using scipy.special.i0) and sweep β\beta from 0 to 14. Plot peak sidelobe level vs. main-lobe width. Reproduce the three window parameter rows from the table in Section 5.3.


Exercise 5 ** - Parseval, Energy, and the Power Spectral Density

(a) Generate a bandlimited noise signal: x[n]=k=1020cos(2πkn/N+ϕk)x[n] = \sum_{k=10}^{20} \cos(2\pi kn/N + \phi_k) for N=512N = 512 with random phases ϕkU(0,2π)\phi_k \sim \mathcal{U}(0, 2\pi). Verify Parseval's theorem numerically: nx[n]2=1NkX[k]2\sum_n |x[n]|^2 = \frac{1}{N}\sum_k |X[k]|^2.

(b) Compute the one-sided power spectral density (PSD) and identify the occupied bandwidth.

(c) Compare the PSD estimated by the DFT (single-frame) vs Welch's method (scipy.signal.welch, averaged over overlapping frames). Show that Welch's estimate is less noisy.

(d) For white noise x[n]N(0,1)x[n] \sim \mathcal{N}(0,1), verify that the expected PSD is approximately flat, and that the variance of the PSD estimate decreases with the number of Welch averages.


Exercise 6 ** - STFT and Spectrogram Analysis

(a) Generate a chirp signal: x[n]=cos(2π(f0+αn/N)n/fs)x[n] = \cos(2\pi(f_0 + \alpha n/N) \cdot n/f_s) with f0=100f_0 = 100 Hz, αfs=1000\alpha \cdot f_s = 1000 Hz, fs=8000f_s = 8000 Hz, N=8000N = 8000 samples (1 second). Compute and display the STFT spectrogram using Hann window of length 256 samples with 50% overlap.

(b) Show how the spectrogram changes as you vary the window length: M{64,256,1024}M \in \{64, 256, 1024\}. Identify the time-frequency resolution tradeoff.

(c) Add a 500 Hz sinusoidal interference to the chirp. Show that a narrow window resolves the interference in time but not frequency, while a wide window resolves it in frequency but not time.

(d) Use the STFT to perform simple spectral subtraction denoising: estimate noise power from a "silent" region, subtract from all frames, reconstruct via overlap-add. Report the signal-to-noise ratio improvement.


Exercise 7 *** - Whisper Mel Spectrogram Pipeline

(a) Implement the full Whisper mel-spectrogram pipeline from scratch:

  • STFT with Hann window (400 samples), hop 160, FFT size 512
  • Power spectrogram X2|X|^2
  • 80 mel filterbanks covering 0-8000 Hz on the mel scale
  • Log compression with floor at 101010^{-10}

Test on a 1-second synthetic signal: a mixture of three sinusoids at 300, 1200, and 3500 Hz.

(b) Verify that your mel filterbanks sum to (approximately) the total spectral power: m=180mel_spec[m,l]kwm[k]X[k,l]2\sum_{m=1}^{80} \text{mel\_spec}[m, l] \approx \sum_k w_m[k] |X[k, l]|^2.

(c) Apply your pipeline to the test signal and plot the 80100 mel spectrogram. Label the three sinusoid peaks.

(d) Analyze the effect of mel filterbank overlap: are the three sinusoids at 300, 1200, and 3500 Hz resolved in separate mel bins? What is the mel-bin width in Hz at each of the three frequencies?


Exercise 8 *** - Fourier Neural Operator: Spectral Convolution Layer

(a) Implement the 1D FNO spectral convolution layer:

def spectral_conv1d(x, weights, modes):
    # x: (batch, channels, N)
    # weights: (in_channels, out_channels, modes) complex
    # modes: number of retained Fourier modes

The layer: DFT along last axis -> truncate to modes -> multiply by weights -> zero-pad -> IDFT.

(b) Test on the 1D Burgers' equation approximation: generate smooth initial conditions u0GP(0,(x2+1)2)u_0 \sim \text{GP}(0, (-\partial^2_x + 1)^{-2}) and apply your spectral layer. Show that retaining K=16K = 16 modes captures >99%> 99\% of the energy for smooth functions.

(c) Compare the spectral layer against a naive dense convolution layer of comparable parameter count. For N=1024N = 1024 and K=16K = 16: how many parameters does the spectral layer have vs. the dense layer?

(d) Implement a 2-layer FNO and train it to approximate the integral operator G[a](x)=01exya(y)dy\mathcal{G}[a](x) = \int_0^1 e^{-|x-y|} a(y)\,dy on discretized grids of size N{64,128,256}N \in \{64, 128, 256\}. Report the relative L2L^2 error and show that the spectral layer generalizes across grid resolutions (resolution-invariance).


11. Why This Matters for AI (2026 Perspective)

ConceptAI/ML Impact
FFT O(N log N) complexityFoundation of all efficient sequence models: FNet, S4, Mamba, FNO, FlashFFTConv - the difference between feasible and infeasible at 100K+ token lengths
DFT matrix unitarityGuarantees that FFT-based feature maps preserve signal energy; relevant to gradient stability in FNO and FNet - unitary transforms avoid the vanishing/exploding gradient problem in deep spectral networks
Spectral truncationFNO's key innovation: learning only low-frequency behavior yields generalizable, resolution-invariant operators; matches the inductive bias that real physics is smooth
STFT and mel spectrogramsThe front-end of every modern audio model (Whisper, AudioPaLM, MusicGen, AudioLM, VALL-E 2); getting the FFT pipeline right is prerequisite for any audio AI work
WindowingHann/Hamming windows in Whisper's STFT directly affect spectrogram quality; poor window choice causes leakage artifacts that corrupt speech features used by the ASR transformer
Butterfly matrix structureMonarch matrices (Dao et al., 2022) parameterize structured transforms with O(NN)O(N\sqrt{N}) parameters, enabling trainable FFT-like operations inside transformers at 2-4 speedup
Circular convolution theoremMakes FFT-based convolution O(N log N) instead of O(N2); used in FlashFFTConv for long-range dependencies; connection to Section 04 where this is developed fully
Nyquist-Shannon theoremGoverns sample rate requirements for all audio AI; at 16 kHz (Whisper), the Nyquist limit is 8 kHz, matching the mel filterbank design; misunderstanding aliasing causes silent audio processing bugs
Frequency resolutionIn FNO: KK retained modes determines the scale of PDE features the network can learn; too few modes = underfitting; too many = overfitting to noise; analogous to bandwidth selection in statistics
Fourier Neural OperatorState-of-the-art for weather prediction (FourCastNet, GraphCast partially inspired), fluid dynamics, and climate modeling at O(NlogN)O(N\log N) cost; represents the convergence of PDE numerics and deep learning

12. Conceptual Bridge

Looking Backward

This section discretizes and algorithmizes the continuous Fourier Transform from Section 20-02. The connection is precise: the DFT evaluates the continuous FT of a sampled signal at NN equally-spaced frequencies, with errors bounded by the Nyquist theorem and the windowing artifact. Every property of the continuous FT - linearity, shift, scaling, Plancherel - has a discrete analog, with the crucial difference that shifts are now circular. The L2L^2 inner product of R\mathbb{R} becomes the finite inner product on CN\mathbb{C}^N; unitarity of the continuous FT (Plancherel) becomes unitarity of the DFT matrix. The Poisson summation formula from Section 02-7.5 directly explains why sampled spectra are periodic and why aliasing occurs.

The Cooley-Tukey FFT is not a mathematical theorem about the DFT - it is an algorithmic fact about how to implement the DFT matrix-vector product by exploiting the Vandermonde structure of the DFT matrix. The mathematics is in the DFT; the computer science is in the FFT.

Looking Forward

The immediate next step is Section 20-04 Convolution Theorem, which proves the central result previewed in Section 3.6: DFT multiplication equals circular convolution. This makes FFT-based convolution O(NlogN)O(N\log N) instead of O(N2)O(N^2), and underlies every CNN, SSM, and FNO architecture. Overlap-add and overlap-save (introduced in Section 5.5) are the standard tools for applying this theorem to long signals.

Section 20-05 Wavelets addresses the STFT's fundamental limitation: fixed time-frequency resolution. Wavelets use variable window sizes - short windows at high frequencies, long windows at low frequencies - to achieve optimal time-frequency localization simultaneously across all scales. The connection runs through the filter bank theory developed in Section 04: Daubechies wavelets are exactly the solution to the "perfect reconstruction filter bank" problem.

Further afield, the Fourier Neural Operator of Section 8.2 will appear again in advanced PDE-solving contexts, and the spectral graph convolution preview of Section 8.5 connects to Section 11-04 Spectral Graph Theory. The mathematical framework of the DFT as a change of basis in a finite-dimensional Hilbert space generalizes to infinite-dimensional Hilbert spaces in Section 12-02 Hilbert Spaces, where the orthonormal Fourier basis becomes a complete orthonormal system and Parseval's identity extends to an equality that characterizes separable Hilbert spaces.

POSITION IN THE FOURIER ANALYSIS CURRICULUM
========================================================================

  Section 20-01  Fourier Series           periodic functions, L2[-,]
          v  (take period T -> infty)
  Section 20-02  Fourier Transform        continuous FT, Plancherel, uncertainty
          v  (sample at rate fs, observe N points)
  Section 20-03  DFT and FFT              <- YOU ARE HERE
          v  (DFT multiplication = circular convolution)
  Section 20-04  Convolution Theorem      CNNs, WaveNet, S4/Mamba, Hyena
          v  (variable-resolution analysis)
  Section 20-05  Wavelets                 MRA, Daubechies, scattering networks

  Key Connections:
  +-----------------------------------------------------------+
  | DFT = discrete + finite version of the continuous FT      |
  | FFT = O(N log N) algorithm for computing the DFT matrix   |
  | STFT = sliding DFT -> time-frequency representation        |
  | Windowing = control leakage  frequency resolution        |
  | Parseval: x2 = X2/N  (energy conserved by DFT)      |
  | Circular convolution theorem -> leads to Section 04               |
  | Fixed resolution limitation of STFT -> leads to Section 05        |
  +-----------------------------------------------------------+

========================================================================

<- Back to Fourier Analysis | Previous: Fourier Transform <- | Next: Convolution Theorem ->


Appendix A: Extended DFT Theory

A.1 The DFT as a Projection onto the Fourier Basis

The geometric interpretation of the DFT deserves deeper development. Working in CN\mathbb{C}^N with the standard Hermitian inner product u,v=n=0N1u[n]v[n]\langle \mathbf{u}, \mathbf{v} \rangle = \sum_{n=0}^{N-1} u[n]\overline{v[n]}, we construct the Fourier basis.

Orthonormality proof. Define ek=1N(1,ωNk,ωN2k,,ωN(N1)k)\mathbf{e}_k = \frac{1}{\sqrt{N}}(1, \omega_N^k, \omega_N^{2k}, \ldots, \omega_N^{(N-1)k})^\top. Then:

ek,el=1Nn=0N1ωNnkωNnl=1Nn=0N1ωNn(kl)=δkl\langle \mathbf{e}_k, \mathbf{e}_l \rangle = \frac{1}{N}\sum_{n=0}^{N-1} \omega_N^{nk}\overline{\omega_N^{nl}} = \frac{1}{N}\sum_{n=0}^{N-1}\omega_N^{n(k-l)} = \delta_{kl}

by the geometric sum formula. The normalized DFT matrix 1NFN\frac{1}{\sqrt{N}}F_N has columns e0,e1,,eN1\mathbf{e}_0, \mathbf{e}_1, \ldots, \mathbf{e}_{N-1} (up to conjugation), confirming it is unitary.

The DFT as expansion coefficients. The expansion of any xCN\mathbf{x} \in \mathbb{C}^N in the Fourier basis is:

x=k=0N1x^kek,x^k=x,ek=1Nn=0N1x[n]ωNnk=X[k]N\mathbf{x} = \sum_{k=0}^{N-1} \hat{x}_k \mathbf{e}_k, \qquad \hat{x}_k = \langle \mathbf{x}, \mathbf{e}_k \rangle = \frac{1}{\sqrt{N}}\sum_{n=0}^{N-1} x[n]\omega_N^{-nk} = \frac{X[k]}{\sqrt{N}}

The DFT coefficient X[k]X[k] is N\sqrt{N} times the coordinate of x\mathbf{x} in the kk-th Fourier direction.

Projection interpretation. X[k]=x,NekX[k] = \langle \mathbf{x}, \sqrt{N}\mathbf{e}_k \rangle measures the "overlap" or "correlation" of x\mathbf{x} with the kk-th complex sinusoid. A large X[k]|X[k]| means x\mathbf{x} has significant oscillatory component at frequency k/Nk/N cycles per sample.

A.2 Spectral Density and the Periodogram

The periodogram is the classical (non-parametric) estimator of the Power Spectral Density (PSD):

S^(fk)=1NfsX[k]2,fk=kfs/N\hat{S}(f_k) = \frac{1}{N f_s} |X[k]|^2, \quad f_k = k f_s / N

The periodogram is an inconsistent estimator - its variance does not decrease as NN \to \infty, even though its mean converges to the true PSD. This is because each DFT coefficient is approximately a chi-squared random variable with 2 degrees of freedom, regardless of NN.

Smoothed periodogram estimators:

  1. Bartlett's method: average periodograms from non-overlapping blocks of length MM. Reduces variance by factor KK (number of blocks) at the cost of frequency resolution Δf=fs/M\Delta f = f_s/M.

  2. Welch's method: average periodograms from overlapping windowed blocks. With 50% overlap and KK blocks: variance reduction K/1.5\approx K/1.5, same frequency resolution as Bartlett.

  3. Multitaper method (Thomson, 1982): use multiple orthogonal "Slepian" (DPSS) tapers instead of one window. Optimal bias-variance tradeoff for stationary processes.

For AI: Welch's method is used in biomedical signal processing (EEG power spectra), audio analysis, and anomaly detection. Understanding PSD estimation is essential for interpreting spectral features in any time-series ML pipeline.

A.3 The Discrete-Time Fourier Transform (DTFT) Relation

The DTFT of an infinite sequence x[n]x[n] is:

X(ejω)=n=x[n]ejωn,ω[π,π)X(e^{j\omega}) = \sum_{n=-\infty}^{\infty} x[n]\, e^{-j\omega n}, \quad \omega \in [-\pi, \pi)

The DTFT produces a continuous spectrum over the normalized frequency ω=2πf/fs[π,π)\omega = 2\pi f / f_s \in [-\pi, \pi).

DFT as sampled DTFT. If x[n]x[n] is nonzero only for n=0,,N1n = 0, \ldots, N-1 (or equivalently, if we window to this range), then:

X[k]=X(ejω)ω=2πk/N=X ⁣(ej2πk/N)X[k] = X(e^{j\omega})\Big|_{\omega = 2\pi k/N} = X\!\left(e^{j2\pi k/N}\right)

The DFT is obtained by sampling the DTFT at NN equally-spaced points on the unit circle. Denser sampling (larger NN or zero-padding) evaluates the DTFT at more points but does not change the underlying DTFT.

The zz-transform connection. The DTFT evaluates the zz-transform X(z)=nx[n]znX(z) = \sum_n x[n]z^{-n} on the unit circle z=ejωz = e^{j\omega}. The DFT evaluates it at NN equally-spaced points on the unit circle. Poles of X(z)X(z) inside the unit circle correspond to stable causal systems.

A.4 Goertzel Algorithm: Single-Frequency DFT

Computing the entire NN-point DFT costs O(NlogN)O(N\log N). If you need only a single DFT coefficient X[k0]X[k_0] for a specific frequency k0k_0, the FFT is wasteful. The Goertzel algorithm computes X[k0]X[k_0] in O(N)O(N) operations using a second-order IIR filter.

Second-order filter interpretation. Define y[n]=x[n]+2cos(ω0)y[n1]y[n2]y[n] = x[n] + 2\cos(\omega_0)y[n-1] - y[n-2] with ω0=2πk0/N\omega_0 = 2\pi k_0/N. Then X[k0]=y[N]ejω0y[N1]X[k_0] = y[N] - e^{-j\omega_0}y[N-1] (initialized with y[1]=y[2]=0y[-1] = y[-2] = 0).

Applications:

  • DTMF (Dual-Tone Multi-Frequency) telephone tone decoding: detect specific frequencies (697, 770, 852, 941, 1209, 1336, 1477, 1633 Hz)
  • Musical pitch detection when only octave-frequency analysis is needed
  • Frequency-selective filtering in embedded systems with limited memory

Appendix B: Advanced FFT Topics

B.1 In-Place FFT and Memory Efficiency

The butterfly computation in the DIT-FFT can be performed in-place: the two output values of each butterfly can overwrite the two input values, because after the butterfly both inputs are no longer needed. This means the entire NN-point FFT can be computed using only the NN complex input/output array (plus O(logN)O(\log N) twiddle factor storage).

Implementation sketch:

FOR each stage s = 1, 2, ..., log2(N):
    stride = N / 2^s
    FOR each group g = 0, 1, ..., 2^(s-1) - 1:
        twiddle = omega^{-g * stride}
        FOR each butterfly within group:
            a = x[position]
            b = twiddle * x[position + stride]
            x[position]          = a + b   (overwrite)
            x[position + stride] = a - b   (overwrite)

This avoids allocating a second array of size NN, halving memory usage - critical for embedded systems and for FFTs of very large signals.

B.2 Multi-Dimensional FFT

The 2D DFT is separable: apply a 1D FFT to each row, then a 1D FFT to each column (or vice versa):

X[k1,k2]=n1=0N11n2=0N21x[n1,n2]ωN1n1k1ωN2n2k2X[k_1, k_2] = \sum_{n_1=0}^{N_1-1}\sum_{n_2=0}^{N_2-1} x[n_1, n_2]\,\omega_{N_1}^{-n_1 k_1}\,\omega_{N_2}^{-n_2 k_2}

Computational cost: N2N_2 FFTs of length N1N_1 (for rows) plus N1N_1 FFTs of length N2N_2 (for columns) = O(N1N2log(N1N2))O(N_1 N_2 \log(N_1 N_2)).

Applications in AI:

  • Image processing: 2D FFT used for frequency-domain image filtering, noise removal, and texture analysis
  • Convolutional neural networks: image convolution can be computed via 2D FFT for large kernels (O(N2logN)O(N^2 \log N) vs O(N2K2)O(N^2 K^2) for direct convolution with kernel size KK)
  • FNO on 2D domains: the 2D FNO for Navier-Stokes uses 2D DFT with K2K^2 retained modes

B.3 Number-Theoretic Transform (NTT)

The Number-Theoretic Transform replaces complex arithmetic with arithmetic modulo a prime pp. The NTT uses a primitive NN-th root of unity modulo pp - an integer ω\omega such that ωN1(modp)\omega^N \equiv 1 \pmod{p} and ωk≢1(modp)\omega^k \not\equiv 1 \pmod{p} for k<Nk < N.

Why NTT matters: unlike the FFT with floating-point arithmetic, the NTT is exact (integer arithmetic modulo pp). It is used for:

  • Polynomial multiplication modulo a prime: key operation in lattice-based cryptography (NTRU, CRYSTALS-Kyber)
  • Large integer multiplication: GNU MP library uses NTT for multiplying large integers
  • Post-quantum cryptography: Kyber and Dilithium (NIST post-quantum standards) rely on NTT for efficient polynomial operations

For AI security: language model serving with privacy guarantees (homomorphic encryption) requires NTT-based polynomial multiplications. The efficiency of NTT directly affects the feasibility of private inference for LLMs.


Appendix C: Worked Examples

C.1 Complete DFT Computation: 8-Point Example

Let x=(1,2,3,4,4,3,2,1)\mathbf{x} = (1, 2, 3, 4, 4, 3, 2, 1). Compute the 8-point DFT.

Step 1: Note that x\mathbf{x} is symmetric: x[n]=x[Nn]x[n] = x[N-n] for n=1,,N1n = 1, \ldots, N-1. By conjugate symmetry for real input, X[k]=X[Nk]X[k] = X[N-k]^*. For a symmetric real input, all X[k]X[k] are real.

Step 2: Even and odd subsequences:

  • Even: (x[0],x[2],x[4],x[6])=(1,3,4,2)(x[0], x[2], x[4], x[6]) = (1, 3, 4, 2)
  • Odd: (x[1],x[3],x[5],x[7])=(2,4,3,1)(x[1], x[3], x[5], x[7]) = (2, 4, 3, 1)

Step 3: 4-point DFT of even subsequence (ω4=i\omega_4 = i):

E[0]=1+3+4+2=10,E[1]=13i4+2i=3iE[0] = 1+3+4+2 = 10, \quad E[1] = 1-3i-4+2i = -3-i E[2]=13+42=0,E[3]=1+3i42i=3+iE[2] = 1-3+4-2 = 0, \quad E[3] = 1+3i-4-2i = -3+i

Step 4: 4-point DFT of odd subsequence:

O[0]=2+4+3+1=10,O[1]=24i3+i=13iO[0] = 2+4+3+1 = 10, \quad O[1] = 2-4i-3+i = -1-3i O[2]=24+31=0,O[3]=2+4i3i=1+3iO[2] = 2-4+3-1 = 0, \quad O[3] = 2+4i-3-i = -1+3i

Step 5: Combine with twiddle factors ω8k=eiπk/4\omega_8^k = e^{-i\pi k/4}:

X[0]=E[0]+ω80O[0]=10+10=20X[0] = E[0] + \omega_8^0 O[0] = 10 + 10 = 20 X[1]=E[1]+ω81O[1]=(3i)+1i2(13i)X[1] = E[1] + \omega_8^1 O[1] = (-3-i) + \frac{1-i}{\sqrt{2}}(-1-3i)

Computing ω81(13i)=13ii+3i22(1)=(1+3i)(1i)2(1)\omega_8^1(-1-3i) = \frac{-1-3i-i+3i^2}{\sqrt{2}} \cdot (-1) = \frac{(1+3i)(1-i)}{\sqrt{2}} \cdot (-1)... the exact arithmetic is tedious; numerically: X=(20,5.657,0,0.828,0,0.828,0,5.657)X = (20, -5.657, 0, -0.828, 0, -0.828, 0, -5.657).

Verification: X[0]=nx[n]=1+2+3+4+4+3+2+1=20X[0] = \sum_n x[n] = 1+2+3+4+4+3+2+1 = 20 PASS. Parseval: nx[n]2=1+4+9+16+16+9+4+1=60\sum_n x[n]^2 = 1+4+9+16+16+9+4+1 = 60; 18kX[k]2=18(400+32+0+0.686+0+0.686+0+32)60\frac{1}{8}\sum_k |X[k]|^2 = \frac{1}{8}(400 + 32 + 0 + 0.686 + 0 + 0.686 + 0 + 32) \approx 60 PASS.

C.2 Leakage Quantification: Rectangular vs Hann

Signal: x(t)=cos(2π3.5t)x(t) = \cos(2\pi \cdot 3.5 t) sampled at fs=1f_s = 1 Hz for N=16N = 16 samples. The frequency f0=3.5f_0 = 3.5 lies exactly halfway between bins 3 and 4.

Rectangular window: All 16 DFT values are nonzero. The two closest bins (k=3k=3, k=4k=4) have magnitude 7.1\approx 7.1 each (out of a maximum of 8 for a bin-aligned frequency). The farthest bins (k=8k=8) have magnitude 0.5\approx 0.5. Leakage energy (all bins except 3, 4): 10%\approx 10\% of total.

Hann window: w[n]=12(1cos(2πn/15))w[n] = \frac{1}{2}(1 - \cos(2\pi n/15)). The main lobe is now 4 bins wide (bins 2, 3, 4, 5 have significant energy). Leakage energy outside the main lobe: <0.1%< 0.1\% - a 100 reduction in far-field leakage.

Trade-off: the Hann window "blurs" the frequency axis (wider main lobe), but dramatically reduces interference from distant frequency components. For audio analysis with multiple simultaneous tones, the Hann window is almost always preferred.

C.3 FFT for Image Compression (Preview)

A grayscale image I[m,n]I[m, n] of size 256×256256 \times 256 has 2162^{16} pixel values. The 2D DFT produces 2162^{16} complex frequency coefficients.

Low-frequency concentration: natural images have most energy in low spatial frequencies - slowly varying regions (sky, smooth surfaces). Empirically, the top-left 16×1616 \times 16 corner of the DFT magnitude (the lowest 16 frequencies in each direction) typically captures >90%> 90\% of total image energy.

Compression idea: retain only K×KK \times K DFT coefficients. Setting all others to zero and applying the IDFT gives a low-frequency reconstruction. For K=32K = 32: 10241024 retained coefficients out of 6553665536 (1.6%\approx 1.6\%), with visually acceptable reconstruction for most images.

This is the conceptual predecessor of JPEG, which uses the Discrete Cosine Transform (DCT) - the real-valued cousin of the DFT - in 8×88 \times 8 blocks, achieving better compression due to avoiding complex arithmetic and reducing boundary artifacts.


Appendix D: Numerical Implementation Notes

D.1 Choosing FFT Size for Maximum Speed

Different values of NN require dramatically different computation times even with the same asymptotic complexity. FFTW benchmarks (in rough order of efficiency):

NN typeRelative speed
Powers of 2 (2k2^k)Fastest
Products 2a3b5c2^a 3^b 5^cVery fast
Smooth numbers (all prime factors 7\leq 7)Fast
Prime NN (uses Bluestein)Slow - avoid if possible
N=2pN = 2p for large prime ppSlow

Rule of thumb: always zero-pad to a highly composite number. scipy.fft.next_fast_len(N) returns the smallest integer N\geq N that has only small prime factors.

D.2 Avoiding Common Numerical Pitfalls

Precision requirements:

  • float32 (single precision): round-off error 107\approx 10^{-7} per operation. For N=220N = 2^{20}: accumulated FFT error 107×log2(220)2×106\approx 10^{-7} \times \log_2(2^{20}) \approx 2 \times 10^{-6}.
  • float64 (double precision): error 1016\approx 10^{-16} per operation. FFT error for N=220N = 2^{20}: 3×1015\approx 3 \times 10^{-15}.

For audio processing (16-bit integer input -> 32-bit float): single precision is fine. For scientific computing or precision-sensitive applications: use double precision.

Large DFT sizes and memory bandwidth: for N>106N > 10^6, the FFT becomes memory-bandwidth-limited rather than compute-limited. The working set (the data being accessed at each butterfly stage) exceeds the CPU cache, and cache misses dominate runtime. For very large FFTs: consider FFT algorithms with cache-oblivious access patterns (e.g., FFTW's recursive divide-and-conquer).

D.3 GPU FFT Considerations

CUDA FFT (cuFFT) and ROCm FFT achieve peak performance for:

  • Power-of-2 sizes on any GPU
  • Batch FFTs (many independent FFTs of the same size): significantly faster than sequential single FFTs
  • Real-to-complex (rfft) transforms on modern GPUs with Tensor Cores

For ML training: when using FNO or FlashFFTConv, always batch the FFT computation across the mini-batch and channel dimensions simultaneously. torch.fft.fft(x, dim=-1) can process a (batch, channels, N) tensor with a single cuFFT call, achieving near-peak GPU utilization.


Appendix E: Connections to Other Areas

E.1 DFT and Polynomial Multiplication

The DFT has an elegant algebraic interpretation: it converts polynomial multiplication in the ring C[x]/(xN1)\mathbb{C}[x]/(x^N - 1) into pointwise multiplication of coefficient sequences.

Fact. Let p(x)=n=0N1anxnp(x) = \sum_{n=0}^{N-1} a_n x^n and q(x)=n=0N1bnxnq(x) = \sum_{n=0}^{N-1} b_n x^n. Their product modulo xN1x^N - 1 is the circular convolution c=abc = a \circledast b. The DFT converts this to C[k]=A[k]B[k]C[k] = A[k] \cdot B[k] (pointwise product). Thus: DFT diagonalizes the cyclic convolution operator.

More precisely, the matrix of circular convolution by a\mathbf{a} (the circulant matrix CaC_a) has eigenvectors equal to the Fourier basis vectors fk\mathbf{f}_k, with eigenvalues A[k]A[k]:

Cafk=A[k]fkC_a \mathbf{f}_k = A[k] \mathbf{f}_k

The DFT simultaneously diagonalizes all circulant matrices. This is the algebraic foundation of the Convolution Theorem (Section 04).

E.2 DFT and Number Theory: Quadratic Gauss Sums

The Gauss sum G=n=0N1e2πin2/NG = \sum_{n=0}^{N-1} e^{2\pi i n^2/N} arises naturally in number theory and signal processing. Its magnitude satisfies G=N|G| = \sqrt{N}, which is the basis for several FFT-based tricks.

Bluestein's algorithm uses the identity nk=12[n2+k2(nk)2]nk = \frac{1}{2}[n^2 + k^2 - (n-k)^2] to rewrite the DFT as a convolution:

X[k]=eiπk2/Nn=0N1(x[n]eiπn2/N)eiπ(nk)2/NX[k] = e^{-i\pi k^2/N} \sum_{n=0}^{N-1} \left(x[n] e^{-i\pi n^2/N}\right) e^{i\pi(n-k)^2/N}

This is a convolution of h[n]=x[n]eiπn2/Nh[n] = x[n]e^{-i\pi n^2/N} with the chirp kernel f[n]=eiπn2/Nf[n] = e^{i\pi n^2/N}, computable via FFT for any NN.

E.3 DFT and the Cooley-Tukey Prime Factor Algorithm

When N=N1N2N = N_1 N_2 with gcd(N1,N2)=1\gcd(N_1, N_2) = 1, the Good-Thomas (Prime Factor) Algorithm maps the 1D DFT to a 2D DFT via the Chinese Remainder Theorem (CRT). This gives a different algorithm from Cooley-Tukey that:

  • Requires no twiddle-factor multiplications (only additions)
  • Is advantageous when NN has many small coprime factors

The CRT mapping is n(n1,n2)n \leftrightarrow (n_1, n_2) with n=N2n1+N1n2modNn = N_2 n_1 + N_1 n_2 \bmod N. The DFT factors as:

X[k1N1+k2N2modN]=DFTN1DFTN2X[k_1 N_1 + k_2 N_2 \bmod N] = \text{DFT}_{N_1} \otimes \text{DFT}_{N_2}

Connection to representation theory: the Good-Thomas algorithm expresses the DFT on a cyclic group ZN\mathbb{Z}_N in terms of DFTs on subgroups ZN1\mathbb{Z}_{N_1} and ZN2\mathbb{Z}_{N_2} when N=N1N2N = N_1 N_2 is coprime. The general theory of DFT on finite abelian groups is a special case of harmonic analysis on locally compact abelian groups (Pontryagin duality), which is developed in full in Section 12 Functional Analysis.


Appendix F: DFT Properties - Detailed Proofs

F.1 Time Reversal

If y[n]=x[nmodN]y[n] = x[-n \bmod N], then Y[k]=X[kmodN]=X[Nk]Y[k] = X[-k \bmod N] = X[N-k].

Proof.

Y[k]=n=0N1x[nmodN]ωNnk=m=nmodNm=0N1x[m]ωNmk=X[kmodN]Y[k] = \sum_{n=0}^{N-1} x[-n \bmod N]\,\omega_N^{-nk} \overset{m=-n \bmod N}{=} \sum_{m=0}^{N-1} x[m]\,\omega_N^{mk} = X[-k \bmod N]

Consequence. For real x\mathbf{x}: X[Nk]=X[k]=X[k]X[N-k] = X[-k] = X[k]^* (conjugate symmetry, Section 3.4). Time reversal in the time domain maps to frequency reversal (conjugation) in the frequency domain.

F.2 Duality

If Y[n]=X[n]Y[n] = X[n] (i.e., the time-domain signal equals the DFT of another signal), then:

F{X[n]}[k]=Nx[kmodN]\mathcal{F}\{X[n]\}[k] = N\, x[-k \bmod N]

Proof. From the IDFT: x[n]=1NkX[k]ωNnkx[n] = \frac{1}{N}\sum_k X[k]\omega_N^{nk}. Replace nkn \to -k and knk \to n:

x[k]=1Nn=0N1X[n]ωNkn=1NF{X[n]}[k]x[-k] = \frac{1}{N}\sum_{n=0}^{N-1} X[n]\omega_N^{-kn} = \frac{1}{N}\mathcal{F}\{X[n]\}[k]

Therefore F{X[n]}[k]=Nx[kmodN]\mathcal{F}\{X[n]\}[k] = N\, x[-k \bmod N].

Application. If x\mathbf{x} is a rectangular pulse (1 for n=0,,M1n = 0, \ldots, M-1, else 0), its DFT X[k]=sin(πkM/N)sin(πk/N)eiπk(M1)/NX[k] = \frac{\sin(\pi kM/N)}{\sin(\pi k/N)}e^{-i\pi k(M-1)/N} is a Dirichlet kernel. By duality, if the input is a Dirichlet kernel, the DFT is (essentially) a rectangular pulse. This is the discrete analog of the rectangle-sinc FT pair.

F.3 DFT of Real and Imaginary Parts

For a complex signal z[n]=x[n]+iy[n]z[n] = x[n] + iy[n] with real parts x[n]x[n] and y[n]y[n]:

X[k]=F{x[n]}[k]=Z[k]+Z[Nk]2X[k] = \mathcal{F}\{x[n]\}[k] = \frac{Z[k] + Z[N-k]^*}{2} Y[k]=F{y[n]}[k]=Z[k]Z[Nk]2iY[k] = \mathcal{F}\{y[n]\}[k] = \frac{Z[k] - Z[N-k]^*}{2i}

This allows computing two real DFTs in the time of one complex DFT - an important optimization for processing stereo audio or in-phase/quadrature (IQ) signals.


Appendix G: Windowing Theory - Spectral Analysis Framework

G.1 The Window Design Problem

Given a finite observation xw[n]=x[n]w[n]x_w[n] = x[n] \cdot w[n] of an infinite signal x[n]x[n], the DFT gives:

Xw[k]=1N(XW)[k]X_w[k] = \frac{1}{N}(X \circledast W)[k]

where W[k]=F{w[n]}[k]W[k] = \mathcal{F}\{w[n]\}[k] is the DFT of the window. The true spectrum X[k]X[k] is "blurred" by W[k]W[k].

Ideal window properties (contradictory):

  1. W[k]=Nδ[k]W[k] = N\delta[k] (no leakage - only possible for infinite observation)
  2. w[n]=1w[n] = 1 for n[0,N1]n \in [0, N-1] (rectangular window: maximum time-domain energy)
  3. Compact support in both time and frequency (impossible by uncertainty principle)

Any window must trade off between these competing desiderata. The Kaiser window with adjustable β\beta offers a continuous tradeoff between resolution and leakage control.

G.2 Sidelobe Rolloff Rate

The rate at which sidelobes decrease away from the main lobe is called the rolloff rate (in dB per octave):

  • Rectangular window: discontinuities at n=0n=0 and n=N1n=N-1 -> W[k]1/k|W[k]| \sim 1/k -> rolloff 6dB/octave\approx -6\,\text{dB/octave}
  • Hann window: continuous at endpoints (value 0) but derivative discontinuity -> W[k]1/k2|W[k]| \sim 1/k^2 -> rolloff 18dB/octave\approx -18\,\text{dB/octave}
  • Blackman window: zero at endpoints with zero first and second derivatives -> W[k]1/k3|W[k]| \sim 1/k^3 -> rolloff 18dB/octave\approx -18\,\text{dB/octave} (but lower peak sidelobe)

Generalization. A window with p1p-1 continuous derivatives at the endpoints has sidelobe falloff O(k(p+1))O(k^{-(p+1)}), corresponding to a rolloff rate of 6(p+1)dB/octave-6(p+1)\,\text{dB/octave}. This is a direct consequence of the Riemann-Lebesgue lemma applied to the windowed Fourier series.

G.3 Optimal Windows: Prolate Spheroidal Wave Functions

The Slepian sequences (discrete prolate spheroidal wave functions, DPSS) are the theoretically optimal windows in the following sense:

Problem. Among all windows w[n]w[n] with nw[n]2=1\sum_n w[n]^2 = 1, find the one that maximizes the fraction of energy concentrated in ξW|\xi| \leq W (a specified bandwidth):

maximizekKNW[k]2kW[k]2\text{maximize} \quad \frac{\sum_{|k| \leq KN} |W[k]|^2}{\sum_k |W[k]|^2}

The solution is the first Slepian sequence, which is the eigenvector corresponding to the largest eigenvalue of the matrix Anm=sin(2πW(nm))π(nm)A_{nm} = \frac{\sin(2\pi W(n-m))}{\pi(n-m)}.

Slepian sequences achieve theoretically optimal concentration - the maximum possible energy fraction in a given bandwidth - and are used in multitaper spectral estimation (Section A.2).


Appendix H: The FFT in Modern Hardware

H.1 Vectorization and SIMD

Modern CPUs provide SIMD (Single Instruction Multiple Data) instructions that process multiple values simultaneously:

  • SSE2: 2 double-precision complex numbers per operation
  • AVX2: 4 double-precision complex numbers per operation
  • AVX-512: 8 double-precision complex numbers per operation

FFTW automatically generates SIMD-optimized code for each detected instruction set. For N=220N = 2^{20} at double precision, AVX-512 provides approximately 8 speedup over scalar code.

H.2 GPU FFT (cuFFT)

NVIDIA's cuFFT library achieves peak performance by:

  1. Tiling: decomposing the DFT into tiles that fit in shared memory (SRAM)
  2. Warp-level butterfly: mapping butterfly computations to warp-level shuffle operations
  3. Tensor Core utilization: for batch FFTs, reformulating twiddle-factor multiplication as GEMM operations executable on Tensor Cores

Peak throughput: NVIDIA H100 achieves 30TFLOPS\sim 30\,\text{TFLOPS} on batch FFTs of power-of-2 size with float16 (half precision).

H.3 Distributed FFT for Very Large Transforms

For transforms exceeding GPU memory (e.g., N>108N > 10^8), the DFT can be decomposed across multiple GPUs using the row-column algorithm:

  1. Distribute N=N1×N2N = N_1 \times N_2 points across PP GPUs: each GPU holds N/PN/P rows
  2. Each GPU applies 1D FFT to its rows (N2N_2-point FFTs)
  3. All-to-all communication transposes the data (global shuffle)
  4. Each GPU applies 1D FFT to its columns (N1N_1-point FFTs) and applies twiddle factors
  5. Final all-to-all restores the layout

Total communication: O(N/P)O(N/P) complex numbers per GPU per transpose step. For P=8P = 8 GPUs and N=109N = 10^9: 128GB\sim 128\,\text{GB} data movement per step.

Applications: seismic data processing, climate modeling, and cosmological simulations routinely require distributed FFTs of 10910^9-101210^{12} points.


Appendix I: DFT and Signal Reconstruction

I.1 Bandlimited Interpolation

Given DFT coefficients X[0],,X[K]X[0], \ldots, X[K] (retaining only the lowest K+1K+1 frequencies), the bandlimited interpolation of the corresponding signal to any intermediate time tt (not necessarily a sample point) is:

x^(t)=1Nk=0KX[k]e2πikt/N+conjugate terms\hat{x}(t) = \frac{1}{N}\sum_{k=0}^{K} X[k]\, e^{2\pi i k t/N} + \text{conjugate terms}

This is the continuous analog of what FNO does (Section 8.2): it parameterizes a function by its low-frequency DFT coefficients, and evaluates it at any desired spatial resolution.

I.2 Compressed Sensing and Sparse Recovery

Problem (Compressed Sensing). A signal xCN\mathbf{x} \in \mathbb{C}^N is ss-sparse in the DFT domain: only sNs \ll N Fourier coefficients are nonzero. Can we recover x\mathbf{x} from mNm \ll N measurements?

Answer (Candes, Romberg, Tao, 2006; Donoho, 2006). If mCslog(N/s)m \geq C \cdot s \log(N/s) random measurements are taken, x\mathbf{x} can be exactly recovered (with high probability) by solving the 1\ell^1 minimization:

minx^x^1subject toAx^=b\min_{\hat{\mathbf{x}}} \lVert \hat{\mathbf{x}} \rVert_1 \quad \text{subject to} \quad A\hat{\mathbf{x}} = \mathbf{b}

where AA is the random measurement matrix. This is dramatically below the mNm \geq N requirements of classical sampling.

For AI: compressed sensing underlies MRI reconstruction (6-8 speedup by acquiring 1/6 of kk-space data), compressed attention (attending to sparse subsets of frequency components), and efficient LLM inference when activations are sparse in the Fourier domain. The theoretical guarantees connect directly to the RIP (Restricted Isometry Property) of random DFT matrices.

I.3 Phase Retrieval

Problem. Given only X[k]2|X[k]|^2 (the power spectrum, without phase), recover x\mathbf{x}.

Phase retrieval is fundamentally ill-posed in 1D (any signal and its time-reversed, time-shifted version have the same power spectrum). In 2D (images), phase retrieval becomes uniquely solvable (generically) due to oversampling constraints.

Algorithms: Gerchberg-Saxton (1972), HIO (Fienup, 1982), and modern deep learning approaches (phase retrieval nets).

For AI: X-ray crystallography uses phase retrieval to determine protein structure. Modern protein structure prediction (AlphaFold 2, RoseTTAFold) sidesteps phase retrieval by directly predicting 3D coordinates from sequence and co-evolutionary information.


Appendix J: Worked ML Examples

J.1 Building a Mel Spectrogram from Scratch

This section works through the complete mathematics behind the mel spectrogram used in Whisper, step by step.

Step 1: Mel scale. The mel scale converts frequency ff (Hz) to perceived pitch mm (mels):

m(f)=2595log10 ⁣(1+f700),f(m)=700(10m/25951)m(f) = 2595 \log_{10}\!\left(1 + \frac{f}{700}\right), \quad f(m) = 700\left(10^{m/2595} - 1\right)

The mel scale is approximately linear below 1 kHz (300 Hz approx 401 mel; 700 Hz approx 811 mel) and logarithmic above 1 kHz, matching the frequency resolution of the human auditory system (the cochlea's place-frequency mapping).

Step 2: Filterbank design. For M=80M = 80 mel filters covering [fmin,fmax]=[0,8000]Hz[f_{\min}, f_{\max}] = [0, 8000]\,\text{Hz}:

  1. Convert the frequency range to mels: [mmin,mmax]=[0,3227]mels[m_{\min}, m_{\max}] = [0, 3227]\,\text{mels}
  2. Place M+2=82M+2 = 82 equally-spaced mel points: m0<m1<<m81m_0 < m_1 < \cdots < m_{81}
  3. Convert back to Hz: fk=f(mk)f_k = f(m_k)
  4. For filterbank mm (m=1,,80m = 1, \ldots, 80), define the triangular filter:
Hm[k]={fkfm1fmfm1fm1fkfmfm+1fkfm+1fmfmfkfm+10otherwiseH_m[k] = \begin{cases} \dfrac{f_k - f_{m-1}}{f_m - f_{m-1}} & f_{m-1} \leq f_k \leq f_m \\ \dfrac{f_{m+1} - f_k}{f_{m+1} - f_m} & f_m \leq f_k \leq f_{m+1} \\ 0 & \text{otherwise} \end{cases}

where fk=kfs/Nf_k = k f_s / N are the DFT bin frequencies.

Step 3: Power accumulation. For each STFT frame ll:

mel_spec[m,l]=k=0N/2Hm[k]X[k,l]2\text{mel\_spec}[m, l] = \sum_{k=0}^{N/2} H_m[k] \cdot |X[k, l]|^2

Step 4: Log compression.

log_mel[m,l]=log10 ⁣(max ⁣(mel_spec[m,l],1010))\text{log\_mel}[m, l] = \log_{10}\!\left(\max\!\left(\text{mel\_spec}[m, l], 10^{-10}\right)\right)

The floor 101010^{-10} prevents log(0)\log(0). The Whisper normalization subtracts the mean of log_mel and divides by 4 (an empirically chosen scale factor that centers the values near [1,1][-1, 1] for typical speech).

Why these choices?

  • 80 mel bins: empirically optimal for large-scale ASR (more bins -> marginal improvement; fewer -> information loss)
  • 0-8000 Hz: human speech information is concentrated here; extending to 8+ kHz adds noise without significant ASR benefit
  • 25 ms window / 10 ms hop: matches the typical duration of speech phonemes (20-200 ms)
  • Log compression: matches perceptual loudness; prevents high-amplitude phonemes from dominating the feature representation

J.2 FNO Architecture: Complete Mathematical Description

The Fourier Neural Operator (Li et al., 2021) architecture for 1D problems:

Input: discretized function aRN×da\mathbf{a} \in \mathbb{R}^{N \times d_a} (input function on NN points with dad_a channels) Output: discretized function uRN×du\mathbf{u} \in \mathbb{R}^{N \times d_u} (solution on NN points with dud_u channels)

Architecture (4 FNO layers + output projection):

Layer 0: Input lifting
  a  R^{N  d_a}  ->  v  R^{N  d_v}   (learnable MLP)

Layer 1-4: FNO blocks
  v <- FNO-Block(v)

Layer 5: Output projection
  v  R^{N  d_v}  ->  u  R^{N  d_u}   (learnable MLP)

FNO Block:

FNO-Block(v):
    1. Spectral branch:
       V = DFT(v)                  # shape: (N, d_v) complex
       V_trunc = V[:K, :]           # keep lowest K modes
       V_out = R @ V_trunc          # R  C^{K  d_v  d_v}: learned weight
       V_pad = [V_out; 0, ..., 0]   # zero-pad back to N modes
       v_spectral = IDFT(V_pad)     # shape: (N, d_v)
    
    2. Local branch:
       v_local = W @ v              # W  R^{d_v  d_v}: learned weight
    
    3. Combine:
       v_new = (v_spectral + v_local)  #  = GeLU
    
    return v_new

Key design choices:

  • K=16K = 16 retained modes: sufficient for smooth PDE solutions; higher KK is used for more complex solutions (turbulence: K=32K = 32)
  • dv=64d_v = 64 channels: comparable to modern transformer hidden dimension
  • Weight sharing: the same RR is used across all spatial positions (translation equivariance in the frequency domain)
  • Resolution-invariance: the FNO can be trained on N=64N = 64 and evaluated on N=256N = 256 by changing the DFT size; the spectral weights RR and local weights WW are independent of NN

Parameter count comparison (1D, N=1024N = 1024, K=16K = 16, dv=64d_v = 64):

LayerDense global convolutionFNO spectral layer
ParametersN×dv2=4.2×106N \times d_v^2 = 4.2 \times 10^6K×dv2×2=131,072K \times d_v^2 \times 2 = 131,072 (complex)
FLOPsO(N2dv)=4×109O(N^2 d_v) = 4 \times 10^9O(NdvlogN+Kdv2)=6.7×106O(N d_v \log N + K d_v^2) = 6.7 \times 10^6

The FNO spectral layer has 32×32\times fewer parameters and 600×600\times fewer FLOPs per layer than a fully global convolution.

J.3 Monarch Matrix Decomposition

A Monarch matrix of size N=N1N2N = N_1 N_2 is defined as:

M=PLPRM = P L P^\top R

where:

  • P{0,1}N×NP \in \{0,1\}^{N \times N} is a fixed permutation (the "interleaving permutation")
  • L=diag(L0,L1,,LN11)L = \operatorname{diag}(L_0, L_1, \ldots, L_{N_1-1}) is block-diagonal with N1N_1 blocks of size N2×N2N_2 \times N_2
  • R=diag(R0,R1,,RN21)R = \operatorname{diag}(R_0, R_1, \ldots, R_{N_2-1}) is block-diagonal with N2N_2 blocks of size N1×N1N_1 \times N_1

Connection to FFT. The FFT butterfly matrix factors as:

FN=(IN1FN2)D(FN1IN2)Pbit-reverseF_N = (I_{N_1} \otimes F_{N_2}) \cdot D \cdot (F_{N_1} \otimes I_{N_2}) \cdot P_{\text{bit-reverse}}

where DD is a diagonal twiddle-factor matrix and \otimes is the Kronecker product. Each butterfly stage is a block-diagonal matrix - exactly the Monarch structure. Thus the FFT is a special Monarch matrix with specific parameter values.

Monarch parameterization: instead of fixing the block entries to be DFT matrices, let LiL_i and RjR_j be learnable complex matrices. This gives O(N(N1+N2))=O(NN)O(N(N_1 + N_2)) = O(N\sqrt{N}) parameters (choosing N1=N2=NN_1 = N_2 = \sqrt{N}).

Expressiveness. Monarch matrices can represent any circulant matrix (by choosing LiL_i and RjR_j appropriately) and therefore any convolution. They can also represent some attention patterns and are more expressive than diagonal matrices or individual FFT layers.


Appendix K: Reference Summary

K.1 Key Formulas

FormulaExpression
DFTX[k]=n=0N1x[n]ωNnkX[k] = \sum_{n=0}^{N-1} x[n]\,\omega_N^{-nk}, ωN=e2πi/N\quad\omega_N = e^{2\pi i/N}
IDFTx[n]=1Nk=0N1X[k]ωNnkx[n] = \frac{1}{N}\sum_{k=0}^{N-1} X[k]\,\omega_N^{nk}
DFT Matrix(FN)kn=ωNkn(F_N)_{kn} = \omega_N^{-kn}, FNFN=NI\quad F_N F_N^* = N I
Parsevalnx[n]2=1NkX[k]2\sum_n \lvert x[n]\rvert^2 = \frac{1}{N}\sum_k \lvert X[k]\rvert^2
Circular convolution(xy)[n]=m=0N1x[m]y[nmmodN](x \circledast y)[n] = \sum_{m=0}^{N-1} x[m]\,y[n-m\bmod N]
Circular shiftx[nmmodN]ωNmkX[k]x[n-m\bmod N] \leftrightarrow \omega_N^{-mk} X[k]
Conjugate symmetryx[n]RX[Nk]=X[k]x[n]\in\mathbb{R} \Rightarrow X[N-k] = X[k]^*
FFT complexityN2log2N\frac{N}{2}\log_2 N complex multiplications, Nlog2NN\log_2 N additions
Frequency resolutionΔf=fs/N=1/T\Delta f = f_s/N = 1/T
Nyquist limitfmax=fs/2f_{\max} = f_s/2
STFTS[l,k]=m=0M1x[lH+m]w[m]ωNmkS[l,k] = \sum_{m=0}^{M-1} x[lH+m]\,w[m]\,\omega_N^{-mk}
Mel scalem=2595log10(1+f/700)m = 2595\log_{10}(1 + f/700)

K.2 Quick Comparison: FT vs DFT

PropertyContinuous FTDFT
DomainL1(R)L^1(\mathbb{R}) or L2(R)L^2(\mathbb{R})CN\mathbb{C}^N (finite sequences)
OutputContinuous spectrum on R\mathbb{R}NN complex values
InversionIntegral formulaExact, finite sum
ShiftLinear: e2πiξτe^{-2\pi i\xi\tau} phase factorCircular: ωNmk\omega_N^{-mk} phase factor
ConvolutionLinear convolutionCircular convolution
Parsevalf2=f^2\int\lvert f\rvert^2 = \int\lvert\hat{f}\rvert^2x2=1NX2\sum\lvert x\rvert^2 = \frac{1}{N}\sum\lvert X\rvert^2
UnitarityFT on L2L^2: unitary with factor 11NFN\frac{1}{\sqrt{N}}F_N is unitary
ComputationCannot compute exactlyFFT: O(NlogN)O(N\log N)
LeakageNo finite truncationLeakage from windowing

Appendix L: Case Studies

L.1 Whisper Large-v3: Preprocessing Bug Hunt

A common real-world bug in Whisper-based systems illustrates why understanding the DFT pipeline matters.

Symptom: transcription quality degrades for audio from a different microphone than training data, even though the content is identical.

Root cause: the new microphone captures at 44.1 kHz (not 16 kHz). Naive resampling using librosa.load(file, sr=16000) with default settings may introduce aliasing artifacts or produce a slightly different frequency axis. Specifically:

  1. Aliasing from incorrect anti-aliasing filter: if the 44.1 kHz signal is downsampled by factor 2.76\approx 2.76 without applying a proper lowpass filter at 8 kHz (the Nyquist for 16 kHz), frequencies between 8 kHz and 22.05 kHz alias into the 0-8 kHz band, corrupting the mel spectrogram.

  2. Frequency axis mismatch: if the mel filterbanks were designed for exactly 16 kHz sampling, resampling to 15,994 Hz (a common off-by-one error in resampling libraries) shifts all frequency bin centers by 0.04%\approx 0.04\% - too small to affect single-frequency detection but enough to corrupt the precise triangular filterbank overlaps.

Fix: explicitly verify sample rate after loading (assert sr == 16000), use scipy.signal.resample_poly with explicit anti-aliasing, and validate the mel filterbank output against a known reference signal.

Lesson: every stage of the FFT pipeline - sampling rate, window length, hop size, FFT size, filterbank design - has exact numerical requirements. The DFT is unforgiving of approximations.

L.2 FNO for Climate Modeling: FourCastNet

FourCastNet (Pathak et al., 2022, NVIDIA) uses a variant of the FNO architecture to predict global weather at 6-hour intervals. It replaces the FFT-based spectral layer with a Fourier-based AFNO (Adaptive Fourier Neural Operator) block:

AFNO Layer:
  1. Patchify: divide input grid (7201440) into patches -> tokens
  2. DFT along spatial dimensions (2D FFT)
  3. Mix frequencies with learned complex weights (block-diagonal for efficiency)
  4. IDFT -> spatial domain
  5. MLP for channel mixing

Key difference from FNO: AFNO uses token mixing in frequency space with a block-diagonal structure, reducing parameters from O(K2d2)O(K^2 d^2) to O(Kd2/B)O(K d^2 / B) where BB is the block size.

Results:

  • Forecasts 10-day global weather in 2 seconds (vs 4 hours for NWP models like ECMWF)
  • Competitive with ECMWF operational forecasts on 2-week horizons for key variables (temperature at 500 hPa, wind at 850 hPa)
  • Trained on 40 years of ERA5 reanalysis data (3.8 TB)

Why frequency domain? Atmospheric dynamics are dominated by large-scale planetary waves (low spatial frequencies) - the Rossby waves that govern jet streams and storm tracks. Truncating to K=32K = 32 modes in each dimension captures >95%> 95\% of the variance in ERA5 data. Higher-frequency turbulence below the grid scale is parameterized separately.

L.3 STFT Denoising in Production Audio

Modern audio denoising systems (used in Zoom, Teams, Discord) use STFT-based processing:

Architecture:

  1. STFT with Hann window (1024 samples, hop 256) -> spectrogram X[l,k]|X[l,k]| and phase X[l,k]\angle X[l,k]
  2. Neural network (small RNN or transformer) estimates the noise mask M[l,k][0,1]M[l,k] \in [0,1]
  3. Apply mask: X^[l,k]=M[l,k]X[l,k]\hat{X}[l,k] = M[l,k] \cdot X[l,k] (keep speech, suppress noise)
  4. Reconstruct waveform via ISTFT with original phase (overlap-add)

Why preserve phase: human perception is relatively insensitive to phase errors in speech (the McGurk effect), but phase-inconsistent reconstruction (Griffin-Lim iterations) introduces audible artifacts. Most commercial denoisers use the noisy signal's phase directly - a pragmatic approximation.

Real-time constraint: at 48 kHz with 1024-sample windows and 256-sample hop: one frame every 5.33 ms. A DNN processing one frame must complete in <5ms< 5\,\text{ms} on the target hardware (often a small DSP or CPU). This limits the network size to 105\sim 10^5 parameters.

The STFT as a learned feature extractor question: in principle, one could learn the analysis window jointly with the denoising network (learnable STFT). In practice, fixed Hann windows almost always match or outperform learned windows on denoising benchmarks, because the Hann window's known good time-frequency resolution is hard to improve upon with a finite training set.


Appendix M: Connections to Other Mathematical Areas

M.1 Harmonic Analysis on Finite Abelian Groups

The DFT on ZN\mathbb{Z}_N (the cyclic group of integers modulo NN) is a special case of harmonic analysis on locally compact abelian groups (LCA groups). For a finite abelian group GG, the Pontryagin dual G^\hat{G} is the group of characters (group homomorphisms χ:GS1\chi: G \to S^1). For G=ZNG = \mathbb{Z}_N: G^ZN\hat{G} \cong \mathbb{Z}_N with characters χk(n)=ωNkn\chi_k(n) = \omega_N^{kn} - exactly the DFT basis.

The abstract Fourier transform on GG is:

f^(χ)=gGf(g)χ(g)\hat{f}(\chi) = \sum_{g \in G} f(g)\,\overline{\chi(g)}

For G=ZNG = \mathbb{Z}_N: this recovers the DFT exactly. For G=RG = \mathbb{R}: this recovers the continuous FT. The same theorem - Pontryagin duality, Plancherel's theorem, Poisson summation - holds in all cases.

This unification is developed fully in Section 12 Functional Analysis, connecting the DFT to representation theory of compact groups, wavelet theory, and the Fourier analysis of non-commutative groups (non-abelian harmonic analysis).

M.2 Number-Theoretic Applications

Quadratic reciprocity via DFT. The Gauss sum G(a,N)=n=0N1e2πian2/NG(a, N) = \sum_{n=0}^{N-1} e^{2\pi i a n^2/N} plays a central role in the proof of quadratic reciprocity. For prime pp:

G(1,p)={pp1(mod4)ipp3(mod4)G(1, p) = \begin{cases} \sqrt{p} & p \equiv 1 \pmod 4 \\ i\sqrt{p} & p \equiv 3 \pmod 4 \end{cases}

This is a DFT-like object: G(a,N)=X[a]G(a, N) = X[a] where x[n]=1x[n] = 1 for all nn and the kernel is ωNan2\omega_N^{an^2} instead of ωNan\omega_N^{an}.

Integer factorization. Shor's quantum algorithm (1994) for integer factorization relies on the quantum Fourier transform (QFT): a quantum circuit that applies the DFT to a quantum superposition state in O((logN)2)O((\log N)^2) gate operations. Classical FFT requires O(NlogN)O(N \log N) operations; QFT requires O((logN)2)O((\log N)^2). Shor's algorithm uses the QFT to find the period of akmodNa^k \bmod N, which reveals the prime factors of NN in polynomial time.


Appendix N: Spectral Analysis in Practice - Extended Examples

N.1 Frequency Estimation: Sub-Bin Accuracy

A key practical problem: given noisy measurements y[n]=Acos(2πf0n/fs+ϕ)+ε[n]y[n] = A\cos(2\pi f_0 n/f_s + \phi) + \varepsilon[n], estimate f0f_0, AA, and ϕ\phi with sub-bin accuracy.

Step 1: Coarse estimate. Compute the DFT magnitude Y[k]|Y[k]| and find kmax=argmaxkY[k]k_{\max} = \arg\max_k |Y[k]|. This gives f^0(1)=kmaxfs/N\hat{f}_0^{(1)} = k_{\max} f_s / N with error up to ±Δf/2=±fs/(2N)\pm \Delta f/2 = \pm f_s/(2N).

Step 2: Parabolic interpolation. Fit a parabola to Y[kmax1]|Y[k_{\max}-1]|, Y[kmax]|Y[k_{\max}]|, Y[kmax+1]|Y[k_{\max}+1]|:

k^=kmax+Y[kmax+1]Y[kmax1]2(2Y[kmax]Y[kmax1]Y[kmax+1])\hat{k} = k_{\max} + \frac{|Y[k_{\max}+1]| - |Y[k_{\max}-1]|}{2(2|Y[k_{\max}]| - |Y[k_{\max}-1]| - |Y[k_{\max}+1]|)}

This achieves approximately 10 better accuracy than the coarse estimate for SNR >20> 20 dB.

Step 3: Quinn's Estimator. For even better accuracy, use the complex DFT values directly:

k^=kmax+Re ⁣[Y[kmax+1]/Y[kmax]1Y[kmax+1]/Y[kmax]]\hat{k} = k_{\max} + \operatorname{Re}\!\left[\frac{Y[k_{\max}+1]/Y[k_{\max}]}{1 - Y[k_{\max}+1]/Y[k_{\max}]}\right]

Quinn's estimator is unbiased and achieves near-Cramer-Rao bound accuracy.

Step 4: Amplitude and phase. Once k^\hat{k} is estimated, the amplitude is A^=2Y[k^]/N\hat{A} = 2|Y[\hat{k}]|/N (for a one-sided real spectrum) corrected for the window's coherent gain, and the phase is ϕ^=Y[k^]\hat{\phi} = \angle Y[\hat{k}].

N.2 Detecting Harmonics in Speech

Speech is approximately quasi-periodic: voiced sounds (vowels) have a fundamental frequency F0F_0 (pitch) and harmonics at 2F0,3F0,2F_0, 3F_0, \ldots The harmonic structure is visible in the DFT as a series of peaks at integer multiples of F0F_0.

HPS (Harmonic Product Spectrum) algorithm:

  1. Compute the DFT magnitude X[k]|X[k]|
  2. Form the product spectrum: P[k]=h=1HX[hk]P[k] = \prod_{h=1}^{H} |X[hk]|
  3. Estimate F0=fsargmaxkP[k]/NF_0 = f_s \cdot \arg\max_k P[k] / N

The HPS effectively downsamples the spectrum by factors 1,2,,H1, 2, \ldots, H and takes the product - aligning the harmonic peaks while suppressing noise between them.

In AI: pitch estimation (fundamental frequency detection) is a key component of speech processing pipelines for voice synthesis (VALL-E, SoundStorm), emotion recognition, and singing voice synthesis. Modern deep-learning pitch detectors (CREPE, PYIN, PENN) outperform HPS but rely on spectral representations derived from the STFT.

N.3 The Spectrogram in Transformer-Based Audio Models

The dominant paradigm for audio language models (AudioLM, VALL-E, MusicLM, Stable Audio) is:

Audio waveform
    v FFT -> Mel spectrogram (continuous representation)
    v VQ-VAE or EnCodec -> Discrete tokens
    v Language model (transformer) -> Next-token prediction
    v Decoder -> Reconstructed mel spectrogram
    v Vocoder (HiFi-GAN, WaveGrad) -> Audio waveform

The FFT and mel spectrogram serve as both the input representation for the encoder and the target representation for the decoder. The transformer operates entirely in the token space derived from the mel spectrogram.

Why not train end-to-end on waveforms? The waveform representation at 16-44 kHz requires sequence lengths of 10410^4-10510^5 samples per second - far beyond current transformer context lengths. The mel spectrogram with 10 ms frame shift compresses by a factor of 160×\sim 160\times (16 kHz) while preserving the perceptually relevant structure. This compression is what makes transformer-based audio generation feasible.


Appendix O: Summary of AI Model Connections

AI SystemDFT RoleKey Parameters
OpenAI WhisperFFT -> mel spectrogram as inputfs=16f_s=16kHz, M=400M=400, H=160H=160, N=512N=512, 80 mel bins
FNet (Google)Full 2D DFT replaces attentionDFT along sequence AND embedding dims; no learned weights
FNO (CalTech)Truncated spectral convolutionK=16K=16-3232 modes; global PDE operator learning
FourCastNet (NVIDIA)AFNO block for weather prediction2D FFT on 720×1440720\times1440 lat/lon grid; 73 variables
FlashFFTConvButterfly-structured long convolutionMonarch factorization; O(NlogN)O(N\log N) with Tensor Cores
Monarch (Together AI)FFT butterfly as trainable transformO(NN)O(N\sqrt{N}) parameters; used as attention surrogate
VALL-ECodec (EnCodec) uses STFT lossMulti-resolution STFT loss for perceptual quality
AudioLMSemantic/acoustic tokenizationEnCodec uses mel-scale frequency band processing
CREPE (pitch)Spectrogram input to CNNnfft=1024n_{\text{fft}}=1024, h=10h=10ms; pitch detection on mel spec
HiFi-GAN (vocoder)Multi-period discriminatorMulti-resolution STFT loss; discriminates real vs fake
S4/MambaSpectral view of state space modelsSSM recurrence equivalent to causal convolution (FFT in training)
RoPE (LLaMA)Frequency-domain position encodingComplex rotation = modulation in DFT sense (Section 3.3)

Appendix P: Proofs and Derivations

P.1 Derivation of Cooley-Tukey for General Radix

For N=N1N2N = N_1 \cdot N_2, write n=N2n1+n2n = N_2 n_1 + n_2 and k=N1k2+k1k = N_1 k_2 + k_1 with n1,k1[N1]n_1, k_1 \in [N_1], n2,k2[N2]n_2, k_2 \in [N_2]:

X[k]=n1=0N11n2=0N21x[N2n1+n2]ωN(N2n1+n2)(N1k2+k1)X[k] = \sum_{n_1=0}^{N_1-1}\sum_{n_2=0}^{N_2-1} x[N_2 n_1 + n_2]\, \omega_N^{-(N_2 n_1 + n_2)(N_1 k_2 + k_1)}

Expanding the exponent: (N2n1+n2)(N1k2+k1)=N1N2n1k2N2n1k1n2N1k2n2k1-(N_2 n_1 + n_2)(N_1 k_2 + k_1) = -N_1 N_2 n_1 k_2 - N_2 n_1 k_1 - n_2 N_1 k_2 - n_2 k_1

Since ωNN1N2=1\omega_N^{N_1 N_2} = 1, the first term vanishes. Grouping:

X[N1k2+k1]=n2=0N21ωN2n2k2[ωNn2k1twiddlen1=0N11ωN1n1k1x[N2n1+n2]]X[N_1 k_2 + k_1] = \sum_{n_2=0}^{N_2-1} \omega_{N_2}^{-n_2 k_2} \left[\underbrace{\omega_N^{-n_2 k_1}}_{\text{twiddle}} \sum_{n_1=0}^{N_1-1} \omega_{N_1}^{-n_1 k_1} x[N_2 n_1 + n_2]\right]

The inner sum is an N1N_1-point DFT (for each fixed n2,k1n_2, k_1), and the outer sum is an N2N_2-point DFT (for each fixed k1k_1). The twiddle factors ωNn2k1\omega_N^{-n_2 k_1} multiply between the two stages.

Total operations: N2N_2 DFTs of size N1N_1 + N1N_1 DFTs of size N2N_2 + N1N2N_1 N_2 twiddle multiplications = O(NlogN)O(N \log N) by the master recurrence.

P.2 The Uncertainty Principle for the DFT (Donoho-Stark Theorem)

Theorem (Donoho & Stark, 1989). For any nonzero xCN\mathbf{x} \in \mathbb{C}^N:

supp(x)supp(X)N|\text{supp}(\mathbf{x})| \cdot |\text{supp}(\mathbf{X})| \geq N

where supp(x)={n:x[n]0}\text{supp}(\mathbf{x}) = \{n : x[n] \neq 0\} is the support of x\mathbf{x}.

Proof sketch. Suppose supp(x)=S|\text{supp}(\mathbf{x})| = S and supp(X)=F|\text{supp}(\mathbf{X})| = F. The DFT restricted to supp(x)\text{supp}(\mathbf{x}) and supp(X)\text{supp}(\mathbf{X}) defines an F×SF \times S submatrix AA of 1NFN\frac{1}{\sqrt{N}}F_N. Since x\mathbf{x} is supported on SS points and X\mathbf{X} is supported on FF points, this submatrix satisfies AxS=XF/NA \mathbf{x}_{|S} = \mathbf{X}_{|F}/\sqrt{N}. A counting argument using the rank of AA gives SFNSF \geq N.

Corollary. A signal cannot have fewer than N\sqrt{N} nonzero samples AND fewer than N\sqrt{N} nonzero DFT coefficients simultaneously. This is the discrete analog of the Heisenberg uncertainty principle.

For AI: compressed sensing exploits this theorem: a signal supported on SS DFT bins can be recovered from mCSlogNm \geq CS\log N time-domain measurements - far fewer than NN. This justifies FNO's use of KNK \ll N Fourier modes: smooth solutions to PDEs have small Fourier support.

P.3 Aliasing Formula: Proof via Poisson Summation

When x(t)x(t) is sampled at rate fsf_s (below the Nyquist rate for bandlimited xx), the DFT of the samples is:

Xs[k]=n=x^ ⁣(kNTsnTs)=1Tsn=x^(kΔfnfs)X_s[k] = \sum_{n=-\infty}^{\infty} \hat{x}\!\left(\frac{k}{N T_s} - \frac{n}{T_s}\right) = \frac{1}{T_s}\sum_{n=-\infty}^{\infty} \hat{x}(k\Delta f - n f_s)

This is the aliased spectrum: the true spectrum x^\hat{x} summed over all frequency shifts by integer multiples of fsf_s. When x^\hat{x} is nonzero outside [fs/2,fs/2)[-f_s/2, f_s/2), adjacent copies overlap and contaminate each other - this is aliasing.

Proof. Starting from the Poisson summation formula (Section 02-7.5):

n=f(tnT)=1Tk=f^(k/T)e2πikt/T\sum_{n=-\infty}^{\infty} f(t - nT) = \frac{1}{T}\sum_{k=-\infty}^{\infty} \hat{f}(k/T)\, e^{2\pi i k t/T}

Setting f(t)=x(t)e2πiξtf(t) = x(t) e^{-2\pi i \xi t} and T=1/fsT = 1/f_s, and evaluating at t=0t=0:

nx(n/fs)e2πiξn/fs=fskx^(ξkfs)\sum_n x(n/f_s)\, e^{-2\pi i\xi n/f_s} = f_s \sum_k \hat{x}(\xi - kf_s)

The left side is the DTFT of the sampled signal; the right side is the periodized spectrum. The DFT is the sampling of this DTFT at ξ=kfs/N\xi = k f_s/N, giving the result.


Appendix Q: Further Reading

Primary References

  1. Cooley, J.W. and Tukey, J.W. (1965). "An Algorithm for the Machine Calculation of Complex Fourier Series." Mathematics of Computation, 19(90), 297-301. - The original FFT paper; two pages, historically indispensable.

  2. Oppenheim, A.V. and Schafer, R.W. (2010). Discrete-Time Signal Processing (3rd ed.). Prentice Hall. - The standard reference for DSP; chapters 8-10 cover DFT, FFT, and windowing comprehensively.

  3. Strang, G. (1986). "The Discrete Cosine Transform." SIAM Review, 41(1), 135-147. - Beautiful exposition connecting FFT, DCT, and linear algebra.

  4. Frigo, M. and Johnson, S.G. (2005). "The Design and Implementation of FFTW3." Proceedings of the IEEE, 93(2), 216-231. - FFTW's adaptive algorithm selection strategy.

  5. Li, Z. et al. (2021). "Fourier Neural Operator for Parametric Partial Differential Equations." ICLR 2021. - FNO architecture and theory.

  6. Dao, T. et al. (2022). "Monarch: Expressive Structured Matrices for Efficient and Accurate Training." ICML 2022. - Butterfly matrix parameterization.

  7. Radford, A. et al. (2022). "Robust Speech Recognition via Large-Scale Weak Supervision." ICML 2023. - Whisper architecture and mel spectrogram preprocessing.

  8. Donoho, D.L. and Stark, P.B. (1989). "Uncertainty Principles and Signal Recovery." SIAM Journal on Applied Mathematics, 49(3), 906-931. - Discrete uncertainty principle.

Online Resources

  • NumPy FFT documentation: numpy.fft module reference with complete API
  • SciPy signal processing: scipy.signal.spectrogram, scipy.signal.welch, scipy.fft
  • Whisper preprocessing code: openai/whisper GitHub repository, audio.py
  • FNO implementation: neuraloperator/neuraloperator GitHub repository

Appendix R: Quick-Reference Formula Sheet

SymbolMeaningValue / Formula
ωN\omega_NPrincipal NN-th root of unitye2πi/Ne^{2\pi i/N}
X[k]X[k]kk-th DFT coefficientn=0N1x[n]ωNnk\sum_{n=0}^{N-1} x[n]\,\omega_N^{-nk}
x[n]x[n]IDFT reconstruction1Nk=0N1X[k]ωNnk\frac{1}{N}\sum_{k=0}^{N-1} X[k]\,\omega_N^{nk}
FNF_NDFT matrix(FN)kn=ωNkn(F_N)_{kn} = \omega_N^{-kn}, so FNFN=NIF_N F_N^* = NI
Δf\Delta fFrequency bin widthfs/N=1/Tf_s/N = 1/T
fNyqf_{\text{Nyq}}Nyquist frequencyfs/2f_s/2
NminN_{\min}Minimum NN to resolve Δf0\Delta f_0fs/Δf0\lceil f_s/\Delta f_0 \rceil
S[l,k]S[l,k]STFT coefficientmx[lH+m]w[m]ωNmk\sum_{m} x[lH+m]\,w[m]\,\omega_N^{-mk}
COLAOverlap-add constantlw[nlH]=C\sum_l w[n-lH] = C for all nn
X[Nk]X[N-k]Conjugate symmetry (real xx)X[k]X[k]^*

Window cheat-sheet

WindowMain lobe (bins)First sidelobe (dB)Use case
Rectangular213-13Broadband noise, transients
Hann431.5-31.5General purpose
Hamming442.7-42.7Speech processing
Blackman658-58Weak sinusoids near strong ones
Kaiser β=8.6\beta=8.6variable69-69Precision filter design

FFT complexity

T(N)=2T(N/2)+O(N)    T(N)=O(Nlog2N)T(N) = 2T(N/2) + O(N) \implies T(N) = O(N \log_2 N)

Compared to naive DFT: N=106N = 10^6 reduces from 101210^{12} to 2×1072 \times 10^7 operations - a 50000×50\,000\times speedup.

Key identities

xy  F  X[k]Y[k](see Section 04)x \circledast y \;\overset{\mathcal{F}}{\longleftrightarrow}\; X[k] \cdot Y[k] \qquad \text{(see Section 04)} n=0N1x[n]2=1Nk=0N1X[k]2(Parseval)\sum_{n=0}^{N-1}|x[n]|^2 = \frac{1}{N}\sum_{k=0}^{N-1}|X[k]|^2 \qquad \text{(Parseval)} X[k]=X[k+N](periodicity in frequency)X[k] = X[k+N] \qquad \text{(periodicity in frequency)} x[n]=x[n+N](periodicity in time, implicit)x[n] = x[n+N] \qquad \text{(periodicity in time, implicit)}