Part 1

30 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Discrete Fourier Transform and FFT: Part 1: Intuition to 6. Frequency Resolution and Zero-Padding

1. Intuition

1.1 From Continuous to Discrete: Sampling a Spectrum

Every practical computation with signals must at some point confront a fundamental constraint: computers cannot store or process continuous functions. What they can store is a finite sequence of numbers - samples of a signal taken at discrete time instants. The passage from the continuous world of the Fourier Transform to the discrete world of the DFT is therefore not a mathematical nicety but a computational necessity.

Suppose we record $N$ samples of a signal $x(t)$ at a sampling rate of $f_s$ samples per second:

x[n] = x(n \cdot T_s), \quad T_s = \frac{1}{f_s}, \quad n = 0, 1, \ldots, N-1

The total observation time is $T = N \cdot T_s$ . What frequencies can we see in this finite record? Two constraints emerge immediately:

Highest observable frequency (Nyquist): any frequency above $f_s/2$ will be indistinguishable from a lower frequency due to aliasing - two distinct sinusoids will produce identical sample sequences.
Frequency resolution: the smallest frequency difference we can distinguish is $\Delta f = f_s / N = 1/T$ - the reciprocal of the observation duration.

The DFT honours both constraints. It examines the signal at exactly $N$ equally-spaced frequency bins:

\xi_k = \frac{k \cdot f_s}{N}, \quad k = 0, 1, \ldots, N-1

covering one full period $[0, f_s)$ of the periodic spectrum. The first $N/2$ bins correspond to positive frequencies $[0, f_s/2)$ ; the remaining $N/2$ bins correspond to negative frequencies $(-f_s/2, 0)$ wrapped around to $[f_s/2, f_s)$ . This wrapping is the discrete manifestation of the Nyquist limit.

For AI: Every neural network that processes time-series data - speech, ECG, seismic, financial - ultimately works with sampled signals. Understanding the frequency axis setup of the DFT is essential for interpreting what spectral features (mel filterbanks in Whisper, power spectral density in EEG classifiers) actually represent. Misinterpreting negative-frequency bins is a common source of subtle bugs in spectrogram preprocessing.

1.2 What the DFT Does Geometrically

The space of $N$ -point complex sequences $\mathbb{C}^N$ has dimension $N$ . The DFT provides a complete orthogonal basis for this space - the Fourier basis - consisting of $N$ complex sinusoids at the frequencies $\xi_k = k/N$ :

\mathbf{f}_k = \begin{bmatrix} 1 \\ \omega_N^k \\ \omega_N^{2k} \\ \vdots \\ \omega_N^{(N-1)k} \end{bmatrix}, \quad \omega_N = e^{2\pi i / N}, \quad k = 0, 1, \ldots, N-1

Each basis vector $\mathbf{f}_k$ is a sampled complex sinusoid at frequency $k/N$ cycles per sample. The DFT coefficient $X[k]$ is the inner product of the signal $\mathbf{x}$ with the $k$ -th basis vector:

X[k] = \langle \mathbf{x}, \mathbf{f}_k \rangle = \sum_{n=0}^{N-1} x[n]\, \overline{\omega_N^{nk}} = \sum_{n=0}^{N-1} x[n]\, \omega_N^{-nk}

The DFT is therefore a change of basis from the standard basis (time domain) to the Fourier basis (frequency domain). The magnitude $|X[k]|$ measures how strongly the signal oscillates at frequency $k/N$ cycles per sample; the phase $\angle X[k]$ measures the phase of that oscillation.

The inverse DFT reconstructs the signal by superimposing all $N$ sinusoids with their computed amplitudes and phases:

x[n] = \frac{1}{N} \sum_{k=0}^{N-1} X[k]\, \omega_N^{nk}

The factor $1/N$ comes from the normalization of the basis - the $\mathbf{f}_k$ have squared norm $\lVert \mathbf{f}_k \rVert^2 = N$ , not 1. Using the orthonormal basis $\mathbf{f}_k / \sqrt{N}$ , both forward and inverse transforms carry a $1/\sqrt{N}$ factor. Different software packages use different conventions; always check.

Geometric picture: if you think of the signal as a vector in $\mathbb{C}^N$ , the DFT rotates this vector into a new coordinate system. The new coordinate axes are complex sinusoids. The DFT matrix $F_N$ is the rotation matrix (more precisely, a scaled unitary matrix). The operation $\mathbf{X} = F_N \mathbf{x}$ is a matrix-vector product; naive evaluation takes $O(N^2)$ operations. The FFT computes this same product in $O(N \log N)$ .

1.3 Why O(N log N) Changed the World

To appreciate the scale of the FFT's impact, consider the arithmetic:

$N$	Naive DFT ( $N^2$ )	FFT ( $N \log_2 N$ )	Speedup
64	4,096	384	11
1,024	1,048,576	10,240	102
65,536	$4.3 \times 10^9$	$1.0 \times 10^6$	4,369
$10^6$	$10^{12}$	$2.0 \times 10^7$	50,000

Before the FFT, computing the spectrum of a 1-second audio clip sampled at 44.1 kHz was infeasible in real time. After the FFT, it takes microseconds. The entire modern infrastructure of digital audio, wireless communications (OFDM), radar, MRI, seismology, and antenna design rests on this algorithmic breakthrough.

The secret is that the DFT matrix $F_N$ has enormous structure: it is a Vandermonde matrix with entries that are powers of $N$ -th roots of unity, and these entries satisfy recursive relationships that allow massive cancellation of work through a divide-and-conquer decomposition.

For AI (2026): Long-context language models face a quadratic bottleneck in self-attention: processing a 100K-token context with full attention requires $\sim 10^{10}$ operations. The FFT offers an alternative: if a sequence operation can be formulated as a convolution, it can be computed in $O(N \log N)$ using the FFT. This is the foundation of FNet (Lee-Thorp et al., 2022), FlashFFTConv (Dao et al., 2023), S4 (Gu et al., 2022), and the Fourier Neural Operator (Li et al., 2021). The $O(N \log N)$ vs $O(N^2)$ complexity gap, already dramatic at $N = 10^3$ , becomes existential at $N = 10^5$ .

1.4 Historical Timeline

The story of the FFT is a case study in mathematical rediscovery and the transformative effect of the right algorithm at the right moment.

1805 - Carl Friedrich Gauss derived what we now recognize as the FFT algorithm while computing asteroid orbits. His manuscript lay unpublished and largely unknown.

1903 - Carl Runge independently derived a related fast algorithm for trigonometric interpolation.

1942 - Danielson and Lanczos published a recursive factorization of the DFT based on even-odd splitting, establishing the core mathematical structure.

1965 - James Cooley and John Tukey published "An Algorithm for the Machine Calculation of Complex Fourier Series," the paper that launched the digital signal processing revolution. Their algorithm was motivated by Tukey's desire to detect Soviet nuclear tests through seismic monitoring. The paper, just two pages, is one of the most cited in the history of applied mathematics.

1966 - Gentleman and Sande introduced the "decimation in frequency" variant; Bergland gave a comprehensive tutorial; the algorithm became standard in every engineering discipline within a decade.

1984 - FFTW - Frigo and Johnson developed FFTW ("Fastest Fourier Transform in the West"), an adaptive library that benchmarks and selects the optimal algorithm for any $N$ and hardware. FFTW ships in NumPy, SciPy, and is the backbone of essentially all scientific computing FFT pipelines.

2021 - Fourier Neural Operator (Li et al.) applies the DFT inside a neural network layer to learn solution operators for PDEs in $O(N \log N)$ .

2022 - Monarch matrices (Dao et al.) parameterize FFT-like structured transforms as products of sparse butterfly matrices, enabling learnable near-FFT computations in transformer architectures.

2. Formal Definitions

2.1 The N-Point DFT

Definition (DFT). Let $\mathbf{x} = (x[0], x[1], \ldots, x[N-1]) \in \mathbb{C}^N$ be a finite sequence. The $N$ -point Discrete Fourier Transform of $\mathbf{x}$ is the sequence $\mathbf{X} = (X[0], X[1], \ldots, X[N-1]) \in \mathbb{C}^N$ defined by:

X[k] = \sum_{n=0}^{N-1} x[n]\, \omega_N^{-nk}, \quad k = 0, 1, \ldots, N-1

where $\omega_N = e^{2\pi i / N}$ is the primitive $N$ -th root of unity.

The complex number $\omega_N^{-nk} = e^{-2\pi i nk/N}$ is a sampled complex sinusoid at frequency $k/N$ cycles per sample, evaluated at time $n$ . The DFT coefficient $X[k]$ measures the complex amplitude (magnitude and phase) of the frequency- $k$ component of $\mathbf{x}$ .

Sign convention. The minus sign in $\omega_N^{-nk}$ is the physics/engineering convention (analysis kernel $e^{-2\pi i}$ ). Some mathematical texts use $\omega_N^{+nk}$ (positive sign) for the forward transform; this merely relabels forward and inverse. We follow NumPy/SciPy: forward DFT has the $-$ sign.

Roots of unity. The $N$ values $\{\omega_N^0, \omega_N^1, \ldots, \omega_N^{N-1}\}$ are the $N$ -th roots of unity - equally spaced points on the unit circle in $\mathbb{C}$ , starting at 1 and rotating counterclockwise by $2\pi/N$ at each step. They satisfy:

\sum_{n=0}^{N-1} \omega_N^{nk} = \begin{cases} N & \text{if } k \equiv 0 \pmod{N} \\ 0 & \text{otherwise} \end{cases}

This orthogonality of roots of unity is the key identity underlying all DFT properties and the inversion formula.

Standard examples for small $N$ :

$N=2$ : $\omega_2 = e^{i\pi} = -1$ . The 2-point DFT is:

X[0] = x[0] + x[1], \quad X[1] = x[0] - x[1]

This is the Hadamard/Walsh transform in 2D - just a sum and difference.

$N=4$ : $\omega_4 = e^{i\pi/2} = i$ . The 4-point DFT:

X[0] = x[0]+x[1]+x[2]+x[3], \quad X[1] = x[0] - ix[1] - x[2] + ix[3]

X[2] = x[0]-x[1]+x[2]-x[3], \quad X[3] = x[0]+ix[1]-x[2]-ix[3]

Non-example: An infinite sequence $x[n]$ for $n \in \mathbb{Z}$ does NOT have a DFT - you must first window or truncate it to $N$ points. Applying the DFT directly to an infinite sequence is a category error; what you would compute is the $z$ -transform or DTFT (Discrete-Time Fourier Transform), which is distinct from the DFT.

2.2 Inverse DFT and Normalization Conventions

Definition (IDFT). The Inverse Discrete Fourier Transform is:

x[n] = \frac{1}{N} \sum_{k=0}^{N-1} X[k]\, \omega_N^{nk}, \quad n = 0, 1, \ldots, N-1

Verification. Substitute the DFT into the IDFT:

\frac{1}{N}\sum_k X[k]\omega_N^{nk} = \frac{1}{N}\sum_k \left(\sum_m x[m]\omega_N^{-mk}\right)\omega_N^{nk} = \sum_m x[m] \cdot \frac{1}{N}\sum_k \omega_N^{(n-m)k} = \sum_m x[m]\, \delta[n-m] = x[n]

using the orthogonality of roots of unity in the penultimate step.

Three normalization conventions (all valid, software-dependent):

Convention	Forward DFT	Inverse DFT	Used by
Engineering (default)	$X[k] = \sum_n x[n]\omega_N^{-nk}$	$x[n] = \frac{1}{N}\sum_k X[k]\omega_N^{nk}$	NumPy, SciPy, MATLAB
Symmetric (unitary)	$X[k] = \frac{1}{\sqrt{N}}\sum_n x[n]\omega_N^{-nk}$	$x[n] = \frac{1}{\sqrt{N}}\sum_k X[k]\omega_N^{nk}$	Some textbooks
Inverse-normalized	$X[k] = \frac{1}{N}\sum_n x[n]\omega_N^{-nk}$	$x[n] = \sum_k X[k]\omega_N^{nk}$	Rare; avoid

For AI: Always check which convention a library uses. NumPy uses engineering convention: np.fft.fft has no $1/N$ factor, np.fft.ifft has $1/N$ . PyTorch's torch.fft.fft matches NumPy. When mixing libraries or implementing from scratch, convention mismatch is a common silent bug.

2.3 The DFT Matrix

The DFT is a linear map $\mathbf{X} = F_N \mathbf{x}$ , where the DFT matrix $F_N \in \mathbb{C}^{N \times N}$ has entries:

(F_N)_{kn} = \omega_N^{-kn}, \quad k, n = 0, 1, \ldots, N-1

The matrix is fully specified by its single generator $\omega_N$ . Written out explicitly for $N=4$ :

F_4 = \begin{bmatrix} 1 & 1 & 1 & 1 \\ 1 & -i & -1 & i \\ 1 & -1 & 1 & -1 \\ 1 & i & -1 & -i \end{bmatrix}

Unitarity. The DFT matrix satisfies $F_N F_N^* = N I_N$ , where $F_N^*$ is the conjugate transpose. Equivalently, the normalized DFT matrix $\frac{1}{\sqrt{N}}F_N$ is unitary:

\frac{1}{\sqrt{N}}F_N \cdot \left(\frac{1}{\sqrt{N}}F_N\right)^* = I_N

Proof. The $(k, l)$ entry of $F_N F_N^*$ is:

(F_N F_N^*)_{kl} = \sum_{n=0}^{N-1} \omega_N^{-kn} \overline{\omega_N^{-ln}} = \sum_{n=0}^{N-1} \omega_N^{(l-k)n} = \begin{cases} N & \text{if } k=l \\ 0 & \text{if } k \neq l \end{cases}

by the orthogonality of roots of unity. Therefore $F_N F_N^* = NI$ , so $F_N^{-1} = \frac{1}{N}F_N^*$ , confirming the IDFT formula: $\mathbf{x} = \frac{1}{N}F_N^* \mathbf{X}$ .

Key structural properties of $F_N$ :

Vandermonde structure: $F_N$ is a Vandermonde matrix with nodes $\omega_N^0, \omega_N^{-1}, \ldots, \omega_N^{-(N-1)}$
Symmetric: $(F_N)_{kn} = (F_N)_{nk}$ - the matrix is symmetric (not Hermitian)
Periodic indexing: all indices are taken modulo $N$ , making the transform naturally circular

2.4 DFT as a Change of Basis

The columns of $F_N^*$ (equivalently, the rows of $F_N$ conjugated) are the Fourier basis vectors:

\mathbf{f}_k = \frac{1}{\sqrt{N}}\begin{bmatrix} 1 \\ \omega_N^k \\ \omega_N^{2k} \\ \vdots \\ \omega_N^{(N-1)k} \end{bmatrix} \in \mathbb{C}^N, \quad k = 0, 1, \ldots, N-1

These form an orthonormal basis for $\mathbb{C}^N$ : $\langle \mathbf{f}_k, \mathbf{f}_l \rangle = \delta_{kl}$ .

The DFT coefficient $X[k]/\sqrt{N}$ is the coordinate of $\mathbf{x}$ in the direction $\mathbf{f}_k$ :

\frac{X[k]}{\sqrt{N}} = \left\langle \mathbf{x}, \mathbf{f}_k \right\rangle = \frac{1}{\sqrt{N}}\sum_{n=0}^{N-1} x[n]\,\omega_N^{-nk}

Fourier basis as "frequency detector" vectors: $\mathbf{f}_k$ oscillates at exactly $k$ cycles across the $N$ samples. If $\mathbf{x}$ is itself a pure sinusoid at frequency $k_0$ , then $|X[k]|$ is zero for all $k \neq k_0$ and equals $N$ at $k = k_0$ . The DFT "detects" which frequencies are present by computing inner products with all $N$ basis sinusoids simultaneously.

Contrast with DTFT (Discrete-Time Fourier Transform): The DTFT is defined for infinite sequences as $X(e^{j\omega}) = \sum_{n=-\infty}^\infty x[n]e^{-j\omega n}$ and produces a continuous spectrum over $\omega \in [0, 2\pi)$ . The DFT is a sampled version of the DTFT, evaluating it at $N$ equally-spaced frequencies. The DFT is the only version that is both finite and exactly invertible.

2.5 Relation to the Continuous Fourier Transform

The DFT approximates the continuous Fourier Transform of a sampled signal. Specifically, if $x(t)$ is bandlimited to $[-f_s/2, f_s/2)$ and sampled at rate $f_s$ :

X[k] \approx \hat{x}\!\left(\frac{k}{N T_s}\right) \cdot \frac{1}{T_s} \cdot \frac{1}{N} = \frac{1}{T_s}\hat{x}(k\,\Delta f)

where $\Delta f = f_s/N$ is the frequency bin spacing. The DFT thus provides a discretized snapshot of the continuous spectrum at $N$ frequency points.

Three key approximation errors arise:

Aliasing: frequencies above $f_s/2$ fold back into $[0, f_s/2)$ - see Section 6.3
Leakage: the signal is observed for only $N$ samples, equivalent to multiplying by a rectangular window, causing sidelobe contamination in the spectrum - see Section 5
Picket fence: the DFT only evaluates the spectrum at $N$ discrete frequencies, potentially missing peaks between bins - see Section 6.2

3. Properties of the DFT

All DFT properties follow from linearity of the sum and the orthogonality of roots of unity. We state each property, prove it, and note its discrete-specific character where it differs from the continuous FT.

3.1 Linearity

\mathcal{F}\{ax[n] + by[n]\}[k] = a\,X[k] + b\,Y[k]

Proof. Immediate from linearity of summation: $\sum_n (ax[n]+by[n])\omega_N^{-nk} = a\sum_n x[n]\omega_N^{-nk} + b\sum_n y[n]\omega_N^{-nk}$ .

Linearity makes the DFT a linear operator on $\mathbb{C}^N$ , represented by the matrix $F_N$ .

3.2 Circular Shift

If $y[n] = x[n - m \bmod N]$ (shift $\mathbf{x}$ right by $m$ positions with wrap-around), then:

Y[k] = \omega_N^{-mk}\, X[k]

Proof.

Y[k] = \sum_{n=0}^{N-1} x[n-m]\,\omega_N^{-nk} \overset{l=n-m}{=} \sum_{l=0}^{N-1} x[l]\,\omega_N^{-(l+m)k} = \omega_N^{-mk} \sum_{l=0}^{N-1} x[l]\,\omega_N^{-lk} = \omega_N^{-mk} X[k]

Important distinction from continuous case. In the continuous FT, a time shift by $\tau$ introduces the factor $e^{-2\pi i\xi\tau}$ . In the DFT, the shift is circular (modular): shifting right pushes the last sample to the first position. If you apply a linear (non-circular) shift to a finite-length signal, the result is NOT simply the DFT multiplied by a phase factor - truncation effects occur. This is a frequent source of confusion when applying the shift property carelessly.

Application. In signal processing, circular shift models delay in a circular buffer. In machine learning, circular convolution (Section 3.6) exploits this property to compute convolutions via FFT.

3.3 Modulation (Frequency Shift)

If $y[n] = \omega_N^{mn} x[n]$ (multiply by a complex sinusoid), then:

Y[k] = X[k - m \bmod N]

This is the dual of the circular shift property: multiplication in time = shift in frequency.

Application. Frequency-shift keying (FSK) in communications. In transformers, rotary position encoding (RoPE) multiplies embeddings by complex exponentials - this is modulation in the DFT sense, shifting the frequency-domain representation of each token by its position.

3.4 Conjugate Symmetry for Real Inputs

If $x[n] \in \mathbb{R}$ for all $n$ , then:

X[N-k] = X[k]^*, \quad k = 0, 1, \ldots, N-1

Proof. $X[N-k] = \sum_n x[n]\omega_N^{-n(N-k)} = \sum_n x[n]\omega_N^{nk} = \sum_n x[n]\overline{\omega_N^{-nk}} = \overline{X[k]}$ , using $x[n] \in \mathbb{R}$ so $x[n] = \overline{x[n]}$ and $\omega_N^{nN} = 1$ .

Consequence. Only $\lfloor N/2 \rfloor + 1$ DFT coefficients $X[0], X[1], \ldots, X[N/2]$ are independent; the rest are determined by conjugate symmetry. NumPy's rfft exploits this to return only the non-redundant half, saving memory and computation. For AI applications working with real-valued signals (audio, time-series), always use rfft/irfft.

DC and Nyquist bins: $X[0] = \sum_n x[n] \in \mathbb{R}$ (always real, proportional to the mean). For even $N$ : $X[N/2] = \sum_n x[n](-1)^n \in \mathbb{R}$ (always real, proportional to the alternating sum).

3.5 Parseval's Theorem for the DFT

\sum_{n=0}^{N-1} |x[n]|^2 = \frac{1}{N} \sum_{k=0}^{N-1} |X[k]|^2

Proof. Since $\frac{1}{\sqrt{N}}F_N$ is unitary, it preserves the $\ell^2$ norm:

\lVert \mathbf{x} \rVert^2 = \left\lVert \frac{1}{\sqrt{N}}F_N \mathbf{x} \right\rVert^2 = \frac{1}{N}\lVert \mathbf{X} \rVert^2

Interpretation. Energy is perfectly preserved by the DFT - the total energy in the time domain equals the total energy in the frequency domain (up to the $1/N$ normalization factor). This is the discrete analog of Plancherel's theorem from Section 02.

Application. The power spectral density (PSD) is $|X[k]|^2 / N$ , which by Parseval sums to the total signal power. In Whisper's mel spectrogram, the log power $\log(|X[k]|^2)$ of each FFT bin is computed and then summed over mel filterbanks.

3.6 Circular Convolution - Preview

The most important DFT property is the circular convolution theorem: pointwise multiplication in frequency corresponds to circular convolution in time.

Definition. The circular (cyclic) convolution of $\mathbf{x}$ and $\mathbf{y}$ in $\mathbb{C}^N$ is:

(x \circledast y)[n] = \sum_{m=0}^{N-1} x[m]\, y[n-m \bmod N]

Circular Convolution Theorem. $\mathcal{F}\{x \circledast y\}[k] = X[k] \cdot Y[k]$ .

Preview: This theorem is the entire foundation of FFT-based convolution. The full treatment - including the distinction between circular and linear convolution, overlap-add/save for long signals, filter design, and the connection to CNNs - is the subject of Section 20-04 Convolution Theorem. Here we note the fact; the applications follow in Section 04.

Why circular? The DFT implicitly assumes the signal is periodic with period $N$ . The product $X[k]Y[k]$ is the DFT of the circular (not linear) convolution of $\mathbf{x}$ and $\mathbf{y}$ . To compute linear convolution via FFT, zero-pad both sequences to length $N + M - 1$ before taking the DFT (Section 6.2).

3.7 DFT Properties Master Table

Property	Time domain $x[n]$	Frequency domain $X[k]$
Linearity	$ax[n] + by[n]$	$aX[k] + bY[k]$
Circular shift	$x[n-m \bmod N]$	$\omega_N^{-mk} X[k]$
Modulation	$\omega_N^{mn} x[n]$	$X[k-m \bmod N]$
Conjugate symmetry	$x[n] \in \mathbb{R}$	$X[N-k] = X[k]^*$
Time reversal	$x[-n \bmod N]$	$X[-k \bmod N] = X[N-k]$
Conjugation	$x[n]^*$	$X[-k \bmod N]^* = X[N-k]^*$
Circular convolution	$(x \circledast y)[n]$	$X[k]\cdot Y[k]$
Pointwise product	$x[n] \cdot y[n]$	$\frac{1}{N}(X \circledast Y)[k]$
Parseval	$\sum_n \lvert x[n] \rvert^2$	$\frac{1}{N}\sum_k \lvert X[k] \rvert^2$
Duality	$X[n]$	$N\, x[-k \bmod N]$

4. The Fast Fourier Transform Algorithm

4.1 Complexity of Naive DFT

Direct evaluation of $X[k] = \sum_{n=0}^{N-1} x[n]\omega_N^{-nk}$ for all $k = 0, \ldots, N-1$ requires:

$N$ complex multiplications per output bin (multiplying each $x[n]$ by $\omega_N^{-nk}$ )
$N-1$ complex additions per bin
Total: $N^2$ complex multiplications, $N(N-1)$ complex additions
Each complex multiplication = 4 real multiplications + 2 additions

For $N = 1024$ : about $10^6$ complex multiplications. For $N = 65536$ : about $4 \times 10^9$ . At 2010 hardware speeds (~ $10^9$ operations/second), a 65536-point DFT takes ~4 seconds. Audio applications need hundreds per second. Naive DFT is simply not usable.

4.2 The Cooley-Tukey Insight: Divide and Conquer

Assume $N = 2^m$ (a power of 2). Split the sum into even-indexed and odd-indexed samples:

X[k] = \sum_{n=0}^{N-1} x[n]\omega_N^{-nk} = \underbrace{\sum_{r=0}^{N/2-1} x[2r]\,\omega_N^{-2rk}}_{E[k]} + \underbrace{\sum_{r=0}^{N/2-1} x[2r+1]\,\omega_N^{-(2r+1)k}}_{O[k]}

Observe that $\omega_N^{-2rk} = e^{-2\pi i(2r)k/N} = e^{-2\pi irk/(N/2)} = \omega_{N/2}^{-rk}$ . Therefore:

E[k] = \sum_{r=0}^{N/2-1} x[2r]\,\omega_{N/2}^{-rk} = \text{DFT}_{N/2}\!\left(\{x[0], x[2], x[4], \ldots\}\right)[k]

O[k] = \omega_N^{-k}\sum_{r=0}^{N/2-1} x[2r+1]\,\omega_{N/2}^{-rk} = \omega_N^{-k} \cdot \text{DFT}_{N/2}\!\left(\{x[1], x[3], x[5], \ldots\}\right)[k]

Both $E[k]$ and $O[k]$ are $(N/2)$ -point DFTs, and both are periodic in $k$ with period $N/2$ . For $k = 0, \ldots, N/2-1$ :

\boxed{X[k] = E[k] + \omega_N^{-k} O[k]}

\boxed{X[k + N/2] = E[k] - \omega_N^{-k} O[k]}

This is the Cooley-Tukey butterfly: two $N/2$ -point DFTs plus $N/2$ complex multiplications (by the twiddle factors $\omega_N^{-k}$ ) yield the full $N$ -point DFT.

Recursion: Apply the same split to each $N/2$ -point DFT, then to each $N/4$ -point DFT, and so on, until we reach 2-point DFTs (which are trivial: $X[0] = x[0]+x[1]$ , $X[1] = x[0]-x[1]$ ).

4.3 Butterfly Diagram

The Cooley-Tukey recursion is beautifully represented as a signal flow graph called the butterfly diagram. For an 8-point DFT ( $N=8$ , $\log_2 8 = 3$ stages):

DFT-8 BUTTERFLY DIAGRAM (Decimation in Time)
========================================================================

  Input           Stage 1          Stage 2          Stage 3     Output
  (bit-rev)       N/8 DFTs         N/4 DFTs         N/2 DFTs

  x[0] ----+--------------+---------------+-------- X[0]
            |              |               |
  x[4] ----+W0   ...      |               |
            |              |               |
  x[2] ----+              |W0   ...       |
            |              |               |
  x[6] ----+              |               |
            |              |               |
  x[1] ----+              |               |W0
            |              |               |
  x[5] ----+              |               |W1
            |              |               |
  x[3] ----+              |               |W2
            |              |               |
  x[7] ------------------------------------------ X[7]

  Each node: X[k] = E[k] + W^k * O[k]
             X[k+N/2] = E[k] - W^k * O[k]

  W^k = _N^{-k} = e^{-2ik/N}  (twiddle factor)

========================================================================

Each butterfly is a 2-input, 2-output operation:

  a --+------ a + Wb
      |    
      |     
      |        (butterfly crossing)
      |     
      |    
  b --+-- a - Wb

For $N = 2^m$ , the diagram has $m = \log_2 N$ stages, each containing $N/2$ butterflies. Total butterflies: $\frac{N}{2}\log_2 N$ .

4.4 Decimation in Time vs Decimation in Frequency

Decimation in Time (DIT): Split the input by even/odd indices. The input must be in bit-reversed order (see Section 4.5), while the output emerges in natural order. This is the form described above, and the most common in practice.

Decimation in Frequency (DIF): Split the output by even/odd frequencies. The input is in natural order, while the output emerges in bit-reversed order. The butterfly computation slightly differs but the complexity is identical.

Both variants require exactly the same number of operations. The choice is often determined by memory access patterns or hardware constraints.

In-place computation: The butterfly computation can be performed in-place - the two outputs overwrite the two inputs without needing extra memory. This makes the FFT extremely memory-efficient for large $N$ .

4.5 Bit-Reversal Permutation

The DIT-FFT requires the input in bit-reversed order. For $N = 8$ (3-bit indices):

Natural index	Binary	Bit-reversed	Bit-reversed decimal
0	000	000	0
1	001	100	4
2	010	010	2
3	011	110	6
4	100	001	1
5	101	101	5
6	110	011	3
7	111	111	7

Why bit-reversal arises: Each stage of the DIT recursion separates samples by their last bit (even vs odd) - what was the last bit of the original index becomes the first bit at the deepest recursion level. After $\log_2 N$ stages of even-odd splitting, the indices are fully bit-reversed.

Efficient computation: Bit-reversal can be computed in $O(N)$ time using integer bit-reversal operations or a simple loop with bin() in Python. Hardware FFT chips often include dedicated bit-reversal circuits.

4.6 Complexity Analysis: O(N log N)

Let $T(N)$ be the number of complex multiplications required by the FFT. The recursion is:

T(N) = 2\,T(N/2) + \frac{N}{2}

where $N/2$ twiddle-factor multiplications are needed to combine two $N/2$ -point DFTs. With $T(1) = 0$ :

Solving the recurrence:

T(N) = 2T(N/2) + N/2

= 2\bigl[2T(N/4) + N/4\bigr] + N/2 = 4T(N/4) + N/2 + N/2

= 4\bigl[2T(N/8) + N/8\bigr] + N = 8T(N/8) + 3N/2

\vdots

= N \cdot T(1) + \frac{N}{2}\log_2 N = \frac{N}{2}\log_2 N

Total operations: $\frac{N}{2}\log_2 N$ complex multiplications and $N\log_2 N$ complex additions. This is $O(N \log N)$ .

Comparison with naive DFT ( $N^2$ multiplications):

$N$	FFT / DFT ratio
64	$3 / 64 \approx 4.7\%$
1,024	$5 / 1024 \approx 0.49\%$
$10^6$	$\approx 0.002\%$

For large $N$ , the FFT uses less than 0.01% of the operations required by the naive DFT.

Empirical validation: The $O(N \log N)$ curve can be confirmed by timing numpy.fft.fft for $N = 2^k$ , $k = 6, \ldots, 20$ - the log-log plot should have slope $\approx 1$ (since $N \log N \approx N^1$ for moderate $N$ , growing only slowly faster than linear).

4.7 Variants and Extensions

Mixed-radix FFT. The pure radix-2 algorithm requires $N$ to be a power of 2. Real-world signals rarely have power-of-2 lengths. Mixed-radix algorithms factor $N = N_1 \times N_2 \times \cdots \times N_r$ into small prime factors and apply a generalized Cooley-Tukey recursion. With $N = N_1 N_2$ , the DFT can be decomposed into $N_1$ DFTs of size $N_2$ and $N_2$ DFTs of size $N_1$ . Common radices: 2, 3, 4, 5, 7, 8.

Split-radix FFT. Combines radix-2 and radix-4 butterflies to achieve the minimum theoretical operation count: approximately $\frac{1}{3}N\log_2 N$ complex multiplications. The split-radix algorithm is the most operation-efficient known for power-of-2 lengths.

Prime-length algorithms:

Rader's algorithm (1968): Converts a prime-length DFT into a circular convolution of composite length, solvable by FFT.
Bluestein's algorithm (1970): Converts any DFT into a convolution via the identity $nk = \binom{n+k}{2} - \binom{n}{2} - \binom{k}{2}$ , enabling FFT-based computation for any $N$ .

Real-valued FFT (rfft). For real inputs, conjugate symmetry (Section 3.4) means only $N/2+1$ output values are independent. np.fft.rfft computes only these, roughly halving computation and memory.

FFTW (Fastest Fourier Transform in the West). FFTW (Frigo & Johnson, 2005) automatically selects the best algorithm for any $N$ and hardware through a "plan" computed at initialization time. It achieves near-optimal performance across a wide range of $N$ and machine architectures. All major scientific computing libraries (NumPy, SciPy, MATLAB, Julia) use FFTW or FFTW-compatible libraries under the hood.

For AI: PyTorch's torch.fft.fft uses CUDA-accelerated FFT (cuFFT) on GPU. For sequence lengths that are not powers of 2, cuFFT uses mixed-radix algorithms. When tuning sequence lengths for FFT-based models (FNet, FNO, FlashFFTConv), choosing $N = 2^k$ or $N = 2^a \cdot 3^b \cdot 5^c$ avoids prime-length slowdowns.

5. Spectral Leakage and Windowing

5.1 The Leakage Problem

The DFT assumes that the $N$ samples represent one complete period of a periodic signal. In reality, we observe a finite segment of a continuous signal - we multiply by a rectangular window that is 1 inside $[0, N-1]$ and 0 outside. In frequency, multiplication by a rectangular window corresponds to convolution with the DFT of the rectangle - the Dirichlet kernel:

W_{\text{rect}}[k] = \sum_{n=0}^{N-1} \omega_N^{-nk} = \begin{cases} N & k=0 \\ \frac{1-\omega_N^{-Nk}}{1-\omega_N^{-k}} = \frac{\sin(\pi k)}{\sin(\pi k/N)}e^{-i\pi k(N-1)/N} & k \neq 0 \end{cases}

For non-integer frequencies (frequencies that do not land exactly on a DFT bin), the Dirichlet kernel's large sidelobes spread energy from the true frequency into adjacent bins - this is spectral leakage.

Example. Suppose the signal is $x(t) = \cos(2\pi f_0 t)$ with $f_0 = 3.5 f_s/N$ (a non-integer number of cycles in the window). The true spectrum has just two peaks at $\pm f_0$ . The DFT shows energy spread across all $N$ bins, with the main peak at the nearest bin ( $k=3$ or $k=4$ ) and significant sidelobe contamination at all other bins.

Consequence for AI: In Whisper's mel spectrogram, a sharp onset (e.g., a consonant burst) at a non-bin frequency would appear smeared across multiple mel filterbanks. Window choice directly affects how well transient events are separated in the spectrogram.

5.2 Window Functions

A window function $w[n]$ tapers the signal to zero at its endpoints, reducing the discontinuity that causes leakage. The windowed DFT is:

X_w[k] = \sum_{n=0}^{N-1} w[n]\, x[n]\, \omega_N^{-nk}

The spectrum $X_w$ is the DFT of $w[n]x[n]$ - by the product property (Section 3.7), this equals the circular convolution of $X$ and $W$ (normalized by $1/N$ ). Windows with smaller sidelobes produce less leakage at the cost of a wider main lobe (reduced frequency resolution).

Standard window functions (all defined for $n = 0, 1, \ldots, N-1$ ):

Rectangular (Boxcar):

w[n] = 1

Main lobe width: $4\pi/N$ (narrowest)
Peak sidelobe: $-13\,\mathrm{dB}$ (highest leakage)
Use case: when the signal exactly fits $N$ samples; spectral analysis of periodic signals at bin frequencies

Hann (Hanning):

w[n] = \frac{1}{2}\left(1 - \cos\!\frac{2\pi n}{N-1}\right)

Main lobe width: $8\pi/N$
Peak sidelobe: $-31.5\,\mathrm{dB}$
Use case: default choice for general spectral analysis; used in Whisper STFT

Hamming:

w[n] = 0.54 - 0.46\cos\!\frac{2\pi n}{N-1}

Main lobe width: $8\pi/N$
Peak sidelobe: $-42.7\,\mathrm{dB}$
Use case: narrowband applications requiring low sidelobes; slightly wider main lobe than Hann

Blackman:

w[n] = 0.42 - 0.5\cos\!\frac{2\pi n}{N-1} + 0.08\cos\!\frac{4\pi n}{N-1}

Main lobe width: $12\pi/N$
Peak sidelobe: $-58\,\mathrm{dB}$
Use case: high-dynamic-range spectral analysis

Kaiser:

w[n] = \frac{I_0\!\left(\beta\sqrt{1-(2n/(N-1)-1)^2}\right)}{I_0(\beta)}

where $I_0$ is the zeroth-order modified Bessel function and $\beta$ is a shape parameter.

$\beta = 0$ : rectangular; $\beta \approx 5$ : similar to Hamming; $\beta \approx 8.6$ : similar to Blackman
Chebyshev-optimal: for a given main-lobe width, the Kaiser window minimizes peak sidelobe level
Use case: when precise sidelobe control is required; used in scientific spectral analysis and filter design

5.3 The Main-Lobe/Side-Lobe Tradeoff

Every window function must satisfy a fundamental constraint: the product of main-lobe width (in frequency bins) and side-lobe level (in dB) cannot be simultaneously minimized. This is a consequence of the uncertainty principle applied to finite sequences.

Window	Main-lobe width (bins)	Peak sidelobe (dB)	Sidelobe rolloff (dB/oct)
Rectangular	2	$-13$	$-6$
Hann	4	$-31.5$	$-18$
Hamming	4	$-42.7$	$-6$
Blackman	6	$-58$	$-18$
Blackman-Harris	8	$-92$	$-6$
Kaiser ( $\beta=8.6$ )	8	$-80$	$-6$

Coherent gain: the window amplitude gain, $\sum_n w[n]/N$ . Windows with lower sidelobes also have lower coherent gain (the main peak is smaller), which must be corrected by dividing by the coherent gain when measuring amplitudes.

Equivalent noise bandwidth (ENBW): the width of a rectangular window that would pass the same noise power. Hann has ENBW $= 1.5$ bins - it is effectively 1.5 times wider than the rectangular window for noise measurements.

5.4 Scalloping Loss and the Picket Fence Effect

The DFT evaluates the spectrum at exactly $N$ uniformly-spaced bins. If a sinusoid's true frequency $f_0$ falls exactly halfway between two bins (at $k + 0.5$ in bin units), then both neighboring bins are equally attenuated. The worst-case attenuation - the scalloping loss - depends on the window:

Rectangular: $-3.9\,\mathrm{dB}$ (worst case for nearest-bin amplitude estimate)
Hann: $-1.4\,\mathrm{dB}$
Blackman: $-0.8\,\mathrm{dB}$

This is the picket fence effect: the DFT spectrum looks like a picket fence - it shows signal energy at the posts (bins) but may miss peaks in the gaps between posts.

Mitigation strategies:

Zero-padding (Section 6.2): append zeros to increase $N$ without changing the observation window, interpolating the spectrum between bins. This does NOT improve frequency resolution (the bandwidth $1/T$ is unchanged) but allows finer frequency estimation.
Parabolic interpolation: fit a parabola to the peak bin and its two neighbors; the parabola's maximum is a better estimate of the true frequency.
Quinn's estimator and Jacobsen's estimator: closed-form formulas giving sub-bin frequency estimates with low computational cost.

5.5 Overlap-Add and Overlap-Save

For long signals (audio tracks, continuous data streams), we cannot apply a single DFT to the entire signal. We must process it in blocks. Two standard methods handle the boundary effects at block edges:

Overlap-Add (OLA):

Segment the input into blocks of length $M$
Window each block with $w[n]$ (length $M$ )
Zero-pad each block to length $N \geq M + L - 1$ (where $L$ is the filter length)
Compute DFT-based processing on each block
Overlap and add the outputs (adjacent outputs share $L-1$ samples)

Overlap-Save (OLS):

Segment the input into blocks of length $N$ with overlap of $L-1$ samples
DFT-based processing on each block (no zero-padding needed)
Discard the first $L-1$ samples of each output block (the "contaminated" samples)
Concatenate the remaining $N - L + 1$ samples

Both methods allow exact (non-approximative) processing of long signals with FFT-based operations. The COLA (Constant Overlap-Add) condition for perfect reconstruction is:

\sum_{m=-\infty}^{\infty} w[n - mH] = C \quad \forall n

where $H$ is the hop size. Hann windows with 50% overlap ( $H = N/2$ ) satisfy COLA, making them the standard for STFT-based audio processing.

6. Frequency Resolution and Zero-Padding

6.1 Frequency Resolution

The frequency resolution of the DFT is:

\Delta f = \frac{f_s}{N} = \frac{1}{T}

where $T = N/f_s$ is the total observation duration. This fundamental limit says: to distinguish two sinusoids at frequencies $f_1$ and $f_2$ , you need to observe the signal for at least $T = 1/|f_1 - f_2|$ seconds. No amount of zero-padding or upsampling can circumvent this - it is a direct consequence of the uncertainty principle from Section 02.

The time-bandwidth product is fixed: $T \cdot \Delta f = 1$ . You cannot simultaneously have:

Short observation window (good for tracking rapid changes)
Fine frequency resolution (good for distinguishing close frequencies)

For AI: Whisper uses a 25 ms analysis window (400 samples at 16 kHz) with 10 ms hop size, giving $\Delta f = 1/0.025 = 40$ Hz frequency resolution. This is a deliberate engineering choice: fine enough to separate speech formants (typically 100-300 Hz apart) while short enough to track fast phoneme transitions.

6.2 Zero-Padding

Zero-padding: appending $M - N$ zeros to a signal of length $N$ before computing an $M$ -point DFT ( $M > N$ ).

What zero-padding does: it computes $M$ equally-spaced samples of the continuous DTFT of the original $N$ -point signal (i.e., the DTFT sampled at $M$ bins instead of $N$ ). This is frequency-domain interpolation - it fills in points between the original $N$ bins.

What zero-padding does NOT do: improve frequency resolution. The DTFT itself has resolution limited by $1/N$ (the original number of samples). Zero-padding only reveals the DTFT more finely; it cannot resolve frequencies closer than $1/N$ apart.

Applications of zero-padding:

Sub-bin frequency estimation (combine with interpolation): zero-padding by factor of 4-8 allows visual identification of spectral peaks with sub-bin accuracy.
Linear convolution via FFT: to convolve signals of lengths $L_1$ and $L_2$ linearly (not circularly), zero-pad both to length $\geq L_1 + L_2 - 1$ before DFT multiplication (Section 3.6). The circular convolution of the zero-padded sequences equals the linear convolution of the originals.
Next power-of-2: zero-pad to the nearest power of 2 to maximize FFT efficiency, e.g., 1000 -> 1024.

For AI: The Fourier Neural Operator (Section 8.2) explicitly truncates the spectrum to the lowest $K$ frequencies before learning, implicitly treating the signal as if zero-padded beyond the training domain. FlashFFTConv similarly uses zero-padding to convert long circular convolutions to linear ones.

6.3 The Nyquist-Shannon Sampling Theorem

Theorem (Shannon, 1949; Nyquist, 1928; Whittaker, 1915). If a continuous signal $x(t)$ is bandlimited - meaning $\hat{x}(\xi) = 0$ for $|\xi| > B$ - then $x(t)$ can be perfectly reconstructed from samples at rate $f_s \geq 2B$ .

Why the DFT cares: if $f_s < 2B$ , frequencies above $f_s/2$ alias onto lower frequencies. Specifically, a sinusoid at frequency $f_0 > f_s/2$ produces the same sample sequence as a sinusoid at frequency $f_s - f_0 < f_s/2$ . In the DFT, bin $k$ and bin $N-k$ represent the same physical frequency if $f_s$ is not high enough.

The aliasing formula: a signal at frequency $f_0$ aliases to $f_{\text{alias}} = |f_0 - \text{round}(f_0/f_s) \cdot f_s|$ , i.e., the nearest copy within $[0, f_s/2)$ .

Connection to Poisson summation (from Section 02-7.5): the spectrum of the sampled signal is the sum of periodically-shifted copies of the continuous spectrum:

X_s(\xi) = f_s \sum_{k=-\infty}^{\infty} \hat{x}(\xi - kf_s)

If adjacent copies overlap (i.e., $f_s < 2B$ ), they contaminate each other - this is aliasing. If $f_s \geq 2B$ , the copies are non-overlapping and the original spectrum can be recovered by applying a lowpass filter with cutoff at $f_s/2$ .

Practical rule: In practice, use $f_s \geq 2.5B$ with an anti-aliasing lowpass filter at $f_s/2$ before the ADC. Audio CDs use $f_s = 44.1$ kHz because human hearing extends to $\sim 20$ kHz (requires $f_s > 40$ kHz), with the extra 2.1 kHz providing a transition band for the anti-aliasing filter.

6.4 Practical FFT Setup Checklist

When computing an FFT-based spectral analysis, follow this sequence:

PRACTICAL FFT CHECKLIST
========================================================================

  1. CHOOSE N
     - Prefer N = 2^k (fastest FFT)
     - If not possible: N = 2^a * 3^b * 5^c (still fast)
     - N controls frequency resolution: f = fs / N

  2. APPLY WINDOW
     - Default: Hann window (good sidelobes, modest resolution loss)
     - High dynamic range: Blackman or Kaiser ( approx 8)
     - If signal is periodic with exactly N samples: rectangular

  3. COMPUTE FFT
     - Real input: use rfft (returns N/2+1 values)
     - Complex input: use fft (returns N values)

  4. BUILD FREQUENCY AXIS
     - For rfft: freqs = np.fft.rfftfreq(N, d=1/fs)   [0, fs/2]
     - For fft:  freqs = np.fft.fftfreq(N, d=1/fs)    [0, fs/2, -fs/2, 0]
     - After fftshift: freqs = [-fs/2, ..., 0, ..., fs/2]

  5. INTERPRET MAGNITUDE
     - |X[k]| / N  ->  amplitude of the sinusoid
     - |X[k]|^2 / N  ->  power spectral density
     - Divide by window coherent gain for absolute amplitudes

  6. HANDLE NEGATIVE FREQUENCIES
     - Use fftshift(X) to center zero-frequency
     - For real inputs: mirror the positive half (rfft returns one-sided)

========================================================================

Natural index	Binary	Bit-reversed	Bit-reversed decimal
0	000	000	0
1	001	100	4
2	010	010	2
3	011	110	6
4	100	001	1
5	101	101	5
6	110	011	3
7	111	111	7

Natural index	Binary	Bit-reversed	Bit-reversed decimal
0	000	000	0
1	001	100	4
2	010	010	2
3	011	110	6
4	100	001	1
5	101	101	5
6	110	011	3
7	111	111	7

Discrete Fourier Transform and FFT: Part 1 - Intuition To 6 Frequency Resolution And Zero Padding