Part 3Math for LLMs

Discrete Fourier Transform and FFT: Part 3 - Appendix I Dft And Signal Reconstruction To Appendix R Quick Reference

Fourier Analysis and Signal Processing / Discrete Fourier Transform and FFT

Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 3
17 min read18 headingsSplit lesson page

Lesson overview | Previous part | Lesson overview

Discrete Fourier Transform and FFT: Appendix I: DFT and Signal Reconstruction to Appendix R: Quick-Reference Formula Sheet

Appendix I: DFT and Signal Reconstruction

I.1 Bandlimited Interpolation

Given DFT coefficients X[0],,X[K]X[0], \ldots, X[K] (retaining only the lowest K+1K+1 frequencies), the bandlimited interpolation of the corresponding signal to any intermediate time tt (not necessarily a sample point) is:

x^(t)=1Nk=0KX[k]e2πikt/N+conjugate terms\hat{x}(t) = \frac{1}{N}\sum_{k=0}^{K} X[k]\, e^{2\pi i k t/N} + \text{conjugate terms}

This is the continuous analog of what FNO does (Section 8.2): it parameterizes a function by its low-frequency DFT coefficients, and evaluates it at any desired spatial resolution.

I.2 Compressed Sensing and Sparse Recovery

Problem (Compressed Sensing). A signal xCN\mathbf{x} \in \mathbb{C}^N is ss-sparse in the DFT domain: only sNs \ll N Fourier coefficients are nonzero. Can we recover x\mathbf{x} from mNm \ll N measurements?

Answer (Candes, Romberg, Tao, 2006; Donoho, 2006). If mCslog(N/s)m \geq C \cdot s \log(N/s) random measurements are taken, x\mathbf{x} can be exactly recovered (with high probability) by solving the 1\ell^1 minimization:

minx^x^1subject toAx^=b\min_{\hat{\mathbf{x}}} \lVert \hat{\mathbf{x}} \rVert_1 \quad \text{subject to} \quad A\hat{\mathbf{x}} = \mathbf{b}

where AA is the random measurement matrix. This is dramatically below the mNm \geq N requirements of classical sampling.

For AI: compressed sensing underlies MRI reconstruction (6-8 speedup by acquiring 1/6 of kk-space data), compressed attention (attending to sparse subsets of frequency components), and efficient LLM inference when activations are sparse in the Fourier domain. The theoretical guarantees connect directly to the RIP (Restricted Isometry Property) of random DFT matrices.

I.3 Phase Retrieval

Problem. Given only X[k]2|X[k]|^2 (the power spectrum, without phase), recover x\mathbf{x}.

Phase retrieval is fundamentally ill-posed in 1D (any signal and its time-reversed, time-shifted version have the same power spectrum). In 2D (images), phase retrieval becomes uniquely solvable (generically) due to oversampling constraints.

Algorithms: Gerchberg-Saxton (1972), HIO (Fienup, 1982), and modern deep learning approaches (phase retrieval nets).

For AI: X-ray crystallography uses phase retrieval to determine protein structure. Modern protein structure prediction (AlphaFold 2, RoseTTAFold) sidesteps phase retrieval by directly predicting 3D coordinates from sequence and co-evolutionary information.


Appendix J: Worked ML Examples

J.1 Building a Mel Spectrogram from Scratch

This section works through the complete mathematics behind the mel spectrogram used in Whisper, step by step.

Step 1: Mel scale. The mel scale converts frequency ff (Hz) to perceived pitch mm (mels):

m(f)=2595log10 ⁣(1+f700),f(m)=700(10m/25951)m(f) = 2595 \log_{10}\!\left(1 + \frac{f}{700}\right), \quad f(m) = 700\left(10^{m/2595} - 1\right)

The mel scale is approximately linear below 1 kHz (300 Hz approx 401 mel; 700 Hz approx 811 mel) and logarithmic above 1 kHz, matching the frequency resolution of the human auditory system (the cochlea's place-frequency mapping).

Step 2: Filterbank design. For M=80M = 80 mel filters covering [fmin,fmax]=[0,8000]Hz[f_{\min}, f_{\max}] = [0, 8000]\,\text{Hz}:

  1. Convert the frequency range to mels: [mmin,mmax]=[0,3227]mels[m_{\min}, m_{\max}] = [0, 3227]\,\text{mels}
  2. Place M+2=82M+2 = 82 equally-spaced mel points: m0<m1<<m81m_0 < m_1 < \cdots < m_{81}
  3. Convert back to Hz: fk=f(mk)f_k = f(m_k)
  4. For filterbank mm (m=1,,80m = 1, \ldots, 80), define the triangular filter:
Hm[k]={fkfm1fmfm1fm1fkfmfm+1fkfm+1fmfmfkfm+10otherwiseH_m[k] = \begin{cases} \dfrac{f_k - f_{m-1}}{f_m - f_{m-1}} & f_{m-1} \leq f_k \leq f_m \\ \dfrac{f_{m+1} - f_k}{f_{m+1} - f_m} & f_m \leq f_k \leq f_{m+1} \\ 0 & \text{otherwise} \end{cases}

where fk=kfs/Nf_k = k f_s / N are the DFT bin frequencies.

Step 3: Power accumulation. For each STFT frame ll:

mel_spec[m,l]=k=0N/2Hm[k]X[k,l]2\text{mel\_spec}[m, l] = \sum_{k=0}^{N/2} H_m[k] \cdot |X[k, l]|^2

Step 4: Log compression.

log_mel[m,l]=log10 ⁣(max ⁣(mel_spec[m,l],1010))\text{log\_mel}[m, l] = \log_{10}\!\left(\max\!\left(\text{mel\_spec}[m, l], 10^{-10}\right)\right)

The floor 101010^{-10} prevents log(0)\log(0). The Whisper normalization subtracts the mean of log_mel and divides by 4 (an empirically chosen scale factor that centers the values near [1,1][-1, 1] for typical speech).

Why these choices?

  • 80 mel bins: empirically optimal for large-scale ASR (more bins -> marginal improvement; fewer -> information loss)
  • 0-8000 Hz: human speech information is concentrated here; extending to 8+ kHz adds noise without significant ASR benefit
  • 25 ms window / 10 ms hop: matches the typical duration of speech phonemes (20-200 ms)
  • Log compression: matches perceptual loudness; prevents high-amplitude phonemes from dominating the feature representation

J.2 FNO Architecture: Complete Mathematical Description

The Fourier Neural Operator (Li et al., 2021) architecture for 1D problems:

Input: discretized function aRN×da\mathbf{a} \in \mathbb{R}^{N \times d_a} (input function on NN points with dad_a channels) Output: discretized function uRN×du\mathbf{u} \in \mathbb{R}^{N \times d_u} (solution on NN points with dud_u channels)

Architecture (4 FNO layers + output projection):

Layer 0: Input lifting
  a  R^{N  d_a}  ->  v  R^{N  d_v}   (learnable MLP)

Layer 1-4: FNO blocks
  v <- FNO-Block(v)

Layer 5: Output projection
  v  R^{N  d_v}  ->  u  R^{N  d_u}   (learnable MLP)

FNO Block:

FNO-Block(v):
    1. Spectral branch:
       V = DFT(v)                  # shape: (N, d_v) complex
       V_trunc = V[:K, :]           # keep lowest K modes
       V_out = R @ V_trunc          # R  C^{K  d_v  d_v}: learned weight
       V_pad = [V_out; 0, ..., 0]   # zero-pad back to N modes
       v_spectral = IDFT(V_pad)     # shape: (N, d_v)
    
    2. Local branch:
       v_local = W @ v              # W  R^{d_v  d_v}: learned weight
    
    3. Combine:
       v_new = (v_spectral + v_local)  #  = GeLU
    
    return v_new

Key design choices:

  • K=16K = 16 retained modes: sufficient for smooth PDE solutions; higher KK is used for more complex solutions (turbulence: K=32K = 32)
  • dv=64d_v = 64 channels: comparable to modern transformer hidden dimension
  • Weight sharing: the same RR is used across all spatial positions (translation equivariance in the frequency domain)
  • Resolution-invariance: the FNO can be trained on N=64N = 64 and evaluated on N=256N = 256 by changing the DFT size; the spectral weights RR and local weights WW are independent of NN

Parameter count comparison (1D, N=1024N = 1024, K=16K = 16, dv=64d_v = 64):

LayerDense global convolutionFNO spectral layer
ParametersN×dv2=4.2×106N \times d_v^2 = 4.2 \times 10^6K×dv2×2=131,072K \times d_v^2 \times 2 = 131,072 (complex)
FLOPsO(N2dv)=4×109O(N^2 d_v) = 4 \times 10^9O(NdvlogN+Kdv2)=6.7×106O(N d_v \log N + K d_v^2) = 6.7 \times 10^6

The FNO spectral layer has 32×32\times fewer parameters and 600×600\times fewer FLOPs per layer than a fully global convolution.

J.3 Monarch Matrix Decomposition

A Monarch matrix of size N=N1N2N = N_1 N_2 is defined as:

M=PLPRM = P L P^\top R

where:

  • P{0,1}N×NP \in \{0,1\}^{N \times N} is a fixed permutation (the "interleaving permutation")
  • L=diag(L0,L1,,LN11)L = \operatorname{diag}(L_0, L_1, \ldots, L_{N_1-1}) is block-diagonal with N1N_1 blocks of size N2×N2N_2 \times N_2
  • R=diag(R0,R1,,RN21)R = \operatorname{diag}(R_0, R_1, \ldots, R_{N_2-1}) is block-diagonal with N2N_2 blocks of size N1×N1N_1 \times N_1

Connection to FFT. The FFT butterfly matrix factors as:

FN=(IN1FN2)D(FN1IN2)Pbit-reverseF_N = (I_{N_1} \otimes F_{N_2}) \cdot D \cdot (F_{N_1} \otimes I_{N_2}) \cdot P_{\text{bit-reverse}}

where DD is a diagonal twiddle-factor matrix and \otimes is the Kronecker product. Each butterfly stage is a block-diagonal matrix - exactly the Monarch structure. Thus the FFT is a special Monarch matrix with specific parameter values.

Monarch parameterization: instead of fixing the block entries to be DFT matrices, let LiL_i and RjR_j be learnable complex matrices. This gives O(N(N1+N2))=O(NN)O(N(N_1 + N_2)) = O(N\sqrt{N}) parameters (choosing N1=N2=NN_1 = N_2 = \sqrt{N}).

Expressiveness. Monarch matrices can represent any circulant matrix (by choosing LiL_i and RjR_j appropriately) and therefore any convolution. They can also represent some attention patterns and are more expressive than diagonal matrices or individual FFT layers.


Appendix K: Reference Summary

K.1 Key Formulas

FormulaExpression
DFTX[k]=n=0N1x[n]ωNnkX[k] = \sum_{n=0}^{N-1} x[n]\,\omega_N^{-nk}, ωN=e2πi/N\quad\omega_N = e^{2\pi i/N}
IDFTx[n]=1Nk=0N1X[k]ωNnkx[n] = \frac{1}{N}\sum_{k=0}^{N-1} X[k]\,\omega_N^{nk}
DFT Matrix(FN)kn=ωNkn(F_N)_{kn} = \omega_N^{-kn}, FNFN=NI\quad F_N F_N^* = N I
Parsevalnx[n]2=1NkX[k]2\sum_n \lvert x[n]\rvert^2 = \frac{1}{N}\sum_k \lvert X[k]\rvert^2
Circular convolution(xy)[n]=m=0N1x[m]y[nmmodN](x \circledast y)[n] = \sum_{m=0}^{N-1} x[m]\,y[n-m\bmod N]
Circular shiftx[nmmodN]ωNmkX[k]x[n-m\bmod N] \leftrightarrow \omega_N^{-mk} X[k]
Conjugate symmetryx[n]RX[Nk]=X[k]x[n]\in\mathbb{R} \Rightarrow X[N-k] = X[k]^*
FFT complexityN2log2N\frac{N}{2}\log_2 N complex multiplications, Nlog2NN\log_2 N additions
Frequency resolutionΔf=fs/N=1/T\Delta f = f_s/N = 1/T
Nyquist limitfmax=fs/2f_{\max} = f_s/2
STFTS[l,k]=m=0M1x[lH+m]w[m]ωNmkS[l,k] = \sum_{m=0}^{M-1} x[lH+m]\,w[m]\,\omega_N^{-mk}
Mel scalem=2595log10(1+f/700)m = 2595\log_{10}(1 + f/700)

K.2 Quick Comparison: FT vs DFT

PropertyContinuous FTDFT
DomainL1(R)L^1(\mathbb{R}) or L2(R)L^2(\mathbb{R})CN\mathbb{C}^N (finite sequences)
OutputContinuous spectrum on R\mathbb{R}NN complex values
InversionIntegral formulaExact, finite sum
ShiftLinear: e2πiξτe^{-2\pi i\xi\tau} phase factorCircular: ωNmk\omega_N^{-mk} phase factor
ConvolutionLinear convolutionCircular convolution
Parsevalf2=f^2\int\lvert f\rvert^2 = \int\lvert\hat{f}\rvert^2x2=1NX2\sum\lvert x\rvert^2 = \frac{1}{N}\sum\lvert X\rvert^2
UnitarityFT on L2L^2: unitary with factor 11NFN\frac{1}{\sqrt{N}}F_N is unitary
ComputationCannot compute exactlyFFT: O(NlogN)O(N\log N)
LeakageNo finite truncationLeakage from windowing

Appendix L: Case Studies

L.1 Whisper Large-v3: Preprocessing Bug Hunt

A common real-world bug in Whisper-based systems illustrates why understanding the DFT pipeline matters.

Symptom: transcription quality degrades for audio from a different microphone than training data, even though the content is identical.

Root cause: the new microphone captures at 44.1 kHz (not 16 kHz). Naive resampling using librosa.load(file, sr=16000) with default settings may introduce aliasing artifacts or produce a slightly different frequency axis. Specifically:

  1. Aliasing from incorrect anti-aliasing filter: if the 44.1 kHz signal is downsampled by factor 2.76\approx 2.76 without applying a proper lowpass filter at 8 kHz (the Nyquist for 16 kHz), frequencies between 8 kHz and 22.05 kHz alias into the 0-8 kHz band, corrupting the mel spectrogram.

  2. Frequency axis mismatch: if the mel filterbanks were designed for exactly 16 kHz sampling, resampling to 15,994 Hz (a common off-by-one error in resampling libraries) shifts all frequency bin centers by 0.04%\approx 0.04\% - too small to affect single-frequency detection but enough to corrupt the precise triangular filterbank overlaps.

Fix: explicitly verify sample rate after loading (assert sr == 16000), use scipy.signal.resample_poly with explicit anti-aliasing, and validate the mel filterbank output against a known reference signal.

Lesson: every stage of the FFT pipeline - sampling rate, window length, hop size, FFT size, filterbank design - has exact numerical requirements. The DFT is unforgiving of approximations.

L.2 FNO for Climate Modeling: FourCastNet

FourCastNet (Pathak et al., 2022, NVIDIA) uses a variant of the FNO architecture to predict global weather at 6-hour intervals. It replaces the FFT-based spectral layer with a Fourier-based AFNO (Adaptive Fourier Neural Operator) block:

AFNO Layer:
  1. Patchify: divide input grid (7201440) into patches -> tokens
  2. DFT along spatial dimensions (2D FFT)
  3. Mix frequencies with learned complex weights (block-diagonal for efficiency)
  4. IDFT -> spatial domain
  5. MLP for channel mixing

Key difference from FNO: AFNO uses token mixing in frequency space with a block-diagonal structure, reducing parameters from O(K2d2)O(K^2 d^2) to O(Kd2/B)O(K d^2 / B) where BB is the block size.

Results:

  • Forecasts 10-day global weather in 2 seconds (vs 4 hours for NWP models like ECMWF)
  • Competitive with ECMWF operational forecasts on 2-week horizons for key variables (temperature at 500 hPa, wind at 850 hPa)
  • Trained on 40 years of ERA5 reanalysis data (3.8 TB)

Why frequency domain? Atmospheric dynamics are dominated by large-scale planetary waves (low spatial frequencies) - the Rossby waves that govern jet streams and storm tracks. Truncating to K=32K = 32 modes in each dimension captures >95%> 95\% of the variance in ERA5 data. Higher-frequency turbulence below the grid scale is parameterized separately.

L.3 STFT Denoising in Production Audio

Modern audio denoising systems (used in Zoom, Teams, Discord) use STFT-based processing:

Architecture:

  1. STFT with Hann window (1024 samples, hop 256) -> spectrogram X[l,k]|X[l,k]| and phase X[l,k]\angle X[l,k]
  2. Neural network (small RNN or transformer) estimates the noise mask M[l,k][0,1]M[l,k] \in [0,1]
  3. Apply mask: X^[l,k]=M[l,k]X[l,k]\hat{X}[l,k] = M[l,k] \cdot X[l,k] (keep speech, suppress noise)
  4. Reconstruct waveform via ISTFT with original phase (overlap-add)

Why preserve phase: human perception is relatively insensitive to phase errors in speech (the McGurk effect), but phase-inconsistent reconstruction (Griffin-Lim iterations) introduces audible artifacts. Most commercial denoisers use the noisy signal's phase directly - a pragmatic approximation.

Real-time constraint: at 48 kHz with 1024-sample windows and 256-sample hop: one frame every 5.33 ms. A DNN processing one frame must complete in <5ms< 5\,\text{ms} on the target hardware (often a small DSP or CPU). This limits the network size to 105\sim 10^5 parameters.

The STFT as a learned feature extractor question: in principle, one could learn the analysis window jointly with the denoising network (learnable STFT). In practice, fixed Hann windows almost always match or outperform learned windows on denoising benchmarks, because the Hann window's known good time-frequency resolution is hard to improve upon with a finite training set.


Appendix M: Connections to Other Mathematical Areas

M.1 Harmonic Analysis on Finite Abelian Groups

The DFT on ZN\mathbb{Z}_N (the cyclic group of integers modulo NN) is a special case of harmonic analysis on locally compact abelian groups (LCA groups). For a finite abelian group GG, the Pontryagin dual G^\hat{G} is the group of characters (group homomorphisms χ:GS1\chi: G \to S^1). For G=ZNG = \mathbb{Z}_N: G^ZN\hat{G} \cong \mathbb{Z}_N with characters χk(n)=ωNkn\chi_k(n) = \omega_N^{kn} - exactly the DFT basis.

The abstract Fourier transform on GG is:

f^(χ)=gGf(g)χ(g)\hat{f}(\chi) = \sum_{g \in G} f(g)\,\overline{\chi(g)}

For G=ZNG = \mathbb{Z}_N: this recovers the DFT exactly. For G=RG = \mathbb{R}: this recovers the continuous FT. The same theorem - Pontryagin duality, Plancherel's theorem, Poisson summation - holds in all cases.

This unification is developed fully in Section 12 Functional Analysis, connecting the DFT to representation theory of compact groups, wavelet theory, and the Fourier analysis of non-commutative groups (non-abelian harmonic analysis).

M.2 Number-Theoretic Applications

Quadratic reciprocity via DFT. The Gauss sum G(a,N)=n=0N1e2πian2/NG(a, N) = \sum_{n=0}^{N-1} e^{2\pi i a n^2/N} plays a central role in the proof of quadratic reciprocity. For prime pp:

G(1,p)={pp1(mod4)ipp3(mod4)G(1, p) = \begin{cases} \sqrt{p} & p \equiv 1 \pmod 4 \\ i\sqrt{p} & p \equiv 3 \pmod 4 \end{cases}

This is a DFT-like object: G(a,N)=X[a]G(a, N) = X[a] where x[n]=1x[n] = 1 for all nn and the kernel is ωNan2\omega_N^{an^2} instead of ωNan\omega_N^{an}.

Integer factorization. Shor's quantum algorithm (1994) for integer factorization relies on the quantum Fourier transform (QFT): a quantum circuit that applies the DFT to a quantum superposition state in O((logN)2)O((\log N)^2) gate operations. Classical FFT requires O(NlogN)O(N \log N) operations; QFT requires O((logN)2)O((\log N)^2). Shor's algorithm uses the QFT to find the period of akmodNa^k \bmod N, which reveals the prime factors of NN in polynomial time.


Appendix N: Spectral Analysis in Practice - Extended Examples

N.1 Frequency Estimation: Sub-Bin Accuracy

A key practical problem: given noisy measurements y[n]=Acos(2πf0n/fs+ϕ)+ε[n]y[n] = A\cos(2\pi f_0 n/f_s + \phi) + \varepsilon[n], estimate f0f_0, AA, and ϕ\phi with sub-bin accuracy.

Step 1: Coarse estimate. Compute the DFT magnitude Y[k]|Y[k]| and find kmax=argmaxkY[k]k_{\max} = \arg\max_k |Y[k]|. This gives f^0(1)=kmaxfs/N\hat{f}_0^{(1)} = k_{\max} f_s / N with error up to ±Δf/2=±fs/(2N)\pm \Delta f/2 = \pm f_s/(2N).

Step 2: Parabolic interpolation. Fit a parabola to Y[kmax1]|Y[k_{\max}-1]|, Y[kmax]|Y[k_{\max}]|, Y[kmax+1]|Y[k_{\max}+1]|:

k^=kmax+Y[kmax+1]Y[kmax1]2(2Y[kmax]Y[kmax1]Y[kmax+1])\hat{k} = k_{\max} + \frac{|Y[k_{\max}+1]| - |Y[k_{\max}-1]|}{2(2|Y[k_{\max}]| - |Y[k_{\max}-1]| - |Y[k_{\max}+1]|)}

This achieves approximately 10 better accuracy than the coarse estimate for SNR >20> 20 dB.

Step 3: Quinn's Estimator. For even better accuracy, use the complex DFT values directly:

k^=kmax+Re ⁣[Y[kmax+1]/Y[kmax]1Y[kmax+1]/Y[kmax]]\hat{k} = k_{\max} + \operatorname{Re}\!\left[\frac{Y[k_{\max}+1]/Y[k_{\max}]}{1 - Y[k_{\max}+1]/Y[k_{\max}]}\right]

Quinn's estimator is unbiased and achieves near-Cramer-Rao bound accuracy.

Step 4: Amplitude and phase. Once k^\hat{k} is estimated, the amplitude is A^=2Y[k^]/N\hat{A} = 2|Y[\hat{k}]|/N (for a one-sided real spectrum) corrected for the window's coherent gain, and the phase is ϕ^=Y[k^]\hat{\phi} = \angle Y[\hat{k}].

N.2 Detecting Harmonics in Speech

Speech is approximately quasi-periodic: voiced sounds (vowels) have a fundamental frequency F0F_0 (pitch) and harmonics at 2F0,3F0,2F_0, 3F_0, \ldots The harmonic structure is visible in the DFT as a series of peaks at integer multiples of F0F_0.

HPS (Harmonic Product Spectrum) algorithm:

  1. Compute the DFT magnitude X[k]|X[k]|
  2. Form the product spectrum: P[k]=h=1HX[hk]P[k] = \prod_{h=1}^{H} |X[hk]|
  3. Estimate F0=fsargmaxkP[k]/NF_0 = f_s \cdot \arg\max_k P[k] / N

The HPS effectively downsamples the spectrum by factors 1,2,,H1, 2, \ldots, H and takes the product - aligning the harmonic peaks while suppressing noise between them.

In AI: pitch estimation (fundamental frequency detection) is a key component of speech processing pipelines for voice synthesis (VALL-E, SoundStorm), emotion recognition, and singing voice synthesis. Modern deep-learning pitch detectors (CREPE, PYIN, PENN) outperform HPS but rely on spectral representations derived from the STFT.

N.3 The Spectrogram in Transformer-Based Audio Models

The dominant paradigm for audio language models (AudioLM, VALL-E, MusicLM, Stable Audio) is:

Audio waveform
    v FFT -> Mel spectrogram (continuous representation)
    v VQ-VAE or EnCodec -> Discrete tokens
    v Language model (transformer) -> Next-token prediction
    v Decoder -> Reconstructed mel spectrogram
    v Vocoder (HiFi-GAN, WaveGrad) -> Audio waveform

The FFT and mel spectrogram serve as both the input representation for the encoder and the target representation for the decoder. The transformer operates entirely in the token space derived from the mel spectrogram.

Why not train end-to-end on waveforms? The waveform representation at 16-44 kHz requires sequence lengths of 10410^4-10510^5 samples per second - far beyond current transformer context lengths. The mel spectrogram with 10 ms frame shift compresses by a factor of 160×\sim 160\times (16 kHz) while preserving the perceptually relevant structure. This compression is what makes transformer-based audio generation feasible.


Appendix O: Summary of AI Model Connections

AI SystemDFT RoleKey Parameters
OpenAI WhisperFFT -> mel spectrogram as inputfs=16f_s=16kHz, M=400M=400, H=160H=160, N=512N=512, 80 mel bins
FNet (Google)Full 2D DFT replaces attentionDFT along sequence AND embedding dims; no learned weights
FNO (CalTech)Truncated spectral convolutionK=16K=16-3232 modes; global PDE operator learning
FourCastNet (NVIDIA)AFNO block for weather prediction2D FFT on 720×1440720\times1440 lat/lon grid; 73 variables
FlashFFTConvButterfly-structured long convolutionMonarch factorization; O(NlogN)O(N\log N) with Tensor Cores
Monarch (Together AI)FFT butterfly as trainable transformO(NN)O(N\sqrt{N}) parameters; used as attention surrogate
VALL-ECodec (EnCodec) uses STFT lossMulti-resolution STFT loss for perceptual quality
AudioLMSemantic/acoustic tokenizationEnCodec uses mel-scale frequency band processing
CREPE (pitch)Spectrogram input to CNNnfft=1024n_{\text{fft}}=1024, h=10h=10ms; pitch detection on mel spec
HiFi-GAN (vocoder)Multi-period discriminatorMulti-resolution STFT loss; discriminates real vs fake
S4/MambaSpectral view of state space modelsSSM recurrence equivalent to causal convolution (FFT in training)
RoPE (LLaMA)Frequency-domain position encodingComplex rotation = modulation in DFT sense (Section 3.3)

Appendix P: Proofs and Derivations

P.1 Derivation of Cooley-Tukey for General Radix

For N=N1N2N = N_1 \cdot N_2, write n=N2n1+n2n = N_2 n_1 + n_2 and k=N1k2+k1k = N_1 k_2 + k_1 with n1,k1[N1]n_1, k_1 \in [N_1], n2,k2[N2]n_2, k_2 \in [N_2]:

X[k]=n1=0N11n2=0N21x[N2n1+n2]ωN(N2n1+n2)(N1k2+k1)X[k] = \sum_{n_1=0}^{N_1-1}\sum_{n_2=0}^{N_2-1} x[N_2 n_1 + n_2]\, \omega_N^{-(N_2 n_1 + n_2)(N_1 k_2 + k_1)}

Expanding the exponent: (N2n1+n2)(N1k2+k1)=N1N2n1k2N2n1k1n2N1k2n2k1-(N_2 n_1 + n_2)(N_1 k_2 + k_1) = -N_1 N_2 n_1 k_2 - N_2 n_1 k_1 - n_2 N_1 k_2 - n_2 k_1

Since ωNN1N2=1\omega_N^{N_1 N_2} = 1, the first term vanishes. Grouping:

X[N1k2+k1]=n2=0N21ωN2n2k2[ωNn2k1twiddlen1=0N11ωN1n1k1x[N2n1+n2]]X[N_1 k_2 + k_1] = \sum_{n_2=0}^{N_2-1} \omega_{N_2}^{-n_2 k_2} \left[\underbrace{\omega_N^{-n_2 k_1}}_{\text{twiddle}} \sum_{n_1=0}^{N_1-1} \omega_{N_1}^{-n_1 k_1} x[N_2 n_1 + n_2]\right]

The inner sum is an N1N_1-point DFT (for each fixed n2,k1n_2, k_1), and the outer sum is an N2N_2-point DFT (for each fixed k1k_1). The twiddle factors ωNn2k1\omega_N^{-n_2 k_1} multiply between the two stages.

Total operations: N2N_2 DFTs of size N1N_1 + N1N_1 DFTs of size N2N_2 + N1N2N_1 N_2 twiddle multiplications = O(NlogN)O(N \log N) by the master recurrence.

P.2 The Uncertainty Principle for the DFT (Donoho-Stark Theorem)

Theorem (Donoho & Stark, 1989). For any nonzero xCN\mathbf{x} \in \mathbb{C}^N:

supp(x)supp(X)N|\text{supp}(\mathbf{x})| \cdot |\text{supp}(\mathbf{X})| \geq N

where supp(x)={n:x[n]0}\text{supp}(\mathbf{x}) = \{n : x[n] \neq 0\} is the support of x\mathbf{x}.

Proof sketch. Suppose supp(x)=S|\text{supp}(\mathbf{x})| = S and supp(X)=F|\text{supp}(\mathbf{X})| = F. The DFT restricted to supp(x)\text{supp}(\mathbf{x}) and supp(X)\text{supp}(\mathbf{X}) defines an F×SF \times S submatrix AA of 1NFN\frac{1}{\sqrt{N}}F_N. Since x\mathbf{x} is supported on SS points and X\mathbf{X} is supported on FF points, this submatrix satisfies AxS=XF/NA \mathbf{x}_{|S} = \mathbf{X}_{|F}/\sqrt{N}. A counting argument using the rank of AA gives SFNSF \geq N.

Corollary. A signal cannot have fewer than N\sqrt{N} nonzero samples AND fewer than N\sqrt{N} nonzero DFT coefficients simultaneously. This is the discrete analog of the Heisenberg uncertainty principle.

For AI: compressed sensing exploits this theorem: a signal supported on SS DFT bins can be recovered from mCSlogNm \geq CS\log N time-domain measurements - far fewer than NN. This justifies FNO's use of KNK \ll N Fourier modes: smooth solutions to PDEs have small Fourier support.

P.3 Aliasing Formula: Proof via Poisson Summation

When x(t)x(t) is sampled at rate fsf_s (below the Nyquist rate for bandlimited xx), the DFT of the samples is:

Xs[k]=n=x^ ⁣(kNTsnTs)=1Tsn=x^(kΔfnfs)X_s[k] = \sum_{n=-\infty}^{\infty} \hat{x}\!\left(\frac{k}{N T_s} - \frac{n}{T_s}\right) = \frac{1}{T_s}\sum_{n=-\infty}^{\infty} \hat{x}(k\Delta f - n f_s)

This is the aliased spectrum: the true spectrum x^\hat{x} summed over all frequency shifts by integer multiples of fsf_s. When x^\hat{x} is nonzero outside [fs/2,fs/2)[-f_s/2, f_s/2), adjacent copies overlap and contaminate each other - this is aliasing.

Proof. Starting from the Poisson summation formula (Section 02-7.5):

n=f(tnT)=1Tk=f^(k/T)e2πikt/T\sum_{n=-\infty}^{\infty} f(t - nT) = \frac{1}{T}\sum_{k=-\infty}^{\infty} \hat{f}(k/T)\, e^{2\pi i k t/T}

Setting f(t)=x(t)e2πiξtf(t) = x(t) e^{-2\pi i \xi t} and T=1/fsT = 1/f_s, and evaluating at t=0t=0:

nx(n/fs)e2πiξn/fs=fskx^(ξkfs)\sum_n x(n/f_s)\, e^{-2\pi i\xi n/f_s} = f_s \sum_k \hat{x}(\xi - kf_s)

The left side is the DTFT of the sampled signal; the right side is the periodized spectrum. The DFT is the sampling of this DTFT at ξ=kfs/N\xi = k f_s/N, giving the result.


Appendix Q: Further Reading

Primary References

  1. Cooley, J.W. and Tukey, J.W. (1965). "An Algorithm for the Machine Calculation of Complex Fourier Series." Mathematics of Computation, 19(90), 297-301. - The original FFT paper; two pages, historically indispensable.

  2. Oppenheim, A.V. and Schafer, R.W. (2010). Discrete-Time Signal Processing (3rd ed.). Prentice Hall. - The standard reference for DSP; chapters 8-10 cover DFT, FFT, and windowing comprehensively.

  3. Strang, G. (1986). "The Discrete Cosine Transform." SIAM Review, 41(1), 135-147. - Beautiful exposition connecting FFT, DCT, and linear algebra.

  4. Frigo, M. and Johnson, S.G. (2005). "The Design and Implementation of FFTW3." Proceedings of the IEEE, 93(2), 216-231. - FFTW's adaptive algorithm selection strategy.

  5. Li, Z. et al. (2021). "Fourier Neural Operator for Parametric Partial Differential Equations." ICLR 2021. - FNO architecture and theory.

  6. Dao, T. et al. (2022). "Monarch: Expressive Structured Matrices for Efficient and Accurate Training." ICML 2022. - Butterfly matrix parameterization.

  7. Radford, A. et al. (2022). "Robust Speech Recognition via Large-Scale Weak Supervision." ICML 2023. - Whisper architecture and mel spectrogram preprocessing.

  8. Donoho, D.L. and Stark, P.B. (1989). "Uncertainty Principles and Signal Recovery." SIAM Journal on Applied Mathematics, 49(3), 906-931. - Discrete uncertainty principle.

Online Resources

  • NumPy FFT documentation: numpy.fft module reference with complete API
  • SciPy signal processing: scipy.signal.spectrogram, scipy.signal.welch, scipy.fft
  • Whisper preprocessing code: openai/whisper GitHub repository, audio.py
  • FNO implementation: neuraloperator/neuraloperator GitHub repository

Appendix R: Quick-Reference Formula Sheet

SymbolMeaningValue / Formula
ωN\omega_NPrincipal NN-th root of unitye2πi/Ne^{2\pi i/N}
X[k]X[k]kk-th DFT coefficientn=0N1x[n]ωNnk\sum_{n=0}^{N-1} x[n]\,\omega_N^{-nk}
x[n]x[n]IDFT reconstruction1Nk=0N1X[k]ωNnk\frac{1}{N}\sum_{k=0}^{N-1} X[k]\,\omega_N^{nk}
FNF_NDFT matrix(FN)kn=ωNkn(F_N)_{kn} = \omega_N^{-kn}, so FNFN=NIF_N F_N^* = NI
Δf\Delta fFrequency bin widthfs/N=1/Tf_s/N = 1/T
fNyqf_{\text{Nyq}}Nyquist frequencyfs/2f_s/2
NminN_{\min}Minimum NN to resolve Δf0\Delta f_0fs/Δf0\lceil f_s/\Delta f_0 \rceil
S[l,k]S[l,k]STFT coefficientmx[lH+m]w[m]ωNmk\sum_{m} x[lH+m]\,w[m]\,\omega_N^{-mk}
COLAOverlap-add constantlw[nlH]=C\sum_l w[n-lH] = C for all nn
X[Nk]X[N-k]Conjugate symmetry (real xx)X[k]X[k]^*

Window cheat-sheet

WindowMain lobe (bins)First sidelobe (dB)Use case
Rectangular213-13Broadband noise, transients
Hann431.5-31.5General purpose
Hamming442.7-42.7Speech processing
Blackman658-58Weak sinusoids near strong ones
Kaiser β=8.6\beta=8.6variable69-69Precision filter design

FFT complexity

T(N)=2T(N/2)+O(N)    T(N)=O(Nlog2N)T(N) = 2T(N/2) + O(N) \implies T(N) = O(N \log_2 N)

Compared to naive DFT: N=106N = 10^6 reduces from 101210^{12} to 2×1072 \times 10^7 operations - a 50000×50\,000\times speedup.

Key identities

xy  F  X[k]Y[k](see Section 04)x \circledast y \;\overset{\mathcal{F}}{\longleftrightarrow}\; X[k] \cdot Y[k] \qquad \text{(see Section 04)} n=0N1x[n]2=1Nk=0N1X[k]2(Parseval)\sum_{n=0}^{N-1}|x[n]|^2 = \frac{1}{N}\sum_{k=0}^{N-1}|X[k]|^2 \qquad \text{(Parseval)} X[k]=X[k+N](periodicity in frequency)X[k] = X[k+N] \qquad \text{(periodicity in frequency)} x[n]=x[n+N](periodicity in time, implicit)x[n] = x[n+N] \qquad \text{(periodicity in time, implicit)}

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue