Part 2

23 min read18 headingsSplit lesson page

Lesson overview | Previous part | Next part

Fourier Transform: Part 8: Applications in Machine Learning to Appendix A: Extended Properties and Derivations

8. Applications in Machine Learning

8.1 FNet: Replacing Self-Attention with Fourier Transforms

Paper: Lee-Thorp, Ainslie, Eckstein, Ontanon (2022). "FNet: Mixing Tokens with Fourier Transforms." NAACL.

The idea: In a standard Transformer, the self-attention sublayer computes:

\text{Attention}(Q,K,V) = \operatorname{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

which is $O(N^2 d)$ in sequence length $N$ . FNet replaces this with a 2D DFT:

\text{FNet-mixing}(X) = \operatorname{Re}\!\left(\mathcal{F}_{\text{seq}}\!\left[\mathcal{F}_{\text{model}}[X]\right]\right)

where $\mathcal{F}_{\text{seq}}$ applies FFT along the sequence dimension and $\mathcal{F}_{\text{model}}$ along the embedding dimension. This is $O(N\log N \cdot d)$ - much faster.

Why does it work? The Fourier Transform is a global linear mixing operation: every output position depends on every input position (via the sum $\hat{x}_k = \sum_n x_n e^{-2\pi ink/N}$ ). This is similar to attention's global mixing, but unparameterized - the mixing weights are fixed Fourier basis functions rather than learned attention scores.

Results: FNet achieves 92-97% of BERT's performance on GLUE benchmarks. For tasks requiring fine-grained token interaction (e.g., extractive QA), the gap is larger; for tasks where global context suffices (e.g., sentence classification), FNet nearly matches BERT. Training speed is 7 faster on GPUs, 2 faster on TPUs.

Mathematical insight: The key theorem is that the DFT matrix $F$ satisfies $F = F^\top / N$ (circular DFT is symmetric up to normalization). Therefore $FF^\dagger = I$ - the DFT is unitary, just like the continuous FT by Plancherel. The real part operation ensures the output is real-valued.

Code sketch:

import numpy as np

def fnet_mixing(X):
    """
    X: (batch, seq_len, d_model) array
    Returns: real-valued mixing output of same shape
    """
    X_fft = np.fft.fft(np.fft.fft(X, axis=1), axis=2)
    return np.real(X_fft)

8.2 Random Fourier Features

Paper: Rahimi & Recht (2007). "Random Features for Large-Scale Kernel Machines." NeurIPS (Best Paper Award).

The setup: Kernel methods (SVMs, Gaussian Processes) require computing $K_{ij} = k(\mathbf{x}^{(i)}, \mathbf{x}^{(j)})$ for all $N$ training pairs - an $O(N^2)$ matrix that costs $O(N^3)$ to invert. For $N > 10^5$ , this is prohibitive.

Bochner's theorem: A continuous, shift-invariant kernel $k(\mathbf{x}, \mathbf{y}) = k(\mathbf{x} - \mathbf{y})$ is positive definite if and only if it is the Fourier Transform of a non-negative measure $p(\boldsymbol{\omega})$ :

k(\mathbf{x} - \mathbf{y}) = \int p(\boldsymbol{\omega})\,e^{i\boldsymbol{\omega}^\top(\mathbf{x}-\mathbf{y})}\,d\boldsymbol{\omega} = \mathbb{E}_{\boldsymbol{\omega} \sim p}[e^{i\boldsymbol{\omega}^\top\mathbf{x}}\overline{e^{i\boldsymbol{\omega}^\top\mathbf{y}}}]

(Full treatment of Bochner's theorem in Section 12-03 Kernel Methods.)

The approximation: Sample $D$ frequencies $\boldsymbol{\omega}_1, \ldots, \boldsymbol{\omega}_D \sim p(\boldsymbol{\omega})$ and define the random feature map:

\phi(\mathbf{x}) = \sqrt{\frac{2}{D}}\left[\cos(\boldsymbol{\omega}_1^\top\mathbf{x} + b_1), \ldots, \cos(\boldsymbol{\omega}_D^\top\mathbf{x} + b_D)\right] \in \mathbb{R}^D

where $b_j \sim \mathcal{U}[0, 2\pi]$ are random phase offsets.

Guarantee: By the law of large numbers:

k(\mathbf{x}, \mathbf{y}) \approx \phi(\mathbf{x})^\top\phi(\mathbf{y}), \qquad \mathbb{E}[\phi(\mathbf{x})^\top\phi(\mathbf{y})] = k(\mathbf{x},\mathbf{y})

with concentration: $P\!\left(|\phi(\mathbf{x})^\top\phi(\mathbf{y}) - k(\mathbf{x},\mathbf{y})| > \epsilon\right) \leq 2\exp(-D\epsilon^2/4)$ .

For popular kernels:

RBF kernel $k(\mathbf{x},\mathbf{y}) = e^{-\lVert\mathbf{x}-\mathbf{y}\rVert^2/(2\sigma^2)}$ : $p(\boldsymbol{\omega}) = \mathcal{N}(\mathbf{0}, \sigma^{-2}I)$
Laplace kernel $k(\mathbf{x},\mathbf{y}) = e^{-\lVert\mathbf{x}-\mathbf{y}\rVert_1/\sigma}$ : $p$ is a Cauchy distribution

Impact: Random Fourier Features reduce kernel SVM training from $O(N^3)$ to $O(ND + D^3)$ , making kernel methods scalable. The same idea appears in Performer (Choromanski et al., 2021) - an efficient attention mechanism that approximates the attention kernel with random features, reducing attention to $O(N)$ .

8.3 Spectral Normalization

Paper: Miyato, Kataoka, Koyama, Yoshida (2018). "Spectral Normalization for Generative Adversarial Networks." ICLR.

The problem: Training GANs is notoriously unstable because the discriminator can overfit, causing gradient vanishing for the generator. A Lipschitz constraint on the discriminator stabilizes training.

The solution: Normalize each weight matrix $W$ by its spectral norm $\lVert W \rVert_2 = \sigma_{\max}(W)$ (the largest singular value):

\tilde{W} = \frac{W}{\sigma_{\max}(W)}

This makes every linear layer 1-Lipschitz: $\lVert\tilde{W}\mathbf{x} - \tilde{W}\mathbf{y}\rVert \leq \lVert\mathbf{x} - \mathbf{y}\rVert$ . The composition of 1-Lipschitz layers is 1-Lipschitz - so the entire discriminator is 1-Lipschitz.

Computing $\sigma_{\max}$ : Power iteration gives an efficient approximation. Starting from a random unit vector $\tilde{\mathbf{v}}$ :

$\tilde{\mathbf{u}} = W\tilde{\mathbf{v}} / \lVert W\tilde{\mathbf{v}} \rVert$
$\tilde{\mathbf{v}} = W^\top\tilde{\mathbf{u}} / \lVert W^\top\tilde{\mathbf{u}} \rVert$
$\hat{\sigma} = \tilde{\mathbf{u}}^\top W\tilde{\mathbf{v}}$ One iteration per training step suffices empirically.

Why "spectral"? The spectral norm $\lVert W \rVert_2$ is the largest eigenvalue of $\sqrt{W^\top W}$ - the spectrum of the operator $W$ viewed as a linear map. This connects to Fourier analysis: for a convolution layer with kernel $h$ , the spectral norm is $\lVert\hat{h}\rVert_\infty = \max_\xi|\hat{h}(\xi)|$ - the supremum of the Fourier Transform of the filter.

For AI: Spectral normalization is now standard in GAN training. It also appears in attention normalization - normalizing the key/value projection matrices to control the Lipschitz constant of the attention map, which is critical for stable training of large models.

8.4 Fourier Neural Operator - Preview

The Fourier Neural Operator (FNO) (Li et al., 2021) learns a mapping between function spaces (e.g., PDE initial conditions -> solutions) by:

Lift input $\mathbf{v}(t)$ to a higher-dimensional representation $\mathbf{v}_0 = P(\mathbf{v})$
Apply $L$ Fourier integral operator layers: $\mathbf{v}_{l+1}(t) = \sigma(W\mathbf{v}_l(t) + \mathcal{F}^{-1}[R_\phi \cdot \mathcal{F}[\mathbf{v}_l]](t))$
Project to output: $u = Q(\mathbf{v}_L)$

The key operation $R_\phi$ is a learnable complex-valued tensor applied pointwise in Fourier space - a global convolution with a learned filter. Only the lowest $k_{\max}$ Fourier modes are kept (Plancherel guarantees the truncation error equals the discarded energy).

FNO achieves 1000 speedup over classical PDE solvers for problems like weather prediction and turbulence simulation.

Full treatment in Section 20-03 DFT and FFT - the discretized version of FNO and its implementation via FFT is covered there.

8.5 Bochner's Theorem and Shift-Invariant Kernels - Preview

Bochner's theorem (1932) characterizes all positive-definite, shift-invariant kernels as Fourier Transforms of non-negative measures:

k(\mathbf{x} - \mathbf{y}) = \int_{\mathbb{R}^d} e^{i\boldsymbol{\omega}^\top(\mathbf{x}-\mathbf{y})}\,dp(\boldsymbol{\omega})

This is a profound connection between Fourier analysis and kernel methods. The kernel's shape in the spatial domain determines (and is determined by) its spectral density $p$ .

Full treatment in Section 12-03 Kernel Methods - the RKHS theory, Mercer's theorem, and applications to SVMs, Gaussian Processes, and the Neural Tangent Kernel are covered there. Here we note that Bochner's theorem is the mathematical foundation for Random Fourier Features (Section 8.2).

8.6 RoPE from the Continuous FT Perspective

Rotary Position Embedding (RoPE) (Su, Lu, Pan, Meng, Luo, 2021) is now used in LLaMA-3, Mistral, Qwen, GPT-NeoX, and virtually every frontier LLM. Its mathematical foundation is the Fourier Transform on the circle group.

Setup. In attention, the score between query $\mathbf{q}$ at position $m$ and key $\mathbf{k}$ at position $n$ should depend only on the relative position $m - n$ . This is a translation invariance requirement - exactly the condition Bochner's theorem characterizes.

Construction. Pair up embedding dimensions: $(q_{2j}, q_{2j+1})$ and $(k_{2j}, k_{2j+1})$ for $j = 0, \ldots, d/2 - 1$ . Treat each pair as a complex number $q_j^{(c)} = q_{2j} + iq_{2j+1}$ . Apply rotation by angle $m\theta_j$ :

f_q(\mathbf{q}, m) = q_j^{(c)} \cdot e^{im\theta_j} \qquad \theta_j = 10000^{-2j/d_{\text{model}}}

The attention score becomes:

\langle f_q(\mathbf{q}, m), f_k(\mathbf{k}, n)\rangle = \operatorname{Re}\!\left[\sum_j q_j^{(c)}\overline{k_j^{(c)}} \cdot e^{i(m-n)\theta_j}\right]

which depends only on $m - n$ - the relative position. This is the Fourier modulation theorem: multiplying by $e^{im\theta_j}$ is a frequency shift by $m$ in the "position domain."

The frequencies $\theta_j$ : The geometric spacing $\theta_j = 10000^{-2j/d}$ means each pair encodes a different "octave" of positional information. Low-index dimensions ( $j$ small) use high frequencies $\theta_j \approx 1$ - sensitive to local position differences. High-index dimensions use low frequencies $\theta_j \approx 10000^{-1}$ - sensitive to global structure over thousands of positions. This is a multi-resolution decomposition - the Fourier Transform at multiple frequency scales.

Why the $10000$ base? The maximum useful context is $\sim 10000$ tokens at the original RoPE scale (each frequency completes at most one full rotation). For LLaMA-3 with 128K context, the effective base must be much larger - which is why LLaMA-3.1 uses $\theta_{\text{base}} = 500000$ instead of $10000$ .

ROPE: FOURIER TRANSFORM IN ATTENTION
========================================================================

  f_q(q, m) = q  e^{im}   <- modulation by Fourier basis e^{im}

  f_q(q,m), f_k(k,n) = Re[q*k  e^{i(m-n)}]
                            ^ depends only on relative position m-n

  j = 10000^{-2j/d}:  j=0 -> approx1 (fast/local),  j=d/2 -> approx1/10000 (slow/global)

  Context extension:
    Original RoPE: base=10000, max context ~10K tokens
    LLaMA-3.1:     base=500K,  max context 128K tokens
    LongRoPE:      non-uniform rescaling up to 2M context

========================================================================

9. Common Mistakes

#	Mistake	Why It's Wrong	Fix
1	Confusing the $\omega$ and $\xi$ conventions and getting $2\pi$ factors wrong	$\mathcal{F}_\omega[f](f')=\hat{f}(\omega)$ has a $1/(2\pi)$ in the inverse; $\mathcal{F}_\xi$ does not. Mixing these adds spurious factors	Always identify the convention on first use; for this curriculum use the $\xi$ -convention $\hat{f}(\xi)=\int f(t)e^{-2\pi i\xi t}\,dt$
2	Assuming $\hat{f} \in L^1$ whenever $f \in L^1$	$f=\operatorname{rect}$ has $\hat{f}=\operatorname{sinc} \notin L^1$ . The Riemann-Lebesgue lemma only guarantees $\hat{f} \in C_0$ , not $L^1$	Check whether $f$ is smooth enough (e.g., $f \in L^1$ and Lipschitz) for $\hat{f} \in L^1$ to hold
3	Writing $\mathcal{F}[\delta](\xi) = 0$	$\delta$ has a flat spectrum: $\mathcal{F}[\delta](\xi)=1$ for all $\xi$ . A "pointlike" impulse contains all frequencies equally	Compute distributional FT: $\langle\hat{\delta},\phi\rangle=\langle\delta,\hat{\phi}\rangle=\hat{\phi}(0)=\int\phi\,dt$
4	Thinking the uncertainty principle only limits our instruments	$\Delta t\cdot\Delta\xi\geq 1/(4\pi)$ is a theorem about functions, not about measurement devices. It holds for any $f\in L^2$ , regardless of how it is measured	Accept the bound as a mathematical fact; designing around it requires fundamentally different time-frequency representations (wavelets, Section 05)
5	Applying the differentiation property to non-differentiable $f$	$\mathcal{F}[f'](\xi)=2\pi i\xi\hat{f}(\xi)$ requires $f'\in L^1$ . For $f=\operatorname{rect}$ , $f'=\delta(t+1/2)-\delta(t-1/2)$ in the distributional sense	Work in the distributional setting when $f$ has jumps; the formula still holds but $f'$ is a distribution
6	Claiming Parseval says $\int	f	^2 = \int
7	Mixing up the FT of a product vs. convolution	$\mathcal{F}[f\cdot g]=\hat{f}*\hat{g}$ (convolution in frequency), not $\hat{f}\cdot\hat{g}$ . The multiplicative dual of the convolution theorem is often reversed	Memorize the duality table: convolution in time $\leftrightarrow$ product in frequency; product in time $\leftrightarrow$ convolution in frequency
8	Forgetting the $1/	a	$ in the scaling theorem
9	Treating negative frequencies as "unphysical"	For real $f$ , negative frequencies are redundant (Hermitian symmetry) but are not wrong. For complex $f$ (e.g., analytic signals), negative and positive frequencies are distinct and carry different information	Embrace negative frequencies; they simplify formulas and are essential for complex signals
10	Applying $\mathcal{F}^{-1}\circ\mathcal{F}=I$ pointwise instead of a.e.	Inversion holds almost everywhere (for $L^1\cap$ sufficient regularity). At a jump discontinuity, $\int\hat{f}(\xi)e^{2\pi i\xi t}\,d\xi=\frac{1}{2}[f(t^+)+f(t^-)]$ - the average of left and right limits	Account for the $1/2$ -average behavior at discontinuities (same as Dirichlet's theorem in Section 01)
11	Using $\text{sinc}(x) = \sin(x)/x$ and $\operatorname{sinc}(x) = \sin(\pi x)/(\pi x)$ interchangeably	There are two conventions for sinc. The normalized sinc $\operatorname{sinc}(x)=\sin(\pi x)/(\pi x)$ satisfies $\mathcal{F}[\operatorname{rect}]=\operatorname{sinc}$ in the $\xi$ -convention; the unnormalized $\sin(x)/x$ does not	In this curriculum: $\operatorname{sinc}(x) = \sin(\pi x)/(\pi x)$ (normalized), consistent with the $\xi$ -convention FT
12	Thinking FNet replaces attention entirely	FNet is a weaker mixer than attention - it cannot learn data-dependent weights. It works well for classification but poorly for tasks requiring fine-grained token interactions (QA, coreference)	Use FNet as a fast baseline; switch to attention for tasks requiring selective, query-dependent mixing

10. Exercises

Exercise 1 * - Computing Fourier Transforms from Definition

(a) Compute $\mathcal{F}[f](\xi)$ for $f(t) = e^{-a|t|}$ with $a > 0$ . Show all steps of integration. What type of function is $\hat{f}$ ?

(b) Compute $\mathcal{F}[\operatorname{rect}(t/T)](\xi)$ for a pulse of width $T > 0$ . Express using the normalized sinc function.

(c) Verify the $L^1$ bound: show $|\hat{f}(\xi)| \leq \lVert f\rVert_1$ for your answer in (a).

(d) Confirm the Riemann-Lebesgue lemma numerically for $f(t) = e^{-t^2}$ : plot $|\hat{f}(\xi)|$ and verify it decays to 0.

Exercise 2 * - Applying Properties

(a) Given $\hat{f}(\xi) = e^{-\pi\xi^2}$ , use the time-shift property to write the Fourier Transform of $g(t) = f(t - 3)$ .

(b) Given $\hat{h}(\xi) = \operatorname{sinc}(\xi)$ , find the Fourier Transform of $h(t - 2)\cos(4\pi t)$ . (Use shift + modulation.)

(c) If $f(t)$ has $\hat{f}(\xi) = g(\xi)$ , what is the FT of $f(3t + 1)$ ? (Use both shift and scaling.)

(d) Prove: if $f$ is real and even, then $\hat{f}$ is real and even.

Exercise 3 * - Parseval's Relation Numerically

(a) For $f(t) = e^{-\pi t^2}$ , compute $\lVert f\rVert_2^2 = \int|f|^2\,dt$ analytically.

(b) Compute $\lVert\hat{f}\rVert_2^2$ analytically and verify $\lVert f\rVert_2^2 = \lVert\hat{f}\rVert_2^2$ .

(c) Numerically verify Parseval using the FFT: discretize $f$ on $[-5, 5]$ with $N = 1024$ points, apply FFT, and compare $\sum|f_n|^2/N$ vs $\sum|\hat{f}_k|^2/N$ .

(d) Use Parseval to evaluate $\int_{-\infty}^\infty \frac{1}{(1 + 4\pi^2 t^2)^2}\,dt$ .

Exercise 4 ** - The Inversion Theorem

(a) For $f(t) = \operatorname{rect}(t)$ (not in $L^1$ after FT since $\operatorname{sinc} \notin L^1$ ), show that the principal value integral $\lim_{R\to\infty}\int_{-R}^{R}\operatorname{sinc}(\xi)e^{2\pi i\xi t}\,d\xi$ converges to $\operatorname{rect}(t)$ for $|t| \neq 1/2$ by evaluating the integral explicitly.

(b) Show that at $t = 1/2$ (the jump discontinuity), the principal value integral converges to $1/2 = \frac{1}{2}[\operatorname{rect}(1/2^-) + \operatorname{rect}(1/2^+)]$ .

(c) This mirrors Dirichlet's theorem from Section 20-01. Identify the precise analog: what plays the role of the Dirichlet kernel in the Fourier integral case?

Exercise 5 ** - Uncertainty Principle

(a) For $f(t) = e^{-\pi t^2/\sigma^2}$ (normalized to $\lVert f\rVert_2 = 1$ appropriately), compute $\Delta t$ (RMS time spread) and $\Delta\xi$ (RMS frequency spread). Verify $\Delta t\cdot\Delta\xi = 1/(4\pi)$ .

(b) For $f(t) = \operatorname{rect}(t)$ , compute $\Delta t$ and $\Delta\xi$ numerically. Is the uncertainty bound satisfied? Is it tight?

(c) A signal designer wants to transmit a pulse with duration $\Delta t \leq 1\,\mu$ s using bandwidth $\Delta\xi \leq 100\,$ kHz. Is this possible? What does the uncertainty principle say?

(d) Show that any signal of the form $f(t) = Ce^{-\alpha t^2}e^{2\pi i\xi_0 t}$ achieves equality in the uncertainty principle.

Exercise 6 ** - Tempered Distributions

(a) Verify $\mathcal{F}[\delta(t - a)](\xi) = e^{-2\pi i\xi a}$ using the distributional definition $\langle\hat{T},\phi\rangle = \langle T,\hat{\phi}\rangle$ .

(b) Compute $\mathcal{F}[\delta'(t)](\xi)$ (FT of the derivative of delta).

(c) Use the result from (a) to verify: $\mathcal{F}[\cos(2\pi\xi_0 t)](\xi) = \frac{1}{2}[\delta(\xi - \xi_0) + \delta(\xi + \xi_0)]$ .

(d) Verify the Poisson summation formula numerically for $f(t) = e^{-\pi t^2}$ : compare $\sum_{n=-N}^{N} f(n)$ with $\sum_{n=-N}^{N} \hat{f}(n)$ for $N = 5, 10, 20$ .

Exercise 7 *** - Random Fourier Features

(a) For the RBF kernel $k(\mathbf{x},\mathbf{y}) = e^{-\lVert\mathbf{x}-\mathbf{y}\rVert^2}$ (setting $\sigma = 1$ ), the spectral density is $p(\boldsymbol{\omega}) = (4\pi)^{-d/2}e^{-\lVert\boldsymbol{\omega}\rVert^2/4}$ . Draw $D = 500$ frequencies $\boldsymbol{\omega}_j \sim \mathcal{N}(\mathbf{0}, 2I)$ for $d = 2$ and construct the random feature map $\phi(\mathbf{x}) \in \mathbb{R}^{2D}$ (using cos and sin).

(b) Generate 100 random pairs $(\mathbf{x}_i, \mathbf{y}_i)$ with $\mathbf{x}_i, \mathbf{y}_i \sim \mathcal{N}(\mathbf{0}, I)$ . For each pair, compute the exact kernel value $k(\mathbf{x}_i, \mathbf{y}_i)$ and the RFF approximation $\phi(\mathbf{x}_i)^\top\phi(\mathbf{y}_i)$ . Plot the approximation vs. exact values and report RMSE.

(c) Study the approximation quality as a function of $D \in \{10, 50, 100, 500, 1000\}$ . How does RMSE scale with $D$ ? Is it consistent with the theoretical bound $\mathbb{E}[\text{error}^2] = O(D^{-1})$ ?

(d) Takeaway: explain in 2 sentences why RFF matters for LLM-scale systems where kernel computation would be infeasible.

Exercise 8 *** - FNet vs Attention Experiment

(a) Implement a minimal FNet mixing layer: given $X \in \mathbb{R}^{N \times d}$ , apply 2D FFT and take the real part. Implement a minimal attention mixing layer: $\text{Attn}(X) = \operatorname{softmax}(XX^\top/\sqrt{d})X$ .

(b) For a synthetic task - "copy the token from position $k$ to the output" (a task requiring precise position tracking) - construct training pairs $(X, y)$ and train both mixers with a linear head on top. Compare accuracy and training speed.

(c) For a synthetic task - "output the mean of all input tokens" (a global aggregation task) - repeat the comparison. Which mixer is better suited? Why?

(d) Takeaway: connect your results to the mathematical properties of the FT (fixed unparameterized mixing vs. learned data-dependent mixing) and explain what types of tasks each mixer is appropriate for.

11. Why This Matters for AI (2026 Perspective)

Concept	AI Impact
Fourier Transform as unitary operator (Plancherel)	FNet (Lee-Thorp et al., 2022) uses FT as a token mixer - same information content, 7 faster than attention on GPU. The unitarity of FT ensures no information loss during mixing.
Convolution-multiplication duality	The theoretical basis for $O(N\log N)$ convolution via FFT - the reason CNNs can be trained efficiently. FlashConv (Fu et al., 2023) and Hyena (Poli et al., 2023) extend this to long-range sequence models.
Modulation theorem (time-shift frequency-shift)	Foundation of RoPE (Su et al., 2021) - multiplying embeddings by $e^{im\theta}$ shifts the position representation. Now standard in LLaMA-3, Mistral, Qwen, GPT-NeoX.
Uncertainty principle $\Delta t\cdot\Delta\xi \geq 1/(4\pi)$	Fundamental constraint on context length extension. YaRN (Peng et al., 2023) and LongRoPE (Ding et al., 2024) must lower the frequency $\theta_j$ to extend context - the UP dictates the minimum frequency needed.
Bochner's theorem + FT of spectral density	Random Fourier Features (Rahimi & Recht, 2007) reduce kernel computation from $O(N^2)$ to $O(ND)$ . Performer (Choromanski et al., 2021) uses the same idea to get $O(N)$ attention.
Spectral norm $\lVert W\rVert_2 = \sigma_{\max}$	Spectral normalization (Miyato et al., 2018) - dividing discriminator weights by their spectral norm enforces Lipschitz constraints. Standard in GAN training and used in diffusion model discriminators.
FT as basis for function space operators	Fourier Neural Operator (Li et al., 2021) - solves PDEs 1000 faster by learning in Fourier space. Used in climate modeling, fluid simulation, and materials science.
Differentiation in frequency domain ( $f' \leftrightarrow 2\pi i\xi\hat{f}$ )	Heat and wave equation solutions in closed form via FT. Kernel of the heat equation is a Gaussian - explains why diffusion models add Gaussian noise: it has maximal spectral support.
Gaussian as optimal time-frequency window	Gaussian embeddings in Gaussian attention (Tsai et al., 2019) - replacing dot-product attention with a Gaussian kernel gives an attention with explicit time-frequency interpretation.
Dirac comb + Poisson summation	Shannon sampling theorem - ensures that Whisper's mel-spectrogram preprocessing (16kHz sampling, 80-channel mel filterbank) captures all speech frequencies (up to 8kHz) without aliasing.

12. Conceptual Bridge

Looking Back: From Fourier Series to Fourier Transform

The Fourier series (Section 20-01) established that periodic functions decompose into discrete harmonics. The key insight was geometric: the trigonometric functions $\{e^{inx}\}$ form an orthonormal basis of $L^2[-\pi,\pi]$ , and the Fourier coefficients are the projections of $f$ onto these basis vectors.

The Fourier Transform carries this insight to its natural conclusion. By letting the period $T \to \infty$ , the discrete harmonic sum $\sum_n c_n e^{2\pi int/T}$ becomes the continuous integral $\int\hat{f}(\xi)e^{2\pi i\xi t}\,d\xi$ , and the discrete spectrum $\{c_n\}$ becomes the continuous spectrum $\hat{f}(\xi)$ . Everything from the Fourier series has an analog:

Orthonormality -> Plancherel's theorem (Section 5)
Parseval's identity -> Parseval's relation (Section 5.2)
Dirichlet's theorem on convergence -> Fourier inversion theorem (Section 4)
Gibbs phenomenon -> spectral leakage in DFT (Section 03)
Smoothness spectral decay -> differentiation property (Section 3.5)

The distribution theory in Section 7 reveals the deepest unity: Fourier series is just the Fourier Transform restricted to periodic distributions. The Dirac comb with spacing $T$ transforms to a Dirac comb with spacing $1/T$ - the very relationship between period and harmonic spacing that the Fourier series encodes.

Looking Forward: Three Branches

From the Fourier Transform, the curriculum branches in three directions:

Branch 1: Discretization -> Section 20-03 DFT and FFT. The continuous FT must be made computable. Discretizing both time and frequency leads to the DFT, and the Cooley-Tukey algorithm (1965) computes it in $O(N\log N)$ instead of $O(N^2)$ . The FFT is arguably the most important algorithm in scientific computing - it is what makes spectrograms, MRI, and FNet feasible. The DFT inherits all properties of the continuous FT but adds complications: aliasing (from discretizing frequency), spectral leakage (from finite duration), and the circular convolution structure.

Branch 2: Convolution Theorem -> Section 20-04. The most important property of the FT for applications is that convolution in time equals multiplication in frequency. This converts the $O(N^2)$ convolution operation into $O(N\log N)$ FFT-based multiplication. The convolution theorem is the mathematical foundation of CNNs, WaveNet, S4/Mamba, and Hyena - all architectures where the key operation is a learned filter applied by convolution. The full theory of LTI systems, frequency response, filter design, and the Wiener-Khinchin theorem belongs there.

Branch 3: Time-Frequency Localization -> Section 20-05 Wavelets. The FT has a fundamental limitation: $|\hat{f}(\xi)|$ tells you which frequencies are present but not when they occur. The uncertainty principle shows this is not fixable - any attempt to localize in both time and frequency is bounded by $\Delta t\cdot\Delta\xi \geq 1/(4\pi)$ . Wavelets overcome this by accepting a principled tradeoff: high frequencies are analyzed at fine time resolution, low frequencies at coarse time resolution. This multi-resolution analysis (MRA) is the mathematical foundation of JPEG 2000, EEG analysis, and wavelet attention mechanisms.

POSITION IN THE FOURIER CURRICULUM
========================================================================

  Section 01 FOURIER SERIES                Section 02 FOURIER TRANSFORM (HERE)
  ---------------------             ---------------------------------
  Periodic functions                Aperiodic functions on R
  Discrete spectrum {cn}            Continuous spectrum f()
  Parseval for series               Plancherel's theorem
  Dirichlet convergence             Inversion theorem
  Gibbs phenomenon                  Spectral leakage (-> Section 03)
  RoPE derivation                   Full uncertainty principle
                   v                         v
  +----------------------------------------------------------------+
  |              Section 02 -> THREE BRANCHES                              |
  |                                                                |
  |  Discretize:      Convolve:         Time-localize:            |
  |  Section 03 DFT/FFT  ->  Section 04 Convolution  ->  Section 05 Wavelets            |
  |  $O(N\log N)$    CNNs, Mamba         MRA, multi-scale         |
  |  Whisper, FNO    WaveNet, Hyena      Scattering networks       |
  +----------------------------------------------------------------+
                          v
       Section 21 Statistical Learning Theory
       (spectral learning bounds, kernel methods)

========================================================================

Key Takeaways

The Fourier Transform is characterized by three central theorems:

Plancherel - the FT is a unitary isometry on $L^2(\mathbb{R})$ : it preserves energy and inner products. The Fourier Transform changes how you describe a signal, not how much information it contains.
Uncertainty - $\Delta t\cdot\Delta\xi \geq 1/(4\pi)$ : time and frequency concentration are fundamentally coupled. This is a theorem about analysis, not about physics, and it constrains every signal processing and ML system that operates in both domains simultaneously.
Inversion - the FT is invertible: given $\hat{f}$ , you can recover $f$ exactly. No information is lost in the Fourier domain representation - it is a complete, lossless change of basis.

Together, these theorems say: the Fourier Transform is an invertible, energy-preserving, mathematically constrained change of representation. It reveals structure (frequency content) that is invisible in the time domain, converts convolution to multiplication, and converts differentiation to algebra - making it the single most powerful analytical tool for understanding and processing signals, functions, and the weight matrices of neural networks.

<- Back to Fourier Analysis | Previous: Fourier Series <- | Next: DFT and FFT ->

Appendix A: Extended Properties and Derivations

A.1 The Duality Theorem in Full

One of the most elegant properties of the Fourier Transform is self-duality: applying the FT twice returns the time-reversed original.

Theorem A.1 (Duality). For $f \in L^1(\mathbb{R})$ :

\mathcal{F}[\hat{f}(t)](\xi) = f(-\xi)

Equivalently: if $\mathcal{F}[f](\xi) = g(\xi)$ , then $\mathcal{F}[g](\xi) = f(-\xi)$ .

Proof: $\mathcal{F}[\hat{f}(t)](\xi) = \int\hat{f}(t)e^{-2\pi i\xi t}\,dt = \int\left(\int f(s)e^{-2\pi ist}\,ds\right)e^{-2\pi i\xi t}\,dt$

By Fubini (justified since $f, \hat{f} \in L^1$ ): $= \int f(s)\left(\int e^{-2\pi i(s+\xi)t}\,dt\right)ds = \int f(s)\delta(s+\xi)\,ds = f(-\xi)$ $\square$

Corollary: Applying the FT four times: $\mathcal{F}^4[f] = f$ . The Fourier Transform has order 4 as an operator.

Using duality to compute new transforms:

If you know $\mathcal{F}[f](\xi) = g(\xi)$ , then $\mathcal{F}[g](\xi) = f(-\xi)$ .

Example: We know $\mathcal{F}[\operatorname{rect}(t)](\xi) = \operatorname{sinc}(\xi)$ . By duality: $\mathcal{F}[\operatorname{sinc}(t)](\xi) = \operatorname{rect}(-\xi) = \operatorname{rect}(\xi)$ (since rect is even).

So $\mathcal{F}[\operatorname{sinc}](\xi) = \operatorname{rect}(\xi)$ : a sinc in time has a rectangular (bandlimited) spectrum. This is the ideal low-pass filter - a filter that passes frequencies $|\xi| \leq 1/2$ perfectly and rejects all others.

A.2 Analytic Signals and the Hilbert Transform

Definition A.1 (Analytic Signal). For a real signal $f(t)$ , its analytic signal is:

f_+(t) = f(t) + i\,\mathcal{H}[f](t)

where $\mathcal{H}[f]$ is the Hilbert Transform: $\mathcal{H}[f](t) = \frac{1}{\pi}\,\text{P.V.}\int_{-\infty}^\infty \frac{f(\tau)}{t-\tau}\,d\tau$ .

Fourier domain characterization:

\hat{f}_+(\xi) = \begin{cases}2\hat{f}(\xi) & \xi > 0 \\ \hat{f}(0) & \xi = 0 \\ 0 & \xi < 0\end{cases}

The analytic signal has only non-negative frequencies: it is a one-sided spectrum signal. This is achieved by zeroing out the negative frequencies and doubling the positive ones - a frequency-domain operation.

Why this matters: The analytic signal enables the definition of instantaneous frequency $\omega(t) = \frac{d}{dt}\arg f_+(t)$ - the frequency of the signal at a given instant. This is essential for frequency-modulated signals (FM radio, chirps) and is used in empirical mode decomposition for analyzing non-stationary signals like EEG and speech.

Hilbert Transform in spectral terms: $\mathcal{F}[\mathcal{H}[f]](\xi) = -i\,\operatorname{sgn}(\xi)\,\hat{f}(\xi)$ . The Hilbert Transform applies a $-\pi/2$ phase shift to positive frequencies and $+\pi/2$ to negative frequencies - it is a "90-degree phase rotator."

A.3 The Fourier Transform on $\mathbb{R}^d$

The Fourier Transform extends naturally to $d$ dimensions. For $f: \mathbb{R}^d \to \mathbb{C}$ with $f \in L^1(\mathbb{R}^d)$ :

\hat{f}(\boldsymbol{\xi}) = \int_{\mathbb{R}^d} f(\mathbf{t})\,e^{-2\pi i \boldsymbol{\xi}^\top\mathbf{t}}\,d\mathbf{t}

where $\boldsymbol{\xi} \in \mathbb{R}^d$ is the frequency vector. All properties generalize:

Shift: $\mathcal{F}[f(\mathbf{t} - \mathbf{a})](\boldsymbol{\xi}) = e^{-2\pi i\boldsymbol{\xi}^\top\mathbf{a}}\hat{f}(\boldsymbol{\xi})$
Scaling by matrix $A$ : $\mathcal{F}[f(A\mathbf{t})](\boldsymbol{\xi}) = \frac{1}{|\det A|}\hat{f}(A^{-\top}\boldsymbol{\xi})$
Differentiation: $\mathcal{F}[\partial_{t_j}f](\boldsymbol{\xi}) = 2\pi i\xi_j\,\hat{f}(\boldsymbol{\xi})$
Plancherel: $\lVert\hat{f}\rVert_{L^2(\mathbb{R}^d)} = \lVert f\rVert_{L^2(\mathbb{R}^d)}$

For AI: The 2D Fourier Transform is used in convolutional networks for images. A 2D convolution with filter $h$ is, in Fourier space, pointwise multiplication: $\hat{y}(\boldsymbol{\xi}) = \hat{x}(\boldsymbol{\xi})\cdot\hat{h}(\boldsymbol{\xi})$ . For FNet (Section 8.1), the 2D FT is applied over both the sequence dimension and the embedding dimension.

Radial functions: For radially symmetric $f(\mathbf{t}) = f_0(\lVert\mathbf{t}\rVert)$ , the Fourier Transform is also radial: $\hat{f}(\boldsymbol{\xi}) = \hat{f}_0(\lVert\boldsymbol{\xi}\rVert)$ . The RBF kernel $k(\mathbf{x}-\mathbf{y}) = e^{-\lVert\mathbf{x}-\mathbf{y}\rVert^2}$ is radially symmetric, so its spectral density $p(\boldsymbol{\omega}) = c_d e^{-\lVert\boldsymbol{\omega}\rVert^2/4}$ is also radially symmetric - which is why sampling $\boldsymbol{\omega} \sim \mathcal{N}(\mathbf{0}, \frac{1}{2}I)$ works for RFF with the RBF kernel.

A.4 The Fourier Transform and PDEs

The differentiation property $\mathcal{F}[f^{(n)}](\xi) = (2\pi i\xi)^n\hat{f}(\xi)$ converts partial differential equations into algebraic equations in frequency space. This is the primary tool for solving linear constant-coefficient PDEs.

Example 1: Heat Equation. Solve $\partial_t u = \kappa\partial_{xx}u$ , $u(x,0) = u_0(x)$ .

Taking the Fourier Transform in $x$ :

\partial_t\hat{u}(\xi,t) = \kappa(2\pi i\xi)^2\hat{u}(\xi,t) = -4\pi^2\kappa\xi^2\hat{u}(\xi,t)

This is a first-order ODE in $t$ for each fixed $\xi$ : $\hat{u}(\xi,t) = \hat{u}_0(\xi)\,e^{-4\pi^2\kappa\xi^2 t}$ .

Inverting: $u(x,t) = \int\hat{u}_0(\xi)\,e^{-4\pi^2\kappa\xi^2 t}\,e^{2\pi i\xi x}\,d\xi = (u_0 * G_t)(x)$

where $G_t(x) = \frac{1}{\sqrt{4\pi\kappa t}}e^{-x^2/(4\kappa t)}$ is the Gaussian heat kernel (a Gaussian with time-dependent width $\sigma = \sqrt{2\kappa t}$ ).

Interpretation: Heat diffusion = convolution with a Gaussian. High frequencies $\xi$ decay as $e^{-4\pi^2\kappa\xi^2 t}$ - much faster than low frequencies. Sharp features (high frequency content) are smoothed out rapidly; broad features (low frequency) persist.

For AI - Diffusion Models: The DDPM forward process $q(x_t|x_0) = \mathcal{N}(x_t;\sqrt{\bar{\alpha}_t}x_0, (1-\bar{\alpha}_t)I)$ is exactly discrete heat diffusion: the signal is progressively smoothed by adding Gaussian noise (which has a flat spectrum = all frequencies). The network learns to reverse this process. The Fourier perspective explains why the early denoising steps (low noise) must recover high-frequency details while later steps (high noise) recover the coarse structure - the exact inverse of heat diffusion.

Example 2: Wave Equation. Solve $\partial_{tt}u = c^2\partial_{xx}u$ , $u(x,0) = u_0(x)$ , $\partial_t u(x,0) = v_0(x)$ .

FT in $x$ : $\partial_{tt}\hat{u} = (2\pi i\xi)^2 c^2\hat{u} = -4\pi^2c^2\xi^2\hat{u}$

Solution: $\hat{u}(\xi,t) = \hat{u}_0(\xi)\cos(2\pi c|\xi|t) + \frac{\hat{v}_0(\xi)}{2\pi ic|\xi|}\sin(2\pi c|\xi|t)$

By d'Alembert's formula (inverting): $u(x,t) = \frac{1}{2}[u_0(x+ct) + u_0(x-ct)] + \frac{1}{2c}\int_{x-ct}^{x+ct}v_0(s)\,ds$

The wave equation propagates each frequency at the same speed $c$ - no dispersion. In contrast, the heat equation attenuates high frequencies faster than low frequencies - strong dispersion (dissipation).

Example 3: Schrodinger Equation. The free-particle Schrodinger equation $i\hbar\partial_t\psi = -\frac{\hbar^2}{2m}\partial_{xx}\psi$ is formally a heat equation with imaginary time. Its solution $\hat{\psi}(\xi,t) = \hat{\psi}_0(\xi)e^{-i\hbar(2\pi\xi)^2t/(2m)}$ shows that each frequency component oscillates (rather than decays) with frequency $\hbar\xi^2/(2m)$ - quantum dispersion. The uncertainty principle $\Delta x\cdot\Delta p \geq \hbar/2$ (where $p = h\xi$ by de Broglie) is the physics version of the mathematical uncertainty principle in Section 6.

A.5 Windowed Fourier Transform (STFT)

The continuous Fourier Transform gives global frequency information but no time localization: $|\hat{f}(\xi)|$ tells you whether frequency $\xi$ is present somewhere in $f$ , but not when.

Definition A.2 (Short-Time Fourier Transform / Spectrogram). For a window function $g \in L^2(\mathbb{R})$ and signal $f \in L^2(\mathbb{R})$ :

\text{STFT}_f(t, \xi) = \int_{-\infty}^{\infty} f(\tau)\,\overline{g(\tau - t)}\,e^{-2\pi i\xi\tau}\,d\tau

This localizes the analysis around time $t$ using the window $g$ . The spectrogram is $|\text{STFT}_f(t,\xi)|^2$ - a time-frequency power map.

The uncertainty principle applies: The time resolution $\Delta t$ and frequency resolution $\Delta\xi$ of the STFT are determined by the window $g$ :

Narrow window $g$ -> good time resolution, poor frequency resolution
Wide window $g$ -> poor time resolution, good frequency resolution

The Gaussian window $g(t) = e^{-\pi t^2/\sigma^2}$ minimizes the product $\Delta t\cdot\Delta\xi$ for a given $\sigma$ , making it the optimal STFT window (the resulting time-frequency representation is called the Gabor transform).

For AI - Audio Models: Whisper (Radford et al., 2022) processes audio as mel-spectrograms: it applies a 25ms STFT window (adequate time resolution for phoneme detection) with 10ms hop, then maps the frequency axis to the mel scale (logarithmic, matching human perception), and feeds the result as a 2D "image" to a Vision Transformer. The STFT parameters are fixed - the uncertainty principle determines the time-frequency resolution tradeoff.

STFT TIME-FREQUENCY TRADEOFF (UNCERTAINTY PRINCIPLE IN PRACTICE)
========================================================================

  Short window (e.g., 2ms):            Long window (e.g., 50ms):
  -------------------------            --------------------------
  Fine time resolution (~2ms)          Coarse time resolution (~50ms)
  Poor freq resolution (~500 Hz)       Fine freq resolution (~20 Hz)
  Good for: transients, clicks         Good for: sustained tones, pitch

  Whisper (speech): 25ms window -> resolves individual phonemes (50-100ms)
  while achieving ~40Hz frequency resolution (enough for pitch/formants)

  The Gaussian window achieves the minimum t product - it is the
  optimal window under the uncertainty principle.

========================================================================

A.6 Fourier Transform and Operator Theory

From the perspective of functional analysis, the Fourier Transform is a unitary operator on the Hilbert space $L^2(\mathbb{R})$ . This abstract viewpoint connects Fourier analysis to the broader theory of self-adjoint operators.

The Fourier operator $\mathcal{F}$ : $\mathcal{F}: L^2(\mathbb{R}) \to L^2(\mathbb{R})$ is unitary: $\mathcal{F}^*\mathcal{F} = \mathcal{F}\mathcal{F}^* = I$ .

Since $\mathcal{F}^4 = I$ , the eigenvalues satisfy $\lambda^4 = 1$ : $\lambda \in \{1, -1, i, -i\}$ .

The spectral decomposition of $\mathcal{F}$ in terms of its eigenspaces: $L^2(\mathbb{R}) = \mathcal{E}_1 \oplus \mathcal{E}_{-1} \oplus \mathcal{E}_i \oplus \mathcal{E}_{-i}$ , where each $\mathcal{E}_\lambda$ is spanned by the Hermite functions with the appropriate eigenvalue.

Connection to quantum mechanics: The position operator $\hat{X}f(t) = tf(t)$ and momentum operator $\hat{P}f(t) = \frac{1}{2\pi i}\frac{df}{dt}$ (in natural units) are both self-adjoint operators on $L^2(\mathbb{R})$ , related by $\hat{P} = \mathcal{F}\hat{X}\mathcal{F}^{-1}$ . The commutator $[\hat{X},\hat{P}] = \frac{i}{2\pi}I$ implies the uncertainty principle $\Delta X\cdot\Delta P \geq \frac{1}{4\pi}$ . The mathematical identity and the physics are the same theorem.

For AI: The language of operator theory is increasingly used to analyze neural networks. The weight matrix $W$ of a linear layer is a linear operator; its spectral norm $\lVert W\rVert_2$ is its "size" as an operator; its singular value decomposition expresses it in terms of rank-1 operators. The FT connects linear operator theory to signal processing, unifying the analysis of both classical filters and neural network layers.

Fourier Transform: Part 2 - Applications In Machine Learning To Appendix A Extended Properties A

Fourier Transform: Part 8: Applications in Machine Learning to Appendix A: Extended Properties and Derivations

8. Applications in Machine Learning

8.1 FNet: Replacing Self-Attention with Fourier Transforms

8.2 Random Fourier Features

8.3 Spectral Normalization

8.4 Fourier Neural Operator - Preview

8.5 Bochner's Theorem and Shift-Invariant Kernels - Preview

8.6 RoPE from the Continuous FT Perspective

9. Common Mistakes

10. Exercises

11. Why This Matters for AI (2026 Perspective)

12. Conceptual Bridge

Looking Back: From Fourier Series to Fourier Transform

Looking Forward: Three Branches

Key Takeaways

Appendix A: Extended Properties and Derivations

A.1 The Duality Theorem in Full

A.2 Analytic Signals and the Hilbert Transform

A.3 The Fourier Transform on $\mathbb{R}^d$

A.4 The Fourier Transform and PDEs

A.5 Windowed Fourier Transform (STFT)

A.6 Fourier Transform and Operator Theory

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?

Fourier Transform: Part 2 - Applications In Machine Learning To Appendix A Extended Properties A

Fourier Transform: Part 8: Applications in Machine Learning to Appendix A: Extended Properties and Derivations

8. Applications in Machine Learning

8.1 FNet: Replacing Self-Attention with Fourier Transforms

8.2 Random Fourier Features

8.3 Spectral Normalization

8.4 Fourier Neural Operator - Preview

8.5 Bochner's Theorem and Shift-Invariant Kernels - Preview

8.6 RoPE from the Continuous FT Perspective

9. Common Mistakes

10. Exercises

11. Why This Matters for AI (2026 Perspective)

12. Conceptual Bridge

Looking Back: From Fourier Series to Fourier Transform

Looking Forward: Three Branches

Key Takeaways

Appendix A: Extended Properties and Derivations

A.1 The Duality Theorem in Full

A.2 Analytic Signals and the Hilbert Transform

A.3 The Fourier Transform on Rd\mathbb{R}^dRd

A.4 The Fourier Transform and PDEs

A.5 Windowed Fourier Transform (STFT)

A.6 Fourier Transform and Operator Theory

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?

A.3 The Fourier Transform on $\mathbb{R}^d$