NotesMath for LLMs

Fourier Transform

Fourier Analysis and Signal Processing / Fourier Transform

Notes

"The Fourier transform is a mathematical prism: it refracts a signal into its constituent frequencies, revealing structure invisible in time."

  • paraphrasing Norbert Wiener

Overview

The Fourier Transform extends Fourier series to non-periodic signals by taking the period to infinity. Where a Fourier series decomposes a periodic function into a discrete sum of harmonics, the Fourier Transform decomposes an aperiodic function into a continuous integral over all frequencies. The result is the spectrum of a signal - a complete description of which frequencies are present, at what amplitude, and with what phase.

This extension is not merely a generalization for mathematical completeness. It is the foundation of modern signal processing, quantum mechanics, partial differential equations, and - crucially for this curriculum - large language models, neural operators, and kernel methods. The Fourier Transform converts convolution (an expensive O(N2)O(N^2) operation) into pointwise multiplication, converts differentiation into algebraic multiplication by frequency, and provides the mathematical basis for the uncertainty principle that constrains every time-frequency representation from spectrograms to attention windows.

We develop the theory in full: the L1L^1 definition and its limitations, the L2L^2 extension via Plancherel's theorem, the complete set of operational properties, the inversion theorem, the Heisenberg uncertainty principle with proof, the distribution-theoretic extension to handle Dirac deltas and periodic signals, and a thorough treatment of AI applications including FNet, random Fourier features, spectral normalization, and the Fourier Neural Operator.

Prerequisites

Companion Notebooks

NotebookDescription
theory.ipynbInteractive derivations: FT of standard functions, properties, uncertainty principle, Plancherel, FNet, random Fourier features
exercises.ipynb10 graded problems from computing transforms to proving inversion to implementing kernel approximation

Learning Objectives

After completing this section, you will:

  1. Derive the Fourier Transform formula as the TT \to \infty limit of the Fourier series
  2. Compute the Fourier Transform of the Gaussian, rectangular pulse, Lorentzian, and triangle function from the definition
  3. Apply all ten standard properties (linearity, shift, scaling, differentiation, conjugation, duality, convolution) to evaluate transforms without integration
  4. State and prove Plancherel's theorem and use it to evaluate L2L^2 norms in the frequency domain
  5. State and prove the Heisenberg uncertainty principle and identify the Gaussian as the unique extremal function
  6. Define the Schwartz space and tempered distributions; compute the Fourier Transform of δ(t)\delta(t), 11, cos(2πξ0t)\cos(2\pi\xi_0 t), and the Dirac comb
  7. Explain the Poisson summation formula and its role in sampling theory
  8. Describe how FNet (Lee-Thorp et al., 2022) replaces self-attention with Fourier transforms and analyze its computational tradeoffs
  9. Implement random Fourier features (Rahimi & Recht, 2007) and verify the kernel approximation guarantee
  10. Explain spectral normalization (Miyato et al., 2018) and how it uses the FT to enforce Lipschitz constraints on discriminators
  11. Distinguish the three normalization conventions (ω\omega, ξ\xi, ff) and convert between them without error
  12. Explain how the uncertainty principle constrains RoPE context length extension and attention window design

Table of Contents


1. Intuition

1.1 From Fourier Series to Fourier Transform

Recall from Section 20-01 Fourier Series that a 2L2L-periodic function has a Fourier series in complex exponential form:

f(x)=n=cneinπx/L,cn=12LLLf(x)einπx/Ldxf(x) = \sum_{n=-\infty}^{\infty} c_n\, e^{in\pi x/L}, \qquad c_n = \frac{1}{2L}\int_{-L}^{L} f(x)\,e^{-in\pi x/L}\,dx

The frequencies present are the discrete set {,2/2L,1/2L,0,1/2L,2/2L,}\{\ldots, -2/2L, -1/2L, 0, 1/2L, 2/2L, \ldots\}, separated by spacing Δξ=1/(2L)\Delta\xi = 1/(2L).

Now ask: what happens as LL \to \infty? The function is no longer required to repeat - it becomes an arbitrary function on all of R\mathbb{R}. Three things change simultaneously:

  1. The frequency spacing Δξ=1/(2L)0\Delta\xi = 1/(2L) \to 0: the discrete spectrum becomes a continuum.
  2. The discrete sum n\sum_n becomes an integral dξ\int_{-\infty}^{\infty} d\xi.
  3. The coefficient cnc_n (which scales as 1/(2L)1/(2L)) becomes a density f^(ξ)dξ\hat{f}(\xi)\,d\xi.

Concretely, substitute ξn=n/(2L)\xi_n = n/(2L) and let LL \to \infty:

f(x)=n=(LLf(t)e2πiξntdt)f^(ξn)e2πiξnx12LΔξf(x) = \sum_{n=-\infty}^{\infty} \underbrace{\left(\int_{-L}^{L} f(t)\,e^{-2\pi i \xi_n t}\,dt\right)}_{\approx \hat{f}(\xi_n)} e^{2\pi i \xi_n x} \underbrace{\frac{1}{2L}}_{\Delta\xi} Lf^(ξ)e2πiξxdξ\xrightarrow{L\to\infty} \int_{-\infty}^{\infty} \hat{f}(\xi)\, e^{2\pi i \xi x}\,d\xi

where the Fourier Transform (using the ξ\xi-convention) is:

f^(ξ)=f(t)e2πiξtdt\hat{f}(\xi) = \int_{-\infty}^{\infty} f(t)\, e^{-2\pi i \xi t}\,dt

This derivation makes the Fourier Transform inevitable: it is not a definition pulled from thin air, but the natural limit of the Fourier series as periodicity is removed. The continuous spectrum f^(ξ)\hat{f}(\xi) is the spectral density of ff - the amplitude and phase contributed by frequency ξ\xi per unit frequency interval.

Non-example: The derivation above requires that the "coefficients" f^(ξ)\hat{f}(\xi) are well-defined as LL \to \infty. For this, we need f(t)dt<\int_{-\infty}^{\infty} |f(t)|\,dt < \infty, i.e., fL1(R)f \in L^1(\mathbb{R}). A constant function f(t)=1f(t) = 1 is NOT in L1(R)L^1(\mathbb{R}), so it does not have a classical Fourier Transform - we must extend the theory to distributions (Section 7) to handle it.

1.2 Frequency as a Continuous Variable

In a Fourier series, the spectrum is a discrete set of spikes: the function has energy only at the harmonics n/Tn/T. In the Fourier Transform, the spectrum is a continuous function f^(ξ)\hat{f}(\xi), and the function can have energy spread continuously across all frequencies.

The magnitude spectrum f^(ξ)|\hat{f}(\xi)| tells you how much of each frequency is present. The phase spectrum f^(ξ)\angle\hat{f}(\xi) tells you the phase offset of each frequency component. Together they completely determine ff (via the inversion theorem). Some examples of what the spectrum looks like:

  • A pure sinusoid f(t)=cos(2πξ0t)f(t) = \cos(2\pi\xi_0 t): spectrum is two spikes at ±ξ0\pm\xi_0 (as Dirac deltas).
  • A rectangular pulse (nonzero on [T/2,T/2][-T/2, T/2]): spectrum is a sinc function, spreading out in frequency.
  • A Gaussian f(t)=eπt2f(t) = e^{-\pi t^2}: spectrum is another Gaussian (this is the self-dual case).
  • A chirp (frequency increasing linearly with time): spectrum is spread over a range of frequencies, with the time-frequency tradeoff governed by the uncertainty principle.
  • White noise: flat spectrum - equal energy at every frequency.

The key principle is the time-frequency duality: a signal that is concentrated in time (short duration) must have a broad spectrum, and a signal with a narrow spectrum (nearly monochromatic) must extend over a long time. This is not a limitation of our measurement apparatus - it is a mathematical theorem (the Heisenberg uncertainty principle, Section 6).

For AI: This duality directly constrains attention mechanisms. RoPE uses frequencies to encode position, and extending context length (longer time window) requires lower frequencies - which is why YaRN (Peng et al., 2023) and LongRoPE (Ding et al., 2024) rescale the frequency base to accommodate longer sequences. The uncertainty principle is the fundamental reason why this rescaling is necessary.

1.3 Why It Matters for AI (2026 Perspective)

The Fourier Transform is not peripheral to modern AI - it is structurally present in several of the most important systems:

FNet (Lee-Thorp et al., 2022): Replaces the self-attention sublayer in a Transformer with a 2D Fourier transform over the sequence and embedding dimensions. Achieves 92% of BERT's accuracy on GLUE benchmarks at 7 the training speed, because FFT is O(NlogN)O(N \log N) while attention is O(N2)O(N^2). The mathematical insight: the FT acts as a "global mixer" that combines all token representations - a weaker but cheaper version of attention.

Random Fourier Features (Rahimi & Recht, 2007, NeurIPS Best Paper): Every shift-invariant kernel k(x,y)=k(xy)k(\mathbf{x}, \mathbf{y}) = k(\mathbf{x} - \mathbf{y}) is the Fourier Transform of a non-negative measure (Bochner's theorem). By sampling random frequencies from that measure, one gets a randomized feature map ϕ(x)RD\phi(\mathbf{x}) \in \mathbb{R}^D such that k(x,y)ϕ(x)ϕ(y)k(\mathbf{x},\mathbf{y}) \approx \phi(\mathbf{x})^\top\phi(\mathbf{y}). This reduces kernel machine computation from O(N2)O(N^2) to O(ND)O(ND).

Spectral Normalization (Miyato et al., 2018): To train stable GANs, the discriminator's weight matrices are normalized by their largest singular value (spectral radius). This enforces a Lipschitz constraint on the discriminator, stabilizing training. The "spectral" in the name refers to the spectrum of the weight matrix - a direct application of the FT of the weight matrices interpreted as linear operators.

Fourier Neural Operator (Li et al., 2021): Learns mappings between function spaces by applying a learnable linear transformation in Fourier space, then transforming back. Used for solving PDEs 1000 faster than traditional numerical methods. The key is that convolution in space = pointwise multiplication in Fourier space, so global dependencies can be captured efficiently.

RoPE (Su et al., 2021): Used in LLaMA-3, Mistral, Qwen, and virtually every modern LLM. Interprets each position as a rotation in the complex plane, using Fourier basis vectors eimθje^{im\theta_j} at different frequencies θj\theta_j for different embedding dimensions. The relative-position property follows from the group structure of the complex exponential.

1.4 Historical Timeline

FOURIER TRANSFORM - HISTORICAL MILESTONES
========================================================================

  1822  Fourier's "Theorie analytique de la chaleur": introduces the
        integral formula that becomes the Fourier Transform
  1880s Rayleigh uses Fourier integrals in optics and acoustics
  1898  Heaviside operational calculus (early form of the Laplace/FT)
  1915  Plancherel proves the L2 isometry theorem (Plancherel's theorem)
  1927  Heisenberg formulates the uncertainty principle in quantum
        mechanics; Kennard gives the precise mathematical inequality
  1933  Norbert Wiener's "The Fourier Integral and Certain of its
        Applications" - rigorous L2 theory; lays groundwork for
        signal processing as a mathematical discipline
  1949  Shannon's sampling theorem (uses Poisson summation formula)
  1965  Cooley & Tukey publish the FFT - makes FT computable in O(NlogN)
  1966  Schwartz distributions: FT extended to , constants, sinusoids
  1975  FT enters digital signal processing (audio, radar, MRI)
  2007  Rahimi & Recht: Random Fourier Features (NeurIPS Best Paper)
  2017  Vaswani et al.: sinusoidal positional encodings in Transformers
  2018  Miyato et al.: spectral normalization for GANs
  2021  Li et al.: Fourier Neural Operator for PDE solving
  2021  Su et al.: RoPE - now in LLaMA-3, Mistral, Qwen, GPT-NeoX
  2022  Lee-Thorp et al.: FNet - Fourier-based alternative to attention
  2023  YaRN (Peng et al.): Fourier-based context length extension
  2024  LongRoPE (Ding et al.): progressive frequency rescaling to 2M
        context length; RWKV-7 uses state-space FT interpretation

========================================================================

1.5 What the Fourier Transform Does Geometrically

The most useful geometric picture is that the Fourier Transform decomposes ff in an orthonormal basis of complex exponentials - just as the Fourier series decomposed a periodic function in the trigonometric basis. The difference is that now the "basis" is uncountably infinite and the "coefficients" form a continuous function rather than a sequence.

Formally, the complex exponentials {e2πiξt}ξR\{e^{2\pi i \xi t}\}_{\xi \in \mathbb{R}} do not form a basis of L2(R)L^2(\mathbb{R}) in the usual Hilbert space sense (they are not in L2(R)L^2(\mathbb{R}) themselves - they have constant modulus 1 and are not square-integrable). The rigorous statement is that the Fourier Transform is a unitary operator on L2(R)L^2(\mathbb{R}) (Plancherel's theorem), meaning it preserves inner products and norms:

f,gL2=f^,g^L2\langle f, g \rangle_{L^2} = \langle \hat{f}, \hat{g} \rangle_{L^2}

Think of the Fourier Transform as a rotation in an infinite-dimensional space: it rotates the function ff into a new coordinate system (the frequency domain) where the same function looks different but has exactly the same "length" (energy). Just as rotating a vector in Rn\mathbb{R}^n preserves its norm but changes its coordinates, the Fourier Transform preserves energy but expresses it in frequency coordinates.

A second geometric picture: the Fourier Transform of the convolution fgf * g is the pointwise product f^g^\hat{f} \cdot \hat{g}. Convolution in time/space corresponds to multiplication in frequency. This is the key theorem for signal processing and the foundation of efficient convolution in CNNs via the FFT. (Full treatment in Section 20-04 Convolution Theorem.)

GEOMETRIC PICTURE OF THE FOURIER TRANSFORM
========================================================================

  Time domain                    Frequency domain
  -------------                  ----------------
  f(t): a function of time       f(): a function of frequency

  Signal duration T  <-->  Bandwidth ~ 1/T  (uncertainty principle)

  Convolution f*g    <-->  Pointwise product fg  (Convolution Theorem)

  Differentiation d/dt <-->  Multiplication by 2i

  Translation f(t-a) <-->  Modulation e^{-2ia}f()

  Scaling f(at)      <-->  Dilation (1/|a|)f(/a)

  The FT is a UNITARY OPERATOR on L2(R):
    f2 = f2  (Plancherel)
    f,g = f,g  (Parseval)

========================================================================

2. Formal Definitions

2.1 The Fourier Transform on L1(R)L^1(\mathbb{R})

Definition 2.1 (Fourier Transform on L1L^1). For fL1(R)f \in L^1(\mathbb{R}) (i.e., f(t)dt<\int_{-\infty}^{\infty}|f(t)|\,dt < \infty), the Fourier Transform of ff is the function f^:RC\hat{f}: \mathbb{R} \to \mathbb{C} defined by:

f^(ξ)=f(t)e2πiξtdt\hat{f}(\xi) = \int_{-\infty}^{\infty} f(t)\, e^{-2\pi i \xi t}\,dt

We write f^=F[f]\hat{f} = \mathcal{F}[f] or ff^f \mapsto \hat{f}.

Well-definedness. For fL1(R)f \in L^1(\mathbb{R}) and any ξR\xi \in \mathbb{R}:

f^(ξ)=f(t)e2πiξtdtf(t)e2πiξtdt=f(t)dt=f1|\hat{f}(\xi)| = \left|\int_{-\infty}^{\infty} f(t)\,e^{-2\pi i\xi t}\,dt\right| \leq \int_{-\infty}^{\infty}|f(t)|\,|e^{-2\pi i\xi t}|\,dt = \int_{-\infty}^{\infty}|f(t)|\,dt = \lVert f \rVert_1

So f^\hat{f} is bounded: f^f1\lVert\hat{f}\rVert_\infty \leq \lVert f \rVert_1. Moreover, one can show (by dominated convergence) that f^\hat{f} is continuous on R\mathbb{R}, and by the Riemann-Lebesgue lemma, f^(ξ)0\hat{f}(\xi) \to 0 as ξ|\xi| \to \infty.

Theorem 2.1 (Riemann-Lebesgue Lemma). If fL1(R)f \in L^1(\mathbb{R}), then f^C0(R)\hat{f} \in C_0(\mathbb{R}) (continuous functions vanishing at infinity):

limξf^(ξ)=0\lim_{|\xi|\to\infty} \hat{f}(\xi) = 0

Proof sketch. For step functions, the integral reduces to a finite sum of terms e2πiξa/(ξ)\sim e^{-2\pi i\xi a}/(\xi), each going to 0. Approximate general L1L^1 functions by step functions and use the boundedness f^f1\lVert\hat{f}\rVert_\infty \leq \lVert f \rVert_1. \square

What f^(ξ)\hat{f}(\xi) tells you. The value f^(ξ)\hat{f}(\xi) is a complex number encoding:

  • f^(ξ)|\hat{f}(\xi)|: the amplitude of frequency ξ\xi in ff
  • argf^(ξ)\arg\hat{f}(\xi): the phase of frequency ξ\xi in ff

The power spectrum f^(ξ)2|\hat{f}(\xi)|^2 gives the energy density per unit frequency at ξ\xi.

2.2 Convention War: ω\omega vs ξ\xi vs ff

The Fourier Transform appears in three notational conventions in the literature, all equivalent but related by rescaling:

ConventionTransform formulaInverse formulaUsed in
Frequency ξ\xi (Hz)f^(ξ)=f(t)e2πiξtdt\hat{f}(\xi) = \int f(t)e^{-2\pi i\xi t}\,dtf(t)=f^(ξ)e2πiξtdξf(t) = \int \hat{f}(\xi)e^{2\pi i\xi t}\,d\xiSignal processing, probability
Angular freq ω\omega (rad/s), no 12π\frac{1}{2\pi}f^(ω)=f(t)eiωtdt\hat{f}(\omega) = \int f(t)e^{-i\omega t}\,dtf(t)=12πf^(ω)eiωtdωf(t) = \frac{1}{2\pi}\int \hat{f}(\omega)e^{i\omega t}\,d\omegaPhysics, engineering
Angular freq ω\omega, symmetricf^(ω)=12πf(t)eiωtdt\hat{f}(\omega) = \frac{1}{\sqrt{2\pi}}\int f(t)e^{-i\omega t}\,dtf(t)=12πf^(ω)eiωtdωf(t) = \frac{1}{\sqrt{2\pi}}\int \hat{f}(\omega)e^{i\omega t}\,d\omegaPure mathematics

This section uses the ξ\xi-convention (row 1). It is self-symmetric (no 2π2\pi factors), maps the Gaussian to itself, and is standard in modern ML papers (FNet, FNO, RFF all use it). The relationship between conventions: f^ω(ω)=f^ξ(ω/2π)\hat{f}_\omega(\omega) = \hat{f}_\xi(\omega/2\pi).

Warning: The most common source of errors in Fourier analysis is mixing conventions. When reading a paper, identify the convention in the first equation before proceeding. Properties like "differentiation multiplies by iωi\omega" vs "multiplies by 2πiξ2\pi i\xi" depend entirely on which convention is in use.

2.3 Standard Examples

These four transforms appear constantly in applications and should be memorized.

Example 1: Rectangular Pulse (rect function)

rect(t)={1t1/20t>1/2\operatorname{rect}(t) = \begin{cases}1 & |t| \leq 1/2 \\ 0 & |t| > 1/2\end{cases} rect^(ξ)=1/21/2e2πiξtdt=eπiξeπiξ2πiξ=sin(πξ)πξ=:sinc(ξ)\widehat{\operatorname{rect}}(\xi) = \int_{-1/2}^{1/2} e^{-2\pi i\xi t}\,dt = \frac{e^{\pi i\xi} - e^{-\pi i\xi}}{2\pi i\xi} = \frac{\sin(\pi\xi)}{\pi\xi} =: \operatorname{sinc}(\xi)

The sinc function oscillates and decays as 1/ξ1/|\xi|. The slow decay reflects the sharp discontinuity of rect\operatorname{rect} - sharp features in time produce slow decay in frequency (this is the spectral analog of the Gibbs phenomenon from Section 20-01).

A general pulse of width TT: f(t)=rect(t/T)f(t) = \operatorname{rect}(t/T) has transform f^(ξ)=Tsinc(Tξ)\hat{f}(\xi) = T\,\operatorname{sinc}(T\xi). Wider pulse -> narrower, taller sinc lobe.

Example 2: Gaussian

f(t)=eπt2f(t) = e^{-\pi t^2} f^(ξ)=eπt2e2πiξtdt=eπξ2\hat{f}(\xi) = \int_{-\infty}^{\infty} e^{-\pi t^2} e^{-2\pi i\xi t}\,dt = e^{-\pi\xi^2}

The Gaussian is self-dual under the Fourier Transform with the ξ\xi-convention: f^=f\hat{f} = f. This is unique among "nice" functions and makes the Gaussian the natural test function in Fourier analysis, the ground state in quantum mechanics, and the kernel of the heat equation.

Derivation: Complete the square in the exponent: πt22πiξt=π(t+iξ)2πξ2-\pi t^2 - 2\pi i\xi t = -\pi(t + i\xi)^2 - \pi\xi^2. Then:

f^(ξ)=eπξ2eπ(t+iξ)2dt=eπξ21\hat{f}(\xi) = e^{-\pi\xi^2}\int_{-\infty}^{\infty}e^{-\pi(t+i\xi)^2}\,dt = e^{-\pi\xi^2} \cdot 1

where the integral equals 1 by contour integration (the Gaussian integral is analytic and the contour shift tt+iξt \to t + i\xi is justified because the integrand decays rapidly).

Example 3: Lorentzian (Cauchy distribution)

f(t)=11+(2πt)2f(t) = \frac{1}{1 + (2\pi t)^2} f^(ξ)=12eξ\hat{f}(\xi) = \frac{1}{2}e^{-|\xi|}

The exponential decay in frequency reflects the moderate smoothness of the Lorentzian (it is CC^\infty but not compactly supported). The relationship between smoothness and spectral decay is captured by the general theorem in Section 3.5.

Example 4: One-Sided Exponential

f(t)=eat1t0,a>0f(t) = e^{-at}\mathbb{1}_{t \geq 0}, \quad a > 0 f^(ξ)=0eate2πiξtdt=1a+2πiξ\hat{f}(\xi) = \int_0^\infty e^{-at}e^{-2\pi i\xi t}\,dt = \frac{1}{a + 2\pi i\xi}

This has modulus f^(ξ)=1/a2+4π2ξ2|\hat{f}(\xi)| = 1/\sqrt{a^2 + 4\pi^2\xi^2}, which decays as 1/ξ1/|\xi| for large ξ\xi - reflecting the discontinuity at t=0t = 0.

Non-example 1: f(t)=1/(1+t2)1/4f(t) = 1/(1+t^2)^{1/4}. This is in L2(R)L^2(\mathbb{R}) but not L1(R)L^1(\mathbb{R}) (it decays too slowly), so the integral f(t)e2πiξtdt\int f(t)e^{-2\pi i\xi t}\,dt does not converge absolutely and the definition above does not apply. The L2L^2 theory (Section 5) handles this case.

Non-example 2: f(t)=sin(2πξ0t)f(t) = \sin(2\pi\xi_0 t). This is bounded but not in L1(R)L^1(\mathbb{R}) or L2(R)L^2(\mathbb{R}) - it does not decay. Its Fourier Transform exists only as a distribution: F[sin(2πξ0t)]=i2[δ(ξ+ξ0)δ(ξξ0)]\mathcal{F}[\sin(2\pi\xi_0 t)] = \frac{i}{2}[\delta(\xi + \xi_0) - \delta(\xi - \xi_0)] (see Section 7).

2.4 The Inverse Fourier Transform

Definition 2.2 (Inverse Fourier Transform). For gL1(R)g \in L^1(\mathbb{R}):

F1[g](t)=g(ξ)e2πiξtdξ\mathcal{F}^{-1}[g](t) = \int_{-\infty}^{\infty} g(\xi)\,e^{2\pi i\xi t}\,d\xi

Note: the inverse is the same as the forward transform except the sign in the exponent is flipped (2πiξt+2πiξt-2\pi i\xi t \to +2\pi i\xi t). In the ξ\xi-convention, this symmetric form is one of its advantages.

The inversion problem: Given f^\hat{f}, can we recover ff? Yes, under mild conditions:

f(t)=f^(ξ)e2πiξtdξf(t) = \int_{-\infty}^{\infty} \hat{f}(\xi)\,e^{2\pi i\xi t}\,d\xi

but this requires knowing that f^L1(R)\hat{f} \in L^1(\mathbb{R}) (which is not guaranteed by fL1(R)f \in L^1(\mathbb{R}) alone - for instance, rect^(ξ)=sinc(ξ)\hat{\operatorname{rect}}(\xi) = \operatorname{sinc}(\xi) is not in L1L^1). The full inversion theorem is treated rigorously in Section 4.

2.5 Non-Examples and the L2L^2 Extension

The classical L1L^1 definition fails for many important functions:

FunctionWhy L1L^1 FT failsSolution
f(t)=et2/2f(t) = e^{-t^2/2} (broader Gaussian)In L1L^1 - this one is fineNot a failure
f(t)=(1+t2)1/4f(t) = (1+t^2)^{-1/4}Not in L1L^1Use L2L^2 Plancherel (Section 5)
f(t)=1f(t) = 1Not in L1L^1 or L2L^2Use distributions (Section 7)
f(t)=sin(2πξ0t)f(t) = \sin(2\pi\xi_0 t)Not in L1L^1 or L2L^2Use distributions (Section 7)
f(t)=δ(t)f(t) = \delta(t)Not a functionUse distributions (Section 7)

The L2L^2 extension proceeds via a density argument. The subspace L1(R)L2(R)L^1(\mathbb{R}) \cap L^2(\mathbb{R}) is dense in L2(R)L^2(\mathbb{R}). For fL1L2f \in L^1 \cap L^2, the classical integral defines f^\hat{f}. The Plancherel theorem (Section 5) then shows f^2=f2\lVert\hat{f}\rVert_2 = \lVert f \rVert_2, which allows extending F\mathcal{F} to all of L2L^2 by continuity. The resulting transform is a unitary operator on L2(R)L^2(\mathbb{R}).

3. Core Properties

3.1 Linearity and Conjugate Symmetry

Theorem 3.1 (Linearity). For f,gL1(R)f, g \in L^1(\mathbb{R}) and α,βC\alpha, \beta \in \mathbb{C}:

F[αf+βg](ξ)=αf^(ξ)+βg^(ξ)\mathcal{F}[\alpha f + \beta g](\xi) = \alpha\hat{f}(\xi) + \beta\hat{g}(\xi)

Proof: Immediate from linearity of the integral. \square

Theorem 3.2 (Conjugate Symmetry / Hermitian Property). For real-valued fL1(R)f \in L^1(\mathbb{R}):

f^(ξ)=f^(ξ)\hat{f}(-\xi) = \overline{\hat{f}(\xi)}

Proof: f^(ξ)=f(t)e2πiξtdt=f(t)e2πiξtdt=f^(ξ)\hat{f}(-\xi) = \int f(t)e^{2\pi i\xi t}\,dt = \overline{\int f(t)e^{-2\pi i\xi t}\,dt} = \overline{\hat{f}(\xi)}, where the last step uses f(t)=f(t)f(t) = \overline{f(t)} (real-valued). \square

Corollaries for real ff:

  • f^(ξ)=f^(ξ)|\hat{f}(-\xi)| = |\hat{f}(\xi)|: the magnitude spectrum is even
  • argf^(ξ)=argf^(ξ)\arg\hat{f}(-\xi) = -\arg\hat{f}(\xi): the phase spectrum is odd
  • If ff is also even, then f^\hat{f} is real-valued and even
  • If ff is also odd, then f^\hat{f} is purely imaginary and odd

For AI: The Hermitian symmetry means that for real signals, half the spectrum is redundant - you only need frequencies ξ0\xi \geq 0. This is why the FFT of a real signal produces N/2+1N/2 + 1 unique complex outputs from NN real inputs. Real-valued weight matrices in neural networks have Hermitian-symmetric spectra, which is exploited in efficient spectral computation.

3.2 Time-Shift and Frequency-Shift (Modulation)

Theorem 3.3 (Time-Shift / Translation). For aRa \in \mathbb{R}:

F[f(ta)](ξ)=e2πiξaf^(ξ)\mathcal{F}[f(t - a)](\xi) = e^{-2\pi i\xi a}\,\hat{f}(\xi)

Proof: f(ta)e2πiξtdt=u=taf(u)e2πiξ(u+a)du=e2πiξaf^(ξ)\int f(t-a)e^{-2\pi i\xi t}\,dt \stackrel{u=t-a}{=} \int f(u)e^{-2\pi i\xi(u+a)}\,du = e^{-2\pi i\xi a}\hat{f}(\xi). \square

Interpretation: Shifting a signal in time multiplies its spectrum by a complex phase factor. The magnitude spectrum f^(ξ)|\hat{f}(\xi)| is unchanged - a time shift does not affect which frequencies are present, only their phases. The phase changes linearly with frequency: ϕϕ2πξa\phi \mapsto \phi - 2\pi\xi a.

Theorem 3.4 (Frequency-Shift / Modulation). For ξ0R\xi_0 \in \mathbb{R}:

F[e2πiξ0tf(t)](ξ)=f^(ξξ0)\mathcal{F}[e^{2\pi i\xi_0 t}f(t)](\xi) = \hat{f}(\xi - \xi_0)

Proof: e2πiξ0tf(t)e2πiξtdt=f(t)e2πi(ξξ0)tdt=f^(ξξ0)\int e^{2\pi i\xi_0 t}f(t)e^{-2\pi i\xi t}\,dt = \int f(t)e^{-2\pi i(\xi - \xi_0)t}\,dt = \hat{f}(\xi - \xi_0). \square

Interpretation: Multiplying by a complex exponential (modulation) shifts the spectrum. This is the mathematical basis for amplitude modulation (AM) radio and for the RoPE positional encoding.

For AI - RoPE connection: In RoPE, the query at position mm is qm=eimθq\mathbf{q}_m = e^{im\theta}\mathbf{q} (rotation by mθm\theta in the complex plane). The inner product between query at mm and key at nn is eimθq,einθk=ei(mn)θq,k\langle e^{im\theta}\mathbf{q},\, e^{in\theta}\mathbf{k}\rangle = e^{i(m-n)\theta}\langle\mathbf{q},\mathbf{k}\rangle, depending only on relative position mnm - n. This is the modulation theorem in action.

3.3 Scaling and Dilation

Theorem 3.5 (Scaling / Dilation). For aRa \in \mathbb{R}, a0a \neq 0:

F[f(at)](ξ)=1af^ ⁣(ξa)\mathcal{F}[f(at)](\xi) = \frac{1}{|a|}\hat{f}\!\left(\frac{\xi}{a}\right)

Proof: f(at)e2πiξtdt=u=at1af(u)e2πi(ξ/a)udu=1af^(ξ/a)\int f(at)e^{-2\pi i\xi t}\,dt \stackrel{u=at}{=} \frac{1}{|a|}\int f(u)e^{-2\pi i(\xi/a)u}\,du = \frac{1}{|a|}\hat{f}(\xi/a). \square

Interpretation: This is the time-bandwidth product in action:

  • Compressing a signal in time (a>1a > 1, so f(at)f(at) is narrower): the spectrum stretches by factor aa and shrinks in amplitude by 1/a1/a
  • Stretching a signal in time (0<a<10 < a < 1): the spectrum compresses

This confirms the time-frequency duality: doubling the duration halves the bandwidth, and vice versa. This is a hard mathematical constraint, not an engineering limitation.

3.4 Time Reversal

Theorem 3.6 (Time Reversal). For fL1(R)f \in L^1(\mathbb{R}):

F[f(t)](ξ)=f^(ξ)=f^(ξ) (for real f)\mathcal{F}[f(-t)](\xi) = \hat{f}(-\xi) = \overline{\hat{f}(\xi)} \text{ (for real } f\text{)}

Proof: f(t)e2πiξtdt=u=tf(u)e2πiξudu=f^(ξ)\int f(-t)e^{-2\pi i\xi t}\,dt \stackrel{u=-t}{=} \int f(u)e^{2\pi i\xi u}\,du = \hat{f}(-\xi). \square

Duality: There is a deeper symmetry: F[f^](t)=f(t)\mathcal{F}[\hat{f}](t) = f(-t). Applying the Fourier Transform four times returns the original function: F4=I\mathcal{F}^4 = I. This means the FT has eigenvalues {1,i,1,i}\{1, -i, -1, i\} and the Hermite functions are its eigenfunctions (with the Gaussian as the eigenfunction for eigenvalue 1).

3.5 Differentiation and Integration

Theorem 3.7 (Differentiation in Time). If f,fL1(R)f, f' \in L^1(\mathbb{R}):

F[f(t)](ξ)=2πiξf^(ξ)\mathcal{F}[f'(t)](\xi) = 2\pi i\xi\,\hat{f}(\xi)

More generally, F[f(n)(t)](ξ)=(2πiξ)nf^(ξ)\mathcal{F}[f^{(n)}(t)](\xi) = (2\pi i\xi)^n\hat{f}(\xi).

Proof: Integration by parts: f(t)e2πiξtdt=[f(t)e2πiξt]+2πiξf(t)e2πiξtdt\int f'(t)e^{-2\pi i\xi t}\,dt = [f(t)e^{-2\pi i\xi t}]_{-\infty}^{\infty} + 2\pi i\xi\int f(t)e^{-2\pi i\xi t}\,dt. The boundary term vanishes since fL1f \in L^1 implies f(t)0f(t) \to 0 as t|t| \to \infty. \square

Theorem 3.8 (Differentiation in Frequency). If tf(t)L1(R)tf(t) \in L^1(\mathbb{R}):

ddξf^(ξ)=2πitf(t)^(ξ)\frac{d}{d\xi}\hat{f}(\xi) = -2\pi i\,\widehat{tf(t)}(\xi)

Equivalently: F[(2πit)nf(t)](ξ)=dndξnf^(ξ)\mathcal{F}[(-2\pi it)^n f(t)](\xi) = \frac{d^n}{d\xi^n}\hat{f}(\xi).

Key consequence for smoothness vs. spectral decay:

Smoothness of ffDecay of f^\hat{f}
fCkf \in C^k and f(k)L1f^{(k)} \in L^1$
fCf \in C^\infty (smooth)f^\hat{f} decays faster than any polynomial
ff has a jump discontinuity$
ff is analyticf^\hat{f} decays exponentially

For AI: The differentiation property is why Fourier methods are efficient for solving differential equations. It converts a PDE tu=axxu\partial_t u = a\partial_{xx}u into an ODE tu^=4π2ξ2au^\partial_t\hat{u} = -4\pi^2\xi^2 a\hat{u}, which is solved by u^(ξ,t)=u^(ξ,0)e4π2aξ2t\hat{u}(\xi, t) = \hat{u}(\xi, 0)e^{-4\pi^2 a\xi^2 t} - a simple exponential decay. The Fourier Neural Operator (Section 8.4) exploits this by learning the spectral solution operator directly.

3.6 The Master Properties Table

Propertyf(t)f(t)f^(ξ)\hat{f}(\xi)
Linearityαf(t)+βg(t)\alpha f(t) + \beta g(t)αf^(ξ)+βg^(ξ)\alpha\hat{f}(\xi) + \beta\hat{g}(\xi)
Time shiftf(ta)f(t - a)e2πiξaf^(ξ)e^{-2\pi i\xi a}\hat{f}(\xi)
Frequency shifte2πiξ0tf(t)e^{2\pi i\xi_0 t}f(t)f^(ξξ0)\hat{f}(\xi - \xi_0)
Scalingf(at)f(at)1af^(ξ/a)\frac{1}{\lvert a\rvert}\hat{f}(\xi/a)
Time reversalf(t)f(-t)f^(ξ)\hat{f}(-\xi)
Conjugationf(t)\overline{f(t)}f^(ξ)\overline{\hat{f}(-\xi)}
Hermitian (real ff)f(t)Rf(t) \in \mathbb{R}f^(ξ)=f^(ξ)\hat{f}(-\xi) = \overline{\hat{f}(\xi)}
Differentiationf(n)(t)f^{(n)}(t)(2πiξ)nf^(ξ)(2\pi i\xi)^n\hat{f}(\xi)
Freq. differentiation(2πit)nf(t)(-2\pi it)^n f(t)f^(n)(ξ)\hat{f}^{(n)}(\xi)
Convolution(fg)(t)(f * g)(t)f^(ξ)g^(ξ)\hat{f}(\xi)\cdot\hat{g}(\xi)
Multiplicationf(t)g(t)f(t)\cdot g(t)(f^g^)(ξ)(\hat{f} * \hat{g})(\xi)
Dualityf^(t)\hat{f}(t)f(ξ)f(-\xi)

3.7 Convolution-Multiplication Duality - Preview

The Convolution Theorem states that the Fourier Transform converts convolution into pointwise multiplication:

F[fg](ξ)=f^(ξ)g^(ξ)\mathcal{F}[f * g](\xi) = \hat{f}(\xi) \cdot \hat{g}(\xi)

where (fg)(t)=f(τ)g(tτ)dτ(f * g)(t) = \int_{-\infty}^{\infty} f(\tau)g(t - \tau)\,d\tau.

This is the most important property of the Fourier Transform for applications. It means that:

  • Linear filtering (convolution with a filter hh) corresponds to pointwise multiplication of spectra
  • A filter's effect on frequency ξ\xi is simply multiplication by h^(ξ)\hat{h}(\xi) (the frequency response)
  • Convolution of length-NN signals costs O(N2)O(N^2) directly but O(NlogN)O(N\log N) via FFT

Preview: The full treatment of the convolution theorem - including circular convolution, cross-correlation, the Wiener-Khinchin theorem, and applications to CNNs, WaveNet, and Mamba - is in Section 20-04 Convolution Theorem. Here we state the theorem and use it; the proof and all applications belong there.

4. The Fourier Inversion Theorem

4.1 Statement and Sufficient Conditions

The fundamental question: given f^\hat{f}, can we recover ff? The answer is yes under appropriate conditions, but the precise statement requires care.

Theorem 4.1 (Fourier Inversion Theorem). Suppose fL1(R)f \in L^1(\mathbb{R}) and f^L1(R)\hat{f} \in L^1(\mathbb{R}). Then for almost every tRt \in \mathbb{R}:

f(t)=f^(ξ)e2πiξtdξf(t) = \int_{-\infty}^{\infty} \hat{f}(\xi)\,e^{2\pi i\xi t}\,d\xi

Moreover, if ff is also continuous at tt, equality holds exactly (not just a.e.).

The subtlety: The condition f^L1(R)\hat{f} \in L^1(\mathbb{R}) is not automatic. If fL1(R)f \in L^1(\mathbb{R}), then f^\hat{f} is bounded and continuous (by Riemann-Lebesgue), but not necessarily in L1L^1. For example, f=rectf = \operatorname{rect} gives f^=sinc\hat{f} = \operatorname{sinc}, and sincL1(R)\operatorname{sinc} \notin L^1(\mathbb{R}) (it decays too slowly: 1sin(πξ)πξdξ=\int_1^\infty \frac{|\sin(\pi\xi)|}{\pi\xi}d\xi = \infty). The inversion of the rect function requires the L2L^2 theory.

Sufficient condition for inversion: If fL1(R)f \in L^1(\mathbb{R}) and ff has bounded variation near tt (e.g., ff is differentiable at tt), then the principal value integral:

limRRRf^(ξ)e2πiξtdξ\lim_{R\to\infty}\int_{-R}^{R}\hat{f}(\xi)e^{2\pi i\xi t}\,d\xi

converges to f(t)f(t) where ff is continuous, and to 12[f(t+)+f(t)]\frac{1}{2}[f(t^+) + f(t^-)] at jump discontinuities - exactly as in Dirichlet's theorem for Fourier series.

4.2 Proof via Approximate Identity

The standard proof uses the concept of an approximate identity - a family of functions that converge to the Dirac delta as a parameter goes to zero.

Definition (Approximate Identity). A family {kϵ}ϵ>0\{k_\epsilon\}_{\epsilon > 0} is an approximate identity if:

  1. kϵ0k_\epsilon \geq 0
  2. kϵ(t)dt=1\int k_\epsilon(t)\,dt = 1 for all ϵ>0\epsilon > 0
  3. For every δ>0\delta > 0: t>δkϵ(t)dt0\int_{|t| > \delta} k_\epsilon(t)\,dt \to 0 as ϵ0\epsilon \to 0

The Gauss-Weierstrass kernel kϵ(t)=1ϵeπt2/ϵ2k_\epsilon(t) = \frac{1}{\epsilon}e^{-\pi t^2/\epsilon^2} is an approximate identity.

Proof sketch of inversion theorem:

Define the Gauss regularized inversion:

fϵ(t)=f^(ξ)e2πiξteπϵ2ξ2dξf_\epsilon(t) = \int_{-\infty}^{\infty} \hat{f}(\xi)\,e^{2\pi i\xi t}\,e^{-\pi\epsilon^2\xi^2}\,d\xi

The factor eπϵ2ξ2e^{-\pi\epsilon^2\xi^2} provides absolute convergence (the Gaussian is in L1L^1). Using Fubini's theorem to interchange integration order:

fϵ(t)=f(s)eπϵ2ξ2e2πiξ(ts)dξ=1ϵeπ(ts)2/ϵ2ds=(fkϵ)(t)f_\epsilon(t) = \int_{-\infty}^{\infty} f(s) \underbrace{\int_{-\infty}^{\infty} e^{-\pi\epsilon^2\xi^2}e^{2\pi i\xi(t-s)}\,d\xi}_{= \frac{1}{\epsilon}e^{-\pi(t-s)^2/\epsilon^2}} ds = (f * k_\epsilon)(t)

Since {kϵ}\{k_\epsilon\} is an approximate identity: fϵ(t)f(t)f_\epsilon(t) \to f(t) as ϵ0\epsilon \to 0 at every point of continuity of ff (and in L1L^1 norm). Since fϵ(t)f^(ξ)e2πiξtdξf_\epsilon(t) \to \int\hat{f}(\xi)e^{2\pi i\xi t}\,d\xi as ϵ0\epsilon \to 0 when f^L1\hat{f} \in L^1, the result follows. \square

Significance: This proof reveals why the Gaussian plays a central role - it is the unique function (up to scaling) that is its own Fourier Transform, making the Gauss regularization self-consistent.

4.3 The Inversion Formula in Practice

The inversion formula is used to:

  1. Recover a signal from its spectrum: Given f^\hat{f}, compute f(t)=f^(ξ)e2πiξtdξf(t) = \int\hat{f}(\xi)e^{2\pi i\xi t}\,d\xi
  2. Compute transforms of new functions from known transforms: Use the FT table plus properties
  3. Solve PDEs: Take FT, solve the resulting ODE, invert

Example: Compute f(t)f(t) such that f^(ξ)=e2πξ\hat{f}(\xi) = e^{-2\pi|\xi|} (double-sided exponential spectrum).

By the inversion formula:

f(t)=e2πξe2πiξtdξ=0e2πξ(1it)dξ+0e2πξ(1+it)dξf(t) = \int_{-\infty}^{\infty} e^{-2\pi|\xi|}e^{2\pi i\xi t}\,d\xi = \int_0^\infty e^{-2\pi\xi(1-it)}\,d\xi + \int_{-\infty}^0 e^{2\pi\xi(1+it)}\,d\xi =12π(1it)+12π(1+it)=1π11+t2= \frac{1}{2\pi(1-it)} + \frac{1}{2\pi(1+it)} = \frac{1}{\pi}\cdot\frac{1}{1+t^2}

So F[1/(π(1+t2))](ξ)=e2πξ\mathcal{F}[1/(\pi(1+t^2))](\xi) = e^{-2\pi|\xi|} - confirming the Lorentzian result from Section 2.3 (Example 3).

Numerical verification: The inversion theorem can be verified numerically via the FFT: compute f^\hat{f} numerically, then invert, and check that you recover ff up to discretization error (this is Exercise 3 in Section 10).

4.4 The Self-Dual Gaussian

The Gaussian g(t)=eπt2g(t) = e^{-\pi t^2} satisfies g^=g\hat{g} = g, making it a fixed point of the Fourier Transform. This property makes it indispensable in both theory and practice.

Generalized Gaussian family. For σ>0\sigma > 0, define gσ(t)=eπt2/σ2g_\sigma(t) = e^{-\pi t^2/\sigma^2} (Gaussian of width σ\sigma). By the scaling theorem:

g^σ(ξ)=σeπσ2ξ2=σg1/σ(ξ)\hat{g}_\sigma(\xi) = \sigma\,e^{-\pi\sigma^2\xi^2} = \sigma\,g_{1/\sigma}(\xi)

Wide Gaussian (σ\sigma large) -> narrow Fourier Transform (bandwidth 1/σ\sim 1/\sigma). This is the uncertainty principle made explicit: the product of time-width σ\sigma and frequency-width 1/σ1/\sigma equals 1, the minimum possible (see Section 6).

Eigenfunctions of the FT: The Fourier Transform has eigenvalues λ{1,i,1,i}\lambda \in \{1, -i, -1, i\}. The corresponding eigenfunctions are the Hermite functions:

ψn(t)=Hn(2πt)eπt2\psi_n(t) = H_n(\sqrt{2\pi}\,t)\,e^{-\pi t^2}

where HnH_n is the nn-th Hermite polynomial. For n=0n = 0: ψ0(t)=eπt2\psi_0(t) = e^{-\pi t^2} (eigenvalue 1, the self-dual Gaussian). For n=1n = 1: ψ1(t)=22πteπt2\psi_1(t) = 2\sqrt{2\pi}\,t\,e^{-\pi t^2} (eigenvalue i-i).

For AI: The Gaussian window is the optimal time-frequency window (by the uncertainty principle), which is why Gaussian smoothing is used in signal preprocessing and why Gaussian attention kernels appear in some efficient attention variants.

GAUSSIAN SELF-DUALITY
========================================================================

  Time domain               Frequency domain
  -------------             ----------------
  g(t) = e^{-t2}   --FT--  g() = e^{-2}  <- same function!

  Width _t = 1/(2)       Width _ = 1/(2)
  Product: _t  _ = 1/(2) = MINIMUM (uncertainty bound)

  Scaling:
  f(t) = e^{-t2/2} --FT-- e^{-22}
  ^ wider in time               ^ narrower in frequency

========================================================================

5. Plancherel's Theorem and L2L^2 Theory

5.1 Statement: FT as a Unitary Isometry

Theorem 5.1 (Plancherel's Theorem). The Fourier Transform extends uniquely from L1(R)L2(R)L^1(\mathbb{R}) \cap L^2(\mathbb{R}) to a unitary operator on L2(R)L^2(\mathbb{R}):

F:L2(R)L2(R),f^L2=fL2\mathcal{F}: L^2(\mathbb{R}) \to L^2(\mathbb{R}), \qquad \lVert \hat{f} \rVert_{L^2} = \lVert f \rVert_{L^2}

Unitarity means F1=F\mathcal{F}^{-1} = \mathcal{F}^* (the adjoint equals the inverse), which in this case means (F)[f^](t)=f^(ξ)e+2πiξtdξ(\mathcal{F}^*)[\hat{f}](t) = \int \hat{f}(\xi)e^{+2\pi i\xi t}\,d\xi - the inverse FT.

Why this is non-trivial: For fL2(R)f \in L^2(\mathbb{R}) but fL1(R)f \notin L^1(\mathbb{R}), the defining integral f(t)e2πiξtdt\int f(t)e^{-2\pi i\xi t}\,dt need not converge absolutely. Plancherel's theorem says we can still define f^L2\hat{f} \in L^2 as the L2L^2 limit:

f^=limRRRf(t)e2πiξtdt(limit in L2 norm)\hat{f} = \lim_{R\to\infty} \int_{-R}^{R} f(t)e^{-2\pi i\xi t}\,dt \qquad \text{(limit in } L^2\text{ norm)}

The limit exists because the truncated transforms form a Cauchy sequence in L2L^2 (by the L1L2L^1 \cap L^2 case and density of that subspace in L2L^2).

5.2 Parseval's Relation

Theorem 5.2 (Parseval / Plancherel Identity). For f,gL2(R)f, g \in L^2(\mathbb{R}):

f(t)g(t)dt=f^(ξ)g^(ξ)dξ\int_{-\infty}^{\infty} f(t)\overline{g(t)}\,dt = \int_{-\infty}^{\infty} \hat{f}(\xi)\overline{\hat{g}(\xi)}\,d\xi

The special case g=fg = f gives the energy conservation (or Parseval's formula):

f(t)2dt=f^(ξ)2dξ\int_{-\infty}^{\infty} |f(t)|^2\,dt = \int_{-\infty}^{\infty} |\hat{f}(\xi)|^2\,d\xi

Interpretation: The total energy of a signal is the same whether measured in the time domain or the frequency domain. No energy is created or destroyed by the Fourier Transform - it is a lossless change of representation.

Using Parseval to compute integrals: Sometimes f2dt\int|f|^2\,dt is hard but f^2dξ\int|\hat{f}|^2\,d\xi is easy (or vice versa). Example: compute sinc2(ξ)dξ\int_{-\infty}^{\infty} \operatorname{sinc}^2(\xi)\,d\xi.

We know F[rect]=sinc\mathcal{F}[\operatorname{rect}] = \operatorname{sinc}, so by Parseval:

sinc(ξ)2dξ=rect(t)2dt=1/21/21dt=1\int_{-\infty}^{\infty}|\operatorname{sinc}(\xi)|^2\,d\xi = \int_{-\infty}^{\infty}|\operatorname{rect}(t)|^2\,dt = \int_{-1/2}^{1/2}1\,dt = 1

For AI: Parseval's relation underpins the analysis of spectral energy distribution in neural networks. The WeightWatcher tool (Martin & Mahoney, 2021) analyzes the spectrum of weight matrices to diagnose training quality - a healthy model has a power-law spectral distribution. Parseval tells us the Frobenius norm WF2=iσi2\lVert W \rVert_F^2 = \sum_i\sigma_i^2 equals the energy in the spectral domain.

5.3 Proof Sketch: Extension from L1L2L^1 \cap L^2 to L2L^2

The proof of Plancherel's theorem follows a classic functional analysis pattern:

Step 1: Show the Parseval identity for fL1(R)L2(R)f \in L^1(\mathbb{R}) \cap L^2(\mathbb{R}) (the intersection is dense in both spaces).

For fL1L2f \in L^1 \cap L^2, use Fubini to compute:

f^(ξ)2dξ=f^(ξ)f^(ξ)dξ=f^(ξ)(f(t)e2πiξtdt)dξ\int |\hat{f}(\xi)|^2\,d\xi = \int\hat{f}(\xi)\overline{\hat{f}(\xi)}\,d\xi = \int\hat{f}(\xi)\left(\int f(t)e^{2\pi i\xi t}\,dt\right)d\xi =f(t)f^(ξ)e2πiξtdξ=f(t) by inversiondt=f(t)2dt= \int f(t)\underbrace{\int\hat{f}(\xi)e^{2\pi i\xi t}\,d\xi}_{=f(t) \text{ by inversion}}\,dt = \int|f(t)|^2\,dt

Step 2: The isometry f^2=f2\lVert\hat{f}\rVert_2 = \lVert f \rVert_2 for fL1L2f \in L^1 \cap L^2 shows F:L1L2L2\mathcal{F}: L^1\cap L^2 \to L^2 is a bounded operator with operator norm 1.

Step 3: By density (L1L2L^1 \cap L^2 is dense in L2L^2) and the fact that a bounded linear isometry extends uniquely to the closure, F\mathcal{F} extends to all of L2L^2 preserving the isometry.

Step 4: Show F\mathcal{F} is surjective by showing F1\mathcal{F}^{-1} (the inverse FT) also extends and FF1=I\mathcal{F}\circ\mathcal{F}^{-1} = I. \square

5.4 Energy Conservation in Practice

The Parseval identity has immediate computational consequences:

Low-pass filtering: A filter that removes frequencies ξ>W|\xi| > W (a rectangular window in frequency) preserves at most a fraction 2Wf^22/f^222W\lVert\hat{f}\rVert_2^2/\lVert\hat{f}\rVert_2^2 of the total energy. For a signal with most energy at low frequencies, this fraction is close to 1.

Compression by truncation: For fL2f \in L^2, the best kk-term approximation in the Fourier basis minimizes the L2L^2 error. The error is ξ>Wf^(ξ)2dξ\int_{|\xi| > W}|\hat{f}(\xi)|^2\,d\xi - the energy in the discarded frequencies. This is the Fourier analog of truncated SVD in matrix approximation.

For AI: The Fourier Neural Operator (Section 8.4) exploits this by keeping only the top-kk Fourier modes (lowest frequencies) of the input function and discarding the rest. Plancherel guarantees the error is exactly the discarded energy - a principled truncation criterion.

6. The Heisenberg Uncertainty Principle

6.1 Time Spread and Frequency Spread

To state the uncertainty principle precisely, we need quantitative measures of how "spread out" a signal is in time and frequency.

Definition 6.1 (Time Center and Spread). For fL2(R)f \in L^2(\mathbb{R}) with f2=1\lVert f \rVert_2 = 1:

μt=tf(t)2dt(time center)\mu_t = \int_{-\infty}^{\infty} t\,|f(t)|^2\,dt \quad \text{(time center)} Δt=((tμt)2f(t)2dt)1/2(time spread / RMS duration)\Delta t = \left(\int_{-\infty}^{\infty} (t - \mu_t)^2\,|f(t)|^2\,dt\right)^{1/2} \quad \text{(time spread / RMS duration)}

Definition 6.2 (Frequency Center and Spread). Similarly:

μξ=ξf^(ξ)2dξ(frequency center)\mu_\xi = \int_{-\infty}^{\infty} \xi\,|\hat{f}(\xi)|^2\,d\xi \quad \text{(frequency center)} Δξ=((ξμξ)2f^(ξ)2dξ)1/2(frequency spread / bandwidth)\Delta\xi = \left(\int_{-\infty}^{\infty} (\xi - \mu_\xi)^2\,|\hat{f}(\xi)|^2\,d\xi\right)^{1/2} \quad \text{(frequency spread / bandwidth)}

Note: since f22=f^22=1\lVert f \rVert_2^2 = \lVert\hat{f}\rVert_2^2 = 1 (Plancherel), f(t)2|f(t)|^2 and f^(ξ)2|\hat{f}(\xi)|^2 are probability densities. The time and frequency spreads are the standard deviations of these distributions.

6.2 Formal Statement: ΔtΔξ14π\Delta t \cdot \Delta\xi \geq \frac{1}{4\pi}

Theorem 6.1 (Heisenberg-Kennard Uncertainty Principle). For any fL2(R)f \in L^2(\mathbb{R}) with f2=1\lVert f \rVert_2 = 1 and tf(t),ξf^(ξ)L2(R)tf(t), \xi\hat{f}(\xi) \in L^2(\mathbb{R}):

ΔtΔξ14π\Delta t \cdot \Delta\xi \geq \frac{1}{4\pi}

Equality holds if and only if ff is a Gaussian (up to time-shift, frequency-shift, and scaling):

f(t)=Aeπ(tt0)2/σ2e2πiξ0tf(t) = A\,e^{-\pi(t-t_0)^2/\sigma^2}\,e^{2\pi i\xi_0 t}

for some A,t0,ξ0RA, t_0, \xi_0 \in \mathbb{R} and σ>0\sigma > 0.

The bound is fundamental. This is not a statement about measurement error or quantum physics - it is a purely mathematical theorem about functions and their Fourier Transforms. Any signal with time duration Δt\Delta t necessarily has bandwidth at least 1/(4πΔt)1/(4\pi\Delta t). You cannot concentrate a signal to be both time-limited and bandwidth-limited simultaneously.

6.3 Proof via Cauchy-Schwarz

Proof. Without loss of generality, assume μt=μξ=0\mu_t = \mu_\xi = 0 (we can always translate in time and frequency without changing the spreads). Then:

Δt2Δξ2=(t2f(t)2dt)(ξ2f^(ξ)2dξ)\Delta t^2 \cdot \Delta\xi^2 = \left(\int t^2|f(t)|^2\,dt\right)\left(\int \xi^2|\hat{f}(\xi)|^2\,d\xi\right)

By Parseval and the differentiation property, ξ2f^(ξ)2dξ=14π2f(t)2dt\int\xi^2|\hat{f}(\xi)|^2\,d\xi = \frac{1}{4\pi^2}\int|f'(t)|^2\,dt (using F[f](ξ)=2πiξf^(ξ)\mathcal{F}[f'](\xi) = 2\pi i\xi\hat{f}(\xi) and Parseval).

So we need to show:

(t2f2dt)(14π2f2dt)116π2\left(\int t^2|f|^2\,dt\right)\left(\frac{1}{4\pi^2}\int|f'|^2\,dt\right) \geq \frac{1}{16\pi^2}

i.e., (t2f2)(f2)14\left(\int t^2|f|^2\right)\left(\int|f'|^2\right) \geq \frac{1}{4}.

By the Cauchy-Schwarz inequality:

f2dt=f1dt\int |f|^2\,dt = \int \lvert f\rvert \cdot 1\,dt \leq \cdots

More elegantly, integrate by parts and use Cauchy-Schwarz:

12=12ddt(t)f2dt=12[tf2]+12tddtf2dt\frac{1}{2} = -\frac{1}{2}\int \frac{d}{dt}(t)|f|^2\,dt = -\frac{1}{2}\left[t|f|^2\right]_{-\infty}^\infty + \frac{1}{2}\int t\frac{d}{dt}|f|^2\,dt

Wait - the cleaner path: use ddt(tf2)=f2+t2Re(ffˉ)\frac{d}{dt}(t|f|^2) = |f|^2 + t \cdot 2\operatorname{Re}(f'\bar{f}), integrate over R\mathbb{R}, and the left side integrates to 0 (since fL2f \in L^2 implies tf20t|f|^2 \to 0):

0=f2dt+2tRe(ffˉ)dt=1+2Retfˉ(t)f(t)dt0 = \int|f|^2\,dt + 2\int t\operatorname{Re}(f'\bar{f})\,dt = 1 + 2\operatorname{Re}\int t\,\bar{f}(t)f'(t)\,dt

So Retfˉfdt=1/2\operatorname{Re}\int t\,\bar{f}f'\,dt = -1/2. Then by Cauchy-Schwarz:

14=Retfˉfdt2tfˉfdt2(t2f2dt)(f2dt)\frac{1}{4} = \left|\operatorname{Re}\int t\bar{f}f'\,dt\right|^2 \leq \left|\int t\bar{f}f'\,dt\right|^2 \leq \left(\int t^2|f|^2\,dt\right)\left(\int|f'|^2\,dt\right)

Dividing both sides by 4π24\pi^2:

Δt2Δξ21414π2=116π2\Delta t^2 \cdot \Delta\xi^2 \geq \frac{1}{4} \cdot \frac{1}{4\pi^2} = \frac{1}{16\pi^2} ΔtΔξ14π\therefore \quad \Delta t \cdot \Delta\xi \geq \frac{1}{4\pi} \qquad \square

6.4 The Gaussian as the Unique Extremal Function

Equality in Cauchy-Schwarz requires f(t)=λtf(t)f'(t) = \lambda t f(t) for some constant λ\lambda. This ODE has solution f(t)=Ceλt2/2f(t) = Ce^{\lambda t^2/2}. For fL2(R)f \in L^2(\mathbb{R}), we need λ<0\lambda < 0, giving f(t)=Ceαt2f(t) = Ce^{-\alpha t^2} for α>0\alpha > 0 - a Gaussian.

Adding back the time-shift t0t_0 and frequency-shift ξ0\xi_0:

f(t)=Ceπ(tt0)2/σ2e2πiξ0t,Δt=σ/2,Δξ=1/(2σ2)f(t) = C\,e^{-\pi(t-t_0)^2/\sigma^2}\,e^{2\pi i\xi_0 t}, \qquad \Delta t = \sigma/\sqrt{2}, \quad \Delta\xi = 1/(2\sigma\sqrt{2})

Check: ΔtΔξ=σ/21/(2σ2)=1/(4)1/π\Delta t \cdot \Delta\xi = \sigma/\sqrt{2} \cdot 1/(2\sigma\sqrt{2}) = 1/(4) \cdot 1/\pi... Actually with the normalization g(t)=eπt2g(t) = e^{-\pi t^2}: Δt=1/(2π)\Delta t = 1/(2\sqrt{\pi}), Δξ=1/(2π)\Delta\xi = 1/(2\sqrt{\pi}), product = 1/(4π)1/(4\pi) - exactly the bound.

Implication: The Gaussian is the only signal that achieves perfect time-frequency concentration in the sense of minimizing ΔtΔξ\Delta t \cdot \Delta\xi. Every other signal is "more spread out" in at least one of the two domains. This is why Gaussian windows are optimal for time-frequency analysis (Gabor atoms), and why the Gaussian noise model is natural in signal processing.

6.5 Implications for Neural Architecture Design

The uncertainty principle has direct, concrete implications for AI systems:

RoPE and context length extension: In RoPE, each dimension pair uses frequency θj=100002j/dmodel\theta_j = 10000^{-2j/d_{\text{model}}}. The lowest frequencies θj\theta_j encode the longest-range positional information. To extend context from 4K to 128K tokens (as in LLaMA-3 long context), the lowest frequencies must be able to distinguish positions up to 128K. But by the uncertainty principle, a signal that is distinguishable over a range TT in "position space" must have frequency resolution Δξ1/T\Delta\xi \leq 1/T - which requires the frequency to be low enough. YaRN (Peng et al., 2023) and LongRoPE (Ding et al., 2024) resolve this by rescaling θj\theta_j to use lower frequencies for long-context models.

Spectrogram and STFT: Audio models like Whisper (Radford et al., 2022) use the Short-Time Fourier Transform - the FT applied to windowed segments. The window length controls the time-frequency tradeoff. Longer windows: better frequency resolution, worse time resolution. The Gaussian window is optimal but the Hamming window is commonly used for computational reasons.

Attention window size: Multi-head self-attention with local windows (as in LongFormer, BigBird) effectively applies a bandpass filter in "position space." The uncertainty principle says a window of size WW can resolve frequency up to 1/W1/W - setting a fundamental limit on the position encodings that can be usefully resolved within the window.

UNCERTAINTY PRINCIPLE - ARCHITECTURE IMPLICATIONS
========================================================================

  Signal Analysis            Neural Architecture Analog
  -----------------          -------------------------
  Time duration T            Context window / sequence length
  Bandwidth B ~ 1/T          Position frequency resolution
  t   >= 1/(4)          Short context  coarse PE, long  fine PE

  STFT window size W:        Local attention window size W:
    -> freq res: ~ 1/W          -> PE resolution: ~ 1/W
    -> time res: ~ W            -> local context: ~ W

  Gaussian window:           Gaussian attention kernel:
    -> optimal concentration    -> used in some efficient attention variants

  LongRoPE: rescales j to lower freqs -> finer resolution at long range

========================================================================

7. Tempered Distributions and the Dirac Delta

7.1 Why Distributions Are Necessary

The L1L^1 and L2L^2 theories of the Fourier Transform handle a large class of functions, but they exclude some of the most important objects in signal processing and physics:

  • δ(t)\delta(t) - the Dirac delta: not a function, but models an ideal impulse
  • f(t)=1f(t) = 1 - the constant function: not in L1L^1 or L2L^2, but models a DC signal
  • f(t)=e2πiξ0tf(t) = e^{2\pi i\xi_0 t} - a pure tone: not in L1L^1 or L2L^2, but is a fundamental signal
  • n=δ(tnT)\sum_{n=-\infty}^{\infty}\delta(t - nT) - the Dirac comb: not a function, but models periodic sampling

All of these have Fourier Transforms in the distributional sense - and all appear in practical signal processing. The distribution framework, developed by Laurent Schwartz in the 1950s, extends the Fourier Transform to this broader class.

7.2 Schwartz Space and Tempered Distributions

Definition 7.1 (Schwartz Space). The Schwartz space S(R)\mathcal{S}(\mathbb{R}) is the space of all CC^\infty functions ϕ:RC\phi: \mathbb{R} \to \mathbb{C} such that for all m,n0m, n \geq 0:

suptRtmϕ(n)(t)<\sup_{t \in \mathbb{R}} \left|t^m \phi^{(n)}(t)\right| < \infty

i.e., ϕ\phi and all its derivatives decay faster than any polynomial. Examples: et2e^{-t^2}, ete^{-|t|} (the derivative condition fails at 0 - not Schwartz), Gaussian bump functions.

The Schwartz space is closed under the Fourier Transform: if ϕS\phi \in \mathcal{S}, then ϕ^S\hat{\phi} \in \mathcal{S}. This is the key property: the FT maps S\mathcal{S} to itself bijectively.

Definition 7.2 (Tempered Distributions). A tempered distribution TT is a continuous linear functional T:S(R)CT: \mathcal{S}(\mathbb{R}) \to \mathbb{C}. We write T(ϕ)=T,ϕT(\phi) = \langle T, \phi \rangle.

Examples of tempered distributions:

  1. Regular distributions: Every function fLp(R)f \in L^p(\mathbb{R}) (1p1 \leq p \leq \infty) defines a tempered distribution Tf(ϕ)=f(t)ϕ(t)dtT_f(\phi) = \int f(t)\phi(t)\,dt.

  2. Dirac delta: δ,ϕ=ϕ(0)\langle\delta, \phi\rangle = \phi(0). The Dirac delta evaluates ϕ\phi at 0. It is NOT a function.

  3. Derivative of delta: δ,ϕ=ϕ(0)\langle\delta', \phi\rangle = -\phi'(0).

  4. Dirac comb: IIIT,ϕ=n=ϕ(nT)\langle\text{III}_T, \phi\rangle = \sum_{n=-\infty}^\infty \phi(nT).

Fourier Transform of a tempered distribution: Define T^,ϕ=T,ϕ^\langle\hat{T}, \phi\rangle = \langle T, \hat{\phi}\rangle for ϕS\phi \in \mathcal{S}. This is the "transpose" of the Fourier Transform.

7.3 The Dirac Delta and its Fourier Transform

The Dirac delta δ(t)\delta(t) models a unit impulse concentrated at t=0t = 0:

δ,ϕ=ϕ(0)=δ(t)ϕ(t)dt(formal)\langle\delta, \phi\rangle = \phi(0) = \int_{-\infty}^{\infty} \delta(t)\phi(t)\,dt \quad \text{(formal)}

It is the limit of a sequence of ordinary functions: δϵ(t)=1ϵeπt2/ϵ2\delta_\epsilon(t) = \frac{1}{\epsilon}e^{-\pi t^2/\epsilon^2} converges to δ\delta in the distributional sense as ϵ0\epsilon \to 0.

Fourier Transform of δ\delta: By the definition δ^,ϕ=δ,ϕ^=ϕ^(0)=ϕ(t)1dt=1,ϕ\langle\hat{\delta}, \phi\rangle = \langle\delta, \hat{\phi}\rangle = \hat{\phi}(0) = \int\phi(t)\cdot 1\,dt = \langle 1, \phi\rangle. Therefore:

F[δ](ξ)=1\mathcal{F}[\delta](\xi) = 1

An ideal impulse at t=0t = 0 has a flat (white) spectrum - it contains equal energy at all frequencies. This makes perfect physical sense: to create a perfect impulse, you need all frequencies constructively interfering at t=0t = 0.

Fourier Transform of a constant: By duality (applying FT twice): F[1](ξ)=δ(ξ)\mathcal{F}[1](\xi) = \delta(\xi). A constant signal (DC) lives entirely at frequency ξ=0\xi = 0.

The duality principle in full: The FT maps δ(t)1\delta(t) \leftrightarrow 1 and 1δ(ξ)1 \leftrightarrow \delta(\xi) - a striking symmetry.

More generally: A delta at t=at = a:

F[δ(ta)](ξ)=e2πiξa\mathcal{F}[\delta(t-a)](\xi) = e^{-2\pi i\xi a}

This is the time-shift property: δ(ta)\delta(t - a) shifted by aa gives a phase factor e2πiξae^{-2\pi i\xi a}, but the magnitude is still flat: F[δ(ta)]=1|\mathcal{F}[\delta(t-a)]| = 1.

7.4 FT of Constants, Sinusoids, and Periodic Signals

Using F[1]=δ\mathcal{F}[1] = \delta and the frequency-shift property F[e2πiξ0tf(t)](ξ)=f^(ξξ0)\mathcal{F}[e^{2\pi i\xi_0 t}f(t)](\xi) = \hat{f}(\xi - \xi_0):

Complex exponential:

F[e2πiξ0t](ξ)=δ(ξξ0)\mathcal{F}[e^{2\pi i\xi_0 t}](\xi) = \delta(\xi - \xi_0)

Cosine and Sine:

F[cos(2πξ0t)](ξ)=12[δ(ξξ0)+δ(ξ+ξ0)]\mathcal{F}[\cos(2\pi\xi_0 t)](\xi) = \frac{1}{2}[\delta(\xi - \xi_0) + \delta(\xi + \xi_0)] F[sin(2πξ0t)](ξ)=12i[δ(ξξ0)δ(ξ+ξ0)]\mathcal{F}[\sin(2\pi\xi_0 t)](\xi) = \frac{1}{2i}[\delta(\xi - \xi_0) - \delta(\xi + \xi_0)]

A pure cosine has a spectrum consisting of two Dirac deltas, one at +ξ0+\xi_0 and one at ξ0-\xi_0 (the negative frequency is the mirror image required by Hermitian symmetry). The amplitude of each spike is 1/21/2 because the energy is split equally between the two.

General periodic signal: A TT-periodic function f(t)=n=cne2πint/Tf(t) = \sum_{n=-\infty}^\infty c_n e^{2\pi i nt/T} has Fourier Transform:

f^(ξ)=n=cnδ ⁣(ξnT)\hat{f}(\xi) = \sum_{n=-\infty}^\infty c_n\,\delta\!\left(\xi - \frac{n}{T}\right)

The Fourier Transform of a periodic signal is a discrete sum of Dirac deltas at the harmonics - precisely the discrete spectrum of the Fourier series! This shows that Fourier series and Fourier Transform are two aspects of the same theory, unified by the distributional framework.

7.5 The Dirac Comb and Poisson Summation Formula

Definition 7.3 (Dirac Comb). The TT-periodic Dirac comb is:

IIIT(t)=n=δ(tnT)\text{III}_T(t) = \sum_{n=-\infty}^{\infty} \delta(t - nT)

Theorem 7.1 (FT of the Dirac Comb).

F[IIIT](ξ)=1TIII1/T(ξ)=1Tn=δ ⁣(ξnT)\mathcal{F}[\text{III}_T](\xi) = \frac{1}{T}\,\text{III}_{1/T}(\xi) = \frac{1}{T}\sum_{n=-\infty}^{\infty} \delta\!\left(\xi - \frac{n}{T}\right)

A comb of spacing TT in time has a spectrum that is a comb of spacing 1/T1/T in frequency. This is the time-frequency duality made concrete: fine temporal sampling -> coarse frequency sampling, and vice versa.

Theorem 7.2 (Poisson Summation Formula). For fS(R)f \in \mathcal{S}(\mathbb{R}):

n=f(n)=n=f^(n)\sum_{n=-\infty}^{\infty} f(n) = \sum_{n=-\infty}^{\infty} \hat{f}(n)

More generally, for sampling at interval TT:

1Tn=f ⁣(nT)=n=f^(nT)\frac{1}{T}\sum_{n=-\infty}^{\infty} f\!\left(\frac{n}{T}\right) = \sum_{n=-\infty}^{\infty} \hat{f}(nT)

Proof: The function g(t)=nf(t+n)g(t) = \sum_n f(t + n) is 11-periodic, so it has a Fourier series with coefficients ck=f^(k)c_k = \hat{f}(k) (by computing 01g(t)e2πiktdt=f(t)e2πiktdt=f^(k)\int_0^1 g(t)e^{-2\pi ikt}\,dt = \int_{-\infty}^\infty f(t)e^{-2\pi ikt}\,dt = \hat{f}(k)). Evaluating at t=0t = 0: nf(n)=g(0)=kck=kf^(k)\sum_n f(n) = g(0) = \sum_k c_k = \sum_k\hat{f}(k). \square

The Poisson summation formula is the key to sampling theory. Shannon's sampling theorem (1949) follows directly: a signal ff with bandwidth BB (i.e., f^(ξ)=0\hat{f}(\xi) = 0 for ξ>B|\xi| > B) is completely determined by its samples at rate 2B\geq 2B (the Nyquist rate). The Poisson summation formula explains both the reconstruction formula and the aliasing that occurs when undersampling.

For AI: The discrete Fourier Transform (Section 03) is the computational realization of the Poisson summation formula. The FFT computes the sampled Fourier Transform exactly, which requires the signal to be band-limited (or periodic) - any violation causes aliasing, the discrete analog of spectral leakage.

7.6 Unifying Fourier Series and Fourier Transform

The distribution framework reveals that Fourier series and Fourier Transform are the same thing:

SituationSignalSpectrum
General L2L^2 signalContinuous, aperiodicContinuous, aperiodic
Periodic signal (period TT)Continuous, periodicDiscrete (spacing 1/T1/T): Fourier series
Bandlimited (bandwidth BB)Discrete (rate 2B\geq 2B): samplingPeriodic (period 2B\geq 2B): aliases
Periodic AND bandlimitedDiscrete and periodicDiscrete and periodic: DFT (Section 03)
FOURIER TRANSFORM UNIFICATION
========================================================================

  Time domain           Frequency domain        Setting
  -----------           ----------------        -------
  Continuous            Continuous              L1/L2 Fourier Transform
  Continuous            Discrete                Fourier Series (periodic)
  Discrete              Continuous              DTFT (sampled signal)
  Discrete              Discrete                DFT / FFT -> see Section 20-03

  All four are the same theory viewed through the distributional lens.
  The Dirac comb + Poisson summation connects all four cases.

========================================================================

8. Applications in Machine Learning

8.1 FNet: Replacing Self-Attention with Fourier Transforms

Paper: Lee-Thorp, Ainslie, Eckstein, Ontanon (2022). "FNet: Mixing Tokens with Fourier Transforms." NAACL.

The idea: In a standard Transformer, the self-attention sublayer computes:

Attention(Q,K,V)=softmax ⁣(QKdk)V\text{Attention}(Q,K,V) = \operatorname{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

which is O(N2d)O(N^2 d) in sequence length NN. FNet replaces this with a 2D DFT:

FNet-mixing(X)=Re ⁣(Fseq ⁣[Fmodel[X]])\text{FNet-mixing}(X) = \operatorname{Re}\!\left(\mathcal{F}_{\text{seq}}\!\left[\mathcal{F}_{\text{model}}[X]\right]\right)

where Fseq\mathcal{F}_{\text{seq}} applies FFT along the sequence dimension and Fmodel\mathcal{F}_{\text{model}} along the embedding dimension. This is O(NlogNd)O(N\log N \cdot d) - much faster.

Why does it work? The Fourier Transform is a global linear mixing operation: every output position depends on every input position (via the sum x^k=nxne2πink/N\hat{x}_k = \sum_n x_n e^{-2\pi ink/N}). This is similar to attention's global mixing, but unparameterized - the mixing weights are fixed Fourier basis functions rather than learned attention scores.

Results: FNet achieves 92-97% of BERT's performance on GLUE benchmarks. For tasks requiring fine-grained token interaction (e.g., extractive QA), the gap is larger; for tasks where global context suffices (e.g., sentence classification), FNet nearly matches BERT. Training speed is 7 faster on GPUs, 2 faster on TPUs.

Mathematical insight: The key theorem is that the DFT matrix FF satisfies F=F/NF = F^\top / N (circular DFT is symmetric up to normalization). Therefore FF=IFF^\dagger = I - the DFT is unitary, just like the continuous FT by Plancherel. The real part operation ensures the output is real-valued.

Code sketch:

import numpy as np

def fnet_mixing(X):
    """
    X: (batch, seq_len, d_model) array
    Returns: real-valued mixing output of same shape
    """
    X_fft = np.fft.fft(np.fft.fft(X, axis=1), axis=2)
    return np.real(X_fft)

8.2 Random Fourier Features

Paper: Rahimi & Recht (2007). "Random Features for Large-Scale Kernel Machines." NeurIPS (Best Paper Award).

The setup: Kernel methods (SVMs, Gaussian Processes) require computing Kij=k(x(i),x(j))K_{ij} = k(\mathbf{x}^{(i)}, \mathbf{x}^{(j)}) for all NN training pairs - an O(N2)O(N^2) matrix that costs O(N3)O(N^3) to invert. For N>105N > 10^5, this is prohibitive.

Bochner's theorem: A continuous, shift-invariant kernel k(x,y)=k(xy)k(\mathbf{x}, \mathbf{y}) = k(\mathbf{x} - \mathbf{y}) is positive definite if and only if it is the Fourier Transform of a non-negative measure p(ω)p(\boldsymbol{\omega}):

k(xy)=p(ω)eiω(xy)dω=Eωp[eiωxeiωy]k(\mathbf{x} - \mathbf{y}) = \int p(\boldsymbol{\omega})\,e^{i\boldsymbol{\omega}^\top(\mathbf{x}-\mathbf{y})}\,d\boldsymbol{\omega} = \mathbb{E}_{\boldsymbol{\omega} \sim p}[e^{i\boldsymbol{\omega}^\top\mathbf{x}}\overline{e^{i\boldsymbol{\omega}^\top\mathbf{y}}}]

(Full treatment of Bochner's theorem in Section 12-03 Kernel Methods.)

The approximation: Sample DD frequencies ω1,,ωDp(ω)\boldsymbol{\omega}_1, \ldots, \boldsymbol{\omega}_D \sim p(\boldsymbol{\omega}) and define the random feature map:

ϕ(x)=2D[cos(ω1x+b1),,cos(ωDx+bD)]RD\phi(\mathbf{x}) = \sqrt{\frac{2}{D}}\left[\cos(\boldsymbol{\omega}_1^\top\mathbf{x} + b_1), \ldots, \cos(\boldsymbol{\omega}_D^\top\mathbf{x} + b_D)\right] \in \mathbb{R}^D

where bjU[0,2π]b_j \sim \mathcal{U}[0, 2\pi] are random phase offsets.

Guarantee: By the law of large numbers:

k(x,y)ϕ(x)ϕ(y),E[ϕ(x)ϕ(y)]=k(x,y)k(\mathbf{x}, \mathbf{y}) \approx \phi(\mathbf{x})^\top\phi(\mathbf{y}), \qquad \mathbb{E}[\phi(\mathbf{x})^\top\phi(\mathbf{y})] = k(\mathbf{x},\mathbf{y})

with concentration: P ⁣(ϕ(x)ϕ(y)k(x,y)>ϵ)2exp(Dϵ2/4)P\!\left(|\phi(\mathbf{x})^\top\phi(\mathbf{y}) - k(\mathbf{x},\mathbf{y})| > \epsilon\right) \leq 2\exp(-D\epsilon^2/4).

For popular kernels:

  • RBF kernel k(x,y)=exy2/(2σ2)k(\mathbf{x},\mathbf{y}) = e^{-\lVert\mathbf{x}-\mathbf{y}\rVert^2/(2\sigma^2)}: p(ω)=N(0,σ2I)p(\boldsymbol{\omega}) = \mathcal{N}(\mathbf{0}, \sigma^{-2}I)
  • Laplace kernel k(x,y)=exy1/σk(\mathbf{x},\mathbf{y}) = e^{-\lVert\mathbf{x}-\mathbf{y}\rVert_1/\sigma}: pp is a Cauchy distribution

Impact: Random Fourier Features reduce kernel SVM training from O(N3)O(N^3) to O(ND+D3)O(ND + D^3), making kernel methods scalable. The same idea appears in Performer (Choromanski et al., 2021) - an efficient attention mechanism that approximates the attention kernel with random features, reducing attention to O(N)O(N).

8.3 Spectral Normalization

Paper: Miyato, Kataoka, Koyama, Yoshida (2018). "Spectral Normalization for Generative Adversarial Networks." ICLR.

The problem: Training GANs is notoriously unstable because the discriminator can overfit, causing gradient vanishing for the generator. A Lipschitz constraint on the discriminator stabilizes training.

The solution: Normalize each weight matrix WW by its spectral norm W2=σmax(W)\lVert W \rVert_2 = \sigma_{\max}(W) (the largest singular value):

W~=Wσmax(W)\tilde{W} = \frac{W}{\sigma_{\max}(W)}

This makes every linear layer 1-Lipschitz: W~xW~yxy\lVert\tilde{W}\mathbf{x} - \tilde{W}\mathbf{y}\rVert \leq \lVert\mathbf{x} - \mathbf{y}\rVert. The composition of 1-Lipschitz layers is 1-Lipschitz - so the entire discriminator is 1-Lipschitz.

Computing σmax\sigma_{\max}: Power iteration gives an efficient approximation. Starting from a random unit vector v~\tilde{\mathbf{v}}:

  1. u~=Wv~/Wv~\tilde{\mathbf{u}} = W\tilde{\mathbf{v}} / \lVert W\tilde{\mathbf{v}} \rVert
  2. v~=Wu~/Wu~\tilde{\mathbf{v}} = W^\top\tilde{\mathbf{u}} / \lVert W^\top\tilde{\mathbf{u}} \rVert
  3. σ^=u~Wv~\hat{\sigma} = \tilde{\mathbf{u}}^\top W\tilde{\mathbf{v}} One iteration per training step suffices empirically.

Why "spectral"? The spectral norm W2\lVert W \rVert_2 is the largest eigenvalue of WW\sqrt{W^\top W} - the spectrum of the operator WW viewed as a linear map. This connects to Fourier analysis: for a convolution layer with kernel hh, the spectral norm is h^=maxξh^(ξ)\lVert\hat{h}\rVert_\infty = \max_\xi|\hat{h}(\xi)| - the supremum of the Fourier Transform of the filter.

For AI: Spectral normalization is now standard in GAN training. It also appears in attention normalization - normalizing the key/value projection matrices to control the Lipschitz constant of the attention map, which is critical for stable training of large models.

8.4 Fourier Neural Operator - Preview

The Fourier Neural Operator (FNO) (Li et al., 2021) learns a mapping between function spaces (e.g., PDE initial conditions -> solutions) by:

  1. Lift input v(t)\mathbf{v}(t) to a higher-dimensional representation v0=P(v)\mathbf{v}_0 = P(\mathbf{v})
  2. Apply LL Fourier integral operator layers: vl+1(t)=σ(Wvl(t)+F1[RϕF[vl]](t))\mathbf{v}_{l+1}(t) = \sigma(W\mathbf{v}_l(t) + \mathcal{F}^{-1}[R_\phi \cdot \mathcal{F}[\mathbf{v}_l]](t))
  3. Project to output: u=Q(vL)u = Q(\mathbf{v}_L)

The key operation RϕR_\phi is a learnable complex-valued tensor applied pointwise in Fourier space - a global convolution with a learned filter. Only the lowest kmaxk_{\max} Fourier modes are kept (Plancherel guarantees the truncation error equals the discarded energy).

FNO achieves 1000 speedup over classical PDE solvers for problems like weather prediction and turbulence simulation.

Full treatment in Section 20-03 DFT and FFT - the discretized version of FNO and its implementation via FFT is covered there.

8.5 Bochner's Theorem and Shift-Invariant Kernels - Preview

Bochner's theorem (1932) characterizes all positive-definite, shift-invariant kernels as Fourier Transforms of non-negative measures:

k(xy)=Rdeiω(xy)dp(ω)k(\mathbf{x} - \mathbf{y}) = \int_{\mathbb{R}^d} e^{i\boldsymbol{\omega}^\top(\mathbf{x}-\mathbf{y})}\,dp(\boldsymbol{\omega})

This is a profound connection between Fourier analysis and kernel methods. The kernel's shape in the spatial domain determines (and is determined by) its spectral density pp.

Full treatment in Section 12-03 Kernel Methods - the RKHS theory, Mercer's theorem, and applications to SVMs, Gaussian Processes, and the Neural Tangent Kernel are covered there. Here we note that Bochner's theorem is the mathematical foundation for Random Fourier Features (Section 8.2).

8.6 RoPE from the Continuous FT Perspective

Rotary Position Embedding (RoPE) (Su, Lu, Pan, Meng, Luo, 2021) is now used in LLaMA-3, Mistral, Qwen, GPT-NeoX, and virtually every frontier LLM. Its mathematical foundation is the Fourier Transform on the circle group.

Setup. In attention, the score between query q\mathbf{q} at position mm and key k\mathbf{k} at position nn should depend only on the relative position mnm - n. This is a translation invariance requirement - exactly the condition Bochner's theorem characterizes.

Construction. Pair up embedding dimensions: (q2j,q2j+1)(q_{2j}, q_{2j+1}) and (k2j,k2j+1)(k_{2j}, k_{2j+1}) for j=0,,d/21j = 0, \ldots, d/2 - 1. Treat each pair as a complex number qj(c)=q2j+iq2j+1q_j^{(c)} = q_{2j} + iq_{2j+1}. Apply rotation by angle mθjm\theta_j:

fq(q,m)=qj(c)eimθjθj=100002j/dmodelf_q(\mathbf{q}, m) = q_j^{(c)} \cdot e^{im\theta_j} \qquad \theta_j = 10000^{-2j/d_{\text{model}}}

The attention score becomes:

fq(q,m),fk(k,n)=Re ⁣[jqj(c)kj(c)ei(mn)θj]\langle f_q(\mathbf{q}, m), f_k(\mathbf{k}, n)\rangle = \operatorname{Re}\!\left[\sum_j q_j^{(c)}\overline{k_j^{(c)}} \cdot e^{i(m-n)\theta_j}\right]

which depends only on mnm - n - the relative position. This is the Fourier modulation theorem: multiplying by eimθje^{im\theta_j} is a frequency shift by mm in the "position domain."

The frequencies θj\theta_j: The geometric spacing θj=100002j/d\theta_j = 10000^{-2j/d} means each pair encodes a different "octave" of positional information. Low-index dimensions (jj small) use high frequencies θj1\theta_j \approx 1 - sensitive to local position differences. High-index dimensions use low frequencies θj100001\theta_j \approx 10000^{-1} - sensitive to global structure over thousands of positions. This is a multi-resolution decomposition - the Fourier Transform at multiple frequency scales.

Why the 1000010000 base? The maximum useful context is 10000\sim 10000 tokens at the original RoPE scale (each frequency completes at most one full rotation). For LLaMA-3 with 128K context, the effective base must be much larger - which is why LLaMA-3.1 uses θbase=500000\theta_{\text{base}} = 500000 instead of 1000010000.

ROPE: FOURIER TRANSFORM IN ATTENTION
========================================================================

  f_q(q, m) = q  e^{im}   <- modulation by Fourier basis e^{im}

  f_q(q,m), f_k(k,n) = Re[q*k  e^{i(m-n)}]
                            ^ depends only on relative position m-n

  j = 10000^{-2j/d}:  j=0 -> approx1 (fast/local),  j=d/2 -> approx1/10000 (slow/global)

  Context extension:
    Original RoPE: base=10000, max context ~10K tokens
    LLaMA-3.1:     base=500K,  max context 128K tokens
    LongRoPE:      non-uniform rescaling up to 2M context

========================================================================

9. Common Mistakes

#MistakeWhy It's WrongFix
1Confusing the ω\omega and ξ\xi conventions and getting 2π2\pi factors wrongFω[f](f)=f^(ω)\mathcal{F}_\omega[f](f')=\hat{f}(\omega) has a 1/(2π)1/(2\pi) in the inverse; Fξ\mathcal{F}_\xi does not. Mixing these adds spurious factorsAlways identify the convention on first use; for this curriculum use the ξ\xi-convention f^(ξ)=f(t)e2πiξtdt\hat{f}(\xi)=\int f(t)e^{-2\pi i\xi t}\,dt
2Assuming f^L1\hat{f} \in L^1 whenever fL1f \in L^1f=rectf=\operatorname{rect} has f^=sincL1\hat{f}=\operatorname{sinc} \notin L^1. The Riemann-Lebesgue lemma only guarantees f^C0\hat{f} \in C_0, not L1L^1Check whether ff is smooth enough (e.g., fL1f \in L^1 and Lipschitz) for f^L1\hat{f} \in L^1 to hold
3Writing F[δ](ξ)=0\mathcal{F}[\delta](\xi) = 0δ\delta has a flat spectrum: F[δ](ξ)=1\mathcal{F}[\delta](\xi)=1 for all ξ\xi. A "pointlike" impulse contains all frequencies equallyCompute distributional FT: δ^,ϕ=δ,ϕ^=ϕ^(0)=ϕdt\langle\hat{\delta},\phi\rangle=\langle\delta,\hat{\phi}\rangle=\hat{\phi}(0)=\int\phi\,dt
4Thinking the uncertainty principle only limits our instrumentsΔtΔξ1/(4π)\Delta t\cdot\Delta\xi\geq 1/(4\pi) is a theorem about functions, not about measurement devices. It holds for any fL2f\in L^2, regardless of how it is measuredAccept the bound as a mathematical fact; designing around it requires fundamentally different time-frequency representations (wavelets, Section 05)
5Applying the differentiation property to non-differentiable ffF[f](ξ)=2πiξf^(ξ)\mathcal{F}[f'](\xi)=2\pi i\xi\hat{f}(\xi) requires fL1f'\in L^1. For f=rectf=\operatorname{rect}, f=δ(t+1/2)δ(t1/2)f'=\delta(t+1/2)-\delta(t-1/2) in the distributional senseWork in the distributional setting when ff has jumps; the formula still holds but ff' is a distribution
6Claiming Parseval says $\intf^2 = \int
7Mixing up the FT of a product vs. convolutionF[fg]=f^g^\mathcal{F}[f\cdot g]=\hat{f}*\hat{g} (convolution in frequency), not f^g^\hat{f}\cdot\hat{g}. The multiplicative dual of the convolution theorem is often reversedMemorize the duality table: convolution in time \leftrightarrow product in frequency; product in time \leftrightarrow convolution in frequency
8Forgetting the $1/a$ in the scaling theorem
9Treating negative frequencies as "unphysical"For real ff, negative frequencies are redundant (Hermitian symmetry) but are not wrong. For complex ff (e.g., analytic signals), negative and positive frequencies are distinct and carry different informationEmbrace negative frequencies; they simplify formulas and are essential for complex signals
10Applying F1F=I\mathcal{F}^{-1}\circ\mathcal{F}=I pointwise instead of a.e.Inversion holds almost everywhere (for L1L^1\cap sufficient regularity). At a jump discontinuity, f^(ξ)e2πiξtdξ=12[f(t+)+f(t)]\int\hat{f}(\xi)e^{2\pi i\xi t}\,d\xi=\frac{1}{2}[f(t^+)+f(t^-)] - the average of left and right limitsAccount for the 1/21/2-average behavior at discontinuities (same as Dirichlet's theorem in Section 01)
11Using sinc(x)=sin(x)/x\text{sinc}(x) = \sin(x)/x and sinc(x)=sin(πx)/(πx)\operatorname{sinc}(x) = \sin(\pi x)/(\pi x) interchangeablyThere are two conventions for sinc. The normalized sinc sinc(x)=sin(πx)/(πx)\operatorname{sinc}(x)=\sin(\pi x)/(\pi x) satisfies F[rect]=sinc\mathcal{F}[\operatorname{rect}]=\operatorname{sinc} in the ξ\xi-convention; the unnormalized sin(x)/x\sin(x)/x does notIn this curriculum: sinc(x)=sin(πx)/(πx)\operatorname{sinc}(x) = \sin(\pi x)/(\pi x) (normalized), consistent with the ξ\xi-convention FT
12Thinking FNet replaces attention entirelyFNet is a weaker mixer than attention - it cannot learn data-dependent weights. It works well for classification but poorly for tasks requiring fine-grained token interactions (QA, coreference)Use FNet as a fast baseline; switch to attention for tasks requiring selective, query-dependent mixing

10. Exercises

Exercise 1 * - Computing Fourier Transforms from Definition

(a) Compute F[f](ξ)\mathcal{F}[f](\xi) for f(t)=eatf(t) = e^{-a|t|} with a>0a > 0. Show all steps of integration. What type of function is f^\hat{f}?

(b) Compute F[rect(t/T)](ξ)\mathcal{F}[\operatorname{rect}(t/T)](\xi) for a pulse of width T>0T > 0. Express using the normalized sinc function.

(c) Verify the L1L^1 bound: show f^(ξ)f1|\hat{f}(\xi)| \leq \lVert f\rVert_1 for your answer in (a).

(d) Confirm the Riemann-Lebesgue lemma numerically for f(t)=et2f(t) = e^{-t^2}: plot f^(ξ)|\hat{f}(\xi)| and verify it decays to 0.


Exercise 2 * - Applying Properties

(a) Given f^(ξ)=eπξ2\hat{f}(\xi) = e^{-\pi\xi^2}, use the time-shift property to write the Fourier Transform of g(t)=f(t3)g(t) = f(t - 3).

(b) Given h^(ξ)=sinc(ξ)\hat{h}(\xi) = \operatorname{sinc}(\xi), find the Fourier Transform of h(t2)cos(4πt)h(t - 2)\cos(4\pi t). (Use shift + modulation.)

(c) If f(t)f(t) has f^(ξ)=g(ξ)\hat{f}(\xi) = g(\xi), what is the FT of f(3t+1)f(3t + 1)? (Use both shift and scaling.)

(d) Prove: if ff is real and even, then f^\hat{f} is real and even.


Exercise 3 * - Parseval's Relation Numerically

(a) For f(t)=eπt2f(t) = e^{-\pi t^2}, compute f22=f2dt\lVert f\rVert_2^2 = \int|f|^2\,dt analytically.

(b) Compute f^22\lVert\hat{f}\rVert_2^2 analytically and verify f22=f^22\lVert f\rVert_2^2 = \lVert\hat{f}\rVert_2^2.

(c) Numerically verify Parseval using the FFT: discretize ff on [5,5][-5, 5] with N=1024N = 1024 points, apply FFT, and compare fn2/N\sum|f_n|^2/N vs f^k2/N\sum|\hat{f}_k|^2/N.

(d) Use Parseval to evaluate 1(1+4π2t2)2dt\int_{-\infty}^\infty \frac{1}{(1 + 4\pi^2 t^2)^2}\,dt.


Exercise 4 ** - The Inversion Theorem

(a) For f(t)=rect(t)f(t) = \operatorname{rect}(t) (not in L1L^1 after FT since sincL1\operatorname{sinc} \notin L^1), show that the principal value integral limRRRsinc(ξ)e2πiξtdξ\lim_{R\to\infty}\int_{-R}^{R}\operatorname{sinc}(\xi)e^{2\pi i\xi t}\,d\xi converges to rect(t)\operatorname{rect}(t) for t1/2|t| \neq 1/2 by evaluating the integral explicitly.

(b) Show that at t=1/2t = 1/2 (the jump discontinuity), the principal value integral converges to 1/2=12[rect(1/2)+rect(1/2+)]1/2 = \frac{1}{2}[\operatorname{rect}(1/2^-) + \operatorname{rect}(1/2^+)].

(c) This mirrors Dirichlet's theorem from Section 20-01. Identify the precise analog: what plays the role of the Dirichlet kernel in the Fourier integral case?


Exercise 5 ** - Uncertainty Principle

(a) For f(t)=eπt2/σ2f(t) = e^{-\pi t^2/\sigma^2} (normalized to f2=1\lVert f\rVert_2 = 1 appropriately), compute Δt\Delta t (RMS time spread) and Δξ\Delta\xi (RMS frequency spread). Verify ΔtΔξ=1/(4π)\Delta t\cdot\Delta\xi = 1/(4\pi).

(b) For f(t)=rect(t)f(t) = \operatorname{rect}(t), compute Δt\Delta t and Δξ\Delta\xi numerically. Is the uncertainty bound satisfied? Is it tight?

(c) A signal designer wants to transmit a pulse with duration Δt1μ\Delta t \leq 1\,\mus using bandwidth Δξ100\Delta\xi \leq 100\,kHz. Is this possible? What does the uncertainty principle say?

(d) Show that any signal of the form f(t)=Ceαt2e2πiξ0tf(t) = Ce^{-\alpha t^2}e^{2\pi i\xi_0 t} achieves equality in the uncertainty principle.


Exercise 6 ** - Tempered Distributions

(a) Verify F[δ(ta)](ξ)=e2πiξa\mathcal{F}[\delta(t - a)](\xi) = e^{-2\pi i\xi a} using the distributional definition T^,ϕ=T,ϕ^\langle\hat{T},\phi\rangle = \langle T,\hat{\phi}\rangle.

(b) Compute F[δ(t)](ξ)\mathcal{F}[\delta'(t)](\xi) (FT of the derivative of delta).

(c) Use the result from (a) to verify: F[cos(2πξ0t)](ξ)=12[δ(ξξ0)+δ(ξ+ξ0)]\mathcal{F}[\cos(2\pi\xi_0 t)](\xi) = \frac{1}{2}[\delta(\xi - \xi_0) + \delta(\xi + \xi_0)].

(d) Verify the Poisson summation formula numerically for f(t)=eπt2f(t) = e^{-\pi t^2}: compare n=NNf(n)\sum_{n=-N}^{N} f(n) with n=NNf^(n)\sum_{n=-N}^{N} \hat{f}(n) for N=5,10,20N = 5, 10, 20.


Exercise 7 *** - Random Fourier Features

(a) For the RBF kernel k(x,y)=exy2k(\mathbf{x},\mathbf{y}) = e^{-\lVert\mathbf{x}-\mathbf{y}\rVert^2} (setting σ=1\sigma = 1), the spectral density is p(ω)=(4π)d/2eω2/4p(\boldsymbol{\omega}) = (4\pi)^{-d/2}e^{-\lVert\boldsymbol{\omega}\rVert^2/4}. Draw D=500D = 500 frequencies ωjN(0,2I)\boldsymbol{\omega}_j \sim \mathcal{N}(\mathbf{0}, 2I) for d=2d = 2 and construct the random feature map ϕ(x)R2D\phi(\mathbf{x}) \in \mathbb{R}^{2D} (using cos and sin).

(b) Generate 100 random pairs (xi,yi)(\mathbf{x}_i, \mathbf{y}_i) with xi,yiN(0,I)\mathbf{x}_i, \mathbf{y}_i \sim \mathcal{N}(\mathbf{0}, I). For each pair, compute the exact kernel value k(xi,yi)k(\mathbf{x}_i, \mathbf{y}_i) and the RFF approximation ϕ(xi)ϕ(yi)\phi(\mathbf{x}_i)^\top\phi(\mathbf{y}_i). Plot the approximation vs. exact values and report RMSE.

(c) Study the approximation quality as a function of D{10,50,100,500,1000}D \in \{10, 50, 100, 500, 1000\}. How does RMSE scale with DD? Is it consistent with the theoretical bound E[error2]=O(D1)\mathbb{E}[\text{error}^2] = O(D^{-1})?

(d) Takeaway: explain in 2 sentences why RFF matters for LLM-scale systems where kernel computation would be infeasible.


Exercise 8 *** - FNet vs Attention Experiment

(a) Implement a minimal FNet mixing layer: given XRN×dX \in \mathbb{R}^{N \times d}, apply 2D FFT and take the real part. Implement a minimal attention mixing layer: Attn(X)=softmax(XX/d)X\text{Attn}(X) = \operatorname{softmax}(XX^\top/\sqrt{d})X.

(b) For a synthetic task - "copy the token from position kk to the output" (a task requiring precise position tracking) - construct training pairs (X,y)(X, y) and train both mixers with a linear head on top. Compare accuracy and training speed.

(c) For a synthetic task - "output the mean of all input tokens" (a global aggregation task) - repeat the comparison. Which mixer is better suited? Why?

(d) Takeaway: connect your results to the mathematical properties of the FT (fixed unparameterized mixing vs. learned data-dependent mixing) and explain what types of tasks each mixer is appropriate for.

11. Why This Matters for AI (2026 Perspective)

ConceptAI Impact
Fourier Transform as unitary operator (Plancherel)FNet (Lee-Thorp et al., 2022) uses FT as a token mixer - same information content, 7 faster than attention on GPU. The unitarity of FT ensures no information loss during mixing.
Convolution-multiplication dualityThe theoretical basis for O(NlogN)O(N\log N) convolution via FFT - the reason CNNs can be trained efficiently. FlashConv (Fu et al., 2023) and Hyena (Poli et al., 2023) extend this to long-range sequence models.
Modulation theorem (time-shift frequency-shift)Foundation of RoPE (Su et al., 2021) - multiplying embeddings by eimθe^{im\theta} shifts the position representation. Now standard in LLaMA-3, Mistral, Qwen, GPT-NeoX.
Uncertainty principle ΔtΔξ1/(4π)\Delta t\cdot\Delta\xi \geq 1/(4\pi)Fundamental constraint on context length extension. YaRN (Peng et al., 2023) and LongRoPE (Ding et al., 2024) must lower the frequency θj\theta_j to extend context - the UP dictates the minimum frequency needed.
Bochner's theorem + FT of spectral densityRandom Fourier Features (Rahimi & Recht, 2007) reduce kernel computation from O(N2)O(N^2) to O(ND)O(ND). Performer (Choromanski et al., 2021) uses the same idea to get O(N)O(N) attention.
Spectral norm W2=σmax\lVert W\rVert_2 = \sigma_{\max}Spectral normalization (Miyato et al., 2018) - dividing discriminator weights by their spectral norm enforces Lipschitz constraints. Standard in GAN training and used in diffusion model discriminators.
FT as basis for function space operatorsFourier Neural Operator (Li et al., 2021) - solves PDEs 1000 faster by learning in Fourier space. Used in climate modeling, fluid simulation, and materials science.
Differentiation in frequency domain (f2πiξf^f' \leftrightarrow 2\pi i\xi\hat{f})Heat and wave equation solutions in closed form via FT. Kernel of the heat equation is a Gaussian - explains why diffusion models add Gaussian noise: it has maximal spectral support.
Gaussian as optimal time-frequency windowGaussian embeddings in Gaussian attention (Tsai et al., 2019) - replacing dot-product attention with a Gaussian kernel gives an attention with explicit time-frequency interpretation.
Dirac comb + Poisson summationShannon sampling theorem - ensures that Whisper's mel-spectrogram preprocessing (16kHz sampling, 80-channel mel filterbank) captures all speech frequencies (up to 8kHz) without aliasing.

12. Conceptual Bridge

Looking Back: From Fourier Series to Fourier Transform

The Fourier series (Section 20-01) established that periodic functions decompose into discrete harmonics. The key insight was geometric: the trigonometric functions {einx}\{e^{inx}\} form an orthonormal basis of L2[π,π]L^2[-\pi,\pi], and the Fourier coefficients are the projections of ff onto these basis vectors.

The Fourier Transform carries this insight to its natural conclusion. By letting the period TT \to \infty, the discrete harmonic sum ncne2πint/T\sum_n c_n e^{2\pi int/T} becomes the continuous integral f^(ξ)e2πiξtdξ\int\hat{f}(\xi)e^{2\pi i\xi t}\,d\xi, and the discrete spectrum {cn}\{c_n\} becomes the continuous spectrum f^(ξ)\hat{f}(\xi). Everything from the Fourier series has an analog:

  • Orthonormality -> Plancherel's theorem (Section 5)
  • Parseval's identity -> Parseval's relation (Section 5.2)
  • Dirichlet's theorem on convergence -> Fourier inversion theorem (Section 4)
  • Gibbs phenomenon -> spectral leakage in DFT (Section 03)
  • Smoothness spectral decay -> differentiation property (Section 3.5)

The distribution theory in Section 7 reveals the deepest unity: Fourier series is just the Fourier Transform restricted to periodic distributions. The Dirac comb with spacing TT transforms to a Dirac comb with spacing 1/T1/T - the very relationship between period and harmonic spacing that the Fourier series encodes.

Looking Forward: Three Branches

From the Fourier Transform, the curriculum branches in three directions:

Branch 1: Discretization -> Section 20-03 DFT and FFT. The continuous FT must be made computable. Discretizing both time and frequency leads to the DFT, and the Cooley-Tukey algorithm (1965) computes it in O(NlogN)O(N\log N) instead of O(N2)O(N^2). The FFT is arguably the most important algorithm in scientific computing - it is what makes spectrograms, MRI, and FNet feasible. The DFT inherits all properties of the continuous FT but adds complications: aliasing (from discretizing frequency), spectral leakage (from finite duration), and the circular convolution structure.

Branch 2: Convolution Theorem -> Section 20-04. The most important property of the FT for applications is that convolution in time equals multiplication in frequency. This converts the O(N2)O(N^2) convolution operation into O(NlogN)O(N\log N) FFT-based multiplication. The convolution theorem is the mathematical foundation of CNNs, WaveNet, S4/Mamba, and Hyena - all architectures where the key operation is a learned filter applied by convolution. The full theory of LTI systems, frequency response, filter design, and the Wiener-Khinchin theorem belongs there.

Branch 3: Time-Frequency Localization -> Section 20-05 Wavelets. The FT has a fundamental limitation: f^(ξ)|\hat{f}(\xi)| tells you which frequencies are present but not when they occur. The uncertainty principle shows this is not fixable - any attempt to localize in both time and frequency is bounded by ΔtΔξ1/(4π)\Delta t\cdot\Delta\xi \geq 1/(4\pi). Wavelets overcome this by accepting a principled tradeoff: high frequencies are analyzed at fine time resolution, low frequencies at coarse time resolution. This multi-resolution analysis (MRA) is the mathematical foundation of JPEG 2000, EEG analysis, and wavelet attention mechanisms.

POSITION IN THE FOURIER CURRICULUM
========================================================================

  Section 01 FOURIER SERIES                Section 02 FOURIER TRANSFORM (HERE)
  ---------------------             ---------------------------------
  Periodic functions                Aperiodic functions on R
  Discrete spectrum {cn}            Continuous spectrum f()
  Parseval for series               Plancherel's theorem
  Dirichlet convergence             Inversion theorem
  Gibbs phenomenon                  Spectral leakage (-> Section 03)
  RoPE derivation                   Full uncertainty principle
                   v                         v
  +----------------------------------------------------------------+
  |              Section 02 -> THREE BRANCHES                              |
  |                                                                |
  |  Discretize:      Convolve:         Time-localize:            |
  |  Section 03 DFT/FFT  ->  Section 04 Convolution  ->  Section 05 Wavelets            |
  |  $O(N\log N)$    CNNs, Mamba         MRA, multi-scale         |
  |  Whisper, FNO    WaveNet, Hyena      Scattering networks       |
  +----------------------------------------------------------------+
                          v
       Section 21 Statistical Learning Theory
       (spectral learning bounds, kernel methods)

========================================================================

Key Takeaways

The Fourier Transform is characterized by three central theorems:

  1. Plancherel - the FT is a unitary isometry on L2(R)L^2(\mathbb{R}): it preserves energy and inner products. The Fourier Transform changes how you describe a signal, not how much information it contains.

  2. Uncertainty - ΔtΔξ1/(4π)\Delta t\cdot\Delta\xi \geq 1/(4\pi): time and frequency concentration are fundamentally coupled. This is a theorem about analysis, not about physics, and it constrains every signal processing and ML system that operates in both domains simultaneously.

  3. Inversion - the FT is invertible: given f^\hat{f}, you can recover ff exactly. No information is lost in the Fourier domain representation - it is a complete, lossless change of basis.

Together, these theorems say: the Fourier Transform is an invertible, energy-preserving, mathematically constrained change of representation. It reveals structure (frequency content) that is invisible in the time domain, converts convolution to multiplication, and converts differentiation to algebra - making it the single most powerful analytical tool for understanding and processing signals, functions, and the weight matrices of neural networks.


<- Back to Fourier Analysis | Previous: Fourier Series <- | Next: DFT and FFT ->


Appendix A: Extended Properties and Derivations

A.1 The Duality Theorem in Full

One of the most elegant properties of the Fourier Transform is self-duality: applying the FT twice returns the time-reversed original.

Theorem A.1 (Duality). For fL1(R)f \in L^1(\mathbb{R}):

F[f^(t)](ξ)=f(ξ)\mathcal{F}[\hat{f}(t)](\xi) = f(-\xi)

Equivalently: if F[f](ξ)=g(ξ)\mathcal{F}[f](\xi) = g(\xi), then F[g](ξ)=f(ξ)\mathcal{F}[g](\xi) = f(-\xi).

Proof: F[f^(t)](ξ)=f^(t)e2πiξtdt=(f(s)e2πistds)e2πiξtdt\mathcal{F}[\hat{f}(t)](\xi) = \int\hat{f}(t)e^{-2\pi i\xi t}\,dt = \int\left(\int f(s)e^{-2\pi ist}\,ds\right)e^{-2\pi i\xi t}\,dt

By Fubini (justified since f,f^L1f, \hat{f} \in L^1): =f(s)(e2πi(s+ξ)tdt)ds=f(s)δ(s+ξ)ds=f(ξ)= \int f(s)\left(\int e^{-2\pi i(s+\xi)t}\,dt\right)ds = \int f(s)\delta(s+\xi)\,ds = f(-\xi) \square

Corollary: Applying the FT four times: F4[f]=f\mathcal{F}^4[f] = f. The Fourier Transform has order 4 as an operator.

Using duality to compute new transforms:

If you know F[f](ξ)=g(ξ)\mathcal{F}[f](\xi) = g(\xi), then F[g](ξ)=f(ξ)\mathcal{F}[g](\xi) = f(-\xi).

Example: We know F[rect(t)](ξ)=sinc(ξ)\mathcal{F}[\operatorname{rect}(t)](\xi) = \operatorname{sinc}(\xi). By duality: F[sinc(t)](ξ)=rect(ξ)=rect(ξ)\mathcal{F}[\operatorname{sinc}(t)](\xi) = \operatorname{rect}(-\xi) = \operatorname{rect}(\xi) (since rect is even).

So F[sinc](ξ)=rect(ξ)\mathcal{F}[\operatorname{sinc}](\xi) = \operatorname{rect}(\xi): a sinc in time has a rectangular (bandlimited) spectrum. This is the ideal low-pass filter - a filter that passes frequencies ξ1/2|\xi| \leq 1/2 perfectly and rejects all others.

A.2 Analytic Signals and the Hilbert Transform

Definition A.1 (Analytic Signal). For a real signal f(t)f(t), its analytic signal is:

f+(t)=f(t)+iH[f](t)f_+(t) = f(t) + i\,\mathcal{H}[f](t)

where H[f]\mathcal{H}[f] is the Hilbert Transform: H[f](t)=1πP.V.f(τ)tτdτ\mathcal{H}[f](t) = \frac{1}{\pi}\,\text{P.V.}\int_{-\infty}^\infty \frac{f(\tau)}{t-\tau}\,d\tau.

Fourier domain characterization:

f^+(ξ)={2f^(ξ)ξ>0f^(0)ξ=00ξ<0\hat{f}_+(\xi) = \begin{cases}2\hat{f}(\xi) & \xi > 0 \\ \hat{f}(0) & \xi = 0 \\ 0 & \xi < 0\end{cases}

The analytic signal has only non-negative frequencies: it is a one-sided spectrum signal. This is achieved by zeroing out the negative frequencies and doubling the positive ones - a frequency-domain operation.

Why this matters: The analytic signal enables the definition of instantaneous frequency ω(t)=ddtargf+(t)\omega(t) = \frac{d}{dt}\arg f_+(t) - the frequency of the signal at a given instant. This is essential for frequency-modulated signals (FM radio, chirps) and is used in empirical mode decomposition for analyzing non-stationary signals like EEG and speech.

Hilbert Transform in spectral terms: F[H[f]](ξ)=isgn(ξ)f^(ξ)\mathcal{F}[\mathcal{H}[f]](\xi) = -i\,\operatorname{sgn}(\xi)\,\hat{f}(\xi). The Hilbert Transform applies a π/2-\pi/2 phase shift to positive frequencies and +π/2+\pi/2 to negative frequencies - it is a "90-degree phase rotator."

A.3 The Fourier Transform on Rd\mathbb{R}^d

The Fourier Transform extends naturally to dd dimensions. For f:RdCf: \mathbb{R}^d \to \mathbb{C} with fL1(Rd)f \in L^1(\mathbb{R}^d):

f^(ξ)=Rdf(t)e2πiξtdt\hat{f}(\boldsymbol{\xi}) = \int_{\mathbb{R}^d} f(\mathbf{t})\,e^{-2\pi i \boldsymbol{\xi}^\top\mathbf{t}}\,d\mathbf{t}

where ξRd\boldsymbol{\xi} \in \mathbb{R}^d is the frequency vector. All properties generalize:

  • Shift: F[f(ta)](ξ)=e2πiξaf^(ξ)\mathcal{F}[f(\mathbf{t} - \mathbf{a})](\boldsymbol{\xi}) = e^{-2\pi i\boldsymbol{\xi}^\top\mathbf{a}}\hat{f}(\boldsymbol{\xi})
  • Scaling by matrix AA: F[f(At)](ξ)=1detAf^(Aξ)\mathcal{F}[f(A\mathbf{t})](\boldsymbol{\xi}) = \frac{1}{|\det A|}\hat{f}(A^{-\top}\boldsymbol{\xi})
  • Differentiation: F[tjf](ξ)=2πiξjf^(ξ)\mathcal{F}[\partial_{t_j}f](\boldsymbol{\xi}) = 2\pi i\xi_j\,\hat{f}(\boldsymbol{\xi})
  • Plancherel: f^L2(Rd)=fL2(Rd)\lVert\hat{f}\rVert_{L^2(\mathbb{R}^d)} = \lVert f\rVert_{L^2(\mathbb{R}^d)}

For AI: The 2D Fourier Transform is used in convolutional networks for images. A 2D convolution with filter hh is, in Fourier space, pointwise multiplication: y^(ξ)=x^(ξ)h^(ξ)\hat{y}(\boldsymbol{\xi}) = \hat{x}(\boldsymbol{\xi})\cdot\hat{h}(\boldsymbol{\xi}). For FNet (Section 8.1), the 2D FT is applied over both the sequence dimension and the embedding dimension.

Radial functions: For radially symmetric f(t)=f0(t)f(\mathbf{t}) = f_0(\lVert\mathbf{t}\rVert), the Fourier Transform is also radial: f^(ξ)=f^0(ξ)\hat{f}(\boldsymbol{\xi}) = \hat{f}_0(\lVert\boldsymbol{\xi}\rVert). The RBF kernel k(xy)=exy2k(\mathbf{x}-\mathbf{y}) = e^{-\lVert\mathbf{x}-\mathbf{y}\rVert^2} is radially symmetric, so its spectral density p(ω)=cdeω2/4p(\boldsymbol{\omega}) = c_d e^{-\lVert\boldsymbol{\omega}\rVert^2/4} is also radially symmetric - which is why sampling ωN(0,12I)\boldsymbol{\omega} \sim \mathcal{N}(\mathbf{0}, \frac{1}{2}I) works for RFF with the RBF kernel.

A.4 The Fourier Transform and PDEs

The differentiation property F[f(n)](ξ)=(2πiξ)nf^(ξ)\mathcal{F}[f^{(n)}](\xi) = (2\pi i\xi)^n\hat{f}(\xi) converts partial differential equations into algebraic equations in frequency space. This is the primary tool for solving linear constant-coefficient PDEs.

Example 1: Heat Equation. Solve tu=κxxu\partial_t u = \kappa\partial_{xx}u, u(x,0)=u0(x)u(x,0) = u_0(x).

Taking the Fourier Transform in xx:

tu^(ξ,t)=κ(2πiξ)2u^(ξ,t)=4π2κξ2u^(ξ,t)\partial_t\hat{u}(\xi,t) = \kappa(2\pi i\xi)^2\hat{u}(\xi,t) = -4\pi^2\kappa\xi^2\hat{u}(\xi,t)

This is a first-order ODE in tt for each fixed ξ\xi: u^(ξ,t)=u^0(ξ)e4π2κξ2t\hat{u}(\xi,t) = \hat{u}_0(\xi)\,e^{-4\pi^2\kappa\xi^2 t}.

Inverting: u(x,t)=u^0(ξ)e4π2κξ2te2πiξxdξ=(u0Gt)(x)u(x,t) = \int\hat{u}_0(\xi)\,e^{-4\pi^2\kappa\xi^2 t}\,e^{2\pi i\xi x}\,d\xi = (u_0 * G_t)(x)

where Gt(x)=14πκtex2/(4κt)G_t(x) = \frac{1}{\sqrt{4\pi\kappa t}}e^{-x^2/(4\kappa t)} is the Gaussian heat kernel (a Gaussian with time-dependent width σ=2κt\sigma = \sqrt{2\kappa t}).

Interpretation: Heat diffusion = convolution with a Gaussian. High frequencies ξ\xi decay as e4π2κξ2te^{-4\pi^2\kappa\xi^2 t} - much faster than low frequencies. Sharp features (high frequency content) are smoothed out rapidly; broad features (low frequency) persist.

For AI - Diffusion Models: The DDPM forward process q(xtx0)=N(xt;αˉtx0,(1αˉt)I)q(x_t|x_0) = \mathcal{N}(x_t;\sqrt{\bar{\alpha}_t}x_0, (1-\bar{\alpha}_t)I) is exactly discrete heat diffusion: the signal is progressively smoothed by adding Gaussian noise (which has a flat spectrum = all frequencies). The network learns to reverse this process. The Fourier perspective explains why the early denoising steps (low noise) must recover high-frequency details while later steps (high noise) recover the coarse structure - the exact inverse of heat diffusion.

Example 2: Wave Equation. Solve ttu=c2xxu\partial_{tt}u = c^2\partial_{xx}u, u(x,0)=u0(x)u(x,0) = u_0(x), tu(x,0)=v0(x)\partial_t u(x,0) = v_0(x).

FT in xx: ttu^=(2πiξ)2c2u^=4π2c2ξ2u^\partial_{tt}\hat{u} = (2\pi i\xi)^2 c^2\hat{u} = -4\pi^2c^2\xi^2\hat{u}

Solution: u^(ξ,t)=u^0(ξ)cos(2πcξt)+v^0(ξ)2πicξsin(2πcξt)\hat{u}(\xi,t) = \hat{u}_0(\xi)\cos(2\pi c|\xi|t) + \frac{\hat{v}_0(\xi)}{2\pi ic|\xi|}\sin(2\pi c|\xi|t)

By d'Alembert's formula (inverting): u(x,t)=12[u0(x+ct)+u0(xct)]+12cxctx+ctv0(s)dsu(x,t) = \frac{1}{2}[u_0(x+ct) + u_0(x-ct)] + \frac{1}{2c}\int_{x-ct}^{x+ct}v_0(s)\,ds

The wave equation propagates each frequency at the same speed cc - no dispersion. In contrast, the heat equation attenuates high frequencies faster than low frequencies - strong dispersion (dissipation).

Example 3: Schrodinger Equation. The free-particle Schrodinger equation itψ=22mxxψi\hbar\partial_t\psi = -\frac{\hbar^2}{2m}\partial_{xx}\psi is formally a heat equation with imaginary time. Its solution ψ^(ξ,t)=ψ^0(ξ)ei(2πξ)2t/(2m)\hat{\psi}(\xi,t) = \hat{\psi}_0(\xi)e^{-i\hbar(2\pi\xi)^2t/(2m)} shows that each frequency component oscillates (rather than decays) with frequency ξ2/(2m)\hbar\xi^2/(2m) - quantum dispersion. The uncertainty principle ΔxΔp/2\Delta x\cdot\Delta p \geq \hbar/2 (where p=hξp = h\xi by de Broglie) is the physics version of the mathematical uncertainty principle in Section 6.

A.5 Windowed Fourier Transform (STFT)

The continuous Fourier Transform gives global frequency information but no time localization: f^(ξ)|\hat{f}(\xi)| tells you whether frequency ξ\xi is present somewhere in ff, but not when.

Definition A.2 (Short-Time Fourier Transform / Spectrogram). For a window function gL2(R)g \in L^2(\mathbb{R}) and signal fL2(R)f \in L^2(\mathbb{R}):

STFTf(t,ξ)=f(τ)g(τt)e2πiξτdτ\text{STFT}_f(t, \xi) = \int_{-\infty}^{\infty} f(\tau)\,\overline{g(\tau - t)}\,e^{-2\pi i\xi\tau}\,d\tau

This localizes the analysis around time tt using the window gg. The spectrogram is STFTf(t,ξ)2|\text{STFT}_f(t,\xi)|^2 - a time-frequency power map.

The uncertainty principle applies: The time resolution Δt\Delta t and frequency resolution Δξ\Delta\xi of the STFT are determined by the window gg:

  • Narrow window gg -> good time resolution, poor frequency resolution
  • Wide window gg -> poor time resolution, good frequency resolution

The Gaussian window g(t)=eπt2/σ2g(t) = e^{-\pi t^2/\sigma^2} minimizes the product ΔtΔξ\Delta t\cdot\Delta\xi for a given σ\sigma, making it the optimal STFT window (the resulting time-frequency representation is called the Gabor transform).

For AI - Audio Models: Whisper (Radford et al., 2022) processes audio as mel-spectrograms: it applies a 25ms STFT window (adequate time resolution for phoneme detection) with 10ms hop, then maps the frequency axis to the mel scale (logarithmic, matching human perception), and feeds the result as a 2D "image" to a Vision Transformer. The STFT parameters are fixed - the uncertainty principle determines the time-frequency resolution tradeoff.

STFT TIME-FREQUENCY TRADEOFF (UNCERTAINTY PRINCIPLE IN PRACTICE)
========================================================================

  Short window (e.g., 2ms):            Long window (e.g., 50ms):
  -------------------------            --------------------------
  Fine time resolution (~2ms)          Coarse time resolution (~50ms)
  Poor freq resolution (~500 Hz)       Fine freq resolution (~20 Hz)
  Good for: transients, clicks         Good for: sustained tones, pitch

  Whisper (speech): 25ms window -> resolves individual phonemes (50-100ms)
  while achieving ~40Hz frequency resolution (enough for pitch/formants)

  The Gaussian window achieves the minimum t product - it is the
  optimal window under the uncertainty principle.

========================================================================

A.6 Fourier Transform and Operator Theory

From the perspective of functional analysis, the Fourier Transform is a unitary operator on the Hilbert space L2(R)L^2(\mathbb{R}). This abstract viewpoint connects Fourier analysis to the broader theory of self-adjoint operators.

The Fourier operator F\mathcal{F}: F:L2(R)L2(R)\mathcal{F}: L^2(\mathbb{R}) \to L^2(\mathbb{R}) is unitary: FF=FF=I\mathcal{F}^*\mathcal{F} = \mathcal{F}\mathcal{F}^* = I.

Since F4=I\mathcal{F}^4 = I, the eigenvalues satisfy λ4=1\lambda^4 = 1: λ{1,1,i,i}\lambda \in \{1, -1, i, -i\}.

The spectral decomposition of F\mathcal{F} in terms of its eigenspaces: L2(R)=E1E1EiEiL^2(\mathbb{R}) = \mathcal{E}_1 \oplus \mathcal{E}_{-1} \oplus \mathcal{E}_i \oplus \mathcal{E}_{-i}, where each Eλ\mathcal{E}_\lambda is spanned by the Hermite functions with the appropriate eigenvalue.

Connection to quantum mechanics: The position operator X^f(t)=tf(t)\hat{X}f(t) = tf(t) and momentum operator P^f(t)=12πidfdt\hat{P}f(t) = \frac{1}{2\pi i}\frac{df}{dt} (in natural units) are both self-adjoint operators on L2(R)L^2(\mathbb{R}), related by P^=FX^F1\hat{P} = \mathcal{F}\hat{X}\mathcal{F}^{-1}. The commutator [X^,P^]=i2πI[\hat{X},\hat{P}] = \frac{i}{2\pi}I implies the uncertainty principle ΔXΔP14π\Delta X\cdot\Delta P \geq \frac{1}{4\pi}. The mathematical identity and the physics are the same theorem.

For AI: The language of operator theory is increasingly used to analyze neural networks. The weight matrix WW of a linear layer is a linear operator; its spectral norm W2\lVert W\rVert_2 is its "size" as an operator; its singular value decomposition expresses it in terms of rank-1 operators. The FT connects linear operator theory to signal processing, unifying the analysis of both classical filters and neural network layers.


Appendix B: Worked Examples and Techniques

B.1 The Fourier Transform Table - Complete Reference

The following table lists the most important Fourier Transform pairs, using the ξ\xi-convention f^(ξ)=f(t)e2πiξtdt\hat{f}(\xi) = \int f(t)e^{-2\pi i\xi t}\,dt. All functions are assumed in L1L2L^1 \cap L^2 unless otherwise noted.

#f(t)f(t)f^(ξ)\hat{f}(\xi)Notes
1eπt2e^{-\pi t^2}eπξ2e^{-\pi\xi^2}Self-dual Gaussian
2eπt2/σ2e^{-\pi t^2/\sigma^2}σeπσ2ξ2\sigma e^{-\pi\sigma^2\xi^2}Scaled Gaussian
3eate^{-a\lvert t\rvert}, a>0a>02aa2+(2πξ)2\frac{2a}{a^2+(2\pi\xi)^2}Double-sided exponential
4eat1t0e^{-at}\mathbb{1}_{t\geq0}, a>0a>01a+2πiξ\frac{1}{a+2\pi i\xi}One-sided exponential
5rect(t/T)\operatorname{rect}(t/T)Tsinc(Tξ)T\operatorname{sinc}(T\xi)Rectangular pulse of width TT
6sinc(Wt)\operatorname{sinc}(Wt)1Wrect(ξ/W)\frac{1}{W}\operatorname{rect}(\xi/W)Ideal low-pass filter (by duality)
7Λ(t)=max(1t,0)\Lambda(t)=\max(1-\lvert t\rvert,0)sinc2(ξ)\operatorname{sinc}^2(\xi)Triangle function
811+(2πt)2\frac{1}{1+(2\pi t)^2}12eξ\frac{1}{2}e^{-\lvert\xi\rvert}Lorentzian / Cauchy
9δ(t)\delta(t)11Dirac impulse -> white spectrum
1011δ(ξ)\delta(\xi)Constant -> DC spike (distributional)
11δ(ta)\delta(t-a)e2πiξae^{-2\pi i\xi a}Shifted impulse
12e2πiξ0te^{2\pi i\xi_0 t}δ(ξξ0)\delta(\xi-\xi_0)Pure tone (distributional)
13cos(2πξ0t)\cos(2\pi\xi_0 t)12[δ(ξξ0)+δ(ξ+ξ0)]\frac{1}{2}[\delta(\xi-\xi_0)+\delta(\xi+\xi_0)]Cosine
14sin(2πξ0t)\sin(2\pi\xi_0 t)12i[δ(ξξ0)δ(ξ+ξ0)]\frac{1}{2i}[\delta(\xi-\xi_0)-\delta(\xi+\xi_0)]Sine
15n=δ(tnT)\sum_{n=-\infty}^\infty\delta(t-nT)1Tn=δ(ξn/T)\frac{1}{T}\sum_{n=-\infty}^\infty\delta(\xi-n/T)Dirac comb
16sgn(t)\operatorname{sgn}(t)1πiξ\frac{1}{\pi i\xi} (distributional)Sign function
171t0\mathbb{1}_{t\geq0} (unit step)12δ(ξ)+12πiξ\frac{1}{2}\delta(\xi)+\frac{1}{2\pi i\xi}Heaviside step (distributional)
18tneatt^n e^{-at}, a>0a>0, t0t\geq0n!(a+2πiξ)n+1\frac{n!}{(a+2\pi i\xi)^{n+1}}Polynomial exponential

B.2 Four Full Worked Examples

Worked Example 1: Triangular Pulse

The triangular function Λ(t)=max(1t,0)\Lambda(t) = \max(1 - |t|, 0) is zero outside [1,1][-1, 1]. We compute Λ^\hat{\Lambda} directly.

Λ^(ξ)=10(1+t)e2πiξtdt+01(1t)e2πiξtdt\hat{\Lambda}(\xi) = \int_{-1}^{0}(1+t)e^{-2\pi i\xi t}\,dt + \int_0^1(1-t)e^{-2\pi i\xi t}\,dt

For the first integral, let u=tu = -t:

10(1+t)e2πiξtdt=01(1u)e2πiξudu\int_{-1}^{0}(1+t)e^{-2\pi i\xi t}\,dt = \int_0^1(1-u)e^{2\pi i\xi u}\,du

Adding both integrals: Λ^(ξ)=01(1t)(e2πiξt+e2πiξt)dt=201(1t)cos(2πξt)dt\hat{\Lambda}(\xi) = \int_0^1(1-t)(e^{2\pi i\xi t}+e^{-2\pi i\xi t})\,dt = 2\int_0^1(1-t)\cos(2\pi\xi t)\,dt.

Integrating by parts:

=2[(1t)sin(2πξt)2πξ]01+201sin(2πξt)2πξdt=0+1πξ[cos(2πξt)2πξ]01=1cos(2πξ)2π2ξ2= 2\left[\frac{(1-t)\sin(2\pi\xi t)}{2\pi\xi}\right]_0^1 + 2\int_0^1\frac{\sin(2\pi\xi t)}{2\pi\xi}\,dt = 0 + \frac{1}{\pi\xi}\left[-\frac{\cos(2\pi\xi t)}{2\pi\xi}\right]_0^1 = \frac{1-\cos(2\pi\xi)}{2\pi^2\xi^2}

Using 1cosθ=2sin2(θ/2)1 - \cos\theta = 2\sin^2(\theta/2):

Λ^(ξ)=2sin2(πξ)2π2ξ2=(sin(πξ)πξ)2=sinc2(ξ)\hat{\Lambda}(\xi) = \frac{2\sin^2(\pi\xi)}{2\pi^2\xi^2} = \left(\frac{\sin(\pi\xi)}{\pi\xi}\right)^2 = \operatorname{sinc}^2(\xi)

The triangular pulse is the FT of sinc2\operatorname{sinc}^2. Alternatively: Λ=rectrect\Lambda = \operatorname{rect} * \operatorname{rect} (convolution of two rect functions), so Λ^=rect^rect^=sinc2\hat{\Lambda} = \hat{\operatorname{rect}}\cdot\hat{\operatorname{rect}} = \operatorname{sinc}^2 by the convolution theorem. The sinc2\operatorname{sinc}^2 spectrum decays faster than sinc\operatorname{sinc} (as 1/ξ21/\xi^2 vs 1/ξ1/\xi) because Λ\Lambda is continuous (no jumps), while rect\operatorname{rect} has jump discontinuities.

Worked Example 2: Gaussian via Differential Equation

An elegant alternative proof that F[eπt2]=eπξ2\mathcal{F}[e^{-\pi t^2}] = e^{-\pi\xi^2} uses the fact that f(t)=eπt2f(t) = e^{-\pi t^2} satisfies f(t)=2πtf(t)f'(t) = -2\pi t\,f(t).

Taking the FT of both sides and using the differentiation rules:

  • LHS: F[f](ξ)=2πiξf^(ξ)\mathcal{F}[f'](\xi) = 2\pi i\xi\,\hat{f}(\xi)
  • RHS: F[2πtf(t)](ξ)=2π12πiddξf^(ξ)=if^(ξ)\mathcal{F}[-2\pi t\,f(t)](\xi) = -2\pi\cdot\frac{1}{-2\pi i}\frac{d}{d\xi}\hat{f}(\xi) = -i\hat{f}'(\xi)

Wait - use: F[tf(t)](ξ)=12πiddξf^(ξ)=i2πf^(ξ)\mathcal{F}[tf(t)](\xi) = \frac{1}{-2\pi i}\frac{d}{d\xi}\hat{f}(\xi) = \frac{i}{2\pi}\hat{f}'(\xi)

So 2πiξf^(ξ)=2πi2πf^(ξ)=if^(ξ)2\pi i\xi\,\hat{f}(\xi) = -2\pi\cdot\frac{i}{2\pi}\hat{f}'(\xi) = -i\hat{f}'(\xi).

This gives the ODE: f^(ξ)=2πξf^(ξ)\hat{f}'(\xi) = -2\pi\xi\,\hat{f}(\xi), with solution f^(ξ)=Ceπξ2\hat{f}(\xi) = C\,e^{-\pi\xi^2}.

To find CC: f^(0)=eπt2dt=1\hat{f}(0) = \int e^{-\pi t^2}\,dt = 1 (Gaussian integral). So C=1C = 1 and f^(ξ)=eπξ2\hat{f}(\xi) = e^{-\pi\xi^2}. \square

This proof is superior for teaching: it shows that the Gaussian's self-duality follows from the fact that eπt2e^{-\pi t^2} satisfies the simplest possible first-order ODE, and the FT converts this ODE into the same ODE in frequency space.

Worked Example 3: FT of eate^{-a|t|} and a Contour Argument

f^(ξ)=eate2πiξtdt=0eate2πiξtdt+0eate2πiξtdt\hat{f}(\xi) = \int_{-\infty}^\infty e^{-a|t|}e^{-2\pi i\xi t}\,dt = \int_0^\infty e^{-at}e^{-2\pi i\xi t}\,dt + \int_{-\infty}^0 e^{at}e^{-2\pi i\xi t}\,dt

For the first integral (t>0t > 0):

0e(a+2πiξ)tdt=1a+2πiξ(valid for Re(a+2πiξ)=a>0)\int_0^\infty e^{-(a+2\pi i\xi)t}\,dt = \frac{1}{a+2\pi i\xi} \quad \text{(valid for } \operatorname{Re}(a+2\pi i\xi) = a > 0\text{)}

For the second integral, let u=tu = -t (u>0u > 0, t<0t < 0):

0eaue2πiξudu=1a2πiξ\int_0^\infty e^{-au}e^{2\pi i\xi u}\,du = \frac{1}{a-2\pi i\xi}

Therefore:

f^(ξ)=1a+2πiξ+1a2πiξ=(a2πiξ)+(a+2πiξ)a2+(2πξ)2=2aa2+4π2ξ2\hat{f}(\xi) = \frac{1}{a+2\pi i\xi} + \frac{1}{a-2\pi i\xi} = \frac{(a-2\pi i\xi)+(a+2\pi i\xi)}{a^2+(2\pi\xi)^2} = \frac{2a}{a^2+4\pi^2\xi^2}

This is a Lorentzian (Cauchy distribution) centered at ξ=0\xi = 0 with half-width a/(2π)a/(2\pi). As a0+a \to 0^+, the time-domain signal eate^{-a|t|} approaches the constant 11, and its Fourier Transform approaches 2/(4π2ξ2)2a2πδ(ξ)/π=2δ(ξ)2/(4\pi^2\xi^2)\cdot2a \to 2\pi\delta(\xi)/\pi = 2\delta(\xi)... Actually: 2aa2+4π2ξ212π2πaπ[a2+4π2ξ2]/(2π)\frac{2a}{a^2+4\pi^2\xi^2} \to \frac{1}{2\pi}\cdot\frac{2\pi a}{\pi[a^2+4\pi^2\xi^2]/(2\pi)} \to ... more carefully, aπ(a2+4π2ξ2)2π\frac{a}{\pi(a^2+4\pi^2\xi^2)}\cdot 2\pi and aπ(a2+ξ2)δ(ξ)\frac{a}{\pi(a^2+\xi^2)} \to \delta(\xi) is the standard Poisson kernel. The limit confirms F[1]=δ\mathcal{F}[1] = \delta.

Worked Example 4: The FT and Convolution for the Heat Equation Kernel

The heat kernel at time t>0t > 0 is Gt(x)=14πκtex2/(4κt)G_t(x) = \frac{1}{\sqrt{4\pi\kappa t}}e^{-x^2/(4\kappa t)}. We verify F[Gt](ξ)=e4π2κtξ2\mathcal{F}[G_t](\xi) = e^{-4\pi^2\kappa t\xi^2} using the scaling theorem.

We know F[eπx2](ξ)=eπξ2\mathcal{F}[e^{-\pi x^2}](\xi) = e^{-\pi\xi^2}. The heat kernel is:

Gt(x)=14πκtex2/(4κt)=14πκteπ(x/σ)2/π1σ...G_t(x) = \frac{1}{\sqrt{4\pi\kappa t}}e^{-x^2/(4\kappa t)} = \frac{1}{\sqrt{4\pi\kappa t}}\cdot e^{-\pi(x/\sigma)^2/\pi}\cdot\frac{1}{\sigma}...

More directly: write Gt(x)=14πκtex2/(4κt)G_t(x) = \frac{1}{\sqrt{4\pi\kappa t}}e^{-x^2/(4\kappa t)}. Setting σ=2κt\sigma = \sqrt{2\kappa t}:

Gt(x)=12πσex2/(2σ2)G_t(x) = \frac{1}{\sqrt{2\pi}\sigma}e^{-x^2/(2\sigma^2)}

The FT of a Gaussian 12πσex2/(2σ2)\frac{1}{\sqrt{2\pi}\sigma}e^{-x^2/(2\sigma^2)} is e2π2σ2ξ2e^{-2\pi^2\sigma^2\xi^2}... Using our convention:

F[eπx2/σ2](ξ)=σeπσ2ξ2\mathcal{F}[e^{-\pi x^2/\sigma^2}](\xi) = \sigma e^{-\pi\sigma^2\xi^2}, so F ⁣[1σeπx2/σ2](ξ)=eπσ2ξ2\mathcal{F}\!\left[\frac{1}{\sigma}e^{-\pi x^2/\sigma^2}\right](\xi) = e^{-\pi\sigma^2\xi^2}.

For GtG_t: Gt(x)=1σteπx2/σt2G_t(x) = \frac{1}{\sigma_t}e^{-\pi x^2/\sigma_t^2} where σt=2πκt\sigma_t = 2\sqrt{\pi\kappa t}... Let me just directly compute:

Write Gt(x)=ceαx2G_t(x) = ce^{-\alpha x^2} with c=(4πκt)1/2c = (4\pi\kappa t)^{-1/2}, α=1/(4κt)\alpha = 1/(4\kappa t). Using F[eαx2](ξ)=π/αeπ2ξ2/α\mathcal{F}[e^{-\alpha x^2}](\xi) = \sqrt{\pi/\alpha}\,e^{-\pi^2\xi^2/\alpha}:

F[Gt](ξ)=cπαeπ2ξ2/α=14πκt4πκteπ2ξ24κt=e4π2κtξ2\mathcal{F}[G_t](\xi) = c\sqrt{\frac{\pi}{\alpha}}e^{-\pi^2\xi^2/\alpha} = \frac{1}{\sqrt{4\pi\kappa t}}\cdot\sqrt{4\pi\kappa t}\cdot e^{-\pi^2\xi^2\cdot 4\kappa t} = e^{-4\pi^2\kappa t\xi^2}

This confirms that in the ξ\xi-convention, the heat kernel's FT is e4π2κtξ2e^{-4\pi^2\kappa t\xi^2} - exactly the solution we derived from the heat equation in Section A.4.

B.3 Decay Rates and Smoothness: The Complete Picture

The relationship between smoothness of ff and decay of f^\hat{f} is one of the deepest qualitative results in Fourier analysis. The complete picture:

Theorem B.1 (Smoothness Spectral Decay). Let fL1(R)f \in L^1(\mathbb{R}).

| Condition on ff | Decay of f^(ξ)|\hat{f}(\xi)| as ξ|\xi| \to \infty | |-----------------|----------------------------------------------| | ff has a jump discontinuity | O(1/ξ)O(1/|\xi|) | | fC1f \in C^1 (continuous, one derivative) | o(1/ξ)o(1/|\xi|) | | fCkf \in C^k, f(k)L1f^{(k)} \in L^1 | O(1/ξk)O(1/|\xi|^k) | | fCf \in C^\infty | Faster than any polynomial: O(1/ξN)O(1/|\xi|^N) for all NN | | ff analytic (holomorphic in a strip Im(t)<δ|\operatorname{Im}(t)| < \delta) | Exponential decay: O(e2πδξ)O(e^{-2\pi\delta|\xi|}) |

Proof sketch: For the kk-derivative case: f^(ξ)=F[f(k)](ξ)/2πiξkf(k)1/(2πξ)k|\hat{f}(\xi)| = |\mathcal{F}[f^{(k)}](\xi)|/|2\pi i\xi|^k \leq \lVert f^{(k)}\rVert_1/(2\pi|\xi|)^k. For the analytic case, shift the contour of integration to Im(t)=±δ\operatorname{Im}(t) = \pm\delta (for ξ0\xi \gtrless 0) gaining an exponential factor e2πξδ=e2πξδe^{\mp 2\pi\xi\delta} = e^{-2\pi|\xi|\delta}. \square

Applications:

  • Signal compression: Smooth signals (like audio with no sharp transients) have rapidly decaying spectra -> can be compressed by truncating high frequencies with small error.
  • Neural network generalization: Smooth target functions (natural image statistics) have rapidly decaying spectra -> a network with limited bandwidth can approximate them well. This connects to the spectral bias of neural networks (Rahaman et al., 2019): networks preferentially learn low-frequency components first.
  • Discretization error: The aliasing error when sampling a signal at rate TT is bounded by n0f^(ξ+n/T)C/Tk\sum_{n \neq 0} |\hat{f}(\xi + n/T)| \leq C/T^k if fCkf \in C^k - smoother signals alias less.

B.4 The Fourier Transform in Different Function Spaces

A complete picture of where the FT lives:

DOMAIN OF DEFINITION - FOURIER TRANSFORM
========================================================================

  L1(R)    ->  C0(R)        (bounded, continuous, vanishes at infty)
               f_infty <= f_1

  L2(R)    ->  L2(R)        (Plancherel: FT is unitary)
               f_2 = f_2

  L1  L2  ->  L2           (both apply; easiest to work with)

  S(R)     ->  S(R)         (Schwartz space: FT preserves rapidly
              (bijection)    decreasing smooth functions)

  S'(R)    ->  S'(R)        (tempered distributions: largest domain)
              (bijection)    includes , constants, periodic signals

  Key inclusions:  S  L1  L2  L1  S'
                   S  L1  L2  L2  S'

  For practical computation: work in S' (distributions),
  which unifies all cases under one theory.

========================================================================

B.5 Numerical Computation of the Fourier Transform

In practice, the Fourier Transform is computed numerically via the DFT/FFT (Section 20-03). Here we note the key relationships.

Approximation scheme. To numerically approximate f^(ξ)\hat{f}(\xi) for fL1(R)f \in L^1(\mathbb{R}):

  1. Truncate the signal to [L,L][-L, L] for large enough LL (error: t>Lf(t)dt\int_{|t|>L}|f(t)|\,dt)
  2. Sample at rate N/(2L)N/(2L) to get NN equally-spaced samples fn=f(nT)f_n = f(nT) where T=2L/NT = 2L/N
  3. Apply DFT: f^k=Tn=0N1fne2πink/N\hat{f}_k = T\sum_{n=0}^{N-1}f_n e^{-2\pi ink/N}
  4. The result approximates f^(k/(2L))\hat{f}(k/(2L)) for k=0,1,,N/2k = 0, 1, \ldots, N/2.

Sources of error:

  • Aliasing: Undersampling causes high-frequency components to fold back - reduced by sampling faster (higher NN)
  • Leakage: Truncation at ±L\pm L multiplies by a rect function -> convolves spectrum with sinc -> spectral leakage. Reduced by windowing (multiplying by a smooth window before DFT)
  • Resolution: The frequency resolution is 1/(2L)1/(2L) - the smallest frequency distinguishable in the DFT is 1/(2L)1/(2L). For finer resolution, use longer time windows.

All three error types are direct consequences of the Fourier Transform properties developed in this section. Full treatment in Section 20-03.


Appendix C: Connections, Extensions, and Deep Dives

C.1 The Fourier Transform on Locally Compact Abelian Groups

The Fourier Transform generalizes far beyond R\mathbb{R}. For any locally compact abelian (LCA) group GG with Haar measure μ\mu, there is a Fourier Transform F:L1(G)C0(G^)\mathcal{F}: L^1(G) \to C_0(\hat{G}) where G^\hat{G} is the Pontryagin dual group (the group of continuous homomorphisms GS1G \to S^1).

Group GGDual G^\hat{G}Fourier TransformName
R\mathbb{R}R\mathbb{R}f(t)e2πiξtdt\int f(t)e^{-2\pi i\xi t}\,dtFourier integral
Z\mathbb{Z}S1=R/ZS^1 = \mathbb{R}/\mathbb{Z}nf(n)e2πiξn\sum_n f(n)e^{-2\pi i\xi n}Discrete-time FT (DTFT)
S1S^1Z\mathbb{Z}01f(eiθ)einθdθ\int_0^1 f(e^{i\theta})e^{-in\theta}\,d\thetaFourier series
Z/NZ\mathbb{Z}/N\mathbb{Z}Z/NZ\mathbb{Z}/N\mathbb{Z}k=0N1f(k)e2πikm/N\sum_{k=0}^{N-1}f(k)e^{-2\pi ikm/N}Discrete Fourier Transform (DFT)
Rd\mathbb{R}^dRd\mathbb{R}^df(t)e2πiξtdt\int f(\mathbf{t})e^{-2\pi i\boldsymbol{\xi}^\top\mathbf{t}}\,d\mathbf{t}Multidimensional FT

Plancherel's theorem, the convolution theorem, and the inversion theorem all hold in this general setting. The unification of Fourier series and Fourier Transform as special cases of the same construction on different groups is one of the most elegant results in abstract harmonic analysis.

For AI: The group structure is what makes RoPE work. The circle group S1S^1 (the group of complex numbers of modulus 1 under multiplication) is a compact abelian group. Multiplying by eimθe^{im\theta} is the group action of rotation by mθm\theta. The relative-position property of RoPE eimθq,einθk=ei(mn)θq,k\langle e^{im\theta}q, e^{in\theta}k\rangle = e^{i(m-n)\theta}\langle q,k\rangle is the statement that the group action is transitive: the inner product only sees the group element ei(mn)θe^{i(m-n)\theta} corresponding to relative position.

C.2 Non-Commutative Fourier Analysis

For non-abelian groups (e.g., SO(3)SO(3), permutation groups SNS_N), the Pontryagin duality breaks down but a Fourier Transform still exists, with irreducible unitary representations playing the role of frequencies.

Graph Fourier Transform: For a graph G=(V,E)G = (V, E) with Laplacian L=DAL = D - A (where DD is the degree matrix), the eigenvectors Luk=λkukL\mathbf{u}_k = \lambda_k\mathbf{u}_k play the role of complex exponentials. The graph Fourier Transform is f^=Uf\hat{\mathbf{f}} = U^\top\mathbf{f} where U=[u1,,uN]U = [\mathbf{u}_1,\ldots,\mathbf{u}_N] is the matrix of eigenvectors. The eigenvalues λk\lambda_k represent "graph frequencies" - how rapidly the eigenvector oscillates across the graph.

This is the foundation of spectral graph neural networks (Bruna et al., 2014; Defferrard et al., 2016 ChebNet; Kipf & Welling, 2017 GCN). The spectral GNN applies a learned filter hθ(Λ)h_\theta(\Lambda) in graph Fourier space: h^θf=Uhθ(Λ)Uf\hat{h}_\theta * \mathbf{f} = U\,h_\theta(\Lambda)\,U^\top\mathbf{f}.

Full treatment in Section 11-04 Spectral Graph Theory.

C.3 The Fourier Transform and Learning Theory

The Fourier Transform connects deeply to several foundational results in learning theory:

Spectral Bias (Rahaman et al., 2019): Neural networks trained by gradient descent learn the Fourier components of the target function in order of increasing frequency. This is because the NTK (Neural Tangent Kernel) has eigenvalues that decay rapidly with frequency - the network converges faster on low-frequency components. Consequence: networks generalize well on smooth targets and may overfit on high-frequency features.

Frequency Principle: Let ff^* be the target function and ftf_t the network at training step tt. Then f^t(ξ)f^(ξ)|\hat{f}_t(\xi) - \hat{f}^*(\xi)| decreases at rate λξ\propto \lambda_\xi (the NTK eigenvalue at frequency ξ\xi), and λξ\lambda_\xi decreases with ξ|\xi|. High-frequency components therefore converge last.

Implications:

  • Data augmentation that adds high-frequency perturbations (e.g., random cropping, noise injection) improves generalization precisely because it forces the network to represent these frequencies.
  • Adversarial examples often correspond to high-frequency perturbations of inputs that fool the network but are invisible to humans - the network hasn't fully learned the high-frequency structure.
  • Neural networks with spectral inductive bias (e.g., SIREN - sinusoidal activation functions) can represent high-frequency targets more efficiently.

Rademacher Complexity and Fourier Norms: The generalization error of a hypothesis class can be bounded by its Fourier 1\ell^1 norm: ξf^(ξ)\sum_\xi |\hat{f}(\xi)|. Functions with small Fourier 1\ell^1 norm have low complexity and generalize well. This connects to the theory of Barron functions (Barron, 1993): functions ff with f^(ξ)ξdξ<C\int|\hat{f}(\xi)||\xi|\,d\xi < C can be approximated to accuracy ϵ\epsilon by a two-layer neural network with O(C2/ϵ2)O(C^2/\epsilon^2) neurons.

C.4 Phase Retrieval - When Magnitude Alone Is Insufficient

The FT f^(ξ)=f^(ξ)eiϕ(ξ)\hat{f}(\xi) = |\hat{f}(\xi)|e^{i\phi(\xi)} consists of a magnitude f^(ξ)|\hat{f}(\xi)| and a phase ϕ(ξ)\phi(\xi). Phase retrieval asks: can you recover ff from f^|\hat{f}| alone (without the phase)?

In general, the answer is no - many different signals can have the same magnitude spectrum. For example, f(t)f(t) and f(t)f(-t) have the same magnitude spectrum f^(ξ)|\hat{f}(\xi)| but may be entirely different. More strikingly, phase-randomized images (keeping f^|\hat{f}| but using random ϕ(ξ)\phi(\xi)) look like noise - the phase carries most of the perceptual information.

Phase is more important than magnitude for perception: An experiment: take image AA and image BB, swap their phase spectra (keep AA's magnitude, use BB's phase), and reconstruct. The result looks like image BB - demonstrating that phase dominates perception.

For AI: This has implications for:

  • Adversarial robustness: Many adversarial attacks perturb the high-frequency phase of input images - imperceptible to humans but highly disruptive to CNNs.
  • Generative models: Diffusion models and GANs must learn both the magnitude and phase of the data distribution's spectrum to generate perceptually realistic images.
  • FNet limitation: FNet takes the real part of the FT output, effectively discarding half the phase information. This is one reason FNet performs worse than attention on tasks requiring precise token interactions.

C.5 Fast Algorithms Beyond FFT

The FFT computes the NN-point DFT in O(NlogN)O(N\log N) operations, a vast improvement over the naive O(N2)O(N^2). Several modern algorithms push further:

Sparse FFT: If the signal has only kNk \ll N significant frequencies, the sparse FFT (Hassanieh et al., 2012; adopted in MIT SFFT) computes the DFT in O(klogN)O(k\log N) - sublinear in NN. Applications: energy-efficient spectrum sensing, MRI acceleration.

Non-Uniform FFT (NUFFT): The standard FFT requires uniformly-spaced samples. The NUFFT handles non-uniform grids with O(NlogN)O(N\log N) complexity, using oversampling and interpolation. Used in MRI reconstruction (non-Cartesian k-space trajectories) and radio astronomy.

Butterfly factorization (Monarch matrices): The FFT matrix can be written as a product of O(logN)O(\log N) sparse matrices (butterfly factors), each requiring O(N)O(N) operations. Monarch matrices (Dao et al., 2022) generalize this structure for learned efficient linear layers - they can represent any matrix with O(NN)O(N\sqrt{N}) parameters that can be applied in O(NN)O(N\sqrt{N}) time, interpolating between dense matrices (O(N2)O(N^2)) and diagonal (O(N)O(N)).

The FFT algorithm and its applications - Cooley-Tukey, Bluestein, Rader - are covered in full in Section 20-03 DFT and FFT.

C.6 Summary: The Central Theorems of the Fourier Transform

This section has developed the following central results, in order of importance:

1. Definition and well-posedness (Section 2):

f^(ξ)=f(t)e2πiξtdt,fL1(R)\hat{f}(\xi) = \int_{-\infty}^\infty f(t)e^{-2\pi i\xi t}\,dt, \qquad f \in L^1(\mathbb{R})

Well-defined, bounded, continuous. Riemann-Lebesgue: f^(ξ)0\hat{f}(\xi) \to 0 as ξ|\xi| \to \infty.

2. Properties (Section 3): Linearity, shift, modulation, scaling, time reversal, differentiation, convolution-multiplication duality. The Master Properties Table encodes all relationships between time and frequency operations.

3. Inversion (Section 4): Under sufficient conditions, f(t)=f^(ξ)e2πiξtdξf(t) = \int\hat{f}(\xi)e^{2\pi i\xi t}\,d\xi. The Gauss-Weierstrass regularization gives a constructive proof.

4. Plancherel (Section 5): f^2=f2\lVert\hat{f}\rVert_2 = \lVert f\rVert_2. The FT is a unitary isometry on L2(R)L^2(\mathbb{R}) - energy is preserved.

5. Uncertainty Principle (Section 6): ΔtΔξ1/(4π)\Delta t\cdot\Delta\xi \geq 1/(4\pi), with equality iff ff is a Gaussian. Time-frequency concentration is fundamentally limited.

6. Distribution extension (Section 7): The FT extends to tempered distributions, giving F[δ]=1\mathcal{F}[\delta] = 1, F[1]=δ\mathcal{F}[1] = \delta, F[e2πiξ0t]=δ(ξξ0)\mathcal{F}[e^{2\pi i\xi_0 t}] = \delta(\xi-\xi_0), and the Dirac comb formula. Poisson summation unifies Fourier series and Fourier Transform.

These six results form a complete, self-consistent theory that has shaped mathematics, physics, engineering, and AI for two centuries - and shows no sign of becoming less central.


Appendix D: Machine Learning Deep Dives

D.1 FNet Architecture - Full Mathematical Analysis

We give a complete mathematical analysis of the FNet architecture (Lee-Thorp et al., 2022) using the formalism developed in this section.

Standard Transformer token mixing (attention):

Given token matrix XRN×dX \in \mathbb{R}^{N \times d} (sequence length NN, embedding dimension dd):

Attn(X)=softmax ⁣(XWQWKXdk)XWV\text{Attn}(X) = \operatorname{softmax}\!\left(\frac{XW_Q W_K^\top X^\top}{\sqrt{d_k}}\right)XW_V

This is O(N2dk)O(N^2 d_k) in memory and time (the N×NN \times N attention matrix dominates for large NN). The attention matrix Aij=softmax(score)ijA_{ij} = \operatorname{softmax}(\text{score})_{ij} is data-dependent: each row is a different weighted sum of the value vectors, where the weights depend on the specific input.

FNet token mixing:

FNetMix(X)=Re(Fseq[Fmodel[X]])\text{FNetMix}(X) = \operatorname{Re}(\mathcal{F}_\text{seq}[\mathcal{F}_\text{model}[X]])

where Fseq\mathcal{F}_\text{seq} applies the DFT along the sequence dimension (rows) and Fmodel\mathcal{F}_\text{model} along the embedding dimension (columns).

The DFT as a matrix multiplication:

The 1D DFT can be written as x^=Fx\hat{\mathbf{x}} = F\mathbf{x} where FCN×NF \in \mathbb{C}^{N \times N} with Fkn=e2πikn/N/NF_{kn} = e^{-2\pi ikn/N}/\sqrt{N} (unitary DFT matrix). The 2D FNet mixing is:

FNetMix(X)=Re(FNXFd)\text{FNetMix}(X) = \operatorname{Re}(F_N X F_d^\top)

where FNCN×NF_N \in \mathbb{C}^{N\times N} and FdCd×dF_d \in \mathbb{C}^{d\times d} are DFT matrices.

Key properties:

  1. Unitarity: FNFN=INF_N F_N^* = I_N and FdFd=IdF_d F_d^* = I_d. The FNet mixing is an isometry (before taking real part): FNXFdF=XF\lVert F_N X F_d^\top\rVert_F = \lVert X\rVert_F by Plancherel.

  2. Global mixing: Each output position Ykj=nmXnm(FN)kn(Fd)jmY_{kj} = \sum_n\sum_m X_{nm}(F_N)_{kn}(F_d)_{jm} depends on all N×dN \times d input values. FNet is a global mixer - exactly like attention.

  3. No parameters: Unlike attention, FNet has zero parameters in the mixing layer (FNF_N and FdF_d are fixed). All learning happens in the feed-forward network after mixing.

  4. Complexity: O(Ndlog(Nd))O(Nd\log(Nd)) via 2D FFT, vs O(N2d)O(N^2 d) for attention. For N=512N = 512, d=768d = 768: FNet is 500×\sim 500\times fewer operations in the mixing layer.

Why taking real part? The DFT output is complex even for real inputs. Taking Re()\operatorname{Re}(\cdot) projects back to real numbers (required since the subsequent feed-forward layers expect real inputs). This discards the imaginary part - half the phase information. This is one reason FNet is weaker than attention: it cannot separately process magnitude and phase information.

Theoretical analysis: The FNet paper shows that attention with random (untrained) weights performs similarly to FNet. This suggests the attention mechanism's power comes primarily from its (trained) data-dependent mixing, not merely from being a global mixer. FNet is the "global but data-independent" baseline.

When to use FNet vs Attention:

Task typeFNetAttentionReason
Sentence classification~92-97% BERT100%Global mixing is sufficient
Named entity recognition~90%100%Moderate position sensitivity
Extractive QA (SQuAD)~80%100%Requires precise token matching
Speed-critical inference+7baselineFNet wins when quality tradeoff acceptable

D.2 Random Fourier Features - Implementation Details

Setup for RBF kernel: k(x,y)=eγxy2k(\mathbf{x},\mathbf{y}) = e^{-\gamma\lVert\mathbf{x}-\mathbf{y}\rVert^2}.

The spectral density is p(ω)=(12π/(2γ))d/2eω2/(4γ)=N(0,2γI)p(\boldsymbol{\omega}) = \left(\frac{1}{2\pi/(2\gamma)}\right)^{d/2}e^{-\lVert\boldsymbol{\omega}\rVert^2/(4\gamma)} = \mathcal{N}(\mathbf{0}, 2\gamma I).

(With γ=1/2\gamma = 1/2: p(ω)=N(0,I)p(\boldsymbol{\omega}) = \mathcal{N}(\mathbf{0}, I), which is the standard form.)

Feature map construction (two variants):

Random Fourier Features (RFF): Sample ωjp\boldsymbol{\omega}_j \sim p, bjU[0,2π]b_j \sim \mathcal{U}[0,2\pi]:

ϕ(x)=2/D[cos(ω1x+b1),,cos(ωDx+bD)]\phi(\mathbf{x}) = \sqrt{2/D}\,[\cos(\boldsymbol{\omega}_1^\top\mathbf{x}+b_1),\ldots,\cos(\boldsymbol{\omega}_D^\top\mathbf{x}+b_D)]

Random Fourier Features (paired, no bias): Sample ωjp\boldsymbol{\omega}_j \sim p:

ϕ(x)=1D[cos(ω1x),sin(ω1x),,cos(ωDx),sin(ωDx)]\phi(\mathbf{x}) = \frac{1}{\sqrt{D}}[\cos(\boldsymbol{\omega}_1^\top\mathbf{x}),\sin(\boldsymbol{\omega}_1^\top\mathbf{x}),\ldots,\cos(\boldsymbol{\omega}_D^\top\mathbf{x}),\sin(\boldsymbol{\omega}_D^\top\mathbf{x})]

Both give E[ϕ(x)ϕ(y)]=k(x,y)\mathbb{E}[\phi(\mathbf{x})^\top\phi(\mathbf{y})] = k(\mathbf{x},\mathbf{y}). The paired version has slightly lower variance.

Concentration bound:

P ⁣(supx,yϕ(x)ϕ(y)k(x,y)>ϵ)28(σpRϵ)2eDϵ2/8P\!\left(\sup_{\mathbf{x},\mathbf{y}}\left|\phi(\mathbf{x})^\top\phi(\mathbf{y}) - k(\mathbf{x},\mathbf{y})\right| > \epsilon\right) \leq 2^8\left(\frac{\sigma_p R}{\epsilon}\right)^2 e^{-D\epsilon^2/8}

where σp=Ep[ω2]\sigma_p = \sqrt{\mathbb{E}_p[\lVert\boldsymbol{\omega}\rVert^2]} and R=maxix(i)R = \max_i\lVert\mathbf{x}^{(i)}\rVert.

Variance reduction via Quasi-Monte Carlo: Instead of random ωj\boldsymbol{\omega}_j, use quasi-random sequences (Sobol, Halton) to improve coverage of frequency space. Gives O(D1logdD)O(D^{-1}\log^d D) error vs O(D1/2)O(D^{-1/2}) for random - significant improvement in practice.

Orthogonal Random Features (ORF) (Yu et al., 2016): Sample ωj\boldsymbol{\omega}_j as rows of G=1DHDSG = \frac{1}{\sqrt{D}}HDS where HH is a Hadamard matrix, DD is diagonal ±1\pm1, SS is a diagonal scaling. This gives:

  • Same unbiasedness: E[ϕ(x)ϕ(y)]=k(x,y)\mathbb{E}[\phi(\mathbf{x})^\top\phi(\mathbf{y})] = k(\mathbf{x},\mathbf{y})
  • Lower variance: O(1/D)O(1/D) vs O(1/D)O(1/D) but with better constants
  • Faster computation: O(DlogD)O(D\log D) using fast Hadamard transform

Connection to Performer: The Performer attention (Choromanski et al., 2021) approximates exp(qk/d)\exp(q^\top k/\sqrt{d}) (the unnormalized attention softmax kernel) using orthogonal random features:

softmax(qk/d)ϕ(q)ϕ(k),ϕ(x)=ex2/2D[eω1x,,eωDx]\text{softmax}(q^\top k/\sqrt{d}) \approx \phi(q)^\top\phi(k), \quad \phi(\mathbf{x}) = \frac{e^{-\lVert\mathbf{x}\rVert^2/2}}{\sqrt{D}}[e^{\omega_1^\top\mathbf{x}},\ldots,e^{\omega_D^\top\mathbf{x}}]

This reduces attention from O(N2d)O(N^2 d) to O(NDd)O(N D d) - linear in sequence length NN.

D.3 Spectral Analysis of Neural Network Weight Matrices

The spectral norm W2=σmax(W)\lVert W\rVert_2 = \sigma_{\max}(W) and the spectral distribution of weight matrices encode important information about the training state of a neural network.

Heavy-tailed self-regularization (Martin & Mahoney, 2021): In well-trained neural networks, the eigenvalue distribution of WWW^\top W follows a power law: p(λ)λαp(\lambda) \propto \lambda^{-\alpha} for some α[2,6]\alpha \in [2, 6]. Under-trained or over-regularized networks have bulk (Marchenko-Pastur) distributions. This is detected by fitting a power law to the sorted singular values.

WeightWatcher: The open-source tool weightwatcher analyzes the spectral distribution of weight matrices in PyTorch/TensorFlow models and reports α\alpha values as a proxy for model quality - without needing test data. Well-trained frontier models (GPT-4, LLaMA) have α2\alpha \approx 2-4; poorly trained models have α>6\alpha > 6 or non-power-law distributions.

Spectral norm and Lipschitz constant: For a multi-layer network f=fLf1f = f_L\circ\cdots\circ f_1, the Lipschitz constant satisfies:

Lip(f)l=1LW[l]2Lip(σ)\text{Lip}(f) \leq \prod_{l=1}^L\lVert W^{[l]}\rVert_2\cdot\text{Lip}(\sigma)

For 1-Lipschitz activations (σ=ReLU\sigma = \operatorname{ReLU}, Lip(σ)=1\text{Lip}(\sigma) = 1): Lip(f)lW[l]2\text{Lip}(f) \leq \prod_l\lVert W^{[l]}\rVert_2.

Spectral normalization (Section 8.3) sets each W[l]2=1\lVert W^{[l]}\rVert_2 = 1, making the entire network 1-Lipschitz.

For LLMs: Analyzing the spectral properties of attention weight matrices (WQ,WK,WV,WOW_Q, W_K, W_V, W_O) is an active area of mechanistic interpretability research. Low-rank structure in these matrices (small effective rank = few large singular values) indicates specialized computation. Anthropic's interpretability work has found "singular value signatures" of specific computational patterns in attention heads.

D.4 The Fourier Perspective on Attention

Attention and the Fourier Transform are both global linear mixing operations on sequences. Understanding their relationship clarifies when each is appropriate.

Attention as a data-dependent filter: The output of attention can be written as:

yi=jAijvj,Aij=exp(qikj/dk)exp(qik/dk)y_i = \sum_j A_{ij}v_j, \quad A_{ij} = \frac{\exp(q_i^\top k_j/\sqrt{d_k})}{\sum_\ell\exp(q_i^\top k_\ell/\sqrt{d_k})}

For fixed query ii, this is a weighted average of value vectors with data-dependent weights AijA_{ij}. In signal processing terms, it is a position-dependent filter: the filter coefficients AijA_{ij} change for each output position ii.

Fourier Transform as a fixed filter bank: FNet applies the DFT, which is a fixed (non-data-dependent) filter: the output at frequency kk is always x^k=nxne2πikn/N\hat{x}_k = \sum_n x_n e^{-2\pi ikn/N}, regardless of the content of xx. Each DFT frequency is a specific pattern of oscillation at rate k/Nk/N.

Comparison table:

PropertySelf-AttentionFNet (DFT)
Filter typeData-dependentFixed
Parameters in mixer3d23d^2 (W_Q, W_K, W_V)0
Computational complexityO(N2d)O(N^2 d)O(Ndlog(Nd))O(Nd\log(Nd))
Expressive powerHigh (can select specific tokens)Low (global uniform mixing)
Position awarenessVia PE or RoPEVia DFT phases
Best forSelective retrieval tasksGlobal mixing tasks
Example successMachine translation, QAText classification, embedding

The analogy: Attention is like a learnable short-time Fourier Transform with adaptive windows - for each output position, it selects which "frequencies" (patterns) to attend to. The DFT in FNet is like a fixed global frequency analysis with no adaptivity. The power of attention comes from its ability to compute different effective filters for different positions and different inputs.


Appendix E: Reference Summary and Quick Checks

E.1 The Six Most Important Facts

If you retain only six things from this section, retain these:

1. Definition: f^(ξ)=f(t)e2πiξtdt\hat{f}(\xi) = \int_{-\infty}^\infty f(t)e^{-2\pi i\xi t}\,dt (use the ξ\xi-convention; no 2π2\pi in the inverse).

2. Self-dual Gaussian: F[eπt2](ξ)=eπξ2\mathcal{F}[e^{-\pi t^2}](\xi) = e^{-\pi\xi^2}. The Gaussian maps to itself. All other transforms can be derived from this one via properties.

3. Plancherel: f^2=f2\lVert\hat{f}\rVert_2 = \lVert f\rVert_2. Energy is conserved. The FT is a unitary operator on L2L^2.

4. Uncertainty principle: ΔtΔξ1/(4π)\Delta t\cdot\Delta\xi \geq 1/(4\pi), equality iff Gaussian. You cannot have both time and frequency concentration.

5. Distributions: F[δ]=1\mathcal{F}[\delta] = 1, F[1]=δ\mathcal{F}[1] = \delta, F[e2πiξ0t]=δ(ξξ0)\mathcal{F}[e^{2\pi i\xi_0 t}] = \delta(\xi-\xi_0). Pure tones map to spikes. Spikes have flat spectra.

6. AI: The FT is used in RoPE (modulation theorem), FNet (global mixing via DFT), RFF (Bochner + spectral sampling), spectral normalization (spectral norm of weight matrices), and FNO (learn in Fourier space).

E.2 Quick Reference: Which Theorem to Use When

ProblemTheorem/Tool
Compute FT of $e^{-at
Compute FT of f(ta)g(t)f(t-a)g(t)Time-shift + modulation properties
Show $\intf
Find sinc2(ξ)dξ\int_{-\infty}^\infty\operatorname{sinc}^2(\xi)\,d\xiParseval applied to rect
Show f^(ξ)0\hat{f}(\xi)\to 0 as $\xi
Compute FT of f(t)f'(t)Differentiation property: multiply by 2πiξ2\pi i\xi
Compute FT of cos(2πξ0t)\cos(2\pi\xi_0 t)Modulation theorem or distribution: two delta spikes
Verify ff recoverable from f^\hat{f}Inversion theorem (check f,f^L1f,\hat{f}\in L^1 or use L2L^2 theory)
Lower bound on signal bandwidthUncertainty principle: Δξ1/(4πΔt)\Delta\xi \geq 1/(4\pi\Delta t)
FT of a periodic signal cne2πint/T\sum c_n e^{2\pi int/T}Distribute over sum: cnδ(ξn/T)\sum c_n\delta(\xi - n/T)
FT of f(at)f(at) (rescaled signal)Scaling theorem: $(1/
Show kernel k(xy)k(\mathbf{x}-\mathbf{y}) is PDBochner: show spectral density p0p\geq0

E.3 Sign Convention Reference

Different sources use different sign conventions. This table gives the conversion formulas.

ConventionForward FTInverse FTeπt2e^{-\pi t^2} maps to
ξ\xi (this section)f(t)e2πiξtdt\int f(t)e^{-2\pi i\xi t}\,dtf^(ξ)e+2πiξtdξ\int\hat{f}(\xi)e^{+2\pi i\xi t}\,d\xieπξ2e^{-\pi\xi^2}
ω\omega, no 12π\frac{1}{2\pi}f(t)eiωtdt\int f(t)e^{-i\omega t}\,dt12πF(ω)e+iωtdω\frac{1}{2\pi}\int F(\omega)e^{+i\omega t}\,d\omega2πeω2/2\sqrt{2\pi}e^{-\omega^2/2}
ω\omega, symmetric12πf(t)eiωtdt\frac{1}{\sqrt{2\pi}}\int f(t)e^{-i\omega t}\,dt12πF(ω)e+iωtdω\frac{1}{\sqrt{2\pi}}\int F(\omega)e^{+i\omega t}\,d\omegaeω2/2e^{-\omega^2/2}

Conversion: If Fξ(ξ)F_\xi(\xi) is the transform in the ξ\xi-convention, then Fω(ω)=Fξ(ω/(2π))F_\omega(\omega) = F_\xi(\omega/(2\pi)) for the non-symmetric ω\omega-convention.

Properties in the ω\omega-convention (no 1/(2π)1/(2\pi)):

  • Differentiation: Fω[f](ω)=iωF(ω)\mathcal{F}_\omega[f'](\omega) = i\omega F(\omega) (not 2πiξ2\pi i\xi)
  • Time-shift: Fω[f(ta)](ω)=eiωaF(ω)\mathcal{F}_\omega[f(t-a)](\omega) = e^{-i\omega a}F(\omega) (not e2πiξae^{-2\pi i\xi a})
  • Parseval: f2=12πF2\int|f|^2 = \frac{1}{2\pi}\int|F|^2 (extra 1/(2π)1/(2\pi) factor!)

The ξ\xi-convention is self-symmetric and avoids all these 2π2\pi factors - strongly recommended for all new work.

E.4 Prerequisites Revisited: What You Now Know

This section assumed you knew Fourier series (Section 01) and functional analysis basics. Here is what each prerequisite contributed:

PrerequisiteHow it was used
Complex exponentials eiθe^{i\theta}The basis functions e2πiξte^{-2\pi i\xi t} of the FT; the Euler formula is used constantly
L2L^2 inner productPlancherel's theorem is the FT version of Parseval from Fourier series
Orthonormality of {einx}\{e^{inx}\}Generalized: the FT is the "orthonormal expansion" in the continuous limit
Parseval's identity (series)Generalized to Plancherel's theorem in Section 5
Gibbs phenomenonReinterpreted as spectral leakage - the frequency-domain manifestation of truncation
Dirichlet conditionsSufficient conditions for pointwise inversion (Section 4.1)
Integration / IBPProofs of differentiation theorem, inversion via approximate identity
Convergence of seriesL2L^2 convergence in Plancherel proof; distributional convergence in Section 7

E.5 Glossary

TermDefinition
Fourier Transformf^(ξ)=f(t)e2πiξtdt\hat{f}(\xi) = \int f(t)e^{-2\pi i\xi t}\,dt; maps time-domain functions to frequency-domain functions
SpectrumThe function f^(ξ)\hat{f}(\xi); describes which frequencies are present and at what amplitude/phase
Power spectrum$
Plancherel theoremf^2=f2\lVert\hat{f}\rVert_2 = \lVert f\rVert_2; FT is a unitary isometry on L2L^2
Parseval's relationfgˉ=f^g^ˉ\int f\bar{g} = \int\hat{f}\bar{\hat{g}}; the FT preserves inner products
Uncertainty principleΔtΔξ1/(4π)\Delta t\cdot\Delta\xi \geq 1/(4\pi); time and frequency spread cannot simultaneously be small
Riemann-Lebesgue lemmaf^(ξ)0\hat{f}(\xi)\to0 as $
Dirac delta δ(t)\delta(t)Distributional unit impulse; δ(t)ϕ(t)dt=ϕ(0)\int\delta(t)\phi(t)\,dt = \phi(0); FT is constant 1
Tempered distributionContinuous linear functional on Schwartz space; includes all LpL^p functions, δ\delta, and more
Schwartz space S\mathcal{S}Smooth functions decaying faster than any polynomial; FT maps S\mathcal{S} to S\mathcal{S} bijectively
Dirac combnδ(tnT)\sum_n\delta(t-nT); its FT is another comb with spacing 1/T1/T
Poisson summationnf(n)=nf^(n)\sum_n f(n) = \sum_n\hat{f}(n); connects Fourier series and FT
Modulation theoremF[e2πiξ0tf](ξ)=f^(ξξ0)\mathcal{F}[e^{2\pi i\xi_0 t}f](\xi) = \hat{f}(\xi-\xi_0); multiplication by tone = frequency shift
Sinc functionsinc(ξ)=sin(πξ)/(πξ)\operatorname{sinc}(\xi) = \sin(\pi\xi)/(\pi\xi); FT of rectangular pulse; ideal low-pass filter kernel
Spectral normW2=σmax(W)\lVert W\rVert_2 = \sigma_{\max}(W); largest singular value; Lipschitz constant of linear map WW
RFFRandom Fourier Features; k(xy)ϕ(x)ϕ(y)k(\mathbf{x}-\mathbf{y})\approx\phi(\mathbf{x})^\top\phi(\mathbf{y}) via sampled frequencies from spectral density
STFTShort-Time Fourier Transform; localizes FT analysis to a time window; gives spectrogram
FNetTransformer variant replacing attention with 2D DFT; O(NlogN)O(N\log N) mixing
FNOFourier Neural Operator; learns PDE solutions by parameterizing the spectral operator
RoPERotary Position Embedding; encodes position as rotation eimθe^{im\theta} in complex plane

Appendix F: The Fourier Transform in Practice - Case Studies

F.1 Case Study: Audio Preprocessing for Whisper

OpenAI's Whisper (Radford et al., 2022) is a speech recognition model that accepts raw audio and produces text. Its preprocessing pipeline is a direct application of STFT and spectral analysis.

Step 1: Resampling. Raw audio is resampled to 16kHz. By Shannon's sampling theorem (Poisson summation): a signal sampled at 16kHz can faithfully represent frequencies up to 8kHz - which covers all speech frequencies (human voice: 80Hz-8kHz, formants: 200Hz-4kHz).

Step 2: STFT. Apply a windowed Fourier Transform with:

  • Window: Hann window (smooth tapering reduces spectral leakage)
  • Window length: 25ms 16kHz = 400 samples
  • Hop length: 10ms 16kHz = 160 samples (75% overlap)
  • FFT size: 512 points (zero-padded for frequency resolution)

This gives a spectrogram of shape (frames×257)(\text{frames} \times 257) where 257 = 512/2 + 1 unique frequencies.

Step 3: Mel filterbank. Map the 257 linear frequencies to 80 mel-scale frequency bins using triangular filters. The mel scale m=2595log10(1+f/700)m = 2595\log_{10}(1 + f/700) is perceptually uniform - equal mel intervals are equally distinguishable to humans. The uncertainty principle is no longer limiting here because the window (25ms) is longer than the smallest perceptible phoneme duration (~10-20ms), so time resolution is slightly sacrificed for good frequency resolution.

Step 4: Log compression. Take log(max(mel spectrogram,1010))\log(\max(\text{mel spectrogram}, 10^{-10})). The log operation compresses the dynamic range of audio energy (from ~120 dB to ~12 dB).

Result: (frames×80)(\text{frames} \times 80) log-mel spectrogram, treated as a 2D "image" and fed to a CNN + Transformer encoder. The entire preprocessing is fixed (not learned) - it embeds centuries of signal processing knowledge into the model architecture.

Fourier analysis used:

  • Fourier Transform -> STFT window analysis
  • Uncertainty principle -> window length tradeoff
  • Sampling theorem -> 16kHz minimum rate
  • Spectral leakage -> Hann window choice

F.2 Case Study: LLaMA-3 Context Extension via RoPE Scaling

LLaMA-3 uses RoPE with base θbase=500,000\theta_{\text{base}} = 500{,}000 and supports up to 128K token context (compared to LLaMA-2's 4K with θbase=10,000\theta_{\text{base}} = 10{,}000).

The problem: Standard RoPE with θj=100002j/d\theta_j = 10000^{-2j/d} has frequencies ranging from θ0=1\theta_0 = 1 (dimension pair 0) to θd/21=104\theta_{d/2-1} = 10^{-4} (last dimension pair). At frequency θj\theta_j, the RoPE encoding completes one full rotation (period = 2π/θj2\pi/\theta_j tokens). The maximum distinguishable context is the longest period: Tmax=2π/θmin=2π×10462,832T_{\max} = 2\pi/\theta_{\min} = 2\pi\times 10^4 \approx 62{,}832 tokens for d=4096d = 4096 - explaining why LLaMA-2 degrades beyond ~4K (it uses θbase=10000\theta_{\text{base}} = 10000, so the longest period is 2π×104/d2\pi\times 10^4/d... the exact computation depends on dd).

The Fourier interpretation: Each dimension pair is an oscillator at frequency θj\theta_j. For two positions mm and nn to be distinguishable, their phase difference (mn)θj|(m-n)\theta_j| must be 0\neq 0 mod 2π2\pi for at least one jj. If mn|m-n| is larger than 2π/θmin2\pi/\theta_{\min}, all dimensions have wrapped around - positions become identical. This is aliasing in the temporal domain, caused by the frequency θmin\theta_{\min} being too high.

The fix: Reduce the minimum frequency by increasing θbase\theta_{\text{base}}:

θj=θbase2j/d,θmin=θbase1,Tmax=2πθbase\theta_j = \theta_{\text{base}}^{-2j/d}, \quad \theta_{\min} = \theta_{\text{base}}^{-1}, \quad T_{\max} = 2\pi\cdot\theta_{\text{base}}

LLaMA-3 uses θbase=500,000\theta_{\text{base}} = 500{,}000, giving Tmax=2π×500,0003.14×106T_{\max} = 2\pi\times 500{,}000 \approx 3.14\times 10^6 - more than enough for 128K context.

The cost: Higher θbase\theta_{\text{base}} means the low-frequency dimensions (jj large) oscillate very slowly, providing little positional information for short contexts. There is a fundamental tradeoff between context length and short-range positional precision - a manifestation of the uncertainty principle in the discrete position domain.

YaRN (Peng et al., 2023): Instead of uniformly rescaling all frequencies, YaRN uses a non-uniform rescaling: high-frequency dimensions (small jj) are scaled minimally (they handle short-range position fine), while low-frequency dimensions (large jj) are scaled aggressively. The interpolation factor s=target context/original contexts = \text{target context} / \text{original context} is applied per-frequency, weighted by the "needed" extension. This maintains short-range precision while extending long-range capacity - a principled application of the time-frequency tradeoff.

F.3 Case Study: Spectral Normalization in Stable Diffusion

Modern diffusion models (Stable Diffusion, DALL-E 3, Flux) train a discriminator or guidance network to distinguish real from generated data. Spectral normalization (Section 8.3) is used to stabilize this training.

Setup: The discriminator D:RH×W×3[0,1]D: \mathbb{R}^{H\times W\times 3} \to [0,1] has LL convolutional layers. Each convolution layer has weight W[l]RCout×Cin×k×kW^{[l]} \in \mathbb{R}^{C_\text{out}\times C_\text{in}\times k\times k}. For a 2D convolution layer, the relevant spectral norm is:

W[l]2=maxh:h=1W[l]h\lVert W^{[l]}\rVert_2 = \max_{\mathbf{h}:\lVert\mathbf{h}\rVert=1}\lVert W^{[l]}*\mathbf{h}\rVert

For a convolution with kernel h(kx,ky)h(k_x, k_y), the spectral norm equals maxξx,ξyh^(ξx,ξy)\max_{\xi_x,\xi_y}|\hat{h}(\xi_x,\xi_y)| - the maximum of the 2D Fourier Transform of the kernel. This connects spectral normalization directly to the FT: normalizing by the spectral norm makes the filter's frequency response bounded by 1 at every frequency.

Effect on generated images: A spectrally-normalized discriminator applies Lipschitz constraints that prevent it from exploiting arbitrary high-frequency artifacts in generated images. This forces the generator to eliminate high-frequency artifacts (checkerboard patterns from transposed convolutions, aliasing from bilinear upsampling) - improving sample quality.

Alias-free GAN / StyleGAN3 (Karras et al., 2021): Takes spectral control further by enforcing alias-free processing throughout the generator using carefully designed low-pass filters at every upsampling step. The key insight from Fourier analysis: any nonlinearity applied to a bandlimited signal creates harmonics beyond the original bandwidth - these must be filtered out before downsampling to avoid aliasing. StyleGAN3's architecture enforces this principle throughout, producing significantly more temporally consistent video outputs.


Appendix G: Further Reading and References

G.1 Foundational Texts

For the classical theory:

  • Stein, E.M. & Weiss, G. (1971). Introduction to Fourier Analysis on Euclidean Spaces. Princeton University Press. - The standard reference for rigorous Fourier analysis.
  • Katznelson, Y. (2004). An Introduction to Harmonic Analysis, 3rd ed. Cambridge. - Concise and rigorous; excellent for the abstract perspective.
  • Dym, H. & McKean, H.P. (1972). Fourier Series and Integrals. Academic Press. - Beautiful, accessible treatment with applications to probability.
  • Korner, T.W. (1988). Fourier Analysis. Cambridge. - The most readable and historically motivated account.

For distributions and Schwartz space:

  • Schwartz, L. (1950). Theorie des Distributions. Herman, Paris. - The original; defines tempered distributions.
  • Hormander, L. (1990). The Analysis of Linear Partial Differential Operators I. Springer. - Comprehensive; connects distributions to PDE theory.

For signal processing:

  • Oppenheim, A.V. & Schafer, R.W. (2009). Discrete-Time Signal Processing, 3rd ed. Prentice Hall. - Standard DSP textbook.
  • Mallat, S. (2008). A Wavelet Tour of Signal Processing, 3rd ed. Academic Press. - Bridges Fourier analysis, wavelets, and sparse signal processing.

G.2 Machine Learning Papers

FNet:

  • Lee-Thorp, J., Ainslie, J., Eckstein, I. & Ontanon, S. (2022). FNet: Mixing Tokens with Fourier Transforms. NAACL-HLT. arXiv:2105.03824.

Random Fourier Features:

  • Rahimi, A. & Recht, B. (2007). Random Features for Large-Scale Kernel Machines. NeurIPS. [Best Paper Award]
  • Yu, F.X. et al. (2016). Orthogonal Random Features. NeurIPS.
  • Choromanski, K. et al. (2021). Rethinking Attention with Performers. ICLR. arXiv:2009.14794.

Spectral Normalization:

  • Miyato, T., Kataoka, T., Koyama, M. & Yoshida, Y. (2018). Spectral Normalization for Generative Adversarial Networks. ICLR. arXiv:1802.05957.

Fourier Neural Operator:

  • Li, Z. et al. (2021). Fourier Neural Operator for Parametric Partial Differential Equations. ICLR. arXiv:2010.08895.

RoPE and Context Extension:

  • Su, J. et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864.
  • Peng, B. et al. (2023). YaRN: Efficient Context Window Extension of Large Language Models. arXiv:2309.00071.
  • Ding, Y. et al. (2024). LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens. arXiv:2402.13753.

Spectral Bias:

  • Rahaman, N. et al. (2019). On the Spectral Bias of Neural Networks. ICML. arXiv:1806.08734.
  • Tancik, M. et al. (2020). Fourier Features Let Networks Learn High Frequency Functions. NeurIPS. arXiv:2006.10739.

Audio Models:

  • Radford, A. et al. (2022). Robust Speech Recognition via Large-Scale Weak Supervision (Whisper). arXiv:2212.04356.

Alias-free GAN:

  • Karras, T. et al. (2021). Alias-Free Generative Adversarial Networks. NeurIPS. arXiv:2106.12423.

G.3 Online Resources

  • MIT OCW 18.103 (Fourier Analysis) - rigorous mathematical treatment with lecture notes
  • Stanford EE261 (The Fourier Transform and its Applications) - engineering-oriented, excellent visualization
  • 3Blue1Brown "But what is the Fourier Transform?" - geometric intuition for absolute beginners
  • Seeing Theory (Brown University) - interactive visualizations of Fourier series and transforms

Appendix H: Self-Assessment Questions

These questions test conceptual understanding without computation. Answer each in 2-3 sentences.

  1. Explain in words why the Fourier Transform of a spike (Dirac delta) is a constant (flat spectrum). What physical interpretation does this have?

  2. A signal engineer claims: "If I double the duration of my pulse, I can halve its bandwidth." Is this claim correct, and what theorem guarantees it? What is the fundamental limit on simultaneous time and frequency concentration?

  3. What is the difference between the Parseval identity from Fourier series (Section 01) and Plancherel's theorem from this section? Are they the same result in different forms?

  4. Why does the Fourier Transform of a real-valued function ff satisfy f^(ξ)=f^(ξ)\hat{f}(-\xi) = \overline{\hat{f}(\xi)}? What does this mean for the magnitude and phase spectra?

  5. Explain why FNet achieves near-BERT performance on classification tasks but not on extractive QA. What mathematical property of the DFT explains this performance gap?

  6. In RoPE, the frequencies θj=100002j/d\theta_j = 10000^{-2j/d} are chosen geometrically. Why geometric spacing? What would happen if the frequencies were linearly spaced?

  7. Bochner's theorem says a shift-invariant kernel is PD iff it is the FT of a non-negative measure. What breaks down if the spectral density is not non-negative? Give an example.

  8. Explain the connection between the Poisson summation formula and Shannon's sampling theorem. Why does undersampling (below the Nyquist rate) cause aliasing?

  9. The heat equation solution is a convolution with a Gaussian that spreads over time. In the Fourier domain, how does the solution evolve? Which frequencies decay fastest and why?

  10. The differentiation property states F[f](ξ)=2πiξf^(ξ)\mathcal{F}[f'](\xi) = 2\pi i\xi\hat{f}(\xi). Use this to explain why smoother functions have faster spectral decay. What happens for a function with a jump discontinuity?

  11. In spectral normalization, the discriminator is normalized by σmax(W)\sigma_{\max}(W). Why does this enforce a Lipschitz constraint? How does the Lipschitz constant relate to training stability?

  12. The Gaussian eπt2e^{-\pi t^2} is the unique minimizer of the uncertainty bound. Does this mean we should always use Gaussian signals in practice? What are the tradeoffs?


End of Section 20-02 Fourier Transform.

Next: Section 20-03 DFT and FFT - discretizing the Fourier Transform for computation via the Cooley-Tukey algorithm.