Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 1
25 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Fourier Transform: Part 1: Intuition to 7. Tempered Distributions and the Dirac Delta

1. Intuition

1.1 From Fourier Series to Fourier Transform

Recall from Section 20-01 Fourier Series that a 2L2L-periodic function has a Fourier series in complex exponential form:

f(x)=n=cneinπx/L,cn=12LLLf(x)einπx/Ldxf(x) = \sum_{n=-\infty}^{\infty} c_n\, e^{in\pi x/L}, \qquad c_n = \frac{1}{2L}\int_{-L}^{L} f(x)\,e^{-in\pi x/L}\,dx

The frequencies present are the discrete set {,2/2L,1/2L,0,1/2L,2/2L,}\{\ldots, -2/2L, -1/2L, 0, 1/2L, 2/2L, \ldots\}, separated by spacing Δξ=1/(2L)\Delta\xi = 1/(2L).

Now ask: what happens as LL \to \infty? The function is no longer required to repeat - it becomes an arbitrary function on all of R\mathbb{R}. Three things change simultaneously:

  1. The frequency spacing Δξ=1/(2L)0\Delta\xi = 1/(2L) \to 0: the discrete spectrum becomes a continuum.
  2. The discrete sum n\sum_n becomes an integral dξ\int_{-\infty}^{\infty} d\xi.
  3. The coefficient cnc_n (which scales as 1/(2L)1/(2L)) becomes a density f^(ξ)dξ\hat{f}(\xi)\,d\xi.

Concretely, substitute ξn=n/(2L)\xi_n = n/(2L) and let LL \to \infty:

f(x)=n=(LLf(t)e2πiξntdt)f^(ξn)e2πiξnx12LΔξf(x) = \sum_{n=-\infty}^{\infty} \underbrace{\left(\int_{-L}^{L} f(t)\,e^{-2\pi i \xi_n t}\,dt\right)}_{\approx \hat{f}(\xi_n)} e^{2\pi i \xi_n x} \underbrace{\frac{1}{2L}}_{\Delta\xi} Lf^(ξ)e2πiξxdξ\xrightarrow{L\to\infty} \int_{-\infty}^{\infty} \hat{f}(\xi)\, e^{2\pi i \xi x}\,d\xi

where the Fourier Transform (using the ξ\xi-convention) is:

f^(ξ)=f(t)e2πiξtdt\hat{f}(\xi) = \int_{-\infty}^{\infty} f(t)\, e^{-2\pi i \xi t}\,dt

This derivation makes the Fourier Transform inevitable: it is not a definition pulled from thin air, but the natural limit of the Fourier series as periodicity is removed. The continuous spectrum f^(ξ)\hat{f}(\xi) is the spectral density of ff - the amplitude and phase contributed by frequency ξ\xi per unit frequency interval.

Non-example: The derivation above requires that the "coefficients" f^(ξ)\hat{f}(\xi) are well-defined as LL \to \infty. For this, we need f(t)dt<\int_{-\infty}^{\infty} |f(t)|\,dt < \infty, i.e., fL1(R)f \in L^1(\mathbb{R}). A constant function f(t)=1f(t) = 1 is NOT in L1(R)L^1(\mathbb{R}), so it does not have a classical Fourier Transform - we must extend the theory to distributions (Section 7) to handle it.

1.2 Frequency as a Continuous Variable

In a Fourier series, the spectrum is a discrete set of spikes: the function has energy only at the harmonics n/Tn/T. In the Fourier Transform, the spectrum is a continuous function f^(ξ)\hat{f}(\xi), and the function can have energy spread continuously across all frequencies.

The magnitude spectrum f^(ξ)|\hat{f}(\xi)| tells you how much of each frequency is present. The phase spectrum f^(ξ)\angle\hat{f}(\xi) tells you the phase offset of each frequency component. Together they completely determine ff (via the inversion theorem). Some examples of what the spectrum looks like:

  • A pure sinusoid f(t)=cos(2πξ0t)f(t) = \cos(2\pi\xi_0 t): spectrum is two spikes at ±ξ0\pm\xi_0 (as Dirac deltas).
  • A rectangular pulse (nonzero on [T/2,T/2][-T/2, T/2]): spectrum is a sinc function, spreading out in frequency.
  • A Gaussian f(t)=eπt2f(t) = e^{-\pi t^2}: spectrum is another Gaussian (this is the self-dual case).
  • A chirp (frequency increasing linearly with time): spectrum is spread over a range of frequencies, with the time-frequency tradeoff governed by the uncertainty principle.
  • White noise: flat spectrum - equal energy at every frequency.

The key principle is the time-frequency duality: a signal that is concentrated in time (short duration) must have a broad spectrum, and a signal with a narrow spectrum (nearly monochromatic) must extend over a long time. This is not a limitation of our measurement apparatus - it is a mathematical theorem (the Heisenberg uncertainty principle, Section 6).

For AI: This duality directly constrains attention mechanisms. RoPE uses frequencies to encode position, and extending context length (longer time window) requires lower frequencies - which is why YaRN (Peng et al., 2023) and LongRoPE (Ding et al., 2024) rescale the frequency base to accommodate longer sequences. The uncertainty principle is the fundamental reason why this rescaling is necessary.

1.3 Why It Matters for AI (2026 Perspective)

The Fourier Transform is not peripheral to modern AI - it is structurally present in several of the most important systems:

FNet (Lee-Thorp et al., 2022): Replaces the self-attention sublayer in a Transformer with a 2D Fourier transform over the sequence and embedding dimensions. Achieves 92% of BERT's accuracy on GLUE benchmarks at 7 the training speed, because FFT is O(NlogN)O(N \log N) while attention is O(N2)O(N^2). The mathematical insight: the FT acts as a "global mixer" that combines all token representations - a weaker but cheaper version of attention.

Random Fourier Features (Rahimi & Recht, 2007, NeurIPS Best Paper): Every shift-invariant kernel k(x,y)=k(xy)k(\mathbf{x}, \mathbf{y}) = k(\mathbf{x} - \mathbf{y}) is the Fourier Transform of a non-negative measure (Bochner's theorem). By sampling random frequencies from that measure, one gets a randomized feature map ϕ(x)RD\phi(\mathbf{x}) \in \mathbb{R}^D such that k(x,y)ϕ(x)ϕ(y)k(\mathbf{x},\mathbf{y}) \approx \phi(\mathbf{x})^\top\phi(\mathbf{y}). This reduces kernel machine computation from O(N2)O(N^2) to O(ND)O(ND).

Spectral Normalization (Miyato et al., 2018): To train stable GANs, the discriminator's weight matrices are normalized by their largest singular value (spectral radius). This enforces a Lipschitz constraint on the discriminator, stabilizing training. The "spectral" in the name refers to the spectrum of the weight matrix - a direct application of the FT of the weight matrices interpreted as linear operators.

Fourier Neural Operator (Li et al., 2021): Learns mappings between function spaces by applying a learnable linear transformation in Fourier space, then transforming back. Used for solving PDEs 1000 faster than traditional numerical methods. The key is that convolution in space = pointwise multiplication in Fourier space, so global dependencies can be captured efficiently.

RoPE (Su et al., 2021): Used in LLaMA-3, Mistral, Qwen, and virtually every modern LLM. Interprets each position as a rotation in the complex plane, using Fourier basis vectors eimθje^{im\theta_j} at different frequencies θj\theta_j for different embedding dimensions. The relative-position property follows from the group structure of the complex exponential.

1.4 Historical Timeline

FOURIER TRANSFORM - HISTORICAL MILESTONES
========================================================================

  1822  Fourier's "Theorie analytique de la chaleur": introduces the
        integral formula that becomes the Fourier Transform
  1880s Rayleigh uses Fourier integrals in optics and acoustics
  1898  Heaviside operational calculus (early form of the Laplace/FT)
  1915  Plancherel proves the L2 isometry theorem (Plancherel's theorem)
  1927  Heisenberg formulates the uncertainty principle in quantum
        mechanics; Kennard gives the precise mathematical inequality
  1933  Norbert Wiener's "The Fourier Integral and Certain of its
        Applications" - rigorous L2 theory; lays groundwork for
        signal processing as a mathematical discipline
  1949  Shannon's sampling theorem (uses Poisson summation formula)
  1965  Cooley & Tukey publish the FFT - makes FT computable in O(NlogN)
  1966  Schwartz distributions: FT extended to , constants, sinusoids
  1975  FT enters digital signal processing (audio, radar, MRI)
  2007  Rahimi & Recht: Random Fourier Features (NeurIPS Best Paper)
  2017  Vaswani et al.: sinusoidal positional encodings in Transformers
  2018  Miyato et al.: spectral normalization for GANs
  2021  Li et al.: Fourier Neural Operator for PDE solving
  2021  Su et al.: RoPE - now in LLaMA-3, Mistral, Qwen, GPT-NeoX
  2022  Lee-Thorp et al.: FNet - Fourier-based alternative to attention
  2023  YaRN (Peng et al.): Fourier-based context length extension
  2024  LongRoPE (Ding et al.): progressive frequency rescaling to 2M
        context length; RWKV-7 uses state-space FT interpretation

========================================================================

1.5 What the Fourier Transform Does Geometrically

The most useful geometric picture is that the Fourier Transform decomposes ff in an orthonormal basis of complex exponentials - just as the Fourier series decomposed a periodic function in the trigonometric basis. The difference is that now the "basis" is uncountably infinite and the "coefficients" form a continuous function rather than a sequence.

Formally, the complex exponentials {e2πiξt}ξR\{e^{2\pi i \xi t}\}_{\xi \in \mathbb{R}} do not form a basis of L2(R)L^2(\mathbb{R}) in the usual Hilbert space sense (they are not in L2(R)L^2(\mathbb{R}) themselves - they have constant modulus 1 and are not square-integrable). The rigorous statement is that the Fourier Transform is a unitary operator on L2(R)L^2(\mathbb{R}) (Plancherel's theorem), meaning it preserves inner products and norms:

f,gL2=f^,g^L2\langle f, g \rangle_{L^2} = \langle \hat{f}, \hat{g} \rangle_{L^2}

Think of the Fourier Transform as a rotation in an infinite-dimensional space: it rotates the function ff into a new coordinate system (the frequency domain) where the same function looks different but has exactly the same "length" (energy). Just as rotating a vector in Rn\mathbb{R}^n preserves its norm but changes its coordinates, the Fourier Transform preserves energy but expresses it in frequency coordinates.

A second geometric picture: the Fourier Transform of the convolution fgf * g is the pointwise product f^g^\hat{f} \cdot \hat{g}. Convolution in time/space corresponds to multiplication in frequency. This is the key theorem for signal processing and the foundation of efficient convolution in CNNs via the FFT. (Full treatment in Section 20-04 Convolution Theorem.)

GEOMETRIC PICTURE OF THE FOURIER TRANSFORM
========================================================================

  Time domain                    Frequency domain
  -------------                  ----------------
  f(t): a function of time       f(): a function of frequency

  Signal duration T  <-->  Bandwidth ~ 1/T  (uncertainty principle)

  Convolution f*g    <-->  Pointwise product fg  (Convolution Theorem)

  Differentiation d/dt <-->  Multiplication by 2i

  Translation f(t-a) <-->  Modulation e^{-2ia}f()

  Scaling f(at)      <-->  Dilation (1/|a|)f(/a)

  The FT is a UNITARY OPERATOR on L2(R):
    f2 = f2  (Plancherel)
    f,g = f,g  (Parseval)

========================================================================

2. Formal Definitions

2.1 The Fourier Transform on L1(R)L^1(\mathbb{R})

Definition 2.1 (Fourier Transform on L1L^1). For fL1(R)f \in L^1(\mathbb{R}) (i.e., f(t)dt<\int_{-\infty}^{\infty}|f(t)|\,dt < \infty), the Fourier Transform of ff is the function f^:RC\hat{f}: \mathbb{R} \to \mathbb{C} defined by:

f^(ξ)=f(t)e2πiξtdt\hat{f}(\xi) = \int_{-\infty}^{\infty} f(t)\, e^{-2\pi i \xi t}\,dt

We write f^=F[f]\hat{f} = \mathcal{F}[f] or ff^f \mapsto \hat{f}.

Well-definedness. For fL1(R)f \in L^1(\mathbb{R}) and any ξR\xi \in \mathbb{R}:

f^(ξ)=f(t)e2πiξtdtf(t)e2πiξtdt=f(t)dt=f1|\hat{f}(\xi)| = \left|\int_{-\infty}^{\infty} f(t)\,e^{-2\pi i\xi t}\,dt\right| \leq \int_{-\infty}^{\infty}|f(t)|\,|e^{-2\pi i\xi t}|\,dt = \int_{-\infty}^{\infty}|f(t)|\,dt = \lVert f \rVert_1

So f^\hat{f} is bounded: f^f1\lVert\hat{f}\rVert_\infty \leq \lVert f \rVert_1. Moreover, one can show (by dominated convergence) that f^\hat{f} is continuous on R\mathbb{R}, and by the Riemann-Lebesgue lemma, f^(ξ)0\hat{f}(\xi) \to 0 as ξ|\xi| \to \infty.

Theorem 2.1 (Riemann-Lebesgue Lemma). If fL1(R)f \in L^1(\mathbb{R}), then f^C0(R)\hat{f} \in C_0(\mathbb{R}) (continuous functions vanishing at infinity):

limξf^(ξ)=0\lim_{|\xi|\to\infty} \hat{f}(\xi) = 0

Proof sketch. For step functions, the integral reduces to a finite sum of terms e2πiξa/(ξ)\sim e^{-2\pi i\xi a}/(\xi), each going to 0. Approximate general L1L^1 functions by step functions and use the boundedness f^f1\lVert\hat{f}\rVert_\infty \leq \lVert f \rVert_1. \square

What f^(ξ)\hat{f}(\xi) tells you. The value f^(ξ)\hat{f}(\xi) is a complex number encoding:

  • f^(ξ)|\hat{f}(\xi)|: the amplitude of frequency ξ\xi in ff
  • argf^(ξ)\arg\hat{f}(\xi): the phase of frequency ξ\xi in ff

The power spectrum f^(ξ)2|\hat{f}(\xi)|^2 gives the energy density per unit frequency at ξ\xi.

2.2 Convention War: ω\omega vs ξ\xi vs ff

The Fourier Transform appears in three notational conventions in the literature, all equivalent but related by rescaling:

ConventionTransform formulaInverse formulaUsed in
Frequency ξ\xi (Hz)f^(ξ)=f(t)e2πiξtdt\hat{f}(\xi) = \int f(t)e^{-2\pi i\xi t}\,dtf(t)=f^(ξ)e2πiξtdξf(t) = \int \hat{f}(\xi)e^{2\pi i\xi t}\,d\xiSignal processing, probability
Angular freq ω\omega (rad/s), no 12π\frac{1}{2\pi}f^(ω)=f(t)eiωtdt\hat{f}(\omega) = \int f(t)e^{-i\omega t}\,dtf(t)=12πf^(ω)eiωtdωf(t) = \frac{1}{2\pi}\int \hat{f}(\omega)e^{i\omega t}\,d\omegaPhysics, engineering
Angular freq ω\omega, symmetricf^(ω)=12πf(t)eiωtdt\hat{f}(\omega) = \frac{1}{\sqrt{2\pi}}\int f(t)e^{-i\omega t}\,dtf(t)=12πf^(ω)eiωtdωf(t) = \frac{1}{\sqrt{2\pi}}\int \hat{f}(\omega)e^{i\omega t}\,d\omegaPure mathematics

This section uses the ξ\xi-convention (row 1). It is self-symmetric (no 2π2\pi factors), maps the Gaussian to itself, and is standard in modern ML papers (FNet, FNO, RFF all use it). The relationship between conventions: f^ω(ω)=f^ξ(ω/2π)\hat{f}_\omega(\omega) = \hat{f}_\xi(\omega/2\pi).

Warning: The most common source of errors in Fourier analysis is mixing conventions. When reading a paper, identify the convention in the first equation before proceeding. Properties like "differentiation multiplies by iωi\omega" vs "multiplies by 2πiξ2\pi i\xi" depend entirely on which convention is in use.

2.3 Standard Examples

These four transforms appear constantly in applications and should be memorized.

Example 1: Rectangular Pulse (rect function)

rect(t)={1t1/20t>1/2\operatorname{rect}(t) = \begin{cases}1 & |t| \leq 1/2 \\ 0 & |t| > 1/2\end{cases} rect^(ξ)=1/21/2e2πiξtdt=eπiξeπiξ2πiξ=sin(πξ)πξ=:sinc(ξ)\widehat{\operatorname{rect}}(\xi) = \int_{-1/2}^{1/2} e^{-2\pi i\xi t}\,dt = \frac{e^{\pi i\xi} - e^{-\pi i\xi}}{2\pi i\xi} = \frac{\sin(\pi\xi)}{\pi\xi} =: \operatorname{sinc}(\xi)

The sinc function oscillates and decays as 1/ξ1/|\xi|. The slow decay reflects the sharp discontinuity of rect\operatorname{rect} - sharp features in time produce slow decay in frequency (this is the spectral analog of the Gibbs phenomenon from Section 20-01).

A general pulse of width TT: f(t)=rect(t/T)f(t) = \operatorname{rect}(t/T) has transform f^(ξ)=Tsinc(Tξ)\hat{f}(\xi) = T\,\operatorname{sinc}(T\xi). Wider pulse -> narrower, taller sinc lobe.

Example 2: Gaussian

f(t)=eπt2f(t) = e^{-\pi t^2} f^(ξ)=eπt2e2πiξtdt=eπξ2\hat{f}(\xi) = \int_{-\infty}^{\infty} e^{-\pi t^2} e^{-2\pi i\xi t}\,dt = e^{-\pi\xi^2}

The Gaussian is self-dual under the Fourier Transform with the ξ\xi-convention: f^=f\hat{f} = f. This is unique among "nice" functions and makes the Gaussian the natural test function in Fourier analysis, the ground state in quantum mechanics, and the kernel of the heat equation.

Derivation: Complete the square in the exponent: πt22πiξt=π(t+iξ)2πξ2-\pi t^2 - 2\pi i\xi t = -\pi(t + i\xi)^2 - \pi\xi^2. Then:

f^(ξ)=eπξ2eπ(t+iξ)2dt=eπξ21\hat{f}(\xi) = e^{-\pi\xi^2}\int_{-\infty}^{\infty}e^{-\pi(t+i\xi)^2}\,dt = e^{-\pi\xi^2} \cdot 1

where the integral equals 1 by contour integration (the Gaussian integral is analytic and the contour shift tt+iξt \to t + i\xi is justified because the integrand decays rapidly).

Example 3: Lorentzian (Cauchy distribution)

f(t)=11+(2πt)2f(t) = \frac{1}{1 + (2\pi t)^2} f^(ξ)=12eξ\hat{f}(\xi) = \frac{1}{2}e^{-|\xi|}

The exponential decay in frequency reflects the moderate smoothness of the Lorentzian (it is CC^\infty but not compactly supported). The relationship between smoothness and spectral decay is captured by the general theorem in Section 3.5.

Example 4: One-Sided Exponential

f(t)=eat1t0,a>0f(t) = e^{-at}\mathbb{1}_{t \geq 0}, \quad a > 0 f^(ξ)=0eate2πiξtdt=1a+2πiξ\hat{f}(\xi) = \int_0^\infty e^{-at}e^{-2\pi i\xi t}\,dt = \frac{1}{a + 2\pi i\xi}

This has modulus f^(ξ)=1/a2+4π2ξ2|\hat{f}(\xi)| = 1/\sqrt{a^2 + 4\pi^2\xi^2}, which decays as 1/ξ1/|\xi| for large ξ\xi - reflecting the discontinuity at t=0t = 0.

Non-example 1: f(t)=1/(1+t2)1/4f(t) = 1/(1+t^2)^{1/4}. This is in L2(R)L^2(\mathbb{R}) but not L1(R)L^1(\mathbb{R}) (it decays too slowly), so the integral f(t)e2πiξtdt\int f(t)e^{-2\pi i\xi t}\,dt does not converge absolutely and the definition above does not apply. The L2L^2 theory (Section 5) handles this case.

Non-example 2: f(t)=sin(2πξ0t)f(t) = \sin(2\pi\xi_0 t). This is bounded but not in L1(R)L^1(\mathbb{R}) or L2(R)L^2(\mathbb{R}) - it does not decay. Its Fourier Transform exists only as a distribution: F[sin(2πξ0t)]=i2[δ(ξ+ξ0)δ(ξξ0)]\mathcal{F}[\sin(2\pi\xi_0 t)] = \frac{i}{2}[\delta(\xi + \xi_0) - \delta(\xi - \xi_0)] (see Section 7).

2.4 The Inverse Fourier Transform

Definition 2.2 (Inverse Fourier Transform). For gL1(R)g \in L^1(\mathbb{R}):

F1[g](t)=g(ξ)e2πiξtdξ\mathcal{F}^{-1}[g](t) = \int_{-\infty}^{\infty} g(\xi)\,e^{2\pi i\xi t}\,d\xi

Note: the inverse is the same as the forward transform except the sign in the exponent is flipped (2πiξt+2πiξt-2\pi i\xi t \to +2\pi i\xi t). In the ξ\xi-convention, this symmetric form is one of its advantages.

The inversion problem: Given f^\hat{f}, can we recover ff? Yes, under mild conditions:

f(t)=f^(ξ)e2πiξtdξf(t) = \int_{-\infty}^{\infty} \hat{f}(\xi)\,e^{2\pi i\xi t}\,d\xi

but this requires knowing that f^L1(R)\hat{f} \in L^1(\mathbb{R}) (which is not guaranteed by fL1(R)f \in L^1(\mathbb{R}) alone - for instance, rect^(ξ)=sinc(ξ)\hat{\operatorname{rect}}(\xi) = \operatorname{sinc}(\xi) is not in L1L^1). The full inversion theorem is treated rigorously in Section 4.

2.5 Non-Examples and the L2L^2 Extension

The classical L1L^1 definition fails for many important functions:

FunctionWhy L1L^1 FT failsSolution
f(t)=et2/2f(t) = e^{-t^2/2} (broader Gaussian)In L1L^1 - this one is fineNot a failure
f(t)=(1+t2)1/4f(t) = (1+t^2)^{-1/4}Not in L1L^1Use L2L^2 Plancherel (Section 5)
f(t)=1f(t) = 1Not in L1L^1 or L2L^2Use distributions (Section 7)
f(t)=sin(2πξ0t)f(t) = \sin(2\pi\xi_0 t)Not in L1L^1 or L2L^2Use distributions (Section 7)
f(t)=δ(t)f(t) = \delta(t)Not a functionUse distributions (Section 7)

The L2L^2 extension proceeds via a density argument. The subspace L1(R)L2(R)L^1(\mathbb{R}) \cap L^2(\mathbb{R}) is dense in L2(R)L^2(\mathbb{R}). For fL1L2f \in L^1 \cap L^2, the classical integral defines f^\hat{f}. The Plancherel theorem (Section 5) then shows f^2=f2\lVert\hat{f}\rVert_2 = \lVert f \rVert_2, which allows extending F\mathcal{F} to all of L2L^2 by continuity. The resulting transform is a unitary operator on L2(R)L^2(\mathbb{R}).

3. Core Properties

3.1 Linearity and Conjugate Symmetry

Theorem 3.1 (Linearity). For f,gL1(R)f, g \in L^1(\mathbb{R}) and α,βC\alpha, \beta \in \mathbb{C}:

F[αf+βg](ξ)=αf^(ξ)+βg^(ξ)\mathcal{F}[\alpha f + \beta g](\xi) = \alpha\hat{f}(\xi) + \beta\hat{g}(\xi)

Proof: Immediate from linearity of the integral. \square

Theorem 3.2 (Conjugate Symmetry / Hermitian Property). For real-valued fL1(R)f \in L^1(\mathbb{R}):

f^(ξ)=f^(ξ)\hat{f}(-\xi) = \overline{\hat{f}(\xi)}

Proof: f^(ξ)=f(t)e2πiξtdt=f(t)e2πiξtdt=f^(ξ)\hat{f}(-\xi) = \int f(t)e^{2\pi i\xi t}\,dt = \overline{\int f(t)e^{-2\pi i\xi t}\,dt} = \overline{\hat{f}(\xi)}, where the last step uses f(t)=f(t)f(t) = \overline{f(t)} (real-valued). \square

Corollaries for real ff:

  • f^(ξ)=f^(ξ)|\hat{f}(-\xi)| = |\hat{f}(\xi)|: the magnitude spectrum is even
  • argf^(ξ)=argf^(ξ)\arg\hat{f}(-\xi) = -\arg\hat{f}(\xi): the phase spectrum is odd
  • If ff is also even, then f^\hat{f} is real-valued and even
  • If ff is also odd, then f^\hat{f} is purely imaginary and odd

For AI: The Hermitian symmetry means that for real signals, half the spectrum is redundant - you only need frequencies ξ0\xi \geq 0. This is why the FFT of a real signal produces N/2+1N/2 + 1 unique complex outputs from NN real inputs. Real-valued weight matrices in neural networks have Hermitian-symmetric spectra, which is exploited in efficient spectral computation.

3.2 Time-Shift and Frequency-Shift (Modulation)

Theorem 3.3 (Time-Shift / Translation). For aRa \in \mathbb{R}:

F[f(ta)](ξ)=e2πiξaf^(ξ)\mathcal{F}[f(t - a)](\xi) = e^{-2\pi i\xi a}\,\hat{f}(\xi)

Proof: f(ta)e2πiξtdt=u=taf(u)e2πiξ(u+a)du=e2πiξaf^(ξ)\int f(t-a)e^{-2\pi i\xi t}\,dt \stackrel{u=t-a}{=} \int f(u)e^{-2\pi i\xi(u+a)}\,du = e^{-2\pi i\xi a}\hat{f}(\xi). \square

Interpretation: Shifting a signal in time multiplies its spectrum by a complex phase factor. The magnitude spectrum f^(ξ)|\hat{f}(\xi)| is unchanged - a time shift does not affect which frequencies are present, only their phases. The phase changes linearly with frequency: ϕϕ2πξa\phi \mapsto \phi - 2\pi\xi a.

Theorem 3.4 (Frequency-Shift / Modulation). For ξ0R\xi_0 \in \mathbb{R}:

F[e2πiξ0tf(t)](ξ)=f^(ξξ0)\mathcal{F}[e^{2\pi i\xi_0 t}f(t)](\xi) = \hat{f}(\xi - \xi_0)

Proof: e2πiξ0tf(t)e2πiξtdt=f(t)e2πi(ξξ0)tdt=f^(ξξ0)\int e^{2\pi i\xi_0 t}f(t)e^{-2\pi i\xi t}\,dt = \int f(t)e^{-2\pi i(\xi - \xi_0)t}\,dt = \hat{f}(\xi - \xi_0). \square

Interpretation: Multiplying by a complex exponential (modulation) shifts the spectrum. This is the mathematical basis for amplitude modulation (AM) radio and for the RoPE positional encoding.

For AI - RoPE connection: In RoPE, the query at position mm is qm=eimθq\mathbf{q}_m = e^{im\theta}\mathbf{q} (rotation by mθm\theta in the complex plane). The inner product between query at mm and key at nn is eimθq,einθk=ei(mn)θq,k\langle e^{im\theta}\mathbf{q},\, e^{in\theta}\mathbf{k}\rangle = e^{i(m-n)\theta}\langle\mathbf{q},\mathbf{k}\rangle, depending only on relative position mnm - n. This is the modulation theorem in action.

3.3 Scaling and Dilation

Theorem 3.5 (Scaling / Dilation). For aRa \in \mathbb{R}, a0a \neq 0:

F[f(at)](ξ)=1af^ ⁣(ξa)\mathcal{F}[f(at)](\xi) = \frac{1}{|a|}\hat{f}\!\left(\frac{\xi}{a}\right)

Proof: f(at)e2πiξtdt=u=at1af(u)e2πi(ξ/a)udu=1af^(ξ/a)\int f(at)e^{-2\pi i\xi t}\,dt \stackrel{u=at}{=} \frac{1}{|a|}\int f(u)e^{-2\pi i(\xi/a)u}\,du = \frac{1}{|a|}\hat{f}(\xi/a). \square

Interpretation: This is the time-bandwidth product in action:

  • Compressing a signal in time (a>1a > 1, so f(at)f(at) is narrower): the spectrum stretches by factor aa and shrinks in amplitude by 1/a1/a
  • Stretching a signal in time (0<a<10 < a < 1): the spectrum compresses

This confirms the time-frequency duality: doubling the duration halves the bandwidth, and vice versa. This is a hard mathematical constraint, not an engineering limitation.

3.4 Time Reversal

Theorem 3.6 (Time Reversal). For fL1(R)f \in L^1(\mathbb{R}):

F[f(t)](ξ)=f^(ξ)=f^(ξ) (for real f)\mathcal{F}[f(-t)](\xi) = \hat{f}(-\xi) = \overline{\hat{f}(\xi)} \text{ (for real } f\text{)}

Proof: f(t)e2πiξtdt=u=tf(u)e2πiξudu=f^(ξ)\int f(-t)e^{-2\pi i\xi t}\,dt \stackrel{u=-t}{=} \int f(u)e^{2\pi i\xi u}\,du = \hat{f}(-\xi). \square

Duality: There is a deeper symmetry: F[f^](t)=f(t)\mathcal{F}[\hat{f}](t) = f(-t). Applying the Fourier Transform four times returns the original function: F4=I\mathcal{F}^4 = I. This means the FT has eigenvalues {1,i,1,i}\{1, -i, -1, i\} and the Hermite functions are its eigenfunctions (with the Gaussian as the eigenfunction for eigenvalue 1).

3.5 Differentiation and Integration

Theorem 3.7 (Differentiation in Time). If f,fL1(R)f, f' \in L^1(\mathbb{R}):

F[f(t)](ξ)=2πiξf^(ξ)\mathcal{F}[f'(t)](\xi) = 2\pi i\xi\,\hat{f}(\xi)

More generally, F[f(n)(t)](ξ)=(2πiξ)nf^(ξ)\mathcal{F}[f^{(n)}(t)](\xi) = (2\pi i\xi)^n\hat{f}(\xi).

Proof: Integration by parts: f(t)e2πiξtdt=[f(t)e2πiξt]+2πiξf(t)e2πiξtdt\int f'(t)e^{-2\pi i\xi t}\,dt = [f(t)e^{-2\pi i\xi t}]_{-\infty}^{\infty} + 2\pi i\xi\int f(t)e^{-2\pi i\xi t}\,dt. The boundary term vanishes since fL1f \in L^1 implies f(t)0f(t) \to 0 as t|t| \to \infty. \square

Theorem 3.8 (Differentiation in Frequency). If tf(t)L1(R)tf(t) \in L^1(\mathbb{R}):

ddξf^(ξ)=2πitf(t)^(ξ)\frac{d}{d\xi}\hat{f}(\xi) = -2\pi i\,\widehat{tf(t)}(\xi)

Equivalently: F[(2πit)nf(t)](ξ)=dndξnf^(ξ)\mathcal{F}[(-2\pi it)^n f(t)](\xi) = \frac{d^n}{d\xi^n}\hat{f}(\xi).

Key consequence for smoothness vs. spectral decay:

Smoothness of ffDecay of f^\hat{f}
fCkf \in C^k and f(k)L1f^{(k)} \in L^1$
fCf \in C^\infty (smooth)f^\hat{f} decays faster than any polynomial
ff has a jump discontinuity$
ff is analyticf^\hat{f} decays exponentially

For AI: The differentiation property is why Fourier methods are efficient for solving differential equations. It converts a PDE tu=axxu\partial_t u = a\partial_{xx}u into an ODE tu^=4π2ξ2au^\partial_t\hat{u} = -4\pi^2\xi^2 a\hat{u}, which is solved by u^(ξ,t)=u^(ξ,0)e4π2aξ2t\hat{u}(\xi, t) = \hat{u}(\xi, 0)e^{-4\pi^2 a\xi^2 t} - a simple exponential decay. The Fourier Neural Operator (Section 8.4) exploits this by learning the spectral solution operator directly.

3.6 The Master Properties Table

Propertyf(t)f(t)f^(ξ)\hat{f}(\xi)
Linearityαf(t)+βg(t)\alpha f(t) + \beta g(t)αf^(ξ)+βg^(ξ)\alpha\hat{f}(\xi) + \beta\hat{g}(\xi)
Time shiftf(ta)f(t - a)e2πiξaf^(ξ)e^{-2\pi i\xi a}\hat{f}(\xi)
Frequency shifte2πiξ0tf(t)e^{2\pi i\xi_0 t}f(t)f^(ξξ0)\hat{f}(\xi - \xi_0)
Scalingf(at)f(at)1af^(ξ/a)\frac{1}{\lvert a\rvert}\hat{f}(\xi/a)
Time reversalf(t)f(-t)f^(ξ)\hat{f}(-\xi)
Conjugationf(t)\overline{f(t)}f^(ξ)\overline{\hat{f}(-\xi)}
Hermitian (real ff)f(t)Rf(t) \in \mathbb{R}f^(ξ)=f^(ξ)\hat{f}(-\xi) = \overline{\hat{f}(\xi)}
Differentiationf(n)(t)f^{(n)}(t)(2πiξ)nf^(ξ)(2\pi i\xi)^n\hat{f}(\xi)
Freq. differentiation(2πit)nf(t)(-2\pi it)^n f(t)f^(n)(ξ)\hat{f}^{(n)}(\xi)
Convolution(fg)(t)(f * g)(t)f^(ξ)g^(ξ)\hat{f}(\xi)\cdot\hat{g}(\xi)
Multiplicationf(t)g(t)f(t)\cdot g(t)(f^g^)(ξ)(\hat{f} * \hat{g})(\xi)
Dualityf^(t)\hat{f}(t)f(ξ)f(-\xi)

3.7 Convolution-Multiplication Duality - Preview

The Convolution Theorem states that the Fourier Transform converts convolution into pointwise multiplication:

F[fg](ξ)=f^(ξ)g^(ξ)\mathcal{F}[f * g](\xi) = \hat{f}(\xi) \cdot \hat{g}(\xi)

where (fg)(t)=f(τ)g(tτ)dτ(f * g)(t) = \int_{-\infty}^{\infty} f(\tau)g(t - \tau)\,d\tau.

This is the most important property of the Fourier Transform for applications. It means that:

  • Linear filtering (convolution with a filter hh) corresponds to pointwise multiplication of spectra
  • A filter's effect on frequency ξ\xi is simply multiplication by h^(ξ)\hat{h}(\xi) (the frequency response)
  • Convolution of length-NN signals costs O(N2)O(N^2) directly but O(NlogN)O(N\log N) via FFT

Preview: The full treatment of the convolution theorem - including circular convolution, cross-correlation, the Wiener-Khinchin theorem, and applications to CNNs, WaveNet, and Mamba - is in Section 20-04 Convolution Theorem. Here we state the theorem and use it; the proof and all applications belong there.

4. The Fourier Inversion Theorem

4.1 Statement and Sufficient Conditions

The fundamental question: given f^\hat{f}, can we recover ff? The answer is yes under appropriate conditions, but the precise statement requires care.

Theorem 4.1 (Fourier Inversion Theorem). Suppose fL1(R)f \in L^1(\mathbb{R}) and f^L1(R)\hat{f} \in L^1(\mathbb{R}). Then for almost every tRt \in \mathbb{R}:

f(t)=f^(ξ)e2πiξtdξf(t) = \int_{-\infty}^{\infty} \hat{f}(\xi)\,e^{2\pi i\xi t}\,d\xi

Moreover, if ff is also continuous at tt, equality holds exactly (not just a.e.).

The subtlety: The condition f^L1(R)\hat{f} \in L^1(\mathbb{R}) is not automatic. If fL1(R)f \in L^1(\mathbb{R}), then f^\hat{f} is bounded and continuous (by Riemann-Lebesgue), but not necessarily in L1L^1. For example, f=rectf = \operatorname{rect} gives f^=sinc\hat{f} = \operatorname{sinc}, and sincL1(R)\operatorname{sinc} \notin L^1(\mathbb{R}) (it decays too slowly: 1sin(πξ)πξdξ=\int_1^\infty \frac{|\sin(\pi\xi)|}{\pi\xi}d\xi = \infty). The inversion of the rect function requires the L2L^2 theory.

Sufficient condition for inversion: If fL1(R)f \in L^1(\mathbb{R}) and ff has bounded variation near tt (e.g., ff is differentiable at tt), then the principal value integral:

limRRRf^(ξ)e2πiξtdξ\lim_{R\to\infty}\int_{-R}^{R}\hat{f}(\xi)e^{2\pi i\xi t}\,d\xi

converges to f(t)f(t) where ff is continuous, and to 12[f(t+)+f(t)]\frac{1}{2}[f(t^+) + f(t^-)] at jump discontinuities - exactly as in Dirichlet's theorem for Fourier series.

4.2 Proof via Approximate Identity

The standard proof uses the concept of an approximate identity - a family of functions that converge to the Dirac delta as a parameter goes to zero.

Definition (Approximate Identity). A family {kϵ}ϵ>0\{k_\epsilon\}_{\epsilon > 0} is an approximate identity if:

  1. kϵ0k_\epsilon \geq 0
  2. kϵ(t)dt=1\int k_\epsilon(t)\,dt = 1 for all ϵ>0\epsilon > 0
  3. For every δ>0\delta > 0: t>δkϵ(t)dt0\int_{|t| > \delta} k_\epsilon(t)\,dt \to 0 as ϵ0\epsilon \to 0

The Gauss-Weierstrass kernel kϵ(t)=1ϵeπt2/ϵ2k_\epsilon(t) = \frac{1}{\epsilon}e^{-\pi t^2/\epsilon^2} is an approximate identity.

Proof sketch of inversion theorem:

Define the Gauss regularized inversion:

fϵ(t)=f^(ξ)e2πiξteπϵ2ξ2dξf_\epsilon(t) = \int_{-\infty}^{\infty} \hat{f}(\xi)\,e^{2\pi i\xi t}\,e^{-\pi\epsilon^2\xi^2}\,d\xi

The factor eπϵ2ξ2e^{-\pi\epsilon^2\xi^2} provides absolute convergence (the Gaussian is in L1L^1). Using Fubini's theorem to interchange integration order:

fϵ(t)=f(s)eπϵ2ξ2e2πiξ(ts)dξ=1ϵeπ(ts)2/ϵ2ds=(fkϵ)(t)f_\epsilon(t) = \int_{-\infty}^{\infty} f(s) \underbrace{\int_{-\infty}^{\infty} e^{-\pi\epsilon^2\xi^2}e^{2\pi i\xi(t-s)}\,d\xi}_{= \frac{1}{\epsilon}e^{-\pi(t-s)^2/\epsilon^2}} ds = (f * k_\epsilon)(t)

Since {kϵ}\{k_\epsilon\} is an approximate identity: fϵ(t)f(t)f_\epsilon(t) \to f(t) as ϵ0\epsilon \to 0 at every point of continuity of ff (and in L1L^1 norm). Since fϵ(t)f^(ξ)e2πiξtdξf_\epsilon(t) \to \int\hat{f}(\xi)e^{2\pi i\xi t}\,d\xi as ϵ0\epsilon \to 0 when f^L1\hat{f} \in L^1, the result follows. \square

Significance: This proof reveals why the Gaussian plays a central role - it is the unique function (up to scaling) that is its own Fourier Transform, making the Gauss regularization self-consistent.

4.3 The Inversion Formula in Practice

The inversion formula is used to:

  1. Recover a signal from its spectrum: Given f^\hat{f}, compute f(t)=f^(ξ)e2πiξtdξf(t) = \int\hat{f}(\xi)e^{2\pi i\xi t}\,d\xi
  2. Compute transforms of new functions from known transforms: Use the FT table plus properties
  3. Solve PDEs: Take FT, solve the resulting ODE, invert

Example: Compute f(t)f(t) such that f^(ξ)=e2πξ\hat{f}(\xi) = e^{-2\pi|\xi|} (double-sided exponential spectrum).

By the inversion formula:

f(t)=e2πξe2πiξtdξ=0e2πξ(1it)dξ+0e2πξ(1+it)dξf(t) = \int_{-\infty}^{\infty} e^{-2\pi|\xi|}e^{2\pi i\xi t}\,d\xi = \int_0^\infty e^{-2\pi\xi(1-it)}\,d\xi + \int_{-\infty}^0 e^{2\pi\xi(1+it)}\,d\xi =12π(1it)+12π(1+it)=1π11+t2= \frac{1}{2\pi(1-it)} + \frac{1}{2\pi(1+it)} = \frac{1}{\pi}\cdot\frac{1}{1+t^2}

So F[1/(π(1+t2))](ξ)=e2πξ\mathcal{F}[1/(\pi(1+t^2))](\xi) = e^{-2\pi|\xi|} - confirming the Lorentzian result from Section 2.3 (Example 3).

Numerical verification: The inversion theorem can be verified numerically via the FFT: compute f^\hat{f} numerically, then invert, and check that you recover ff up to discretization error (this is Exercise 3 in Section 10).

4.4 The Self-Dual Gaussian

The Gaussian g(t)=eπt2g(t) = e^{-\pi t^2} satisfies g^=g\hat{g} = g, making it a fixed point of the Fourier Transform. This property makes it indispensable in both theory and practice.

Generalized Gaussian family. For σ>0\sigma > 0, define gσ(t)=eπt2/σ2g_\sigma(t) = e^{-\pi t^2/\sigma^2} (Gaussian of width σ\sigma). By the scaling theorem:

g^σ(ξ)=σeπσ2ξ2=σg1/σ(ξ)\hat{g}_\sigma(\xi) = \sigma\,e^{-\pi\sigma^2\xi^2} = \sigma\,g_{1/\sigma}(\xi)

Wide Gaussian (σ\sigma large) -> narrow Fourier Transform (bandwidth 1/σ\sim 1/\sigma). This is the uncertainty principle made explicit: the product of time-width σ\sigma and frequency-width 1/σ1/\sigma equals 1, the minimum possible (see Section 6).

Eigenfunctions of the FT: The Fourier Transform has eigenvalues λ{1,i,1,i}\lambda \in \{1, -i, -1, i\}. The corresponding eigenfunctions are the Hermite functions:

ψn(t)=Hn(2πt)eπt2\psi_n(t) = H_n(\sqrt{2\pi}\,t)\,e^{-\pi t^2}

where HnH_n is the nn-th Hermite polynomial. For n=0n = 0: ψ0(t)=eπt2\psi_0(t) = e^{-\pi t^2} (eigenvalue 1, the self-dual Gaussian). For n=1n = 1: ψ1(t)=22πteπt2\psi_1(t) = 2\sqrt{2\pi}\,t\,e^{-\pi t^2} (eigenvalue i-i).

For AI: The Gaussian window is the optimal time-frequency window (by the uncertainty principle), which is why Gaussian smoothing is used in signal preprocessing and why Gaussian attention kernels appear in some efficient attention variants.

GAUSSIAN SELF-DUALITY
========================================================================

  Time domain               Frequency domain
  -------------             ----------------
  g(t) = e^{-t2}   --FT--  g() = e^{-2}  <- same function!

  Width _t = 1/(2)       Width _ = 1/(2)
  Product: _t  _ = 1/(2) = MINIMUM (uncertainty bound)

  Scaling:
  f(t) = e^{-t2/2} --FT-- e^{-22}
  ^ wider in time               ^ narrower in frequency

========================================================================

5. Plancherel's Theorem and L2L^2 Theory

5.1 Statement: FT as a Unitary Isometry

Theorem 5.1 (Plancherel's Theorem). The Fourier Transform extends uniquely from L1(R)L2(R)L^1(\mathbb{R}) \cap L^2(\mathbb{R}) to a unitary operator on L2(R)L^2(\mathbb{R}):

F:L2(R)L2(R),f^L2=fL2\mathcal{F}: L^2(\mathbb{R}) \to L^2(\mathbb{R}), \qquad \lVert \hat{f} \rVert_{L^2} = \lVert f \rVert_{L^2}

Unitarity means F1=F\mathcal{F}^{-1} = \mathcal{F}^* (the adjoint equals the inverse), which in this case means (F)[f^](t)=f^(ξ)e+2πiξtdξ(\mathcal{F}^*)[\hat{f}](t) = \int \hat{f}(\xi)e^{+2\pi i\xi t}\,d\xi - the inverse FT.

Why this is non-trivial: For fL2(R)f \in L^2(\mathbb{R}) but fL1(R)f \notin L^1(\mathbb{R}), the defining integral f(t)e2πiξtdt\int f(t)e^{-2\pi i\xi t}\,dt need not converge absolutely. Plancherel's theorem says we can still define f^L2\hat{f} \in L^2 as the L2L^2 limit:

f^=limRRRf(t)e2πiξtdt(limit in L2 norm)\hat{f} = \lim_{R\to\infty} \int_{-R}^{R} f(t)e^{-2\pi i\xi t}\,dt \qquad \text{(limit in } L^2\text{ norm)}

The limit exists because the truncated transforms form a Cauchy sequence in L2L^2 (by the L1L2L^1 \cap L^2 case and density of that subspace in L2L^2).

5.2 Parseval's Relation

Theorem 5.2 (Parseval / Plancherel Identity). For f,gL2(R)f, g \in L^2(\mathbb{R}):

f(t)g(t)dt=f^(ξ)g^(ξ)dξ\int_{-\infty}^{\infty} f(t)\overline{g(t)}\,dt = \int_{-\infty}^{\infty} \hat{f}(\xi)\overline{\hat{g}(\xi)}\,d\xi

The special case g=fg = f gives the energy conservation (or Parseval's formula):

f(t)2dt=f^(ξ)2dξ\int_{-\infty}^{\infty} |f(t)|^2\,dt = \int_{-\infty}^{\infty} |\hat{f}(\xi)|^2\,d\xi

Interpretation: The total energy of a signal is the same whether measured in the time domain or the frequency domain. No energy is created or destroyed by the Fourier Transform - it is a lossless change of representation.

Using Parseval to compute integrals: Sometimes f2dt\int|f|^2\,dt is hard but f^2dξ\int|\hat{f}|^2\,d\xi is easy (or vice versa). Example: compute sinc2(ξ)dξ\int_{-\infty}^{\infty} \operatorname{sinc}^2(\xi)\,d\xi.

We know F[rect]=sinc\mathcal{F}[\operatorname{rect}] = \operatorname{sinc}, so by Parseval:

sinc(ξ)2dξ=rect(t)2dt=1/21/21dt=1\int_{-\infty}^{\infty}|\operatorname{sinc}(\xi)|^2\,d\xi = \int_{-\infty}^{\infty}|\operatorname{rect}(t)|^2\,dt = \int_{-1/2}^{1/2}1\,dt = 1

For AI: Parseval's relation underpins the analysis of spectral energy distribution in neural networks. The WeightWatcher tool (Martin & Mahoney, 2021) analyzes the spectrum of weight matrices to diagnose training quality - a healthy model has a power-law spectral distribution. Parseval tells us the Frobenius norm WF2=iσi2\lVert W \rVert_F^2 = \sum_i\sigma_i^2 equals the energy in the spectral domain.

5.3 Proof Sketch: Extension from L1L2L^1 \cap L^2 to L2L^2

The proof of Plancherel's theorem follows a classic functional analysis pattern:

Step 1: Show the Parseval identity for fL1(R)L2(R)f \in L^1(\mathbb{R}) \cap L^2(\mathbb{R}) (the intersection is dense in both spaces).

For fL1L2f \in L^1 \cap L^2, use Fubini to compute:

f^(ξ)2dξ=f^(ξ)f^(ξ)dξ=f^(ξ)(f(t)e2πiξtdt)dξ\int |\hat{f}(\xi)|^2\,d\xi = \int\hat{f}(\xi)\overline{\hat{f}(\xi)}\,d\xi = \int\hat{f}(\xi)\left(\int f(t)e^{2\pi i\xi t}\,dt\right)d\xi =f(t)f^(ξ)e2πiξtdξ=f(t) by inversiondt=f(t)2dt= \int f(t)\underbrace{\int\hat{f}(\xi)e^{2\pi i\xi t}\,d\xi}_{=f(t) \text{ by inversion}}\,dt = \int|f(t)|^2\,dt

Step 2: The isometry f^2=f2\lVert\hat{f}\rVert_2 = \lVert f \rVert_2 for fL1L2f \in L^1 \cap L^2 shows F:L1L2L2\mathcal{F}: L^1\cap L^2 \to L^2 is a bounded operator with operator norm 1.

Step 3: By density (L1L2L^1 \cap L^2 is dense in L2L^2) and the fact that a bounded linear isometry extends uniquely to the closure, F\mathcal{F} extends to all of L2L^2 preserving the isometry.

Step 4: Show F\mathcal{F} is surjective by showing F1\mathcal{F}^{-1} (the inverse FT) also extends and FF1=I\mathcal{F}\circ\mathcal{F}^{-1} = I. \square

5.4 Energy Conservation in Practice

The Parseval identity has immediate computational consequences:

Low-pass filtering: A filter that removes frequencies ξ>W|\xi| > W (a rectangular window in frequency) preserves at most a fraction 2Wf^22/f^222W\lVert\hat{f}\rVert_2^2/\lVert\hat{f}\rVert_2^2 of the total energy. For a signal with most energy at low frequencies, this fraction is close to 1.

Compression by truncation: For fL2f \in L^2, the best kk-term approximation in the Fourier basis minimizes the L2L^2 error. The error is ξ>Wf^(ξ)2dξ\int_{|\xi| > W}|\hat{f}(\xi)|^2\,d\xi - the energy in the discarded frequencies. This is the Fourier analog of truncated SVD in matrix approximation.

For AI: The Fourier Neural Operator (Section 8.4) exploits this by keeping only the top-kk Fourier modes (lowest frequencies) of the input function and discarding the rest. Plancherel guarantees the error is exactly the discarded energy - a principled truncation criterion.

6. The Heisenberg Uncertainty Principle

6.1 Time Spread and Frequency Spread

To state the uncertainty principle precisely, we need quantitative measures of how "spread out" a signal is in time and frequency.

Definition 6.1 (Time Center and Spread). For fL2(R)f \in L^2(\mathbb{R}) with f2=1\lVert f \rVert_2 = 1:

μt=tf(t)2dt(time center)\mu_t = \int_{-\infty}^{\infty} t\,|f(t)|^2\,dt \quad \text{(time center)} Δt=((tμt)2f(t)2dt)1/2(time spread / RMS duration)\Delta t = \left(\int_{-\infty}^{\infty} (t - \mu_t)^2\,|f(t)|^2\,dt\right)^{1/2} \quad \text{(time spread / RMS duration)}

Definition 6.2 (Frequency Center and Spread). Similarly:

μξ=ξf^(ξ)2dξ(frequency center)\mu_\xi = \int_{-\infty}^{\infty} \xi\,|\hat{f}(\xi)|^2\,d\xi \quad \text{(frequency center)} Δξ=((ξμξ)2f^(ξ)2dξ)1/2(frequency spread / bandwidth)\Delta\xi = \left(\int_{-\infty}^{\infty} (\xi - \mu_\xi)^2\,|\hat{f}(\xi)|^2\,d\xi\right)^{1/2} \quad \text{(frequency spread / bandwidth)}

Note: since f22=f^22=1\lVert f \rVert_2^2 = \lVert\hat{f}\rVert_2^2 = 1 (Plancherel), f(t)2|f(t)|^2 and f^(ξ)2|\hat{f}(\xi)|^2 are probability densities. The time and frequency spreads are the standard deviations of these distributions.

6.2 Formal Statement: ΔtΔξ14π\Delta t \cdot \Delta\xi \geq \frac{1}{4\pi}

Theorem 6.1 (Heisenberg-Kennard Uncertainty Principle). For any fL2(R)f \in L^2(\mathbb{R}) with f2=1\lVert f \rVert_2 = 1 and tf(t),ξf^(ξ)L2(R)tf(t), \xi\hat{f}(\xi) \in L^2(\mathbb{R}):

ΔtΔξ14π\Delta t \cdot \Delta\xi \geq \frac{1}{4\pi}

Equality holds if and only if ff is a Gaussian (up to time-shift, frequency-shift, and scaling):

f(t)=Aeπ(tt0)2/σ2e2πiξ0tf(t) = A\,e^{-\pi(t-t_0)^2/\sigma^2}\,e^{2\pi i\xi_0 t}

for some A,t0,ξ0RA, t_0, \xi_0 \in \mathbb{R} and σ>0\sigma > 0.

The bound is fundamental. This is not a statement about measurement error or quantum physics - it is a purely mathematical theorem about functions and their Fourier Transforms. Any signal with time duration Δt\Delta t necessarily has bandwidth at least 1/(4πΔt)1/(4\pi\Delta t). You cannot concentrate a signal to be both time-limited and bandwidth-limited simultaneously.

6.3 Proof via Cauchy-Schwarz

Proof. Without loss of generality, assume μt=μξ=0\mu_t = \mu_\xi = 0 (we can always translate in time and frequency without changing the spreads). Then:

Δt2Δξ2=(t2f(t)2dt)(ξ2f^(ξ)2dξ)\Delta t^2 \cdot \Delta\xi^2 = \left(\int t^2|f(t)|^2\,dt\right)\left(\int \xi^2|\hat{f}(\xi)|^2\,d\xi\right)

By Parseval and the differentiation property, ξ2f^(ξ)2dξ=14π2f(t)2dt\int\xi^2|\hat{f}(\xi)|^2\,d\xi = \frac{1}{4\pi^2}\int|f'(t)|^2\,dt (using F[f](ξ)=2πiξf^(ξ)\mathcal{F}[f'](\xi) = 2\pi i\xi\hat{f}(\xi) and Parseval).

So we need to show:

(t2f2dt)(14π2f2dt)116π2\left(\int t^2|f|^2\,dt\right)\left(\frac{1}{4\pi^2}\int|f'|^2\,dt\right) \geq \frac{1}{16\pi^2}

i.e., (t2f2)(f2)14\left(\int t^2|f|^2\right)\left(\int|f'|^2\right) \geq \frac{1}{4}.

By the Cauchy-Schwarz inequality:

f2dt=f1dt\int |f|^2\,dt = \int \lvert f\rvert \cdot 1\,dt \leq \cdots

More elegantly, integrate by parts and use Cauchy-Schwarz:

12=12ddt(t)f2dt=12[tf2]+12tddtf2dt\frac{1}{2} = -\frac{1}{2}\int \frac{d}{dt}(t)|f|^2\,dt = -\frac{1}{2}\left[t|f|^2\right]_{-\infty}^\infty + \frac{1}{2}\int t\frac{d}{dt}|f|^2\,dt

Wait - the cleaner path: use ddt(tf2)=f2+t2Re(ffˉ)\frac{d}{dt}(t|f|^2) = |f|^2 + t \cdot 2\operatorname{Re}(f'\bar{f}), integrate over R\mathbb{R}, and the left side integrates to 0 (since fL2f \in L^2 implies tf20t|f|^2 \to 0):

0=f2dt+2tRe(ffˉ)dt=1+2Retfˉ(t)f(t)dt0 = \int|f|^2\,dt + 2\int t\operatorname{Re}(f'\bar{f})\,dt = 1 + 2\operatorname{Re}\int t\,\bar{f}(t)f'(t)\,dt

So Retfˉfdt=1/2\operatorname{Re}\int t\,\bar{f}f'\,dt = -1/2. Then by Cauchy-Schwarz:

14=Retfˉfdt2tfˉfdt2(t2f2dt)(f2dt)\frac{1}{4} = \left|\operatorname{Re}\int t\bar{f}f'\,dt\right|^2 \leq \left|\int t\bar{f}f'\,dt\right|^2 \leq \left(\int t^2|f|^2\,dt\right)\left(\int|f'|^2\,dt\right)

Dividing both sides by 4π24\pi^2:

Δt2Δξ21414π2=116π2\Delta t^2 \cdot \Delta\xi^2 \geq \frac{1}{4} \cdot \frac{1}{4\pi^2} = \frac{1}{16\pi^2} ΔtΔξ14π\therefore \quad \Delta t \cdot \Delta\xi \geq \frac{1}{4\pi} \qquad \square

6.4 The Gaussian as the Unique Extremal Function

Equality in Cauchy-Schwarz requires f(t)=λtf(t)f'(t) = \lambda t f(t) for some constant λ\lambda. This ODE has solution f(t)=Ceλt2/2f(t) = Ce^{\lambda t^2/2}. For fL2(R)f \in L^2(\mathbb{R}), we need λ<0\lambda < 0, giving f(t)=Ceαt2f(t) = Ce^{-\alpha t^2} for α>0\alpha > 0 - a Gaussian.

Adding back the time-shift t0t_0 and frequency-shift ξ0\xi_0:

f(t)=Ceπ(tt0)2/σ2e2πiξ0t,Δt=σ/2,Δξ=1/(2σ2)f(t) = C\,e^{-\pi(t-t_0)^2/\sigma^2}\,e^{2\pi i\xi_0 t}, \qquad \Delta t = \sigma/\sqrt{2}, \quad \Delta\xi = 1/(2\sigma\sqrt{2})

Check: ΔtΔξ=σ/21/(2σ2)=1/(4)1/π\Delta t \cdot \Delta\xi = \sigma/\sqrt{2} \cdot 1/(2\sigma\sqrt{2}) = 1/(4) \cdot 1/\pi... Actually with the normalization g(t)=eπt2g(t) = e^{-\pi t^2}: Δt=1/(2π)\Delta t = 1/(2\sqrt{\pi}), Δξ=1/(2π)\Delta\xi = 1/(2\sqrt{\pi}), product = 1/(4π)1/(4\pi) - exactly the bound.

Implication: The Gaussian is the only signal that achieves perfect time-frequency concentration in the sense of minimizing ΔtΔξ\Delta t \cdot \Delta\xi. Every other signal is "more spread out" in at least one of the two domains. This is why Gaussian windows are optimal for time-frequency analysis (Gabor atoms), and why the Gaussian noise model is natural in signal processing.

6.5 Implications for Neural Architecture Design

The uncertainty principle has direct, concrete implications for AI systems:

RoPE and context length extension: In RoPE, each dimension pair uses frequency θj=100002j/dmodel\theta_j = 10000^{-2j/d_{\text{model}}}. The lowest frequencies θj\theta_j encode the longest-range positional information. To extend context from 4K to 128K tokens (as in LLaMA-3 long context), the lowest frequencies must be able to distinguish positions up to 128K. But by the uncertainty principle, a signal that is distinguishable over a range TT in "position space" must have frequency resolution Δξ1/T\Delta\xi \leq 1/T - which requires the frequency to be low enough. YaRN (Peng et al., 2023) and LongRoPE (Ding et al., 2024) resolve this by rescaling θj\theta_j to use lower frequencies for long-context models.

Spectrogram and STFT: Audio models like Whisper (Radford et al., 2022) use the Short-Time Fourier Transform - the FT applied to windowed segments. The window length controls the time-frequency tradeoff. Longer windows: better frequency resolution, worse time resolution. The Gaussian window is optimal but the Hamming window is commonly used for computational reasons.

Attention window size: Multi-head self-attention with local windows (as in LongFormer, BigBird) effectively applies a bandpass filter in "position space." The uncertainty principle says a window of size WW can resolve frequency up to 1/W1/W - setting a fundamental limit on the position encodings that can be usefully resolved within the window.

UNCERTAINTY PRINCIPLE - ARCHITECTURE IMPLICATIONS
========================================================================

  Signal Analysis            Neural Architecture Analog
  -----------------          -------------------------
  Time duration T            Context window / sequence length
  Bandwidth B ~ 1/T          Position frequency resolution
  t   >= 1/(4)          Short context  coarse PE, long  fine PE

  STFT window size W:        Local attention window size W:
    -> freq res: ~ 1/W          -> PE resolution: ~ 1/W
    -> time res: ~ W            -> local context: ~ W

  Gaussian window:           Gaussian attention kernel:
    -> optimal concentration    -> used in some efficient attention variants

  LongRoPE: rescales j to lower freqs -> finer resolution at long range

========================================================================

7. Tempered Distributions and the Dirac Delta

7.1 Why Distributions Are Necessary

The L1L^1 and L2L^2 theories of the Fourier Transform handle a large class of functions, but they exclude some of the most important objects in signal processing and physics:

  • δ(t)\delta(t) - the Dirac delta: not a function, but models an ideal impulse
  • f(t)=1f(t) = 1 - the constant function: not in L1L^1 or L2L^2, but models a DC signal
  • f(t)=e2πiξ0tf(t) = e^{2\pi i\xi_0 t} - a pure tone: not in L1L^1 or L2L^2, but is a fundamental signal
  • n=δ(tnT)\sum_{n=-\infty}^{\infty}\delta(t - nT) - the Dirac comb: not a function, but models periodic sampling

All of these have Fourier Transforms in the distributional sense - and all appear in practical signal processing. The distribution framework, developed by Laurent Schwartz in the 1950s, extends the Fourier Transform to this broader class.

7.2 Schwartz Space and Tempered Distributions

Definition 7.1 (Schwartz Space). The Schwartz space S(R)\mathcal{S}(\mathbb{R}) is the space of all CC^\infty functions ϕ:RC\phi: \mathbb{R} \to \mathbb{C} such that for all m,n0m, n \geq 0:

suptRtmϕ(n)(t)<\sup_{t \in \mathbb{R}} \left|t^m \phi^{(n)}(t)\right| < \infty

i.e., ϕ\phi and all its derivatives decay faster than any polynomial. Examples: et2e^{-t^2}, ete^{-|t|} (the derivative condition fails at 0 - not Schwartz), Gaussian bump functions.

The Schwartz space is closed under the Fourier Transform: if ϕS\phi \in \mathcal{S}, then ϕ^S\hat{\phi} \in \mathcal{S}. This is the key property: the FT maps S\mathcal{S} to itself bijectively.

Definition 7.2 (Tempered Distributions). A tempered distribution TT is a continuous linear functional T:S(R)CT: \mathcal{S}(\mathbb{R}) \to \mathbb{C}. We write T(ϕ)=T,ϕT(\phi) = \langle T, \phi \rangle.

Examples of tempered distributions:

  1. Regular distributions: Every function fLp(R)f \in L^p(\mathbb{R}) (1p1 \leq p \leq \infty) defines a tempered distribution Tf(ϕ)=f(t)ϕ(t)dtT_f(\phi) = \int f(t)\phi(t)\,dt.

  2. Dirac delta: δ,ϕ=ϕ(0)\langle\delta, \phi\rangle = \phi(0). The Dirac delta evaluates ϕ\phi at 0. It is NOT a function.

  3. Derivative of delta: δ,ϕ=ϕ(0)\langle\delta', \phi\rangle = -\phi'(0).

  4. Dirac comb: IIIT,ϕ=n=ϕ(nT)\langle\text{III}_T, \phi\rangle = \sum_{n=-\infty}^\infty \phi(nT).

Fourier Transform of a tempered distribution: Define T^,ϕ=T,ϕ^\langle\hat{T}, \phi\rangle = \langle T, \hat{\phi}\rangle for ϕS\phi \in \mathcal{S}. This is the "transpose" of the Fourier Transform.

7.3 The Dirac Delta and its Fourier Transform

The Dirac delta δ(t)\delta(t) models a unit impulse concentrated at t=0t = 0:

δ,ϕ=ϕ(0)=δ(t)ϕ(t)dt(formal)\langle\delta, \phi\rangle = \phi(0) = \int_{-\infty}^{\infty} \delta(t)\phi(t)\,dt \quad \text{(formal)}

It is the limit of a sequence of ordinary functions: δϵ(t)=1ϵeπt2/ϵ2\delta_\epsilon(t) = \frac{1}{\epsilon}e^{-\pi t^2/\epsilon^2} converges to δ\delta in the distributional sense as ϵ0\epsilon \to 0.

Fourier Transform of δ\delta: By the definition δ^,ϕ=δ,ϕ^=ϕ^(0)=ϕ(t)1dt=1,ϕ\langle\hat{\delta}, \phi\rangle = \langle\delta, \hat{\phi}\rangle = \hat{\phi}(0) = \int\phi(t)\cdot 1\,dt = \langle 1, \phi\rangle. Therefore:

F[δ](ξ)=1\mathcal{F}[\delta](\xi) = 1

An ideal impulse at t=0t = 0 has a flat (white) spectrum - it contains equal energy at all frequencies. This makes perfect physical sense: to create a perfect impulse, you need all frequencies constructively interfering at t=0t = 0.

Fourier Transform of a constant: By duality (applying FT twice): F[1](ξ)=δ(ξ)\mathcal{F}[1](\xi) = \delta(\xi). A constant signal (DC) lives entirely at frequency ξ=0\xi = 0.

The duality principle in full: The FT maps δ(t)1\delta(t) \leftrightarrow 1 and 1δ(ξ)1 \leftrightarrow \delta(\xi) - a striking symmetry.

More generally: A delta at t=at = a:

F[δ(ta)](ξ)=e2πiξa\mathcal{F}[\delta(t-a)](\xi) = e^{-2\pi i\xi a}

This is the time-shift property: δ(ta)\delta(t - a) shifted by aa gives a phase factor e2πiξae^{-2\pi i\xi a}, but the magnitude is still flat: F[δ(ta)]=1|\mathcal{F}[\delta(t-a)]| = 1.

7.4 FT of Constants, Sinusoids, and Periodic Signals

Using F[1]=δ\mathcal{F}[1] = \delta and the frequency-shift property F[e2πiξ0tf(t)](ξ)=f^(ξξ0)\mathcal{F}[e^{2\pi i\xi_0 t}f(t)](\xi) = \hat{f}(\xi - \xi_0):

Complex exponential:

F[e2πiξ0t](ξ)=δ(ξξ0)\mathcal{F}[e^{2\pi i\xi_0 t}](\xi) = \delta(\xi - \xi_0)

Cosine and Sine:

F[cos(2πξ0t)](ξ)=12[δ(ξξ0)+δ(ξ+ξ0)]\mathcal{F}[\cos(2\pi\xi_0 t)](\xi) = \frac{1}{2}[\delta(\xi - \xi_0) + \delta(\xi + \xi_0)] F[sin(2πξ0t)](ξ)=12i[δ(ξξ0)δ(ξ+ξ0)]\mathcal{F}[\sin(2\pi\xi_0 t)](\xi) = \frac{1}{2i}[\delta(\xi - \xi_0) - \delta(\xi + \xi_0)]

A pure cosine has a spectrum consisting of two Dirac deltas, one at +ξ0+\xi_0 and one at ξ0-\xi_0 (the negative frequency is the mirror image required by Hermitian symmetry). The amplitude of each spike is 1/21/2 because the energy is split equally between the two.

General periodic signal: A TT-periodic function f(t)=n=cne2πint/Tf(t) = \sum_{n=-\infty}^\infty c_n e^{2\pi i nt/T} has Fourier Transform:

f^(ξ)=n=cnδ ⁣(ξnT)\hat{f}(\xi) = \sum_{n=-\infty}^\infty c_n\,\delta\!\left(\xi - \frac{n}{T}\right)

The Fourier Transform of a periodic signal is a discrete sum of Dirac deltas at the harmonics - precisely the discrete spectrum of the Fourier series! This shows that Fourier series and Fourier Transform are two aspects of the same theory, unified by the distributional framework.

7.5 The Dirac Comb and Poisson Summation Formula

Definition 7.3 (Dirac Comb). The TT-periodic Dirac comb is:

IIIT(t)=n=δ(tnT)\text{III}_T(t) = \sum_{n=-\infty}^{\infty} \delta(t - nT)

Theorem 7.1 (FT of the Dirac Comb).

F[IIIT](ξ)=1TIII1/T(ξ)=1Tn=δ ⁣(ξnT)\mathcal{F}[\text{III}_T](\xi) = \frac{1}{T}\,\text{III}_{1/T}(\xi) = \frac{1}{T}\sum_{n=-\infty}^{\infty} \delta\!\left(\xi - \frac{n}{T}\right)

A comb of spacing TT in time has a spectrum that is a comb of spacing 1/T1/T in frequency. This is the time-frequency duality made concrete: fine temporal sampling -> coarse frequency sampling, and vice versa.

Theorem 7.2 (Poisson Summation Formula). For fS(R)f \in \mathcal{S}(\mathbb{R}):

n=f(n)=n=f^(n)\sum_{n=-\infty}^{\infty} f(n) = \sum_{n=-\infty}^{\infty} \hat{f}(n)

More generally, for sampling at interval TT:

1Tn=f ⁣(nT)=n=f^(nT)\frac{1}{T}\sum_{n=-\infty}^{\infty} f\!\left(\frac{n}{T}\right) = \sum_{n=-\infty}^{\infty} \hat{f}(nT)

Proof: The function g(t)=nf(t+n)g(t) = \sum_n f(t + n) is 11-periodic, so it has a Fourier series with coefficients ck=f^(k)c_k = \hat{f}(k) (by computing 01g(t)e2πiktdt=f(t)e2πiktdt=f^(k)\int_0^1 g(t)e^{-2\pi ikt}\,dt = \int_{-\infty}^\infty f(t)e^{-2\pi ikt}\,dt = \hat{f}(k)). Evaluating at t=0t = 0: nf(n)=g(0)=kck=kf^(k)\sum_n f(n) = g(0) = \sum_k c_k = \sum_k\hat{f}(k). \square

The Poisson summation formula is the key to sampling theory. Shannon's sampling theorem (1949) follows directly: a signal ff with bandwidth BB (i.e., f^(ξ)=0\hat{f}(\xi) = 0 for ξ>B|\xi| > B) is completely determined by its samples at rate 2B\geq 2B (the Nyquist rate). The Poisson summation formula explains both the reconstruction formula and the aliasing that occurs when undersampling.

For AI: The discrete Fourier Transform (Section 03) is the computational realization of the Poisson summation formula. The FFT computes the sampled Fourier Transform exactly, which requires the signal to be band-limited (or periodic) - any violation causes aliasing, the discrete analog of spectral leakage.

7.6 Unifying Fourier Series and Fourier Transform

The distribution framework reveals that Fourier series and Fourier Transform are the same thing:

SituationSignalSpectrum
General L2L^2 signalContinuous, aperiodicContinuous, aperiodic
Periodic signal (period TT)Continuous, periodicDiscrete (spacing 1/T1/T): Fourier series
Bandlimited (bandwidth BB)Discrete (rate 2B\geq 2B): samplingPeriodic (period 2B\geq 2B): aliases
Periodic AND bandlimitedDiscrete and periodicDiscrete and periodic: DFT (Section 03)
FOURIER TRANSFORM UNIFICATION
========================================================================

  Time domain           Frequency domain        Setting
  -----------           ----------------        -------
  Continuous            Continuous              L1/L2 Fourier Transform
  Continuous            Discrete                Fourier Series (periodic)
  Discrete              Continuous              DTFT (sampled signal)
  Discrete              Discrete                DFT / FFT -> see Section 20-03

  All four are the same theory viewed through the distributional lens.
  The Dirac comb + Poisson summation connects all four cases.

========================================================================

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue