Part 1

25 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Fourier Transform: Part 1: Intuition to 7. Tempered Distributions and the Dirac Delta

1. Intuition

1.1 From Fourier Series to Fourier Transform

Recall from Section 20-01 Fourier Series that a $2L$ -periodic function has a Fourier series in complex exponential form:

f(x) = \sum_{n=-\infty}^{\infty} c_n\, e^{in\pi x/L}, \qquad c_n = \frac{1}{2L}\int_{-L}^{L} f(x)\,e^{-in\pi x/L}\,dx

The frequencies present are the discrete set $\{\ldots, -2/2L, -1/2L, 0, 1/2L, 2/2L, \ldots\}$ , separated by spacing $\Delta\xi = 1/(2L)$ .

Now ask: what happens as $L \to \infty$ ? The function is no longer required to repeat - it becomes an arbitrary function on all of $\mathbb{R}$ . Three things change simultaneously:

The frequency spacing $\Delta\xi = 1/(2L) \to 0$ : the discrete spectrum becomes a continuum.
The discrete sum $\sum_n$ becomes an integral $\int_{-\infty}^{\infty} d\xi$ .
The coefficient $c_n$ (which scales as $1/(2L)$ ) becomes a density $\hat{f}(\xi)\,d\xi$ .

Concretely, substitute $\xi_n = n/(2L)$ and let $L \to \infty$ :

f(x) = \sum_{n=-\infty}^{\infty} \underbrace{\left(\int_{-L}^{L} f(t)\,e^{-2\pi i \xi_n t}\,dt\right)}_{\approx \hat{f}(\xi_n)} e^{2\pi i \xi_n x} \underbrace{\frac{1}{2L}}_{\Delta\xi}

\xrightarrow{L\to\infty} \int_{-\infty}^{\infty} \hat{f}(\xi)\, e^{2\pi i \xi x}\,d\xi

where the Fourier Transform (using the $\xi$ -convention) is:

\hat{f}(\xi) = \int_{-\infty}^{\infty} f(t)\, e^{-2\pi i \xi t}\,dt

This derivation makes the Fourier Transform inevitable: it is not a definition pulled from thin air, but the natural limit of the Fourier series as periodicity is removed. The continuous spectrum $\hat{f}(\xi)$ is the spectral density of $f$ - the amplitude and phase contributed by frequency $\xi$ per unit frequency interval.

Non-example: The derivation above requires that the "coefficients" $\hat{f}(\xi)$ are well-defined as $L \to \infty$ . For this, we need $\int_{-\infty}^{\infty} |f(t)|\,dt < \infty$ , i.e., $f \in L^1(\mathbb{R})$ . A constant function $f(t) = 1$ is NOT in $L^1(\mathbb{R})$ , so it does not have a classical Fourier Transform - we must extend the theory to distributions (Section 7) to handle it.

1.2 Frequency as a Continuous Variable

In a Fourier series, the spectrum is a discrete set of spikes: the function has energy only at the harmonics $n/T$ . In the Fourier Transform, the spectrum is a continuous function $\hat{f}(\xi)$ , and the function can have energy spread continuously across all frequencies.

The magnitude spectrum $|\hat{f}(\xi)|$ tells you how much of each frequency is present. The phase spectrum $\angle\hat{f}(\xi)$ tells you the phase offset of each frequency component. Together they completely determine $f$ (via the inversion theorem). Some examples of what the spectrum looks like:

A pure sinusoid $f(t) = \cos(2\pi\xi_0 t)$ : spectrum is two spikes at $\pm\xi_0$ (as Dirac deltas).
A rectangular pulse (nonzero on $[-T/2, T/2]$ ): spectrum is a sinc function, spreading out in frequency.
A Gaussian $f(t) = e^{-\pi t^2}$ : spectrum is another Gaussian (this is the self-dual case).
A chirp (frequency increasing linearly with time): spectrum is spread over a range of frequencies, with the time-frequency tradeoff governed by the uncertainty principle.
White noise: flat spectrum - equal energy at every frequency.

The key principle is the time-frequency duality: a signal that is concentrated in time (short duration) must have a broad spectrum, and a signal with a narrow spectrum (nearly monochromatic) must extend over a long time. This is not a limitation of our measurement apparatus - it is a mathematical theorem (the Heisenberg uncertainty principle, Section 6).

For AI: This duality directly constrains attention mechanisms. RoPE uses frequencies to encode position, and extending context length (longer time window) requires lower frequencies - which is why YaRN (Peng et al., 2023) and LongRoPE (Ding et al., 2024) rescale the frequency base to accommodate longer sequences. The uncertainty principle is the fundamental reason why this rescaling is necessary.

1.3 Why It Matters for AI (2026 Perspective)

The Fourier Transform is not peripheral to modern AI - it is structurally present in several of the most important systems:

FNet (Lee-Thorp et al., 2022): Replaces the self-attention sublayer in a Transformer with a 2D Fourier transform over the sequence and embedding dimensions. Achieves 92% of BERT's accuracy on GLUE benchmarks at 7 the training speed, because FFT is $O(N \log N)$ while attention is $O(N^2)$ . The mathematical insight: the FT acts as a "global mixer" that combines all token representations - a weaker but cheaper version of attention.

Random Fourier Features (Rahimi & Recht, 2007, NeurIPS Best Paper): Every shift-invariant kernel $k(\mathbf{x}, \mathbf{y}) = k(\mathbf{x} - \mathbf{y})$ is the Fourier Transform of a non-negative measure (Bochner's theorem). By sampling random frequencies from that measure, one gets a randomized feature map $\phi(\mathbf{x}) \in \mathbb{R}^D$ such that $k(\mathbf{x},\mathbf{y}) \approx \phi(\mathbf{x})^\top\phi(\mathbf{y})$ . This reduces kernel machine computation from $O(N^2)$ to $O(ND)$ .

Spectral Normalization (Miyato et al., 2018): To train stable GANs, the discriminator's weight matrices are normalized by their largest singular value (spectral radius). This enforces a Lipschitz constraint on the discriminator, stabilizing training. The "spectral" in the name refers to the spectrum of the weight matrix - a direct application of the FT of the weight matrices interpreted as linear operators.

Fourier Neural Operator (Li et al., 2021): Learns mappings between function spaces by applying a learnable linear transformation in Fourier space, then transforming back. Used for solving PDEs 1000 faster than traditional numerical methods. The key is that convolution in space = pointwise multiplication in Fourier space, so global dependencies can be captured efficiently.

RoPE (Su et al., 2021): Used in LLaMA-3, Mistral, Qwen, and virtually every modern LLM. Interprets each position as a rotation in the complex plane, using Fourier basis vectors $e^{im\theta_j}$ at different frequencies $\theta_j$ for different embedding dimensions. The relative-position property follows from the group structure of the complex exponential.

1.4 Historical Timeline

FOURIER TRANSFORM - HISTORICAL MILESTONES
========================================================================

  1822  Fourier's "Theorie analytique de la chaleur": introduces the
        integral formula that becomes the Fourier Transform
  1880s Rayleigh uses Fourier integrals in optics and acoustics
  1898  Heaviside operational calculus (early form of the Laplace/FT)
  1915  Plancherel proves the L2 isometry theorem (Plancherel's theorem)
  1927  Heisenberg formulates the uncertainty principle in quantum
        mechanics; Kennard gives the precise mathematical inequality
  1933  Norbert Wiener's "The Fourier Integral and Certain of its
        Applications" - rigorous L2 theory; lays groundwork for
        signal processing as a mathematical discipline
  1949  Shannon's sampling theorem (uses Poisson summation formula)
  1965  Cooley & Tukey publish the FFT - makes FT computable in O(NlogN)
  1966  Schwartz distributions: FT extended to , constants, sinusoids
  1975  FT enters digital signal processing (audio, radar, MRI)
  2007  Rahimi & Recht: Random Fourier Features (NeurIPS Best Paper)
  2017  Vaswani et al.: sinusoidal positional encodings in Transformers
  2018  Miyato et al.: spectral normalization for GANs
  2021  Li et al.: Fourier Neural Operator for PDE solving
  2021  Su et al.: RoPE - now in LLaMA-3, Mistral, Qwen, GPT-NeoX
  2022  Lee-Thorp et al.: FNet - Fourier-based alternative to attention
  2023  YaRN (Peng et al.): Fourier-based context length extension
  2024  LongRoPE (Ding et al.): progressive frequency rescaling to 2M
        context length; RWKV-7 uses state-space FT interpretation

========================================================================

1.5 What the Fourier Transform Does Geometrically

The most useful geometric picture is that the Fourier Transform decomposes $f$ in an orthonormal basis of complex exponentials - just as the Fourier series decomposed a periodic function in the trigonometric basis. The difference is that now the "basis" is uncountably infinite and the "coefficients" form a continuous function rather than a sequence.

Formally, the complex exponentials $\{e^{2\pi i \xi t}\}_{\xi \in \mathbb{R}}$ do not form a basis of $L^2(\mathbb{R})$ in the usual Hilbert space sense (they are not in $L^2(\mathbb{R})$ themselves - they have constant modulus 1 and are not square-integrable). The rigorous statement is that the Fourier Transform is a unitary operator on $L^2(\mathbb{R})$ (Plancherel's theorem), meaning it preserves inner products and norms:

\langle f, g \rangle_{L^2} = \langle \hat{f}, \hat{g} \rangle_{L^2}

Think of the Fourier Transform as a rotation in an infinite-dimensional space: it rotates the function $f$ into a new coordinate system (the frequency domain) where the same function looks different but has exactly the same "length" (energy). Just as rotating a vector in $\mathbb{R}^n$ preserves its norm but changes its coordinates, the Fourier Transform preserves energy but expresses it in frequency coordinates.

A second geometric picture: the Fourier Transform of the convolution $f * g$ is the pointwise product $\hat{f} \cdot \hat{g}$ . Convolution in time/space corresponds to multiplication in frequency. This is the key theorem for signal processing and the foundation of efficient convolution in CNNs via the FFT. (Full treatment in Section 20-04 Convolution Theorem.)

GEOMETRIC PICTURE OF THE FOURIER TRANSFORM
========================================================================

  Time domain                    Frequency domain
  -------------                  ----------------
  f(t): a function of time       f(): a function of frequency

  Signal duration T  <-->  Bandwidth ~ 1/T  (uncertainty principle)

  Convolution f*g    <-->  Pointwise product fg  (Convolution Theorem)

  Differentiation d/dt <-->  Multiplication by 2i

  Translation f(t-a) <-->  Modulation e^{-2ia}f()

  Scaling f(at)      <-->  Dilation (1/|a|)f(/a)

  The FT is a UNITARY OPERATOR on L2(R):
    f2 = f2  (Plancherel)
    f,g = f,g  (Parseval)

========================================================================

2. Formal Definitions

2.1 The Fourier Transform on $L^1(\mathbb{R})$

Definition 2.1 (Fourier Transform on $L^1$ ). For $f \in L^1(\mathbb{R})$ (i.e., $\int_{-\infty}^{\infty}|f(t)|\,dt < \infty$ ), the Fourier Transform of $f$ is the function $\hat{f}: \mathbb{R} \to \mathbb{C}$ defined by:

\hat{f}(\xi) = \int_{-\infty}^{\infty} f(t)\, e^{-2\pi i \xi t}\,dt

We write $\hat{f} = \mathcal{F}[f]$ or $f \mapsto \hat{f}$ .

Well-definedness. For $f \in L^1(\mathbb{R})$ and any $\xi \in \mathbb{R}$ :

|\hat{f}(\xi)| = \left|\int_{-\infty}^{\infty} f(t)\,e^{-2\pi i\xi t}\,dt\right| \leq \int_{-\infty}^{\infty}|f(t)|\,|e^{-2\pi i\xi t}|\,dt = \int_{-\infty}^{\infty}|f(t)|\,dt = \lVert f \rVert_1

So $\hat{f}$ is bounded: $\lVert\hat{f}\rVert_\infty \leq \lVert f \rVert_1$ . Moreover, one can show (by dominated convergence) that $\hat{f}$ is continuous on $\mathbb{R}$ , and by the Riemann-Lebesgue lemma, $\hat{f}(\xi) \to 0$ as $|\xi| \to \infty$ .

Theorem 2.1 (Riemann-Lebesgue Lemma). If $f \in L^1(\mathbb{R})$ , then $\hat{f} \in C_0(\mathbb{R})$ (continuous functions vanishing at infinity):

\lim_{|\xi|\to\infty} \hat{f}(\xi) = 0

Proof sketch. For step functions, the integral reduces to a finite sum of terms $\sim e^{-2\pi i\xi a}/(\xi)$ , each going to 0. Approximate general $L^1$ functions by step functions and use the boundedness $\lVert\hat{f}\rVert_\infty \leq \lVert f \rVert_1$ . $\square$

What $\hat{f}(\xi)$ tells you. The value $\hat{f}(\xi)$ is a complex number encoding:

$|\hat{f}(\xi)|$ : the amplitude of frequency $\xi$ in $f$
$\arg\hat{f}(\xi)$ : the phase of frequency $\xi$ in $f$

The power spectrum $|\hat{f}(\xi)|^2$ gives the energy density per unit frequency at $\xi$ .

2.2 Convention War: $\omega$ vs $\xi$ vs $f$

The Fourier Transform appears in three notational conventions in the literature, all equivalent but related by rescaling:

Convention	Transform formula	Inverse formula	Used in
Frequency $\xi$ (Hz)	$\hat{f}(\xi) = \int f(t)e^{-2\pi i\xi t}\,dt$	$f(t) = \int \hat{f}(\xi)e^{2\pi i\xi t}\,d\xi$	Signal processing, probability
Angular freq $\omega$ (rad/s), no $\frac{1}{2\pi}$	$\hat{f}(\omega) = \int f(t)e^{-i\omega t}\,dt$	$f(t) = \frac{1}{2\pi}\int \hat{f}(\omega)e^{i\omega t}\,d\omega$	Physics, engineering
Angular freq $\omega$ , symmetric	$\hat{f}(\omega) = \frac{1}{\sqrt{2\pi}}\int f(t)e^{-i\omega t}\,dt$	$f(t) = \frac{1}{\sqrt{2\pi}}\int \hat{f}(\omega)e^{i\omega t}\,d\omega$	Pure mathematics

This section uses the $\xi$ -convention (row 1). It is self-symmetric (no $2\pi$ factors), maps the Gaussian to itself, and is standard in modern ML papers (FNet, FNO, RFF all use it). The relationship between conventions: $\hat{f}_\omega(\omega) = \hat{f}_\xi(\omega/2\pi)$ .

Warning: The most common source of errors in Fourier analysis is mixing conventions. When reading a paper, identify the convention in the first equation before proceeding. Properties like "differentiation multiplies by $i\omega$ " vs "multiplies by $2\pi i\xi$ " depend entirely on which convention is in use.

2.3 Standard Examples

These four transforms appear constantly in applications and should be memorized.

Example 1: Rectangular Pulse (rect function)

\operatorname{rect}(t) = \begin{cases}1 & |t| \leq 1/2 \\ 0 & |t| > 1/2\end{cases}

\widehat{\operatorname{rect}}(\xi) = \int_{-1/2}^{1/2} e^{-2\pi i\xi t}\,dt = \frac{e^{\pi i\xi} - e^{-\pi i\xi}}{2\pi i\xi} = \frac{\sin(\pi\xi)}{\pi\xi} =: \operatorname{sinc}(\xi)

The sinc function oscillates and decays as $1/|\xi|$ . The slow decay reflects the sharp discontinuity of $\operatorname{rect}$ - sharp features in time produce slow decay in frequency (this is the spectral analog of the Gibbs phenomenon from Section 20-01).

A general pulse of width $T$ : $f(t) = \operatorname{rect}(t/T)$ has transform $\hat{f}(\xi) = T\,\operatorname{sinc}(T\xi)$ . Wider pulse -> narrower, taller sinc lobe.

Example 2: Gaussian

f(t) = e^{-\pi t^2}

\hat{f}(\xi) = \int_{-\infty}^{\infty} e^{-\pi t^2} e^{-2\pi i\xi t}\,dt = e^{-\pi\xi^2}

The Gaussian is self-dual under the Fourier Transform with the $\xi$ -convention: $\hat{f} = f$ . This is unique among "nice" functions and makes the Gaussian the natural test function in Fourier analysis, the ground state in quantum mechanics, and the kernel of the heat equation.

Derivation: Complete the square in the exponent: $-\pi t^2 - 2\pi i\xi t = -\pi(t + i\xi)^2 - \pi\xi^2$ . Then:

\hat{f}(\xi) = e^{-\pi\xi^2}\int_{-\infty}^{\infty}e^{-\pi(t+i\xi)^2}\,dt = e^{-\pi\xi^2} \cdot 1

where the integral equals 1 by contour integration (the Gaussian integral is analytic and the contour shift $t \to t + i\xi$ is justified because the integrand decays rapidly).

Example 3: Lorentzian (Cauchy distribution)

f(t) = \frac{1}{1 + (2\pi t)^2}

\hat{f}(\xi) = \frac{1}{2}e^{-|\xi|}

The exponential decay in frequency reflects the moderate smoothness of the Lorentzian (it is $C^\infty$ but not compactly supported). The relationship between smoothness and spectral decay is captured by the general theorem in Section 3.5.

Example 4: One-Sided Exponential

f(t) = e^{-at}\mathbb{1}_{t \geq 0}, \quad a > 0

\hat{f}(\xi) = \int_0^\infty e^{-at}e^{-2\pi i\xi t}\,dt = \frac{1}{a + 2\pi i\xi}

This has modulus $|\hat{f}(\xi)| = 1/\sqrt{a^2 + 4\pi^2\xi^2}$ , which decays as $1/|\xi|$ for large $\xi$ - reflecting the discontinuity at $t = 0$ .

Non-example 1: $f(t) = 1/(1+t^2)^{1/4}$ . This is in $L^2(\mathbb{R})$ but not $L^1(\mathbb{R})$ (it decays too slowly), so the integral $\int f(t)e^{-2\pi i\xi t}\,dt$ does not converge absolutely and the definition above does not apply. The $L^2$ theory (Section 5) handles this case.

Non-example 2: $f(t) = \sin(2\pi\xi_0 t)$ . This is bounded but not in $L^1(\mathbb{R})$ or $L^2(\mathbb{R})$ - it does not decay. Its Fourier Transform exists only as a distribution: $\mathcal{F}[\sin(2\pi\xi_0 t)] = \frac{i}{2}[\delta(\xi + \xi_0) - \delta(\xi - \xi_0)]$ (see Section 7).

2.4 The Inverse Fourier Transform

Definition 2.2 (Inverse Fourier Transform). For $g \in L^1(\mathbb{R})$ :

\mathcal{F}^{-1}[g](t) = \int_{-\infty}^{\infty} g(\xi)\,e^{2\pi i\xi t}\,d\xi

Note: the inverse is the same as the forward transform except the sign in the exponent is flipped ( $-2\pi i\xi t \to +2\pi i\xi t$ ). In the $\xi$ -convention, this symmetric form is one of its advantages.

The inversion problem: Given $\hat{f}$ , can we recover $f$ ? Yes, under mild conditions:

f(t) = \int_{-\infty}^{\infty} \hat{f}(\xi)\,e^{2\pi i\xi t}\,d\xi

but this requires knowing that $\hat{f} \in L^1(\mathbb{R})$ (which is not guaranteed by $f \in L^1(\mathbb{R})$ alone - for instance, $\hat{\operatorname{rect}}(\xi) = \operatorname{sinc}(\xi)$ is not in $L^1$ ). The full inversion theorem is treated rigorously in Section 4.

2.5 Non-Examples and the $L^2$ Extension

The classical $L^1$ definition fails for many important functions:

Function	Why $L^1$ FT fails	Solution
$f(t) = e^{-t^2/2}$ (broader Gaussian)	In $L^1$ - this one is fine	Not a failure
$f(t) = (1+t^2)^{-1/4}$	Not in $L^1$	Use $L^2$ Plancherel (Section 5)
$f(t) = 1$	Not in $L^1$ or $L^2$	Use distributions (Section 7)
$f(t) = \sin(2\pi\xi_0 t)$	Not in $L^1$ or $L^2$	Use distributions (Section 7)
$f(t) = \delta(t)$	Not a function	Use distributions (Section 7)

The $L^2$ extension proceeds via a density argument. The subspace $L^1(\mathbb{R}) \cap L^2(\mathbb{R})$ is dense in $L^2(\mathbb{R})$ . For $f \in L^1 \cap L^2$ , the classical integral defines $\hat{f}$ . The Plancherel theorem (Section 5) then shows $\lVert\hat{f}\rVert_2 = \lVert f \rVert_2$ , which allows extending $\mathcal{F}$ to all of $L^2$ by continuity. The resulting transform is a unitary operator on $L^2(\mathbb{R})$ .

3. Core Properties

3.1 Linearity and Conjugate Symmetry

Theorem 3.1 (Linearity). For $f, g \in L^1(\mathbb{R})$ and $\alpha, \beta \in \mathbb{C}$ :

\mathcal{F}[\alpha f + \beta g](\xi) = \alpha\hat{f}(\xi) + \beta\hat{g}(\xi)

Proof: Immediate from linearity of the integral. $\square$

Theorem 3.2 (Conjugate Symmetry / Hermitian Property). For real-valued $f \in L^1(\mathbb{R})$ :

\hat{f}(-\xi) = \overline{\hat{f}(\xi)}

Proof: $\hat{f}(-\xi) = \int f(t)e^{2\pi i\xi t}\,dt = \overline{\int f(t)e^{-2\pi i\xi t}\,dt} = \overline{\hat{f}(\xi)}$ , where the last step uses $f(t) = \overline{f(t)}$ (real-valued). $\square$

Corollaries for real $f$ :

$|\hat{f}(-\xi)| = |\hat{f}(\xi)|$ : the magnitude spectrum is even
$\arg\hat{f}(-\xi) = -\arg\hat{f}(\xi)$ : the phase spectrum is odd
If $f$ is also even, then $\hat{f}$ is real-valued and even
If $f$ is also odd, then $\hat{f}$ is purely imaginary and odd

For AI: The Hermitian symmetry means that for real signals, half the spectrum is redundant - you only need frequencies $\xi \geq 0$ . This is why the FFT of a real signal produces $N/2 + 1$ unique complex outputs from $N$ real inputs. Real-valued weight matrices in neural networks have Hermitian-symmetric spectra, which is exploited in efficient spectral computation.

3.2 Time-Shift and Frequency-Shift (Modulation)

Theorem 3.3 (Time-Shift / Translation). For $a \in \mathbb{R}$ :

\mathcal{F}[f(t - a)](\xi) = e^{-2\pi i\xi a}\,\hat{f}(\xi)

Proof: $\int f(t-a)e^{-2\pi i\xi t}\,dt \stackrel{u=t-a}{=} \int f(u)e^{-2\pi i\xi(u+a)}\,du = e^{-2\pi i\xi a}\hat{f}(\xi)$ . $\square$

Interpretation: Shifting a signal in time multiplies its spectrum by a complex phase factor. The magnitude spectrum $|\hat{f}(\xi)|$ is unchanged - a time shift does not affect which frequencies are present, only their phases. The phase changes linearly with frequency: $\phi \mapsto \phi - 2\pi\xi a$ .

Theorem 3.4 (Frequency-Shift / Modulation). For $\xi_0 \in \mathbb{R}$ :

\mathcal{F}[e^{2\pi i\xi_0 t}f(t)](\xi) = \hat{f}(\xi - \xi_0)

Proof: $\int e^{2\pi i\xi_0 t}f(t)e^{-2\pi i\xi t}\,dt = \int f(t)e^{-2\pi i(\xi - \xi_0)t}\,dt = \hat{f}(\xi - \xi_0)$ . $\square$

Interpretation: Multiplying by a complex exponential (modulation) shifts the spectrum. This is the mathematical basis for amplitude modulation (AM) radio and for the RoPE positional encoding.

For AI - RoPE connection: In RoPE, the query at position $m$ is $\mathbf{q}_m = e^{im\theta}\mathbf{q}$ (rotation by $m\theta$ in the complex plane). The inner product between query at $m$ and key at $n$ is $\langle e^{im\theta}\mathbf{q},\, e^{in\theta}\mathbf{k}\rangle = e^{i(m-n)\theta}\langle\mathbf{q},\mathbf{k}\rangle$ , depending only on relative position $m - n$ . This is the modulation theorem in action.

3.3 Scaling and Dilation

Theorem 3.5 (Scaling / Dilation). For $a \in \mathbb{R}$ , $a \neq 0$ :

\mathcal{F}[f(at)](\xi) = \frac{1}{|a|}\hat{f}\!\left(\frac{\xi}{a}\right)

Proof: $\int f(at)e^{-2\pi i\xi t}\,dt \stackrel{u=at}{=} \frac{1}{|a|}\int f(u)e^{-2\pi i(\xi/a)u}\,du = \frac{1}{|a|}\hat{f}(\xi/a)$ . $\square$

Interpretation: This is the time-bandwidth product in action:

Compressing a signal in time ( $a > 1$ , so $f(at)$ is narrower): the spectrum stretches by factor $a$ and shrinks in amplitude by $1/a$
Stretching a signal in time ( $0 < a < 1$ ): the spectrum compresses

This confirms the time-frequency duality: doubling the duration halves the bandwidth, and vice versa. This is a hard mathematical constraint, not an engineering limitation.

3.4 Time Reversal

Theorem 3.6 (Time Reversal). For $f \in L^1(\mathbb{R})$ :

\mathcal{F}[f(-t)](\xi) = \hat{f}(-\xi) = \overline{\hat{f}(\xi)} \text{ (for real } f\text{)}

Proof: $\int f(-t)e^{-2\pi i\xi t}\,dt \stackrel{u=-t}{=} \int f(u)e^{2\pi i\xi u}\,du = \hat{f}(-\xi)$ . $\square$

Duality: There is a deeper symmetry: $\mathcal{F}[\hat{f}](t) = f(-t)$ . Applying the Fourier Transform four times returns the original function: $\mathcal{F}^4 = I$ . This means the FT has eigenvalues $\{1, -i, -1, i\}$ and the Hermite functions are its eigenfunctions (with the Gaussian as the eigenfunction for eigenvalue 1).

3.5 Differentiation and Integration

Theorem 3.7 (Differentiation in Time). If $f, f' \in L^1(\mathbb{R})$ :

\mathcal{F}[f'(t)](\xi) = 2\pi i\xi\,\hat{f}(\xi)

More generally, $\mathcal{F}[f^{(n)}(t)](\xi) = (2\pi i\xi)^n\hat{f}(\xi)$ .

Proof: Integration by parts: $\int f'(t)e^{-2\pi i\xi t}\,dt = [f(t)e^{-2\pi i\xi t}]_{-\infty}^{\infty} + 2\pi i\xi\int f(t)e^{-2\pi i\xi t}\,dt$ . The boundary term vanishes since $f \in L^1$ implies $f(t) \to 0$ as $|t| \to \infty$ . $\square$

Theorem 3.8 (Differentiation in Frequency). If $tf(t) \in L^1(\mathbb{R})$ :

\frac{d}{d\xi}\hat{f}(\xi) = -2\pi i\,\widehat{tf(t)}(\xi)

Equivalently: $\mathcal{F}[(-2\pi it)^n f(t)](\xi) = \frac{d^n}{d\xi^n}\hat{f}(\xi)$ .

Key consequence for smoothness vs. spectral decay:

Smoothness of $f$	Decay of $\hat{f}$
$f \in C^k$ and $f^{(k)} \in L^1$	$
$f \in C^\infty$ (smooth)	$\hat{f}$ decays faster than any polynomial
$f$ has a jump discontinuity	$
$f$ is analytic	$\hat{f}$ decays exponentially

For AI: The differentiation property is why Fourier methods are efficient for solving differential equations. It converts a PDE $\partial_t u = a\partial_{xx}u$ into an ODE $\partial_t\hat{u} = -4\pi^2\xi^2 a\hat{u}$ , which is solved by $\hat{u}(\xi, t) = \hat{u}(\xi, 0)e^{-4\pi^2 a\xi^2 t}$ - a simple exponential decay. The Fourier Neural Operator (Section 8.4) exploits this by learning the spectral solution operator directly.

3.6 The Master Properties Table

Property	$f(t)$	$\hat{f}(\xi)$
Linearity	$\alpha f(t) + \beta g(t)$	$\alpha\hat{f}(\xi) + \beta\hat{g}(\xi)$
Time shift	$f(t - a)$	$e^{-2\pi i\xi a}\hat{f}(\xi)$
Frequency shift	$e^{2\pi i\xi_0 t}f(t)$	$\hat{f}(\xi - \xi_0)$
Scaling	$f(at)$	$\frac{1}{\lvert a\rvert}\hat{f}(\xi/a)$
Time reversal	$f(-t)$	$\hat{f}(-\xi)$
Conjugation	$\overline{f(t)}$	$\overline{\hat{f}(-\xi)}$
Hermitian (real $f$ )	$f(t) \in \mathbb{R}$	$\hat{f}(-\xi) = \overline{\hat{f}(\xi)}$
Differentiation	$f^{(n)}(t)$	$(2\pi i\xi)^n\hat{f}(\xi)$
Freq. differentiation	$(-2\pi it)^n f(t)$	$\hat{f}^{(n)}(\xi)$
Convolution	$(f * g)(t)$	$\hat{f}(\xi)\cdot\hat{g}(\xi)$
Multiplication	$f(t)\cdot g(t)$	$(\hat{f} * \hat{g})(\xi)$
Duality	$\hat{f}(t)$	$f(-\xi)$

3.7 Convolution-Multiplication Duality - Preview

The Convolution Theorem states that the Fourier Transform converts convolution into pointwise multiplication:

\mathcal{F}[f * g](\xi) = \hat{f}(\xi) \cdot \hat{g}(\xi)

where $(f * g)(t) = \int_{-\infty}^{\infty} f(\tau)g(t - \tau)\,d\tau$ .

This is the most important property of the Fourier Transform for applications. It means that:

Linear filtering (convolution with a filter $h$ ) corresponds to pointwise multiplication of spectra
A filter's effect on frequency $\xi$ is simply multiplication by $\hat{h}(\xi)$ (the frequency response)
Convolution of length- $N$ signals costs $O(N^2)$ directly but $O(N\log N)$ via FFT

Preview: The full treatment of the convolution theorem - including circular convolution, cross-correlation, the Wiener-Khinchin theorem, and applications to CNNs, WaveNet, and Mamba - is in Section 20-04 Convolution Theorem. Here we state the theorem and use it; the proof and all applications belong there.

4. The Fourier Inversion Theorem

4.1 Statement and Sufficient Conditions

The fundamental question: given $\hat{f}$ , can we recover $f$ ? The answer is yes under appropriate conditions, but the precise statement requires care.

Theorem 4.1 (Fourier Inversion Theorem). Suppose $f \in L^1(\mathbb{R})$ and $\hat{f} \in L^1(\mathbb{R})$ . Then for almost every $t \in \mathbb{R}$ :

f(t) = \int_{-\infty}^{\infty} \hat{f}(\xi)\,e^{2\pi i\xi t}\,d\xi

Moreover, if $f$ is also continuous at $t$ , equality holds exactly (not just a.e.).

The subtlety: The condition $\hat{f} \in L^1(\mathbb{R})$ is not automatic. If $f \in L^1(\mathbb{R})$ , then $\hat{f}$ is bounded and continuous (by Riemann-Lebesgue), but not necessarily in $L^1$ . For example, $f = \operatorname{rect}$ gives $\hat{f} = \operatorname{sinc}$ , and $\operatorname{sinc} \notin L^1(\mathbb{R})$ (it decays too slowly: $\int_1^\infty \frac{|\sin(\pi\xi)|}{\pi\xi}d\xi = \infty$ ). The inversion of the rect function requires the $L^2$ theory.

Sufficient condition for inversion: If $f \in L^1(\mathbb{R})$ and $f$ has bounded variation near $t$ (e.g., $f$ is differentiable at $t$ ), then the principal value integral:

\lim_{R\to\infty}\int_{-R}^{R}\hat{f}(\xi)e^{2\pi i\xi t}\,d\xi

converges to $f(t)$ where $f$ is continuous, and to $\frac{1}{2}[f(t^+) + f(t^-)]$ at jump discontinuities - exactly as in Dirichlet's theorem for Fourier series.

4.2 Proof via Approximate Identity

The standard proof uses the concept of an approximate identity - a family of functions that converge to the Dirac delta as a parameter goes to zero.

Definition (Approximate Identity). A family $\{k_\epsilon\}_{\epsilon > 0}$ is an approximate identity if:

$k_\epsilon \geq 0$
$\int k_\epsilon(t)\,dt = 1$ for all $\epsilon > 0$
For every $\delta > 0$ : $\int_{|t| > \delta} k_\epsilon(t)\,dt \to 0$ as $\epsilon \to 0$

The Gauss-Weierstrass kernel $k_\epsilon(t) = \frac{1}{\epsilon}e^{-\pi t^2/\epsilon^2}$ is an approximate identity.

Proof sketch of inversion theorem:

Define the Gauss regularized inversion:

f_\epsilon(t) = \int_{-\infty}^{\infty} \hat{f}(\xi)\,e^{2\pi i\xi t}\,e^{-\pi\epsilon^2\xi^2}\,d\xi

The factor $e^{-\pi\epsilon^2\xi^2}$ provides absolute convergence (the Gaussian is in $L^1$ ). Using Fubini's theorem to interchange integration order:

f_\epsilon(t) = \int_{-\infty}^{\infty} f(s) \underbrace{\int_{-\infty}^{\infty} e^{-\pi\epsilon^2\xi^2}e^{2\pi i\xi(t-s)}\,d\xi}_{= \frac{1}{\epsilon}e^{-\pi(t-s)^2/\epsilon^2}} ds = (f * k_\epsilon)(t)

Since $\{k_\epsilon\}$ is an approximate identity: $f_\epsilon(t) \to f(t)$ as $\epsilon \to 0$ at every point of continuity of $f$ (and in $L^1$ norm). Since $f_\epsilon(t) \to \int\hat{f}(\xi)e^{2\pi i\xi t}\,d\xi$ as $\epsilon \to 0$ when $\hat{f} \in L^1$ , the result follows. $\square$

Significance: This proof reveals why the Gaussian plays a central role - it is the unique function (up to scaling) that is its own Fourier Transform, making the Gauss regularization self-consistent.

4.3 The Inversion Formula in Practice

The inversion formula is used to:

Recover a signal from its spectrum: Given $\hat{f}$ , compute $f(t) = \int\hat{f}(\xi)e^{2\pi i\xi t}\,d\xi$
Compute transforms of new functions from known transforms: Use the FT table plus properties
Solve PDEs: Take FT, solve the resulting ODE, invert

Example: Compute $f(t)$ such that $\hat{f}(\xi) = e^{-2\pi|\xi|}$ (double-sided exponential spectrum).

By the inversion formula:

f(t) = \int_{-\infty}^{\infty} e^{-2\pi|\xi|}e^{2\pi i\xi t}\,d\xi = \int_0^\infty e^{-2\pi\xi(1-it)}\,d\xi + \int_{-\infty}^0 e^{2\pi\xi(1+it)}\,d\xi

= \frac{1}{2\pi(1-it)} + \frac{1}{2\pi(1+it)} = \frac{1}{\pi}\cdot\frac{1}{1+t^2}

So $\mathcal{F}[1/(\pi(1+t^2))](\xi) = e^{-2\pi|\xi|}$ - confirming the Lorentzian result from Section 2.3 (Example 3).

Numerical verification: The inversion theorem can be verified numerically via the FFT: compute $\hat{f}$ numerically, then invert, and check that you recover $f$ up to discretization error (this is Exercise 3 in Section 10).

4.4 The Self-Dual Gaussian

The Gaussian $g(t) = e^{-\pi t^2}$ satisfies $\hat{g} = g$ , making it a fixed point of the Fourier Transform. This property makes it indispensable in both theory and practice.

Generalized Gaussian family. For $\sigma > 0$ , define $g_\sigma(t) = e^{-\pi t^2/\sigma^2}$ (Gaussian of width $\sigma$ ). By the scaling theorem:

\hat{g}_\sigma(\xi) = \sigma\,e^{-\pi\sigma^2\xi^2} = \sigma\,g_{1/\sigma}(\xi)

Wide Gaussian ( $\sigma$ large) -> narrow Fourier Transform (bandwidth $\sim 1/\sigma$ ). This is the uncertainty principle made explicit: the product of time-width $\sigma$ and frequency-width $1/\sigma$ equals 1, the minimum possible (see Section 6).

Eigenfunctions of the FT: The Fourier Transform has eigenvalues $\lambda \in \{1, -i, -1, i\}$ . The corresponding eigenfunctions are the Hermite functions:

\psi_n(t) = H_n(\sqrt{2\pi}\,t)\,e^{-\pi t^2}

where $H_n$ is the $n$ -th Hermite polynomial. For $n = 0$ : $\psi_0(t) = e^{-\pi t^2}$ (eigenvalue 1, the self-dual Gaussian). For $n = 1$ : $\psi_1(t) = 2\sqrt{2\pi}\,t\,e^{-\pi t^2}$ (eigenvalue $-i$ ).

For AI: The Gaussian window is the optimal time-frequency window (by the uncertainty principle), which is why Gaussian smoothing is used in signal preprocessing and why Gaussian attention kernels appear in some efficient attention variants.

GAUSSIAN SELF-DUALITY
========================================================================

  Time domain               Frequency domain
  -------------             ----------------
  g(t) = e^{-t2}   --FT--  g() = e^{-2}  <- same function!

  Width _t = 1/(2)       Width _ = 1/(2)
  Product: _t  _ = 1/(2) = MINIMUM (uncertainty bound)

  Scaling:
  f(t) = e^{-t2/2} --FT-- e^{-22}
  ^ wider in time               ^ narrower in frequency

========================================================================

5. Plancherel's Theorem and $L^2$ Theory

5.1 Statement: FT as a Unitary Isometry

Theorem 5.1 (Plancherel's Theorem). The Fourier Transform extends uniquely from $L^1(\mathbb{R}) \cap L^2(\mathbb{R})$ to a unitary operator on $L^2(\mathbb{R})$ :

\mathcal{F}: L^2(\mathbb{R}) \to L^2(\mathbb{R}), \qquad \lVert \hat{f} \rVert_{L^2} = \lVert f \rVert_{L^2}

Unitarity means $\mathcal{F}^{-1} = \mathcal{F}^*$ (the adjoint equals the inverse), which in this case means $(\mathcal{F}^*)[\hat{f}](t) = \int \hat{f}(\xi)e^{+2\pi i\xi t}\,d\xi$ - the inverse FT.

Why this is non-trivial: For $f \in L^2(\mathbb{R})$ but $f \notin L^1(\mathbb{R})$ , the defining integral $\int f(t)e^{-2\pi i\xi t}\,dt$ need not converge absolutely. Plancherel's theorem says we can still define $\hat{f} \in L^2$ as the $L^2$ limit:

\hat{f} = \lim_{R\to\infty} \int_{-R}^{R} f(t)e^{-2\pi i\xi t}\,dt \qquad \text{(limit in } L^2\text{ norm)}

The limit exists because the truncated transforms form a Cauchy sequence in $L^2$ (by the $L^1 \cap L^2$ case and density of that subspace in $L^2$ ).

5.2 Parseval's Relation

Theorem 5.2 (Parseval / Plancherel Identity). For $f, g \in L^2(\mathbb{R})$ :

\int_{-\infty}^{\infty} f(t)\overline{g(t)}\,dt = \int_{-\infty}^{\infty} \hat{f}(\xi)\overline{\hat{g}(\xi)}\,d\xi

The special case $g = f$ gives the energy conservation (or Parseval's formula):

\int_{-\infty}^{\infty} |f(t)|^2\,dt = \int_{-\infty}^{\infty} |\hat{f}(\xi)|^2\,d\xi

Interpretation: The total energy of a signal is the same whether measured in the time domain or the frequency domain. No energy is created or destroyed by the Fourier Transform - it is a lossless change of representation.

Using Parseval to compute integrals: Sometimes $\int|f|^2\,dt$ is hard but $\int|\hat{f}|^2\,d\xi$ is easy (or vice versa). Example: compute $\int_{-\infty}^{\infty} \operatorname{sinc}^2(\xi)\,d\xi$ .

We know $\mathcal{F}[\operatorname{rect}] = \operatorname{sinc}$ , so by Parseval:

\int_{-\infty}^{\infty}|\operatorname{sinc}(\xi)|^2\,d\xi = \int_{-\infty}^{\infty}|\operatorname{rect}(t)|^2\,dt = \int_{-1/2}^{1/2}1\,dt = 1

For AI: Parseval's relation underpins the analysis of spectral energy distribution in neural networks. The WeightWatcher tool (Martin & Mahoney, 2021) analyzes the spectrum of weight matrices to diagnose training quality - a healthy model has a power-law spectral distribution. Parseval tells us the Frobenius norm $\lVert W \rVert_F^2 = \sum_i\sigma_i^2$ equals the energy in the spectral domain.

5.3 Proof Sketch: Extension from $L^1 \cap L^2$ to $L^2$

The proof of Plancherel's theorem follows a classic functional analysis pattern:

Step 1: Show the Parseval identity for $f \in L^1(\mathbb{R}) \cap L^2(\mathbb{R})$ (the intersection is dense in both spaces).

For $f \in L^1 \cap L^2$ , use Fubini to compute:

\int |\hat{f}(\xi)|^2\,d\xi = \int\hat{f}(\xi)\overline{\hat{f}(\xi)}\,d\xi = \int\hat{f}(\xi)\left(\int f(t)e^{2\pi i\xi t}\,dt\right)d\xi

= \int f(t)\underbrace{\int\hat{f}(\xi)e^{2\pi i\xi t}\,d\xi}_{=f(t) \text{ by inversion}}\,dt = \int|f(t)|^2\,dt

Step 2: The isometry $\lVert\hat{f}\rVert_2 = \lVert f \rVert_2$ for $f \in L^1 \cap L^2$ shows $\mathcal{F}: L^1\cap L^2 \to L^2$ is a bounded operator with operator norm 1.

Step 3: By density ( $L^1 \cap L^2$ is dense in $L^2$ ) and the fact that a bounded linear isometry extends uniquely to the closure, $\mathcal{F}$ extends to all of $L^2$ preserving the isometry.

Step 4: Show $\mathcal{F}$ is surjective by showing $\mathcal{F}^{-1}$ (the inverse FT) also extends and $\mathcal{F}\circ\mathcal{F}^{-1} = I$ . $\square$

5.4 Energy Conservation in Practice

The Parseval identity has immediate computational consequences:

Low-pass filtering: A filter that removes frequencies $|\xi| > W$ (a rectangular window in frequency) preserves at most a fraction $2W\lVert\hat{f}\rVert_2^2/\lVert\hat{f}\rVert_2^2$ of the total energy. For a signal with most energy at low frequencies, this fraction is close to 1.

Compression by truncation: For $f \in L^2$ , the best $k$ -term approximation in the Fourier basis minimizes the $L^2$ error. The error is $\int_{|\xi| > W}|\hat{f}(\xi)|^2\,d\xi$ - the energy in the discarded frequencies. This is the Fourier analog of truncated SVD in matrix approximation.

For AI: The Fourier Neural Operator (Section 8.4) exploits this by keeping only the top- $k$ Fourier modes (lowest frequencies) of the input function and discarding the rest. Plancherel guarantees the error is exactly the discarded energy - a principled truncation criterion.

6. The Heisenberg Uncertainty Principle

6.1 Time Spread and Frequency Spread

To state the uncertainty principle precisely, we need quantitative measures of how "spread out" a signal is in time and frequency.

Definition 6.1 (Time Center and Spread). For $f \in L^2(\mathbb{R})$ with $\lVert f \rVert_2 = 1$ :

\mu_t = \int_{-\infty}^{\infty} t\,|f(t)|^2\,dt \quad \text{(time center)}

\Delta t = \left(\int_{-\infty}^{\infty} (t - \mu_t)^2\,|f(t)|^2\,dt\right)^{1/2} \quad \text{(time spread / RMS duration)}

Definition 6.2 (Frequency Center and Spread). Similarly:

\mu_\xi = \int_{-\infty}^{\infty} \xi\,|\hat{f}(\xi)|^2\,d\xi \quad \text{(frequency center)}

\Delta\xi = \left(\int_{-\infty}^{\infty} (\xi - \mu_\xi)^2\,|\hat{f}(\xi)|^2\,d\xi\right)^{1/2} \quad \text{(frequency spread / bandwidth)}

Note: since $\lVert f \rVert_2^2 = \lVert\hat{f}\rVert_2^2 = 1$ (Plancherel), $|f(t)|^2$ and $|\hat{f}(\xi)|^2$ are probability densities. The time and frequency spreads are the standard deviations of these distributions.

6.2 Formal Statement: $\Delta t \cdot \Delta\xi \geq \frac{1}{4\pi}$

Theorem 6.1 (Heisenberg-Kennard Uncertainty Principle). For any $f \in L^2(\mathbb{R})$ with $\lVert f \rVert_2 = 1$ and $tf(t), \xi\hat{f}(\xi) \in L^2(\mathbb{R})$ :

\Delta t \cdot \Delta\xi \geq \frac{1}{4\pi}

Equality holds if and only if $f$ is a Gaussian (up to time-shift, frequency-shift, and scaling):

f(t) = A\,e^{-\pi(t-t_0)^2/\sigma^2}\,e^{2\pi i\xi_0 t}

for some $A, t_0, \xi_0 \in \mathbb{R}$ and $\sigma > 0$ .

The bound is fundamental. This is not a statement about measurement error or quantum physics - it is a purely mathematical theorem about functions and their Fourier Transforms. Any signal with time duration $\Delta t$ necessarily has bandwidth at least $1/(4\pi\Delta t)$ . You cannot concentrate a signal to be both time-limited and bandwidth-limited simultaneously.

6.3 Proof via Cauchy-Schwarz

Proof. Without loss of generality, assume $\mu_t = \mu_\xi = 0$ (we can always translate in time and frequency without changing the spreads). Then:

\Delta t^2 \cdot \Delta\xi^2 = \left(\int t^2|f(t)|^2\,dt\right)\left(\int \xi^2|\hat{f}(\xi)|^2\,d\xi\right)

By Parseval and the differentiation property, $\int\xi^2|\hat{f}(\xi)|^2\,d\xi = \frac{1}{4\pi^2}\int|f'(t)|^2\,dt$ (using $\mathcal{F}[f'](\xi) = 2\pi i\xi\hat{f}(\xi)$ and Parseval).

So we need to show:

\left(\int t^2|f|^2\,dt\right)\left(\frac{1}{4\pi^2}\int|f'|^2\,dt\right) \geq \frac{1}{16\pi^2}

i.e., $\left(\int t^2|f|^2\right)\left(\int|f'|^2\right) \geq \frac{1}{4}$ .

By the Cauchy-Schwarz inequality:

\int |f|^2\,dt = \int \lvert f\rvert \cdot 1\,dt \leq \cdots

More elegantly, integrate by parts and use Cauchy-Schwarz:

\frac{1}{2} = -\frac{1}{2}\int \frac{d}{dt}(t)|f|^2\,dt = -\frac{1}{2}\left[t|f|^2\right]_{-\infty}^\infty + \frac{1}{2}\int t\frac{d}{dt}|f|^2\,dt

Wait - the cleaner path: use $\frac{d}{dt}(t|f|^2) = |f|^2 + t \cdot 2\operatorname{Re}(f'\bar{f})$ , integrate over $\mathbb{R}$ , and the left side integrates to 0 (since $f \in L^2$ implies $t|f|^2 \to 0$ ):

0 = \int|f|^2\,dt + 2\int t\operatorname{Re}(f'\bar{f})\,dt = 1 + 2\operatorname{Re}\int t\,\bar{f}(t)f'(t)\,dt

So $\operatorname{Re}\int t\,\bar{f}f'\,dt = -1/2$ . Then by Cauchy-Schwarz:

\frac{1}{4} = \left|\operatorname{Re}\int t\bar{f}f'\,dt\right|^2 \leq \left|\int t\bar{f}f'\,dt\right|^2 \leq \left(\int t^2|f|^2\,dt\right)\left(\int|f'|^2\,dt\right)

Dividing both sides by $4\pi^2$ :

\Delta t^2 \cdot \Delta\xi^2 \geq \frac{1}{4} \cdot \frac{1}{4\pi^2} = \frac{1}{16\pi^2}

\therefore \quad \Delta t \cdot \Delta\xi \geq \frac{1}{4\pi} \qquad \square

6.4 The Gaussian as the Unique Extremal Function

Equality in Cauchy-Schwarz requires $f'(t) = \lambda t f(t)$ for some constant $\lambda$ . This ODE has solution $f(t) = Ce^{\lambda t^2/2}$ . For $f \in L^2(\mathbb{R})$ , we need $\lambda < 0$ , giving $f(t) = Ce^{-\alpha t^2}$ for $\alpha > 0$ - a Gaussian.

Adding back the time-shift $t_0$ and frequency-shift $\xi_0$ :

f(t) = C\,e^{-\pi(t-t_0)^2/\sigma^2}\,e^{2\pi i\xi_0 t}, \qquad \Delta t = \sigma/\sqrt{2}, \quad \Delta\xi = 1/(2\sigma\sqrt{2})

Check: $\Delta t \cdot \Delta\xi = \sigma/\sqrt{2} \cdot 1/(2\sigma\sqrt{2}) = 1/(4) \cdot 1/\pi$ ... Actually with the normalization $g(t) = e^{-\pi t^2}$ : $\Delta t = 1/(2\sqrt{\pi})$ , $\Delta\xi = 1/(2\sqrt{\pi})$ , product = $1/(4\pi)$ - exactly the bound.

Implication: The Gaussian is the only signal that achieves perfect time-frequency concentration in the sense of minimizing $\Delta t \cdot \Delta\xi$ . Every other signal is "more spread out" in at least one of the two domains. This is why Gaussian windows are optimal for time-frequency analysis (Gabor atoms), and why the Gaussian noise model is natural in signal processing.

6.5 Implications for Neural Architecture Design

The uncertainty principle has direct, concrete implications for AI systems:

RoPE and context length extension: In RoPE, each dimension pair uses frequency $\theta_j = 10000^{-2j/d_{\text{model}}}$ . The lowest frequencies $\theta_j$ encode the longest-range positional information. To extend context from 4K to 128K tokens (as in LLaMA-3 long context), the lowest frequencies must be able to distinguish positions up to 128K. But by the uncertainty principle, a signal that is distinguishable over a range $T$ in "position space" must have frequency resolution $\Delta\xi \leq 1/T$ - which requires the frequency to be low enough. YaRN (Peng et al., 2023) and LongRoPE (Ding et al., 2024) resolve this by rescaling $\theta_j$ to use lower frequencies for long-context models.

Spectrogram and STFT: Audio models like Whisper (Radford et al., 2022) use the Short-Time Fourier Transform - the FT applied to windowed segments. The window length controls the time-frequency tradeoff. Longer windows: better frequency resolution, worse time resolution. The Gaussian window is optimal but the Hamming window is commonly used for computational reasons.

Attention window size: Multi-head self-attention with local windows (as in LongFormer, BigBird) effectively applies a bandpass filter in "position space." The uncertainty principle says a window of size $W$ can resolve frequency up to $1/W$ - setting a fundamental limit on the position encodings that can be usefully resolved within the window.

UNCERTAINTY PRINCIPLE - ARCHITECTURE IMPLICATIONS
========================================================================

  Signal Analysis            Neural Architecture Analog
  -----------------          -------------------------
  Time duration T            Context window / sequence length
  Bandwidth B ~ 1/T          Position frequency resolution
  t   >= 1/(4)          Short context  coarse PE, long  fine PE

  STFT window size W:        Local attention window size W:
    -> freq res: ~ 1/W          -> PE resolution: ~ 1/W
    -> time res: ~ W            -> local context: ~ W

  Gaussian window:           Gaussian attention kernel:
    -> optimal concentration    -> used in some efficient attention variants

  LongRoPE: rescales j to lower freqs -> finer resolution at long range

========================================================================

7. Tempered Distributions and the Dirac Delta

7.1 Why Distributions Are Necessary

The $L^1$ and $L^2$ theories of the Fourier Transform handle a large class of functions, but they exclude some of the most important objects in signal processing and physics:

$\delta(t)$ - the Dirac delta: not a function, but models an ideal impulse
$f(t) = 1$ - the constant function: not in $L^1$ or $L^2$ , but models a DC signal
$f(t) = e^{2\pi i\xi_0 t}$ - a pure tone: not in $L^1$ or $L^2$ , but is a fundamental signal
$\sum_{n=-\infty}^{\infty}\delta(t - nT)$ - the Dirac comb: not a function, but models periodic sampling

All of these have Fourier Transforms in the distributional sense - and all appear in practical signal processing. The distribution framework, developed by Laurent Schwartz in the 1950s, extends the Fourier Transform to this broader class.

7.2 Schwartz Space and Tempered Distributions

Definition 7.1 (Schwartz Space). The Schwartz space $\mathcal{S}(\mathbb{R})$ is the space of all $C^\infty$ functions $\phi: \mathbb{R} \to \mathbb{C}$ such that for all $m, n \geq 0$ :

\sup_{t \in \mathbb{R}} \left|t^m \phi^{(n)}(t)\right| < \infty

i.e., $\phi$ and all its derivatives decay faster than any polynomial. Examples: $e^{-t^2}$ , $e^{-|t|}$ (the derivative condition fails at 0 - not Schwartz), Gaussian bump functions.

The Schwartz space is closed under the Fourier Transform: if $\phi \in \mathcal{S}$ , then $\hat{\phi} \in \mathcal{S}$ . This is the key property: the FT maps $\mathcal{S}$ to itself bijectively.

Definition 7.2 (Tempered Distributions). A tempered distribution $T$ is a continuous linear functional $T: \mathcal{S}(\mathbb{R}) \to \mathbb{C}$ . We write $T(\phi) = \langle T, \phi \rangle$ .

Examples of tempered distributions:

Regular distributions: Every function $f \in L^p(\mathbb{R})$ ( $1 \leq p \leq \infty$ ) defines a tempered distribution $T_f(\phi) = \int f(t)\phi(t)\,dt$ .
Dirac delta: $\langle\delta, \phi\rangle = \phi(0)$ . The Dirac delta evaluates $\phi$ at 0. It is NOT a function.
Derivative of delta: $\langle\delta', \phi\rangle = -\phi'(0)$ .
Dirac comb: $\langle\text{III}_T, \phi\rangle = \sum_{n=-\infty}^\infty \phi(nT)$ .

Fourier Transform of a tempered distribution: Define $\langle\hat{T}, \phi\rangle = \langle T, \hat{\phi}\rangle$ for $\phi \in \mathcal{S}$ . This is the "transpose" of the Fourier Transform.

7.3 The Dirac Delta and its Fourier Transform

The Dirac delta $\delta(t)$ models a unit impulse concentrated at $t = 0$ :

\langle\delta, \phi\rangle = \phi(0) = \int_{-\infty}^{\infty} \delta(t)\phi(t)\,dt \quad \text{(formal)}

It is the limit of a sequence of ordinary functions: $\delta_\epsilon(t) = \frac{1}{\epsilon}e^{-\pi t^2/\epsilon^2}$ converges to $\delta$ in the distributional sense as $\epsilon \to 0$ .

Fourier Transform of $\delta$ : By the definition $\langle\hat{\delta}, \phi\rangle = \langle\delta, \hat{\phi}\rangle = \hat{\phi}(0) = \int\phi(t)\cdot 1\,dt = \langle 1, \phi\rangle$ . Therefore:

\mathcal{F}[\delta](\xi) = 1

An ideal impulse at $t = 0$ has a flat (white) spectrum - it contains equal energy at all frequencies. This makes perfect physical sense: to create a perfect impulse, you need all frequencies constructively interfering at $t = 0$ .

Fourier Transform of a constant: By duality (applying FT twice): $\mathcal{F}[1](\xi) = \delta(\xi)$ . A constant signal (DC) lives entirely at frequency $\xi = 0$ .

The duality principle in full: The FT maps $\delta(t) \leftrightarrow 1$ and $1 \leftrightarrow \delta(\xi)$ - a striking symmetry.

More generally: A delta at $t = a$ :

\mathcal{F}[\delta(t-a)](\xi) = e^{-2\pi i\xi a}

This is the time-shift property: $\delta(t - a)$ shifted by $a$ gives a phase factor $e^{-2\pi i\xi a}$ , but the magnitude is still flat: $|\mathcal{F}[\delta(t-a)]| = 1$ .

7.4 FT of Constants, Sinusoids, and Periodic Signals

Using $\mathcal{F}[1] = \delta$ and the frequency-shift property $\mathcal{F}[e^{2\pi i\xi_0 t}f(t)](\xi) = \hat{f}(\xi - \xi_0)$ :

Complex exponential:

\mathcal{F}[e^{2\pi i\xi_0 t}](\xi) = \delta(\xi - \xi_0)

Cosine and Sine:

\mathcal{F}[\cos(2\pi\xi_0 t)](\xi) = \frac{1}{2}[\delta(\xi - \xi_0) + \delta(\xi + \xi_0)]

\mathcal{F}[\sin(2\pi\xi_0 t)](\xi) = \frac{1}{2i}[\delta(\xi - \xi_0) - \delta(\xi + \xi_0)]

A pure cosine has a spectrum consisting of two Dirac deltas, one at $+\xi_0$ and one at $-\xi_0$ (the negative frequency is the mirror image required by Hermitian symmetry). The amplitude of each spike is $1/2$ because the energy is split equally between the two.

General periodic signal: A $T$ -periodic function $f(t) = \sum_{n=-\infty}^\infty c_n e^{2\pi i nt/T}$ has Fourier Transform:

\hat{f}(\xi) = \sum_{n=-\infty}^\infty c_n\,\delta\!\left(\xi - \frac{n}{T}\right)

The Fourier Transform of a periodic signal is a discrete sum of Dirac deltas at the harmonics - precisely the discrete spectrum of the Fourier series! This shows that Fourier series and Fourier Transform are two aspects of the same theory, unified by the distributional framework.

7.5 The Dirac Comb and Poisson Summation Formula

Definition 7.3 (Dirac Comb). The $T$ -periodic Dirac comb is:

\text{III}_T(t) = \sum_{n=-\infty}^{\infty} \delta(t - nT)

Theorem 7.1 (FT of the Dirac Comb).

\mathcal{F}[\text{III}_T](\xi) = \frac{1}{T}\,\text{III}_{1/T}(\xi) = \frac{1}{T}\sum_{n=-\infty}^{\infty} \delta\!\left(\xi - \frac{n}{T}\right)

A comb of spacing $T$ in time has a spectrum that is a comb of spacing $1/T$ in frequency. This is the time-frequency duality made concrete: fine temporal sampling -> coarse frequency sampling, and vice versa.

Theorem 7.2 (Poisson Summation Formula). For $f \in \mathcal{S}(\mathbb{R})$ :

\sum_{n=-\infty}^{\infty} f(n) = \sum_{n=-\infty}^{\infty} \hat{f}(n)

More generally, for sampling at interval $T$ :

\frac{1}{T}\sum_{n=-\infty}^{\infty} f\!\left(\frac{n}{T}\right) = \sum_{n=-\infty}^{\infty} \hat{f}(nT)

Proof: The function $g(t) = \sum_n f(t + n)$ is $1$ -periodic, so it has a Fourier series with coefficients $c_k = \hat{f}(k)$ (by computing $\int_0^1 g(t)e^{-2\pi ikt}\,dt = \int_{-\infty}^\infty f(t)e^{-2\pi ikt}\,dt = \hat{f}(k)$ ). Evaluating at $t = 0$ : $\sum_n f(n) = g(0) = \sum_k c_k = \sum_k\hat{f}(k)$ . $\square$

The Poisson summation formula is the key to sampling theory. Shannon's sampling theorem (1949) follows directly: a signal $f$ with bandwidth $B$ (i.e., $\hat{f}(\xi) = 0$ for $|\xi| > B$ ) is completely determined by its samples at rate $\geq 2B$ (the Nyquist rate). The Poisson summation formula explains both the reconstruction formula and the aliasing that occurs when undersampling.

For AI: The discrete Fourier Transform (Section 03) is the computational realization of the Poisson summation formula. The FFT computes the sampled Fourier Transform exactly, which requires the signal to be band-limited (or periodic) - any violation causes aliasing, the discrete analog of spectral leakage.

7.6 Unifying Fourier Series and Fourier Transform

The distribution framework reveals that Fourier series and Fourier Transform are the same thing:

Situation	Signal	Spectrum
General $L^2$ signal	Continuous, aperiodic	Continuous, aperiodic
Periodic signal (period $T$ )	Continuous, periodic	Discrete (spacing $1/T$ ): Fourier series
Bandlimited (bandwidth $B$ )	Discrete (rate $\geq 2B$ ): sampling	Periodic (period $\geq 2B$ ): aliases
Periodic AND bandlimited	Discrete and periodic	Discrete and periodic: DFT (Section 03)

FOURIER TRANSFORM UNIFICATION
========================================================================

  Time domain           Frequency domain        Setting
  -----------           ----------------        -------
  Continuous            Continuous              L1/L2 Fourier Transform
  Continuous            Discrete                Fourier Series (periodic)
  Discrete              Continuous              DTFT (sampled signal)
  Discrete              Discrete                DFT / FFT -> see Section 20-03

  All four are the same theory viewed through the distributional lens.
  The Dirac comb + Poisson summation connects all four cases.

========================================================================

Fourier Transform: Part 1 - Intuition To 7 Tempered Distributions And The Dirac Delta