Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 3
29 min read18 headingsSplit lesson page

Lesson overview | Previous part | Next part

Fourier Transform: Appendix B: Worked Examples and Techniques to Appendix F: The Fourier Transform in Practice - Case Studies

Appendix B: Worked Examples and Techniques

B.1 The Fourier Transform Table - Complete Reference

The following table lists the most important Fourier Transform pairs, using the ξ\xi-convention f^(ξ)=f(t)e2πiξtdt\hat{f}(\xi) = \int f(t)e^{-2\pi i\xi t}\,dt. All functions are assumed in L1L2L^1 \cap L^2 unless otherwise noted.

#f(t)f(t)f^(ξ)\hat{f}(\xi)Notes
1eπt2e^{-\pi t^2}eπξ2e^{-\pi\xi^2}Self-dual Gaussian
2eπt2/σ2e^{-\pi t^2/\sigma^2}σeπσ2ξ2\sigma e^{-\pi\sigma^2\xi^2}Scaled Gaussian
3eate^{-a\lvert t\rvert}, a>0a>02aa2+(2πξ)2\frac{2a}{a^2+(2\pi\xi)^2}Double-sided exponential
4eat1t0e^{-at}\mathbb{1}_{t\geq0}, a>0a>01a+2πiξ\frac{1}{a+2\pi i\xi}One-sided exponential
5rect(t/T)\operatorname{rect}(t/T)Tsinc(Tξ)T\operatorname{sinc}(T\xi)Rectangular pulse of width TT
6sinc(Wt)\operatorname{sinc}(Wt)1Wrect(ξ/W)\frac{1}{W}\operatorname{rect}(\xi/W)Ideal low-pass filter (by duality)
7Λ(t)=max(1t,0)\Lambda(t)=\max(1-\lvert t\rvert,0)sinc2(ξ)\operatorname{sinc}^2(\xi)Triangle function
811+(2πt)2\frac{1}{1+(2\pi t)^2}12eξ\frac{1}{2}e^{-\lvert\xi\rvert}Lorentzian / Cauchy
9δ(t)\delta(t)11Dirac impulse -> white spectrum
1011δ(ξ)\delta(\xi)Constant -> DC spike (distributional)
11δ(ta)\delta(t-a)e2πiξae^{-2\pi i\xi a}Shifted impulse
12e2πiξ0te^{2\pi i\xi_0 t}δ(ξξ0)\delta(\xi-\xi_0)Pure tone (distributional)
13cos(2πξ0t)\cos(2\pi\xi_0 t)12[δ(ξξ0)+δ(ξ+ξ0)]\frac{1}{2}[\delta(\xi-\xi_0)+\delta(\xi+\xi_0)]Cosine
14sin(2πξ0t)\sin(2\pi\xi_0 t)12i[δ(ξξ0)δ(ξ+ξ0)]\frac{1}{2i}[\delta(\xi-\xi_0)-\delta(\xi+\xi_0)]Sine
15n=δ(tnT)\sum_{n=-\infty}^\infty\delta(t-nT)1Tn=δ(ξn/T)\frac{1}{T}\sum_{n=-\infty}^\infty\delta(\xi-n/T)Dirac comb
16sgn(t)\operatorname{sgn}(t)1πiξ\frac{1}{\pi i\xi} (distributional)Sign function
171t0\mathbb{1}_{t\geq0} (unit step)12δ(ξ)+12πiξ\frac{1}{2}\delta(\xi)+\frac{1}{2\pi i\xi}Heaviside step (distributional)
18tneatt^n e^{-at}, a>0a>0, t0t\geq0n!(a+2πiξ)n+1\frac{n!}{(a+2\pi i\xi)^{n+1}}Polynomial exponential

B.2 Four Full Worked Examples

Worked Example 1: Triangular Pulse

The triangular function Λ(t)=max(1t,0)\Lambda(t) = \max(1 - |t|, 0) is zero outside [1,1][-1, 1]. We compute Λ^\hat{\Lambda} directly.

Λ^(ξ)=10(1+t)e2πiξtdt+01(1t)e2πiξtdt\hat{\Lambda}(\xi) = \int_{-1}^{0}(1+t)e^{-2\pi i\xi t}\,dt + \int_0^1(1-t)e^{-2\pi i\xi t}\,dt

For the first integral, let u=tu = -t:

10(1+t)e2πiξtdt=01(1u)e2πiξudu\int_{-1}^{0}(1+t)e^{-2\pi i\xi t}\,dt = \int_0^1(1-u)e^{2\pi i\xi u}\,du

Adding both integrals: Λ^(ξ)=01(1t)(e2πiξt+e2πiξt)dt=201(1t)cos(2πξt)dt\hat{\Lambda}(\xi) = \int_0^1(1-t)(e^{2\pi i\xi t}+e^{-2\pi i\xi t})\,dt = 2\int_0^1(1-t)\cos(2\pi\xi t)\,dt.

Integrating by parts:

=2[(1t)sin(2πξt)2πξ]01+201sin(2πξt)2πξdt=0+1πξ[cos(2πξt)2πξ]01=1cos(2πξ)2π2ξ2= 2\left[\frac{(1-t)\sin(2\pi\xi t)}{2\pi\xi}\right]_0^1 + 2\int_0^1\frac{\sin(2\pi\xi t)}{2\pi\xi}\,dt = 0 + \frac{1}{\pi\xi}\left[-\frac{\cos(2\pi\xi t)}{2\pi\xi}\right]_0^1 = \frac{1-\cos(2\pi\xi)}{2\pi^2\xi^2}

Using 1cosθ=2sin2(θ/2)1 - \cos\theta = 2\sin^2(\theta/2):

Λ^(ξ)=2sin2(πξ)2π2ξ2=(sin(πξ)πξ)2=sinc2(ξ)\hat{\Lambda}(\xi) = \frac{2\sin^2(\pi\xi)}{2\pi^2\xi^2} = \left(\frac{\sin(\pi\xi)}{\pi\xi}\right)^2 = \operatorname{sinc}^2(\xi)

The triangular pulse is the FT of sinc2\operatorname{sinc}^2. Alternatively: Λ=rectrect\Lambda = \operatorname{rect} * \operatorname{rect} (convolution of two rect functions), so Λ^=rect^rect^=sinc2\hat{\Lambda} = \hat{\operatorname{rect}}\cdot\hat{\operatorname{rect}} = \operatorname{sinc}^2 by the convolution theorem. The sinc2\operatorname{sinc}^2 spectrum decays faster than sinc\operatorname{sinc} (as 1/ξ21/\xi^2 vs 1/ξ1/\xi) because Λ\Lambda is continuous (no jumps), while rect\operatorname{rect} has jump discontinuities.

Worked Example 2: Gaussian via Differential Equation

An elegant alternative proof that F[eπt2]=eπξ2\mathcal{F}[e^{-\pi t^2}] = e^{-\pi\xi^2} uses the fact that f(t)=eπt2f(t) = e^{-\pi t^2} satisfies f(t)=2πtf(t)f'(t) = -2\pi t\,f(t).

Taking the FT of both sides and using the differentiation rules:

  • LHS: F[f](ξ)=2πiξf^(ξ)\mathcal{F}[f'](\xi) = 2\pi i\xi\,\hat{f}(\xi)
  • RHS: F[2πtf(t)](ξ)=2π12πiddξf^(ξ)=if^(ξ)\mathcal{F}[-2\pi t\,f(t)](\xi) = -2\pi\cdot\frac{1}{-2\pi i}\frac{d}{d\xi}\hat{f}(\xi) = -i\hat{f}'(\xi)

Wait - use: F[tf(t)](ξ)=12πiddξf^(ξ)=i2πf^(ξ)\mathcal{F}[tf(t)](\xi) = \frac{1}{-2\pi i}\frac{d}{d\xi}\hat{f}(\xi) = \frac{i}{2\pi}\hat{f}'(\xi)

So 2πiξf^(ξ)=2πi2πf^(ξ)=if^(ξ)2\pi i\xi\,\hat{f}(\xi) = -2\pi\cdot\frac{i}{2\pi}\hat{f}'(\xi) = -i\hat{f}'(\xi).

This gives the ODE: f^(ξ)=2πξf^(ξ)\hat{f}'(\xi) = -2\pi\xi\,\hat{f}(\xi), with solution f^(ξ)=Ceπξ2\hat{f}(\xi) = C\,e^{-\pi\xi^2}.

To find CC: f^(0)=eπt2dt=1\hat{f}(0) = \int e^{-\pi t^2}\,dt = 1 (Gaussian integral). So C=1C = 1 and f^(ξ)=eπξ2\hat{f}(\xi) = e^{-\pi\xi^2}. \square

This proof is superior for teaching: it shows that the Gaussian's self-duality follows from the fact that eπt2e^{-\pi t^2} satisfies the simplest possible first-order ODE, and the FT converts this ODE into the same ODE in frequency space.

Worked Example 3: FT of eate^{-a|t|} and a Contour Argument

f^(ξ)=eate2πiξtdt=0eate2πiξtdt+0eate2πiξtdt\hat{f}(\xi) = \int_{-\infty}^\infty e^{-a|t|}e^{-2\pi i\xi t}\,dt = \int_0^\infty e^{-at}e^{-2\pi i\xi t}\,dt + \int_{-\infty}^0 e^{at}e^{-2\pi i\xi t}\,dt

For the first integral (t>0t > 0):

0e(a+2πiξ)tdt=1a+2πiξ(valid for Re(a+2πiξ)=a>0)\int_0^\infty e^{-(a+2\pi i\xi)t}\,dt = \frac{1}{a+2\pi i\xi} \quad \text{(valid for } \operatorname{Re}(a+2\pi i\xi) = a > 0\text{)}

For the second integral, let u=tu = -t (u>0u > 0, t<0t < 0):

0eaue2πiξudu=1a2πiξ\int_0^\infty e^{-au}e^{2\pi i\xi u}\,du = \frac{1}{a-2\pi i\xi}

Therefore:

f^(ξ)=1a+2πiξ+1a2πiξ=(a2πiξ)+(a+2πiξ)a2+(2πξ)2=2aa2+4π2ξ2\hat{f}(\xi) = \frac{1}{a+2\pi i\xi} + \frac{1}{a-2\pi i\xi} = \frac{(a-2\pi i\xi)+(a+2\pi i\xi)}{a^2+(2\pi\xi)^2} = \frac{2a}{a^2+4\pi^2\xi^2}

This is a Lorentzian (Cauchy distribution) centered at ξ=0\xi = 0 with half-width a/(2π)a/(2\pi). As a0+a \to 0^+, the time-domain signal eate^{-a|t|} approaches the constant 11, and its Fourier Transform approaches 2/(4π2ξ2)2a2πδ(ξ)/π=2δ(ξ)2/(4\pi^2\xi^2)\cdot2a \to 2\pi\delta(\xi)/\pi = 2\delta(\xi)... Actually: 2aa2+4π2ξ212π2πaπ[a2+4π2ξ2]/(2π)\frac{2a}{a^2+4\pi^2\xi^2} \to \frac{1}{2\pi}\cdot\frac{2\pi a}{\pi[a^2+4\pi^2\xi^2]/(2\pi)} \to ... more carefully, aπ(a2+4π2ξ2)2π\frac{a}{\pi(a^2+4\pi^2\xi^2)}\cdot 2\pi and aπ(a2+ξ2)δ(ξ)\frac{a}{\pi(a^2+\xi^2)} \to \delta(\xi) is the standard Poisson kernel. The limit confirms F[1]=δ\mathcal{F}[1] = \delta.

Worked Example 4: The FT and Convolution for the Heat Equation Kernel

The heat kernel at time t>0t > 0 is Gt(x)=14πκtex2/(4κt)G_t(x) = \frac{1}{\sqrt{4\pi\kappa t}}e^{-x^2/(4\kappa t)}. We verify F[Gt](ξ)=e4π2κtξ2\mathcal{F}[G_t](\xi) = e^{-4\pi^2\kappa t\xi^2} using the scaling theorem.

We know F[eπx2](ξ)=eπξ2\mathcal{F}[e^{-\pi x^2}](\xi) = e^{-\pi\xi^2}. The heat kernel is:

Gt(x)=14πκtex2/(4κt)=14πκteπ(x/σ)2/π1σ...G_t(x) = \frac{1}{\sqrt{4\pi\kappa t}}e^{-x^2/(4\kappa t)} = \frac{1}{\sqrt{4\pi\kappa t}}\cdot e^{-\pi(x/\sigma)^2/\pi}\cdot\frac{1}{\sigma}...

More directly: write Gt(x)=14πκtex2/(4κt)G_t(x) = \frac{1}{\sqrt{4\pi\kappa t}}e^{-x^2/(4\kappa t)}. Setting σ=2κt\sigma = \sqrt{2\kappa t}:

Gt(x)=12πσex2/(2σ2)G_t(x) = \frac{1}{\sqrt{2\pi}\sigma}e^{-x^2/(2\sigma^2)}

The FT of a Gaussian 12πσex2/(2σ2)\frac{1}{\sqrt{2\pi}\sigma}e^{-x^2/(2\sigma^2)} is e2π2σ2ξ2e^{-2\pi^2\sigma^2\xi^2}... Using our convention:

F[eπx2/σ2](ξ)=σeπσ2ξ2\mathcal{F}[e^{-\pi x^2/\sigma^2}](\xi) = \sigma e^{-\pi\sigma^2\xi^2}, so F ⁣[1σeπx2/σ2](ξ)=eπσ2ξ2\mathcal{F}\!\left[\frac{1}{\sigma}e^{-\pi x^2/\sigma^2}\right](\xi) = e^{-\pi\sigma^2\xi^2}.

For GtG_t: Gt(x)=1σteπx2/σt2G_t(x) = \frac{1}{\sigma_t}e^{-\pi x^2/\sigma_t^2} where σt=2πκt\sigma_t = 2\sqrt{\pi\kappa t}... Let me just directly compute:

Write Gt(x)=ceαx2G_t(x) = ce^{-\alpha x^2} with c=(4πκt)1/2c = (4\pi\kappa t)^{-1/2}, α=1/(4κt)\alpha = 1/(4\kappa t). Using F[eαx2](ξ)=π/αeπ2ξ2/α\mathcal{F}[e^{-\alpha x^2}](\xi) = \sqrt{\pi/\alpha}\,e^{-\pi^2\xi^2/\alpha}:

F[Gt](ξ)=cπαeπ2ξ2/α=14πκt4πκteπ2ξ24κt=e4π2κtξ2\mathcal{F}[G_t](\xi) = c\sqrt{\frac{\pi}{\alpha}}e^{-\pi^2\xi^2/\alpha} = \frac{1}{\sqrt{4\pi\kappa t}}\cdot\sqrt{4\pi\kappa t}\cdot e^{-\pi^2\xi^2\cdot 4\kappa t} = e^{-4\pi^2\kappa t\xi^2}

This confirms that in the ξ\xi-convention, the heat kernel's FT is e4π2κtξ2e^{-4\pi^2\kappa t\xi^2} - exactly the solution we derived from the heat equation in Section A.4.

B.3 Decay Rates and Smoothness: The Complete Picture

The relationship between smoothness of ff and decay of f^\hat{f} is one of the deepest qualitative results in Fourier analysis. The complete picture:

Theorem B.1 (Smoothness Spectral Decay). Let fL1(R)f \in L^1(\mathbb{R}).

| Condition on ff | Decay of f^(ξ)|\hat{f}(\xi)| as ξ|\xi| \to \infty | |-----------------|----------------------------------------------| | ff has a jump discontinuity | O(1/ξ)O(1/|\xi|) | | fC1f \in C^1 (continuous, one derivative) | o(1/ξ)o(1/|\xi|) | | fCkf \in C^k, f(k)L1f^{(k)} \in L^1 | O(1/ξk)O(1/|\xi|^k) | | fCf \in C^\infty | Faster than any polynomial: O(1/ξN)O(1/|\xi|^N) for all NN | | ff analytic (holomorphic in a strip Im(t)<δ|\operatorname{Im}(t)| < \delta) | Exponential decay: O(e2πδξ)O(e^{-2\pi\delta|\xi|}) |

Proof sketch: For the kk-derivative case: f^(ξ)=F[f(k)](ξ)/2πiξkf(k)1/(2πξ)k|\hat{f}(\xi)| = |\mathcal{F}[f^{(k)}](\xi)|/|2\pi i\xi|^k \leq \lVert f^{(k)}\rVert_1/(2\pi|\xi|)^k. For the analytic case, shift the contour of integration to Im(t)=±δ\operatorname{Im}(t) = \pm\delta (for ξ0\xi \gtrless 0) gaining an exponential factor e2πξδ=e2πξδe^{\mp 2\pi\xi\delta} = e^{-2\pi|\xi|\delta}. \square

Applications:

  • Signal compression: Smooth signals (like audio with no sharp transients) have rapidly decaying spectra -> can be compressed by truncating high frequencies with small error.
  • Neural network generalization: Smooth target functions (natural image statistics) have rapidly decaying spectra -> a network with limited bandwidth can approximate them well. This connects to the spectral bias of neural networks (Rahaman et al., 2019): networks preferentially learn low-frequency components first.
  • Discretization error: The aliasing error when sampling a signal at rate TT is bounded by n0f^(ξ+n/T)C/Tk\sum_{n \neq 0} |\hat{f}(\xi + n/T)| \leq C/T^k if fCkf \in C^k - smoother signals alias less.

B.4 The Fourier Transform in Different Function Spaces

A complete picture of where the FT lives:

DOMAIN OF DEFINITION - FOURIER TRANSFORM
========================================================================

  L1(R)    ->  C0(R)        (bounded, continuous, vanishes at infty)
               f_infty <= f_1

  L2(R)    ->  L2(R)        (Plancherel: FT is unitary)
               f_2 = f_2

  L1  L2  ->  L2           (both apply; easiest to work with)

  S(R)     ->  S(R)         (Schwartz space: FT preserves rapidly
              (bijection)    decreasing smooth functions)

  S'(R)    ->  S'(R)        (tempered distributions: largest domain)
              (bijection)    includes , constants, periodic signals

  Key inclusions:  S  L1  L2  L1  S'
                   S  L1  L2  L2  S'

  For practical computation: work in S' (distributions),
  which unifies all cases under one theory.

========================================================================

B.5 Numerical Computation of the Fourier Transform

In practice, the Fourier Transform is computed numerically via the DFT/FFT (Section 20-03). Here we note the key relationships.

Approximation scheme. To numerically approximate f^(ξ)\hat{f}(\xi) for fL1(R)f \in L^1(\mathbb{R}):

  1. Truncate the signal to [L,L][-L, L] for large enough LL (error: t>Lf(t)dt\int_{|t|>L}|f(t)|\,dt)
  2. Sample at rate N/(2L)N/(2L) to get NN equally-spaced samples fn=f(nT)f_n = f(nT) where T=2L/NT = 2L/N
  3. Apply DFT: f^k=Tn=0N1fne2πink/N\hat{f}_k = T\sum_{n=0}^{N-1}f_n e^{-2\pi ink/N}
  4. The result approximates f^(k/(2L))\hat{f}(k/(2L)) for k=0,1,,N/2k = 0, 1, \ldots, N/2.

Sources of error:

  • Aliasing: Undersampling causes high-frequency components to fold back - reduced by sampling faster (higher NN)
  • Leakage: Truncation at ±L\pm L multiplies by a rect function -> convolves spectrum with sinc -> spectral leakage. Reduced by windowing (multiplying by a smooth window before DFT)
  • Resolution: The frequency resolution is 1/(2L)1/(2L) - the smallest frequency distinguishable in the DFT is 1/(2L)1/(2L). For finer resolution, use longer time windows.

All three error types are direct consequences of the Fourier Transform properties developed in this section. Full treatment in Section 20-03.


Appendix C: Connections, Extensions, and Deep Dives

C.1 The Fourier Transform on Locally Compact Abelian Groups

The Fourier Transform generalizes far beyond R\mathbb{R}. For any locally compact abelian (LCA) group GG with Haar measure μ\mu, there is a Fourier Transform F:L1(G)C0(G^)\mathcal{F}: L^1(G) \to C_0(\hat{G}) where G^\hat{G} is the Pontryagin dual group (the group of continuous homomorphisms GS1G \to S^1).

Group GGDual G^\hat{G}Fourier TransformName
R\mathbb{R}R\mathbb{R}f(t)e2πiξtdt\int f(t)e^{-2\pi i\xi t}\,dtFourier integral
Z\mathbb{Z}S1=R/ZS^1 = \mathbb{R}/\mathbb{Z}nf(n)e2πiξn\sum_n f(n)e^{-2\pi i\xi n}Discrete-time FT (DTFT)
S1S^1Z\mathbb{Z}01f(eiθ)einθdθ\int_0^1 f(e^{i\theta})e^{-in\theta}\,d\thetaFourier series
Z/NZ\mathbb{Z}/N\mathbb{Z}Z/NZ\mathbb{Z}/N\mathbb{Z}k=0N1f(k)e2πikm/N\sum_{k=0}^{N-1}f(k)e^{-2\pi ikm/N}Discrete Fourier Transform (DFT)
Rd\mathbb{R}^dRd\mathbb{R}^df(t)e2πiξtdt\int f(\mathbf{t})e^{-2\pi i\boldsymbol{\xi}^\top\mathbf{t}}\,d\mathbf{t}Multidimensional FT

Plancherel's theorem, the convolution theorem, and the inversion theorem all hold in this general setting. The unification of Fourier series and Fourier Transform as special cases of the same construction on different groups is one of the most elegant results in abstract harmonic analysis.

For AI: The group structure is what makes RoPE work. The circle group S1S^1 (the group of complex numbers of modulus 1 under multiplication) is a compact abelian group. Multiplying by eimθe^{im\theta} is the group action of rotation by mθm\theta. The relative-position property of RoPE eimθq,einθk=ei(mn)θq,k\langle e^{im\theta}q, e^{in\theta}k\rangle = e^{i(m-n)\theta}\langle q,k\rangle is the statement that the group action is transitive: the inner product only sees the group element ei(mn)θe^{i(m-n)\theta} corresponding to relative position.

C.2 Non-Commutative Fourier Analysis

For non-abelian groups (e.g., SO(3)SO(3), permutation groups SNS_N), the Pontryagin duality breaks down but a Fourier Transform still exists, with irreducible unitary representations playing the role of frequencies.

Graph Fourier Transform: For a graph G=(V,E)G = (V, E) with Laplacian L=DAL = D - A (where DD is the degree matrix), the eigenvectors Luk=λkukL\mathbf{u}_k = \lambda_k\mathbf{u}_k play the role of complex exponentials. The graph Fourier Transform is f^=Uf\hat{\mathbf{f}} = U^\top\mathbf{f} where U=[u1,,uN]U = [\mathbf{u}_1,\ldots,\mathbf{u}_N] is the matrix of eigenvectors. The eigenvalues λk\lambda_k represent "graph frequencies" - how rapidly the eigenvector oscillates across the graph.

This is the foundation of spectral graph neural networks (Bruna et al., 2014; Defferrard et al., 2016 ChebNet; Kipf & Welling, 2017 GCN). The spectral GNN applies a learned filter hθ(Λ)h_\theta(\Lambda) in graph Fourier space: h^θf=Uhθ(Λ)Uf\hat{h}_\theta * \mathbf{f} = U\,h_\theta(\Lambda)\,U^\top\mathbf{f}.

Full treatment in Section 11-04 Spectral Graph Theory.

C.3 The Fourier Transform and Learning Theory

The Fourier Transform connects deeply to several foundational results in learning theory:

Spectral Bias (Rahaman et al., 2019): Neural networks trained by gradient descent learn the Fourier components of the target function in order of increasing frequency. This is because the NTK (Neural Tangent Kernel) has eigenvalues that decay rapidly with frequency - the network converges faster on low-frequency components. Consequence: networks generalize well on smooth targets and may overfit on high-frequency features.

Frequency Principle: Let ff^* be the target function and ftf_t the network at training step tt. Then f^t(ξ)f^(ξ)|\hat{f}_t(\xi) - \hat{f}^*(\xi)| decreases at rate λξ\propto \lambda_\xi (the NTK eigenvalue at frequency ξ\xi), and λξ\lambda_\xi decreases with ξ|\xi|. High-frequency components therefore converge last.

Implications:

  • Data augmentation that adds high-frequency perturbations (e.g., random cropping, noise injection) improves generalization precisely because it forces the network to represent these frequencies.
  • Adversarial examples often correspond to high-frequency perturbations of inputs that fool the network but are invisible to humans - the network hasn't fully learned the high-frequency structure.
  • Neural networks with spectral inductive bias (e.g., SIREN - sinusoidal activation functions) can represent high-frequency targets more efficiently.

Rademacher Complexity and Fourier Norms: The generalization error of a hypothesis class can be bounded by its Fourier 1\ell^1 norm: ξf^(ξ)\sum_\xi |\hat{f}(\xi)|. Functions with small Fourier 1\ell^1 norm have low complexity and generalize well. This connects to the theory of Barron functions (Barron, 1993): functions ff with f^(ξ)ξdξ<C\int|\hat{f}(\xi)||\xi|\,d\xi < C can be approximated to accuracy ϵ\epsilon by a two-layer neural network with O(C2/ϵ2)O(C^2/\epsilon^2) neurons.

C.4 Phase Retrieval - When Magnitude Alone Is Insufficient

The FT f^(ξ)=f^(ξ)eiϕ(ξ)\hat{f}(\xi) = |\hat{f}(\xi)|e^{i\phi(\xi)} consists of a magnitude f^(ξ)|\hat{f}(\xi)| and a phase ϕ(ξ)\phi(\xi). Phase retrieval asks: can you recover ff from f^|\hat{f}| alone (without the phase)?

In general, the answer is no - many different signals can have the same magnitude spectrum. For example, f(t)f(t) and f(t)f(-t) have the same magnitude spectrum f^(ξ)|\hat{f}(\xi)| but may be entirely different. More strikingly, phase-randomized images (keeping f^|\hat{f}| but using random ϕ(ξ)\phi(\xi)) look like noise - the phase carries most of the perceptual information.

Phase is more important than magnitude for perception: An experiment: take image AA and image BB, swap their phase spectra (keep AA's magnitude, use BB's phase), and reconstruct. The result looks like image BB - demonstrating that phase dominates perception.

For AI: This has implications for:

  • Adversarial robustness: Many adversarial attacks perturb the high-frequency phase of input images - imperceptible to humans but highly disruptive to CNNs.
  • Generative models: Diffusion models and GANs must learn both the magnitude and phase of the data distribution's spectrum to generate perceptually realistic images.
  • FNet limitation: FNet takes the real part of the FT output, effectively discarding half the phase information. This is one reason FNet performs worse than attention on tasks requiring precise token interactions.

C.5 Fast Algorithms Beyond FFT

The FFT computes the NN-point DFT in O(NlogN)O(N\log N) operations, a vast improvement over the naive O(N2)O(N^2). Several modern algorithms push further:

Sparse FFT: If the signal has only kNk \ll N significant frequencies, the sparse FFT (Hassanieh et al., 2012; adopted in MIT SFFT) computes the DFT in O(klogN)O(k\log N) - sublinear in NN. Applications: energy-efficient spectrum sensing, MRI acceleration.

Non-Uniform FFT (NUFFT): The standard FFT requires uniformly-spaced samples. The NUFFT handles non-uniform grids with O(NlogN)O(N\log N) complexity, using oversampling and interpolation. Used in MRI reconstruction (non-Cartesian k-space trajectories) and radio astronomy.

Butterfly factorization (Monarch matrices): The FFT matrix can be written as a product of O(logN)O(\log N) sparse matrices (butterfly factors), each requiring O(N)O(N) operations. Monarch matrices (Dao et al., 2022) generalize this structure for learned efficient linear layers - they can represent any matrix with O(NN)O(N\sqrt{N}) parameters that can be applied in O(NN)O(N\sqrt{N}) time, interpolating between dense matrices (O(N2)O(N^2)) and diagonal (O(N)O(N)).

The FFT algorithm and its applications - Cooley-Tukey, Bluestein, Rader - are covered in full in Section 20-03 DFT and FFT.

C.6 Summary: The Central Theorems of the Fourier Transform

This section has developed the following central results, in order of importance:

1. Definition and well-posedness (Section 2):

f^(ξ)=f(t)e2πiξtdt,fL1(R)\hat{f}(\xi) = \int_{-\infty}^\infty f(t)e^{-2\pi i\xi t}\,dt, \qquad f \in L^1(\mathbb{R})

Well-defined, bounded, continuous. Riemann-Lebesgue: f^(ξ)0\hat{f}(\xi) \to 0 as ξ|\xi| \to \infty.

2. Properties (Section 3): Linearity, shift, modulation, scaling, time reversal, differentiation, convolution-multiplication duality. The Master Properties Table encodes all relationships between time and frequency operations.

3. Inversion (Section 4): Under sufficient conditions, f(t)=f^(ξ)e2πiξtdξf(t) = \int\hat{f}(\xi)e^{2\pi i\xi t}\,d\xi. The Gauss-Weierstrass regularization gives a constructive proof.

4. Plancherel (Section 5): f^2=f2\lVert\hat{f}\rVert_2 = \lVert f\rVert_2. The FT is a unitary isometry on L2(R)L^2(\mathbb{R}) - energy is preserved.

5. Uncertainty Principle (Section 6): ΔtΔξ1/(4π)\Delta t\cdot\Delta\xi \geq 1/(4\pi), with equality iff ff is a Gaussian. Time-frequency concentration is fundamentally limited.

6. Distribution extension (Section 7): The FT extends to tempered distributions, giving F[δ]=1\mathcal{F}[\delta] = 1, F[1]=δ\mathcal{F}[1] = \delta, F[e2πiξ0t]=δ(ξξ0)\mathcal{F}[e^{2\pi i\xi_0 t}] = \delta(\xi-\xi_0), and the Dirac comb formula. Poisson summation unifies Fourier series and Fourier Transform.

These six results form a complete, self-consistent theory that has shaped mathematics, physics, engineering, and AI for two centuries - and shows no sign of becoming less central.


Appendix D: Machine Learning Deep Dives

D.1 FNet Architecture - Full Mathematical Analysis

We give a complete mathematical analysis of the FNet architecture (Lee-Thorp et al., 2022) using the formalism developed in this section.

Standard Transformer token mixing (attention):

Given token matrix XRN×dX \in \mathbb{R}^{N \times d} (sequence length NN, embedding dimension dd):

Attn(X)=softmax ⁣(XWQWKXdk)XWV\text{Attn}(X) = \operatorname{softmax}\!\left(\frac{XW_Q W_K^\top X^\top}{\sqrt{d_k}}\right)XW_V

This is O(N2dk)O(N^2 d_k) in memory and time (the N×NN \times N attention matrix dominates for large NN). The attention matrix Aij=softmax(score)ijA_{ij} = \operatorname{softmax}(\text{score})_{ij} is data-dependent: each row is a different weighted sum of the value vectors, where the weights depend on the specific input.

FNet token mixing:

FNetMix(X)=Re(Fseq[Fmodel[X]])\text{FNetMix}(X) = \operatorname{Re}(\mathcal{F}_\text{seq}[\mathcal{F}_\text{model}[X]])

where Fseq\mathcal{F}_\text{seq} applies the DFT along the sequence dimension (rows) and Fmodel\mathcal{F}_\text{model} along the embedding dimension (columns).

The DFT as a matrix multiplication:

The 1D DFT can be written as x^=Fx\hat{\mathbf{x}} = F\mathbf{x} where FCN×NF \in \mathbb{C}^{N \times N} with Fkn=e2πikn/N/NF_{kn} = e^{-2\pi ikn/N}/\sqrt{N} (unitary DFT matrix). The 2D FNet mixing is:

FNetMix(X)=Re(FNXFd)\text{FNetMix}(X) = \operatorname{Re}(F_N X F_d^\top)

where FNCN×NF_N \in \mathbb{C}^{N\times N} and FdCd×dF_d \in \mathbb{C}^{d\times d} are DFT matrices.

Key properties:

  1. Unitarity: FNFN=INF_N F_N^* = I_N and FdFd=IdF_d F_d^* = I_d. The FNet mixing is an isometry (before taking real part): FNXFdF=XF\lVert F_N X F_d^\top\rVert_F = \lVert X\rVert_F by Plancherel.

  2. Global mixing: Each output position Ykj=nmXnm(FN)kn(Fd)jmY_{kj} = \sum_n\sum_m X_{nm}(F_N)_{kn}(F_d)_{jm} depends on all N×dN \times d input values. FNet is a global mixer - exactly like attention.

  3. No parameters: Unlike attention, FNet has zero parameters in the mixing layer (FNF_N and FdF_d are fixed). All learning happens in the feed-forward network after mixing.

  4. Complexity: O(Ndlog(Nd))O(Nd\log(Nd)) via 2D FFT, vs O(N2d)O(N^2 d) for attention. For N=512N = 512, d=768d = 768: FNet is 500×\sim 500\times fewer operations in the mixing layer.

Why taking real part? The DFT output is complex even for real inputs. Taking Re()\operatorname{Re}(\cdot) projects back to real numbers (required since the subsequent feed-forward layers expect real inputs). This discards the imaginary part - half the phase information. This is one reason FNet is weaker than attention: it cannot separately process magnitude and phase information.

Theoretical analysis: The FNet paper shows that attention with random (untrained) weights performs similarly to FNet. This suggests the attention mechanism's power comes primarily from its (trained) data-dependent mixing, not merely from being a global mixer. FNet is the "global but data-independent" baseline.

When to use FNet vs Attention:

Task typeFNetAttentionReason
Sentence classification~92-97% BERT100%Global mixing is sufficient
Named entity recognition~90%100%Moderate position sensitivity
Extractive QA (SQuAD)~80%100%Requires precise token matching
Speed-critical inference+7baselineFNet wins when quality tradeoff acceptable

D.2 Random Fourier Features - Implementation Details

Setup for RBF kernel: k(x,y)=eγxy2k(\mathbf{x},\mathbf{y}) = e^{-\gamma\lVert\mathbf{x}-\mathbf{y}\rVert^2}.

The spectral density is p(ω)=(12π/(2γ))d/2eω2/(4γ)=N(0,2γI)p(\boldsymbol{\omega}) = \left(\frac{1}{2\pi/(2\gamma)}\right)^{d/2}e^{-\lVert\boldsymbol{\omega}\rVert^2/(4\gamma)} = \mathcal{N}(\mathbf{0}, 2\gamma I).

(With γ=1/2\gamma = 1/2: p(ω)=N(0,I)p(\boldsymbol{\omega}) = \mathcal{N}(\mathbf{0}, I), which is the standard form.)

Feature map construction (two variants):

Random Fourier Features (RFF): Sample ωjp\boldsymbol{\omega}_j \sim p, bjU[0,2π]b_j \sim \mathcal{U}[0,2\pi]:

ϕ(x)=2/D[cos(ω1x+b1),,cos(ωDx+bD)]\phi(\mathbf{x}) = \sqrt{2/D}\,[\cos(\boldsymbol{\omega}_1^\top\mathbf{x}+b_1),\ldots,\cos(\boldsymbol{\omega}_D^\top\mathbf{x}+b_D)]

Random Fourier Features (paired, no bias): Sample ωjp\boldsymbol{\omega}_j \sim p:

ϕ(x)=1D[cos(ω1x),sin(ω1x),,cos(ωDx),sin(ωDx)]\phi(\mathbf{x}) = \frac{1}{\sqrt{D}}[\cos(\boldsymbol{\omega}_1^\top\mathbf{x}),\sin(\boldsymbol{\omega}_1^\top\mathbf{x}),\ldots,\cos(\boldsymbol{\omega}_D^\top\mathbf{x}),\sin(\boldsymbol{\omega}_D^\top\mathbf{x})]

Both give E[ϕ(x)ϕ(y)]=k(x,y)\mathbb{E}[\phi(\mathbf{x})^\top\phi(\mathbf{y})] = k(\mathbf{x},\mathbf{y}). The paired version has slightly lower variance.

Concentration bound:

P ⁣(supx,yϕ(x)ϕ(y)k(x,y)>ϵ)28(σpRϵ)2eDϵ2/8P\!\left(\sup_{\mathbf{x},\mathbf{y}}\left|\phi(\mathbf{x})^\top\phi(\mathbf{y}) - k(\mathbf{x},\mathbf{y})\right| > \epsilon\right) \leq 2^8\left(\frac{\sigma_p R}{\epsilon}\right)^2 e^{-D\epsilon^2/8}

where σp=Ep[ω2]\sigma_p = \sqrt{\mathbb{E}_p[\lVert\boldsymbol{\omega}\rVert^2]} and R=maxix(i)R = \max_i\lVert\mathbf{x}^{(i)}\rVert.

Variance reduction via Quasi-Monte Carlo: Instead of random ωj\boldsymbol{\omega}_j, use quasi-random sequences (Sobol, Halton) to improve coverage of frequency space. Gives O(D1logdD)O(D^{-1}\log^d D) error vs O(D1/2)O(D^{-1/2}) for random - significant improvement in practice.

Orthogonal Random Features (ORF) (Yu et al., 2016): Sample ωj\boldsymbol{\omega}_j as rows of G=1DHDSG = \frac{1}{\sqrt{D}}HDS where HH is a Hadamard matrix, DD is diagonal ±1\pm1, SS is a diagonal scaling. This gives:

  • Same unbiasedness: E[ϕ(x)ϕ(y)]=k(x,y)\mathbb{E}[\phi(\mathbf{x})^\top\phi(\mathbf{y})] = k(\mathbf{x},\mathbf{y})
  • Lower variance: O(1/D)O(1/D) vs O(1/D)O(1/D) but with better constants
  • Faster computation: O(DlogD)O(D\log D) using fast Hadamard transform

Connection to Performer: The Performer attention (Choromanski et al., 2021) approximates exp(qk/d)\exp(q^\top k/\sqrt{d}) (the unnormalized attention softmax kernel) using orthogonal random features:

softmax(qk/d)ϕ(q)ϕ(k),ϕ(x)=ex2/2D[eω1x,,eωDx]\text{softmax}(q^\top k/\sqrt{d}) \approx \phi(q)^\top\phi(k), \quad \phi(\mathbf{x}) = \frac{e^{-\lVert\mathbf{x}\rVert^2/2}}{\sqrt{D}}[e^{\omega_1^\top\mathbf{x}},\ldots,e^{\omega_D^\top\mathbf{x}}]

This reduces attention from O(N2d)O(N^2 d) to O(NDd)O(N D d) - linear in sequence length NN.

D.3 Spectral Analysis of Neural Network Weight Matrices

The spectral norm W2=σmax(W)\lVert W\rVert_2 = \sigma_{\max}(W) and the spectral distribution of weight matrices encode important information about the training state of a neural network.

Heavy-tailed self-regularization (Martin & Mahoney, 2021): In well-trained neural networks, the eigenvalue distribution of WWW^\top W follows a power law: p(λ)λαp(\lambda) \propto \lambda^{-\alpha} for some α[2,6]\alpha \in [2, 6]. Under-trained or over-regularized networks have bulk (Marchenko-Pastur) distributions. This is detected by fitting a power law to the sorted singular values.

WeightWatcher: The open-source tool weightwatcher analyzes the spectral distribution of weight matrices in PyTorch/TensorFlow models and reports α\alpha values as a proxy for model quality - without needing test data. Well-trained frontier models (GPT-4, LLaMA) have α2\alpha \approx 2-4; poorly trained models have α>6\alpha > 6 or non-power-law distributions.

Spectral norm and Lipschitz constant: For a multi-layer network f=fLf1f = f_L\circ\cdots\circ f_1, the Lipschitz constant satisfies:

Lip(f)l=1LW[l]2Lip(σ)\text{Lip}(f) \leq \prod_{l=1}^L\lVert W^{[l]}\rVert_2\cdot\text{Lip}(\sigma)

For 1-Lipschitz activations (σ=ReLU\sigma = \operatorname{ReLU}, Lip(σ)=1\text{Lip}(\sigma) = 1): Lip(f)lW[l]2\text{Lip}(f) \leq \prod_l\lVert W^{[l]}\rVert_2.

Spectral normalization (Section 8.3) sets each W[l]2=1\lVert W^{[l]}\rVert_2 = 1, making the entire network 1-Lipschitz.

For LLMs: Analyzing the spectral properties of attention weight matrices (WQ,WK,WV,WOW_Q, W_K, W_V, W_O) is an active area of mechanistic interpretability research. Low-rank structure in these matrices (small effective rank = few large singular values) indicates specialized computation. Anthropic's interpretability work has found "singular value signatures" of specific computational patterns in attention heads.

D.4 The Fourier Perspective on Attention

Attention and the Fourier Transform are both global linear mixing operations on sequences. Understanding their relationship clarifies when each is appropriate.

Attention as a data-dependent filter: The output of attention can be written as:

yi=jAijvj,Aij=exp(qikj/dk)exp(qik/dk)y_i = \sum_j A_{ij}v_j, \quad A_{ij} = \frac{\exp(q_i^\top k_j/\sqrt{d_k})}{\sum_\ell\exp(q_i^\top k_\ell/\sqrt{d_k})}

For fixed query ii, this is a weighted average of value vectors with data-dependent weights AijA_{ij}. In signal processing terms, it is a position-dependent filter: the filter coefficients AijA_{ij} change for each output position ii.

Fourier Transform as a fixed filter bank: FNet applies the DFT, which is a fixed (non-data-dependent) filter: the output at frequency kk is always x^k=nxne2πikn/N\hat{x}_k = \sum_n x_n e^{-2\pi ikn/N}, regardless of the content of xx. Each DFT frequency is a specific pattern of oscillation at rate k/Nk/N.

Comparison table:

PropertySelf-AttentionFNet (DFT)
Filter typeData-dependentFixed
Parameters in mixer3d23d^2 (W_Q, W_K, W_V)0
Computational complexityO(N2d)O(N^2 d)O(Ndlog(Nd))O(Nd\log(Nd))
Expressive powerHigh (can select specific tokens)Low (global uniform mixing)
Position awarenessVia PE or RoPEVia DFT phases
Best forSelective retrieval tasksGlobal mixing tasks
Example successMachine translation, QAText classification, embedding

The analogy: Attention is like a learnable short-time Fourier Transform with adaptive windows - for each output position, it selects which "frequencies" (patterns) to attend to. The DFT in FNet is like a fixed global frequency analysis with no adaptivity. The power of attention comes from its ability to compute different effective filters for different positions and different inputs.


Appendix E: Reference Summary and Quick Checks

E.1 The Six Most Important Facts

If you retain only six things from this section, retain these:

1. Definition: f^(ξ)=f(t)e2πiξtdt\hat{f}(\xi) = \int_{-\infty}^\infty f(t)e^{-2\pi i\xi t}\,dt (use the ξ\xi-convention; no 2π2\pi in the inverse).

2. Self-dual Gaussian: F[eπt2](ξ)=eπξ2\mathcal{F}[e^{-\pi t^2}](\xi) = e^{-\pi\xi^2}. The Gaussian maps to itself. All other transforms can be derived from this one via properties.

3. Plancherel: f^2=f2\lVert\hat{f}\rVert_2 = \lVert f\rVert_2. Energy is conserved. The FT is a unitary operator on L2L^2.

4. Uncertainty principle: ΔtΔξ1/(4π)\Delta t\cdot\Delta\xi \geq 1/(4\pi), equality iff Gaussian. You cannot have both time and frequency concentration.

5. Distributions: F[δ]=1\mathcal{F}[\delta] = 1, F[1]=δ\mathcal{F}[1] = \delta, F[e2πiξ0t]=δ(ξξ0)\mathcal{F}[e^{2\pi i\xi_0 t}] = \delta(\xi-\xi_0). Pure tones map to spikes. Spikes have flat spectra.

6. AI: The FT is used in RoPE (modulation theorem), FNet (global mixing via DFT), RFF (Bochner + spectral sampling), spectral normalization (spectral norm of weight matrices), and FNO (learn in Fourier space).

E.2 Quick Reference: Which Theorem to Use When

ProblemTheorem/Tool
Compute FT of $e^{-at
Compute FT of f(ta)g(t)f(t-a)g(t)Time-shift + modulation properties
Show $\intf
Find sinc2(ξ)dξ\int_{-\infty}^\infty\operatorname{sinc}^2(\xi)\,d\xiParseval applied to rect
Show f^(ξ)0\hat{f}(\xi)\to 0 as $\xi
Compute FT of f(t)f'(t)Differentiation property: multiply by 2πiξ2\pi i\xi
Compute FT of cos(2πξ0t)\cos(2\pi\xi_0 t)Modulation theorem or distribution: two delta spikes
Verify ff recoverable from f^\hat{f}Inversion theorem (check f,f^L1f,\hat{f}\in L^1 or use L2L^2 theory)
Lower bound on signal bandwidthUncertainty principle: Δξ1/(4πΔt)\Delta\xi \geq 1/(4\pi\Delta t)
FT of a periodic signal cne2πint/T\sum c_n e^{2\pi int/T}Distribute over sum: cnδ(ξn/T)\sum c_n\delta(\xi - n/T)
FT of f(at)f(at) (rescaled signal)Scaling theorem: $(1/
Show kernel k(xy)k(\mathbf{x}-\mathbf{y}) is PDBochner: show spectral density p0p\geq0

E.3 Sign Convention Reference

Different sources use different sign conventions. This table gives the conversion formulas.

ConventionForward FTInverse FTeπt2e^{-\pi t^2} maps to
ξ\xi (this section)f(t)e2πiξtdt\int f(t)e^{-2\pi i\xi t}\,dtf^(ξ)e+2πiξtdξ\int\hat{f}(\xi)e^{+2\pi i\xi t}\,d\xieπξ2e^{-\pi\xi^2}
ω\omega, no 12π\frac{1}{2\pi}f(t)eiωtdt\int f(t)e^{-i\omega t}\,dt12πF(ω)e+iωtdω\frac{1}{2\pi}\int F(\omega)e^{+i\omega t}\,d\omega2πeω2/2\sqrt{2\pi}e^{-\omega^2/2}
ω\omega, symmetric12πf(t)eiωtdt\frac{1}{\sqrt{2\pi}}\int f(t)e^{-i\omega t}\,dt12πF(ω)e+iωtdω\frac{1}{\sqrt{2\pi}}\int F(\omega)e^{+i\omega t}\,d\omegaeω2/2e^{-\omega^2/2}

Conversion: If Fξ(ξ)F_\xi(\xi) is the transform in the ξ\xi-convention, then Fω(ω)=Fξ(ω/(2π))F_\omega(\omega) = F_\xi(\omega/(2\pi)) for the non-symmetric ω\omega-convention.

Properties in the ω\omega-convention (no 1/(2π)1/(2\pi)):

  • Differentiation: Fω[f](ω)=iωF(ω)\mathcal{F}_\omega[f'](\omega) = i\omega F(\omega) (not 2πiξ2\pi i\xi)
  • Time-shift: Fω[f(ta)](ω)=eiωaF(ω)\mathcal{F}_\omega[f(t-a)](\omega) = e^{-i\omega a}F(\omega) (not e2πiξae^{-2\pi i\xi a})
  • Parseval: f2=12πF2\int|f|^2 = \frac{1}{2\pi}\int|F|^2 (extra 1/(2π)1/(2\pi) factor!)

The ξ\xi-convention is self-symmetric and avoids all these 2π2\pi factors - strongly recommended for all new work.

E.4 Prerequisites Revisited: What You Now Know

This section assumed you knew Fourier series (Section 01) and functional analysis basics. Here is what each prerequisite contributed:

PrerequisiteHow it was used
Complex exponentials eiθe^{i\theta}The basis functions e2πiξte^{-2\pi i\xi t} of the FT; the Euler formula is used constantly
L2L^2 inner productPlancherel's theorem is the FT version of Parseval from Fourier series
Orthonormality of {einx}\{e^{inx}\}Generalized: the FT is the "orthonormal expansion" in the continuous limit
Parseval's identity (series)Generalized to Plancherel's theorem in Section 5
Gibbs phenomenonReinterpreted as spectral leakage - the frequency-domain manifestation of truncation
Dirichlet conditionsSufficient conditions for pointwise inversion (Section 4.1)
Integration / IBPProofs of differentiation theorem, inversion via approximate identity
Convergence of seriesL2L^2 convergence in Plancherel proof; distributional convergence in Section 7

E.5 Glossary

TermDefinition
Fourier Transformf^(ξ)=f(t)e2πiξtdt\hat{f}(\xi) = \int f(t)e^{-2\pi i\xi t}\,dt; maps time-domain functions to frequency-domain functions
SpectrumThe function f^(ξ)\hat{f}(\xi); describes which frequencies are present and at what amplitude/phase
Power spectrum$
Plancherel theoremf^2=f2\lVert\hat{f}\rVert_2 = \lVert f\rVert_2; FT is a unitary isometry on L2L^2
Parseval's relationfgˉ=f^g^ˉ\int f\bar{g} = \int\hat{f}\bar{\hat{g}}; the FT preserves inner products
Uncertainty principleΔtΔξ1/(4π)\Delta t\cdot\Delta\xi \geq 1/(4\pi); time and frequency spread cannot simultaneously be small
Riemann-Lebesgue lemmaf^(ξ)0\hat{f}(\xi)\to0 as $
Dirac delta δ(t)\delta(t)Distributional unit impulse; δ(t)ϕ(t)dt=ϕ(0)\int\delta(t)\phi(t)\,dt = \phi(0); FT is constant 1
Tempered distributionContinuous linear functional on Schwartz space; includes all LpL^p functions, δ\delta, and more
Schwartz space S\mathcal{S}Smooth functions decaying faster than any polynomial; FT maps S\mathcal{S} to S\mathcal{S} bijectively
Dirac combnδ(tnT)\sum_n\delta(t-nT); its FT is another comb with spacing 1/T1/T
Poisson summationnf(n)=nf^(n)\sum_n f(n) = \sum_n\hat{f}(n); connects Fourier series and FT
Modulation theoremF[e2πiξ0tf](ξ)=f^(ξξ0)\mathcal{F}[e^{2\pi i\xi_0 t}f](\xi) = \hat{f}(\xi-\xi_0); multiplication by tone = frequency shift
Sinc functionsinc(ξ)=sin(πξ)/(πξ)\operatorname{sinc}(\xi) = \sin(\pi\xi)/(\pi\xi); FT of rectangular pulse; ideal low-pass filter kernel
Spectral normW2=σmax(W)\lVert W\rVert_2 = \sigma_{\max}(W); largest singular value; Lipschitz constant of linear map WW
RFFRandom Fourier Features; k(xy)ϕ(x)ϕ(y)k(\mathbf{x}-\mathbf{y})\approx\phi(\mathbf{x})^\top\phi(\mathbf{y}) via sampled frequencies from spectral density
STFTShort-Time Fourier Transform; localizes FT analysis to a time window; gives spectrogram
FNetTransformer variant replacing attention with 2D DFT; O(NlogN)O(N\log N) mixing
FNOFourier Neural Operator; learns PDE solutions by parameterizing the spectral operator
RoPERotary Position Embedding; encodes position as rotation eimθe^{im\theta} in complex plane

Appendix F: The Fourier Transform in Practice - Case Studies

F.1 Case Study: Audio Preprocessing for Whisper

OpenAI's Whisper (Radford et al., 2022) is a speech recognition model that accepts raw audio and produces text. Its preprocessing pipeline is a direct application of STFT and spectral analysis.

Step 1: Resampling. Raw audio is resampled to 16kHz. By Shannon's sampling theorem (Poisson summation): a signal sampled at 16kHz can faithfully represent frequencies up to 8kHz - which covers all speech frequencies (human voice: 80Hz-8kHz, formants: 200Hz-4kHz).

Step 2: STFT. Apply a windowed Fourier Transform with:

  • Window: Hann window (smooth tapering reduces spectral leakage)
  • Window length: 25ms 16kHz = 400 samples
  • Hop length: 10ms 16kHz = 160 samples (75% overlap)
  • FFT size: 512 points (zero-padded for frequency resolution)

This gives a spectrogram of shape (frames×257)(\text{frames} \times 257) where 257 = 512/2 + 1 unique frequencies.

Step 3: Mel filterbank. Map the 257 linear frequencies to 80 mel-scale frequency bins using triangular filters. The mel scale m=2595log10(1+f/700)m = 2595\log_{10}(1 + f/700) is perceptually uniform - equal mel intervals are equally distinguishable to humans. The uncertainty principle is no longer limiting here because the window (25ms) is longer than the smallest perceptible phoneme duration (~10-20ms), so time resolution is slightly sacrificed for good frequency resolution.

Step 4: Log compression. Take log(max(mel spectrogram,1010))\log(\max(\text{mel spectrogram}, 10^{-10})). The log operation compresses the dynamic range of audio energy (from ~120 dB to ~12 dB).

Result: (frames×80)(\text{frames} \times 80) log-mel spectrogram, treated as a 2D "image" and fed to a CNN + Transformer encoder. The entire preprocessing is fixed (not learned) - it embeds centuries of signal processing knowledge into the model architecture.

Fourier analysis used:

  • Fourier Transform -> STFT window analysis
  • Uncertainty principle -> window length tradeoff
  • Sampling theorem -> 16kHz minimum rate
  • Spectral leakage -> Hann window choice

F.2 Case Study: LLaMA-3 Context Extension via RoPE Scaling

LLaMA-3 uses RoPE with base θbase=500,000\theta_{\text{base}} = 500{,}000 and supports up to 128K token context (compared to LLaMA-2's 4K with θbase=10,000\theta_{\text{base}} = 10{,}000).

The problem: Standard RoPE with θj=100002j/d\theta_j = 10000^{-2j/d} has frequencies ranging from θ0=1\theta_0 = 1 (dimension pair 0) to θd/21=104\theta_{d/2-1} = 10^{-4} (last dimension pair). At frequency θj\theta_j, the RoPE encoding completes one full rotation (period = 2π/θj2\pi/\theta_j tokens). The maximum distinguishable context is the longest period: Tmax=2π/θmin=2π×10462,832T_{\max} = 2\pi/\theta_{\min} = 2\pi\times 10^4 \approx 62{,}832 tokens for d=4096d = 4096 - explaining why LLaMA-2 degrades beyond ~4K (it uses θbase=10000\theta_{\text{base}} = 10000, so the longest period is 2π×104/d2\pi\times 10^4/d... the exact computation depends on dd).

The Fourier interpretation: Each dimension pair is an oscillator at frequency θj\theta_j. For two positions mm and nn to be distinguishable, their phase difference (mn)θj|(m-n)\theta_j| must be 0\neq 0 mod 2π2\pi for at least one jj. If mn|m-n| is larger than 2π/θmin2\pi/\theta_{\min}, all dimensions have wrapped around - positions become identical. This is aliasing in the temporal domain, caused by the frequency θmin\theta_{\min} being too high.

The fix: Reduce the minimum frequency by increasing θbase\theta_{\text{base}}:

θj=θbase2j/d,θmin=θbase1,Tmax=2πθbase\theta_j = \theta_{\text{base}}^{-2j/d}, \quad \theta_{\min} = \theta_{\text{base}}^{-1}, \quad T_{\max} = 2\pi\cdot\theta_{\text{base}}

LLaMA-3 uses θbase=500,000\theta_{\text{base}} = 500{,}000, giving Tmax=2π×500,0003.14×106T_{\max} = 2\pi\times 500{,}000 \approx 3.14\times 10^6 - more than enough for 128K context.

The cost: Higher θbase\theta_{\text{base}} means the low-frequency dimensions (jj large) oscillate very slowly, providing little positional information for short contexts. There is a fundamental tradeoff between context length and short-range positional precision - a manifestation of the uncertainty principle in the discrete position domain.

YaRN (Peng et al., 2023): Instead of uniformly rescaling all frequencies, YaRN uses a non-uniform rescaling: high-frequency dimensions (small jj) are scaled minimally (they handle short-range position fine), while low-frequency dimensions (large jj) are scaled aggressively. The interpolation factor s=target context/original contexts = \text{target context} / \text{original context} is applied per-frequency, weighted by the "needed" extension. This maintains short-range precision while extending long-range capacity - a principled application of the time-frequency tradeoff.

F.3 Case Study: Spectral Normalization in Stable Diffusion

Modern diffusion models (Stable Diffusion, DALL-E 3, Flux) train a discriminator or guidance network to distinguish real from generated data. Spectral normalization (Section 8.3) is used to stabilize this training.

Setup: The discriminator D:RH×W×3[0,1]D: \mathbb{R}^{H\times W\times 3} \to [0,1] has LL convolutional layers. Each convolution layer has weight W[l]RCout×Cin×k×kW^{[l]} \in \mathbb{R}^{C_\text{out}\times C_\text{in}\times k\times k}. For a 2D convolution layer, the relevant spectral norm is:

W[l]2=maxh:h=1W[l]h\lVert W^{[l]}\rVert_2 = \max_{\mathbf{h}:\lVert\mathbf{h}\rVert=1}\lVert W^{[l]}*\mathbf{h}\rVert

For a convolution with kernel h(kx,ky)h(k_x, k_y), the spectral norm equals maxξx,ξyh^(ξx,ξy)\max_{\xi_x,\xi_y}|\hat{h}(\xi_x,\xi_y)| - the maximum of the 2D Fourier Transform of the kernel. This connects spectral normalization directly to the FT: normalizing by the spectral norm makes the filter's frequency response bounded by 1 at every frequency.

Effect on generated images: A spectrally-normalized discriminator applies Lipschitz constraints that prevent it from exploiting arbitrary high-frequency artifacts in generated images. This forces the generator to eliminate high-frequency artifacts (checkerboard patterns from transposed convolutions, aliasing from bilinear upsampling) - improving sample quality.

Alias-free GAN / StyleGAN3 (Karras et al., 2021): Takes spectral control further by enforcing alias-free processing throughout the generator using carefully designed low-pass filters at every upsampling step. The key insight from Fourier analysis: any nonlinearity applied to a bandlimited signal creates harmonics beyond the original bandwidth - these must be filtered out before downsampling to avoid aliasing. StyleGAN3's architecture enforces this principle throughout, producing significantly more temporally consistent video outputs.


Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue