Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 2
23 min read18 headingsSplit lesson page

Lesson overview | Previous part | Next part

Fourier Transform: Part 8: Applications in Machine Learning to Appendix A: Extended Properties and Derivations

8. Applications in Machine Learning

8.1 FNet: Replacing Self-Attention with Fourier Transforms

Paper: Lee-Thorp, Ainslie, Eckstein, Ontanon (2022). "FNet: Mixing Tokens with Fourier Transforms." NAACL.

The idea: In a standard Transformer, the self-attention sublayer computes:

Attention(Q,K,V)=softmax ⁣(QKdk)V\text{Attention}(Q,K,V) = \operatorname{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

which is O(N2d)O(N^2 d) in sequence length NN. FNet replaces this with a 2D DFT:

FNet-mixing(X)=Re ⁣(Fseq ⁣[Fmodel[X]])\text{FNet-mixing}(X) = \operatorname{Re}\!\left(\mathcal{F}_{\text{seq}}\!\left[\mathcal{F}_{\text{model}}[X]\right]\right)

where Fseq\mathcal{F}_{\text{seq}} applies FFT along the sequence dimension and Fmodel\mathcal{F}_{\text{model}} along the embedding dimension. This is O(NlogNd)O(N\log N \cdot d) - much faster.

Why does it work? The Fourier Transform is a global linear mixing operation: every output position depends on every input position (via the sum x^k=nxne2πink/N\hat{x}_k = \sum_n x_n e^{-2\pi ink/N}). This is similar to attention's global mixing, but unparameterized - the mixing weights are fixed Fourier basis functions rather than learned attention scores.

Results: FNet achieves 92-97% of BERT's performance on GLUE benchmarks. For tasks requiring fine-grained token interaction (e.g., extractive QA), the gap is larger; for tasks where global context suffices (e.g., sentence classification), FNet nearly matches BERT. Training speed is 7 faster on GPUs, 2 faster on TPUs.

Mathematical insight: The key theorem is that the DFT matrix FF satisfies F=F/NF = F^\top / N (circular DFT is symmetric up to normalization). Therefore FF=IFF^\dagger = I - the DFT is unitary, just like the continuous FT by Plancherel. The real part operation ensures the output is real-valued.

Code sketch:

import numpy as np

def fnet_mixing(X):
    """
    X: (batch, seq_len, d_model) array
    Returns: real-valued mixing output of same shape
    """
    X_fft = np.fft.fft(np.fft.fft(X, axis=1), axis=2)
    return np.real(X_fft)

8.2 Random Fourier Features

Paper: Rahimi & Recht (2007). "Random Features for Large-Scale Kernel Machines." NeurIPS (Best Paper Award).

The setup: Kernel methods (SVMs, Gaussian Processes) require computing Kij=k(x(i),x(j))K_{ij} = k(\mathbf{x}^{(i)}, \mathbf{x}^{(j)}) for all NN training pairs - an O(N2)O(N^2) matrix that costs O(N3)O(N^3) to invert. For N>105N > 10^5, this is prohibitive.

Bochner's theorem: A continuous, shift-invariant kernel k(x,y)=k(xy)k(\mathbf{x}, \mathbf{y}) = k(\mathbf{x} - \mathbf{y}) is positive definite if and only if it is the Fourier Transform of a non-negative measure p(ω)p(\boldsymbol{\omega}):

k(xy)=p(ω)eiω(xy)dω=Eωp[eiωxeiωy]k(\mathbf{x} - \mathbf{y}) = \int p(\boldsymbol{\omega})\,e^{i\boldsymbol{\omega}^\top(\mathbf{x}-\mathbf{y})}\,d\boldsymbol{\omega} = \mathbb{E}_{\boldsymbol{\omega} \sim p}[e^{i\boldsymbol{\omega}^\top\mathbf{x}}\overline{e^{i\boldsymbol{\omega}^\top\mathbf{y}}}]

(Full treatment of Bochner's theorem in Section 12-03 Kernel Methods.)

The approximation: Sample DD frequencies ω1,,ωDp(ω)\boldsymbol{\omega}_1, \ldots, \boldsymbol{\omega}_D \sim p(\boldsymbol{\omega}) and define the random feature map:

ϕ(x)=2D[cos(ω1x+b1),,cos(ωDx+bD)]RD\phi(\mathbf{x}) = \sqrt{\frac{2}{D}}\left[\cos(\boldsymbol{\omega}_1^\top\mathbf{x} + b_1), \ldots, \cos(\boldsymbol{\omega}_D^\top\mathbf{x} + b_D)\right] \in \mathbb{R}^D

where bjU[0,2π]b_j \sim \mathcal{U}[0, 2\pi] are random phase offsets.

Guarantee: By the law of large numbers:

k(x,y)ϕ(x)ϕ(y),E[ϕ(x)ϕ(y)]=k(x,y)k(\mathbf{x}, \mathbf{y}) \approx \phi(\mathbf{x})^\top\phi(\mathbf{y}), \qquad \mathbb{E}[\phi(\mathbf{x})^\top\phi(\mathbf{y})] = k(\mathbf{x},\mathbf{y})

with concentration: P ⁣(ϕ(x)ϕ(y)k(x,y)>ϵ)2exp(Dϵ2/4)P\!\left(|\phi(\mathbf{x})^\top\phi(\mathbf{y}) - k(\mathbf{x},\mathbf{y})| > \epsilon\right) \leq 2\exp(-D\epsilon^2/4).

For popular kernels:

  • RBF kernel k(x,y)=exy2/(2σ2)k(\mathbf{x},\mathbf{y}) = e^{-\lVert\mathbf{x}-\mathbf{y}\rVert^2/(2\sigma^2)}: p(ω)=N(0,σ2I)p(\boldsymbol{\omega}) = \mathcal{N}(\mathbf{0}, \sigma^{-2}I)
  • Laplace kernel k(x,y)=exy1/σk(\mathbf{x},\mathbf{y}) = e^{-\lVert\mathbf{x}-\mathbf{y}\rVert_1/\sigma}: pp is a Cauchy distribution

Impact: Random Fourier Features reduce kernel SVM training from O(N3)O(N^3) to O(ND+D3)O(ND + D^3), making kernel methods scalable. The same idea appears in Performer (Choromanski et al., 2021) - an efficient attention mechanism that approximates the attention kernel with random features, reducing attention to O(N)O(N).

8.3 Spectral Normalization

Paper: Miyato, Kataoka, Koyama, Yoshida (2018). "Spectral Normalization for Generative Adversarial Networks." ICLR.

The problem: Training GANs is notoriously unstable because the discriminator can overfit, causing gradient vanishing for the generator. A Lipschitz constraint on the discriminator stabilizes training.

The solution: Normalize each weight matrix WW by its spectral norm W2=σmax(W)\lVert W \rVert_2 = \sigma_{\max}(W) (the largest singular value):

W~=Wσmax(W)\tilde{W} = \frac{W}{\sigma_{\max}(W)}

This makes every linear layer 1-Lipschitz: W~xW~yxy\lVert\tilde{W}\mathbf{x} - \tilde{W}\mathbf{y}\rVert \leq \lVert\mathbf{x} - \mathbf{y}\rVert. The composition of 1-Lipschitz layers is 1-Lipschitz - so the entire discriminator is 1-Lipschitz.

Computing σmax\sigma_{\max}: Power iteration gives an efficient approximation. Starting from a random unit vector v~\tilde{\mathbf{v}}:

  1. u~=Wv~/Wv~\tilde{\mathbf{u}} = W\tilde{\mathbf{v}} / \lVert W\tilde{\mathbf{v}} \rVert
  2. v~=Wu~/Wu~\tilde{\mathbf{v}} = W^\top\tilde{\mathbf{u}} / \lVert W^\top\tilde{\mathbf{u}} \rVert
  3. σ^=u~Wv~\hat{\sigma} = \tilde{\mathbf{u}}^\top W\tilde{\mathbf{v}} One iteration per training step suffices empirically.

Why "spectral"? The spectral norm W2\lVert W \rVert_2 is the largest eigenvalue of WW\sqrt{W^\top W} - the spectrum of the operator WW viewed as a linear map. This connects to Fourier analysis: for a convolution layer with kernel hh, the spectral norm is h^=maxξh^(ξ)\lVert\hat{h}\rVert_\infty = \max_\xi|\hat{h}(\xi)| - the supremum of the Fourier Transform of the filter.

For AI: Spectral normalization is now standard in GAN training. It also appears in attention normalization - normalizing the key/value projection matrices to control the Lipschitz constant of the attention map, which is critical for stable training of large models.

8.4 Fourier Neural Operator - Preview

The Fourier Neural Operator (FNO) (Li et al., 2021) learns a mapping between function spaces (e.g., PDE initial conditions -> solutions) by:

  1. Lift input v(t)\mathbf{v}(t) to a higher-dimensional representation v0=P(v)\mathbf{v}_0 = P(\mathbf{v})
  2. Apply LL Fourier integral operator layers: vl+1(t)=σ(Wvl(t)+F1[RϕF[vl]](t))\mathbf{v}_{l+1}(t) = \sigma(W\mathbf{v}_l(t) + \mathcal{F}^{-1}[R_\phi \cdot \mathcal{F}[\mathbf{v}_l]](t))
  3. Project to output: u=Q(vL)u = Q(\mathbf{v}_L)

The key operation RϕR_\phi is a learnable complex-valued tensor applied pointwise in Fourier space - a global convolution with a learned filter. Only the lowest kmaxk_{\max} Fourier modes are kept (Plancherel guarantees the truncation error equals the discarded energy).

FNO achieves 1000 speedup over classical PDE solvers for problems like weather prediction and turbulence simulation.

Full treatment in Section 20-03 DFT and FFT - the discretized version of FNO and its implementation via FFT is covered there.

8.5 Bochner's Theorem and Shift-Invariant Kernels - Preview

Bochner's theorem (1932) characterizes all positive-definite, shift-invariant kernels as Fourier Transforms of non-negative measures:

k(xy)=Rdeiω(xy)dp(ω)k(\mathbf{x} - \mathbf{y}) = \int_{\mathbb{R}^d} e^{i\boldsymbol{\omega}^\top(\mathbf{x}-\mathbf{y})}\,dp(\boldsymbol{\omega})

This is a profound connection between Fourier analysis and kernel methods. The kernel's shape in the spatial domain determines (and is determined by) its spectral density pp.

Full treatment in Section 12-03 Kernel Methods - the RKHS theory, Mercer's theorem, and applications to SVMs, Gaussian Processes, and the Neural Tangent Kernel are covered there. Here we note that Bochner's theorem is the mathematical foundation for Random Fourier Features (Section 8.2).

8.6 RoPE from the Continuous FT Perspective

Rotary Position Embedding (RoPE) (Su, Lu, Pan, Meng, Luo, 2021) is now used in LLaMA-3, Mistral, Qwen, GPT-NeoX, and virtually every frontier LLM. Its mathematical foundation is the Fourier Transform on the circle group.

Setup. In attention, the score between query q\mathbf{q} at position mm and key k\mathbf{k} at position nn should depend only on the relative position mnm - n. This is a translation invariance requirement - exactly the condition Bochner's theorem characterizes.

Construction. Pair up embedding dimensions: (q2j,q2j+1)(q_{2j}, q_{2j+1}) and (k2j,k2j+1)(k_{2j}, k_{2j+1}) for j=0,,d/21j = 0, \ldots, d/2 - 1. Treat each pair as a complex number qj(c)=q2j+iq2j+1q_j^{(c)} = q_{2j} + iq_{2j+1}. Apply rotation by angle mθjm\theta_j:

fq(q,m)=qj(c)eimθjθj=100002j/dmodelf_q(\mathbf{q}, m) = q_j^{(c)} \cdot e^{im\theta_j} \qquad \theta_j = 10000^{-2j/d_{\text{model}}}

The attention score becomes:

fq(q,m),fk(k,n)=Re ⁣[jqj(c)kj(c)ei(mn)θj]\langle f_q(\mathbf{q}, m), f_k(\mathbf{k}, n)\rangle = \operatorname{Re}\!\left[\sum_j q_j^{(c)}\overline{k_j^{(c)}} \cdot e^{i(m-n)\theta_j}\right]

which depends only on mnm - n - the relative position. This is the Fourier modulation theorem: multiplying by eimθje^{im\theta_j} is a frequency shift by mm in the "position domain."

The frequencies θj\theta_j: The geometric spacing θj=100002j/d\theta_j = 10000^{-2j/d} means each pair encodes a different "octave" of positional information. Low-index dimensions (jj small) use high frequencies θj1\theta_j \approx 1 - sensitive to local position differences. High-index dimensions use low frequencies θj100001\theta_j \approx 10000^{-1} - sensitive to global structure over thousands of positions. This is a multi-resolution decomposition - the Fourier Transform at multiple frequency scales.

Why the 1000010000 base? The maximum useful context is 10000\sim 10000 tokens at the original RoPE scale (each frequency completes at most one full rotation). For LLaMA-3 with 128K context, the effective base must be much larger - which is why LLaMA-3.1 uses θbase=500000\theta_{\text{base}} = 500000 instead of 1000010000.

ROPE: FOURIER TRANSFORM IN ATTENTION
========================================================================

  f_q(q, m) = q  e^{im}   <- modulation by Fourier basis e^{im}

  f_q(q,m), f_k(k,n) = Re[q*k  e^{i(m-n)}]
                            ^ depends only on relative position m-n

  j = 10000^{-2j/d}:  j=0 -> approx1 (fast/local),  j=d/2 -> approx1/10000 (slow/global)

  Context extension:
    Original RoPE: base=10000, max context ~10K tokens
    LLaMA-3.1:     base=500K,  max context 128K tokens
    LongRoPE:      non-uniform rescaling up to 2M context

========================================================================

9. Common Mistakes

#MistakeWhy It's WrongFix
1Confusing the ω\omega and ξ\xi conventions and getting 2π2\pi factors wrongFω[f](f)=f^(ω)\mathcal{F}_\omega[f](f')=\hat{f}(\omega) has a 1/(2π)1/(2\pi) in the inverse; Fξ\mathcal{F}_\xi does not. Mixing these adds spurious factorsAlways identify the convention on first use; for this curriculum use the ξ\xi-convention f^(ξ)=f(t)e2πiξtdt\hat{f}(\xi)=\int f(t)e^{-2\pi i\xi t}\,dt
2Assuming f^L1\hat{f} \in L^1 whenever fL1f \in L^1f=rectf=\operatorname{rect} has f^=sincL1\hat{f}=\operatorname{sinc} \notin L^1. The Riemann-Lebesgue lemma only guarantees f^C0\hat{f} \in C_0, not L1L^1Check whether ff is smooth enough (e.g., fL1f \in L^1 and Lipschitz) for f^L1\hat{f} \in L^1 to hold
3Writing F[δ](ξ)=0\mathcal{F}[\delta](\xi) = 0δ\delta has a flat spectrum: F[δ](ξ)=1\mathcal{F}[\delta](\xi)=1 for all ξ\xi. A "pointlike" impulse contains all frequencies equallyCompute distributional FT: δ^,ϕ=δ,ϕ^=ϕ^(0)=ϕdt\langle\hat{\delta},\phi\rangle=\langle\delta,\hat{\phi}\rangle=\hat{\phi}(0)=\int\phi\,dt
4Thinking the uncertainty principle only limits our instrumentsΔtΔξ1/(4π)\Delta t\cdot\Delta\xi\geq 1/(4\pi) is a theorem about functions, not about measurement devices. It holds for any fL2f\in L^2, regardless of how it is measuredAccept the bound as a mathematical fact; designing around it requires fundamentally different time-frequency representations (wavelets, Section 05)
5Applying the differentiation property to non-differentiable ffF[f](ξ)=2πiξf^(ξ)\mathcal{F}[f'](\xi)=2\pi i\xi\hat{f}(\xi) requires fL1f'\in L^1. For f=rectf=\operatorname{rect}, f=δ(t+1/2)δ(t1/2)f'=\delta(t+1/2)-\delta(t-1/2) in the distributional senseWork in the distributional setting when ff has jumps; the formula still holds but ff' is a distribution
6Claiming Parseval says $\intf^2 = \int
7Mixing up the FT of a product vs. convolutionF[fg]=f^g^\mathcal{F}[f\cdot g]=\hat{f}*\hat{g} (convolution in frequency), not f^g^\hat{f}\cdot\hat{g}. The multiplicative dual of the convolution theorem is often reversedMemorize the duality table: convolution in time \leftrightarrow product in frequency; product in time \leftrightarrow convolution in frequency
8Forgetting the $1/a$ in the scaling theorem
9Treating negative frequencies as "unphysical"For real ff, negative frequencies are redundant (Hermitian symmetry) but are not wrong. For complex ff (e.g., analytic signals), negative and positive frequencies are distinct and carry different informationEmbrace negative frequencies; they simplify formulas and are essential for complex signals
10Applying F1F=I\mathcal{F}^{-1}\circ\mathcal{F}=I pointwise instead of a.e.Inversion holds almost everywhere (for L1L^1\cap sufficient regularity). At a jump discontinuity, f^(ξ)e2πiξtdξ=12[f(t+)+f(t)]\int\hat{f}(\xi)e^{2\pi i\xi t}\,d\xi=\frac{1}{2}[f(t^+)+f(t^-)] - the average of left and right limitsAccount for the 1/21/2-average behavior at discontinuities (same as Dirichlet's theorem in Section 01)
11Using sinc(x)=sin(x)/x\text{sinc}(x) = \sin(x)/x and sinc(x)=sin(πx)/(πx)\operatorname{sinc}(x) = \sin(\pi x)/(\pi x) interchangeablyThere are two conventions for sinc. The normalized sinc sinc(x)=sin(πx)/(πx)\operatorname{sinc}(x)=\sin(\pi x)/(\pi x) satisfies F[rect]=sinc\mathcal{F}[\operatorname{rect}]=\operatorname{sinc} in the ξ\xi-convention; the unnormalized sin(x)/x\sin(x)/x does notIn this curriculum: sinc(x)=sin(πx)/(πx)\operatorname{sinc}(x) = \sin(\pi x)/(\pi x) (normalized), consistent with the ξ\xi-convention FT
12Thinking FNet replaces attention entirelyFNet is a weaker mixer than attention - it cannot learn data-dependent weights. It works well for classification but poorly for tasks requiring fine-grained token interactions (QA, coreference)Use FNet as a fast baseline; switch to attention for tasks requiring selective, query-dependent mixing

10. Exercises

Exercise 1 * - Computing Fourier Transforms from Definition

(a) Compute F[f](ξ)\mathcal{F}[f](\xi) for f(t)=eatf(t) = e^{-a|t|} with a>0a > 0. Show all steps of integration. What type of function is f^\hat{f}?

(b) Compute F[rect(t/T)](ξ)\mathcal{F}[\operatorname{rect}(t/T)](\xi) for a pulse of width T>0T > 0. Express using the normalized sinc function.

(c) Verify the L1L^1 bound: show f^(ξ)f1|\hat{f}(\xi)| \leq \lVert f\rVert_1 for your answer in (a).

(d) Confirm the Riemann-Lebesgue lemma numerically for f(t)=et2f(t) = e^{-t^2}: plot f^(ξ)|\hat{f}(\xi)| and verify it decays to 0.


Exercise 2 * - Applying Properties

(a) Given f^(ξ)=eπξ2\hat{f}(\xi) = e^{-\pi\xi^2}, use the time-shift property to write the Fourier Transform of g(t)=f(t3)g(t) = f(t - 3).

(b) Given h^(ξ)=sinc(ξ)\hat{h}(\xi) = \operatorname{sinc}(\xi), find the Fourier Transform of h(t2)cos(4πt)h(t - 2)\cos(4\pi t). (Use shift + modulation.)

(c) If f(t)f(t) has f^(ξ)=g(ξ)\hat{f}(\xi) = g(\xi), what is the FT of f(3t+1)f(3t + 1)? (Use both shift and scaling.)

(d) Prove: if ff is real and even, then f^\hat{f} is real and even.


Exercise 3 * - Parseval's Relation Numerically

(a) For f(t)=eπt2f(t) = e^{-\pi t^2}, compute f22=f2dt\lVert f\rVert_2^2 = \int|f|^2\,dt analytically.

(b) Compute f^22\lVert\hat{f}\rVert_2^2 analytically and verify f22=f^22\lVert f\rVert_2^2 = \lVert\hat{f}\rVert_2^2.

(c) Numerically verify Parseval using the FFT: discretize ff on [5,5][-5, 5] with N=1024N = 1024 points, apply FFT, and compare fn2/N\sum|f_n|^2/N vs f^k2/N\sum|\hat{f}_k|^2/N.

(d) Use Parseval to evaluate 1(1+4π2t2)2dt\int_{-\infty}^\infty \frac{1}{(1 + 4\pi^2 t^2)^2}\,dt.


Exercise 4 ** - The Inversion Theorem

(a) For f(t)=rect(t)f(t) = \operatorname{rect}(t) (not in L1L^1 after FT since sincL1\operatorname{sinc} \notin L^1), show that the principal value integral limRRRsinc(ξ)e2πiξtdξ\lim_{R\to\infty}\int_{-R}^{R}\operatorname{sinc}(\xi)e^{2\pi i\xi t}\,d\xi converges to rect(t)\operatorname{rect}(t) for t1/2|t| \neq 1/2 by evaluating the integral explicitly.

(b) Show that at t=1/2t = 1/2 (the jump discontinuity), the principal value integral converges to 1/2=12[rect(1/2)+rect(1/2+)]1/2 = \frac{1}{2}[\operatorname{rect}(1/2^-) + \operatorname{rect}(1/2^+)].

(c) This mirrors Dirichlet's theorem from Section 20-01. Identify the precise analog: what plays the role of the Dirichlet kernel in the Fourier integral case?


Exercise 5 ** - Uncertainty Principle

(a) For f(t)=eπt2/σ2f(t) = e^{-\pi t^2/\sigma^2} (normalized to f2=1\lVert f\rVert_2 = 1 appropriately), compute Δt\Delta t (RMS time spread) and Δξ\Delta\xi (RMS frequency spread). Verify ΔtΔξ=1/(4π)\Delta t\cdot\Delta\xi = 1/(4\pi).

(b) For f(t)=rect(t)f(t) = \operatorname{rect}(t), compute Δt\Delta t and Δξ\Delta\xi numerically. Is the uncertainty bound satisfied? Is it tight?

(c) A signal designer wants to transmit a pulse with duration Δt1μ\Delta t \leq 1\,\mus using bandwidth Δξ100\Delta\xi \leq 100\,kHz. Is this possible? What does the uncertainty principle say?

(d) Show that any signal of the form f(t)=Ceαt2e2πiξ0tf(t) = Ce^{-\alpha t^2}e^{2\pi i\xi_0 t} achieves equality in the uncertainty principle.


Exercise 6 ** - Tempered Distributions

(a) Verify F[δ(ta)](ξ)=e2πiξa\mathcal{F}[\delta(t - a)](\xi) = e^{-2\pi i\xi a} using the distributional definition T^,ϕ=T,ϕ^\langle\hat{T},\phi\rangle = \langle T,\hat{\phi}\rangle.

(b) Compute F[δ(t)](ξ)\mathcal{F}[\delta'(t)](\xi) (FT of the derivative of delta).

(c) Use the result from (a) to verify: F[cos(2πξ0t)](ξ)=12[δ(ξξ0)+δ(ξ+ξ0)]\mathcal{F}[\cos(2\pi\xi_0 t)](\xi) = \frac{1}{2}[\delta(\xi - \xi_0) + \delta(\xi + \xi_0)].

(d) Verify the Poisson summation formula numerically for f(t)=eπt2f(t) = e^{-\pi t^2}: compare n=NNf(n)\sum_{n=-N}^{N} f(n) with n=NNf^(n)\sum_{n=-N}^{N} \hat{f}(n) for N=5,10,20N = 5, 10, 20.


Exercise 7 *** - Random Fourier Features

(a) For the RBF kernel k(x,y)=exy2k(\mathbf{x},\mathbf{y}) = e^{-\lVert\mathbf{x}-\mathbf{y}\rVert^2} (setting σ=1\sigma = 1), the spectral density is p(ω)=(4π)d/2eω2/4p(\boldsymbol{\omega}) = (4\pi)^{-d/2}e^{-\lVert\boldsymbol{\omega}\rVert^2/4}. Draw D=500D = 500 frequencies ωjN(0,2I)\boldsymbol{\omega}_j \sim \mathcal{N}(\mathbf{0}, 2I) for d=2d = 2 and construct the random feature map ϕ(x)R2D\phi(\mathbf{x}) \in \mathbb{R}^{2D} (using cos and sin).

(b) Generate 100 random pairs (xi,yi)(\mathbf{x}_i, \mathbf{y}_i) with xi,yiN(0,I)\mathbf{x}_i, \mathbf{y}_i \sim \mathcal{N}(\mathbf{0}, I). For each pair, compute the exact kernel value k(xi,yi)k(\mathbf{x}_i, \mathbf{y}_i) and the RFF approximation ϕ(xi)ϕ(yi)\phi(\mathbf{x}_i)^\top\phi(\mathbf{y}_i). Plot the approximation vs. exact values and report RMSE.

(c) Study the approximation quality as a function of D{10,50,100,500,1000}D \in \{10, 50, 100, 500, 1000\}. How does RMSE scale with DD? Is it consistent with the theoretical bound E[error2]=O(D1)\mathbb{E}[\text{error}^2] = O(D^{-1})?

(d) Takeaway: explain in 2 sentences why RFF matters for LLM-scale systems where kernel computation would be infeasible.


Exercise 8 *** - FNet vs Attention Experiment

(a) Implement a minimal FNet mixing layer: given XRN×dX \in \mathbb{R}^{N \times d}, apply 2D FFT and take the real part. Implement a minimal attention mixing layer: Attn(X)=softmax(XX/d)X\text{Attn}(X) = \operatorname{softmax}(XX^\top/\sqrt{d})X.

(b) For a synthetic task - "copy the token from position kk to the output" (a task requiring precise position tracking) - construct training pairs (X,y)(X, y) and train both mixers with a linear head on top. Compare accuracy and training speed.

(c) For a synthetic task - "output the mean of all input tokens" (a global aggregation task) - repeat the comparison. Which mixer is better suited? Why?

(d) Takeaway: connect your results to the mathematical properties of the FT (fixed unparameterized mixing vs. learned data-dependent mixing) and explain what types of tasks each mixer is appropriate for.

11. Why This Matters for AI (2026 Perspective)

ConceptAI Impact
Fourier Transform as unitary operator (Plancherel)FNet (Lee-Thorp et al., 2022) uses FT as a token mixer - same information content, 7 faster than attention on GPU. The unitarity of FT ensures no information loss during mixing.
Convolution-multiplication dualityThe theoretical basis for O(NlogN)O(N\log N) convolution via FFT - the reason CNNs can be trained efficiently. FlashConv (Fu et al., 2023) and Hyena (Poli et al., 2023) extend this to long-range sequence models.
Modulation theorem (time-shift frequency-shift)Foundation of RoPE (Su et al., 2021) - multiplying embeddings by eimθe^{im\theta} shifts the position representation. Now standard in LLaMA-3, Mistral, Qwen, GPT-NeoX.
Uncertainty principle ΔtΔξ1/(4π)\Delta t\cdot\Delta\xi \geq 1/(4\pi)Fundamental constraint on context length extension. YaRN (Peng et al., 2023) and LongRoPE (Ding et al., 2024) must lower the frequency θj\theta_j to extend context - the UP dictates the minimum frequency needed.
Bochner's theorem + FT of spectral densityRandom Fourier Features (Rahimi & Recht, 2007) reduce kernel computation from O(N2)O(N^2) to O(ND)O(ND). Performer (Choromanski et al., 2021) uses the same idea to get O(N)O(N) attention.
Spectral norm W2=σmax\lVert W\rVert_2 = \sigma_{\max}Spectral normalization (Miyato et al., 2018) - dividing discriminator weights by their spectral norm enforces Lipschitz constraints. Standard in GAN training and used in diffusion model discriminators.
FT as basis for function space operatorsFourier Neural Operator (Li et al., 2021) - solves PDEs 1000 faster by learning in Fourier space. Used in climate modeling, fluid simulation, and materials science.
Differentiation in frequency domain (f2πiξf^f' \leftrightarrow 2\pi i\xi\hat{f})Heat and wave equation solutions in closed form via FT. Kernel of the heat equation is a Gaussian - explains why diffusion models add Gaussian noise: it has maximal spectral support.
Gaussian as optimal time-frequency windowGaussian embeddings in Gaussian attention (Tsai et al., 2019) - replacing dot-product attention with a Gaussian kernel gives an attention with explicit time-frequency interpretation.
Dirac comb + Poisson summationShannon sampling theorem - ensures that Whisper's mel-spectrogram preprocessing (16kHz sampling, 80-channel mel filterbank) captures all speech frequencies (up to 8kHz) without aliasing.

12. Conceptual Bridge

Looking Back: From Fourier Series to Fourier Transform

The Fourier series (Section 20-01) established that periodic functions decompose into discrete harmonics. The key insight was geometric: the trigonometric functions {einx}\{e^{inx}\} form an orthonormal basis of L2[π,π]L^2[-\pi,\pi], and the Fourier coefficients are the projections of ff onto these basis vectors.

The Fourier Transform carries this insight to its natural conclusion. By letting the period TT \to \infty, the discrete harmonic sum ncne2πint/T\sum_n c_n e^{2\pi int/T} becomes the continuous integral f^(ξ)e2πiξtdξ\int\hat{f}(\xi)e^{2\pi i\xi t}\,d\xi, and the discrete spectrum {cn}\{c_n\} becomes the continuous spectrum f^(ξ)\hat{f}(\xi). Everything from the Fourier series has an analog:

  • Orthonormality -> Plancherel's theorem (Section 5)
  • Parseval's identity -> Parseval's relation (Section 5.2)
  • Dirichlet's theorem on convergence -> Fourier inversion theorem (Section 4)
  • Gibbs phenomenon -> spectral leakage in DFT (Section 03)
  • Smoothness spectral decay -> differentiation property (Section 3.5)

The distribution theory in Section 7 reveals the deepest unity: Fourier series is just the Fourier Transform restricted to periodic distributions. The Dirac comb with spacing TT transforms to a Dirac comb with spacing 1/T1/T - the very relationship between period and harmonic spacing that the Fourier series encodes.

Looking Forward: Three Branches

From the Fourier Transform, the curriculum branches in three directions:

Branch 1: Discretization -> Section 20-03 DFT and FFT. The continuous FT must be made computable. Discretizing both time and frequency leads to the DFT, and the Cooley-Tukey algorithm (1965) computes it in O(NlogN)O(N\log N) instead of O(N2)O(N^2). The FFT is arguably the most important algorithm in scientific computing - it is what makes spectrograms, MRI, and FNet feasible. The DFT inherits all properties of the continuous FT but adds complications: aliasing (from discretizing frequency), spectral leakage (from finite duration), and the circular convolution structure.

Branch 2: Convolution Theorem -> Section 20-04. The most important property of the FT for applications is that convolution in time equals multiplication in frequency. This converts the O(N2)O(N^2) convolution operation into O(NlogN)O(N\log N) FFT-based multiplication. The convolution theorem is the mathematical foundation of CNNs, WaveNet, S4/Mamba, and Hyena - all architectures where the key operation is a learned filter applied by convolution. The full theory of LTI systems, frequency response, filter design, and the Wiener-Khinchin theorem belongs there.

Branch 3: Time-Frequency Localization -> Section 20-05 Wavelets. The FT has a fundamental limitation: f^(ξ)|\hat{f}(\xi)| tells you which frequencies are present but not when they occur. The uncertainty principle shows this is not fixable - any attempt to localize in both time and frequency is bounded by ΔtΔξ1/(4π)\Delta t\cdot\Delta\xi \geq 1/(4\pi). Wavelets overcome this by accepting a principled tradeoff: high frequencies are analyzed at fine time resolution, low frequencies at coarse time resolution. This multi-resolution analysis (MRA) is the mathematical foundation of JPEG 2000, EEG analysis, and wavelet attention mechanisms.

POSITION IN THE FOURIER CURRICULUM
========================================================================

  Section 01 FOURIER SERIES                Section 02 FOURIER TRANSFORM (HERE)
  ---------------------             ---------------------------------
  Periodic functions                Aperiodic functions on R
  Discrete spectrum {cn}            Continuous spectrum f()
  Parseval for series               Plancherel's theorem
  Dirichlet convergence             Inversion theorem
  Gibbs phenomenon                  Spectral leakage (-> Section 03)
  RoPE derivation                   Full uncertainty principle
                   v                         v
  +----------------------------------------------------------------+
  |              Section 02 -> THREE BRANCHES                              |
  |                                                                |
  |  Discretize:      Convolve:         Time-localize:            |
  |  Section 03 DFT/FFT  ->  Section 04 Convolution  ->  Section 05 Wavelets            |
  |  $O(N\log N)$    CNNs, Mamba         MRA, multi-scale         |
  |  Whisper, FNO    WaveNet, Hyena      Scattering networks       |
  +----------------------------------------------------------------+
                          v
       Section 21 Statistical Learning Theory
       (spectral learning bounds, kernel methods)

========================================================================

Key Takeaways

The Fourier Transform is characterized by three central theorems:

  1. Plancherel - the FT is a unitary isometry on L2(R)L^2(\mathbb{R}): it preserves energy and inner products. The Fourier Transform changes how you describe a signal, not how much information it contains.

  2. Uncertainty - ΔtΔξ1/(4π)\Delta t\cdot\Delta\xi \geq 1/(4\pi): time and frequency concentration are fundamentally coupled. This is a theorem about analysis, not about physics, and it constrains every signal processing and ML system that operates in both domains simultaneously.

  3. Inversion - the FT is invertible: given f^\hat{f}, you can recover ff exactly. No information is lost in the Fourier domain representation - it is a complete, lossless change of basis.

Together, these theorems say: the Fourier Transform is an invertible, energy-preserving, mathematically constrained change of representation. It reveals structure (frequency content) that is invisible in the time domain, converts convolution to multiplication, and converts differentiation to algebra - making it the single most powerful analytical tool for understanding and processing signals, functions, and the weight matrices of neural networks.


<- Back to Fourier Analysis | Previous: Fourier Series <- | Next: DFT and FFT ->


Appendix A: Extended Properties and Derivations

A.1 The Duality Theorem in Full

One of the most elegant properties of the Fourier Transform is self-duality: applying the FT twice returns the time-reversed original.

Theorem A.1 (Duality). For fL1(R)f \in L^1(\mathbb{R}):

F[f^(t)](ξ)=f(ξ)\mathcal{F}[\hat{f}(t)](\xi) = f(-\xi)

Equivalently: if F[f](ξ)=g(ξ)\mathcal{F}[f](\xi) = g(\xi), then F[g](ξ)=f(ξ)\mathcal{F}[g](\xi) = f(-\xi).

Proof: F[f^(t)](ξ)=f^(t)e2πiξtdt=(f(s)e2πistds)e2πiξtdt\mathcal{F}[\hat{f}(t)](\xi) = \int\hat{f}(t)e^{-2\pi i\xi t}\,dt = \int\left(\int f(s)e^{-2\pi ist}\,ds\right)e^{-2\pi i\xi t}\,dt

By Fubini (justified since f,f^L1f, \hat{f} \in L^1): =f(s)(e2πi(s+ξ)tdt)ds=f(s)δ(s+ξ)ds=f(ξ)= \int f(s)\left(\int e^{-2\pi i(s+\xi)t}\,dt\right)ds = \int f(s)\delta(s+\xi)\,ds = f(-\xi) \square

Corollary: Applying the FT four times: F4[f]=f\mathcal{F}^4[f] = f. The Fourier Transform has order 4 as an operator.

Using duality to compute new transforms:

If you know F[f](ξ)=g(ξ)\mathcal{F}[f](\xi) = g(\xi), then F[g](ξ)=f(ξ)\mathcal{F}[g](\xi) = f(-\xi).

Example: We know F[rect(t)](ξ)=sinc(ξ)\mathcal{F}[\operatorname{rect}(t)](\xi) = \operatorname{sinc}(\xi). By duality: F[sinc(t)](ξ)=rect(ξ)=rect(ξ)\mathcal{F}[\operatorname{sinc}(t)](\xi) = \operatorname{rect}(-\xi) = \operatorname{rect}(\xi) (since rect is even).

So F[sinc](ξ)=rect(ξ)\mathcal{F}[\operatorname{sinc}](\xi) = \operatorname{rect}(\xi): a sinc in time has a rectangular (bandlimited) spectrum. This is the ideal low-pass filter - a filter that passes frequencies ξ1/2|\xi| \leq 1/2 perfectly and rejects all others.

A.2 Analytic Signals and the Hilbert Transform

Definition A.1 (Analytic Signal). For a real signal f(t)f(t), its analytic signal is:

f+(t)=f(t)+iH[f](t)f_+(t) = f(t) + i\,\mathcal{H}[f](t)

where H[f]\mathcal{H}[f] is the Hilbert Transform: H[f](t)=1πP.V.f(τ)tτdτ\mathcal{H}[f](t) = \frac{1}{\pi}\,\text{P.V.}\int_{-\infty}^\infty \frac{f(\tau)}{t-\tau}\,d\tau.

Fourier domain characterization:

f^+(ξ)={2f^(ξ)ξ>0f^(0)ξ=00ξ<0\hat{f}_+(\xi) = \begin{cases}2\hat{f}(\xi) & \xi > 0 \\ \hat{f}(0) & \xi = 0 \\ 0 & \xi < 0\end{cases}

The analytic signal has only non-negative frequencies: it is a one-sided spectrum signal. This is achieved by zeroing out the negative frequencies and doubling the positive ones - a frequency-domain operation.

Why this matters: The analytic signal enables the definition of instantaneous frequency ω(t)=ddtargf+(t)\omega(t) = \frac{d}{dt}\arg f_+(t) - the frequency of the signal at a given instant. This is essential for frequency-modulated signals (FM radio, chirps) and is used in empirical mode decomposition for analyzing non-stationary signals like EEG and speech.

Hilbert Transform in spectral terms: F[H[f]](ξ)=isgn(ξ)f^(ξ)\mathcal{F}[\mathcal{H}[f]](\xi) = -i\,\operatorname{sgn}(\xi)\,\hat{f}(\xi). The Hilbert Transform applies a π/2-\pi/2 phase shift to positive frequencies and +π/2+\pi/2 to negative frequencies - it is a "90-degree phase rotator."

A.3 The Fourier Transform on Rd\mathbb{R}^d

The Fourier Transform extends naturally to dd dimensions. For f:RdCf: \mathbb{R}^d \to \mathbb{C} with fL1(Rd)f \in L^1(\mathbb{R}^d):

f^(ξ)=Rdf(t)e2πiξtdt\hat{f}(\boldsymbol{\xi}) = \int_{\mathbb{R}^d} f(\mathbf{t})\,e^{-2\pi i \boldsymbol{\xi}^\top\mathbf{t}}\,d\mathbf{t}

where ξRd\boldsymbol{\xi} \in \mathbb{R}^d is the frequency vector. All properties generalize:

  • Shift: F[f(ta)](ξ)=e2πiξaf^(ξ)\mathcal{F}[f(\mathbf{t} - \mathbf{a})](\boldsymbol{\xi}) = e^{-2\pi i\boldsymbol{\xi}^\top\mathbf{a}}\hat{f}(\boldsymbol{\xi})
  • Scaling by matrix AA: F[f(At)](ξ)=1detAf^(Aξ)\mathcal{F}[f(A\mathbf{t})](\boldsymbol{\xi}) = \frac{1}{|\det A|}\hat{f}(A^{-\top}\boldsymbol{\xi})
  • Differentiation: F[tjf](ξ)=2πiξjf^(ξ)\mathcal{F}[\partial_{t_j}f](\boldsymbol{\xi}) = 2\pi i\xi_j\,\hat{f}(\boldsymbol{\xi})
  • Plancherel: f^L2(Rd)=fL2(Rd)\lVert\hat{f}\rVert_{L^2(\mathbb{R}^d)} = \lVert f\rVert_{L^2(\mathbb{R}^d)}

For AI: The 2D Fourier Transform is used in convolutional networks for images. A 2D convolution with filter hh is, in Fourier space, pointwise multiplication: y^(ξ)=x^(ξ)h^(ξ)\hat{y}(\boldsymbol{\xi}) = \hat{x}(\boldsymbol{\xi})\cdot\hat{h}(\boldsymbol{\xi}). For FNet (Section 8.1), the 2D FT is applied over both the sequence dimension and the embedding dimension.

Radial functions: For radially symmetric f(t)=f0(t)f(\mathbf{t}) = f_0(\lVert\mathbf{t}\rVert), the Fourier Transform is also radial: f^(ξ)=f^0(ξ)\hat{f}(\boldsymbol{\xi}) = \hat{f}_0(\lVert\boldsymbol{\xi}\rVert). The RBF kernel k(xy)=exy2k(\mathbf{x}-\mathbf{y}) = e^{-\lVert\mathbf{x}-\mathbf{y}\rVert^2} is radially symmetric, so its spectral density p(ω)=cdeω2/4p(\boldsymbol{\omega}) = c_d e^{-\lVert\boldsymbol{\omega}\rVert^2/4} is also radially symmetric - which is why sampling ωN(0,12I)\boldsymbol{\omega} \sim \mathcal{N}(\mathbf{0}, \frac{1}{2}I) works for RFF with the RBF kernel.

A.4 The Fourier Transform and PDEs

The differentiation property F[f(n)](ξ)=(2πiξ)nf^(ξ)\mathcal{F}[f^{(n)}](\xi) = (2\pi i\xi)^n\hat{f}(\xi) converts partial differential equations into algebraic equations in frequency space. This is the primary tool for solving linear constant-coefficient PDEs.

Example 1: Heat Equation. Solve tu=κxxu\partial_t u = \kappa\partial_{xx}u, u(x,0)=u0(x)u(x,0) = u_0(x).

Taking the Fourier Transform in xx:

tu^(ξ,t)=κ(2πiξ)2u^(ξ,t)=4π2κξ2u^(ξ,t)\partial_t\hat{u}(\xi,t) = \kappa(2\pi i\xi)^2\hat{u}(\xi,t) = -4\pi^2\kappa\xi^2\hat{u}(\xi,t)

This is a first-order ODE in tt for each fixed ξ\xi: u^(ξ,t)=u^0(ξ)e4π2κξ2t\hat{u}(\xi,t) = \hat{u}_0(\xi)\,e^{-4\pi^2\kappa\xi^2 t}.

Inverting: u(x,t)=u^0(ξ)e4π2κξ2te2πiξxdξ=(u0Gt)(x)u(x,t) = \int\hat{u}_0(\xi)\,e^{-4\pi^2\kappa\xi^2 t}\,e^{2\pi i\xi x}\,d\xi = (u_0 * G_t)(x)

where Gt(x)=14πκtex2/(4κt)G_t(x) = \frac{1}{\sqrt{4\pi\kappa t}}e^{-x^2/(4\kappa t)} is the Gaussian heat kernel (a Gaussian with time-dependent width σ=2κt\sigma = \sqrt{2\kappa t}).

Interpretation: Heat diffusion = convolution with a Gaussian. High frequencies ξ\xi decay as e4π2κξ2te^{-4\pi^2\kappa\xi^2 t} - much faster than low frequencies. Sharp features (high frequency content) are smoothed out rapidly; broad features (low frequency) persist.

For AI - Diffusion Models: The DDPM forward process q(xtx0)=N(xt;αˉtx0,(1αˉt)I)q(x_t|x_0) = \mathcal{N}(x_t;\sqrt{\bar{\alpha}_t}x_0, (1-\bar{\alpha}_t)I) is exactly discrete heat diffusion: the signal is progressively smoothed by adding Gaussian noise (which has a flat spectrum = all frequencies). The network learns to reverse this process. The Fourier perspective explains why the early denoising steps (low noise) must recover high-frequency details while later steps (high noise) recover the coarse structure - the exact inverse of heat diffusion.

Example 2: Wave Equation. Solve ttu=c2xxu\partial_{tt}u = c^2\partial_{xx}u, u(x,0)=u0(x)u(x,0) = u_0(x), tu(x,0)=v0(x)\partial_t u(x,0) = v_0(x).

FT in xx: ttu^=(2πiξ)2c2u^=4π2c2ξ2u^\partial_{tt}\hat{u} = (2\pi i\xi)^2 c^2\hat{u} = -4\pi^2c^2\xi^2\hat{u}

Solution: u^(ξ,t)=u^0(ξ)cos(2πcξt)+v^0(ξ)2πicξsin(2πcξt)\hat{u}(\xi,t) = \hat{u}_0(\xi)\cos(2\pi c|\xi|t) + \frac{\hat{v}_0(\xi)}{2\pi ic|\xi|}\sin(2\pi c|\xi|t)

By d'Alembert's formula (inverting): u(x,t)=12[u0(x+ct)+u0(xct)]+12cxctx+ctv0(s)dsu(x,t) = \frac{1}{2}[u_0(x+ct) + u_0(x-ct)] + \frac{1}{2c}\int_{x-ct}^{x+ct}v_0(s)\,ds

The wave equation propagates each frequency at the same speed cc - no dispersion. In contrast, the heat equation attenuates high frequencies faster than low frequencies - strong dispersion (dissipation).

Example 3: Schrodinger Equation. The free-particle Schrodinger equation itψ=22mxxψi\hbar\partial_t\psi = -\frac{\hbar^2}{2m}\partial_{xx}\psi is formally a heat equation with imaginary time. Its solution ψ^(ξ,t)=ψ^0(ξ)ei(2πξ)2t/(2m)\hat{\psi}(\xi,t) = \hat{\psi}_0(\xi)e^{-i\hbar(2\pi\xi)^2t/(2m)} shows that each frequency component oscillates (rather than decays) with frequency ξ2/(2m)\hbar\xi^2/(2m) - quantum dispersion. The uncertainty principle ΔxΔp/2\Delta x\cdot\Delta p \geq \hbar/2 (where p=hξp = h\xi by de Broglie) is the physics version of the mathematical uncertainty principle in Section 6.

A.5 Windowed Fourier Transform (STFT)

The continuous Fourier Transform gives global frequency information but no time localization: f^(ξ)|\hat{f}(\xi)| tells you whether frequency ξ\xi is present somewhere in ff, but not when.

Definition A.2 (Short-Time Fourier Transform / Spectrogram). For a window function gL2(R)g \in L^2(\mathbb{R}) and signal fL2(R)f \in L^2(\mathbb{R}):

STFTf(t,ξ)=f(τ)g(τt)e2πiξτdτ\text{STFT}_f(t, \xi) = \int_{-\infty}^{\infty} f(\tau)\,\overline{g(\tau - t)}\,e^{-2\pi i\xi\tau}\,d\tau

This localizes the analysis around time tt using the window gg. The spectrogram is STFTf(t,ξ)2|\text{STFT}_f(t,\xi)|^2 - a time-frequency power map.

The uncertainty principle applies: The time resolution Δt\Delta t and frequency resolution Δξ\Delta\xi of the STFT are determined by the window gg:

  • Narrow window gg -> good time resolution, poor frequency resolution
  • Wide window gg -> poor time resolution, good frequency resolution

The Gaussian window g(t)=eπt2/σ2g(t) = e^{-\pi t^2/\sigma^2} minimizes the product ΔtΔξ\Delta t\cdot\Delta\xi for a given σ\sigma, making it the optimal STFT window (the resulting time-frequency representation is called the Gabor transform).

For AI - Audio Models: Whisper (Radford et al., 2022) processes audio as mel-spectrograms: it applies a 25ms STFT window (adequate time resolution for phoneme detection) with 10ms hop, then maps the frequency axis to the mel scale (logarithmic, matching human perception), and feeds the result as a 2D "image" to a Vision Transformer. The STFT parameters are fixed - the uncertainty principle determines the time-frequency resolution tradeoff.

STFT TIME-FREQUENCY TRADEOFF (UNCERTAINTY PRINCIPLE IN PRACTICE)
========================================================================

  Short window (e.g., 2ms):            Long window (e.g., 50ms):
  -------------------------            --------------------------
  Fine time resolution (~2ms)          Coarse time resolution (~50ms)
  Poor freq resolution (~500 Hz)       Fine freq resolution (~20 Hz)
  Good for: transients, clicks         Good for: sustained tones, pitch

  Whisper (speech): 25ms window -> resolves individual phonemes (50-100ms)
  while achieving ~40Hz frequency resolution (enough for pitch/formants)

  The Gaussian window achieves the minimum t product - it is the
  optimal window under the uncertainty principle.

========================================================================

A.6 Fourier Transform and Operator Theory

From the perspective of functional analysis, the Fourier Transform is a unitary operator on the Hilbert space L2(R)L^2(\mathbb{R}). This abstract viewpoint connects Fourier analysis to the broader theory of self-adjoint operators.

The Fourier operator F\mathcal{F}: F:L2(R)L2(R)\mathcal{F}: L^2(\mathbb{R}) \to L^2(\mathbb{R}) is unitary: FF=FF=I\mathcal{F}^*\mathcal{F} = \mathcal{F}\mathcal{F}^* = I.

Since F4=I\mathcal{F}^4 = I, the eigenvalues satisfy λ4=1\lambda^4 = 1: λ{1,1,i,i}\lambda \in \{1, -1, i, -i\}.

The spectral decomposition of F\mathcal{F} in terms of its eigenspaces: L2(R)=E1E1EiEiL^2(\mathbb{R}) = \mathcal{E}_1 \oplus \mathcal{E}_{-1} \oplus \mathcal{E}_i \oplus \mathcal{E}_{-i}, where each Eλ\mathcal{E}_\lambda is spanned by the Hermite functions with the appropriate eigenvalue.

Connection to quantum mechanics: The position operator X^f(t)=tf(t)\hat{X}f(t) = tf(t) and momentum operator P^f(t)=12πidfdt\hat{P}f(t) = \frac{1}{2\pi i}\frac{df}{dt} (in natural units) are both self-adjoint operators on L2(R)L^2(\mathbb{R}), related by P^=FX^F1\hat{P} = \mathcal{F}\hat{X}\mathcal{F}^{-1}. The commutator [X^,P^]=i2πI[\hat{X},\hat{P}] = \frac{i}{2\pi}I implies the uncertainty principle ΔXΔP14π\Delta X\cdot\Delta P \geq \frac{1}{4\pi}. The mathematical identity and the physics are the same theorem.

For AI: The language of operator theory is increasingly used to analyze neural networks. The weight matrix WW of a linear layer is a linear operator; its spectral norm W2\lVert W\rVert_2 is its "size" as an operator; its singular value decomposition expresses it in terms of rank-1 operators. The FT connects linear operator theory to signal processing, unifying the analysis of both classical filters and neural network layers.


Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue