Part 2Math for LLMs

Fourier Series: Part 2 - Appendix B Advanced Topics To Appendix I Implementation Details

Fourier Analysis and Signal Processing / Fourier Series

Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 2
27 min read18 headingsSplit lesson page

Lesson overview | Previous part | Next part

Fourier Series: Appendix B: Advanced Topics to Appendix I: Implementation Details

Appendix B: Advanced Topics

B.1 Multidimensional Fourier Series

On a dd-dimensional torus Td=[π,π]d\mathbb{T}^d = [-\pi,\pi]^d, the Fourier series generalizes naturally. For x=(x1,,xd)Td\mathbf{x} = (x_1,\ldots,x_d) \in \mathbb{T}^d and n=(n1,,nd)Zd\mathbf{n} = (n_1,\ldots,n_d) \in \mathbb{Z}^d:

f(x)=nZdcneinx,cn=1(2π)dTdf(x)einxdxf(\mathbf{x}) = \sum_{\mathbf{n} \in \mathbb{Z}^d} c_{\mathbf{n}} e^{i\mathbf{n}\cdot\mathbf{x}}, \quad c_{\mathbf{n}} = \frac{1}{(2\pi)^d}\int_{\mathbb{T}^d} f(\mathbf{x}) e^{-i\mathbf{n}\cdot\mathbf{x}}\,d\mathbf{x}

The orthonormality relation is einx,eimx=δnm\langle e^{i\mathbf{n}\cdot\mathbf{x}}, e^{i\mathbf{m}\cdot\mathbf{x}} \rangle = \delta_{\mathbf{n}\mathbf{m}}, and Parseval generalizes to f2=ncn2\lVert f \rVert^2 = \sum_{\mathbf{n}} |c_{\mathbf{n}}|^2.

For AI - 2D images: Images are functions on a 2D torus (with periodic boundary). The 2D Fourier series (or DFT) represents an image as a sum of 2D complex exponentials ei(n1x1+n2x2)e^{i(n_1 x_1 + n_2 x_2)}. Low-frequency components (n1,n2)(n_1, n_2) near the origin represent slowly varying regions; high-frequency components represent edges and textures. JPEG compression discards high-frequency 2D DCT coefficients - this is the 2D cosine series applied to 8×88 \times 8 blocks.

B.2 Distributions and the Dirac Delta

For rigorous treatment of impulse-like functions in Fourier analysis, we need distributions (generalized functions). The Dirac delta δx0\delta_{x_0} is defined by:

δx0,ϕ=ϕ(x0)for all test functions ϕC[π,π]\langle \delta_{x_0}, \phi \rangle = \phi(x_0) \quad \text{for all test functions } \phi \in C^\infty[-\pi,\pi]

Its Fourier coefficients are cn[δ0]=12πc_n[\delta_0] = \frac{1}{2\pi}, so its Fourier series is:

δ0(x)=12πn=einx\delta_0(x) = \frac{1}{2\pi}\sum_{n=-\infty}^{\infty} e^{inx}

This is the completeness relation of the Fourier system: the identity operator decomposes as a sum over all frequency components. In functional analysis, this is written as I=neinxeinxI = \sum_n |e^{inx}\rangle\langle e^{inx}| (in Dirac notation).

Verification: For any smooth ff, δ0,f=f(0)=12πneinx,f(x)=12πneinx,f=12πncn[f]2π=ncn[f]1\langle \delta_0, f \rangle = f(0) = \langle \frac{1}{2\pi}\sum_n e^{inx}, f(x) \rangle = \frac{1}{2\pi}\sum_n \langle e^{inx}, f \rangle^* = \frac{1}{2\pi}\sum_n \overline{c_n[f]} \cdot 2\pi = \sum_n c_n[f] \cdot 1... (Parseval confirms this is f(0)f(0) by the inversion formula).

Comb function: The Dirac comb III(x)=k=δ(x2πk)\text{III}(x) = \sum_{k=-\infty}^{\infty} \delta(x - 2\pi k) has Fourier coefficients cn=12πc_n = \frac{1}{2\pi} for all nn - its Fourier series is the same as δ0\delta_0. This might seem paradoxical, but it reflects the periodization in the series context.

B.3 Pointwise Convergence: The Full Story (Carleson's Theorem)

The question of pointwise convergence of Fourier series remained open for 120 years after Dirichlet's 1829 result.

Timeline of the convergence problem:

  • Dirichlet (1829): Piecewise smooth functions -> pointwise convergence everywhere
  • Du Bois-Reymond (1876): Exhibited a continuous function whose Fourier series diverges at one point
  • Kolmogorov (1923): Exhibited an L1L^1 function whose Fourier series diverges everywhere!
  • Carleson (1966): Proved that for every fL2f \in L^2, the Fourier series converges pointwise almost everywhere
  • Hunt (1968): Extended to fLpf \in L^p for any p>1p > 1

Carleson's theorem (which won him the Abel Prize in 2006) resolved the central question: L2L^2 functions have Fourier series converging a.e. The proof is one of the most difficult in analysis, involving a refined version of the Calderon-Zygmund theory.

Practical implication: For all functions we encounter in ML (bounded measurable, piecewise smooth, L2L^2), Fourier series converge pointwise. The pathological examples require non-measurable functions or unbounded ones.

B.4 Sobolev Spaces and Regularity via Fourier Series

Definition (Sobolev spaces): For s0s \geq 0, the Sobolev space Hs[π,π]H^s[-\pi,\pi] consists of L2L^2 functions whose Fourier coefficients satisfy:

fHs2=n=(1+n2)scn2<\lVert f \rVert_{H^s}^2 = \sum_{n=-\infty}^{\infty} (1 + n^2)^s |c_n|^2 < \infty

The parameter ss measures smoothness: H0=L2H^0 = L^2, H1H^1 approx "once weakly differentiable in L2L^2", HsH^s for s>1/2s > 1/2 consists of continuous functions (Sobolev embedding theorem).

Fourier characterization of regularity: A function ff is in HsH^s if and only if cn=O(ns1/2ε)|c_n| = O(n^{-s-1/2-\varepsilon}) for some ε>0\varepsilon > 0. This makes Sobolev regularity entirely readable from the Fourier spectrum.

For AI - Spectral bias quantified: The spectral bias (Section 7.3) says networks approximate fHsf \in H^s more easily if ss is large (smooth target). Formally, a network trained by gradient descent with nn samples achieves L2L^2 risk of order n2s/(2s+1)n^{-2s/(2s+1)} when targeting fHsf \in H^s - the standard minimax rate for nonparametric regression. Higher ss -> faster convergence -> need fewer samples.

For AI - Kernel smoothness: The RKHS of a kernel kk is a Sobolev space HsH^s. The RBF kernel corresponds to HH^\infty (infinitely smooth functions), while the Matern kernel corresponds to Hs+d/2H^{s+d/2} for the regularity parameter ss and dimension dd. This is why the choice of kernel implicitly assumes a smoothness level for the target function.

B.5 Fejer-Riesz Theorem and Spectral Factorization

Theorem (Fejer-Riesz): Any trigonometric polynomial p(x)=nNcneinxp(x) = \sum_{|n| \leq N} c_n e^{inx} that is non-negative on [π,π][-\pi,\pi] can be written as p(x)=q(x)2p(x) = |q(x)|^2 where q(x)=n=0Ndneinxq(x) = \sum_{n=0}^N d_n e^{inx} is an analytic polynomial.

Spectral factorization: Given a non-negative spectral density S(ω)0S(\omega) \geq 0 (such as the power spectral density of a stationary process), we can find a causal filter H(ω)H(\omega) such that H(ω)2=S(ω)|H(\omega)|^2 = S(\omega). This is the foundation of the Wiener filter (used in denoising) and Cholesky-like factorizations in probability theory.

For AI - Connection to SSMs: State Space Models (S4, Mamba) parameterize their convolution kernels through a spectral factorization. The kernel is written as K=IFFT(H(ω))K = \text{IFFT}(H(\omega)) where H(ω)H(\omega) is constrained to be the frequency response of a stable ARMA model. The Fejer-Riesz theorem guarantees that the spectral density H2|H|^2 can always be factored into a causal part.

B.6 The Discrete Cosine Transform and Signal Compression

The Discrete Cosine Transform (DCT) of Type II is defined as:

X[k]=n=0N1x[n]cos ⁣(πk(2n+1)2N),k=0,1,,N1X[k] = \sum_{n=0}^{N-1} x[n] \cos\!\left(\frac{\pi k (2n+1)}{2N}\right), \quad k = 0, 1, \ldots, N-1

This is the even-extension Fourier series (cosine series) applied to a finite sequence. The DCT has two critical properties for signal compression:

  1. Energy compaction: For natural signals (images, audio), the DCT concentrates most of the signal energy in the first few low-frequency coefficients. The DCT of a typical 8×88 \times 8 image block has 95%+ of its energy in the top-left 3×33 \times 3 coefficients.

  2. Decorrelation: The DCT approximately diagonalizes the covariance matrix of natural signals, making the coefficients nearly uncorrelated. This allows independent quantization of each coefficient.

JPEG compression pipeline:

  1. Divide image into 8×88 \times 8 blocks
  2. Apply 2D DCT to each block
  3. Quantize coefficients (coarser quantization for high frequencies)
  4. Run-length encode the quantized sparse coefficients
  5. Huffman encode

The quality parameter in JPEG controls the quantization step sizes - higher quality means finer quantization of the high-frequency DCT coefficients.

DCT vs DFT: The DCT avoids the "spectral leakage" (Gibbs phenomenon) of the DFT because the even extension of a finite sequence is continuous at the boundaries. This is why JPEG uses DCT rather than DFT.

B.7 Uniform Convergence: Weierstrass M-Test Application

Let us prove uniform convergence of the triangle wave Fourier series rigorously.

Claim: The Fourier series f(x)=π24πk=0cos((2k+1)x)(2k+1)2f(x) = \frac{\pi}{2} - \frac{4}{\pi}\sum_{k=0}^\infty \frac{\cos((2k+1)x)}{(2k+1)^2} converges uniformly.

Proof via Weierstrass M-Test: We need kMk<\sum_k M_k < \infty where Mk=supxakcos((2k+1)x)=ak=4π(2k+1)2M_k = \sup_x |a_k \cos((2k+1)x)| = |a_k| = \frac{4}{\pi(2k+1)^2}.

Then k=0Mk=4πk=01(2k+1)2=4ππ28=π2<\sum_{k=0}^\infty M_k = \frac{4}{\pi}\sum_{k=0}^\infty \frac{1}{(2k+1)^2} = \frac{4}{\pi} \cdot \frac{\pi^2}{8} = \frac{\pi}{2} < \infty.

Since the series of bounds converges, the Weierstrass M-test gives uniform convergence. \square

Contrast with square wave: For the square wave, kb2k+1/(2k+1)=4πk1(2k+1)2\sum_k |b_{2k+1}|/(2k+1) = \frac{4}{\pi}\sum_k \frac{1}{(2k+1)^2} would bound a DIFFERENT series than what appears. But the square wave's coefficients are b2k+1=4/(π(2k+1))b_{2k+1} = 4/(\pi(2k+1)), and 4/(π(2k+1))\sum 4/(\pi(2k+1)) diverges (harmonic series). So the Weierstrass M-test fails - consistent with non-uniform convergence near the jumps.

B.8 Connection to Functional Analysis: Completeness Proof Sketch

Here is a sketch of why the trigonometric system is complete in L2[π,π]L^2[-\pi,\pi] - a result we stated but did not prove in Section 2.2.

Step 1: Stone-Weierstrass theorem: Continuous periodic functions can be uniformly approximated by trigonometric polynomials (polynomials in einxe^{inx}).

Step 2: Density: Continuous functions are dense in L2[π,π]L^2[-\pi,\pi] (this is a standard result in measure theory).

Step 3: Combining: For any fL2f \in L^2 and ε>0\varepsilon > 0, find continuous gg with fg2<ε/2\lVert f - g \rVert_2 < \varepsilon/2, then find trigonometric polynomial TT with gT<ε/(22π)\lVert g - T \rVert_\infty < \varepsilon/(2\sqrt{2\pi}), giving gT22πgT<ε/2\lVert g - T \rVert_2 \leq \sqrt{2\pi}\lVert g - T \rVert_\infty < \varepsilon/2. Triangle inequality gives fT2<ε\lVert f - T \rVert_2 < \varepsilon. \square

Full proof: See Section 12-02 Hilbert Spaces where the abstract completeness theorem is proved and the trigonometric system is given as the primary example.


Appendix C: Problem-Solving Strategies

C.1 Computing Fourier Coefficients: A Systematic Approach

Follow this checklist when computing Fourier coefficients:

  1. Check parity first. Is ff even? -> only ana_n terms. Odd? -> only bnb_n terms. Neither? -> compute all.

  2. Check for known waveform. Square wave, sawtooth, triangle wave, rectangular pulse - look up or recall the result.

  3. Split piecewise functions. Write ππ=π0+0π\int_{-\pi}^{\pi} = \int_{-\pi}^{0} + \int_0^{\pi} if ff is defined piecewise.

  4. Use integration by parts for polynomial trig: xkcos(nx)dx\int x^k \cos(nx)\,dx. Set u=xku = x^k, dv=cos(nx)dxdv = \cos(nx)dx. Repeat until done.

  5. Check boundary terms. For periodic ff: boundary terms in integration by parts vanish. For non-periodic problems (PDEs), keep them.

  6. Use the differentiation theorem as a shortcut: if you know cn[f]c_n[f'], then cn[f]=cn[f]/(in)c_n[f] = c_n[f']/(in).

  7. Verify via Parseval. Compute cn2\sum |c_n|^2 and check it equals 12πf2\frac{1}{2\pi}\int |f|^2 - this catches algebra errors.

C.2 Determining Convergence Type

ConditionConclusion
ff piecewise smoothPointwise convergence to f(x+)+f(x)2\frac{f(x^+)+f(x^-)}{2} (Dirichlet)
ff continuous + piecewise C1C^1fSNf0\lVert f - S_N f \rVert_\infty \to 0 (uniform)
fL2f \in L^2fSNf20\lVert f - S_N f \rVert_2 \to 0 (mean-square)
fCkf \in C^k, periodicUniform convergence; coefficients O(1/nk)O(1/n^k)
ff analyticExponential coefficient decay; exponentially fast uniform convergence

C.3 Recognizing When Fourier Series Appear in ML

Watch for these signatures of Fourier analysis in ML papers and code:

  • sin(ωkt)\sin(\omega_k t) and cos(ωkt)\cos(\omega_k t) in positional encodings -> Fourier basis
  • "Frequency domain" or "spectral" in the title -> likely uses DFT or FT
  • Complex multiplication or rotation in attention -> RoPE or ALiBi variant
  • "Low-pass filter" or "high-pass filter" -> filter = convolution = Fourier operation
  • 1/n21/n^2 or 1/n41/n^4 weight decay in regularization -> Sobolev norm regularization
  • "Spectral density" or "power spectral density" -> autocorrelation -> Wiener-Khinchin -> Fourier
  • State space model (SSM, S4, Mamba) -> the convolution kernel is parameterized in frequency domain

Appendix D: Deep Dive - Proofs and Derivations

D.1 Proof of Dirichlet's Theorem (Complete)

We prove the central convergence result: for ff piecewise smooth and 2π2\pi-periodic, SNf(x)f(x+)+f(x)2S_N f(x) \to \frac{f(x^+) + f(x^-)}{2}.

Setup. Write SNf(x)=12πππf(t)DN(xt)dtS_N f(x) = \frac{1}{2\pi}\int_{-\pi}^{\pi} f(t) D_N(x-t)\,dt. By periodicity, we may center the integral at xx:

SNf(x)=12πππf(x+t)DN(t)dtS_N f(x) = \frac{1}{2\pi}\int_{-\pi}^{\pi} f(x+t) D_N(t)\,dt

Since 12πππDN(t)dt=1\frac{1}{2\pi}\int_{-\pi}^{\pi} D_N(t)\,dt = 1, we can write:

SNf(x)f(x+)+f(x)2=12πππ[f(x+t)f(x+)+f(x)2]DN(t)dtS_N f(x) - \frac{f(x^+)+f(x^-)}{2} = \frac{1}{2\pi}\int_{-\pi}^{\pi} \left[f(x+t) - \frac{f(x^+)+f(x^-)}{2}\right] D_N(t)\,dt

Split the integral. Since DN(t)=sin((N+1/2)t)sin(t/2)D_N(t) = \frac{\sin((N+1/2)t)}{\sin(t/2)}, and using DN(t)=DN(t)D_N(t) = D_N(-t) (even function):

=12π0π[f(x+t)f(x+)+f(xt)f(x)]sin((N+1/2)t)sin(t/2)dt= \frac{1}{2\pi}\int_0^{\pi} \left[f(x+t) - f(x^+) + f(x-t) - f(x^-)\right] \frac{\sin((N+1/2)t)}{\sin(t/2)}\,dt

Key function. Define:

g(t)=f(x+t)f(x+)+f(xt)f(x)sin(t/2)12g(t) = \frac{f(x+t) - f(x^+) + f(x-t) - f(x^-)}{\sin(t/2)} \cdot \frac{1}{2}

We need gL1[0,π]g \in L^1[0,\pi]. Near t=0t = 0: the numerator behaves like f(x+)t+(f(x))tf'(x^+)t + (-f'(x^-))t (by piecewise smoothness), so numerator =O(t)= O(t), and sin(t/2)t/2\sin(t/2) \sim t/2, giving g(t)=O(1)g(t) = O(1) - bounded near t=0t = 0. Away from t=0t = 0: gg is bounded since ff is bounded and sin(t/2)c>0\sin(t/2) \geq c > 0. So gL1g \in L^1.

Apply Riemann-Lebesgue. The expression becomes:

12π0πg(t)sin ⁣((N+12)t)dt=12πg^ ⁣(N+12)\frac{1}{2\pi}\int_0^{\pi} g(t) \sin\!\left(\left(N+\frac{1}{2}\right)t\right)\,dt = \frac{1}{2\pi}\hat{g}\!\left(N+\frac{1}{2}\right)

By the Riemann-Lebesgue lemma (proved below), this 0\to 0 as NN \to \infty. \square

Riemann-Lebesgue Lemma (proof): For hL1[a,b]h \in L^1[a,b]:

abh(t)sin(λt)dt0as λ\int_a^b h(t)\sin(\lambda t)\,dt \to 0 \quad \text{as } \lambda \to \infty

Proof for step functions: If h=1[c,d]h = \mathbf{1}_{[c,d]}, then cdsin(λt)dt=cos(λc)cos(λd)λ0\int_c^d \sin(\lambda t)\,dt = \frac{\cos(\lambda c) - \cos(\lambda d)}{\lambda} \to 0. By linearity, this holds for any step function. For general L1L^1 functions, approximate by step functions and use the L1L^1 bound. \square

D.2 Parseval's Identity: Detailed Proof

Theorem. For fL2[π,π]f \in L^2[-\pi,\pi] with complex Fourier coefficients {cn}nZ\{c_n\}_{n \in \mathbb{Z}}:

12πππf(x)2dx=n=cn2\frac{1}{2\pi}\int_{-\pi}^{\pi} |f(x)|^2\,dx = \sum_{n=-\infty}^{\infty} |c_n|^2

Proof (assuming completeness). Consider the partial sum error:

fSNf2=f2SNf2\lVert f - S_N f \rVert^2 = \lVert f \rVert^2 - \lVert S_N f \rVert^2

The cross-term vanishes: f,SNf=nNcn2=SNf2\langle f, S_N f \rangle = \sum_{|n| \leq N} |c_n|^2 = \lVert S_N f \rVert^2 (by the projection property: SNfS_N f is the best approximation to ff from span{einx:nN}\text{span}\{e^{inx} : |n| \leq N\}).

So fSNf2=f2nNcn20\lVert f - S_N f \rVert^2 = \lVert f \rVert^2 - \sum_{|n| \leq N} |c_n|^2 \geq 0 (Bessel).

By completeness, fSNf0\lVert f - S_N f \rVert \to 0, so:

0=limNfSNf2=f2limNnNcn2=f2n=cn20 = \lim_{N\to\infty}\lVert f - S_N f \rVert^2 = \lVert f \rVert^2 - \lim_{N\to\infty}\sum_{|n| \leq N} |c_n|^2 = \lVert f \rVert^2 - \sum_{n=-\infty}^\infty |c_n|^2

This gives the result. \square

Remark: The key step is fSNf2=f2SNf2\lVert f - S_N f \rVert^2 = \lVert f \rVert^2 - \lVert S_N f \rVert^2. This is the Pythagorean theorem in L2L^2: ff decomposes orthogonally into SNfS_N f (the projection) and fSNff - S_N f (the residual). No analogy - it IS the Pythagorean theorem, in an infinite-dimensional space.

D.3 Gibbs Phenomenon: Precise Quantification

Let f(x)=sgn(x)f(x) = \operatorname{sgn}(x) (square wave). We compute the exact overshoot at the jump x=0+x = 0^+.

The partial sum SNf(x)=4πk=0Nsin((2k+1)x)2k+1S_N f(x) = \frac{4}{\pi}\sum_{k=0}^N \frac{\sin((2k+1)x)}{2k+1} has its first local maximum at xN=π2N+1x_N = \frac{\pi}{2N+1} (the first zero of (SNf)(x)=4πk=0Ncos((2k+1)x)(S_N f)'(x) = \frac{4}{\pi}\sum_{k=0}^N \cos((2k+1)x), which is zero when x=π/(2N+1)x = \pi/(2N+1)).

Computing:

SNf ⁣(π2N+1)=4πk=0Nsin ⁣((2k+1)π2N+1)2k+1S_N f\!\left(\frac{\pi}{2N+1}\right) = \frac{4}{\pi}\sum_{k=0}^N \frac{\sin\!\left(\frac{(2k+1)\pi}{2N+1}\right)}{2k+1}

This is a Riemann sum for 2π0πsinttdt\frac{2}{\pi}\int_0^\pi \frac{\sin t}{t}\,dt with step Δt=π/(2N+1)\Delta t = \pi/(2N+1). As NN \to \infty:

limNSNf ⁣(π2N+1)=2π0πsinttdt=2πSi(π)2π×1.85191.1790\lim_{N\to\infty} S_N f\!\left(\frac{\pi}{2N+1}\right) = \frac{2}{\pi}\int_0^\pi \frac{\sin t}{t}\,dt = \frac{2}{\pi} \cdot \text{Si}(\pi) \approx \frac{2}{\pi} \times 1.8519 \approx 1.1790

The jump in ff is from 1-1 to +1+1 (height 2). The overshoot above the limiting value +1+1 is 1.17901.0=0.17901.1790 - 1.0 = 0.1790, which is 8.95%9%8.95\% \approx 9\% of the jump height.

Numerical value of Si(π)\text{Si}(\pi):

Si(π)=0πsinttdt=k=0(1)kπ2k+1(2k+1)(2k+1)!1.8519\text{Si}(\pi) = \int_0^\pi \frac{\sin t}{t}\,dt = \sum_{k=0}^\infty \frac{(-1)^k \pi^{2k+1}}{(2k+1)\cdot(2k+1)!} \approx 1.8519

Why 9%9\% is universal: The integral 2πSi(π)10.0895\frac{2}{\pi}\text{Si}(\pi) - 1 \approx 0.0895 does not depend on the specific piecewise smooth function - only on the jump height. Every jump discontinuity in any piecewise smooth function gives a Gibbs overshoot of this universal fraction.

D.4 The Weierstrass Approximation Theorem and Completeness

Weierstrass Approximation Theorem (1885): Every continuous function ff on [a,b][a,b] can be uniformly approximated by algebraic polynomials.

Trigonometric variant: Every continuous 2π2\pi-periodic function can be uniformly approximated by trigonometric polynomials T(x)=nNcneinxT(x) = \sum_{|n| \leq N} c_n e^{inx}.

Proof sketch (Fejer's approach): The Cesaro means σNf\sigma_N f are trigonometric polynomials (each SkfS_k f is a trigonometric polynomial of degree kk, and their average σNf\sigma_N f is also one). Fejer proved σNff\sigma_N f \to f uniformly for continuous ff. \square

Why completeness follows: Given fL2f \in L^2 and ε>0\varepsilon > 0:

  1. Find continuous gg with fg2<ε/2\lVert f - g \rVert_2 < \varepsilon/2 (density of CC in L2L^2)
  2. Find trigonometric polynomial TT with gT<ε/(22π)\lVert g - T \rVert_\infty < \varepsilon/(2\sqrt{2\pi}) (Weierstrass/Fejer)
  3. Then gT22πgT<ε/2\lVert g - T \rVert_2 \leq \sqrt{2\pi}\lVert g - T \rVert_\infty < \varepsilon/2
  4. Triangle inequality: fT2<ε\lVert f - T \rVert_2 < \varepsilon \square

D.5 Fourier Series and Operator Theory

The Fourier series establishes a fundamental connection between function analysis and operator theory.

The operator perspective: Consider the differentiation operator D=ddxD = \frac{d}{dx} on periodic functions. The eigenfunctions of DD are exactly the complex exponentials:

Deinx=ineinxD e^{inx} = in \cdot e^{inx}

with eigenvalue inin. The Fourier series is the eigenfunction expansion of DD. Every linear differential operator with constant coefficients:

L=akDk+ak1Dk1++a0L = a_k D^k + a_{k-1}D^{k-1} + \cdots + a_0

is diagonalized by the Fourier basis:

Leinx=(ak(in)k+ak1(in)k1++a0)einx=p(in)einxL e^{inx} = (a_k (in)^k + a_{k-1}(in)^{k-1} + \cdots + a_0) e^{inx} = p(in) e^{inx}

where p(z)=akzk++a0p(z) = a_k z^k + \cdots + a_0 is the symbol of the operator.

For AI: Attention is (in the linear case) an operator on sequence positions. The linear attention matrix AA with entries Amn=K(mn)A_{mn} = K(m-n) (for a shift-invariant kernel) is diagonalized by the Fourier basis - its eigenvectors are the complex exponentials and its eigenvalues are the Fourier transform of KK. This is why spectral analysis of attention patterns connects directly to Fourier theory, and why RoPE (which multiplies by eimθe^{im\theta}) is so natural in this framework.

Compact operators: If the coefficients p(in)0p(in) \to 0 as n|n| \to \infty, then LL is a compact operator. The inverse problem (recovering ff from Lf=gLf = g) is then ill-posed (small changes in gg can cause large changes in ff) - this is the mathematical underpinning of regularization in machine learning.


Appendix E: Summary Tables

E.1 Fourier Coefficient Reference

Function f(x)f(x) on (π,π)(-\pi,\pi)cnc_n (complex)Notes
11 (constant)c0=1c_0 = 1, cn=0c_n = 0 (n0n \neq 0)DC component only
eiaxe^{iax} (aZa \in \mathbb{Z})ca=1c_a = 1, cn=0c_n = 0 (nan \neq a)Single frequency
cos(mx)\cos(mx)cm=cm=1/2c_m = c_{-m} = 1/2, rest 00
sin(mx)\sin(mx)cm=1/(2i)c_m = 1/(2i), cm=1/(2i)c_{-m} = -1/(2i)
sgn(x)\text{sgn}(x) (square wave)cn=2inπc_n = \frac{2}{in\pi} (nn odd), 00 (even/zero)1/n1/n decay
xx (sawtooth)cn=(1)n+1inc_n = \frac{(-1)^{n+1}}{in} (n0n \neq 0), 00 (n=0n=0)1/n1/n decay
$x$ (triangle)
x2x^2c0=π2/3c_0 = \pi^2/3, cn=2(1)nn2c_n = \frac{2(-1)^n}{n^2} (n0n \neq 0)1/n21/n^2 decay
eaxe^{ax}cn=sinh(πa)π(ain)c_n = \frac{\sinh(\pi a)}{\pi(a-in)}Exponential decay
δ(x)\delta(x) (Dirac)cn=12πc_n = \frac{1}{2\pi} (all nn)No decay (distributional)

E.2 Properties Reference Table

PropertyTime domainFrequency domain
Linearityαf+βg\alpha f + \beta gαcn[f]+βcn[g]\alpha c_n[f] + \beta c_n[g]
Shiftf(xx0)f(x - x_0)einx0cn[f]e^{-inx_0} c_n[f]
Modulationeimxf(x)e^{imx} f(x)cnm[f]c_{n-m}[f]
Differentiationf(x)f'(x)incn[f]in \cdot c_n[f]
Integration0xf(t)dt\int_0^x f(t)\,dtcn[f]/(in)c_n[f]/(in) (n0n \neq 0)
Reflectionf(x)f(-x)cn[f]c_{-n}[f]
Conjugationf(x)\overline{f(x)}cn[f]\overline{c_{-n}[f]}
Convolution(fg)(x)=12πf(t)g(xt)dt(f*g)(x) = \frac{1}{2\pi}\int f(t)g(x-t)\,dtcn[f]cn[g]c_n[f] \cdot c_n[g]
Multiplicationf(x)g(x)f(x)g(x)kck[f]cnk[g]\sum_k c_k[f] c_{n-k}[g] (convolution in freq)

E.3 Convergence Summary

TypeConditionRateGibbs?
L2L^2fL2f \in L^2 (always)fSNf20\lVert f - S_N f \rVert_2 \to 0N/A
Pointwiseff piecewise smoothO(1/N)O(1/N) error at smooth pointsYes at jumps
UniformfC1f \in C^1, periodicO(1/N)O(1/N) in \lVert \cdot \rVert_\inftyNo
SuperpolynomialfCkf \in C^k, periodicO(1/Nk)O(1/N^k)No
Exponentialff analyticO(ecN)O(e^{-cN})No
Cesaroff continuousUniform (Fejer)No

Appendix F: Complete Worked Problems

F.1 Problem: Full Solution of Fourier Series for f(x)=x(πx)f(x) = x(\pi - x) on [0,π][0, \pi]

This is a common PDE boundary value problem. We want the sine series of f(x)=x(πx)f(x) = x(\pi - x) on [0,π][0, \pi] (odd extension):

bn=2π0πx(πx)sin(nx)dxb_n = \frac{2}{\pi}\int_0^\pi x(\pi - x)\sin(nx)\,dx

Step 1: Split the integral.

bn=2π[π0πxsin(nx)dx0πx2sin(nx)dx]b_n = \frac{2}{\pi}\left[\pi\int_0^\pi x\sin(nx)\,dx - \int_0^\pi x^2\sin(nx)\,dx\right]

Step 2: Compute 0πxsin(nx)dx\int_0^\pi x\sin(nx)\,dx by parts.

Set u=xu = x, dv=sin(nx)dxdv = \sin(nx)dx:

0πxsin(nx)dx=[xcos(nx)n]0π+1n0πcos(nx)dx=πcos(nπ)n+sin(nx)n20π=π(1)n+1n\int_0^\pi x\sin(nx)\,dx = \left[\frac{-x\cos(nx)}{n}\right]_0^\pi + \frac{1}{n}\int_0^\pi \cos(nx)\,dx = \frac{-\pi\cos(n\pi)}{n} + \frac{\sin(nx)}{n^2}\bigg|_0^\pi = \frac{\pi(-1)^{n+1}}{n}

Step 3: Compute 0πx2sin(nx)dx\int_0^\pi x^2\sin(nx)\,dx by parts twice.

First integration by parts: u=x2u = x^2, dv=sin(nx)dxdv = \sin(nx)dx:

0πx2sin(nx)dx=[x2cos(nx)n]0π+2n0πxcos(nx)dx=π2(1)nn+2n0πxcos(nx)dx\int_0^\pi x^2\sin(nx)\,dx = \left[\frac{-x^2\cos(nx)}{n}\right]_0^\pi + \frac{2}{n}\int_0^\pi x\cos(nx)\,dx = \frac{-\pi^2(-1)^n}{n} + \frac{2}{n}\int_0^\pi x\cos(nx)\,dx

Second integration by parts: 0πxcos(nx)dx=[xsin(nx)n]0π1n0πsin(nx)dx=0+1n2[cos(nx)]0π=(1)n1n2\int_0^\pi x\cos(nx)\,dx = \left[\frac{x\sin(nx)}{n}\right]_0^\pi - \frac{1}{n}\int_0^\pi \sin(nx)\,dx = 0 + \frac{1}{n^2}[\cos(nx)]_0^\pi = \frac{(-1)^n - 1}{n^2}

So: 0πx2sin(nx)dx=π2(1)nn+2((1)n1)n3\int_0^\pi x^2\sin(nx)\,dx = \frac{-\pi^2(-1)^n}{n} + \frac{2((-1)^n - 1)}{n^3}

Step 4: Combine.

bn=2π[ππ(1)n+1n(π2(1)nn+2((1)n1)n3)]b_n = \frac{2}{\pi}\left[\pi \cdot \frac{\pi(-1)^{n+1}}{n} - \left(\frac{-\pi^2(-1)^n}{n} + \frac{2((-1)^n-1)}{n^3}\right)\right] =2π[π2(1)n+1n+π2(1)nn2((1)n1)n3]= \frac{2}{\pi}\left[\frac{\pi^2(-1)^{n+1}}{n} + \frac{\pi^2(-1)^n}{n} - \frac{2((-1)^n-1)}{n^3}\right] =2π[02((1)n1)n3]=4((1)n1)πn3={8/(πn3)n odd0n even= \frac{2}{\pi}\left[0 - \frac{2((-1)^n - 1)}{n^3}\right] = \frac{-4((-1)^n - 1)}{\pi n^3} = \begin{cases} 8/(\pi n^3) & n \text{ odd} \\ 0 & n \text{ even} \end{cases}

Result: f(x)=x(πx)=8πk=0sin((2k+1)x)(2k+1)3f(x) = x(\pi - x) = \frac{8}{\pi}\sum_{k=0}^\infty \frac{\sin((2k+1)x)}{(2k+1)^3}

Verification via Parseval: 2π0π[x(πx)]2dx=2π0π(x2π22x3π+x4)dx=2π[π2π33π52+π55]=2π430=π415\frac{2}{\pi}\int_0^\pi [x(\pi-x)]^2\,dx = \frac{2}{\pi}\int_0^\pi (x^2\pi^2 - 2x^3\pi + x^4)\,dx = \frac{2}{\pi}\left[\frac{\pi^2 \cdot \pi^3}{3} - \frac{\pi^5}{2} + \frac{\pi^5}{5}\right] = \frac{2\pi^4}{30} = \frac{\pi^4}{15}

Parseval: k=0b2k+12=k=064π2(2k+1)6=π415\sum_{k=0}^\infty b_{2k+1}^2 = \sum_{k=0}^\infty \frac{64}{\pi^2(2k+1)^6} = \frac{\pi^4}{15}, giving k=01(2k+1)6=π6960\sum_{k=0}^\infty \frac{1}{(2k+1)^6} = \frac{\pi^6}{960}. (Verifiable!)

Application to heat equation: The solution of ut=α2uxxu_t = \alpha^2 u_{xx} on [0,π][0,\pi] with u(0,t)=u(π,t)=0u(0,t) = u(\pi,t) = 0 and u(x,0)=x(πx)u(x,0) = x(\pi-x) is:

u(x,t)=8πk=01(2k+1)3sin((2k+1)x)eα2(2k+1)2tu(x,t) = \frac{8}{\pi}\sum_{k=0}^\infty \frac{1}{(2k+1)^3}\sin((2k+1)x)\,e^{-\alpha^2(2k+1)^2 t}

At t=0t = 0: this reduces to the Fourier series of x(πx)x(\pi-x). At large tt: the dominant term is 8πsin(x)eα2t\frac{8}{\pi}\sin(x)e^{-\alpha^2 t} - the solution relaxes toward zero via the fundamental mode.

F.2 Problem: Fourier Series of f(x)=eixf(x) = e^{ix} - The Modulation Property

This trivial-seeming example illuminates the modulation property.

f(x)=eixf(x) = e^{ix} is already a single Fourier mode with c1=1c_1 = 1 and cn=0c_n = 0 for n1n \neq 1.

Now consider g(x)=eiaxf(x)=ei(a+1)xg(x) = e^{iax}f(x) = e^{i(a+1)x} for integer aa. By the modulation property:

cn[g]=cna[f]=δna,1=δn,a+1c_n[g] = c_{n-a}[f] = \delta_{n-a, 1} = \delta_{n, a+1}

So gg has all energy concentrated at frequency n=a+1n = a+1. This is the frequency shift or modulation property: multiplying by eiaxe^{iax} shifts the spectrum by aa.

Application to RoPE in detail: In RoPE, the query at position mm for dimension ii is:

qm,irotated=(q2i+iq2i+1)eimθiq_{m,i}^{\text{rotated}} = (q_{2i} + iq_{2i+1}) \cdot e^{im\theta_i}

This is multiplication by the complex exponential eimθie^{im\theta_i} - i.e., modulation by frequency mm. The inner product with key at position nn:

qmrot,knrot=Re ⁣[i(q2i+iq2i+1)eimθi(k2i+ik2i+1)einθi]\langle q_m^{\text{rot}}, k_n^{\text{rot}} \rangle = \operatorname{Re}\!\left[\sum_i (q_{2i}+iq_{2i+1})e^{im\theta_i} \cdot \overline{(k_{2i}+ik_{2i+1})e^{in\theta_i}}\right] =Re ⁣[i(q2i+iq2i+1)(k2i+ik2i+1)ei(mn)θi]= \operatorname{Re}\!\left[\sum_i (q_{2i}+iq_{2i+1})\overline{(k_{2i}+ik_{2i+1})} \cdot e^{i(m-n)\theta_i}\right]

The score depends only on (mn)(m-n) - relative position - by the modulation-then-conjugate-product structure. This is the Fourier basis shift theorem instantiated in transformer attention.

F.3 Problem: Convergence Rate Comparison

Compare the convergence rates of the Fourier series for four functions:

Function A: Square wave (1 discontinuity per period)

  • cn=O(1/n)|c_n| = O(1/n) -> fSN22=O(1/N)\lVert f - S_N \rVert_2^2 = O(1/N)
  • Needs N106N \sim 10^6 terms for 10310^{-3} relative error

Function B: Triangle wave (continuous, corners = 1 discontinuity in derivative)

  • cn=O(1/n2)|c_n| = O(1/n^2) -> fSN22=O(1/N3)\lVert f - S_N \rVert_2^2 = O(1/N^3)
  • Needs N102N \sim 10^2 terms for 10310^{-3} relative error

Function C: Smooth periodic bump f(x)=ecosxf(x) = e^{\cos x}

  • ff is analytic -> cn|c_n| decays exponentially, cnCecn|c_n| \leq Ce^{-cn}
  • Needs N20N \sim 20 terms for 10610^{-6} relative error

Function D: f(x)=1/2x/(2π)f(x) = 1/2 - |x|/(2\pi) (slowly varying linear ramp)

  • cn=1/(2π2n2)|c_n| = 1/(2\pi^2 n^2) -> similar to triangle wave
  • Fast convergence

Lesson: The algebraic convergence rates for discontinuous or non-smooth functions make Fourier truncation expensive. This motivates wavelets (Section 20-05), which can represent localized features (edges) efficiently without paying the Gibbs penalty.

F.4 Problem: Convolution in Fourier Space

Setup: Let f(x)=cos(x)f(x) = \cos(x) and g(x)=sin(2x)g(x) = \sin(2x). Compute (fg)(x)=12πππf(t)g(xt)dt(f * g)(x) = \frac{1}{2\pi}\int_{-\pi}^{\pi} f(t)g(x-t)\,dt using the convolution theorem.

Using the convolution theorem: cn[fg]=cn[f]cn[g]c_n[f*g] = c_n[f] \cdot c_n[g].

Fourier coefficients of f=cosxf = \cos x: c1=c1=1/2c_1 = c_{-1} = 1/2, rest zero.

Fourier coefficients of g=sin(2x)g = \sin(2x): c2=1/(2i)=i/2c_2 = 1/(2i) = -i/2, c2=i/2c_{-2} = i/2, rest zero.

Product cn[f]cn[g]c_n[f] \cdot c_n[g]: This is zero for all nn since the non-zero frequencies of ff (n=±1n = \pm 1) and gg (n=±2n = \pm 2) are disjoint!

So (fg)(x)=0(f * g)(x) = 0 - the convolution of cos(x)\cos(x) and sin(2x)\sin(2x) is identically zero.

Direct verification: 12πππcos(t)sin(2(xt))dt=12πcost(sin2xcos2tcos2xsin2t)dt\frac{1}{2\pi}\int_{-\pi}^{\pi} \cos(t)\sin(2(x-t))\,dt = \frac{1}{2\pi}\int \cos t(\sin 2x\cos 2t - \cos 2x\sin 2t)\,dt. Each integral is zero by orthogonality. \checkmark

Lesson: The convolution of two functions with non-overlapping spectra is zero. This is the frequency-domain multiplication interpretation of convolution. In signal processing, this means two signals at different frequencies don't interfere - the superposition principle holds in frequency space.

For AI: This is why attention heads can specialize to different frequency ranges. If head h1h_1 attends to low-frequency positional patterns and head h2h_2 attends to high-frequency syntactic patterns, their convolution-like operations in frequency space are orthogonal.

F.5 Problem: Parseval and the Isoperimetric Inequality

The isoperimetric inequality - that among all closed curves of length LL, the circle encloses the maximum area - has an elegant proof via Fourier series.

Setup: Parameterize a smooth closed curve of length LL by arclength: (x(t),y(t))(x(t), y(t)) for t[0,L]t \in [0, L], with period LL. The constraint is x˙2+y˙2=1\dot{x}^2 + \dot{y}^2 = 1.

Area via Green's theorem: A=120L(xy˙yx˙)dtA = \frac{1}{2}\int_0^L (x\dot{y} - y\dot{x})\,dt.

Fourier expand: x(t)=nane2πint/Lx(t) = \sum_n a_n e^{2\pi int/L}, y(t)=nbne2πint/Ly(t) = \sum_n b_n e^{2\pi int/L}.

Apply Parseval to the arc length constraint: 1=x˙2+y˙21 = \dot{x}^2 + \dot{y}^2 -> after Parseval:

n(2πnL)2(an2+bn2)=1\sum_n \left(\frac{2\pi n}{L}\right)^2 (|a_n|^2 + |b_n|^2) = 1

Bound the area: Using the Fourier expansion and Cauchy-Schwarz:

A=πnn(anbˉnaˉnbn)πnn(an2+bn2)πL24π2n4π2n2L2(an2+bn2)=L24π|A| = \pi \left|\sum_n n(a_n\bar{b}_n - \bar{a}_n b_n)\right| \leq \pi \sum_n |n|(|a_n|^2 + |b_n|^2) \leq \frac{\pi L^2}{4\pi^2}\sum_n \frac{4\pi^2 n^2}{L^2}(|a_n|^2+|b_n|^2) = \frac{L^2}{4\pi}

Equality holds when only the n=±1n = \pm 1 terms are non-zero - i.e., when the curve is a circle.

Conclusion: AL2/(4π)A \leq L^2/(4\pi), with equality for a circle. This is the isoperimetric inequality, proved entirely via Parseval's theorem.

For AI: Isoperimetric inequalities in function spaces underlie capacity control in learning theory. The VC dimension and Rademacher complexity measure the "surface area to volume ratio" of hypothesis classes in high-dimensional function spaces - and Fourier analysis provides the tools to compute these quantities.


Appendix G: In-Depth AI Applications

G.1 Transformer Positional Encodings: A Unified View

To understand the deep connection between Fourier series and transformer positional encodings, let us develop the theory carefully.

The fundamental problem in attention. The self-attention mechanism Attention(Q,K,V)=softmax(QK/dk)V\text{Attention}(Q,K,V) = \text{softmax}(QK^\top/\sqrt{d_k})V is permutation-equivariant: permuting the input tokens permutes the output, but the attention weights treat all positions symmetrically. Without positional encodings, "the cat sat on the mat" and "the mat sat on the cat" would produce the same attention weights.

The sinusoidal solution (Vaswani et al., 2017). Add a position-dependent vector PE(m)Rd\text{PE}(m) \in \mathbb{R}^d to the mm-th token embedding before attention. Use:

PE(m)2i=sin ⁣(mθi),PE(m)2i+1=cos ⁣(mθi),θi=100002i/d\text{PE}(m)_{2i} = \sin\!\left(\frac{m}{\theta_i}\right), \quad \text{PE}(m)_{2i+1} = \cos\!\left(\frac{m}{\theta_i}\right), \quad \theta_i = 10000^{2i/d}

The key property. For any fixed offset kk, there exists a linear map MkRd×dM_k \in \mathbb{R}^{d \times d} such that PE(m+k)=MkPE(m)\text{PE}(m + k) = M_k \text{PE}(m). This is because:

(sin((m+k)/θi)cos((m+k)/θi))=(cos(k/θi)sin(k/θi)sin(k/θi)cos(k/θi))(sin(m/θi)cos(m/θi))\begin{pmatrix} \sin((m+k)/\theta_i) \\ \cos((m+k)/\theta_i) \end{pmatrix} = \begin{pmatrix} \cos(k/\theta_i) & \sin(k/\theta_i) \\ -\sin(k/\theta_i) & \cos(k/\theta_i) \end{pmatrix} \begin{pmatrix} \sin(m/\theta_i) \\ \cos(m/\theta_i) \end{pmatrix}

Each dimension pair (2i,2i+1)(2i, 2i+1) transforms via a rotation by k/θik/\theta_i - a Fourier shift in 2D.

Why different frequencies? At frequency 1/θi=100002i/d1/\theta_i = 10000^{-2i/d}:

  • i=0i = 0: θ0=1\theta_0 = 1, period =2π6.28= 2\pi \approx 6.28 - distinguishes adjacent tokens
  • i=d/4i = d/4: θd/4=100001/2100\theta_{d/4} = 10000^{1/2} \approx 100 - distinguishes tokens 100\sim 100 apart
  • i=d/21i = d/2-1: θd/21=10000\theta_{d/2-1} = 10000 - distinguishes tokens up to distance 1000010000

This is exactly the geometric progression of harmonics in a Fourier series on the interval [0,10000][0, 10000].

The Fourier basis interpretation. Define zm=PE(m)\mathbf{z}_m = \text{PE}(m). The inner product zmzn=i=0d/21cos((mn)/θi)\mathbf{z}_m \cdot \mathbf{z}_n = \sum_{i=0}^{d/2-1} \cos((m-n)/\theta_i) depends only on mnm - n. This is the autocorrelation function of the positional encoding - it measures how "similar" two positions are based on their relative offset.

Limitation. These encodings are absolute (the encoding of position 5 is the same regardless of whether the sequence has length 10 or 1000). For sequences longer than those seen in training, the high-frequency components (θ0=1\theta_0 = 1) still work, but the model has never seen the low-frequency components (θd/21=10000\theta_{d/2-1} = 10000) vary much. This causes length generalization failure.

G.2 RoPE: Relative Positions via Complex Multiplication

The RoPE construction. Instead of adding PE(m)\text{PE}(m) to the embedding, RoPE rotates the query and key vectors:

qm=Rmq,kn=Rnkq_m = R_m q, \quad k_n = R_n k

where RmR_m is the block-diagonal rotation matrix:

Rm=(R(θ0,m)R(θ1,m)),R(θ,m)=(cos(mθ)sin(mθ)sin(mθ)cos(mθ))R_m = \begin{pmatrix} R(\theta_0, m) & & \\ & R(\theta_1, m) & \\ & & \ddots \end{pmatrix}, \quad R(\theta, m) = \begin{pmatrix} \cos(m\theta) & -\sin(m\theta) \\ \sin(m\theta) & \cos(m\theta) \end{pmatrix}

Computing the attention score.

qmkn=(Rmq)(Rnk)=qRmRnk=qRmnkq_m^\top k_n = (R_m q)^\top (R_n k) = q^\top R_m^\top R_n k = q^\top R_{m-n} k

since RmRn=RnmR_m^\top R_n = R_{n-m} (rotation by mnm - n). The score is a function of mnm - n only - position-independent in the absolute sense, position-sensitive in the relative sense.

The complex representation. In complex notation, the ii-th pair (q2i,q2i+1)(q_{2i}, q_{2i+1}) maps to the complex number q2i+iq2i+1Cq_{2i} + iq_{2i+1} \in \mathbb{C}. The RoPE operation is:

q2i+iq2i+1    (q2i+iq2i+1)eimθiq_{2i} + iq_{2i+1} \;\mapsto\; (q_{2i} + iq_{2i+1}) \cdot e^{im\theta_i}

This is multiplication by the unit complex number eimθie^{im\theta_i} - a phase rotation of the complex Fourier coefficient at frequency θi\theta_i.

Long-context extrapolation. The base frequency θi=100002i/d\theta_i = 10000^{-2i/d} means the maximum period is 2π/θd/21=2π×10000628322\pi/\theta_{d/2-1} = 2\pi \times 10000 \approx 62832. For positions beyond this, the encoding wraps around (becomes periodic) and position information is lost. This is why standard RoPE models struggle at contexts beyond 4096\sim 4096 tokens.

Fix - Extended RoPE: Scale θiθi/s\theta_i \to \theta_i / s where ss is the ratio of new context length to training context length (Position Interpolation, Chen et al., 2023). Alternatively, YaRN (Peng et al., 2023) uses a non-uniform frequency scaling based on Fourier analysis of the empirical attention patterns.

G.3 Spectral Bias: Mathematical Theory

The spectral bias (or frequency principle) of neural networks has a rigorous formulation in terms of the NTK and Fourier analysis.

The NTK framework. For a wide neural network fθ(x)f_\theta(\mathbf{x}) trained with gradient descent, the NTK matrix Θ(x,x)=θfθ(x)θfθ(x)\Theta(\mathbf{x}, \mathbf{x}') = \nabla_\theta f_\theta(\mathbf{x}) \cdot \nabla_\theta f_\theta(\mathbf{x}') is approximately constant during training. The training dynamics are:

dfθ(x)dt=Θ(x,x)fL(f(x),y)dx\frac{d f_\theta(\mathbf{x})}{dt} = -\int \Theta(\mathbf{x}, \mathbf{x}') \nabla_{f} \mathcal{L}(f(\mathbf{x}'), y')\,d\mathbf{x}'

This is a kernel regression problem in function space, where Θ\Theta acts as the kernel.

Spectral decomposition. Let λk,ϕk\lambda_k, \phi_k be eigenvalues/eigenfunctions of Θ\Theta in L2L^2. The projection of the target ff^* onto eigenvector ϕk\phi_k has coefficient αk=f,ϕk\alpha_k = \langle f^*, \phi_k \rangle. Under gradient descent:

αk(t)=αk(0)+(1eλkt)αk\alpha_k(t) = \alpha_k(0) + (1 - e^{-\lambda_k t}) \alpha_k^\infty

where αk\alpha_k^\infty is the target coefficient. The coefficient at eigenfrequency kk converges at rate λk\lambda_k - fast if λk\lambda_k large, slow if λk\lambda_k small.

The spectral bias. For standard MLPs with Gaussian initialization and ReLU activations, the NTK has eigenvalues λk\lambda_k that decrease with frequency (high-frequency eigenfunctions have small eigenvalues). Therefore:

  • Low-frequency Fourier components have large eigenvalues -> learned quickly
  • High-frequency components have small eigenvalues -> learned slowly

Quantitative: the Barron approximation theorem. Functions that can be represented with O(C2ε2)O(C^2 \varepsilon^{-2}) neurons (where C=ωf^(ω)dωC = \int |\omega||\hat{f}(\omega)|\,d\omega is the Barron norm) can be approximated to ε\varepsilon accuracy in L2L^2. The Barron norm penalizes high-frequency components: if ff^* has large high-frequency content (f^(ω)|\hat{f^*}(\omega)| large for large ω|\omega|), then CC is large and the network needs more parameters.

Implications for practice:

  1. Overfitting to noise: High-frequency noise (which has large ω|\omega|) is learned last -> early stopping acts as a frequency low-pass filter
  2. Data augmentation: Adding geometric augmentations (rotations, flips) destroys high-frequency features -> forces the network to rely on low-frequency structure -> better generalization
  3. NeRF: Without positional encoding, networks can't represent the high-frequency variation in scene appearance -> Fourier feature encodings inject high frequencies explicitly

G.4 SIREN Networks: Replacing ReLU with Sine

SIREN (Sitzmann et al., 2020). Instead of using ReLU\text{ReLU} activations, SIREN uses:

fθ(x)=WLϕL1ϕ1(x),ϕi(x)=sin(Wix+bi)f_\theta(\mathbf{x}) = W_L \phi_{L-1} \circ \cdots \circ \phi_1(\mathbf{x}), \quad \phi_i(\mathbf{x}) = \sin(W_i \mathbf{x} + \mathbf{b}_i)

The derivative xf\nabla_\mathbf{x} f is again a SIREN (with cosines -> sines one layer deeper). This is the key property: the derivative of a SIREN is also a SIREN with the same architecture.

Fourier interpretation. The output of a SIREN is a sum of sinusoids at multiple frequencies and phases. The composition of sine activations creates a rich set of Fourier components. The Fourier spectrum of a SIREN output can represent high-frequency content that a standard ReLU network would struggle to learn.

Mathematical property. For a single-layer SIREN with nn neurons, the output is:

f(x)=k=1ncksin(ωkx+ϕk)f(x) = \sum_{k=1}^n c_k \sin(\omega_k x + \phi_k)

a sum of nn Fourier basis functions with learned frequencies, amplitudes, and phases. Unlike a standard Fourier series with fixed frequencies ωk=k\omega_k = k, SIREN can choose the frequencies freely - making it more expressive.

Applications: Implicit neural representations (NeRF, video), solving PDEs (physics-informed neural networks), generative modeling of images and 3D shapes.

G.5 The Spectral Analysis of LLM Attention Patterns

Recent research (2023-2024) has revealed striking spectral structure in LLM attention patterns:

Attention as a filter. In a transformer layer, the attention mechanism computes:

ym=nAmnVxn\mathbf{y}_m = \sum_n A_{mn} V \mathbf{x}_n

where Amn=softmaxn(qmkn/dk)A_{mn} = \text{softmax}_n(q_m^\top k_n / \sqrt{d_k}). If AmnA_{mn} depends only on mnm - n (which is approximately true for well-trained models with relative PE), then this is a convolution with kernel a[k]=Amka[k] = A_{mk}. In frequency space:

Y^(ξ)=A^(ξ)X^(ξ)\hat{Y}(\xi) = \hat{A}(\xi) \cdot \hat{X}(\xi)

The attention pattern is a frequency filter. Different heads learn different filters:

  • Local heads: AmnA_{mn} large only for mn|m - n| small -> low-pass filter (smooth the sequence)
  • Global heads: AmnA_{mn} relatively uniform -> constant filter (compute the mean)
  • Periodic heads: AmnA_{mn} large when mnk(modT)m - n \equiv k \pmod{T} -> periodic filter (detect regular patterns)

Empirical findings:

  • In BERT, some heads function as "copy heads" (attend to identical tokens) and "position heads" (attend to positions with specific relative offsets)
  • In GPT-2/3, induction heads attend strongly to previous occurrences of the same token
  • The frequency-domain view reveals why attention heads naturally factorize into frequency bands

WeightWatcher (Martin & Mahoney, 2021). The singular value spectra of weight matrices in LLMs follow power laws ρ(σ)σα\rho(\sigma) \sim \sigma^{-\alpha} with exponent α\alpha related to model quality. Healthy models have α[2,4]\alpha \in [2, 4]. Overfit or poorly trained models have α\alpha outside this range. This is a Fourier analysis of the weight matrices - the spectral density tells us about the effective rank and implicit regularization.

G.6 Circular Convolution and LLMs

The circular convolution theorem (proved in Section 20-04) states that in the DFT domain, elementwise multiplication corresponds to circular convolution in the time domain. This has a direct connection to autoregressive language models.

Autoregressive language models as circular filters. A causal language model computes p(xtx1,,xt1)p(x_t | x_1, \ldots, x_{t-1}) by attending to all previous tokens. In the frequency domain:

P^(ξ)=A^(ξ)X^(ξ)\hat{P}(\xi) = \hat{A}(\xi) \cdot \hat{X}(\xi)

where A^(ξ)\hat{A}(\xi) is a one-sided (causal) filter. The causal constraint means Amn=0A_{mn} = 0 for m<nm < n - only attending to past tokens. In frequency space, this corresponds to a filter whose phase response is π/2-\pi/2 for all frequencies (a Hilbert transform).

State space models. Models like S4, H3, and Mamba parameterize the causal convolution kernel h[n]h[n] directly in frequency space (via diagonal state space AA matrix). The output is:

y[t]=(xh)[t]=IFFT(X^H^)y[t] = (x * h)[t] = \text{IFFT}(\hat{X} \cdot \hat{H})

where H^\hat{H} is the learned frequency response. This is exactly the convolution theorem applied to sequence modeling, and it enables O(NlogN)O(N \log N) computation instead of O(N2)O(N^2) attention.


Appendix H: Connections to Probability and Statistics

H.1 Characteristic Functions as Fourier Transforms of Distributions

The characteristic function of a random variable XX is:

φX(t)=E[eitX]=eitxdFX(x)\varphi_X(t) = \mathbb{E}[e^{itX}] = \int_{-\infty}^{\infty} e^{itx}\,dF_X(x)

This is the Fourier transform of the probability measure FXF_X! The characteristic function exists for all random variables (unlike the moment generating function) and uniquely determines the distribution.

Fourier inversion of densities. If XX has density pXp_X:

φX(t)=eitxpX(x)dx=p^X(t)\varphi_X(t) = \int e^{itx} p_X(x)\,dx = \hat{p}_X(-t)

The density is recovered by the inverse Fourier transform: pX(x)=12πeitxφX(t)dtp_X(x) = \frac{1}{2\pi}\int e^{-itx} \varphi_X(t)\,dt.

Central limit theorem via characteristic functions. Let X1,,XnX_1, \ldots, X_n i.i.d. with mean μ\mu and variance σ2\sigma^2. The standardized sum Sn=(X1++Xnnμ)/(σn)S_n = (X_1 + \cdots + X_n - n\mu)/(\sigma\sqrt{n}) has characteristic function:

φSn(t)=[φX ⁣(tσn)eitμ/(σn)]n\varphi_{S_n}(t) = \left[\varphi_X\!\left(\frac{t}{\sigma\sqrt{n}}\right) e^{-it\mu/(\sigma\sqrt{n})}\right]^n

Expanding φX(s)=1+μ(is)+μ2+σ22(is)2+O(s3)\varphi_X(s) = 1 + \mu(is) + \frac{\mu^2 + \sigma^2}{2}(is)^2 + O(s^3) for small ss and substituting:

φSn(t)=[1t22n+O(n3/2)]net2/2as n\varphi_{S_n}(t) = \left[1 - \frac{t^2}{2n} + O(n^{-3/2})\right]^n \to e^{-t^2/2} \quad \text{as } n \to \infty

The Gaussian et2/2e^{-t^2/2} is the Fourier transform of the standard normal density. The CLT is a convergence in Fourier space to the Gaussian fixed point.

For AI - Training stability. The distribution of gradient updates in neural network training follows a CLT when the batch size is large: the stochastic gradient is a sum of BB per-sample gradients, so its distribution is approximately Gaussian for large BB. This is why large-batch training behaves differently from small-batch training - the gradient noise becomes more Gaussian (less heavy-tailed) as BB grows.

H.2 Autocorrelation and Power Spectral Density

For a stationary stochastic process {Xt}\{X_t\} with mean zero, the autocorrelation function is:

RX(τ)=E[XtXt+τ]R_X(\tau) = \mathbb{E}[X_t X_{t+\tau}]

The power spectral density (PSD) is the Fourier transform of RXR_X:

SX(f)=RX(τ)e2πifτdτS_X(f) = \int_{-\infty}^{\infty} R_X(\tau) e^{-2\pi if\tau}\,d\tau

This is the Wiener-Khinchin theorem: the PSD and autocorrelation are a Fourier pair.

Interpretation of PSD:

  • SX(f)0S_X(f) \geq 0 for all ff (non-negative by Bochner's theorem)
  • SX(f)df=RX(0)=E[Xt2]\int S_X(f)\,df = R_X(0) = \mathbb{E}[X_t^2] (total power)
  • SX(f)S_X(f) at frequency ff tells how much of the signal's power is concentrated near frequency ff

White noise: SX(f)=N0S_X(f) = N_0 (constant) -> RX(τ)=N0δ(τ)R_X(\tau) = N_0 \delta(\tau) (uncorrelated across all lags). White noise has equal power at all frequencies - it is the "maximally random" process.

1/f noise (pink noise): SX(f)1/fS_X(f) \propto 1/f -> found in LLM token frequency distributions (Zipf's law), financial returns, EEG signals. The 1/f spectrum is exactly at the boundary between stationary (SXS_X integrable) and non-stationary processes.

For AI: The power spectral density of the hidden states in LLMs has been studied empirically. Models trained on natural language show 1/f-like spectra in their activations, consistent with the fractal/scale-free structure of language. This suggests that efficient LLM architectures should process information at multiple timescales simultaneously - exactly what multi-head attention (with different positional frequency ranges per head) achieves.

H.3 Fourier Analysis and the Sampling Theorem

The Shannon-Nyquist Sampling Theorem connects continuous Fourier analysis to discrete signal representation:

Theorem (Shannon, 1949; Nyquist, 1928): A bandlimited signal ff with f^(ξ)=0\hat{f}(\xi) = 0 for ξ>W|\xi| > W (bandwidth WW) can be perfectly reconstructed from its samples {f(n/(2W))}nZ\{f(n/(2W))\}_{n \in \mathbb{Z}}:

f(t)=n=f ⁣(n2W)sinc(2Wtn)f(t) = \sum_{n=-\infty}^{\infty} f\!\left(\frac{n}{2W}\right) \operatorname{sinc}(2Wt - n)

Proof sketch. Define the sampled signal fs(t)=nf(n/(2W))δ(tn/(2W))f_s(t) = \sum_n f(n/(2W))\delta(t - n/(2W)). Its Fourier transform is the periodized spectrum: f^s(ξ)=2Wkf^(ξ2kW)\hat{f}_s(\xi) = 2W \sum_k \hat{f}(\xi - 2kW). Since ff is bandlimited to ξW|\xi| \leq W, the copies don't overlap (no aliasing). Applying an ideal low-pass filter with cutoff WW recovers f^\hat{f}, and inverse FT gives ff. \square

Aliasing: If ff has frequency content above WW but is sampled at rate 2W2W, the high-frequency copies in f^s\hat{f}_s overlap, corrupting the spectrum. This is aliasing - high-frequency components masquerade as low-frequency ones.

For AI - Tokenization: Tokenization is analogous to sampling. A tokenizer samples text at a rate of 0.7\sim 0.7 tokens/word (for byte-pair encoding). If the "semantic content" of text has structure at finer granularity than 1 word, the tokenizer introduces aliasing: some semantic distinctions are lost. This is why character-level models can capture morphological patterns that word-level models miss.

Anti-aliasing in CNNs: The standard strided convolution in CNNs (stride-2 for downsampling) can alias: it samples the feature map at rate 1/21/2 without first applying a low-pass filter. Zhang (2019) showed that adding a blur filter before strided convolutions dramatically improves shift-equivariance. This is the CNN version of anti-aliasing.


Appendix I: Implementation Details

I.1 Numerical Computation of Fourier Coefficients

Direct formula. For a function given by samples f0,,fN1f_0, \ldots, f_{N-1} at equally spaced points xk=π+2πk/Nx_k = -\pi + 2\pi k/N:

cn1Nk=0N1fkeinxk=1NDFT(f)[n]c_n \approx \frac{1}{N}\sum_{k=0}^{N-1} f_k e^{-inx_k} = \frac{1}{N}\text{DFT}(f)[n]

This approximates the integral cn=12πππf(x)einxdxc_n = \frac{1}{2\pi}\int_{-\pi}^{\pi} f(x)e^{-inx}\,dx using the rectangle rule.

Accuracy. The approximation error is O(1/N)O(1/N) for smooth functions and O(1/Nk)O(1/N^k) for CkC^k functions. For discontinuous functions, the error is O(1/N)O(1/N) regardless of how fine the sampling.

Using NumPy:

# N equispaced samples of f on [-pi, pi)
x = np.linspace(-np.pi, np.pi, N, endpoint=False)
f_samples = f(x)
# Fourier coefficients (approximate)
c = np.fft.fft(f_samples) / N
# c[n] corresponds to the n-th coefficient for n = 0, 1, ..., N-1
# Negative frequencies: c[N-n] corresponds to c_{-n}

Aliasing in discrete computation. The DFT computes only NN coefficients c0,,cN1c_0, \ldots, c_{N-1}. For a bandlimited function with cn=0c_n = 0 for n>N/2|n| > N/2, this is exact. Otherwise, the nn-th computed coefficient is actually cn+cn+N+cnN+c_n + c_{n+N} + c_{n-N} + \cdots (all aliased copies).

I.2 Plotting Spectra Correctly

Common mistakes when plotting Fourier spectra:

  1. Frequency axis: NumPy's FFT output has frequencies [0,1,,N/21,N/2,,1]/N[0, 1, \ldots, N/2-1, -N/2, \ldots, -1]/N in units of cycles/sample. Use np.fft.fftfreq(N) to get the correct axis.

  2. Centering: Use np.fft.fftshift to rearrange the output so that zero frequency is in the center.

  3. One-sided vs two-sided: For real signals, plot only n0n \geq 0 (the two-sided spectrum is symmetric). Use np.fft.rfft for real signals - it returns only the positive-frequency half.

  4. dB scale: Power spectra are often plotted in dB: 10log10cn210\log_{10}|c_n|^2 or 20log10cn20\log_{10}|c_n|. This makes it easier to see components spanning many orders of magnitude.

  5. Parseval check: Always verify cn21Nfk2\sum |c_n|^2 \approx \frac{1}{N}\sum |f_k|^2. If these don't match, there's a normalization error.

I.3 RoPE Implementation Reference

def rope_embed(x, positions, base=10000):
    """
    Apply RoPE to input tensor x.
    x: shape (seq_len, d_model) - query or key vectors
    positions: integer positions (seq_len,)
    Returns: rotated x, same shape
    """
    d = x.shape[-1]
    # Frequencies: theta_i = base^{-2i/d} for i = 0, ..., d/2 - 1
    i = np.arange(0, d, 2)
    freqs = base ** (-i / d)          # shape (d/2,)
    # Angles: m * theta_i
    angles = positions[:, None] * freqs[None, :]  # (seq_len, d/2)
    # Build rotation: cos(angle) and sin(angle)
    cos = np.cos(angles)   # (seq_len, d/2)
    sin = np.sin(angles)   # (seq_len, d/2)
    # Rotate pairs (x_{2i}, x_{2i+1})
    x_even = x[:, 0::2]   # shape (seq_len, d/2)
    x_odd  = x[:, 1::2]
    x_rot_even = x_even * cos - x_odd * sin
    x_rot_odd  = x_even * sin + x_odd * cos
    # Interleave back
    x_rot = np.empty_like(x)
    x_rot[:, 0::2] = x_rot_even
    x_rot[:, 1::2] = x_rot_odd
    return x_rot

Verification: For any queries qmq_m and keys knk_n with relative position r=mnr = m - n:

score(m, n) = dot(rope_embed(q, [m]), rope_embed(k, [n]))
            == score(m + delta, n + delta)  # for any integer delta

This confirms that the attention score depends only on relative position.


Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue