Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 1
29 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Fourier Series: Part 1: Intuition to Appendix A: Extended Examples and Computations

1. Intuition

1.1 What Is a Fourier Series?

Strike a guitar string and you hear a pitch - the fundamental frequency. Listen more carefully and you hear overtones: frequencies at twice, three times, four times the fundamental. The rich timbre of a guitar versus a piano playing the same note comes entirely from the relative amplitudes of these overtones. A Fourier series is the precise mathematical statement of this physical observation: every periodic function is a (possibly infinite) weighted sum of sinusoids at integer multiples of a fundamental frequency.

More precisely, if ff is a 2π2\pi-periodic function satisfying mild regularity conditions, then:

f(x)=a02+n=1[ancos(nx)+bnsin(nx)]f(x) = \frac{a_0}{2} + \sum_{n=1}^{\infty} \left[ a_n \cos(nx) + b_n \sin(nx) \right]

where the coefficients ana_n and bnb_n are uniquely determined by ff. The term with n=1n=1 is the fundamental harmonic, and the terms with n>1n > 1 are the overtones or harmonics.

The miracle is how general this is. The function ff does not have to be smooth. The square wave - which jumps discontinuously between 1-1 and +1+1 - has a Fourier series:

f(x)=4π(sinx+sin3x3+sin5x5+)=4πk=0sin((2k+1)x)2k+1f(x) = \frac{4}{\pi}\left(\sin x + \frac{\sin 3x}{3} + \frac{\sin 5x}{5} + \cdots\right) = \frac{4}{\pi}\sum_{k=0}^\infty \frac{\sin((2k+1)x)}{2k+1}

Summing 100 terms gives a function that is nearly indistinguishable from the square wave - except near the jumps, where a small overshoot persists no matter how many terms you include. This is the Gibbs phenomenon, and we will study it carefully.

Non-example: Not every function has a convergent Fourier series in the pointwise sense. A function that is not in L2[π,π]L^2[-\pi,\pi] (e.g., f(x)=1/xf(x) = 1/x near x=0x=0) does not have a well-defined Fourier series. A continuous function can even have a Fourier series that diverges at a single point (Du Bois-Reymond, 1876). The correct framework is L2L^2 convergence, not pointwise convergence.

1.2 Why It Matters for AI

Fourier series are not an abstract curiosity - they are embedded in the architecture of modern LLMs, CNNs, and audio models:

For AI:

  • Positional encodings in Transformers (Vaswani et al., 2017) use sin(pos/100002i/d)\sin(pos/10000^{2i/d}) and cos(pos/100002i/d)\cos(pos/10000^{2i/d}) - these are Fourier basis functions at geometrically spaced frequencies.
  • Rotary Position Embedding (RoPE) (Su et al., 2021), used in LLaMA-3, GPT-NeoX, and Mistral, interprets token positions as rotations in the complex plane - directly using complex Fourier basis vectors einθe^{in\theta}.
  • Spectral bias (Rahaman et al., 2019): neural networks learn low-frequency Fourier components of the target function first and high-frequency components last. This governs convergence speed, generalization, and the effectiveness of data augmentation.
  • Random Fourier features (Rahimi & Recht, 2007): kernel machines (SVMs, GPs) can be approximated in O(D)O(D) by projecting data onto DD random Fourier basis functions sampled from the kernel's spectral density.

1.3 Historical Timeline

FOURIER ANALYSIS - HISTORICAL MILESTONES
========================================================================

  1748  Euler studies vibrating strings; writes trigonometric series
  1807  Fourier presents "Theorie de la chaleur" to Paris Academy;
        claims every function has a trigonometric series expansion
        (claim rejected - Laplace and Lagrange are skeptical)
  1822  Fourier publishes "Theorie analytique de la chaleur"
  1829  Dirichlet proves convergence for piecewise smooth functions;
        gives first rigorous proof of pointwise convergence
  1854  Riemann extends the theory; defines the Riemann integral partly
        for the purposes of studying Fourier series
  1873  Du Bois-Reymond exhibits a continuous function whose Fourier
        series diverges at a point - Fourier's original claim was wrong
  1900  Fejer proves Cesaro summability: every continuous function's
        Fourier series converges in the Cesaro sense
  1907  Parseval's identity proved rigorously by Fatou & Riesz
  1966  Carleson proves pointwise convergence a.e. for L2 functions
        (Fields Medal, 2006)
  2017  Vaswani et al. use Fourier basis for transformer PE
  2021  Su et al. introduce RoPE; now standard in LLaMA-3, Mistral
  2022  FNet (Lee-Thorp) replaces attention with Fourier transforms
  2023  Spectral analysis of LLM weight matrices goes mainstream

========================================================================

1.4 The Geometric Picture

The key insight that makes Fourier series transparent is to think of functions as vectors in an infinite-dimensional inner product space.

On the interval [π,π][-\pi, \pi], define the inner product of two functions ff and gg by:

f,g=12πππf(x)g(x)dx\langle f, g \rangle = \frac{1}{2\pi} \int_{-\pi}^{\pi} f(x)\overline{g(x)}\,dx

The functions {einx}nZ\{e^{inx}\}_{n \in \mathbb{Z}} form an orthonormal set under this inner product:

einx,eimx=12πππei(nm)xdx=δnm\langle e^{inx}, e^{imx} \rangle = \frac{1}{2\pi} \int_{-\pi}^{\pi} e^{i(n-m)x}\,dx = \delta_{nm}

where δnm\delta_{nm} is the Kronecker delta. This is easy to verify: if nmn \neq m, the integral of ei(nm)xe^{i(n-m)x} over a full period is zero.

Now the Fourier series f(x)=ncneinxf(x) = \sum_n c_n e^{inx} is just the expansion of ff in this orthonormal basis. The coefficient cnc_n is the projection of ff onto the basis vector einxe^{inx}:

cn=f,einx=12πππf(x)einxdxc_n = \langle f, e^{inx} \rangle = \frac{1}{2\pi}\int_{-\pi}^{\pi} f(x) e^{-inx}\,dx

This is identical to the formula for coordinates in any orthonormal basis: cn=f,enc_n = \langle f, \mathbf{e}_n \rangle. The Fourier series is not a formula to be memorized - it is the inevitable consequence of projecting onto an orthonormal basis.

For AI: This geometric view directly explains RoPE. In RoPE, each attention head dimension pair (q2i,q2i+1)(q_{2i}, q_{2i+1}) is treated as a 2D vector and rotated by angle mθim\theta_i for token position mm. This rotation is multiplication by eimθie^{im\theta_i} - a Fourier basis vector. The inner product between rotated queries and keys then depends only on their relative position mnm - n, making the positional encoding relative rather than absolute.

1.5 Three Equivalent Representations

Every Fourier series can be written in three equivalent forms. Each has advantages in different contexts.

Real (sine-cosine) form:

f(x)=a02+n=1[ancos(nx)+bnsin(nx)]f(x) = \frac{a_0}{2} + \sum_{n=1}^\infty \left[ a_n \cos(nx) + b_n \sin(nx) \right] an=1πππf(x)cos(nx)dx,bn=1πππf(x)sin(nx)dxa_n = \frac{1}{\pi}\int_{-\pi}^{\pi} f(x)\cos(nx)\,dx, \quad b_n = \frac{1}{\pi}\int_{-\pi}^{\pi} f(x)\sin(nx)\,dx

Best for: real-valued functions where you want to see real amplitudes explicitly.

Complex exponential form:

f(x)=n=cneinx,cn=12πππf(x)einxdxf(x) = \sum_{n=-\infty}^{\infty} c_n e^{inx}, \quad c_n = \frac{1}{2\pi}\int_{-\pi}^{\pi} f(x)e^{-inx}\,dx

Best for: computation, derivations, and AI applications (RoPE, FNet). More compact; the complex exponentials are eigenfunctions of differentiation.

Amplitude-phase form:

f(x)=A0+n=1Ancos(nx+ϕn)f(x) = A_0 + \sum_{n=1}^\infty A_n \cos(nx + \phi_n)

where An=an2+bn2A_n = \sqrt{a_n^2 + b_n^2}, ϕn=arctan(bn/an)\phi_n = \arctan(-b_n/a_n). Best for: signal processing applications where amplitude and phase are the physically meaningful quantities.

Conversion between forms: cn=(anibn)/2c_n = (a_n - ib_n)/2 for n>0n > 0, c0=a0/2c_0 = a_0/2, and cn=cnc_{-n} = \overline{c_n} for real ff. These identities follow immediately from Euler's formula einx=cos(nx)+isin(nx)e^{inx} = \cos(nx) + i\sin(nx).

2. Formal Definitions

2.1 The Space L2[π,π]L^2[-\pi,\pi]

The natural domain for Fourier series is the square-integrable functions on [π,π][-\pi, \pi]:

L2[π,π]={f:[π,π]C  |  ππf(x)2dx<}L^2[-\pi,\pi] = \left\{ f: [-\pi,\pi] \to \mathbb{C} \;\middle|\; \int_{-\pi}^{\pi} |f(x)|^2\,dx < \infty \right\}

This space carries the inner product:

f,g=12πππf(x)g(x)dx\langle f, g \rangle = \frac{1}{2\pi}\int_{-\pi}^{\pi} f(x)\overline{g(x)}\,dx

and induced norm f=f,f\lVert f \rVert = \sqrt{\langle f, f \rangle}. Two functions that agree except on a set of measure zero are identified. Under this identification, L2[π,π]L^2[-\pi,\pi] is a complete inner product space - a Hilbert space. The full abstract theory is in Section 12-02 Hilbert Spaces; here we use only the concrete definition.

Examples in L2[π,π]L^2[-\pi,\pi]:

  • Every bounded measurable function (e.g., square wave, triangle wave)
  • f(x)=x1/3f(x) = |x|^{-1/3} (integrable singularity; f<\lVert f \rVert < \infty)
  • f(x)=sin(nx)f(x) = \sin(nx) for any nZn \in \mathbb{Z}

Non-examples:

  • f(x)=1/xf(x) = 1/x near x=0x = 0: ππ1/x2dx=\int_{-\pi}^{\pi} 1/x^2\,dx = \infty, so fL2f \notin L^2
  • f(x)=e1/x2f(x) = e^{1/x^2} near x=0x = 0: grows too fast

2.2 The Trigonometric System

Definition (Trigonometric System): The collection T={einx}nZ\mathcal{T} = \{e^{inx}\}_{n \in \mathbb{Z}} is called the complex trigonometric system on [π,π][-\pi, \pi].

Theorem (Orthonormality): The rescaled functions ϕn(x)=einx/2π\phi_n(x) = e^{inx}/\sqrt{2\pi} satisfy ϕn,ϕmL2=δnm\langle \phi_n, \phi_m \rangle_{L^2} = \delta_{nm}, using the unweighted inner product f,g=ππfgˉdx\langle f, g \rangle = \int_{-\pi}^{\pi} f\bar{g}\,dx.

Proof. For nmn \neq m:

ππeinxeimxdx=ππei(nm)xdx=ei(nm)xi(nm)ππ=ei(nm)πei(nm)πi(nm)=2sin((nm)π)nm=0\int_{-\pi}^{\pi} e^{inx} e^{-imx}\,dx = \int_{-\pi}^{\pi} e^{i(n-m)x}\,dx = \frac{e^{i(n-m)x}}{i(n-m)}\Bigg|_{-\pi}^{\pi} = \frac{e^{i(n-m)\pi} - e^{-i(n-m)\pi}}{i(n-m)} = \frac{2\sin((n-m)\pi)}{n-m} = 0

since nmZ{0}n-m \in \mathbb{Z}\setminus\{0\} so sin((nm)π)=0\sin((n-m)\pi) = 0. For n=mn = m: ππ1dx=2π\int_{-\pi}^{\pi} 1\,dx = 2\pi. \square

The real trigonometric system {1,cosx,sinx,cos2x,sin2x,}\{1, \cos x, \sin x, \cos 2x, \sin 2x, \ldots\} is similarly orthogonal, with norms 1=2π\lVert 1 \rVert = \sqrt{2\pi} and cosnx=sinnx=π\lVert \cos nx \rVert = \lVert \sin nx \rVert = \sqrt{\pi} for n1n \geq 1.

Completeness (stated): The trigonometric system is complete in L2[π,π]L^2[-\pi,\pi]: for every fL2f \in L^2 and every ε>0\varepsilon > 0, there exists a trigonometric polynomial T(x)=nNcneinxT(x) = \sum_{|n| \leq N} c_n e^{inx} such that fT<ε\lVert f - T \rVert < \varepsilon. The proof uses the Weierstrass approximation theorem and is in Section 12-02.

2.3 Fourier Coefficients

Definition (Fourier Coefficients): For fL2[π,π]f \in L^2[-\pi,\pi], the complex Fourier coefficients of ff are:

cn=12πππf(x)einxdx,nZc_n = \frac{1}{2\pi}\int_{-\pi}^{\pi} f(x)\,e^{-inx}\,dx, \quad n \in \mathbb{Z}

The real Fourier coefficients are:

an=1πππf(x)cos(nx)dx(n0),bn=1πππf(x)sin(nx)dx(n1)a_n = \frac{1}{\pi}\int_{-\pi}^{\pi} f(x)\cos(nx)\,dx \quad (n \geq 0), \quad b_n = \frac{1}{\pi}\int_{-\pi}^{\pi} f(x)\sin(nx)\,dx \quad (n \geq 1)

Relations between forms: For real ff: c0=a0/2c_0 = a_0/2, cn=(anibn)/2c_n = (a_n - ib_n)/2 for n1n \geq 1, and cn=cnc_{-n} = \overline{c_n} (Hermitian symmetry).

Standard examples - complete computations:

(i) Square wave: f(x)=sgn(x)=+1f(x) = \operatorname{sgn}(x) = +1 for x>0x > 0, 1-1 for x<0x < 0.

Since ff is odd, an=0a_n = 0 for all nn. For bnb_n:

bn=2π0πsin(nx)dx=2π[cos(nx)n]0π=2πn(1cos(nπ))={4/(πn)n odd0n evenb_n = \frac{2}{\pi}\int_0^\pi \sin(nx)\,dx = \frac{2}{\pi}\left[\frac{-\cos(nx)}{n}\right]_0^\pi = \frac{2}{\pi n}(1 - \cos(n\pi)) = \begin{cases} 4/(\pi n) & n \text{ odd} \\ 0 & n \text{ even} \end{cases}

So f(x)=4πk=0sin((2k+1)x)2k+1f(x) = \frac{4}{\pi}\sum_{k=0}^\infty \frac{\sin((2k+1)x)}{2k+1}.

(ii) Sawtooth wave: f(x)=xf(x) = x for x(π,π)x \in (-\pi, \pi), extended periodically.

ff is odd, so an=0a_n = 0. For bnb_n, integrate by parts:

bn=1πππxsin(nx)dx=2π0πxsin(nx)dx=2(1)n+1nb_n = \frac{1}{\pi}\int_{-\pi}^{\pi} x\sin(nx)\,dx = \frac{2}{\pi}\int_0^\pi x\sin(nx)\,dx = \frac{2(-1)^{n+1}}{n}

So f(x)=2n=1(1)n+1nsin(nx)=2(sinxsin2x2+sin3x3)f(x) = 2\sum_{n=1}^\infty \frac{(-1)^{n+1}}{n}\sin(nx) = 2\left(\sin x - \frac{\sin 2x}{2} + \frac{\sin 3x}{3} - \cdots\right).

(iii) Triangle wave: f(x)=xf(x) = |x| for x[π,π]x \in [-\pi, \pi].

ff is even, so bn=0b_n = 0. For ana_n (n1n \geq 1):

an=2π0πxcos(nx)dx=2π[xsin(nx)n+cos(nx)n2]0π=2(cos(nπ)1)πn2={4/(πn2)n odd0n evena_n = \frac{2}{\pi}\int_0^\pi x\cos(nx)\,dx = \frac{2}{\pi}\left[\frac{x\sin(nx)}{n} + \frac{\cos(nx)}{n^2}\right]_0^\pi = \frac{2(\cos(n\pi) - 1)}{\pi n^2} = \begin{cases} -4/(\pi n^2) & n \text{ odd} \\ 0 & n \text{ even}\end{cases}

And a0=πa_0 = \pi. So f(x)=π24πk=0cos((2k+1)x)(2k+1)2f(x) = \frac{\pi}{2} - \frac{4}{\pi}\sum_{k=0}^\infty \frac{\cos((2k+1)x)}{(2k+1)^2}.

Non-examples: Not every sequence {cn}\{c_n\} is the Fourier coefficient sequence of an L2L^2 function. By Parseval's identity (Section 4.2), we need ncn2<\sum_n |c_n|^2 < \infty. The sequence cn=1c_n = 1 for all nn is not a valid Fourier coefficient sequence.

2.4 Partial Sums and the Dirichlet Kernel

Definition: The NN-th partial sum of the Fourier series of ff is:

SNf(x)=n=NNcneinxS_N f(x) = \sum_{n=-N}^{N} c_n e^{inx}

A crucial observation: the partial sum can be written as a convolution:

SNf(x)=(fDN)(x)=12πππf(t)DN(xt)dtS_N f(x) = (f * D_N)(x) = \frac{1}{2\pi}\int_{-\pi}^{\pi} f(t) D_N(x - t)\,dt

where the Dirichlet kernel DND_N is:

DN(x)=n=NNeinx=sin((N+1/2)x)sin(x/2)D_N(x) = \sum_{n=-N}^{N} e^{inx} = \frac{\sin((N + 1/2)x)}{\sin(x/2)}

The closed form follows by summing the geometric series n=NNeinx\sum_{n=-N}^N e^{inx} and simplifying using eix/2eix/2=2isin(x/2)e^{ix/2} - e^{-ix/2} = 2i\sin(x/2).

Key properties of DND_N:

  • 12πππDN(x)dx=1\frac{1}{2\pi}\int_{-\pi}^{\pi} D_N(x)\,dx = 1 (normalized)
  • DND_N oscillates with NN peaks near x=0x = 0
  • DND_N does NOT stay non-negative (unlike an approximate identity), causing convergence issues

The failure of DND_N to remain non-negative is precisely what allows the Gibbs phenomenon (Section 3.4) and prevents uniform convergence at discontinuities.

3. Convergence Theory

3.1 Pointwise Convergence: Dirichlet's Theorem

Theorem (Dirichlet, 1829): Let ff be a 2π2\pi-periodic function that is piecewise smooth on [π,π][-\pi,\pi] (i.e., ff and ff' are piecewise continuous). Then the Fourier series of ff converges for every xx, and:

SNf(x)    f(x+)+f(x)2as NS_N f(x) \;\to\; \frac{f(x^+) + f(x^-)}{2} \quad \text{as } N \to \infty

At points of continuity, f(x+)=f(x)f(x^+) = f(x^-), so the series converges to f(x)f(x). At a jump discontinuity, the series converges to the average of the left and right limits.

Proof sketch. Write SNf(x)f(x+)+f(x)2S_N f(x) - \frac{f(x^+)+f(x^-)}{2} as an integral involving DND_N and the function gx(t)=f(x+t)f(x+)+f(x)2g_x(t) = f(x+t) - \frac{f(x^+)+f(x^-)}{2}. The key step: gx(t)/sin(t/2)g_x(t)/\sin(t/2) is integrable if ff is piecewise smooth (the singularity at t=0t=0 is removable). Then apply the Riemann-Lebesgue Lemma (Section 5.5): h(t)sin(Nt)dt0\int h(t)\sin(Nt)\,dt \to 0 as NN \to \infty for any integrable hh. \square

Dirichlet conditions (sufficient, not necessary):

  1. ff is bounded and has finitely many maxima, minima, and discontinuities on [π,π][-\pi,\pi]
  2. ff has left and right derivatives everywhere

What Dirichlet's theorem does NOT say:

  • It does not guarantee uniform convergence (fails at jumps due to Gibbs)
  • It does not apply to all continuous functions (Du Bois-Reymond's example)
  • It says nothing about L2L^2 convergence rate

3.2 L2L^2 Convergence and Completeness

The L2L^2 convergence theorem is stronger and cleaner than the pointwise result:

Theorem (L2L^2 Convergence): For every fL2[π,π]f \in L^2[-\pi,\pi]:

fSNfL20as N\lVert f - S_N f \rVert_{L^2} \to 0 \quad \text{as } N \to \infty

Equivalently, f=n=cneinxf = \sum_{n=-\infty}^{\infty} c_n e^{inx} as an L2L^2 limit. This follows directly from the completeness of the trigonometric system in L2L^2 (stated in Section 2.2).

Bessel's Inequality (preliminary to Parseval): For any fL2f \in L^2:

n=cn2f2=12πππf(x)2dx\sum_{n=-\infty}^{\infty} |c_n|^2 \leq \lVert f \rVert^2 = \frac{1}{2\pi}\int_{-\pi}^{\pi}|f(x)|^2\,dx

Proof. Expand 0fSNf2=f2nNcn20 \leq \lVert f - S_N f \rVert^2 = \lVert f \rVert^2 - \sum_{|n| \leq N} |c_n|^2. This shows the partial sums of cn2\sum |c_n|^2 are bounded by f2\lVert f \rVert^2. Taking NN \to \infty gives Bessel. \square

Equality holds (Bessel becomes Parseval) exactly when the system is complete.

3.3 Uniform Convergence

Theorem: If fC1[π,π]f \in C^1[-\pi,\pi] (continuously differentiable) and periodic, then SNffS_N f \to f uniformly.

Proof sketch. Integrate by parts: cn[f]=cn[f]/(in)c_n[f] = -c_n[f']/(in) for n0n \neq 0. So cnf1/(πn)|c_n| \leq \lVert f' \rVert_1 / (\pi |n|). The partial sums converge uniformly by the Weierstrass M-test since 1/n<\sum 1/n < \infty (conditionally). For fCkf \in C^k, cn=O(nk)|c_n| = O(n^{-k}), giving faster uniform convergence. \square

Key point: Continuity alone is insufficient for uniform convergence. The counterexample (Du Bois-Reymond) is a continuous function whose Fourier series diverges at a single point.

3.4 The Gibbs Phenomenon

The Gibbs phenomenon is one of the most important practical facts about Fourier series.

Observation: Near a jump discontinuity at x=x0x = x_0, the partial sum SNfS_N f overshoots the function value by approximately 2π0πsinttdt10.0895\frac{2}{\pi}\int_0^\pi \frac{\sin t}{t}\,dt - 1 \approx 0.0895 - about 9% of the jump height - regardless of how large NN is.

Precise statement for the square wave: For f(x)=sgn(x)f(x) = \operatorname{sgn}(x) with jump height 22:

limNSNf ⁣(π2N+1)=2π0πsinttdt1.1790\lim_{N\to\infty} S_N f\!\left(\frac{\pi}{2N+1}\right) = \frac{2}{\pi}\int_0^\pi \frac{\sin t}{t}\,dt \approx 1.1790\ldots

So the overshoot is 0.179\approx 0.179, which is 9%9\% of the total jump of 22.

Why it persists: The Dirichlet kernel DND_N has a tall central spike but also oscillating side lobes with total negative area 1/π\approx -1/\pi. These side lobes cannot be eliminated by taking NN larger - they just become narrower and move closer to the discontinuity.

For AI: The Gibbs phenomenon is why "ringing" artifacts appear when you sharply truncate a frequency spectrum (e.g., in audio compression, image filtering, or when a language model encounters out-of-distribution high-frequency tokens). Windowing functions (Section 20-03) are the engineering fix.

Remedy - Fejer's theorem: Instead of taking partial sums, take their Cesaro means σNf=1N+1k=0NSkf\sigma_N f = \frac{1}{N+1}\sum_{k=0}^N S_k f. Fejer proved these converge everywhere and do NOT exhibit Gibbs overshoot.

3.5 Fejer's Theorem and Cesaro Summability

Theorem (Fejer, 1900): Let fC[π,π]f \in C[-\pi,\pi] be 2π2\pi-periodic. Then σNff\sigma_N f \to f uniformly as NN \to \infty.

The Cesaro means use the Fejer kernel FN=1N+1k=0NDkF_N = \frac{1}{N+1}\sum_{k=0}^N D_k:

FN(x)=1N+1(sin((N+1)x/2)sin(x/2))20F_N(x) = \frac{1}{N+1}\left(\frac{\sin((N+1)x/2)}{\sin(x/2)}\right)^2 \geq 0

Unlike the Dirichlet kernel, the Fejer kernel is non-negative. This is why Cesaro summation fixes the Gibbs phenomenon - the averaging eliminates the negative side lobes.

For AI: Fejer's theorem is the prototype for windowing in signal processing. Multiplying a signal by a window function before taking the Fourier transform is equivalent to using a smoother summation kernel, which reduces spectral leakage and ringing (-> Section 20-03).


4. Parseval's Theorem and Energy

4.1 Bessel's Inequality

We proved Bessel's inequality in Section 3.2: ncn2f2\sum_{n} |c_n|^2 \leq \lVert f \rVert^2. This says the "energy" in the frequency representation is at most the energy in the time representation. Equality requires completeness.

4.2 Parseval's Identity

Theorem (Parseval's Identity): For fL2[π,π]f \in L^2[-\pi,\pi] with Fourier coefficients {cn}\{c_n\}:

12πππf(x)2dx=n=cn2\frac{1}{2\pi}\int_{-\pi}^{\pi} |f(x)|^2\,dx = \sum_{n=-\infty}^{\infty} |c_n|^2

In real form: a022+n=1(an2+bn2)=1πππf(x)2dx\frac{a_0^2}{2} + \sum_{n=1}^\infty (a_n^2 + b_n^2) = \frac{1}{\pi}\int_{-\pi}^{\pi} |f(x)|^2\,dx.

Proof. By completeness, fSNf0\lVert f - S_N f \rVert \to 0. Then:

f2=limNSNf2=limNnNcn2=n=cn2\lVert f \rVert^2 = \lim_{N\to\infty} \lVert S_N f \rVert^2 = \lim_{N\to\infty} \sum_{|n|\leq N} |c_n|^2 = \sum_{n=-\infty}^\infty |c_n|^2

using orthonormality of {einx}\{e^{inx}\} for the second equality. \square

Physical interpretation: The total energy (power) of a signal equals the sum of energies in each frequency component. Fourier analysis is an energy-preserving change of basis.

More general form: For f,gL2f, g \in L^2:

12πππf(x)g(x)dx=n=cn[f]cn[g]\frac{1}{2\pi}\int_{-\pi}^{\pi} f(x)\overline{g(x)}\,dx = \sum_{n=-\infty}^{\infty} c_n[f]\,\overline{c_n[g]}

This is the polarization identity version of Parseval.

4.3 Applications: Evaluating Infinite Series

Parseval's identity is a powerful tool for evaluating series that have no elementary closed form.

Example 1 - Basel problem via triangle wave: Apply Parseval to f(x)=xf(x) = x on (π,π)(-\pi,\pi):

1πππx2dx=2π23,andn=1bn2=n=14n2\frac{1}{\pi}\int_{-\pi}^{\pi} x^2\,dx = \frac{2\pi^2}{3}, \quad \text{and} \quad \sum_{n=1}^\infty b_n^2 = \sum_{n=1}^\infty \frac{4}{n^2}

Parseval gives 2π23=4n=11n2\frac{2\pi^2}{3} = 4\sum_{n=1}^\infty \frac{1}{n^2}, so n=11n2=π26\sum_{n=1}^\infty \frac{1}{n^2} = \frac{\pi^2}{6}.

Example 2 - 1/(2k+1)2\sum 1/(2k+1)^2: Apply Parseval to the square wave. The non-zero Fourier coefficients are b2k+1=4/(π(2k+1))b_{2k+1} = 4/(\pi(2k+1)). Then:

1πππ1dx=2=k=016π2(2k+1)2\frac{1}{\pi}\int_{-\pi}^{\pi} 1\,dx = 2 = \sum_{k=0}^\infty \frac{16}{\pi^2(2k+1)^2}

giving k=01(2k+1)2=π28\sum_{k=0}^\infty \frac{1}{(2k+1)^2} = \frac{\pi^2}{8}.

5. Properties of Fourier Coefficients

5.1 Linearity, Shift, and Scaling

Let cn[f]c_n[f] denote the nn-th Fourier coefficient of ff.

PropertyStatementProof idea
Linearitycn[αf+βg]=αcn[f]+βcn[g]c_n[\alpha f + \beta g] = \alpha c_n[f] + \beta c_n[g]Integral is linear
Shiftcn[f(x0)]=einx0cn[f]c_n[f(\cdot - x_0)] = e^{-inx_0} c_n[f]Change of variable t=xx0t = x - x_0
Conjugationcn[f]=cn[f]c_n[\overline{f}] = \overline{c_{-n}[f]}Conjugate the integral
Reflectioncn[f()]=cn[f]c_n[f(-\cdot)] = c_{-n}[f]Change of variable xxx \to -x

The shift property is the Fourier-series version of the time-shift property of the Fourier transform. It says: shifting a signal in time multiplies its Fourier coefficients by a complex exponential - i.e., introduces a linear phase. This is fundamental in signal alignment and is the mechanism behind relative positional encodings in transformers.

For AI - RoPE connection: In RoPE, the key and query vectors at position mm and nn are qm=Rmq\mathbf{q}_m = R_m \mathbf{q} and kn=Rnk\mathbf{k}_n = R_n \mathbf{k} where RmR_m is a rotation matrix. The dot product qmkn=qRmnk\mathbf{q}_m^\top \mathbf{k}_n = \mathbf{q}^\top R_{m-n} \mathbf{k} - it depends only on the relative position mnm - n. This is exactly the Fourier shift property: shifting both signals by the same amount leaves their inner product unchanged.

5.2 Differentiation and Integration

Differentiation in frequency space:

cn[f]=incn[f]c_n[f'] = in \cdot c_n[f]

Proof. Integrate by parts: cn[f]=12πππf(x)einxdx=12π[f(x)einx]ππ+in2πππf(x)einxdxc_n[f'] = \frac{1}{2\pi}\int_{-\pi}^{\pi} f'(x) e^{-inx}\,dx = \frac{1}{2\pi}\left[f(x)e^{-inx}\right]_{-\pi}^{\pi} + \frac{in}{2\pi}\int_{-\pi}^{\pi} f(x)e^{-inx}\,dx. By periodicity, the boundary term vanishes, leaving cn[f]=incn[f]c_n[f'] = in\,c_n[f]. \square

Iterated differentiation: cn[f(k)]=(in)kcn[f]c_n[f^{(k)}] = (in)^k c_n[f].

Implication for smoothness: A function fCkf \in C^k (k-times continuously differentiable and periodic) has cn=O(nk)|c_n| = O(|n|^{-k}) as n|n| \to \infty. The smoother ff is, the faster its Fourier coefficients decay. This is the quantitative content of Section 5.4.

Integration: cn[0xf(t)dtfx]=cn[f]inc_n\left[\int_0^x f(t)\,dt - \langle f \rangle x\right] = \frac{c_n[f]}{in} for n0n \neq 0, where f=c0[f]\langle f \rangle = c_0[f] is the mean.

5.3 Even and Odd Functions

Even functions (f(x)=f(x)f(-x) = f(x)) have bn=0b_n = 0 for all nn, so the Fourier series contains only cosines: f(x)=a02+n=1ancos(nx)f(x) = \frac{a_0}{2} + \sum_{n=1}^\infty a_n \cos(nx) - a cosine series. The coefficients are an=2π0πf(x)cos(nx)dxa_n = \frac{2}{\pi}\int_0^\pi f(x)\cos(nx)\,dx.

Odd functions (f(x)=f(x)f(-x) = -f(x)) have an=0a_n = 0 for all nn, so the Fourier series contains only sines: f(x)=n=1bnsin(nx)f(x) = \sum_{n=1}^\infty b_n \sin(nx) - a sine series. The coefficients are bn=2π0πf(x)sin(nx)dxb_n = \frac{2}{\pi}\int_0^\pi f(x)\sin(nx)\,dx.

For AI - DCT connection: The Discrete Cosine Transform (DCT) used in JPEG and MP3 compression is the even-extension Fourier series. The DCT coefficients ana_n are exactly the Fourier coefficients of the even extension of the signal to [π,π][-\pi,\pi]. This is why the DCT achieves better energy compaction than the DFT for real signals.

5.4 Smoothness and Spectral Decay

The relationship between regularity and spectral decay is fundamental to both signal processing and machine learning:

| Regularity of ff | Decay of cn|c_n| | Practical implication | |-------------------|------------------|-----------------------| | fL2f \in L^2 | cn0|c_n| \to 0 (by Riemann-Lebesgue) | All coefficients vanish asymptotically | | ff piecewise smooth, discontinuity | cnC/n|c_n| \sim C/|n| | Slow 1/n1/n decay (Gibbs) | | fC1f \in C^1 (continuously diff.) | cn=O(1/n2)|c_n| = O(1/n^2) | Faster decay, no Gibbs | | fCkf \in C^k | cn=O(1/nk+1)|c_n| = O(1/n^{k+1}) | Super-algebraic if kk large | | ff analytic | cnCern|c_n| \leq Ce^{-r|n|} for some r>0r > 0 | Exponential decay |

For AI - Spectral bias (Rahaman et al., 2019): Neural networks learn the target function's Fourier decomposition from lowest to highest frequency. Formally, if f=ncneinxf^* = \sum_n c_n e^{inx} is the target function, a network trained by gradient descent first approximates the cnc_n for small n|n| (low frequencies) and only later approximates high-frequency components. This implies:

  • Benefit: Implicit regularization toward smooth (low-frequency) solutions -> good generalization
  • Cost: Slow convergence on high-frequency targets; requires more data for texture-rich images
  • Fix: Fourier feature embeddings (RFF, NeRF's positional encoding) inject high-frequency components explicitly

5.5 Riemann-Lebesgue Lemma

Theorem (Riemann-Lebesgue): If fL1[π,π]f \in L^1[-\pi,\pi], then cn[f]0c_n[f] \to 0 as n|n| \to \infty.

Proof sketch. For ff a step function: direct computation. For general ff: approximate by step functions and use the L1L^1 bound. \square

Significance: This says every L1L^1 function has Fourier coefficients that vanish at high frequencies. The rate of decay depends on smoothness (Section 5.4), but the qualitative statement holds for all integrable functions. This is the mathematical foundation of the claim "smooth signals are compressible in the Fourier domain."


6. Fourier Series on General Intervals

6.1 Extension to [L,L][-L, L]

For a 2L2L-periodic function ff, the Fourier series on [L,L][-L, L] is obtained by the change of variables t=πx/Lt = \pi x / L:

f(x)=a02+n=1[ancos ⁣(nπxL)+bnsin ⁣(nπxL)]f(x) = \frac{a_0}{2} + \sum_{n=1}^\infty \left[ a_n \cos\!\left(\frac{n\pi x}{L}\right) + b_n \sin\!\left(\frac{n\pi x}{L}\right) \right]

with coefficients:

an=1LLLf(x)cos ⁣(nπxL)dx,bn=1LLLf(x)sin ⁣(nπxL)dxa_n = \frac{1}{L}\int_{-L}^{L} f(x)\cos\!\left(\frac{n\pi x}{L}\right)dx, \quad b_n = \frac{1}{L}\int_{-L}^{L} f(x)\sin\!\left(\frac{n\pi x}{L}\right)dx

The fundamental frequency is f1=1/(2L)f_1 = 1/(2L), and the nn-th harmonic has frequency n/(2L)n/(2L).

For AI - Transformer context length: In a transformer with context length TT, sinusoidal positional encodings use frequencies 1/100002i/d1/10000^{2i/d} for i=0,,d/21i = 0, \ldots, d/2 - 1. This covers a geometric range of frequencies over the interval [0,T][0, T] - analogous to sampling the Fourier series on [0,T][0, T] at d/2d/2 geometrically spaced frequencies. Longer context requires lower minimum frequencies, which is why RoPE's θmin=100001\theta_{\min} = 10000^{-1} limits effective context length and why extended-context models (LLaMA-3-128K, Mistral) modify the base θ\theta.

6.2 Half-Range Expansions

If ff is defined on [0,L][0, L] (not the full period), we can extend it to [L,L][-L, L] in two ways:

Even extension: Extend by f(x)=f(x)f(-x) = f(x) -> leads to a cosine series on [0,L][0, L]:

f(x)=a02+n=1ancos ⁣(nπxL),an=2L0Lf(x)cos ⁣(nπxL)dxf(x) = \frac{a_0}{2} + \sum_{n=1}^\infty a_n \cos\!\left(\frac{n\pi x}{L}\right), \quad a_n = \frac{2}{L}\int_0^L f(x)\cos\!\left(\frac{n\pi x}{L}\right)dx

Odd extension: Extend by f(x)=f(x)f(-x) = -f(x) -> leads to a sine series on [0,L][0, L]:

f(x)=n=1bnsin ⁣(nπxL),bn=2L0Lf(x)sin ⁣(nπxL)dxf(x) = \sum_{n=1}^\infty b_n \sin\!\left(\frac{n\pi x}{L}\right), \quad b_n = \frac{2}{L}\int_0^L f(x)\sin\!\left(\frac{n\pi x}{L}\right)dx

Application: Solving the heat equation on [0,L][0, L] with Dirichlet boundary conditions (f(0)=f(L)=0f(0) = f(L) = 0) uses the sine series expansion. With Neumann conditions (f(0)=f(L)=0f'(0) = f'(L) = 0), use the cosine series.

7. Applications in Machine Learning

7.1 Sinusoidal Positional Encodings in Transformers

The original Transformer (Vaswani et al., 2017) uses positional encodings:

PE(pos,2i)=sin ⁣(pos100002i/dmodel),PE(pos,2i+1)=cos ⁣(pos100002i/dmodel)\text{PE}(pos, 2i) = \sin\!\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right), \quad \text{PE}(pos, 2i+1) = \cos\!\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

This assigns each position pos{0,1,,T1}pos \in \{0, 1, \ldots, T-1\} a dmodeld_{\text{model}}-dimensional vector whose components are sine and cosine values at geometrically spaced frequencies ωi=1/100002i/dmodel\omega_i = 1/10000^{2i/d_{\text{model}}}.

Why this works: The set of frequencies {ωi}\{\omega_i\} forms a geometric progression spanning from 11 (very high frequency, distinguishes adjacent tokens) down to 10000110000^{-1} (very low frequency, distinguishes widely separated tokens). This is exactly the strategy of a Fourier series on a long interval: use many harmonics at different scales.

Limitation: These are absolute positional encodings - the embedding for position 5 is fixed regardless of context. This creates problems for generalization to longer sequences than seen in training.

7.2 Rotary Positional Encoding (RoPE)

RoPE (Su et al., 2021), now standard in LLaMA-2/3, Mistral, Falcon, and GPT-NeoX, encodes position as a rotation of the query and key vectors.

Construction: For a dd-dimensional query vector q\mathbf{q}, split into d/2d/2 pairs (q2i,q2i+1)(q_{2i}, q_{2i+1}). Rotate each pair by angle mθim\theta_i where mm is the token position and θi=100002i/d\theta_i = 10000^{-2i/d}:

(q2iq2i+1)=(cos(mθi)sin(mθi)sin(mθi)cos(mθi))(q2iq2i+1)\begin{pmatrix} q'_{2i} \\ q'_{2i+1} \end{pmatrix} = \begin{pmatrix} \cos(m\theta_i) & -\sin(m\theta_i) \\ \sin(m\theta_i) & \cos(m\theta_i) \end{pmatrix}\begin{pmatrix} q_{2i} \\ q_{2i+1} \end{pmatrix}

This is multiplication by eimθie^{im\theta_i} in the complex representation q2i+iq2i+1q_{2i} + iq_{2i+1}.

The key property: The attention score between query at position mm and key at position nn is:

score(m,n)=Re ⁣[i=0d/21(q2i+iq2i+1)eimθi(k2i+ik2i+1)einθi]\text{score}(m, n) = \operatorname{Re}\!\left[\sum_{i=0}^{d/2-1} (q_{2i} + iq_{2i+1})e^{im\theta_i} \cdot \overline{(k_{2i} + ik_{2i+1})e^{in\theta_i}}\right] =Re ⁣[i=0d/21(q2i+iq2i+1)(k2i+ik2i+1)ei(mn)θi]= \operatorname{Re}\!\left[\sum_{i=0}^{d/2-1} (q_{2i} + iq_{2i+1})\overline{(k_{2i} + ik_{2i+1})} \cdot e^{i(m-n)\theta_i}\right]

The score depends only on mnm - n - the relative position. This is the Fourier shift theorem: multiplying by eimθie^{im\theta_i} and taking a conjugate product gives sensitivity to relative displacement.

Extended context: The maximum effective context length is determined by the lowest frequency θmin=θd/21=100001\theta_{\min} = \theta_{d/2-1} = 10000^{-1}. Models like LLaMA-3-128K extend context by scaling θiθi(Ltrain/Lnew)\theta_i \to \theta_i \cdot (L_{\text{train}}/L_{\text{new}}) (Position Interpolation, Chen et al., 2023) or by increasing the base from 10000 to 500000 (Roziere et al., 2023).

7.3 Spectral Bias of Neural Networks

The phenomenon (Rahaman et al., 2019): Neural networks trained with gradient descent learn a biased decomposition of the target function: low-frequency Fourier components are learned first, high-frequency components last.

Mathematical statement: Let ff^* be the target function with Fourier decomposition f(x)=ncneinxf^*(x) = \sum_n c_n e^{inx}. A network fθf_\theta trained on a finite sample first converges on the low-n|n| components (cn[ffθ]|c_n[f^* - f_\theta]| decreases for small n|n| first).

Consequences for AI:

  1. Good: Networks converge to smooth solutions -> implicit regularization -> good out-of-distribution generalization for smooth targets
  2. Bad: Learning high-frequency signals (sharp edges, fine texture) requires more data and training time
  3. NeRF / SIREN fix: Sinusoidal activations (Sitzmann et al., 2020) or Fourier feature mappings γ(x)=[cos(Bx),sin(Bx)]\gamma(\mathbf{x}) = [\cos(B\mathbf{x}), \sin(B\mathbf{x})] (Tancik et al., 2020) inject high-frequency components, overcoming spectral bias for 3D scene representation

Mechanism: The NTK (Neural Tangent Kernel) of a standard MLP has a spectrum that decays with frequency. High-frequency components of the target correspond to high-eigenvalue directions of the NTK, which converge slowly under gradient descent.

7.4 Random Fourier Features and Kernel Approximation

The problem: Kernel methods (SVMs, GPs) require storing and computing an n×nn \times n kernel matrix - O(n2)O(n^2) memory and O(n3)O(n^3) computation. For large nn, this is infeasible.

The solution (Rahimi & Recht, 2007): Any shift-invariant kernel k(x,y)=k(xy)k(\mathbf{x}, \mathbf{y}) = k(\mathbf{x} - \mathbf{y}) can be written, by Bochner's theorem, as the Fourier transform of a non-negative measure:

k(xy)=p(ω)eiω(xy)dωk(\mathbf{x} - \mathbf{y}) = \int p(\boldsymbol{\omega}) e^{i\boldsymbol{\omega}^\top(\mathbf{x} - \mathbf{y})}\,d\boldsymbol{\omega}

Sampling DD frequencies ω1,,ωDp(ω)\boldsymbol{\omega}_1, \ldots, \boldsymbol{\omega}_D \sim p(\boldsymbol{\omega}) and defining the feature map:

ϕ(x)=1D[cos(ω1x),sin(ω1x),,cos(ωDx),sin(ωDx)]\phi(\mathbf{x}) = \frac{1}{\sqrt{D}}\left[\cos(\boldsymbol{\omega}_1^\top\mathbf{x}), \sin(\boldsymbol{\omega}_1^\top\mathbf{x}), \ldots, \cos(\boldsymbol{\omega}_D^\top\mathbf{x}), \sin(\boldsymbol{\omega}_D^\top\mathbf{x})\right]

gives E[ϕ(x)ϕ(y)]=k(xy)\mathbb{E}[\phi(\mathbf{x})^\top\phi(\mathbf{y})] = k(\mathbf{x} - \mathbf{y}). This reduces kernel computation to O(nD)O(nD) - linear in the dataset size.

Connection to Fourier series: ϕ(x)\phi(\mathbf{x}) is a finite Fourier expansion with randomly sampled frequencies. The approximation quality improves as DD increases, with concentration bounds showing k(xy)ϕ(x)ϕ(y)ε|k(\mathbf{x}-\mathbf{y}) - \phi(\mathbf{x})^\top\phi(\mathbf{y})| \leq \varepsilon with high probability for D=O(dlog(1/ε)/ε2)D = O(d\log(1/\varepsilon)/\varepsilon^2).


8. Common Mistakes

#MistakeWhy It's WrongFix
1Using cn=ππfeinxdxc_n = \int_{-\pi}^{\pi} f e^{-inx} dx without the 12π\frac{1}{2\pi} factorMissing normalization; coefficients will be off by 2π2\piAlways check which convention your source uses; in this repo we use cn=12πc_n = \frac{1}{2\pi}\int
2Assuming the Fourier series always converges pointwiseFalse: Du Bois-Reymond exhibited a continuous function whose series diverges at a pointState the convergence type explicitly; use L2L^2 convergence for most theory
3Assuming the sum at a discontinuity equals f(x)f(x)The series converges to the average f(x+)+f(x)2\frac{f(x^+)+f(x^-)}{2} at a jumpApply Dirichlet's theorem: check for discontinuities before claiming convergence to ff
4Writing cn=cnc_{-n} = c_n for a real functionCorrect relation is cn=cnc_{-n} = \overline{c_n} (conjugate, not equal)For real ff: cn=cnc_{-n} = \overline{c_n}; only real if cnc_n is real
5Applying differentiation term-by-term without checking conditionscn[f]=incn[f]c_n[f'] = inc_n[f] requires ff to be absolutely continuous and periodicCheck that fAC[π,π]f \in AC[-\pi,\pi] (absolutely continuous) before differentiating term-by-term
6Confusing Parseval's theorem with Bessel's inequalityBessel gives \leq; Parseval gives == - they differParseval requires completeness of the trigonometric system; Bessel holds for any orthonormal set
7Claiming $c_n= O(1/n)$ decay for a smooth function
8Forgetting to subtract the mean before computing sine/cosine coefficientsc0=a0/2c_0 = a_0/2 is the DC component; it appears in the a0/2a_0/2 term, not in ana_n for n1n \geq 1Always compute a0a_0 separately; the series is a02+n1\frac{a_0}{2} + \sum_{n \geq 1}
9Using the complex form with einxe^{inx} but real form coefficientsComplex form requires cn=12πfeinxc_n = \frac{1}{2\pi}\int fe^{-inx}; real form uses an,bna_n, b_n - these are differentPick one form and use it consistently; convert via cn=(anibn)/2c_n = (a_n - ib_n)/2
10Claiming the Gibbs phenomenon goes away as NN \to \inftyThe overshoot height stays at ~9% of the jump; only its width shrinksGibbs is a permanent feature of the partial sum near discontinuities; use windowing to mitigate

9. Exercises

Exercise 1 (*): Compute the complex Fourier coefficients cnc_n of the triangle wave f(x)=xf(x) = |\,x\,| on (π,π)(-\pi, \pi) and write out the first five non-zero terms of the Fourier series. Verify that ff is even, so cnRc_n \in \mathbb{R}.

Exercise 2 (*): Prove from first principles that the functions {1/2π,cos(nx)/π,sin(nx)/π}n1\{1/\sqrt{2\pi},\, \cos(nx)/\sqrt{\pi},\, \sin(nx)/\sqrt{\pi}\}_{n \geq 1} form an orthonormal set in L2[π,π]L^2[-\pi,\pi] with the inner product f,g=ππfgˉdx\langle f, g \rangle = \int_{-\pi}^{\pi} f\bar{g}\,dx.

Exercise 3 (*): Using Parseval's identity applied to the square wave, show that k=01(2k+1)2=π28\sum_{k=0}^\infty \frac{1}{(2k+1)^2} = \frac{\pi^2}{8}. Then use n=11n2=k=11(2k)2+k=01(2k+1)2\sum_{n=1}^\infty \frac{1}{n^2} = \sum_{k=1}^\infty \frac{1}{(2k)^2} + \sum_{k=0}^\infty \frac{1}{(2k+1)^2} to recover the Basel sum n=11n2=π26\sum_{n=1}^\infty \frac{1}{n^2} = \frac{\pi^2}{6}.

Exercise 4 ():** Let f(x)=eaxf(x) = e^{ax} for x(π,π)x \in (-\pi, \pi) with a0a \neq 0 real. (a) Compute cn[f]c_n[f]. (b) Apply Parseval's identity to obtain a formula for 1sinh2(πa)\frac{1}{\sinh^2(\pi a)} in terms of a series involving 1/(a2+n2)1/(a^2 + n^2). (c) Verify your answer for a=1a = 1 numerically.

Exercise 5 ():** Prove the Riemann-Lebesgue Lemma: if fL1[π,π]f \in L^1[-\pi,\pi], then cn[f]0c_n[f] \to 0 as n|n| \to \infty. (Hint: first prove it for step functions, then approximate general ff.)

Exercise 6 ():** Consider f(x)=x2f(x) = x^2 on (π,π)(-\pi,\pi). (a) Find all Fourier coefficients an,bna_n, b_n. (b) Show that SNffS_N f \to f uniformly. (c) Set x=πx = \pi in the resulting series to derive n=1(1)n+1/n2=π2/12\sum_{n=1}^\infty (-1)^{n+1}/n^2 = \pi^2/12.

Exercise 7 (*):** Implement RoPE from scratch in Python. (a) Given query and key vectors of dimension d=64d = 64 and sequence length T=512T = 512, construct the rotation matrices RmR_m for each position mm using θi=100002i/d\theta_i = 10000^{-2i/d}. (b) Compute the rotated attention scores score(m,n)=(Rmq)(Rnk)\text{score}(m, n) = (R_m \mathbf{q})^\top (R_n \mathbf{k}) for all pairs (m,n)(m, n). (c) Verify that score(m,n)\text{score}(m, n) depends only on mnm - n by checking score(m,n)=score(m+k,n+k)\text{score}(m, n) = \text{score}(m+k, n+k) for several values of kk.

Exercise 8 (*):** Implement Random Fourier Features for the RBF kernel. (a) For the Gaussian kernel k(x,y)=exy2/(2σ2)k(\mathbf{x}, \mathbf{y}) = e^{-\lVert \mathbf{x} - \mathbf{y} \rVert^2/(2\sigma^2)}, show that the spectral density is p(ω)=N(0,σ2I)p(\boldsymbol{\omega}) = \mathcal{N}(\mathbf{0}, \sigma^{-2} I). (b) Implement the feature map ϕ(x)R2D\phi(\mathbf{x}) \in \mathbb{R}^{2D} using D=100D = 100 random frequency samples. (c) On a synthetic dataset in R2\mathbb{R}^2, compare the exact kernel matrix KK with the approximation ΦΦ\Phi\Phi^\top and plot the approximation error as a function of DD.


10. Why This Matters for AI (2026 Perspective)

ConceptAI ImpactConcrete System
Fourier basis vectors einxe^{inx}Foundation of positional encodings in all transformer variantsRoPE in LLaMA-3, Gemma-2, Mistral-7B, Falcon
Complex Fourier formRotation = position; enables relative PE via shift theoremRoPE (Su et al., 2021), xPos, YaRN
Parseval's identityEnergy preservation: Fourier is a unitary transform; spectral analysis of embeddingsWeightWatcher spectral analysis of LLM health
Spectral bias (1/nk1/n^k decay)Networks learn smooth functions first; governs training dynamicsSIREN (Sitzmann et al., 2020), NeRF frequency encoding
Gibbs phenomenonRinging in over-compressed audio/images; sudden context-length failure modesJPEG compression artifacts, LLM boundary token issues
Fourier completenessAny periodic signal can be exactly represented; digital audio encodingMP3, AAC, Ogg Vorbis compression standards
Random Fourier featuresO(nD)O(nD) kernel approximation replacing O(n2)O(n^2) kernel matrixLarge-scale SVM/GP inference (Rahimi & Recht, 2007)

11. Conceptual Bridge

Looking backward: Fourier series is built on three pillars from earlier chapters: the inner product structure of L2L^2 (from Section 12-02 Hilbert Spaces), complex exponentials einxe^{inx} (from Section 01 Mathematical Foundations), and convergence of series (from real analysis in Section 04 Calculus). If those foundations feel shaky, revisit them before proceeding.

Looking forward: Fourier series handles periodic functions on a bounded interval. The Fourier Transform (Section 20-02) extends this to aperiodic signals on the entire real line by taking the period TT \to \infty. The discrete, computationally feasible version is the DFT and FFT (Section 20-03). The Convolution Theorem (Section 20-04) shows why Fourier analysis is so powerful: convolution in time becomes multiplication in frequency. Finally, Wavelets (Section 20-05) overcome Fourier's fundamental limitation - the inability to localize in both time and frequency simultaneously.

POSITION IN THE FOURIER ANALYSIS CHAPTER
========================================================================

  Section 20-01  Fourier Series          -- YOU ARE HERE
    v  (take T -> infty)
  Section 20-02  Fourier Transform
    v  (discretize: N samples)
  Section 20-03  DFT and FFT
    v  (mult. in freq  conv. in time)
  Section 20-04  Convolution Theorem
    v  (localize in time AND freq)
  Section 20-05  Wavelets

  Prerequisites:              Forward pointers:
  Section 01 Mathematical Foundations -> Section 20-02 (continuous FT)
  Section 04 Calculus Fundamentals    -> Section 20-03 (discrete FT)
  Section 12-02 Hilbert Spaces        -> Section 12-03 (kernel methods via Bochner)

========================================================================

<- Back to Fourier Analysis | Next: Fourier Transform ->


Appendix A: Extended Examples and Computations

A.1 Full Derivation: Fourier Series of Common Waveforms

This appendix provides complete, step-by-step derivations for the Fourier coefficients of the most important periodic waveforms. These are not presented as exercises - they are reference derivations that you should work through once and understand deeply. Every step is motivated.

A.1.1 Rectangular Pulse Train

Define a pulse of width 2τ2\tau centered at x=0x = 0, repeated with period 2π2\pi:

f(x)={1xτ0τ<xπf(x) = \begin{cases} 1 & |x| \leq \tau \\ 0 & \tau < |x| \leq \pi \end{cases}

Complex coefficients: for n0n \neq 0:

cn=12πππf(x)einxdx=12πττeinxdx=12πeinτeinτin=sin(nτ)nπ=τπsinc(nτ/π)c_n = \frac{1}{2\pi}\int_{-\pi}^{\pi} f(x)e^{-inx}\,dx = \frac{1}{2\pi}\int_{-\tau}^{\tau} e^{-inx}\,dx = \frac{1}{2\pi} \cdot \frac{e^{in\tau} - e^{-in\tau}}{in} = \frac{\sin(n\tau)}{n\pi} = \frac{\tau}{\pi}\operatorname{sinc}(n\tau/\pi)

where sinc(u)=sin(πu)/(πu)\operatorname{sinc}(u) = \sin(\pi u)/(\pi u). For n=0n = 0: c0=τ/πc_0 = \tau/\pi (the duty cycle).

Key observation: The envelope of cn|c_n| versus nn follows a sinc\operatorname{sinc} function. The first zero crossing occurs at n=π/τn = \pi/\tau. A narrower pulse (τ\tau small) -> wider spectrum (more high frequencies needed to represent the sharp edges). This is the time-frequency trade-off in action.

Spectrum width: The "bandwidth" (distance to first spectral null) is B=1/(2τ)B = 1/(2\tau) in normalized frequency. Narrow pulses are "broadband"; wide pulses (large τ/π1\tau/\pi \to 1, approaching a constant) have most energy concentrated near n=0n = 0.

A.1.2 Full-Wave Rectified Sine

f(x)=sinxf(x) = |\sin x|

ff is even and non-negative. Since sinx=sin(x)|\sin x| = |\sin(-x)|, bn=0b_n = 0 for all nn. For ana_n:

a0=2π0πsinxdx=4πa_0 = \frac{2}{\pi}\int_0^\pi \sin x\,dx = \frac{4}{\pi} an=2π0πsinxcos(nx)dx=2π0πsinxcos(nx)dxa_n = \frac{2}{\pi}\int_0^\pi |\sin x|\cos(nx)\,dx = \frac{2}{\pi}\int_0^\pi \sin x\cos(nx)\,dx

Using the product-to-sum formula sinxcos(nx)=12[sin((1+n)x)+sin((1n)x)]\sin x\cos(nx) = \frac{1}{2}[\sin((1+n)x) + \sin((1-n)x)]:

For n=1n = 1: a1=1π0πsin(2x)dx=0a_1 = \frac{1}{\pi}\int_0^\pi \sin(2x)\,dx = 0.

For n1n \neq 1:

an=1π0π[sin((n+1)x)+sin((1n)x)]dx=1π[cos((n+1)x)n+1+cos((1n)x)1n]0πa_n = \frac{1}{\pi}\int_0^\pi [\sin((n+1)x) + \sin((1-n)x)]\,dx = \frac{1}{\pi}\left[\frac{-\cos((n+1)x)}{n+1} + \frac{\cos((1-n)x)}{1-n}\right]_0^\pi =1π(1cos((n+1)π)n+11cos((n1)π)n1)={4π(n21)n even0n odd, n1= \frac{1}{\pi}\left(\frac{1-\cos((n+1)\pi)}{n+1} - \frac{1-\cos((n-1)\pi)}{n-1}\right) = \begin{cases} -\frac{4}{\pi(n^2-1)} & n \text{ even} \\ 0 & n \text{ odd, } n \neq 1 \end{cases}

So sinx=2π4πk=1cos(2kx)4k21|\sin x| = \frac{2}{\pi} - \frac{4}{\pi}\sum_{k=1}^\infty \frac{\cos(2kx)}{4k^2 - 1}.

Application: The full-wave rectified sine is used in AC-to-DC conversion. Its spectrum has no odd harmonics - only even harmonics - which is why its Fourier representation converges faster (coefficients decay as 1/n21/n^2) than the square wave (which has 1/n1/n decay).

A.1.3 The Sawtooth Wave (Staircase Convergence)

f(x)=xf(x) = x on (π,π)(-\pi,\pi), f(π)=f(π)=0f(\pi) = f(-\pi) = 0 (we assign the average at the jumps). We showed f(x)=2n=1(1)n+1nsin(nx)f(x) = 2\sum_{n=1}^\infty \frac{(-1)^{n+1}}{n}\sin(nx).

Let us verify: at x=π/2x = \pi/2, f(π/2)=π/2f(\pi/2) = \pi/2. The series gives:

2n=1(1)n+1nsin(nπ/2)=2(113+1517+)=2π4=π22\sum_{n=1}^\infty \frac{(-1)^{n+1}}{n}\sin(n\pi/2) = 2\left(1 - \frac{1}{3} + \frac{1}{5} - \frac{1}{7} + \cdots\right) = 2 \cdot \frac{\pi}{4} = \frac{\pi}{2} \checkmark

This uses the Leibniz formula 11/3+1/5=π/41 - 1/3 + 1/5 - \cdots = \pi/4.

Convergence rate: For the sawtooth, cn=1/n|c_n| = 1/n, so fSNf2=2n>N1/n22/N\lVert f - S_N f \rVert^2 = 2\sum_{n>N} 1/n^2 \sim 2/N (from the integral test). Convergence is slow: to reduce the L2L^2 error below ε\varepsilon, we need N2/ε2N \sim 2/\varepsilon^2 terms.

A.2 The Heat Equation: Fourier's Original Application

The problem that motivated Fourier's entire theory is the heat equation on a rod:

ut=α22ux2,x[0,L],t>0\frac{\partial u}{\partial t} = \alpha^2 \frac{\partial^2 u}{\partial x^2}, \quad x \in [0, L], \quad t > 0

with boundary conditions u(0,t)=u(L,t)=0u(0, t) = u(L, t) = 0 (zero temperature at the ends) and initial condition u(x,0)=f(x)u(x, 0) = f(x).

Solution by separation of variables:

Assume u(x,t)=X(x)T(t)u(x, t) = X(x)T(t). Substituting: XT=α2XTX T' = \alpha^2 X'' T, so T/T=α2X/X=λT'/T = \alpha^2 X''/X = -\lambda (both sides must equal the same constant). This gives:

X+λα2X=0with X(0)=X(L)=0X'' + \frac{\lambda}{\alpha^2} X = 0 \quad \text{with } X(0) = X(L) = 0 T+λT=0T' + \lambda T = 0

The boundary condition forces Xn(x)=sin(nπx/L)X_n(x) = \sin(n\pi x/L) with eigenvalues λn=(nπα/L)2\lambda_n = (n\pi\alpha/L)^2. The corresponding time solution is Tn(t)=eλntT_n(t) = e^{-\lambda_n t}.

Superposition: The general solution is:

u(x,t)=n=1bnsin ⁣(nπxL)e(nπα/L)2tu(x, t) = \sum_{n=1}^\infty b_n \sin\!\left(\frac{n\pi x}{L}\right) e^{-(n\pi\alpha/L)^2 t}

The coefficients bnb_n are determined by the initial condition:

f(x)=u(x,0)=n=1bnsin ⁣(nπxL)f(x) = u(x, 0) = \sum_{n=1}^\infty b_n \sin\!\left(\frac{n\pi x}{L}\right)

This is exactly the sine series (half-range expansion, Section 6.2). So bn=2L0Lf(x)sin(nπx/L)dxb_n = \frac{2}{L}\int_0^L f(x)\sin(n\pi x/L)\,dx.

Physical insight: Each mode sin(nπx/L)\sin(n\pi x/L) decays at rate e(nπα/L)2te^{-(n\pi\alpha/L)^2 t}. High-frequency modes (large nn) decay much faster than low-frequency modes. At long times, the solution looks like just the n=1n = 1 fundamental mode. This is Fourier's original discovery: heat diffusion is a low-pass filter in frequency space.

For AI: This is the origin of the spectral bias observation. In a sense, gradient descent on a neural network is like running the heat equation in weight space - it diffuses high-frequency components (noise) faster than low-frequency components (signal). The spectral bias of neural networks has exactly the same mathematical structure as heat diffusion.

A.3 Dirichlet's Kernel: Detailed Analysis

Understanding why the Dirichlet kernel causes problems requires a careful look at its behavior.

Closed form derivation:

DN(x)=n=NNeinx=eiNxei(2N+1)x1eix1=eiNxeiNxei(N+1/2)xei(N+1/2)xeix/2eix/2D_N(x) = \sum_{n=-N}^N e^{inx} = e^{-iNx} \cdot \frac{e^{i(2N+1)x} - 1}{e^{ix} - 1} = e^{-iNx} \cdot e^{iNx} \cdot \frac{e^{i(N+1/2)x} - e^{-i(N+1/2)x}}{e^{ix/2} - e^{-ix/2}} =sin((N+1/2)x)sin(x/2)= \frac{\sin((N+1/2)x)}{\sin(x/2)}

Properties:

  1. DN(0)=2N+1D_N(0) = 2N + 1 (the value at the origin)
  2. 12πππDN(x)dx=1\frac{1}{2\pi}\int_{-\pi}^{\pi} D_N(x)\,dx = 1 (normalized)
  3. DND_N has zeros at x=2πk/(2N+1)x = 2\pi k/(2N+1) for k=±1,±2,,±Nk = \pm 1, \pm 2, \ldots, \pm N
  4. DND_N alternates sign: it is NOT a non-negative approximate identity
  5. DNL14π2lnN\lVert D_N \rVert_{L^1} \sim \frac{4}{\pi^2}\ln N (the L1L^1 norm grows logarithmically)

The logarithmic growth of DNL1\lVert D_N \rVert_{L^1} is the root cause of convergence problems. A proper approximate identity would have KNL1=1\lVert K_N \rVert_{L^1} = 1 bounded - the Dirichlet kernel fails this, which is exactly why pointwise convergence can fail for continuous functions.

Contrast with the Fejer kernel:

FN(x)=1N+1k=0NDk(x)=1N+1(sin((N+1)x/2)sin(x/2))2F_N(x) = \frac{1}{N+1}\sum_{k=0}^N D_k(x) = \frac{1}{N+1}\left(\frac{\sin((N+1)x/2)}{\sin(x/2)}\right)^2

The Fejer kernel satisfies: FN0F_N \geq 0, 12πFN=1\frac{1}{2\pi}\int F_N = 1, and for any δ>0\delta > 0: FN(x)0F_N(x) \to 0 uniformly on {xδ}\{|x| \geq \delta\}. These three properties make FNF_N a genuine approximate identity, guaranteeing uniform convergence.

A.4 Complex Analysis Connection: Laurent Series

The Fourier series of a function on [π,π][-\pi,\pi] is closely related to the Laurent series in complex analysis. If ff has Fourier series f(x)=ncneinxf(x) = \sum_n c_n e^{inx}, define z=eixz = e^{ix} (a point on the unit circle z=1|z| = 1). Then:

f(x)=n=cnznf(x) = \sum_{n=-\infty}^{\infty} c_n z^n

This is a Laurent series in zz centered at the origin, evaluated on the unit circle. The Fourier coefficient cnc_n is the znz^n coefficient of this Laurent series.

Analyticity and series convergence: If ff extends analytically to an annulus r<z<1/rr < |z| < 1/r (with r<1r < 1), then the Laurent series converges absolutely and uniformly on the unit circle, and the Fourier coefficients decay exponentially: cnCrn|c_n| \leq C r^{|n|}. This is the deep reason why analytic functions have exponentially decaying Fourier coefficients (Section 5.4).

For AI: The connection between Fourier analysis and analytic continuation underlies the theory of analytic functions of neural networks and the frequency-domain analysis of attention patterns. When LLM researchers analyze learned positional encodings in the complex plane, they are (often implicitly) using this Laurent series picture.

A.5 Numerical Experiments: Convergence Visualization

The following experiments (implemented in theory.ipynb) illustrate the key convergence phenomena:

Experiment 1 - Convergence speed: Compute partial sums SNfS_N f for N=1,5,10,50,100N = 1, 5, 10, 50, 100 for the square wave. Plot the L2L^2 error fSNf2=n>Ncn2\lVert f - S_N f \rVert^2 = \sum_{|n|>N} |c_n|^2 as a function of NN. Observe the O(1/N)O(1/N) decay rate from the cn1/n|c_n| \sim 1/n coefficients.

Experiment 2 - Gibbs phenomenon: Plot S100fS_{100}f for the square wave. Zoom in near x=0x = 0. Measure the overshoot height and verify it is 0.1799%\approx 0.179 \approx 9\% of the jump height 2.

Experiment 3 - Cesaro means fix Gibbs: Plot σ100f\sigma_{100} f (Cesaro means) alongside S100fS_{100} f. Observe that the Gibbs overshoot is absent in the Cesaro means.

Experiment 4 - Decay rate vs smoothness: Compare the coefficient decay rate for: (a) square wave (discontinuous): cn1/n|c_n| \sim 1/n; (b) triangle wave (continuous, piecewise linear): cn1/n2|c_n| \sim 1/n^2; (c) smooth bump function: cn|c_n| decays super-polynomially. Plot logcn\log |c_n| vs logn\log n to see the exponent.

Experiment 5 - RoPE implementation: Implement RoPE as in Exercise 7. Visualize the rotation matrices RmR_m for positions m=0,1,,20m = 0, 1, \ldots, 20 and frequency index i=0i = 0. Verify that consecutive positions differ by a fixed rotation angle θ0\theta_0, confirming the Fourier basis interpretation.


Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue