Part 1

29 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Fourier Series: Part 1: Intuition to Appendix A: Extended Examples and Computations

1. Intuition

1.1 What Is a Fourier Series?

Strike a guitar string and you hear a pitch - the fundamental frequency. Listen more carefully and you hear overtones: frequencies at twice, three times, four times the fundamental. The rich timbre of a guitar versus a piano playing the same note comes entirely from the relative amplitudes of these overtones. A Fourier series is the precise mathematical statement of this physical observation: every periodic function is a (possibly infinite) weighted sum of sinusoids at integer multiples of a fundamental frequency.

More precisely, if $f$ is a $2\pi$ -periodic function satisfying mild regularity conditions, then:

f(x) = \frac{a_0}{2} + \sum_{n=1}^{\infty} \left[ a_n \cos(nx) + b_n \sin(nx) \right]

where the coefficients $a_n$ and $b_n$ are uniquely determined by $f$ . The term with $n=1$ is the fundamental harmonic, and the terms with $n > 1$ are the overtones or harmonics.

The miracle is how general this is. The function $f$ does not have to be smooth. The square wave - which jumps discontinuously between $-1$ and $+1$ - has a Fourier series:

f(x) = \frac{4}{\pi}\left(\sin x + \frac{\sin 3x}{3} + \frac{\sin 5x}{5} + \cdots\right) = \frac{4}{\pi}\sum_{k=0}^\infty \frac{\sin((2k+1)x)}{2k+1}

Summing 100 terms gives a function that is nearly indistinguishable from the square wave - except near the jumps, where a small overshoot persists no matter how many terms you include. This is the Gibbs phenomenon, and we will study it carefully.

Non-example: Not every function has a convergent Fourier series in the pointwise sense. A function that is not in $L^2[-\pi,\pi]$ (e.g., $f(x) = 1/x$ near $x=0$ ) does not have a well-defined Fourier series. A continuous function can even have a Fourier series that diverges at a single point (Du Bois-Reymond, 1876). The correct framework is $L^2$ convergence, not pointwise convergence.

1.2 Why It Matters for AI

Fourier series are not an abstract curiosity - they are embedded in the architecture of modern LLMs, CNNs, and audio models:

For AI:

Positional encodings in Transformers (Vaswani et al., 2017) use $\sin(pos/10000^{2i/d})$ and $\cos(pos/10000^{2i/d})$ - these are Fourier basis functions at geometrically spaced frequencies.
Rotary Position Embedding (RoPE) (Su et al., 2021), used in LLaMA-3, GPT-NeoX, and Mistral, interprets token positions as rotations in the complex plane - directly using complex Fourier basis vectors $e^{in\theta}$ .
Spectral bias (Rahaman et al., 2019): neural networks learn low-frequency Fourier components of the target function first and high-frequency components last. This governs convergence speed, generalization, and the effectiveness of data augmentation.
Random Fourier features (Rahimi & Recht, 2007): kernel machines (SVMs, GPs) can be approximated in $O(D)$ by projecting data onto $D$ random Fourier basis functions sampled from the kernel's spectral density.

1.3 Historical Timeline

FOURIER ANALYSIS - HISTORICAL MILESTONES
========================================================================

  1748  Euler studies vibrating strings; writes trigonometric series
  1807  Fourier presents "Theorie de la chaleur" to Paris Academy;
        claims every function has a trigonometric series expansion
        (claim rejected - Laplace and Lagrange are skeptical)
  1822  Fourier publishes "Theorie analytique de la chaleur"
  1829  Dirichlet proves convergence for piecewise smooth functions;
        gives first rigorous proof of pointwise convergence
  1854  Riemann extends the theory; defines the Riemann integral partly
        for the purposes of studying Fourier series
  1873  Du Bois-Reymond exhibits a continuous function whose Fourier
        series diverges at a point - Fourier's original claim was wrong
  1900  Fejer proves Cesaro summability: every continuous function's
        Fourier series converges in the Cesaro sense
  1907  Parseval's identity proved rigorously by Fatou & Riesz
  1966  Carleson proves pointwise convergence a.e. for L2 functions
        (Fields Medal, 2006)
  2017  Vaswani et al. use Fourier basis for transformer PE
  2021  Su et al. introduce RoPE; now standard in LLaMA-3, Mistral
  2022  FNet (Lee-Thorp) replaces attention with Fourier transforms
  2023  Spectral analysis of LLM weight matrices goes mainstream

========================================================================

1.4 The Geometric Picture

The key insight that makes Fourier series transparent is to think of functions as vectors in an infinite-dimensional inner product space.

On the interval $[-\pi, \pi]$ , define the inner product of two functions $f$ and $g$ by:

\langle f, g \rangle = \frac{1}{2\pi} \int_{-\pi}^{\pi} f(x)\overline{g(x)}\,dx

The functions $\{e^{inx}\}_{n \in \mathbb{Z}}$ form an orthonormal set under this inner product:

\langle e^{inx}, e^{imx} \rangle = \frac{1}{2\pi} \int_{-\pi}^{\pi} e^{i(n-m)x}\,dx = \delta_{nm}

where $\delta_{nm}$ is the Kronecker delta. This is easy to verify: if $n \neq m$ , the integral of $e^{i(n-m)x}$ over a full period is zero.

Now the Fourier series $f(x) = \sum_n c_n e^{inx}$ is just the expansion of $f$ in this orthonormal basis. The coefficient $c_n$ is the projection of $f$ onto the basis vector $e^{inx}$ :

c_n = \langle f, e^{inx} \rangle = \frac{1}{2\pi}\int_{-\pi}^{\pi} f(x) e^{-inx}\,dx

This is identical to the formula for coordinates in any orthonormal basis: $c_n = \langle f, \mathbf{e}_n \rangle$ . The Fourier series is not a formula to be memorized - it is the inevitable consequence of projecting onto an orthonormal basis.

For AI: This geometric view directly explains RoPE. In RoPE, each attention head dimension pair $(q_{2i}, q_{2i+1})$ is treated as a 2D vector and rotated by angle $m\theta_i$ for token position $m$ . This rotation is multiplication by $e^{im\theta_i}$ - a Fourier basis vector. The inner product between rotated queries and keys then depends only on their relative position $m - n$ , making the positional encoding relative rather than absolute.

1.5 Three Equivalent Representations

Every Fourier series can be written in three equivalent forms. Each has advantages in different contexts.

Real (sine-cosine) form:

f(x) = \frac{a_0}{2} + \sum_{n=1}^\infty \left[ a_n \cos(nx) + b_n \sin(nx) \right]

a_n = \frac{1}{\pi}\int_{-\pi}^{\pi} f(x)\cos(nx)\,dx, \quad b_n = \frac{1}{\pi}\int_{-\pi}^{\pi} f(x)\sin(nx)\,dx

Best for: real-valued functions where you want to see real amplitudes explicitly.

Complex exponential form:

f(x) = \sum_{n=-\infty}^{\infty} c_n e^{inx}, \quad c_n = \frac{1}{2\pi}\int_{-\pi}^{\pi} f(x)e^{-inx}\,dx

Best for: computation, derivations, and AI applications (RoPE, FNet). More compact; the complex exponentials are eigenfunctions of differentiation.

Amplitude-phase form:

f(x) = A_0 + \sum_{n=1}^\infty A_n \cos(nx + \phi_n)

where $A_n = \sqrt{a_n^2 + b_n^2}$ , $\phi_n = \arctan(-b_n/a_n)$ . Best for: signal processing applications where amplitude and phase are the physically meaningful quantities.

Conversion between forms: $c_n = (a_n - ib_n)/2$ for $n > 0$ , $c_0 = a_0/2$ , and $c_{-n} = \overline{c_n}$ for real $f$ . These identities follow immediately from Euler's formula $e^{inx} = \cos(nx) + i\sin(nx)$ .

2. Formal Definitions

2.1 The Space $L^2[-\pi,\pi]$

The natural domain for Fourier series is the square-integrable functions on $[-\pi, \pi]$ :

L^2[-\pi,\pi] = \left\{ f: [-\pi,\pi] \to \mathbb{C} \;\middle|\; \int_{-\pi}^{\pi} |f(x)|^2\,dx < \infty \right\}

This space carries the inner product:

\langle f, g \rangle = \frac{1}{2\pi}\int_{-\pi}^{\pi} f(x)\overline{g(x)}\,dx

and induced norm $\lVert f \rVert = \sqrt{\langle f, f \rangle}$ . Two functions that agree except on a set of measure zero are identified. Under this identification, $L^2[-\pi,\pi]$ is a complete inner product space - a Hilbert space. The full abstract theory is in Section 12-02 Hilbert Spaces; here we use only the concrete definition.

Examples in $L^2[-\pi,\pi]$ :

Every bounded measurable function (e.g., square wave, triangle wave)
$f(x) = |x|^{-1/3}$ (integrable singularity; $\lVert f \rVert < \infty$ )
$f(x) = \sin(nx)$ for any $n \in \mathbb{Z}$

Non-examples:

$f(x) = 1/x$ near $x = 0$ : $\int_{-\pi}^{\pi} 1/x^2\,dx = \infty$ , so $f \notin L^2$
$f(x) = e^{1/x^2}$ near $x = 0$ : grows too fast

2.2 The Trigonometric System

Definition (Trigonometric System): The collection $\mathcal{T} = \{e^{inx}\}_{n \in \mathbb{Z}}$ is called the complex trigonometric system on $[-\pi, \pi]$ .

Theorem (Orthonormality): The rescaled functions $\phi_n(x) = e^{inx}/\sqrt{2\pi}$ satisfy $\langle \phi_n, \phi_m \rangle_{L^2} = \delta_{nm}$ , using the unweighted inner product $\langle f, g \rangle = \int_{-\pi}^{\pi} f\bar{g}\,dx$ .

Proof. For $n \neq m$ :

\int_{-\pi}^{\pi} e^{inx} e^{-imx}\,dx = \int_{-\pi}^{\pi} e^{i(n-m)x}\,dx = \frac{e^{i(n-m)x}}{i(n-m)}\Bigg|_{-\pi}^{\pi} = \frac{e^{i(n-m)\pi} - e^{-i(n-m)\pi}}{i(n-m)} = \frac{2\sin((n-m)\pi)}{n-m} = 0

since $n-m \in \mathbb{Z}\setminus\{0\}$ so $\sin((n-m)\pi) = 0$ . For $n = m$ : $\int_{-\pi}^{\pi} 1\,dx = 2\pi$ . $\square$

The real trigonometric system $\{1, \cos x, \sin x, \cos 2x, \sin 2x, \ldots\}$ is similarly orthogonal, with norms $\lVert 1 \rVert = \sqrt{2\pi}$ and $\lVert \cos nx \rVert = \lVert \sin nx \rVert = \sqrt{\pi}$ for $n \geq 1$ .

Completeness (stated): The trigonometric system is complete in $L^2[-\pi,\pi]$ : for every $f \in L^2$ and every $\varepsilon > 0$ , there exists a trigonometric polynomial $T(x) = \sum_{|n| \leq N} c_n e^{inx}$ such that $\lVert f - T \rVert < \varepsilon$ . The proof uses the Weierstrass approximation theorem and is in Section 12-02.

2.3 Fourier Coefficients

Definition (Fourier Coefficients): For $f \in L^2[-\pi,\pi]$ , the complex Fourier coefficients of $f$ are:

c_n = \frac{1}{2\pi}\int_{-\pi}^{\pi} f(x)\,e^{-inx}\,dx, \quad n \in \mathbb{Z}

The real Fourier coefficients are:

a_n = \frac{1}{\pi}\int_{-\pi}^{\pi} f(x)\cos(nx)\,dx \quad (n \geq 0), \quad b_n = \frac{1}{\pi}\int_{-\pi}^{\pi} f(x)\sin(nx)\,dx \quad (n \geq 1)

Relations between forms: For real $f$ : $c_0 = a_0/2$ , $c_n = (a_n - ib_n)/2$ for $n \geq 1$ , and $c_{-n} = \overline{c_n}$ (Hermitian symmetry).

Standard examples - complete computations:

(i) Square wave: $f(x) = \operatorname{sgn}(x) = +1$ for $x > 0$ , $-1$ for $x < 0$ .

Since $f$ is odd, $a_n = 0$ for all $n$ . For $b_n$ :

b_n = \frac{2}{\pi}\int_0^\pi \sin(nx)\,dx = \frac{2}{\pi}\left[\frac{-\cos(nx)}{n}\right]_0^\pi = \frac{2}{\pi n}(1 - \cos(n\pi)) = \begin{cases} 4/(\pi n) & n \text{ odd} \\ 0 & n \text{ even} \end{cases}

So $f(x) = \frac{4}{\pi}\sum_{k=0}^\infty \frac{\sin((2k+1)x)}{2k+1}$ .

(ii) Sawtooth wave: $f(x) = x$ for $x \in (-\pi, \pi)$ , extended periodically.

$f$ is odd, so $a_n = 0$ . For $b_n$ , integrate by parts:

b_n = \frac{1}{\pi}\int_{-\pi}^{\pi} x\sin(nx)\,dx = \frac{2}{\pi}\int_0^\pi x\sin(nx)\,dx = \frac{2(-1)^{n+1}}{n}

So $f(x) = 2\sum_{n=1}^\infty \frac{(-1)^{n+1}}{n}\sin(nx) = 2\left(\sin x - \frac{\sin 2x}{2} + \frac{\sin 3x}{3} - \cdots\right)$ .

(iii) Triangle wave: $f(x) = |x|$ for $x \in [-\pi, \pi]$ .

$f$ is even, so $b_n = 0$ . For $a_n$ ( $n \geq 1$ ):

a_n = \frac{2}{\pi}\int_0^\pi x\cos(nx)\,dx = \frac{2}{\pi}\left[\frac{x\sin(nx)}{n} + \frac{\cos(nx)}{n^2}\right]_0^\pi = \frac{2(\cos(n\pi) - 1)}{\pi n^2} = \begin{cases} -4/(\pi n^2) & n \text{ odd} \\ 0 & n \text{ even}\end{cases}

And $a_0 = \pi$ . So $f(x) = \frac{\pi}{2} - \frac{4}{\pi}\sum_{k=0}^\infty \frac{\cos((2k+1)x)}{(2k+1)^2}$ .

Non-examples: Not every sequence $\{c_n\}$ is the Fourier coefficient sequence of an $L^2$ function. By Parseval's identity (Section 4.2), we need $\sum_n |c_n|^2 < \infty$ . The sequence $c_n = 1$ for all $n$ is not a valid Fourier coefficient sequence.

2.4 Partial Sums and the Dirichlet Kernel

Definition: The $N$ -th partial sum of the Fourier series of $f$ is:

S_N f(x) = \sum_{n=-N}^{N} c_n e^{inx}

A crucial observation: the partial sum can be written as a convolution:

S_N f(x) = (f * D_N)(x) = \frac{1}{2\pi}\int_{-\pi}^{\pi} f(t) D_N(x - t)\,dt

where the Dirichlet kernel $D_N$ is:

D_N(x) = \sum_{n=-N}^{N} e^{inx} = \frac{\sin((N + 1/2)x)}{\sin(x/2)}

The closed form follows by summing the geometric series $\sum_{n=-N}^N e^{inx}$ and simplifying using $e^{ix/2} - e^{-ix/2} = 2i\sin(x/2)$ .

Key properties of $D_N$ :

$\frac{1}{2\pi}\int_{-\pi}^{\pi} D_N(x)\,dx = 1$ (normalized)
$D_N$ oscillates with $N$ peaks near $x = 0$
$D_N$ does NOT stay non-negative (unlike an approximate identity), causing convergence issues

The failure of $D_N$ to remain non-negative is precisely what allows the Gibbs phenomenon (Section 3.4) and prevents uniform convergence at discontinuities.

3. Convergence Theory

3.1 Pointwise Convergence: Dirichlet's Theorem

Theorem (Dirichlet, 1829): Let $f$ be a $2\pi$ -periodic function that is piecewise smooth on $[-\pi,\pi]$ (i.e., $f$ and $f'$ are piecewise continuous). Then the Fourier series of $f$ converges for every $x$ , and:

S_N f(x) \;\to\; \frac{f(x^+) + f(x^-)}{2} \quad \text{as } N \to \infty

At points of continuity, $f(x^+) = f(x^-)$ , so the series converges to $f(x)$ . At a jump discontinuity, the series converges to the average of the left and right limits.

Proof sketch. Write $S_N f(x) - \frac{f(x^+)+f(x^-)}{2}$ as an integral involving $D_N$ and the function $g_x(t) = f(x+t) - \frac{f(x^+)+f(x^-)}{2}$ . The key step: $g_x(t)/\sin(t/2)$ is integrable if $f$ is piecewise smooth (the singularity at $t=0$ is removable). Then apply the Riemann-Lebesgue Lemma (Section 5.5): $\int h(t)\sin(Nt)\,dt \to 0$ as $N \to \infty$ for any integrable $h$ . $\square$

Dirichlet conditions (sufficient, not necessary):

$f$ is bounded and has finitely many maxima, minima, and discontinuities on $[-\pi,\pi]$
$f$ has left and right derivatives everywhere

What Dirichlet's theorem does NOT say:

It does not guarantee uniform convergence (fails at jumps due to Gibbs)
It does not apply to all continuous functions (Du Bois-Reymond's example)
It says nothing about $L^2$ convergence rate

3.2 $L^2$ Convergence and Completeness

The $L^2$ convergence theorem is stronger and cleaner than the pointwise result:

Theorem ( $L^2$ Convergence): For every $f \in L^2[-\pi,\pi]$ :

\lVert f - S_N f \rVert_{L^2} \to 0 \quad \text{as } N \to \infty

Equivalently, $f = \sum_{n=-\infty}^{\infty} c_n e^{inx}$ as an $L^2$ limit. This follows directly from the completeness of the trigonometric system in $L^2$ (stated in Section 2.2).

Bessel's Inequality (preliminary to Parseval): For any $f \in L^2$ :

\sum_{n=-\infty}^{\infty} |c_n|^2 \leq \lVert f \rVert^2 = \frac{1}{2\pi}\int_{-\pi}^{\pi}|f(x)|^2\,dx

Proof. Expand $0 \leq \lVert f - S_N f \rVert^2 = \lVert f \rVert^2 - \sum_{|n| \leq N} |c_n|^2$ . This shows the partial sums of $\sum |c_n|^2$ are bounded by $\lVert f \rVert^2$ . Taking $N \to \infty$ gives Bessel. $\square$

Equality holds (Bessel becomes Parseval) exactly when the system is complete.

3.3 Uniform Convergence

Theorem: If $f \in C^1[-\pi,\pi]$ (continuously differentiable) and periodic, then $S_N f \to f$ uniformly.

Proof sketch. Integrate by parts: $c_n[f] = -c_n[f']/(in)$ for $n \neq 0$ . So $|c_n| \leq \lVert f' \rVert_1 / (\pi |n|)$ . The partial sums converge uniformly by the Weierstrass M-test since $\sum 1/n < \infty$ (conditionally). For $f \in C^k$ , $|c_n| = O(n^{-k})$ , giving faster uniform convergence. $\square$

Key point: Continuity alone is insufficient for uniform convergence. The counterexample (Du Bois-Reymond) is a continuous function whose Fourier series diverges at a single point.

3.4 The Gibbs Phenomenon

The Gibbs phenomenon is one of the most important practical facts about Fourier series.

Observation: Near a jump discontinuity at $x = x_0$ , the partial sum $S_N f$ overshoots the function value by approximately $\frac{2}{\pi}\int_0^\pi \frac{\sin t}{t}\,dt - 1 \approx 0.0895$ - about 9% of the jump height - regardless of how large $N$ is.

Precise statement for the square wave: For $f(x) = \operatorname{sgn}(x)$ with jump height $2$ :

\lim_{N\to\infty} S_N f\!\left(\frac{\pi}{2N+1}\right) = \frac{2}{\pi}\int_0^\pi \frac{\sin t}{t}\,dt \approx 1.1790\ldots

So the overshoot is $\approx 0.179$ , which is $9\%$ of the total jump of $2$ .

Why it persists: The Dirichlet kernel $D_N$ has a tall central spike but also oscillating side lobes with total negative area $\approx -1/\pi$ . These side lobes cannot be eliminated by taking $N$ larger - they just become narrower and move closer to the discontinuity.

For AI: The Gibbs phenomenon is why "ringing" artifacts appear when you sharply truncate a frequency spectrum (e.g., in audio compression, image filtering, or when a language model encounters out-of-distribution high-frequency tokens). Windowing functions (Section 20-03) are the engineering fix.

Remedy - Fejer's theorem: Instead of taking partial sums, take their Cesaro means $\sigma_N f = \frac{1}{N+1}\sum_{k=0}^N S_k f$ . Fejer proved these converge everywhere and do NOT exhibit Gibbs overshoot.

3.5 Fejer's Theorem and Cesaro Summability

Theorem (Fejer, 1900): Let $f \in C[-\pi,\pi]$ be $2\pi$ -periodic. Then $\sigma_N f \to f$ uniformly as $N \to \infty$ .

The Cesaro means use the Fejer kernel $F_N = \frac{1}{N+1}\sum_{k=0}^N D_k$ :

F_N(x) = \frac{1}{N+1}\left(\frac{\sin((N+1)x/2)}{\sin(x/2)}\right)^2 \geq 0

Unlike the Dirichlet kernel, the Fejer kernel is non-negative. This is why Cesaro summation fixes the Gibbs phenomenon - the averaging eliminates the negative side lobes.

For AI: Fejer's theorem is the prototype for windowing in signal processing. Multiplying a signal by a window function before taking the Fourier transform is equivalent to using a smoother summation kernel, which reduces spectral leakage and ringing (-> Section 20-03).

4. Parseval's Theorem and Energy

4.1 Bessel's Inequality

We proved Bessel's inequality in Section 3.2: $\sum_{n} |c_n|^2 \leq \lVert f \rVert^2$ . This says the "energy" in the frequency representation is at most the energy in the time representation. Equality requires completeness.

4.2 Parseval's Identity

Theorem (Parseval's Identity): For $f \in L^2[-\pi,\pi]$ with Fourier coefficients $\{c_n\}$ :

\frac{1}{2\pi}\int_{-\pi}^{\pi} |f(x)|^2\,dx = \sum_{n=-\infty}^{\infty} |c_n|^2

In real form: $\frac{a_0^2}{2} + \sum_{n=1}^\infty (a_n^2 + b_n^2) = \frac{1}{\pi}\int_{-\pi}^{\pi} |f(x)|^2\,dx$ .

Proof. By completeness, $\lVert f - S_N f \rVert \to 0$ . Then:

\lVert f \rVert^2 = \lim_{N\to\infty} \lVert S_N f \rVert^2 = \lim_{N\to\infty} \sum_{|n|\leq N} |c_n|^2 = \sum_{n=-\infty}^\infty |c_n|^2

using orthonormality of $\{e^{inx}\}$ for the second equality. $\square$

Physical interpretation: The total energy (power) of a signal equals the sum of energies in each frequency component. Fourier analysis is an energy-preserving change of basis.

More general form: For $f, g \in L^2$ :

\frac{1}{2\pi}\int_{-\pi}^{\pi} f(x)\overline{g(x)}\,dx = \sum_{n=-\infty}^{\infty} c_n[f]\,\overline{c_n[g]}

This is the polarization identity version of Parseval.

4.3 Applications: Evaluating Infinite Series

Parseval's identity is a powerful tool for evaluating series that have no elementary closed form.

Example 1 - Basel problem via triangle wave: Apply Parseval to $f(x) = x$ on $(-\pi,\pi)$ :

\frac{1}{\pi}\int_{-\pi}^{\pi} x^2\,dx = \frac{2\pi^2}{3}, \quad \text{and} \quad \sum_{n=1}^\infty b_n^2 = \sum_{n=1}^\infty \frac{4}{n^2}

Parseval gives $\frac{2\pi^2}{3} = 4\sum_{n=1}^\infty \frac{1}{n^2}$ , so $\sum_{n=1}^\infty \frac{1}{n^2} = \frac{\pi^2}{6}$ .

Example 2 - $\sum 1/(2k+1)^2$ : Apply Parseval to the square wave. The non-zero Fourier coefficients are $b_{2k+1} = 4/(\pi(2k+1))$ . Then:

\frac{1}{\pi}\int_{-\pi}^{\pi} 1\,dx = 2 = \sum_{k=0}^\infty \frac{16}{\pi^2(2k+1)^2}

giving $\sum_{k=0}^\infty \frac{1}{(2k+1)^2} = \frac{\pi^2}{8}$ .

5. Properties of Fourier Coefficients

5.1 Linearity, Shift, and Scaling

Let $c_n[f]$ denote the $n$ -th Fourier coefficient of $f$ .

Property	Statement	Proof idea
Linearity	$c_n[\alpha f + \beta g] = \alpha c_n[f] + \beta c_n[g]$	Integral is linear
Shift	$c_n[f(\cdot - x_0)] = e^{-inx_0} c_n[f]$	Change of variable $t = x - x_0$
Conjugation	$c_n[\overline{f}] = \overline{c_{-n}[f]}$	Conjugate the integral
Reflection	$c_n[f(-\cdot)] = c_{-n}[f]$	Change of variable $x \to -x$

The shift property is the Fourier-series version of the time-shift property of the Fourier transform. It says: shifting a signal in time multiplies its Fourier coefficients by a complex exponential - i.e., introduces a linear phase. This is fundamental in signal alignment and is the mechanism behind relative positional encodings in transformers.

For AI - RoPE connection: In RoPE, the key and query vectors at position $m$ and $n$ are $\mathbf{q}_m = R_m \mathbf{q}$ and $\mathbf{k}_n = R_n \mathbf{k}$ where $R_m$ is a rotation matrix. The dot product $\mathbf{q}_m^\top \mathbf{k}_n = \mathbf{q}^\top R_{m-n} \mathbf{k}$ - it depends only on the relative position $m - n$ . This is exactly the Fourier shift property: shifting both signals by the same amount leaves their inner product unchanged.

5.2 Differentiation and Integration

Differentiation in frequency space:

c_n[f'] = in \cdot c_n[f]

Proof. Integrate by parts: $c_n[f'] = \frac{1}{2\pi}\int_{-\pi}^{\pi} f'(x) e^{-inx}\,dx = \frac{1}{2\pi}\left[f(x)e^{-inx}\right]_{-\pi}^{\pi} + \frac{in}{2\pi}\int_{-\pi}^{\pi} f(x)e^{-inx}\,dx$ . By periodicity, the boundary term vanishes, leaving $c_n[f'] = in\,c_n[f]$ . $\square$

Iterated differentiation: $c_n[f^{(k)}] = (in)^k c_n[f]$ .

Implication for smoothness: A function $f \in C^k$ (k-times continuously differentiable and periodic) has $|c_n| = O(|n|^{-k})$ as $|n| \to \infty$ . The smoother $f$ is, the faster its Fourier coefficients decay. This is the quantitative content of Section 5.4.

Integration: $c_n\left[\int_0^x f(t)\,dt - \langle f \rangle x\right] = \frac{c_n[f]}{in}$ for $n \neq 0$ , where $\langle f \rangle = c_0[f]$ is the mean.

5.3 Even and Odd Functions

Even functions ( $f(-x) = f(x)$ ) have $b_n = 0$ for all $n$ , so the Fourier series contains only cosines: $f(x) = \frac{a_0}{2} + \sum_{n=1}^\infty a_n \cos(nx)$ - a cosine series. The coefficients are $a_n = \frac{2}{\pi}\int_0^\pi f(x)\cos(nx)\,dx$ .

Odd functions ( $f(-x) = -f(x)$ ) have $a_n = 0$ for all $n$ , so the Fourier series contains only sines: $f(x) = \sum_{n=1}^\infty b_n \sin(nx)$ - a sine series. The coefficients are $b_n = \frac{2}{\pi}\int_0^\pi f(x)\sin(nx)\,dx$ .

For AI - DCT connection: The Discrete Cosine Transform (DCT) used in JPEG and MP3 compression is the even-extension Fourier series. The DCT coefficients $a_n$ are exactly the Fourier coefficients of the even extension of the signal to $[-\pi,\pi]$ . This is why the DCT achieves better energy compaction than the DFT for real signals.

5.4 Smoothness and Spectral Decay

The relationship between regularity and spectral decay is fundamental to both signal processing and machine learning:

| Regularity of $f$ | Decay of $|c_n|$ | Practical implication | |-------------------|------------------|-----------------------| | $f \in L^2$ | $|c_n| \to 0$ (by Riemann-Lebesgue) | All coefficients vanish asymptotically | | $f$ piecewise smooth, discontinuity | $|c_n| \sim C/|n|$ | Slow $1/n$ decay (Gibbs) | | $f \in C^1$ (continuously diff.) | $|c_n| = O(1/n^2)$ | Faster decay, no Gibbs | | $f \in C^k$ | $|c_n| = O(1/n^{k+1})$ | Super-algebraic if $k$ large | | $f$ analytic | $|c_n| \leq Ce^{-r|n|}$ for some $r > 0$ | Exponential decay |

For AI - Spectral bias (Rahaman et al., 2019): Neural networks learn the target function's Fourier decomposition from lowest to highest frequency. Formally, if $f^* = \sum_n c_n e^{inx}$ is the target function, a network trained by gradient descent first approximates the $c_n$ for small $|n|$ (low frequencies) and only later approximates high-frequency components. This implies:

Benefit: Implicit regularization toward smooth (low-frequency) solutions -> good generalization
Cost: Slow convergence on high-frequency targets; requires more data for texture-rich images
Fix: Fourier feature embeddings (RFF, NeRF's positional encoding) inject high-frequency components explicitly

5.5 Riemann-Lebesgue Lemma

Theorem (Riemann-Lebesgue): If $f \in L^1[-\pi,\pi]$ , then $c_n[f] \to 0$ as $|n| \to \infty$ .

Proof sketch. For $f$ a step function: direct computation. For general $f$ : approximate by step functions and use the $L^1$ bound. $\square$

Significance: This says every $L^1$ function has Fourier coefficients that vanish at high frequencies. The rate of decay depends on smoothness (Section 5.4), but the qualitative statement holds for all integrable functions. This is the mathematical foundation of the claim "smooth signals are compressible in the Fourier domain."

6. Fourier Series on General Intervals

6.1 Extension to $[-L, L]$

For a $2L$ -periodic function $f$ , the Fourier series on $[-L, L]$ is obtained by the change of variables $t = \pi x / L$ :

f(x) = \frac{a_0}{2} + \sum_{n=1}^\infty \left[ a_n \cos\!\left(\frac{n\pi x}{L}\right) + b_n \sin\!\left(\frac{n\pi x}{L}\right) \right]

with coefficients:

a_n = \frac{1}{L}\int_{-L}^{L} f(x)\cos\!\left(\frac{n\pi x}{L}\right)dx, \quad b_n = \frac{1}{L}\int_{-L}^{L} f(x)\sin\!\left(\frac{n\pi x}{L}\right)dx

The fundamental frequency is $f_1 = 1/(2L)$ , and the $n$ -th harmonic has frequency $n/(2L)$ .

For AI - Transformer context length: In a transformer with context length $T$ , sinusoidal positional encodings use frequencies $1/10000^{2i/d}$ for $i = 0, \ldots, d/2 - 1$ . This covers a geometric range of frequencies over the interval $[0, T]$ - analogous to sampling the Fourier series on $[0, T]$ at $d/2$ geometrically spaced frequencies. Longer context requires lower minimum frequencies, which is why RoPE's $\theta_{\min} = 10000^{-1}$ limits effective context length and why extended-context models (LLaMA-3-128K, Mistral) modify the base $\theta$ .

6.2 Half-Range Expansions

If $f$ is defined on $[0, L]$ (not the full period), we can extend it to $[-L, L]$ in two ways:

Even extension: Extend by $f(-x) = f(x)$ -> leads to a cosine series on $[0, L]$ :

f(x) = \frac{a_0}{2} + \sum_{n=1}^\infty a_n \cos\!\left(\frac{n\pi x}{L}\right), \quad a_n = \frac{2}{L}\int_0^L f(x)\cos\!\left(\frac{n\pi x}{L}\right)dx

Odd extension: Extend by $f(-x) = -f(x)$ -> leads to a sine series on $[0, L]$ :

f(x) = \sum_{n=1}^\infty b_n \sin\!\left(\frac{n\pi x}{L}\right), \quad b_n = \frac{2}{L}\int_0^L f(x)\sin\!\left(\frac{n\pi x}{L}\right)dx

Application: Solving the heat equation on $[0, L]$ with Dirichlet boundary conditions ( $f(0) = f(L) = 0$ ) uses the sine series expansion. With Neumann conditions ( $f'(0) = f'(L) = 0$ ), use the cosine series.

7. Applications in Machine Learning

7.1 Sinusoidal Positional Encodings in Transformers

The original Transformer (Vaswani et al., 2017) uses positional encodings:

\text{PE}(pos, 2i) = \sin\!\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right), \quad \text{PE}(pos, 2i+1) = \cos\!\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

This assigns each position $pos \in \{0, 1, \ldots, T-1\}$ a $d_{\text{model}}$ -dimensional vector whose components are sine and cosine values at geometrically spaced frequencies $\omega_i = 1/10000^{2i/d_{\text{model}}}$ .

Why this works: The set of frequencies $\{\omega_i\}$ forms a geometric progression spanning from $1$ (very high frequency, distinguishes adjacent tokens) down to $10000^{-1}$ (very low frequency, distinguishes widely separated tokens). This is exactly the strategy of a Fourier series on a long interval: use many harmonics at different scales.

Limitation: These are absolute positional encodings - the embedding for position 5 is fixed regardless of context. This creates problems for generalization to longer sequences than seen in training.

7.2 Rotary Positional Encoding (RoPE)

RoPE (Su et al., 2021), now standard in LLaMA-2/3, Mistral, Falcon, and GPT-NeoX, encodes position as a rotation of the query and key vectors.

Construction: For a $d$ -dimensional query vector $\mathbf{q}$ , split into $d/2$ pairs $(q_{2i}, q_{2i+1})$ . Rotate each pair by angle $m\theta_i$ where $m$ is the token position and $\theta_i = 10000^{-2i/d}$ :

\begin{pmatrix} q'_{2i} \\ q'_{2i+1} \end{pmatrix} = \begin{pmatrix} \cos(m\theta_i) & -\sin(m\theta_i) \\ \sin(m\theta_i) & \cos(m\theta_i) \end{pmatrix}\begin{pmatrix} q_{2i} \\ q_{2i+1} \end{pmatrix}

This is multiplication by $e^{im\theta_i}$ in the complex representation $q_{2i} + iq_{2i+1}$ .

The key property: The attention score between query at position $m$ and key at position $n$ is:

\text{score}(m, n) = \operatorname{Re}\!\left[\sum_{i=0}^{d/2-1} (q_{2i} + iq_{2i+1})e^{im\theta_i} \cdot \overline{(k_{2i} + ik_{2i+1})e^{in\theta_i}}\right]

= \operatorname{Re}\!\left[\sum_{i=0}^{d/2-1} (q_{2i} + iq_{2i+1})\overline{(k_{2i} + ik_{2i+1})} \cdot e^{i(m-n)\theta_i}\right]

The score depends only on $m - n$ - the relative position. This is the Fourier shift theorem: multiplying by $e^{im\theta_i}$ and taking a conjugate product gives sensitivity to relative displacement.

Extended context: The maximum effective context length is determined by the lowest frequency $\theta_{\min} = \theta_{d/2-1} = 10000^{-1}$ . Models like LLaMA-3-128K extend context by scaling $\theta_i \to \theta_i \cdot (L_{\text{train}}/L_{\text{new}})$ (Position Interpolation, Chen et al., 2023) or by increasing the base from 10000 to 500000 (Roziere et al., 2023).

7.3 Spectral Bias of Neural Networks

The phenomenon (Rahaman et al., 2019): Neural networks trained with gradient descent learn a biased decomposition of the target function: low-frequency Fourier components are learned first, high-frequency components last.

Mathematical statement: Let $f^*$ be the target function with Fourier decomposition $f^*(x) = \sum_n c_n e^{inx}$ . A network $f_\theta$ trained on a finite sample first converges on the low- $|n|$ components ( $|c_n[f^* - f_\theta]|$ decreases for small $|n|$ first).

Consequences for AI:

Good: Networks converge to smooth solutions -> implicit regularization -> good out-of-distribution generalization for smooth targets
Bad: Learning high-frequency signals (sharp edges, fine texture) requires more data and training time
NeRF / SIREN fix: Sinusoidal activations (Sitzmann et al., 2020) or Fourier feature mappings $\gamma(\mathbf{x}) = [\cos(B\mathbf{x}), \sin(B\mathbf{x})]$ (Tancik et al., 2020) inject high-frequency components, overcoming spectral bias for 3D scene representation

Mechanism: The NTK (Neural Tangent Kernel) of a standard MLP has a spectrum that decays with frequency. High-frequency components of the target correspond to high-eigenvalue directions of the NTK, which converge slowly under gradient descent.

7.4 Random Fourier Features and Kernel Approximation

The problem: Kernel methods (SVMs, GPs) require storing and computing an $n \times n$ kernel matrix - $O(n^2)$ memory and $O(n^3)$ computation. For large $n$ , this is infeasible.

The solution (Rahimi & Recht, 2007): Any shift-invariant kernel $k(\mathbf{x}, \mathbf{y}) = k(\mathbf{x} - \mathbf{y})$ can be written, by Bochner's theorem, as the Fourier transform of a non-negative measure:

k(\mathbf{x} - \mathbf{y}) = \int p(\boldsymbol{\omega}) e^{i\boldsymbol{\omega}^\top(\mathbf{x} - \mathbf{y})}\,d\boldsymbol{\omega}

Sampling $D$ frequencies $\boldsymbol{\omega}_1, \ldots, \boldsymbol{\omega}_D \sim p(\boldsymbol{\omega})$ and defining the feature map:

\phi(\mathbf{x}) = \frac{1}{\sqrt{D}}\left[\cos(\boldsymbol{\omega}_1^\top\mathbf{x}), \sin(\boldsymbol{\omega}_1^\top\mathbf{x}), \ldots, \cos(\boldsymbol{\omega}_D^\top\mathbf{x}), \sin(\boldsymbol{\omega}_D^\top\mathbf{x})\right]

gives $\mathbb{E}[\phi(\mathbf{x})^\top\phi(\mathbf{y})] = k(\mathbf{x} - \mathbf{y})$ . This reduces kernel computation to $O(nD)$ - linear in the dataset size.

Connection to Fourier series: $\phi(\mathbf{x})$ is a finite Fourier expansion with randomly sampled frequencies. The approximation quality improves as $D$ increases, with concentration bounds showing $|k(\mathbf{x}-\mathbf{y}) - \phi(\mathbf{x})^\top\phi(\mathbf{y})| \leq \varepsilon$ with high probability for $D = O(d\log(1/\varepsilon)/\varepsilon^2)$ .

8. Common Mistakes

#	Mistake	Why It's Wrong	Fix
1	Using $c_n = \int_{-\pi}^{\pi} f e^{-inx} dx$ without the $\frac{1}{2\pi}$ factor	Missing normalization; coefficients will be off by $2\pi$	Always check which convention your source uses; in this repo we use $c_n = \frac{1}{2\pi}\int$
2	Assuming the Fourier series always converges pointwise	False: Du Bois-Reymond exhibited a continuous function whose series diverges at a point	State the convergence type explicitly; use $L^2$ convergence for most theory
3	Assuming the sum at a discontinuity equals $f(x)$	The series converges to the average $\frac{f(x^+)+f(x^-)}{2}$ at a jump	Apply Dirichlet's theorem: check for discontinuities before claiming convergence to $f$
4	Writing $c_{-n} = c_n$ for a real function	Correct relation is $c_{-n} = \overline{c_n}$ (conjugate, not equal)	For real $f$ : $c_{-n} = \overline{c_n}$ ; only real if $c_n$ is real
5	Applying differentiation term-by-term without checking conditions	$c_n[f'] = inc_n[f]$ requires $f$ to be absolutely continuous and periodic	Check that $f \in AC[-\pi,\pi]$ (absolutely continuous) before differentiating term-by-term
6	Confusing Parseval's theorem with Bessel's inequality	Bessel gives $\leq$ ; Parseval gives $=$ - they differ	Parseval requires completeness of the trigonometric system; Bessel holds for any orthonormal set
7	Claiming $	c_n	= O(1/n)$ decay for a smooth function
8	Forgetting to subtract the mean before computing sine/cosine coefficients	$c_0 = a_0/2$ is the DC component; it appears in the $a_0/2$ term, not in $a_n$ for $n \geq 1$	Always compute $a_0$ separately; the series is $\frac{a_0}{2} + \sum_{n \geq 1}$
9	Using the complex form with $e^{inx}$ but real form coefficients	Complex form requires $c_n = \frac{1}{2\pi}\int fe^{-inx}$ ; real form uses $a_n, b_n$ - these are different	Pick one form and use it consistently; convert via $c_n = (a_n - ib_n)/2$
10	Claiming the Gibbs phenomenon goes away as $N \to \infty$	The overshoot height stays at ~9% of the jump; only its width shrinks	Gibbs is a permanent feature of the partial sum near discontinuities; use windowing to mitigate

9. Exercises

Exercise 1 (*): Compute the complex Fourier coefficients $c_n$ of the triangle wave $f(x) = |\,x\,|$ on $(-\pi, \pi)$ and write out the first five non-zero terms of the Fourier series. Verify that $f$ is even, so $c_n \in \mathbb{R}$ .

Exercise 2 (*): Prove from first principles that the functions $\{1/\sqrt{2\pi},\, \cos(nx)/\sqrt{\pi},\, \sin(nx)/\sqrt{\pi}\}_{n \geq 1}$ form an orthonormal set in $L^2[-\pi,\pi]$ with the inner product $\langle f, g \rangle = \int_{-\pi}^{\pi} f\bar{g}\,dx$ .

Exercise 3 (*): Using Parseval's identity applied to the square wave, show that $\sum_{k=0}^\infty \frac{1}{(2k+1)^2} = \frac{\pi^2}{8}$ . Then use $\sum_{n=1}^\infty \frac{1}{n^2} = \sum_{k=1}^\infty \frac{1}{(2k)^2} + \sum_{k=0}^\infty \frac{1}{(2k+1)^2}$ to recover the Basel sum $\sum_{n=1}^\infty \frac{1}{n^2} = \frac{\pi^2}{6}$ .

Exercise 4 ():** Let $f(x) = e^{ax}$ for $x \in (-\pi, \pi)$ with $a \neq 0$ real. (a) Compute $c_n[f]$ . (b) Apply Parseval's identity to obtain a formula for $\frac{1}{\sinh^2(\pi a)}$ in terms of a series involving $1/(a^2 + n^2)$ . (c) Verify your answer for $a = 1$ numerically.

Exercise 5 ():** Prove the Riemann-Lebesgue Lemma: if $f \in L^1[-\pi,\pi]$ , then $c_n[f] \to 0$ as $|n| \to \infty$ . (Hint: first prove it for step functions, then approximate general $f$ .)

Exercise 6 ():** Consider $f(x) = x^2$ on $(-\pi,\pi)$ . (a) Find all Fourier coefficients $a_n, b_n$ . (b) Show that $S_N f \to f$ uniformly. (c) Set $x = \pi$ in the resulting series to derive $\sum_{n=1}^\infty (-1)^{n+1}/n^2 = \pi^2/12$ .

Exercise 7 (*):** Implement RoPE from scratch in Python. (a) Given query and key vectors of dimension $d = 64$ and sequence length $T = 512$ , construct the rotation matrices $R_m$ for each position $m$ using $\theta_i = 10000^{-2i/d}$ . (b) Compute the rotated attention scores $\text{score}(m, n) = (R_m \mathbf{q})^\top (R_n \mathbf{k})$ for all pairs $(m, n)$ . (c) Verify that $\text{score}(m, n)$ depends only on $m - n$ by checking $\text{score}(m, n) = \text{score}(m+k, n+k)$ for several values of $k$ .

Exercise 8 (*):** Implement Random Fourier Features for the RBF kernel. (a) For the Gaussian kernel $k(\mathbf{x}, \mathbf{y}) = e^{-\lVert \mathbf{x} - \mathbf{y} \rVert^2/(2\sigma^2)}$ , show that the spectral density is $p(\boldsymbol{\omega}) = \mathcal{N}(\mathbf{0}, \sigma^{-2} I)$ . (b) Implement the feature map $\phi(\mathbf{x}) \in \mathbb{R}^{2D}$ using $D = 100$ random frequency samples. (c) On a synthetic dataset in $\mathbb{R}^2$ , compare the exact kernel matrix $K$ with the approximation $\Phi\Phi^\top$ and plot the approximation error as a function of $D$ .

10. Why This Matters for AI (2026 Perspective)

Concept	AI Impact	Concrete System
Fourier basis vectors $e^{inx}$	Foundation of positional encodings in all transformer variants	RoPE in LLaMA-3, Gemma-2, Mistral-7B, Falcon
Complex Fourier form	Rotation = position; enables relative PE via shift theorem	RoPE (Su et al., 2021), xPos, YaRN
Parseval's identity	Energy preservation: Fourier is a unitary transform; spectral analysis of embeddings	WeightWatcher spectral analysis of LLM health
Spectral bias ( $1/n^k$ decay)	Networks learn smooth functions first; governs training dynamics	SIREN (Sitzmann et al., 2020), NeRF frequency encoding
Gibbs phenomenon	Ringing in over-compressed audio/images; sudden context-length failure modes	JPEG compression artifacts, LLM boundary token issues
Fourier completeness	Any periodic signal can be exactly represented; digital audio encoding	MP3, AAC, Ogg Vorbis compression standards
Random Fourier features	$O(nD)$ kernel approximation replacing $O(n^2)$ kernel matrix	Large-scale SVM/GP inference (Rahimi & Recht, 2007)

11. Conceptual Bridge

Looking backward: Fourier series is built on three pillars from earlier chapters: the inner product structure of $L^2$ (from Section 12-02 Hilbert Spaces), complex exponentials $e^{inx}$ (from Section 01 Mathematical Foundations), and convergence of series (from real analysis in Section 04 Calculus). If those foundations feel shaky, revisit them before proceeding.

Looking forward: Fourier series handles periodic functions on a bounded interval. The Fourier Transform (Section 20-02) extends this to aperiodic signals on the entire real line by taking the period $T \to \infty$ . The discrete, computationally feasible version is the DFT and FFT (Section 20-03). The Convolution Theorem (Section 20-04) shows why Fourier analysis is so powerful: convolution in time becomes multiplication in frequency. Finally, Wavelets (Section 20-05) overcome Fourier's fundamental limitation - the inability to localize in both time and frequency simultaneously.

POSITION IN THE FOURIER ANALYSIS CHAPTER
========================================================================

  Section 20-01  Fourier Series          -- YOU ARE HERE
    v  (take T -> infty)
  Section 20-02  Fourier Transform
    v  (discretize: N samples)
  Section 20-03  DFT and FFT
    v  (mult. in freq  conv. in time)
  Section 20-04  Convolution Theorem
    v  (localize in time AND freq)
  Section 20-05  Wavelets

  Prerequisites:              Forward pointers:
  Section 01 Mathematical Foundations -> Section 20-02 (continuous FT)
  Section 04 Calculus Fundamentals    -> Section 20-03 (discrete FT)
  Section 12-02 Hilbert Spaces        -> Section 12-03 (kernel methods via Bochner)

========================================================================

<- Back to Fourier Analysis | Next: Fourier Transform ->

Appendix A: Extended Examples and Computations

A.1 Full Derivation: Fourier Series of Common Waveforms

This appendix provides complete, step-by-step derivations for the Fourier coefficients of the most important periodic waveforms. These are not presented as exercises - they are reference derivations that you should work through once and understand deeply. Every step is motivated.

A.1.1 Rectangular Pulse Train

Define a pulse of width $2\tau$ centered at $x = 0$ , repeated with period $2\pi$ :

f(x) = \begin{cases} 1 & |x| \leq \tau \\ 0 & \tau < |x| \leq \pi \end{cases}

Complex coefficients: for $n \neq 0$ :

c_n = \frac{1}{2\pi}\int_{-\pi}^{\pi} f(x)e^{-inx}\,dx = \frac{1}{2\pi}\int_{-\tau}^{\tau} e^{-inx}\,dx = \frac{1}{2\pi} \cdot \frac{e^{in\tau} - e^{-in\tau}}{in} = \frac{\sin(n\tau)}{n\pi} = \frac{\tau}{\pi}\operatorname{sinc}(n\tau/\pi)

where $\operatorname{sinc}(u) = \sin(\pi u)/(\pi u)$ . For $n = 0$ : $c_0 = \tau/\pi$ (the duty cycle).

Key observation: The envelope of $|c_n|$ versus $n$ follows a $\operatorname{sinc}$ function. The first zero crossing occurs at $n = \pi/\tau$ . A narrower pulse ( $\tau$ small) -> wider spectrum (more high frequencies needed to represent the sharp edges). This is the time-frequency trade-off in action.

Spectrum width: The "bandwidth" (distance to first spectral null) is $B = 1/(2\tau)$ in normalized frequency. Narrow pulses are "broadband"; wide pulses (large $\tau/\pi \to 1$ , approaching a constant) have most energy concentrated near $n = 0$ .

A.1.2 Full-Wave Rectified Sine

f(x) = |\sin x|

$f$ is even and non-negative. Since $|\sin x| = |\sin(-x)|$ , $b_n = 0$ for all $n$ . For $a_n$ :

a_0 = \frac{2}{\pi}\int_0^\pi \sin x\,dx = \frac{4}{\pi}

a_n = \frac{2}{\pi}\int_0^\pi |\sin x|\cos(nx)\,dx = \frac{2}{\pi}\int_0^\pi \sin x\cos(nx)\,dx

Using the product-to-sum formula $\sin x\cos(nx) = \frac{1}{2}[\sin((1+n)x) + \sin((1-n)x)]$ :

For $n = 1$ : $a_1 = \frac{1}{\pi}\int_0^\pi \sin(2x)\,dx = 0$ .

For $n \neq 1$ :

a_n = \frac{1}{\pi}\int_0^\pi [\sin((n+1)x) + \sin((1-n)x)]\,dx = \frac{1}{\pi}\left[\frac{-\cos((n+1)x)}{n+1} + \frac{\cos((1-n)x)}{1-n}\right]_0^\pi

= \frac{1}{\pi}\left(\frac{1-\cos((n+1)\pi)}{n+1} - \frac{1-\cos((n-1)\pi)}{n-1}\right) = \begin{cases} -\frac{4}{\pi(n^2-1)} & n \text{ even} \\ 0 & n \text{ odd, } n \neq 1 \end{cases}

So $|\sin x| = \frac{2}{\pi} - \frac{4}{\pi}\sum_{k=1}^\infty \frac{\cos(2kx)}{4k^2 - 1}$ .

Application: The full-wave rectified sine is used in AC-to-DC conversion. Its spectrum has no odd harmonics - only even harmonics - which is why its Fourier representation converges faster (coefficients decay as $1/n^2$ ) than the square wave (which has $1/n$ decay).

A.1.3 The Sawtooth Wave (Staircase Convergence)

$f(x) = x$ on $(-\pi,\pi)$ , $f(\pi) = f(-\pi) = 0$ (we assign the average at the jumps). We showed $f(x) = 2\sum_{n=1}^\infty \frac{(-1)^{n+1}}{n}\sin(nx)$ .

Let us verify: at $x = \pi/2$ , $f(\pi/2) = \pi/2$ . The series gives:

2\sum_{n=1}^\infty \frac{(-1)^{n+1}}{n}\sin(n\pi/2) = 2\left(1 - \frac{1}{3} + \frac{1}{5} - \frac{1}{7} + \cdots\right) = 2 \cdot \frac{\pi}{4} = \frac{\pi}{2} \checkmark

This uses the Leibniz formula $1 - 1/3 + 1/5 - \cdots = \pi/4$ .

Convergence rate: For the sawtooth, $|c_n| = 1/n$ , so $\lVert f - S_N f \rVert^2 = 2\sum_{n>N} 1/n^2 \sim 2/N$ (from the integral test). Convergence is slow: to reduce the $L^2$ error below $\varepsilon$ , we need $N \sim 2/\varepsilon^2$ terms.

A.2 The Heat Equation: Fourier's Original Application

The problem that motivated Fourier's entire theory is the heat equation on a rod:

\frac{\partial u}{\partial t} = \alpha^2 \frac{\partial^2 u}{\partial x^2}, \quad x \in [0, L], \quad t > 0

with boundary conditions $u(0, t) = u(L, t) = 0$ (zero temperature at the ends) and initial condition $u(x, 0) = f(x)$ .

Solution by separation of variables:

Assume $u(x, t) = X(x)T(t)$ . Substituting: $X T' = \alpha^2 X'' T$ , so $T'/T = \alpha^2 X''/X = -\lambda$ (both sides must equal the same constant). This gives:

X'' + \frac{\lambda}{\alpha^2} X = 0 \quad \text{with } X(0) = X(L) = 0

T' + \lambda T = 0

The boundary condition forces $X_n(x) = \sin(n\pi x/L)$ with eigenvalues $\lambda_n = (n\pi\alpha/L)^2$ . The corresponding time solution is $T_n(t) = e^{-\lambda_n t}$ .

Superposition: The general solution is:

u(x, t) = \sum_{n=1}^\infty b_n \sin\!\left(\frac{n\pi x}{L}\right) e^{-(n\pi\alpha/L)^2 t}

The coefficients $b_n$ are determined by the initial condition:

f(x) = u(x, 0) = \sum_{n=1}^\infty b_n \sin\!\left(\frac{n\pi x}{L}\right)

This is exactly the sine series (half-range expansion, Section 6.2). So $b_n = \frac{2}{L}\int_0^L f(x)\sin(n\pi x/L)\,dx$ .

Physical insight: Each mode $\sin(n\pi x/L)$ decays at rate $e^{-(n\pi\alpha/L)^2 t}$ . High-frequency modes (large $n$ ) decay much faster than low-frequency modes. At long times, the solution looks like just the $n = 1$ fundamental mode. This is Fourier's original discovery: heat diffusion is a low-pass filter in frequency space.

For AI: This is the origin of the spectral bias observation. In a sense, gradient descent on a neural network is like running the heat equation in weight space - it diffuses high-frequency components (noise) faster than low-frequency components (signal). The spectral bias of neural networks has exactly the same mathematical structure as heat diffusion.

A.3 Dirichlet's Kernel: Detailed Analysis

Understanding why the Dirichlet kernel causes problems requires a careful look at its behavior.

Closed form derivation:

D_N(x) = \sum_{n=-N}^N e^{inx} = e^{-iNx} \cdot \frac{e^{i(2N+1)x} - 1}{e^{ix} - 1} = e^{-iNx} \cdot e^{iNx} \cdot \frac{e^{i(N+1/2)x} - e^{-i(N+1/2)x}}{e^{ix/2} - e^{-ix/2}}

= \frac{\sin((N+1/2)x)}{\sin(x/2)}

Properties:

$D_N(0) = 2N + 1$ (the value at the origin)
$\frac{1}{2\pi}\int_{-\pi}^{\pi} D_N(x)\,dx = 1$ (normalized)
$D_N$ has zeros at $x = 2\pi k/(2N+1)$ for $k = \pm 1, \pm 2, \ldots, \pm N$
$D_N$ alternates sign: it is NOT a non-negative approximate identity
$\lVert D_N \rVert_{L^1} \sim \frac{4}{\pi^2}\ln N$ (the $L^1$ norm grows logarithmically)

The logarithmic growth of $\lVert D_N \rVert_{L^1}$ is the root cause of convergence problems. A proper approximate identity would have $\lVert K_N \rVert_{L^1} = 1$ bounded - the Dirichlet kernel fails this, which is exactly why pointwise convergence can fail for continuous functions.

Contrast with the Fejer kernel:

F_N(x) = \frac{1}{N+1}\sum_{k=0}^N D_k(x) = \frac{1}{N+1}\left(\frac{\sin((N+1)x/2)}{\sin(x/2)}\right)^2

The Fejer kernel satisfies: $F_N \geq 0$ , $\frac{1}{2\pi}\int F_N = 1$ , and for any $\delta > 0$ : $F_N(x) \to 0$ uniformly on $\{|x| \geq \delta\}$ . These three properties make $F_N$ a genuine approximate identity, guaranteeing uniform convergence.

A.4 Complex Analysis Connection: Laurent Series

The Fourier series of a function on $[-\pi,\pi]$ is closely related to the Laurent series in complex analysis. If $f$ has Fourier series $f(x) = \sum_n c_n e^{inx}$ , define $z = e^{ix}$ (a point on the unit circle $|z| = 1$ ). Then:

f(x) = \sum_{n=-\infty}^{\infty} c_n z^n

This is a Laurent series in $z$ centered at the origin, evaluated on the unit circle. The Fourier coefficient $c_n$ is the $z^n$ coefficient of this Laurent series.

Analyticity and series convergence: If $f$ extends analytically to an annulus $r < |z| < 1/r$ (with $r < 1$ ), then the Laurent series converges absolutely and uniformly on the unit circle, and the Fourier coefficients decay exponentially: $|c_n| \leq C r^{|n|}$ . This is the deep reason why analytic functions have exponentially decaying Fourier coefficients (Section 5.4).

For AI: The connection between Fourier analysis and analytic continuation underlies the theory of analytic functions of neural networks and the frequency-domain analysis of attention patterns. When LLM researchers analyze learned positional encodings in the complex plane, they are (often implicitly) using this Laurent series picture.

A.5 Numerical Experiments: Convergence Visualization

The following experiments (implemented in theory.ipynb) illustrate the key convergence phenomena:

Experiment 1 - Convergence speed: Compute partial sums $S_N f$ for $N = 1, 5, 10, 50, 100$ for the square wave. Plot the $L^2$ error $\lVert f - S_N f \rVert^2 = \sum_{|n|>N} |c_n|^2$ as a function of $N$ . Observe the $O(1/N)$ decay rate from the $|c_n| \sim 1/n$ coefficients.

Experiment 2 - Gibbs phenomenon: Plot $S_{100}f$ for the square wave. Zoom in near $x = 0$ . Measure the overshoot height and verify it is $\approx 0.179 \approx 9\%$ of the jump height 2.

Experiment 3 - Cesaro means fix Gibbs: Plot $\sigma_{100} f$ (Cesaro means) alongside $S_{100} f$ . Observe that the Gibbs overshoot is absent in the Cesaro means.

Experiment 4 - Decay rate vs smoothness: Compare the coefficient decay rate for: (a) square wave (discontinuous): $|c_n| \sim 1/n$ ; (b) triangle wave (continuous, piecewise linear): $|c_n| \sim 1/n^2$ ; (c) smooth bump function: $|c_n|$ decays super-polynomially. Plot $\log |c_n|$ vs $\log n$ to see the exponent.

Experiment 5 - RoPE implementation: Implement RoPE as in Exercise 7. Visualize the rotation matrices $R_m$ for positions $m = 0, 1, \ldots, 20$ and frequency index $i = 0$ . Verify that consecutive positions differ by a fixed rotation angle $\theta_0$ , confirming the Fourier basis interpretation.

Fourier Series: Part 1 - Intuition To Appendix A Extended Examples And Computations