Part 2

27 min read18 headingsSplit lesson page

Lesson overview | Previous part | Next part

Fourier Series: Appendix B: Advanced Topics to Appendix I: Implementation Details

Appendix B: Advanced Topics

B.1 Multidimensional Fourier Series

On a $d$ -dimensional torus $\mathbb{T}^d = [-\pi,\pi]^d$ , the Fourier series generalizes naturally. For $\mathbf{x} = (x_1,\ldots,x_d) \in \mathbb{T}^d$ and $\mathbf{n} = (n_1,\ldots,n_d) \in \mathbb{Z}^d$ :

f(\mathbf{x}) = \sum_{\mathbf{n} \in \mathbb{Z}^d} c_{\mathbf{n}} e^{i\mathbf{n}\cdot\mathbf{x}}, \quad c_{\mathbf{n}} = \frac{1}{(2\pi)^d}\int_{\mathbb{T}^d} f(\mathbf{x}) e^{-i\mathbf{n}\cdot\mathbf{x}}\,d\mathbf{x}

The orthonormality relation is $\langle e^{i\mathbf{n}\cdot\mathbf{x}}, e^{i\mathbf{m}\cdot\mathbf{x}} \rangle = \delta_{\mathbf{n}\mathbf{m}}$ , and Parseval generalizes to $\lVert f \rVert^2 = \sum_{\mathbf{n}} |c_{\mathbf{n}}|^2$ .

For AI - 2D images: Images are functions on a 2D torus (with periodic boundary). The 2D Fourier series (or DFT) represents an image as a sum of 2D complex exponentials $e^{i(n_1 x_1 + n_2 x_2)}$ . Low-frequency components $(n_1, n_2)$ near the origin represent slowly varying regions; high-frequency components represent edges and textures. JPEG compression discards high-frequency 2D DCT coefficients - this is the 2D cosine series applied to $8 \times 8$ blocks.

B.2 Distributions and the Dirac Delta

For rigorous treatment of impulse-like functions in Fourier analysis, we need distributions (generalized functions). The Dirac delta $\delta_{x_0}$ is defined by:

\langle \delta_{x_0}, \phi \rangle = \phi(x_0) \quad \text{for all test functions } \phi \in C^\infty[-\pi,\pi]

Its Fourier coefficients are $c_n[\delta_0] = \frac{1}{2\pi}$ , so its Fourier series is:

\delta_0(x) = \frac{1}{2\pi}\sum_{n=-\infty}^{\infty} e^{inx}

This is the completeness relation of the Fourier system: the identity operator decomposes as a sum over all frequency components. In functional analysis, this is written as $I = \sum_n |e^{inx}\rangle\langle e^{inx}|$ (in Dirac notation).

Verification: For any smooth $f$ , $\langle \delta_0, f \rangle = f(0) = \langle \frac{1}{2\pi}\sum_n e^{inx}, f(x) \rangle = \frac{1}{2\pi}\sum_n \langle e^{inx}, f \rangle^* = \frac{1}{2\pi}\sum_n \overline{c_n[f]} \cdot 2\pi = \sum_n c_n[f] \cdot 1$ ... (Parseval confirms this is $f(0)$ by the inversion formula).

Comb function: The Dirac comb $\text{III}(x) = \sum_{k=-\infty}^{\infty} \delta(x - 2\pi k)$ has Fourier coefficients $c_n = \frac{1}{2\pi}$ for all $n$ - its Fourier series is the same as $\delta_0$ . This might seem paradoxical, but it reflects the periodization in the series context.

B.3 Pointwise Convergence: The Full Story (Carleson's Theorem)

The question of pointwise convergence of Fourier series remained open for 120 years after Dirichlet's 1829 result.

Timeline of the convergence problem:

Dirichlet (1829): Piecewise smooth functions -> pointwise convergence everywhere
Du Bois-Reymond (1876): Exhibited a continuous function whose Fourier series diverges at one point
Kolmogorov (1923): Exhibited an $L^1$ function whose Fourier series diverges everywhere!
Carleson (1966): Proved that for every $f \in L^2$ , the Fourier series converges pointwise almost everywhere
Hunt (1968): Extended to $f \in L^p$ for any $p > 1$

Carleson's theorem (which won him the Abel Prize in 2006) resolved the central question: $L^2$ functions have Fourier series converging a.e. The proof is one of the most difficult in analysis, involving a refined version of the Calderon-Zygmund theory.

Practical implication: For all functions we encounter in ML (bounded measurable, piecewise smooth, $L^2$ ), Fourier series converge pointwise. The pathological examples require non-measurable functions or unbounded ones.

B.4 Sobolev Spaces and Regularity via Fourier Series

Definition (Sobolev spaces): For $s \geq 0$ , the Sobolev space $H^s[-\pi,\pi]$ consists of $L^2$ functions whose Fourier coefficients satisfy:

\lVert f \rVert_{H^s}^2 = \sum_{n=-\infty}^{\infty} (1 + n^2)^s |c_n|^2 < \infty

The parameter $s$ measures smoothness: $H^0 = L^2$ , $H^1$ approx "once weakly differentiable in $L^2$ ", $H^s$ for $s > 1/2$ consists of continuous functions (Sobolev embedding theorem).

Fourier characterization of regularity: A function $f$ is in $H^s$ if and only if $|c_n| = O(n^{-s-1/2-\varepsilon})$ for some $\varepsilon > 0$ . This makes Sobolev regularity entirely readable from the Fourier spectrum.

For AI - Spectral bias quantified: The spectral bias (Section 7.3) says networks approximate $f \in H^s$ more easily if $s$ is large (smooth target). Formally, a network trained by gradient descent with $n$ samples achieves $L^2$ risk of order $n^{-2s/(2s+1)}$ when targeting $f \in H^s$ - the standard minimax rate for nonparametric regression. Higher $s$ -> faster convergence -> need fewer samples.

For AI - Kernel smoothness: The RKHS of a kernel $k$ is a Sobolev space $H^s$ . The RBF kernel corresponds to $H^\infty$ (infinitely smooth functions), while the Matern kernel corresponds to $H^{s+d/2}$ for the regularity parameter $s$ and dimension $d$ . This is why the choice of kernel implicitly assumes a smoothness level for the target function.

B.5 Fejer-Riesz Theorem and Spectral Factorization

Theorem (Fejer-Riesz): Any trigonometric polynomial $p(x) = \sum_{|n| \leq N} c_n e^{inx}$ that is non-negative on $[-\pi,\pi]$ can be written as $p(x) = |q(x)|^2$ where $q(x) = \sum_{n=0}^N d_n e^{inx}$ is an analytic polynomial.

Spectral factorization: Given a non-negative spectral density $S(\omega) \geq 0$ (such as the power spectral density of a stationary process), we can find a causal filter $H(\omega)$ such that $|H(\omega)|^2 = S(\omega)$ . This is the foundation of the Wiener filter (used in denoising) and Cholesky-like factorizations in probability theory.

For AI - Connection to SSMs: State Space Models (S4, Mamba) parameterize their convolution kernels through a spectral factorization. The kernel is written as $K = \text{IFFT}(H(\omega))$ where $H(\omega)$ is constrained to be the frequency response of a stable ARMA model. The Fejer-Riesz theorem guarantees that the spectral density $|H|^2$ can always be factored into a causal part.

B.6 The Discrete Cosine Transform and Signal Compression

The Discrete Cosine Transform (DCT) of Type II is defined as:

X[k] = \sum_{n=0}^{N-1} x[n] \cos\!\left(\frac{\pi k (2n+1)}{2N}\right), \quad k = 0, 1, \ldots, N-1

This is the even-extension Fourier series (cosine series) applied to a finite sequence. The DCT has two critical properties for signal compression:

Energy compaction: For natural signals (images, audio), the DCT concentrates most of the signal energy in the first few low-frequency coefficients. The DCT of a typical $8 \times 8$ image block has 95%+ of its energy in the top-left $3 \times 3$ coefficients.
Decorrelation: The DCT approximately diagonalizes the covariance matrix of natural signals, making the coefficients nearly uncorrelated. This allows independent quantization of each coefficient.

JPEG compression pipeline:

Divide image into $8 \times 8$ blocks
Apply 2D DCT to each block
Quantize coefficients (coarser quantization for high frequencies)
Run-length encode the quantized sparse coefficients
Huffman encode

The quality parameter in JPEG controls the quantization step sizes - higher quality means finer quantization of the high-frequency DCT coefficients.

DCT vs DFT: The DCT avoids the "spectral leakage" (Gibbs phenomenon) of the DFT because the even extension of a finite sequence is continuous at the boundaries. This is why JPEG uses DCT rather than DFT.

B.7 Uniform Convergence: Weierstrass M-Test Application

Let us prove uniform convergence of the triangle wave Fourier series rigorously.

Claim: The Fourier series $f(x) = \frac{\pi}{2} - \frac{4}{\pi}\sum_{k=0}^\infty \frac{\cos((2k+1)x)}{(2k+1)^2}$ converges uniformly.

Proof via Weierstrass M-Test: We need $\sum_k M_k < \infty$ where $M_k = \sup_x |a_k \cos((2k+1)x)| = |a_k| = \frac{4}{\pi(2k+1)^2}$ .

Then $\sum_{k=0}^\infty M_k = \frac{4}{\pi}\sum_{k=0}^\infty \frac{1}{(2k+1)^2} = \frac{4}{\pi} \cdot \frac{\pi^2}{8} = \frac{\pi}{2} < \infty$ .

Since the series of bounds converges, the Weierstrass M-test gives uniform convergence. $\square$

Contrast with square wave: For the square wave, $\sum_k |b_{2k+1}|/(2k+1) = \frac{4}{\pi}\sum_k \frac{1}{(2k+1)^2}$ would bound a DIFFERENT series than what appears. But the square wave's coefficients are $b_{2k+1} = 4/(\pi(2k+1))$ , and $\sum 4/(\pi(2k+1))$ diverges (harmonic series). So the Weierstrass M-test fails - consistent with non-uniform convergence near the jumps.

B.8 Connection to Functional Analysis: Completeness Proof Sketch

Here is a sketch of why the trigonometric system is complete in $L^2[-\pi,\pi]$ - a result we stated but did not prove in Section 2.2.

Step 1: Stone-Weierstrass theorem: Continuous periodic functions can be uniformly approximated by trigonometric polynomials (polynomials in $e^{inx}$ ).

Step 2: Density: Continuous functions are dense in $L^2[-\pi,\pi]$ (this is a standard result in measure theory).

Step 3: Combining: For any $f \in L^2$ and $\varepsilon > 0$ , find continuous $g$ with $\lVert f - g \rVert_2 < \varepsilon/2$ , then find trigonometric polynomial $T$ with $\lVert g - T \rVert_\infty < \varepsilon/(2\sqrt{2\pi})$ , giving $\lVert g - T \rVert_2 \leq \sqrt{2\pi}\lVert g - T \rVert_\infty < \varepsilon/2$ . Triangle inequality gives $\lVert f - T \rVert_2 < \varepsilon$ . $\square$

Full proof: See Section 12-02 Hilbert Spaces where the abstract completeness theorem is proved and the trigonometric system is given as the primary example.

Appendix C: Problem-Solving Strategies

C.1 Computing Fourier Coefficients: A Systematic Approach

Follow this checklist when computing Fourier coefficients:

Check parity first. Is $f$ even? -> only $a_n$ terms. Odd? -> only $b_n$ terms. Neither? -> compute all.
Check for known waveform. Square wave, sawtooth, triangle wave, rectangular pulse - look up or recall the result.
Split piecewise functions. Write $\int_{-\pi}^{\pi} = \int_{-\pi}^{0} + \int_0^{\pi}$ if $f$ is defined piecewise.
Use integration by parts for polynomial trig: $\int x^k \cos(nx)\,dx$ . Set $u = x^k$ , $dv = \cos(nx)dx$ . Repeat until done.
Check boundary terms. For periodic $f$ : boundary terms in integration by parts vanish. For non-periodic problems (PDEs), keep them.
Use the differentiation theorem as a shortcut: if you know $c_n[f']$ , then $c_n[f] = c_n[f']/(in)$ .
Verify via Parseval. Compute $\sum |c_n|^2$ and check it equals $\frac{1}{2\pi}\int |f|^2$ - this catches algebra errors.

C.2 Determining Convergence Type

Condition	Conclusion
$f$ piecewise smooth	Pointwise convergence to $\frac{f(x^+)+f(x^-)}{2}$ (Dirichlet)
$f$ continuous + piecewise $C^1$	$\lVert f - S_N f \rVert_\infty \to 0$ (uniform)
$f \in L^2$	$\lVert f - S_N f \rVert_2 \to 0$ (mean-square)
$f \in C^k$ , periodic	Uniform convergence; coefficients $O(1/n^k)$
$f$ analytic	Exponential coefficient decay; exponentially fast uniform convergence

C.3 Recognizing When Fourier Series Appear in ML

Watch for these signatures of Fourier analysis in ML papers and code:

$\sin(\omega_k t)$ and $\cos(\omega_k t)$ in positional encodings -> Fourier basis
"Frequency domain" or "spectral" in the title -> likely uses DFT or FT
Complex multiplication or rotation in attention -> RoPE or ALiBi variant
"Low-pass filter" or "high-pass filter" -> filter = convolution = Fourier operation
$1/n^2$ or $1/n^4$ weight decay in regularization -> Sobolev norm regularization
"Spectral density" or "power spectral density" -> autocorrelation -> Wiener-Khinchin -> Fourier
State space model (SSM, S4, Mamba) -> the convolution kernel is parameterized in frequency domain

Appendix D: Deep Dive - Proofs and Derivations

D.1 Proof of Dirichlet's Theorem (Complete)

We prove the central convergence result: for $f$ piecewise smooth and $2\pi$ -periodic, $S_N f(x) \to \frac{f(x^+) + f(x^-)}{2}$ .

Setup. Write $S_N f(x) = \frac{1}{2\pi}\int_{-\pi}^{\pi} f(t) D_N(x-t)\,dt$ . By periodicity, we may center the integral at $x$ :

S_N f(x) = \frac{1}{2\pi}\int_{-\pi}^{\pi} f(x+t) D_N(t)\,dt

Since $\frac{1}{2\pi}\int_{-\pi}^{\pi} D_N(t)\,dt = 1$ , we can write:

S_N f(x) - \frac{f(x^+)+f(x^-)}{2} = \frac{1}{2\pi}\int_{-\pi}^{\pi} \left[f(x+t) - \frac{f(x^+)+f(x^-)}{2}\right] D_N(t)\,dt

Split the integral. Since $D_N(t) = \frac{\sin((N+1/2)t)}{\sin(t/2)}$ , and using $D_N(t) = D_N(-t)$ (even function):

= \frac{1}{2\pi}\int_0^{\pi} \left[f(x+t) - f(x^+) + f(x-t) - f(x^-)\right] \frac{\sin((N+1/2)t)}{\sin(t/2)}\,dt

Key function. Define:

g(t) = \frac{f(x+t) - f(x^+) + f(x-t) - f(x^-)}{\sin(t/2)} \cdot \frac{1}{2}

We need $g \in L^1[0,\pi]$ . Near $t = 0$ : the numerator behaves like $f'(x^+)t + (-f'(x^-))t$ (by piecewise smoothness), so numerator $= O(t)$ , and $\sin(t/2) \sim t/2$ , giving $g(t) = O(1)$ - bounded near $t = 0$ . Away from $t = 0$ : $g$ is bounded since $f$ is bounded and $\sin(t/2) \geq c > 0$ . So $g \in L^1$ .

Apply Riemann-Lebesgue. The expression becomes:

\frac{1}{2\pi}\int_0^{\pi} g(t) \sin\!\left(\left(N+\frac{1}{2}\right)t\right)\,dt = \frac{1}{2\pi}\hat{g}\!\left(N+\frac{1}{2}\right)

By the Riemann-Lebesgue lemma (proved below), this $\to 0$ as $N \to \infty$ . $\square$

Riemann-Lebesgue Lemma (proof): For $h \in L^1[a,b]$ :

\int_a^b h(t)\sin(\lambda t)\,dt \to 0 \quad \text{as } \lambda \to \infty

Proof for step functions: If $h = \mathbf{1}_{[c,d]}$ , then $\int_c^d \sin(\lambda t)\,dt = \frac{\cos(\lambda c) - \cos(\lambda d)}{\lambda} \to 0$ . By linearity, this holds for any step function. For general $L^1$ functions, approximate by step functions and use the $L^1$ bound. $\square$

D.2 Parseval's Identity: Detailed Proof

Theorem. For $f \in L^2[-\pi,\pi]$ with complex Fourier coefficients $\{c_n\}_{n \in \mathbb{Z}}$ :

\frac{1}{2\pi}\int_{-\pi}^{\pi} |f(x)|^2\,dx = \sum_{n=-\infty}^{\infty} |c_n|^2

Proof (assuming completeness). Consider the partial sum error:

\lVert f - S_N f \rVert^2 = \lVert f \rVert^2 - \lVert S_N f \rVert^2

The cross-term vanishes: $\langle f, S_N f \rangle = \sum_{|n| \leq N} |c_n|^2 = \lVert S_N f \rVert^2$ (by the projection property: $S_N f$ is the best approximation to $f$ from $\text{span}\{e^{inx} : |n| \leq N\}$ ).

So $\lVert f - S_N f \rVert^2 = \lVert f \rVert^2 - \sum_{|n| \leq N} |c_n|^2 \geq 0$ (Bessel).

By completeness, $\lVert f - S_N f \rVert \to 0$ , so:

0 = \lim_{N\to\infty}\lVert f - S_N f \rVert^2 = \lVert f \rVert^2 - \lim_{N\to\infty}\sum_{|n| \leq N} |c_n|^2 = \lVert f \rVert^2 - \sum_{n=-\infty}^\infty |c_n|^2

This gives the result. $\square$

Remark: The key step is $\lVert f - S_N f \rVert^2 = \lVert f \rVert^2 - \lVert S_N f \rVert^2$ . This is the Pythagorean theorem in $L^2$ : $f$ decomposes orthogonally into $S_N f$ (the projection) and $f - S_N f$ (the residual). No analogy - it IS the Pythagorean theorem, in an infinite-dimensional space.

D.3 Gibbs Phenomenon: Precise Quantification

Let $f(x) = \operatorname{sgn}(x)$ (square wave). We compute the exact overshoot at the jump $x = 0^+$ .

The partial sum $S_N f(x) = \frac{4}{\pi}\sum_{k=0}^N \frac{\sin((2k+1)x)}{2k+1}$ has its first local maximum at $x_N = \frac{\pi}{2N+1}$ (the first zero of $(S_N f)'(x) = \frac{4}{\pi}\sum_{k=0}^N \cos((2k+1)x)$ , which is zero when $x = \pi/(2N+1)$ ).

Computing:

S_N f\!\left(\frac{\pi}{2N+1}\right) = \frac{4}{\pi}\sum_{k=0}^N \frac{\sin\!\left(\frac{(2k+1)\pi}{2N+1}\right)}{2k+1}

This is a Riemann sum for $\frac{2}{\pi}\int_0^\pi \frac{\sin t}{t}\,dt$ with step $\Delta t = \pi/(2N+1)$ . As $N \to \infty$ :

\lim_{N\to\infty} S_N f\!\left(\frac{\pi}{2N+1}\right) = \frac{2}{\pi}\int_0^\pi \frac{\sin t}{t}\,dt = \frac{2}{\pi} \cdot \text{Si}(\pi) \approx \frac{2}{\pi} \times 1.8519 \approx 1.1790

The jump in $f$ is from $-1$ to $+1$ (height 2). The overshoot above the limiting value $+1$ is $1.1790 - 1.0 = 0.1790$ , which is $8.95\% \approx 9\%$ of the jump height.

Numerical value of $\text{Si}(\pi)$ :

\text{Si}(\pi) = \int_0^\pi \frac{\sin t}{t}\,dt = \sum_{k=0}^\infty \frac{(-1)^k \pi^{2k+1}}{(2k+1)\cdot(2k+1)!} \approx 1.8519

Why $9\%$ is universal: The integral $\frac{2}{\pi}\text{Si}(\pi) - 1 \approx 0.0895$ does not depend on the specific piecewise smooth function - only on the jump height. Every jump discontinuity in any piecewise smooth function gives a Gibbs overshoot of this universal fraction.

D.4 The Weierstrass Approximation Theorem and Completeness

Weierstrass Approximation Theorem (1885): Every continuous function $f$ on $[a,b]$ can be uniformly approximated by algebraic polynomials.

Trigonometric variant: Every continuous $2\pi$ -periodic function can be uniformly approximated by trigonometric polynomials $T(x) = \sum_{|n| \leq N} c_n e^{inx}$ .

Proof sketch (Fejer's approach): The Cesaro means $\sigma_N f$ are trigonometric polynomials (each $S_k f$ is a trigonometric polynomial of degree $k$ , and their average $\sigma_N f$ is also one). Fejer proved $\sigma_N f \to f$ uniformly for continuous $f$ . $\square$

Why completeness follows: Given $f \in L^2$ and $\varepsilon > 0$ :

Find continuous $g$ with $\lVert f - g \rVert_2 < \varepsilon/2$ (density of $C$ in $L^2$ )
Find trigonometric polynomial $T$ with $\lVert g - T \rVert_\infty < \varepsilon/(2\sqrt{2\pi})$ (Weierstrass/Fejer)
Then $\lVert g - T \rVert_2 \leq \sqrt{2\pi}\lVert g - T \rVert_\infty < \varepsilon/2$
Triangle inequality: $\lVert f - T \rVert_2 < \varepsilon$ $\square$

D.5 Fourier Series and Operator Theory

The Fourier series establishes a fundamental connection between function analysis and operator theory.

The operator perspective: Consider the differentiation operator $D = \frac{d}{dx}$ on periodic functions. The eigenfunctions of $D$ are exactly the complex exponentials:

D e^{inx} = in \cdot e^{inx}

with eigenvalue $in$ . The Fourier series is the eigenfunction expansion of $D$ . Every linear differential operator with constant coefficients:

L = a_k D^k + a_{k-1}D^{k-1} + \cdots + a_0

is diagonalized by the Fourier basis:

L e^{inx} = (a_k (in)^k + a_{k-1}(in)^{k-1} + \cdots + a_0) e^{inx} = p(in) e^{inx}

where $p(z) = a_k z^k + \cdots + a_0$ is the symbol of the operator.

For AI: Attention is (in the linear case) an operator on sequence positions. The linear attention matrix $A$ with entries $A_{mn} = K(m-n)$ (for a shift-invariant kernel) is diagonalized by the Fourier basis - its eigenvectors are the complex exponentials and its eigenvalues are the Fourier transform of $K$ . This is why spectral analysis of attention patterns connects directly to Fourier theory, and why RoPE (which multiplies by $e^{im\theta}$ ) is so natural in this framework.

Compact operators: If the coefficients $p(in) \to 0$ as $|n| \to \infty$ , then $L$ is a compact operator. The inverse problem (recovering $f$ from $Lf = g$ ) is then ill-posed (small changes in $g$ can cause large changes in $f$ ) - this is the mathematical underpinning of regularization in machine learning.

Appendix E: Summary Tables

E.1 Fourier Coefficient Reference

Function $f(x)$ on $(-\pi,\pi)$	$c_n$ (complex)	Notes
$1$ (constant)	$c_0 = 1$ , $c_n = 0$ ( $n \neq 0$ )	DC component only
$e^{iax}$ ( $a \in \mathbb{Z}$ )	$c_a = 1$ , $c_n = 0$ ( $n \neq a$ )	Single frequency
$\cos(mx)$	$c_m = c_{-m} = 1/2$ , rest $0$
$\sin(mx)$	$c_m = 1/(2i)$ , $c_{-m} = -1/(2i)$
$\text{sgn}(x)$ (square wave)	$c_n = \frac{2}{in\pi}$ ( $n$ odd), $0$ (even/zero)	$1/n$ decay
$x$ (sawtooth)	$c_n = \frac{(-1)^{n+1}}{in}$ ( $n \neq 0$ ), $0$ ( $n=0$ )	$1/n$ decay
$	x	$ (triangle)
$x^2$	$c_0 = \pi^2/3$ , $c_n = \frac{2(-1)^n}{n^2}$ ( $n \neq 0$ )	$1/n^2$ decay
$e^{ax}$	$c_n = \frac{\sinh(\pi a)}{\pi(a-in)}$	Exponential decay
$\delta(x)$ (Dirac)	$c_n = \frac{1}{2\pi}$ (all $n$ )	No decay (distributional)

E.2 Properties Reference Table

Property	Time domain	Frequency domain
Linearity	$\alpha f + \beta g$	$\alpha c_n[f] + \beta c_n[g]$
Shift	$f(x - x_0)$	$e^{-inx_0} c_n[f]$
Modulation	$e^{imx} f(x)$	$c_{n-m}[f]$
Differentiation	$f'(x)$	$in \cdot c_n[f]$
Integration	$\int_0^x f(t)\,dt$	$c_n[f]/(in)$ ( $n \neq 0$ )
Reflection	$f(-x)$	$c_{-n}[f]$
Conjugation	$\overline{f(x)}$	$\overline{c_{-n}[f]}$
Convolution	$(f*g)(x) = \frac{1}{2\pi}\int f(t)g(x-t)\,dt$	$c_n[f] \cdot c_n[g]$
Multiplication	$f(x)g(x)$	$\sum_k c_k[f] c_{n-k}[g]$ (convolution in freq)

E.3 Convergence Summary

Type	Condition	Rate	Gibbs?
$L^2$	$f \in L^2$ (always)	$\lVert f - S_N f \rVert_2 \to 0$	N/A
Pointwise	$f$ piecewise smooth	$O(1/N)$ error at smooth points	Yes at jumps
Uniform	$f \in C^1$ , periodic	$O(1/N)$ in $\lVert \cdot \rVert_\infty$	No
Superpolynomial	$f \in C^k$ , periodic	$O(1/N^k)$	No
Exponential	$f$ analytic	$O(e^{-cN})$	No
Cesaro	$f$ continuous	Uniform (Fejer)	No

Appendix F: Complete Worked Problems

F.1 Problem: Full Solution of Fourier Series for $f(x) = x(\pi - x)$ on $[0, \pi]$

This is a common PDE boundary value problem. We want the sine series of $f(x) = x(\pi - x)$ on $[0, \pi]$ (odd extension):

b_n = \frac{2}{\pi}\int_0^\pi x(\pi - x)\sin(nx)\,dx

Step 1: Split the integral.

b_n = \frac{2}{\pi}\left[\pi\int_0^\pi x\sin(nx)\,dx - \int_0^\pi x^2\sin(nx)\,dx\right]

Step 2: Compute $\int_0^\pi x\sin(nx)\,dx$ by parts.

Set $u = x$ , $dv = \sin(nx)dx$ :

\int_0^\pi x\sin(nx)\,dx = \left[\frac{-x\cos(nx)}{n}\right]_0^\pi + \frac{1}{n}\int_0^\pi \cos(nx)\,dx = \frac{-\pi\cos(n\pi)}{n} + \frac{\sin(nx)}{n^2}\bigg|_0^\pi = \frac{\pi(-1)^{n+1}}{n}

Step 3: Compute $\int_0^\pi x^2\sin(nx)\,dx$ by parts twice.

First integration by parts: $u = x^2$ , $dv = \sin(nx)dx$ :

\int_0^\pi x^2\sin(nx)\,dx = \left[\frac{-x^2\cos(nx)}{n}\right]_0^\pi + \frac{2}{n}\int_0^\pi x\cos(nx)\,dx = \frac{-\pi^2(-1)^n}{n} + \frac{2}{n}\int_0^\pi x\cos(nx)\,dx

Second integration by parts: $\int_0^\pi x\cos(nx)\,dx = \left[\frac{x\sin(nx)}{n}\right]_0^\pi - \frac{1}{n}\int_0^\pi \sin(nx)\,dx = 0 + \frac{1}{n^2}[\cos(nx)]_0^\pi = \frac{(-1)^n - 1}{n^2}$

So: $\int_0^\pi x^2\sin(nx)\,dx = \frac{-\pi^2(-1)^n}{n} + \frac{2((-1)^n - 1)}{n^3}$

Step 4: Combine.

b_n = \frac{2}{\pi}\left[\pi \cdot \frac{\pi(-1)^{n+1}}{n} - \left(\frac{-\pi^2(-1)^n}{n} + \frac{2((-1)^n-1)}{n^3}\right)\right]

= \frac{2}{\pi}\left[\frac{\pi^2(-1)^{n+1}}{n} + \frac{\pi^2(-1)^n}{n} - \frac{2((-1)^n-1)}{n^3}\right]

= \frac{2}{\pi}\left[0 - \frac{2((-1)^n - 1)}{n^3}\right] = \frac{-4((-1)^n - 1)}{\pi n^3} = \begin{cases} 8/(\pi n^3) & n \text{ odd} \\ 0 & n \text{ even} \end{cases}

Result: $f(x) = x(\pi - x) = \frac{8}{\pi}\sum_{k=0}^\infty \frac{\sin((2k+1)x)}{(2k+1)^3}$

Verification via Parseval: $\frac{2}{\pi}\int_0^\pi [x(\pi-x)]^2\,dx = \frac{2}{\pi}\int_0^\pi (x^2\pi^2 - 2x^3\pi + x^4)\,dx = \frac{2}{\pi}\left[\frac{\pi^2 \cdot \pi^3}{3} - \frac{\pi^5}{2} + \frac{\pi^5}{5}\right] = \frac{2\pi^4}{30} = \frac{\pi^4}{15}$

Parseval: $\sum_{k=0}^\infty b_{2k+1}^2 = \sum_{k=0}^\infty \frac{64}{\pi^2(2k+1)^6} = \frac{\pi^4}{15}$ , giving $\sum_{k=0}^\infty \frac{1}{(2k+1)^6} = \frac{\pi^6}{960}$ . (Verifiable!)

Application to heat equation: The solution of $u_t = \alpha^2 u_{xx}$ on $[0,\pi]$ with $u(0,t) = u(\pi,t) = 0$ and $u(x,0) = x(\pi-x)$ is:

u(x,t) = \frac{8}{\pi}\sum_{k=0}^\infty \frac{1}{(2k+1)^3}\sin((2k+1)x)\,e^{-\alpha^2(2k+1)^2 t}

At $t = 0$ : this reduces to the Fourier series of $x(\pi-x)$ . At large $t$ : the dominant term is $\frac{8}{\pi}\sin(x)e^{-\alpha^2 t}$ - the solution relaxes toward zero via the fundamental mode.

F.2 Problem: Fourier Series of $f(x) = e^{ix}$ - The Modulation Property

This trivial-seeming example illuminates the modulation property.

$f(x) = e^{ix}$ is already a single Fourier mode with $c_1 = 1$ and $c_n = 0$ for $n \neq 1$ .

Now consider $g(x) = e^{iax}f(x) = e^{i(a+1)x}$ for integer $a$ . By the modulation property:

c_n[g] = c_{n-a}[f] = \delta_{n-a, 1} = \delta_{n, a+1}

So $g$ has all energy concentrated at frequency $n = a+1$ . This is the frequency shift or modulation property: multiplying by $e^{iax}$ shifts the spectrum by $a$ .

Application to RoPE in detail: In RoPE, the query at position $m$ for dimension $i$ is:

q_{m,i}^{\text{rotated}} = (q_{2i} + iq_{2i+1}) \cdot e^{im\theta_i}

This is multiplication by the complex exponential $e^{im\theta_i}$ - i.e., modulation by frequency $m$ . The inner product with key at position $n$ :

\langle q_m^{\text{rot}}, k_n^{\text{rot}} \rangle = \operatorname{Re}\!\left[\sum_i (q_{2i}+iq_{2i+1})e^{im\theta_i} \cdot \overline{(k_{2i}+ik_{2i+1})e^{in\theta_i}}\right]

= \operatorname{Re}\!\left[\sum_i (q_{2i}+iq_{2i+1})\overline{(k_{2i}+ik_{2i+1})} \cdot e^{i(m-n)\theta_i}\right]

The score depends only on $(m-n)$ - relative position - by the modulation-then-conjugate-product structure. This is the Fourier basis shift theorem instantiated in transformer attention.

F.3 Problem: Convergence Rate Comparison

Compare the convergence rates of the Fourier series for four functions:

Function A: Square wave (1 discontinuity per period)

$|c_n| = O(1/n)$ -> $\lVert f - S_N \rVert_2^2 = O(1/N)$
Needs $N \sim 10^6$ terms for $10^{-3}$ relative error

Function B: Triangle wave (continuous, corners = 1 discontinuity in derivative)

$|c_n| = O(1/n^2)$ -> $\lVert f - S_N \rVert_2^2 = O(1/N^3)$
Needs $N \sim 10^2$ terms for $10^{-3}$ relative error

Function C: Smooth periodic bump $f(x) = e^{\cos x}$

$f$ is analytic -> $|c_n|$ decays exponentially, $|c_n| \leq Ce^{-cn}$
Needs $N \sim 20$ terms for $10^{-6}$ relative error

Function D: $f(x) = 1/2 - |x|/(2\pi)$ (slowly varying linear ramp)

$|c_n| = 1/(2\pi^2 n^2)$ -> similar to triangle wave
Fast convergence

Lesson: The algebraic convergence rates for discontinuous or non-smooth functions make Fourier truncation expensive. This motivates wavelets (Section 20-05), which can represent localized features (edges) efficiently without paying the Gibbs penalty.

F.4 Problem: Convolution in Fourier Space

Setup: Let $f(x) = \cos(x)$ and $g(x) = \sin(2x)$ . Compute $(f * g)(x) = \frac{1}{2\pi}\int_{-\pi}^{\pi} f(t)g(x-t)\,dt$ using the convolution theorem.

Using the convolution theorem: $c_n[f*g] = c_n[f] \cdot c_n[g]$ .

Fourier coefficients of $f = \cos x$ : $c_1 = c_{-1} = 1/2$ , rest zero.

Fourier coefficients of $g = \sin(2x)$ : $c_2 = 1/(2i) = -i/2$ , $c_{-2} = i/2$ , rest zero.

Product $c_n[f] \cdot c_n[g]$ : This is zero for all $n$ since the non-zero frequencies of $f$ ( $n = \pm 1$ ) and $g$ ( $n = \pm 2$ ) are disjoint!

So $(f * g)(x) = 0$ - the convolution of $\cos(x)$ and $\sin(2x)$ is identically zero.

Direct verification: $\frac{1}{2\pi}\int_{-\pi}^{\pi} \cos(t)\sin(2(x-t))\,dt = \frac{1}{2\pi}\int \cos t(\sin 2x\cos 2t - \cos 2x\sin 2t)\,dt$ . Each integral is zero by orthogonality. $\checkmark$

Lesson: The convolution of two functions with non-overlapping spectra is zero. This is the frequency-domain multiplication interpretation of convolution. In signal processing, this means two signals at different frequencies don't interfere - the superposition principle holds in frequency space.

For AI: This is why attention heads can specialize to different frequency ranges. If head $h_1$ attends to low-frequency positional patterns and head $h_2$ attends to high-frequency syntactic patterns, their convolution-like operations in frequency space are orthogonal.

F.5 Problem: Parseval and the Isoperimetric Inequality

The isoperimetric inequality - that among all closed curves of length $L$ , the circle encloses the maximum area - has an elegant proof via Fourier series.

Setup: Parameterize a smooth closed curve of length $L$ by arclength: $(x(t), y(t))$ for $t \in [0, L]$ , with period $L$ . The constraint is $\dot{x}^2 + \dot{y}^2 = 1$ .

Area via Green's theorem: $A = \frac{1}{2}\int_0^L (x\dot{y} - y\dot{x})\,dt$ .

Fourier expand: $x(t) = \sum_n a_n e^{2\pi int/L}$ , $y(t) = \sum_n b_n e^{2\pi int/L}$ .

Apply Parseval to the arc length constraint: $1 = \dot{x}^2 + \dot{y}^2$ -> after Parseval:

\sum_n \left(\frac{2\pi n}{L}\right)^2 (|a_n|^2 + |b_n|^2) = 1

Bound the area: Using the Fourier expansion and Cauchy-Schwarz:

|A| = \pi \left|\sum_n n(a_n\bar{b}_n - \bar{a}_n b_n)\right| \leq \pi \sum_n |n|(|a_n|^2 + |b_n|^2) \leq \frac{\pi L^2}{4\pi^2}\sum_n \frac{4\pi^2 n^2}{L^2}(|a_n|^2+|b_n|^2) = \frac{L^2}{4\pi}

Equality holds when only the $n = \pm 1$ terms are non-zero - i.e., when the curve is a circle.

Conclusion: $A \leq L^2/(4\pi)$ , with equality for a circle. This is the isoperimetric inequality, proved entirely via Parseval's theorem.

For AI: Isoperimetric inequalities in function spaces underlie capacity control in learning theory. The VC dimension and Rademacher complexity measure the "surface area to volume ratio" of hypothesis classes in high-dimensional function spaces - and Fourier analysis provides the tools to compute these quantities.

Appendix G: In-Depth AI Applications

G.1 Transformer Positional Encodings: A Unified View

To understand the deep connection between Fourier series and transformer positional encodings, let us develop the theory carefully.

The fundamental problem in attention. The self-attention mechanism $\text{Attention}(Q,K,V) = \text{softmax}(QK^\top/\sqrt{d_k})V$ is permutation-equivariant: permuting the input tokens permutes the output, but the attention weights treat all positions symmetrically. Without positional encodings, "the cat sat on the mat" and "the mat sat on the cat" would produce the same attention weights.

The sinusoidal solution (Vaswani et al., 2017). Add a position-dependent vector $\text{PE}(m) \in \mathbb{R}^d$ to the $m$ -th token embedding before attention. Use:

\text{PE}(m)_{2i} = \sin\!\left(\frac{m}{\theta_i}\right), \quad \text{PE}(m)_{2i+1} = \cos\!\left(\frac{m}{\theta_i}\right), \quad \theta_i = 10000^{2i/d}

The key property. For any fixed offset $k$ , there exists a linear map $M_k \in \mathbb{R}^{d \times d}$ such that $\text{PE}(m + k) = M_k \text{PE}(m)$ . This is because:

\begin{pmatrix} \sin((m+k)/\theta_i) \\ \cos((m+k)/\theta_i) \end{pmatrix} = \begin{pmatrix} \cos(k/\theta_i) & \sin(k/\theta_i) \\ -\sin(k/\theta_i) & \cos(k/\theta_i) \end{pmatrix} \begin{pmatrix} \sin(m/\theta_i) \\ \cos(m/\theta_i) \end{pmatrix}

Each dimension pair $(2i, 2i+1)$ transforms via a rotation by $k/\theta_i$ - a Fourier shift in 2D.

Why different frequencies? At frequency $1/\theta_i = 10000^{-2i/d}$ :

$i = 0$ : $\theta_0 = 1$ , period $= 2\pi \approx 6.28$ - distinguishes adjacent tokens
$i = d/4$ : $\theta_{d/4} = 10000^{1/2} \approx 100$ - distinguishes tokens $\sim 100$ apart
$i = d/2-1$ : $\theta_{d/2-1} = 10000$ - distinguishes tokens up to distance $10000$

This is exactly the geometric progression of harmonics in a Fourier series on the interval $[0, 10000]$ .

The Fourier basis interpretation. Define $\mathbf{z}_m = \text{PE}(m)$ . The inner product $\mathbf{z}_m \cdot \mathbf{z}_n = \sum_{i=0}^{d/2-1} \cos((m-n)/\theta_i)$ depends only on $m - n$ . This is the autocorrelation function of the positional encoding - it measures how "similar" two positions are based on their relative offset.

Limitation. These encodings are absolute (the encoding of position 5 is the same regardless of whether the sequence has length 10 or 1000). For sequences longer than those seen in training, the high-frequency components ( $\theta_0 = 1$ ) still work, but the model has never seen the low-frequency components ( $\theta_{d/2-1} = 10000$ ) vary much. This causes length generalization failure.

G.2 RoPE: Relative Positions via Complex Multiplication

The RoPE construction. Instead of adding $\text{PE}(m)$ to the embedding, RoPE rotates the query and key vectors:

q_m = R_m q, \quad k_n = R_n k

where $R_m$ is the block-diagonal rotation matrix:

R_m = \begin{pmatrix} R(\theta_0, m) & & \\ & R(\theta_1, m) & \\ & & \ddots \end{pmatrix}, \quad R(\theta, m) = \begin{pmatrix} \cos(m\theta) & -\sin(m\theta) \\ \sin(m\theta) & \cos(m\theta) \end{pmatrix}

Computing the attention score.

q_m^\top k_n = (R_m q)^\top (R_n k) = q^\top R_m^\top R_n k = q^\top R_{m-n} k

since $R_m^\top R_n = R_{n-m}$ (rotation by $m - n$ ). The score is a function of $m - n$ only - position-independent in the absolute sense, position-sensitive in the relative sense.

The complex representation. In complex notation, the $i$ -th pair $(q_{2i}, q_{2i+1})$ maps to the complex number $q_{2i} + iq_{2i+1} \in \mathbb{C}$ . The RoPE operation is:

q_{2i} + iq_{2i+1} \;\mapsto\; (q_{2i} + iq_{2i+1}) \cdot e^{im\theta_i}

This is multiplication by the unit complex number $e^{im\theta_i}$ - a phase rotation of the complex Fourier coefficient at frequency $\theta_i$ .

Long-context extrapolation. The base frequency $\theta_i = 10000^{-2i/d}$ means the maximum period is $2\pi/\theta_{d/2-1} = 2\pi \times 10000 \approx 62832$ . For positions beyond this, the encoding wraps around (becomes periodic) and position information is lost. This is why standard RoPE models struggle at contexts beyond $\sim 4096$ tokens.

Fix - Extended RoPE: Scale $\theta_i \to \theta_i / s$ where $s$ is the ratio of new context length to training context length (Position Interpolation, Chen et al., 2023). Alternatively, YaRN (Peng et al., 2023) uses a non-uniform frequency scaling based on Fourier analysis of the empirical attention patterns.

G.3 Spectral Bias: Mathematical Theory

The spectral bias (or frequency principle) of neural networks has a rigorous formulation in terms of the NTK and Fourier analysis.

The NTK framework. For a wide neural network $f_\theta(\mathbf{x})$ trained with gradient descent, the NTK matrix $\Theta(\mathbf{x}, \mathbf{x}') = \nabla_\theta f_\theta(\mathbf{x}) \cdot \nabla_\theta f_\theta(\mathbf{x}')$ is approximately constant during training. The training dynamics are:

\frac{d f_\theta(\mathbf{x})}{dt} = -\int \Theta(\mathbf{x}, \mathbf{x}') \nabla_{f} \mathcal{L}(f(\mathbf{x}'), y')\,d\mathbf{x}'

This is a kernel regression problem in function space, where $\Theta$ acts as the kernel.

Spectral decomposition. Let $\lambda_k, \phi_k$ be eigenvalues/eigenfunctions of $\Theta$ in $L^2$ . The projection of the target $f^*$ onto eigenvector $\phi_k$ has coefficient $\alpha_k = \langle f^*, \phi_k \rangle$ . Under gradient descent:

\alpha_k(t) = \alpha_k(0) + (1 - e^{-\lambda_k t}) \alpha_k^\infty

where $\alpha_k^\infty$ is the target coefficient. The coefficient at eigenfrequency $k$ converges at rate $\lambda_k$ - fast if $\lambda_k$ large, slow if $\lambda_k$ small.

The spectral bias. For standard MLPs with Gaussian initialization and ReLU activations, the NTK has eigenvalues $\lambda_k$ that decrease with frequency (high-frequency eigenfunctions have small eigenvalues). Therefore:

Low-frequency Fourier components have large eigenvalues -> learned quickly
High-frequency components have small eigenvalues -> learned slowly

Quantitative: the Barron approximation theorem. Functions that can be represented with $O(C^2 \varepsilon^{-2})$ neurons (where $C = \int |\omega||\hat{f}(\omega)|\,d\omega$ is the Barron norm) can be approximated to $\varepsilon$ accuracy in $L^2$ . The Barron norm penalizes high-frequency components: if $f^*$ has large high-frequency content ( $|\hat{f^*}(\omega)|$ large for large $|\omega|$ ), then $C$ is large and the network needs more parameters.

Implications for practice:

Overfitting to noise: High-frequency noise (which has large $|\omega|$ ) is learned last -> early stopping acts as a frequency low-pass filter
Data augmentation: Adding geometric augmentations (rotations, flips) destroys high-frequency features -> forces the network to rely on low-frequency structure -> better generalization
NeRF: Without positional encoding, networks can't represent the high-frequency variation in scene appearance -> Fourier feature encodings inject high frequencies explicitly

G.4 SIREN Networks: Replacing ReLU with Sine

SIREN (Sitzmann et al., 2020). Instead of using $\text{ReLU}$ activations, SIREN uses:

f_\theta(\mathbf{x}) = W_L \phi_{L-1} \circ \cdots \circ \phi_1(\mathbf{x}), \quad \phi_i(\mathbf{x}) = \sin(W_i \mathbf{x} + \mathbf{b}_i)

The derivative $\nabla_\mathbf{x} f$ is again a SIREN (with cosines -> sines one layer deeper). This is the key property: the derivative of a SIREN is also a SIREN with the same architecture.

Fourier interpretation. The output of a SIREN is a sum of sinusoids at multiple frequencies and phases. The composition of sine activations creates a rich set of Fourier components. The Fourier spectrum of a SIREN output can represent high-frequency content that a standard ReLU network would struggle to learn.

Mathematical property. For a single-layer SIREN with $n$ neurons, the output is:

f(x) = \sum_{k=1}^n c_k \sin(\omega_k x + \phi_k)

a sum of $n$ Fourier basis functions with learned frequencies, amplitudes, and phases. Unlike a standard Fourier series with fixed frequencies $\omega_k = k$ , SIREN can choose the frequencies freely - making it more expressive.

Applications: Implicit neural representations (NeRF, video), solving PDEs (physics-informed neural networks), generative modeling of images and 3D shapes.

G.5 The Spectral Analysis of LLM Attention Patterns

Recent research (2023-2024) has revealed striking spectral structure in LLM attention patterns:

Attention as a filter. In a transformer layer, the attention mechanism computes:

\mathbf{y}_m = \sum_n A_{mn} V \mathbf{x}_n

where $A_{mn} = \text{softmax}_n(q_m^\top k_n / \sqrt{d_k})$ . If $A_{mn}$ depends only on $m - n$ (which is approximately true for well-trained models with relative PE), then this is a convolution with kernel $a[k] = A_{mk}$ . In frequency space:

\hat{Y}(\xi) = \hat{A}(\xi) \cdot \hat{X}(\xi)

The attention pattern is a frequency filter. Different heads learn different filters:

Local heads: $A_{mn}$ large only for $|m - n|$ small -> low-pass filter (smooth the sequence)
Global heads: $A_{mn}$ relatively uniform -> constant filter (compute the mean)
Periodic heads: $A_{mn}$ large when $m - n \equiv k \pmod{T}$ -> periodic filter (detect regular patterns)

Empirical findings:

In BERT, some heads function as "copy heads" (attend to identical tokens) and "position heads" (attend to positions with specific relative offsets)
In GPT-2/3, induction heads attend strongly to previous occurrences of the same token
The frequency-domain view reveals why attention heads naturally factorize into frequency bands

WeightWatcher (Martin & Mahoney, 2021). The singular value spectra of weight matrices in LLMs follow power laws $\rho(\sigma) \sim \sigma^{-\alpha}$ with exponent $\alpha$ related to model quality. Healthy models have $\alpha \in [2, 4]$ . Overfit or poorly trained models have $\alpha$ outside this range. This is a Fourier analysis of the weight matrices - the spectral density tells us about the effective rank and implicit regularization.

G.6 Circular Convolution and LLMs

The circular convolution theorem (proved in Section 20-04) states that in the DFT domain, elementwise multiplication corresponds to circular convolution in the time domain. This has a direct connection to autoregressive language models.

Autoregressive language models as circular filters. A causal language model computes $p(x_t | x_1, \ldots, x_{t-1})$ by attending to all previous tokens. In the frequency domain:

\hat{P}(\xi) = \hat{A}(\xi) \cdot \hat{X}(\xi)

where $\hat{A}(\xi)$ is a one-sided (causal) filter. The causal constraint means $A_{mn} = 0$ for $m < n$ - only attending to past tokens. In frequency space, this corresponds to a filter whose phase response is $-\pi/2$ for all frequencies (a Hilbert transform).

State space models. Models like S4, H3, and Mamba parameterize the causal convolution kernel $h[n]$ directly in frequency space (via diagonal state space $A$ matrix). The output is:

y[t] = (x * h)[t] = \text{IFFT}(\hat{X} \cdot \hat{H})

where $\hat{H}$ is the learned frequency response. This is exactly the convolution theorem applied to sequence modeling, and it enables $O(N \log N)$ computation instead of $O(N^2)$ attention.

Appendix H: Connections to Probability and Statistics

H.1 Characteristic Functions as Fourier Transforms of Distributions

The characteristic function of a random variable $X$ is:

\varphi_X(t) = \mathbb{E}[e^{itX}] = \int_{-\infty}^{\infty} e^{itx}\,dF_X(x)

This is the Fourier transform of the probability measure $F_X$ ! The characteristic function exists for all random variables (unlike the moment generating function) and uniquely determines the distribution.

Fourier inversion of densities. If $X$ has density $p_X$ :

\varphi_X(t) = \int e^{itx} p_X(x)\,dx = \hat{p}_X(-t)

The density is recovered by the inverse Fourier transform: $p_X(x) = \frac{1}{2\pi}\int e^{-itx} \varphi_X(t)\,dt$ .

Central limit theorem via characteristic functions. Let $X_1, \ldots, X_n$ i.i.d. with mean $\mu$ and variance $\sigma^2$ . The standardized sum $S_n = (X_1 + \cdots + X_n - n\mu)/(\sigma\sqrt{n})$ has characteristic function:

\varphi_{S_n}(t) = \left[\varphi_X\!\left(\frac{t}{\sigma\sqrt{n}}\right) e^{-it\mu/(\sigma\sqrt{n})}\right]^n

Expanding $\varphi_X(s) = 1 + \mu(is) + \frac{\mu^2 + \sigma^2}{2}(is)^2 + O(s^3)$ for small $s$ and substituting:

\varphi_{S_n}(t) = \left[1 - \frac{t^2}{2n} + O(n^{-3/2})\right]^n \to e^{-t^2/2} \quad \text{as } n \to \infty

The Gaussian $e^{-t^2/2}$ is the Fourier transform of the standard normal density. The CLT is a convergence in Fourier space to the Gaussian fixed point.

For AI - Training stability. The distribution of gradient updates in neural network training follows a CLT when the batch size is large: the stochastic gradient is a sum of $B$ per-sample gradients, so its distribution is approximately Gaussian for large $B$ . This is why large-batch training behaves differently from small-batch training - the gradient noise becomes more Gaussian (less heavy-tailed) as $B$ grows.

H.2 Autocorrelation and Power Spectral Density

For a stationary stochastic process $\{X_t\}$ with mean zero, the autocorrelation function is:

R_X(\tau) = \mathbb{E}[X_t X_{t+\tau}]

The power spectral density (PSD) is the Fourier transform of $R_X$ :

S_X(f) = \int_{-\infty}^{\infty} R_X(\tau) e^{-2\pi if\tau}\,d\tau

This is the Wiener-Khinchin theorem: the PSD and autocorrelation are a Fourier pair.

Interpretation of PSD:

$S_X(f) \geq 0$ for all $f$ (non-negative by Bochner's theorem)
$\int S_X(f)\,df = R_X(0) = \mathbb{E}[X_t^2]$ (total power)
$S_X(f)$ at frequency $f$ tells how much of the signal's power is concentrated near frequency $f$

White noise: $S_X(f) = N_0$ (constant) -> $R_X(\tau) = N_0 \delta(\tau)$ (uncorrelated across all lags). White noise has equal power at all frequencies - it is the "maximally random" process.

1/f noise (pink noise): $S_X(f) \propto 1/f$ -> found in LLM token frequency distributions (Zipf's law), financial returns, EEG signals. The 1/f spectrum is exactly at the boundary between stationary ( $S_X$ integrable) and non-stationary processes.

For AI: The power spectral density of the hidden states in LLMs has been studied empirically. Models trained on natural language show 1/f-like spectra in their activations, consistent with the fractal/scale-free structure of language. This suggests that efficient LLM architectures should process information at multiple timescales simultaneously - exactly what multi-head attention (with different positional frequency ranges per head) achieves.

H.3 Fourier Analysis and the Sampling Theorem

The Shannon-Nyquist Sampling Theorem connects continuous Fourier analysis to discrete signal representation:

Theorem (Shannon, 1949; Nyquist, 1928): A bandlimited signal $f$ with $\hat{f}(\xi) = 0$ for $|\xi| > W$ (bandwidth $W$ ) can be perfectly reconstructed from its samples $\{f(n/(2W))\}_{n \in \mathbb{Z}}$ :

f(t) = \sum_{n=-\infty}^{\infty} f\!\left(\frac{n}{2W}\right) \operatorname{sinc}(2Wt - n)

Proof sketch. Define the sampled signal $f_s(t) = \sum_n f(n/(2W))\delta(t - n/(2W))$ . Its Fourier transform is the periodized spectrum: $\hat{f}_s(\xi) = 2W \sum_k \hat{f}(\xi - 2kW)$ . Since $f$ is bandlimited to $|\xi| \leq W$ , the copies don't overlap (no aliasing). Applying an ideal low-pass filter with cutoff $W$ recovers $\hat{f}$ , and inverse FT gives $f$ . $\square$

Aliasing: If $f$ has frequency content above $W$ but is sampled at rate $2W$ , the high-frequency copies in $\hat{f}_s$ overlap, corrupting the spectrum. This is aliasing - high-frequency components masquerade as low-frequency ones.

For AI - Tokenization: Tokenization is analogous to sampling. A tokenizer samples text at a rate of $\sim 0.7$ tokens/word (for byte-pair encoding). If the "semantic content" of text has structure at finer granularity than 1 word, the tokenizer introduces aliasing: some semantic distinctions are lost. This is why character-level models can capture morphological patterns that word-level models miss.

Anti-aliasing in CNNs: The standard strided convolution in CNNs (stride-2 for downsampling) can alias: it samples the feature map at rate $1/2$ without first applying a low-pass filter. Zhang (2019) showed that adding a blur filter before strided convolutions dramatically improves shift-equivariance. This is the CNN version of anti-aliasing.

Appendix I: Implementation Details

I.1 Numerical Computation of Fourier Coefficients

Direct formula. For a function given by samples $f_0, \ldots, f_{N-1}$ at equally spaced points $x_k = -\pi + 2\pi k/N$ :

c_n \approx \frac{1}{N}\sum_{k=0}^{N-1} f_k e^{-inx_k} = \frac{1}{N}\text{DFT}(f)[n]

This approximates the integral $c_n = \frac{1}{2\pi}\int_{-\pi}^{\pi} f(x)e^{-inx}\,dx$ using the rectangle rule.

Accuracy. The approximation error is $O(1/N)$ for smooth functions and $O(1/N^k)$ for $C^k$ functions. For discontinuous functions, the error is $O(1/N)$ regardless of how fine the sampling.

Using NumPy:

# N equispaced samples of f on [-pi, pi)
x = np.linspace(-np.pi, np.pi, N, endpoint=False)
f_samples = f(x)
# Fourier coefficients (approximate)
c = np.fft.fft(f_samples) / N
# c[n] corresponds to the n-th coefficient for n = 0, 1, ..., N-1
# Negative frequencies: c[N-n] corresponds to c_{-n}

Aliasing in discrete computation. The DFT computes only $N$ coefficients $c_0, \ldots, c_{N-1}$ . For a bandlimited function with $c_n = 0$ for $|n| > N/2$ , this is exact. Otherwise, the $n$ -th computed coefficient is actually $c_n + c_{n+N} + c_{n-N} + \cdots$ (all aliased copies).

I.2 Plotting Spectra Correctly

Common mistakes when plotting Fourier spectra:

Frequency axis: NumPy's FFT output has frequencies $[0, 1, \ldots, N/2-1, -N/2, \ldots, -1]/N$ in units of cycles/sample. Use np.fft.fftfreq(N) to get the correct axis.
Centering: Use np.fft.fftshift to rearrange the output so that zero frequency is in the center.
One-sided vs two-sided: For real signals, plot only $n \geq 0$ (the two-sided spectrum is symmetric). Use np.fft.rfft for real signals - it returns only the positive-frequency half.
dB scale: Power spectra are often plotted in dB: $10\log_{10}|c_n|^2$ or $20\log_{10}|c_n|$ . This makes it easier to see components spanning many orders of magnitude.
Parseval check: Always verify $\sum |c_n|^2 \approx \frac{1}{N}\sum |f_k|^2$ . If these don't match, there's a normalization error.

I.3 RoPE Implementation Reference

def rope_embed(x, positions, base=10000):
    """
    Apply RoPE to input tensor x.
    x: shape (seq_len, d_model) - query or key vectors
    positions: integer positions (seq_len,)
    Returns: rotated x, same shape
    """
    d = x.shape[-1]
    # Frequencies: theta_i = base^{-2i/d} for i = 0, ..., d/2 - 1
    i = np.arange(0, d, 2)
    freqs = base ** (-i / d)          # shape (d/2,)
    # Angles: m * theta_i
    angles = positions[:, None] * freqs[None, :]  # (seq_len, d/2)
    # Build rotation: cos(angle) and sin(angle)
    cos = np.cos(angles)   # (seq_len, d/2)
    sin = np.sin(angles)   # (seq_len, d/2)
    # Rotate pairs (x_{2i}, x_{2i+1})
    x_even = x[:, 0::2]   # shape (seq_len, d/2)
    x_odd  = x[:, 1::2]
    x_rot_even = x_even * cos - x_odd * sin
    x_rot_odd  = x_even * sin + x_odd * cos
    # Interleave back
    x_rot = np.empty_like(x)
    x_rot[:, 0::2] = x_rot_even
    x_rot[:, 1::2] = x_rot_odd
    return x_rot

Verification: For any queries $q_m$ and keys $k_n$ with relative position $r = m - n$ :

score(m, n) = dot(rope_embed(q, [m]), rope_embed(k, [n]))
            == score(m + delta, n + delta)  # for any integer delta

This confirms that the attention score depends only on relative position.

Fourier Series: Part 2 - Appendix B Advanced Topics To Appendix I Implementation Details