Part 3Math for LLMs

Fourier Series: Part 3 - Appendix J Fourier Series In Diverse Mathematical Contexts To Appendix

Fourier Analysis and Signal Processing / Fourier Series

Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 3
23 min read18 headingsSplit lesson page

Lesson overview | Previous part | Lesson overview

Fourier Series: Appendix J: Fourier Series in Diverse Mathematical Contexts to Appendix S: Summary of Key Results

Appendix J: Fourier Series in Diverse Mathematical Contexts

J.1 Fourier Series and Number Theory

The connection between Fourier analysis and number theory is profound and historically important.

Dirichlet characters. For a modulus qq, a Dirichlet character χ:ZC\chi: \mathbb{Z} \to \mathbb{C} is a completely multiplicative arithmetic function with period qq. These are precisely the characters of the group (Z/qZ)(\mathbb{Z}/q\mathbb{Z})^*. Dirichlet's proof that there are infinitely many primes in any arithmetic progression {a,a+q,a+2q,}\{a, a+q, a+2q,\ldots\} with gcd(a,q)=1\gcd(a,q) = 1 uses Fourier analysis on the group (Z/qZ)(\mathbb{Z}/q\mathbb{Z})^* - a finite version of Fourier series.

The Poisson summation formula. For a "nice" function ff:

n=f(n)=k=f^(k)\sum_{n=-\infty}^{\infty} f(n) = \sum_{k=-\infty}^{\infty} \hat{f}(k)

This connects sums over integers to sums over their Fourier transforms. Applications include:

  • Jacobi theta functions: θ(τ)=n=eiπn2τ\theta(\tau) = \sum_{n=-\infty}^\infty e^{i\pi n^2 \tau} satisfies a functional equation from Poisson summation
  • Riemann zeta function: The functional equation ζ(s)=2sπs1sin(πs/2)Γ(1s)ζ(1s)\zeta(s) = 2^s\pi^{s-1}\sin(\pi s/2)\Gamma(1-s)\zeta(1-s) is proved via Poisson summation
  • Quadratic forms: The number of ways to represent an integer as a sum of kk squares is expressed via theta functions

For AI: Poisson summation appears in the theory of neural network generalization. The neural tangent kernel at infinite width is related to the Fourier transform of the network architecture, and bounds on generalization involve the Poisson summation formula applied to the empirical Fourier spectrum of the training data.

J.2 Fourier Series on Non-Euclidean Spaces

Fourier analysis on groups. The Fourier series on [π,π][-\pi,\pi] is really harmonic analysis on the group S1={eiθ:θ[π,π]}S^1 = \{e^{i\theta} : \theta \in [-\pi,\pi]\}. The generalization to compact groups GG gives:

f(g)=πG^dπtr(f^(π)π(g))f(g) = \sum_{\pi \in \hat{G}} d_\pi \operatorname{tr}(\hat{f}(\pi) \pi(g))

where G^\hat{G} is the set of irreducible unitary representations of GG and dπd_\pi is the dimension of representation π\pi. For G=S1G = S^1: all irreducible representations are 1-dimensional (dπ=1d_\pi = 1), and we recover the ordinary Fourier series.

For AI - Equivariant networks: Group-equivariant neural networks (Cohen & Welling, 2016) build in symmetries (rotations, reflections) by working with group-Fourier analysis. A convolution network on R2\mathbb{R}^2 is equivariant to translations; lifting to SE(2)SE(2) (rigid motions) gives equivariance to rotation and translation simultaneously. The mathematical foundation is the group Fourier transform - Fourier series on a non-abelian group.

Fourier analysis on graphs. The graph Laplacian L=DAL = D - A has eigenvectors U=[u1,,un]U = [u_1,\ldots,u_n] with eigenvalues 0=λ1λ2λn0 = \lambda_1 \leq \lambda_2 \leq \cdots \leq \lambda_n. The graph Fourier transform is f^=Uf\hat{f} = U^\top f. Graph convolution (used in GCN, ChebNet - covered in Section 11-04) is multiplication in this graph Fourier domain:

fGh=U((Uf)(Uh))f *_G h = U((U^\top f) \odot (U^\top h))

This is exactly the convolution theorem applied to graphs, replacing the standard Fourier basis einxe^{inx} with the Laplacian eigenvectors uku_k.

J.3 Fourier Series and Quantum Mechanics

The mathematical structure of quantum mechanics is built on Fourier analysis.

Position and momentum representations. A quantum state ψ|\psi\rangle can be represented in the position basis: ψ(x)=xψ\psi(x) = \langle x|\psi\rangle (the wavefunction), or in the momentum basis: ψ~(p)=pψ\tilde{\psi}(p) = \langle p|\psi\rangle. The two representations are related by the Fourier transform:

ψ~(p)=12πψ(x)eipx/dx\tilde{\psi}(p) = \frac{1}{\sqrt{2\pi\hbar}}\int \psi(x) e^{-ipx/\hbar}\,dx

The Heisenberg uncertainty principle ΔxΔp/2\Delta x \cdot \Delta p \geq \hbar/2 is exactly the Fourier uncertainty principle σtσξ1/(4π)\sigma_t \cdot \sigma_\xi \geq 1/(4\pi) in disguise, with p=ξp = \hbar\xi and x=tx = t.

For AI: The mathematical structure of quantum mechanics (Hilbert space, operators, eigenvalues) and the mathematical structure of attention (inner products, softmax, value aggregation) are strikingly parallel. Several researchers have proposed "quantum attention" mechanisms, and the Fourier analysis connecting position and momentum representations underlies proposals for "quantum-inspired" sequence modeling.

J.4 The Ergodic Theorem and Fourier Analysis

Birkhoff's ergodic theorem. For an ergodic measure-preserving transformation TT and fL1f \in L^1:

1Nk=0N1f(Tkx)fdμa.e.\frac{1}{N}\sum_{k=0}^{N-1} f(T^k x) \to \int f\,d\mu \quad \text{a.e.}

The time average equals the space average.

Connection to Fourier series. For the rotation T:xx+α(mod1)T: x \mapsto x + \alpha \pmod{1} on the circle (irrational α\alpha), Birkhoff's theorem gives:

1Nk=0N1f(x+kα)01fdx\frac{1}{N}\sum_{k=0}^{N-1} f(x + k\alpha) \to \int_0^1 f\,dx

In Fourier space: 1Nk=0N1cn[f]e2πinkα=cn[f]1Nk=0N1e2πinkα\frac{1}{N}\sum_{k=0}^{N-1} c_n[f] e^{2\pi ink\alpha} = c_n[f] \cdot \frac{1}{N}\sum_{k=0}^{N-1} e^{2\pi ink\alpha}. For irrational α\alpha and n0n \neq 0: 1Ne2πinkα0\frac{1}{N}\sum e^{2\pi ink\alpha} \to 0 (equidistribution) -> only the n=0n = 0 term survives -> the time average equals c0[f]=fc_0[f] = \int f.

For AI: The ergodic theorem underpins the mixing of training data. In language model training, a sequence of tokens is generated by a stationary stochastic process (natural language). Ergodicity says that sampling from a long enough sequence gives a representative sample of the distribution - justifying the use of random windowed crops for training, rather than requiring the full corpus at each step.


Appendix K: Connections Between Sections

K.1 Why This Section is the Foundation

All subsequent sections in this chapter build on the foundations laid here:

-> Section 20-02 (Fourier Transform): Take the period TT \to \infty in the Fourier series. The discrete sum ncneinx\sum_n c_n e^{inx} becomes the integral f^(ξ)e2πiξxdξ\int \hat{f}(\xi)e^{2\pi i\xi x}\,d\xi. Every property of the Fourier series has an analogue in the Fourier transform - the derivations are nearly identical. The key new feature is the continuous spectrum and the L2(R)L^2(\mathbb{R}) setting.

-> Section 20-03 (DFT and FFT): Discretize the Fourier transform: sample ff at NN points and compute NN frequency coefficients. The DFT is the Fourier series applied to the sequence (f0,,fN1)(f_0, \ldots, f_{N-1}) viewed as a periodic sequence of period NN. The FFT is the efficient algorithm for computing it.

-> Section 20-04 (Convolution Theorem): The property cn[fg]=cn[f]cn[g]c_n[f*g] = c_n[f]\cdot c_n[g] (multiplication in frequency = convolution in time) is the single most important theorem in signal processing. It follows directly from the orthogonality of the Fourier basis and is proved here in the series setting; the full treatment with applications to CNNs is in Section 20-04.

-> Section 20-05 (Wavelets): Wavelets are a replacement for Fourier series that provides both time and frequency localization. The Haar wavelet is a piecewise constant basis - like the Fourier basis, but localized in time. The mathematical framework of multiresolution analysis is the generalization of the Fourier series framework (nested subspaces, orthonormal bases) to time-localized expansions.

After this section, you have two natural paths:

Path A - Depth first: Continue with Section 20-02 (Fourier Transform) to see how the theory extends to aperiodic signals and the full real line. This path eventually leads to the uncertainty principle and the deeper function space theory.

Path B - Applications first: Jump to Section 20-04 (Convolution Theorem) to see immediately how Fourier series powers CNNs and sequence models. The convolution theorem proof requires the Fourier transform (Section 20-02), so read that first.

For the AI-focused reader: The most direct path from this section to ML applications is:

  1. Section 20-01 (this section) - Fourier series fundamentals
  2. Section 20-03 (DFT/FFT) - computational implementation
  3. Section 20-04 (Convolution Theorem) - CNNs and sequence models
  4. Section 20-02 (Fourier Transform) - theory for spectral methods
  5. Section 20-05 (Wavelets) - multi-scale models and compression

Appendix L: The Fourier Series in Numerical Practice

L.1 Accuracy Benchmarks: Theoretical vs. Observed

When implementing Fourier series approximations numerically, understanding the gap between theoretical and observed accuracy is essential.

Experiment: Square wave approximation accuracy.

For the square wave f(x)=sgn(x)f(x) = \text{sgn}(x) with NN-term Fourier series SNfS_N f:

  • Theoretical L2L^2 error: fSNf22=k>N/2,k odd16π2k28π2N\lVert f - S_N f \rVert_2^2 = \sum_{k>N/2, k\text{ odd}} \frac{16}{\pi^2 k^2} \sim \frac{8}{\pi^2 N}
  • Observed: Plot fSNf22\lVert f - S_N f \rVert_2^2 vs NN on a log-log scale. Slope should be 1-1, confirming O(1/N)O(1/N) decay.
  • At jump: SNf(0.01)10.179|S_N f(0.01) - 1| \approx 0.179 regardless of NN (Gibbs)

Experiment: Triangle wave approximation accuracy.

For f(x)=xf(x) = |x| with cn=4/(π2n2)|c_n| = 4/(\pi^2 n^2) (odd nn):

  • Theoretical L2L^2 error: fSNf22=k>N/2,k odd16π4k443π4N3\lVert f - S_N f \rVert_2^2 = \sum_{k>N/2, k\text{ odd}} \frac{16}{\pi^4 k^4} \sim \frac{4}{3\pi^4 N^3}
  • Observed: slope 3\approx -3 on log-log plot, confirming O(1/N3)O(1/N^3) decay

Experiment: Smooth function accuracy.

For f(x)=ecosxf(x) = e^{\cos x} (analytic, 2π2\pi-periodic):

  • Theoretical: cnCen|c_n| \leq C e^{-n} -> fSNf2CeN\lVert f - S_N f \rVert_2 \leq C' e^{-N}
  • Observed: semi-log plot of fSNf2\lVert f - S_N f \rVert_2 vs NN is linear (exponential decay)
  • Essentially machine-precision accuracy for N30N \geq 30

Takeaway: The rule of thumb is "smooth = exponentially good, nonsmooth = algebraically poor." For discontinuous signals, truncated Fourier series are often a poor choice - JPEG-style block transforms or wavelets are better.

L.2 The Fast Fourier Transform Preview

The DFT (fully treated in Section 20-03) computes exactly the same Fourier coefficients as the Fourier series formula, but for a finite sequence and in O(NlogN)O(N \log N) time.

The connection: For samples fk=f(xk)f_k = f(x_k) at NN equispaced points xk=π+2πk/Nx_k = -\pi + 2\pi k/N:

DFT(f)[n]=k=0N1fke2πikn/N\text{DFT}(f)[n] = \sum_{k=0}^{N-1} f_k e^{-2\pi ikn/N}

Comparing with cn=12πππf(x)einxdx1Nk=0N1fkeinxkc_n = \frac{1}{2\pi}\int_{-\pi}^{\pi} f(x)e^{-inx}\,dx \approx \frac{1}{N}\sum_{k=0}^{N-1} f_k e^{-inx_k} (rectangle rule with Δx=2π/N\Delta x = 2\pi/N):

cn1NDFT(f)[n]c_n \approx \frac{1}{N} \cdot \text{DFT}(f)[n]

So the DFT is just the Fourier series computation via a rectangle-rule quadrature, scaled by NN.

The FFT (Cooley-Tukey, 1965) computes all NN DFT values in O(NlogN)O(N \log N) operations instead of the naive O(N2)O(N^2). For N=106N = 10^6: naive DFT requires 101210^{12} operations; FFT requires 2×107\approx 2 \times 10^7 operations - a factor of 50,000 speedup.

This speedup is what made digital signal processing, audio compression, and modern AI practically feasible. The STFT and mel-spectrogram preprocessing in Whisper, the frequency-domain convolutions in FNet, and the state-space model computations in S4/Mamba all rely on this speedup. Full treatment: Section 20-03 Discrete Fourier Transform and FFT.

L.3 Fourier Series vs. Polynomial Approximation

Both Fourier series and polynomial approximation (Taylor, Chebyshev, Legendre) approximate functions by basis expansions. Understanding when to use each is important.

AspectFourier SeriesPolynomial (Chebyshev)
Best forPeriodic functions, global frequency analysisNon-periodic functions on [a,b][a,b]
Convergence for smooth ffExponential (analytic), algebraic (smooth)Exponential (analytic), algebraic (smooth)
At discontinuitiesGibbs phenomenon (9% overshoot)Runge phenomenon near endpoints
Complexity for NN termsO(NlogN)O(N \log N) via FFTO(N2)O(N^2) (or O(NlogN)O(N\log N) for Chebyshev via DCT)
AI applicationsRoPE, spectrograms, convolutionsPolynomial activations, rational networks
Sparse representationWorks for bandlimited signalsWorks for smooth functions

Key insight: For periodic and global structure, use Fourier series. For localized and non-periodic structure, use wavelets (Section 20-05) or local polynomial bases. The choice of approximation basis should match the intrinsic structure of the function.

L.4 When Fourier Series Fail: Pathological Examples

Understanding the failure modes of Fourier series deepens intuition.

Example 1 - Du Bois-Reymond's construction (1876): There exists a continuous 2π2\pi-periodic function ff whose Fourier series diverges at x=0x = 0. The construction uses the slow growth of DNL14π2lnN\lVert D_N \rVert_{L^1} \sim \frac{4}{\pi^2}\ln N: build ff to make SNf(0)|S_N f(0)| grow like lnN\ln N.

Example 2 - Kolmogorov's function (1923): There exists an L1L^1 function whose Fourier series diverges everywhere. This shows that L1L^1 is too weak for pointwise Fourier convergence.

Example 3 - Weierstrass nowhere-differentiable function:

W(x)=n=0ancos(bnπx),0<a<1,  ab>1+3π2W(x) = \sum_{n=0}^\infty a^n \cos(b^n \pi x), \quad 0 < a < 1, \; ab > 1 + \frac{3\pi}{2}

This is a Fourier series (cosine series) that converges uniformly (geometric series bounds), but the sum is nowhere differentiable. The coefficients ana^n decay geometrically but the frequencies bnb^n grow geometrically - the function has "fractal" structure with self-similar behavior at all scales.

For AI: Nowhere-differentiable functions are the worst case for neural network optimization. The loss landscape of a neural network restricted to certain parameter subspaces can exhibit Weierstrass-like behavior - highly oscillatory, with no gradient information about the global minimum. This motivates smooth parameterizations (Adam's momentum, gradient clipping) and understanding of the loss landscape geometry.


Appendix M: Quick Reference

M.1 Key Formulas at a Glance

cn=12πππf(x)einxdx,f(x)=n=cneinx\boxed{c_n = \frac{1}{2\pi}\int_{-\pi}^{\pi} f(x)e^{-inx}\,dx, \quad f(x) = \sum_{n=-\infty}^\infty c_n e^{inx}} an=1πππf(x)cos(nx)dx,bn=1πππf(x)sin(nx)dx\boxed{a_n = \frac{1}{\pi}\int_{-\pi}^{\pi} f(x)\cos(nx)\,dx, \quad b_n = \frac{1}{\pi}\int_{-\pi}^{\pi} f(x)\sin(nx)\,dx} Parseval: n=cn2=12πππf(x)2dx\boxed{\text{Parseval: } \sum_{n=-\infty}^\infty |c_n|^2 = \frac{1}{2\pi}\int_{-\pi}^\pi |f(x)|^2\,dx} Shift: cn[f(x0)]=einx0cn[f],Deriv: cn[f]=incn[f]\boxed{\text{Shift: } c_n[f(\cdot - x_0)] = e^{-inx_0}c_n[f], \quad \text{Deriv: } c_n[f'] = in\cdot c_n[f]} SNf(x)=12πππf(t)DN(xt)dt,DN(x)=sin((N+12)x)sin(x/2)\boxed{S_N f(x) = \frac{1}{2\pi}\int_{-\pi}^{\pi} f(t)D_N(x-t)\,dt, \quad D_N(x) = \frac{\sin((N+\frac{1}{2})x)}{\sin(x/2)}} RoPE: (q2i+iq2i+1)eimθi,  θi=100002i/d\boxed{\text{RoPE: }(q_{2i}+iq_{2i+1})e^{im\theta_i}, \;\theta_i = 10000^{-2i/d}} Score(m,n)=Re ⁣[i(q2i+iq2i+1)(k2i+ik2i+1)ei(mn)θi]\boxed{\text{Score}(m,n) = \operatorname{Re}\!\left[\sum_i (q_{2i}+iq_{2i+1})\overline{(k_{2i}+ik_{2i+1})} e^{i(m-n)\theta_i}\right]}

M.2 Decision Tree: Which Tool to Use

Is your signal periodic?
+-- YES -> Fourier series (Section 20-01)
|   +-- Need to compute? -> DFT/FFT (Section 20-03)
|   +-- Need to convolve? -> Convolution theorem (Section 20-04)
+-- NO -> Fourier transform (Section 20-02)
    +-- Bandlimited? -> Sampling theorem
    +-- Need time localization? -> Wavelets (Section 20-05)
    +-- Aperiodic transient? -> STFT (Section 20-03)

Is your signal discrete?
+-- YES -> DFT/FFT (Section 20-03)
+-- NO -> Fourier transform (Section 20-02)

Do you need time AND frequency localization?
+-- YES -> Wavelets (Section 20-05)

Appendix N: Fourier Series and Modern Transformers - Extended Analysis

N.1 YaRN and NTK-Aware Frequency Scaling

The original RoPE with base 1000010000 runs out of positional expressiveness beyond 4096\sim 4096 tokens. Several methods have been proposed to extend it:

Linear Position Interpolation (PI) - Chen et al., 2023. Scale all positions by s=Lorig/Lnews = L_{\text{orig}}/L_{\text{new}}: position mm becomes m/sm/s. This compresses the angles to keep mθi/sm\theta_i/s within the trained range [0,2π][0, 2\pi]. Equivalently, reduce all frequencies: θiθi/s\theta_i \to \theta_i/s.

Issue with linear interpolation: The lowest-frequency components (large θi\theta_i, which change slowly) are barely affected, but the highest-frequency components (small θi\theta_i, which rotate rapidly) are compressed most aggressively. High-frequency positional information (distinguishing adjacent tokens) degrades.

NTK-RoPE - Bloc97, 2023. Scale the base rather than the positions: 1000010000sd/(d2)10000 \to 10000^{s^{d/(d-2)}}. This modifies low-frequency components more than high-frequency ones - a non-uniform scaling derived from the NTK analysis.

YaRN - Peng et al., 2023. Combines interpolation for middle frequencies with no interpolation for high frequencies (which extrapolate) and linear interpolation for low frequencies. The frequency ranges are: (i) n<αd/2n < \alpha d/2: unchanged; (ii) αd/2n<βd/2\alpha d/2 \leq n < \beta d/2: linear interpolation; (iii) nβd/2n \geq \beta d/2: extrapolation. The parameters α,β\alpha, \beta are tuned per model. YaRN achieves strong performance at 128K context with only 0.1% of the parameters fine-tuned.

The Fourier interpretation: All these methods are attempting to find the right frequency allocation - which Fourier basis vectors should be responsible for which positional scales. The ideal is to have each frequency range [1/(2θi),1/(2θi+1)][1/(2\theta_i), 1/(2\theta_{i+1})] cover roughly equal logarithmic bandwidth, so no scale is under- or over-represented.

N.2 Frequency Analysis of Layer Norms

LayerNorm as a high-pass filter. LayerNorm subtracts the mean and divides by standard deviation:

LN(x)=xxˉσx\text{LN}(\mathbf{x}) = \frac{\mathbf{x} - \bar{\mathbf{x}}}{\sigma_{\mathbf{x}}}

The mean subtraction is a DC removal in signal processing terms: it removes the n=0n=0 (constant) Fourier component from the sequence. The division by σx\sigma_\mathbf{x} is a gain normalization.

Implication: After LayerNorm, the token representation has zero mean. This concentrates the representational capacity in the non-DC Fourier components, reinforcing the tendency of transformers to represent relative (not absolute) positional information.

RMSNorm (used in LLaMA) only divides by the RMS (root mean square), without mean subtraction. It does not remove the DC component - preserving more absolute magnitude information. The trade-off: slightly less stable training vs. richer representation.

N.3 Fourier Analysis of Embedding Matrices

What does the Fourier transform of a token embedding matrix reveal?

Let ERV×dE \in \mathbb{R}^{V \times d} be the embedding matrix for vocabulary size VV and dimension dd. For a fixed dimension jj, the vector ej=E:,jRVe_j = E_{:,j} \in \mathbb{R}^V gives the jj-th coordinate of all token embeddings.

If we sort tokens by frequency rank (most common first) and look at eje_j as a function of token rank, we can ask: what is its Fourier transform?

Empirical finding (2024): The embedding vectors of tokens sorted by frequency show 1/f-like spectra in dimension jj vs. frequency rank - consistent with the Zipf distribution of token frequencies. This suggests that the embedding space has a natural "frequency ordering" that mirrors the frequency distribution of the training corpus.

For interpretability: The Fourier components of the embedding matrix at different "token frequencies" correspond to different semantic clusters. Low-frequency components (affecting the most common tokens) encode universal semantic primitives (noun/verb, positive/negative). High-frequency components encode rare, specific meanings.

N.4 The Fourier Perspective on Attention Sinks

The "attention sink" phenomenon (Xiao et al., 2023): In all transformer models, the first token (often <BOS> or the period .) receives disproportionate attention from most heads, even when semantically irrelevant.

Fourier explanation: The attention sink is the DC component (zero frequency) of the attention pattern. In signal processing, a low-pass filter preserves the DC component. If attention heads are implicitly low-pass filtering the sequence, they must "send" the removed energy somewhere - and the first token becomes the "drain" for this energy.

StreamingLLM (Xiao et al., 2023): By always keeping the first 4 tokens in the KV cache (the "attention sinks"), models can be extended to arbitrarily long sequences without performance degradation. This is a practical application of the Fourier insight: preserve the DC reference point.

N.5 Complete Worked Example: Sinusoidal PE Computation

Let d=8d = 8 (small for illustration), positions {0,1,2,3,4}\{0, 1, 2, 3, 4\}.

Frequencies: ωi=100002i/8\omega_i = 10000^{-2i/8} for i=0,1,2,3i = 0, 1, 2, 3:

  • ω0=1.0000\omega_0 = 1.0000
  • ω1=100001/40.1000\omega_1 = 10000^{-1/4} \approx 0.1000
  • ω2=100001/20.0100\omega_2 = 10000^{-1/2} \approx 0.0100
  • ω3=100003/40.0010\omega_3 = 10000^{-3/4} \approx 0.0010

Positional encoding matrix PER5×8\text{PE} \in \mathbb{R}^{5 \times 8}:

possin(pos1.0)cos(pos1.0)sin(pos0.1)cos(pos0.1)sin(pos0.01)cos(pos0.01)sin(pos0.001)cos(pos0.001)
00.0001.0000.0001.0000.0001.0000.0001.000
10.8410.5400.1000.9950.0101.0000.0011.000
20.909-0.4160.1990.9800.0201.0000.0021.000
30.141-0.9900.2960.9550.0301.0000.0031.000
4-0.757-0.6540.3890.9210.0401.0000.0041.000

Observations:

  • Dimension 0-1 (ω0=1\omega_0 = 1): Oscillates rapidly - completes a full cycle in 2π6.282\pi \approx 6.28 positions. Distinguishes adjacent tokens.
  • Dimension 2-3 (ω1=0.1\omega_1 = 0.1): Oscillates at 1/10th the rate. Distinguishes tokens up to ~63 apart.
  • Dimension 4-5 (ω2=0.01\omega_2 = 0.01): Very slow oscillation. Distinguishes tokens up to ~628 apart.
  • Dimension 6-7 (ω3=0.001\omega_3 = 0.001): Extremely slow. Distinguishes tokens up to ~6283 apart.

This is the Fourier series multi-resolution strategy: use harmonics at multiple scales to encode position information at all scales simultaneously. Each dimension pair (2i,2i+1)(2i, 2i+1) is a Fourier basis function at a different frequency - the entire positional encoding is a multi-frequency Fourier expansion of the position index.


Appendix O: Further Reading and References

O.1 Primary References

Textbooks (Classical):

  • Korner, T.W. (1988). Fourier Analysis. Cambridge University Press. - The most readable rigorous treatment; includes Gibbs phenomenon, convergence theory, and applications to number theory.
  • Katznelson, Y. (2004). An Introduction to Harmonic Analysis, 3rd ed. Cambridge. - Definitive graduate reference.
  • Stein, E. & Weiss, G. (1971). Introduction to Fourier Analysis on Euclidean Spaces. Princeton. - The multidimensional extension.
  • Tolstov, G.P. (1962). Fourier Series (Silverman translation). Dover. - Inexpensive, detailed, complete derivations of all standard results.

Textbooks (Applied):

  • Oppenheim, A. & Willsky, A. (1997). Signals and Systems, 2nd ed. Prentice Hall. - Standard engineering reference for Fourier methods in signal processing.
  • Bracewell, R.N. (2000). The Fourier Transform and Its Applications, 3rd ed. McGraw-Hill. - Intuitive engineering treatment with excellent visualizations.

O.2 Key ML Papers Using Fourier Methods

PaperYearContributionFourier connection
Vaswani et al., "Attention Is All You Need"2017Transformer; sinusoidal PEFourier basis for position
Rahimi & Recht, "Random Features for Large-Scale Kernel Machines"2007Random Fourier featuresBochner's theorem
Rahaman et al., "On the Spectral Bias of Neural Networks"2019Spectral bias phenomenonNTK Fourier decomposition
Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding"2021RoPEComplex Fourier rotation
Lee-Thorp et al., "FNet: Mixing Tokens with Fourier Transforms"2021FNetDFT replaces attention
Sitzmann et al., "Implicit Neural Representations with Periodic Activations"2020SIRENFourier basis activations
Tancik et al., "Fourier Features Let Networks Learn High Frequency Functions"2020Fourier feature encodingNeRF; overcomes spectral bias
Gu et al., "Efficiently Modeling Long Sequences with Structured State Spaces"2021S4 modelFrequency domain parameterization
Peng et al., "YaRN: Efficient Context Window Extension of LLMs"2023Extended contextRoPE frequency scaling
Xiao et al., "Efficient Streaming Language Models with Attention Sinks"2023StreamingLLMDC component preservation

O.3 Connections to Other Sections in This Curriculum

TopicWhere to find itConnection to Fourier Series
L2L^2 Hilbert spaces, completenessSection 12-02 Hilbert SpacesThe abstract framework for Fourier series
Inner products on function spacesSection 12-01 Normed SpacesThe f,g\langle f,g \rangle that defines Fourier coefficients
Kernel methods, RKHSSection 12-03 Kernel MethodsBochner's theorem; RFFs
Complex numbers, seriesSection 01 Mathematical Foundationseinxe^{inx}, Euler's formula
Gradient descent, loss landscapeSection 08 OptimizationNTK, spectral bias
Spectral graph convolutionSection 11-04 Spectral Graph TheoryGraph Fourier transform
Probability characteristic functionsSection 06 Probability TheoryBochner, CLT via characteristic functions

Appendix P: Common Error Patterns in Code

When implementing Fourier series and Fourier-based methods in Python, these are the most frequent bugs:

P.1 Normalization Factor Confusion

Bug: Using np.fft.fft(x) without dividing by N.

# WRONG: FFT includes the N normalization in a different convention
c_wrong = np.fft.fft(x)  # These are N * c_n, not c_n!

# CORRECT: divide by N to get Fourier coefficients
c_correct = np.fft.fft(x) / N

Why: NumPy's FFT convention is X[k]=n=0N1x[n]e2πink/NX[k] = \sum_{n=0}^{N-1} x[n] e^{-2\pi ink/N} (no 1/N1/N factor). The mathematical Fourier series coefficient is cn=1Nn=0N1x[n]e2πink/Nc_n = \frac{1}{N}\sum_{n=0}^{N-1} x[n] e^{-2\pi ink/N}.

P.2 Frequency Axis Misidentification

# WRONG: assuming output index k corresponds to frequency k
freqs = np.arange(N)  # Wrong!

# CORRECT: use fftfreq
freqs = np.fft.fftfreq(N, d=1.0/sample_rate)  # In Hz
# or for [-pi, pi] interval:
freqs = np.fft.fftfreq(N) * N  # integer wavenumbers -N/2 to N/2

Why: NumPy FFT output has frequencies [0,1,,N/21,N/2,,1][0, 1, \ldots, N/2-1, -N/2, \ldots, -1] cycles/sample, not [0,1,,N1][0, 1, \ldots, N-1].

P.3 Forgetting to Handle Real FFT Symmetry

For real inputs, cn=cnc_{-n} = \overline{c_n} (Hermitian symmetry). The positive-frequency half is sufficient:

# For real input, use rfft (returns only positive frequencies):
c = np.fft.rfft(x) / N     # Length N//2 + 1
freqs = np.fft.rfftfreq(N) # Corresponding frequencies

# To recover the full spectrum, remember to double all |c_n|^2 for n > 0:
power_onesided = np.abs(c)**2
power_onesided[1:-1] *= 2   # Double all except DC and Nyquist

P.4 Aliasing from Insufficient Sampling

# DANGER: if signal has frequency content above Nyquist
t = np.linspace(0, 1, 100)     # 100 samples per second
x = np.sin(2 * np.pi * 80 * t)  # 80 Hz signal

# Nyquist frequency = 50 Hz; 80 Hz is ABOVE Nyquist
# The DFT will show this as a false 20 Hz component (80 - 100/2 = 30 != 20... wait)
# Actually: 80 Hz aliases to 100 - 80 = 20 Hz.
# ALWAYS check: max frequency in signal < sample_rate / 2

P.5 Misinterpreting Parseval in Discrete vs. Continuous Setting

Continuous Parseval: 12πππf2=ncn2\frac{1}{2\pi}\int_{-\pi}^\pi |f|^2 = \sum_n |c_n|^2

Discrete Parseval (DFT): 1Nk=0N1x[k]2=1N2n=0N1X[n]2\frac{1}{N}\sum_{k=0}^{N-1} |x[k]|^2 = \frac{1}{N^2}\sum_{n=0}^{N-1} |X[n]|^2, where X=DFT(x)X = \text{DFT}(x).

# Check Parseval numerically:
x = np.random.randn(N)
X = np.fft.fft(x)
lhs = np.mean(np.abs(x)**2)           # (1/N) * sum |x[k]|^2
rhs = np.mean(np.abs(X/N)**2)         # (1/N) * sum |c_n|^2
assert np.isclose(lhs, rhs), f"Parseval failed: {lhs} != {rhs}"

This completes the Section 20-01 Fourier Series reference notes. Continue with Section 20-02 Fourier Transform -> to see how the period-TT Fourier series becomes the continuous Fourier transform as TT \to \infty.


Appendix Q: Historical Vignettes

Q.1 Fourier's Audacity and the Reception of His Theory

Joseph Fourier (1768-1830) was not primarily a mathematician. He was a physicist and administrator who, as Napoleon's prefect of Isere, oversaw the draining of malarial swamps and the construction of roads. His mathematical masterpiece grew from a practical problem: how does heat conduct through a solid body?

When Fourier presented his theory to the Paris Academy in 1807, the reaction was hostile. The paper was rejected - or rather, shelved - on the grounds that the claim "every function can be represented as a trigonometric series" was both wrong (they thought) and insufficiently rigorous (it was). The judges included Laplace, Lagrange, and Monge - the greatest mathematicians of the age.

Lagrange's objection was mathematical: a sum of smooth functions (sines and cosines) cannot converge to a discontinuous function. This was a perfectly reasonable objection in 1807, before the modern theory of pointwise vs. L2L^2 convergence had been developed. Fourier's answer - essentially, "but it does!" - was more right than Lagrange, but not for the reasons Fourier gave.

Fourier persisted. In 1822, he published his masterpiece Theorie analytique de la chaleur, which remains one of the great works of mathematical physics. His key equations - the heat equation, the Fourier series, and the Fourier integral - were all present, though not rigorously proved.

The rigorous proof came from Dirichlet in 1829. The controversy about pointwise convergence of the Fourier series of a continuous function was not finally resolved until Carleson in 1966 - 159 years after Fourier's original claim. Mathematics sometimes takes time to catch up with correct intuitions.

Q.2 Dirichlet's Proof: The Birth of Rigorous Analysis

Peter Gustav Lejeune Dirichlet (1805-1859) was a student of Fourier's. His 1829 paper giving the first rigorous proof of pointwise convergence is often cited as one of the founding documents of modern analysis - not just because of the result, but because of the method.

Dirichlet's proof introduced two innovations that transformed mathematics:

  1. The modern definition of function. Before Dirichlet, a "function" meant something expressible by a formula. Dirichlet explicitly worked with functions that could behave differently on different parts of their domain, introducing what we now recognize as the modern concept. His paper is one of the first places where the contemporary idea of function (as any rule assigning outputs to inputs) appears clearly.

  2. Careful convergence arguments. Dirichlet's proof carefully tracked when limits could be exchanged and when they could not - a concern that Cauchy had introduced but that Dirichlet applied systematically to Fourier series. This careful accounting of convergence, rather than treating series as formal objects, is the hallmark of 19th-century rigor.

In this sense, Fourier series didn't just give us a useful computational tool. They provoked the development of the rigorous analysis that underlies all of modern mathematics and machine learning theory.

Q.3 Carleson's Theorem: The 159-Year-Old Problem

For 120 years after Dirichlet's theorem (1829), the following question remained open:

Does the Fourier series of every continuous function converge pointwise everywhere?

  • 1873: Du Bois-Reymond showed NO for continuous functions at one point
  • 1923: Kolmogorov showed NO for L1L^1 functions everywhere (!)
  • But for the natural space L2L^2...?

In 1966, Lennart Carleson proved: YES, the Fourier series of every L2L^2 function converges pointwise almost everywhere. This was one of the most celebrated theorems in 20th-century analysis. Carleson received the Abel Prize in 2006 - the only prize given specifically for this theorem.

The proof uses a remarkable decomposition of the Dirichlet kernel into "tiles" in the time-frequency plane, combined with a sophisticated stopping-time argument. It is considered one of the most difficult proofs in all of harmonic analysis - spanning 30+ pages of dense mathematics.

For our purposes, the practical upshot is: every L2L^2 function (which includes every function we encounter in ML) has its Fourier series converging pointwise almost everywhere. The theoretical pathologies of convergence failure are measure-zero exceptions.


Appendix R: Self-Assessment Checklist

Use this checklist to verify understanding before moving to Section 20-02.

R.1 Computational Mastery

  • Can you compute the Fourier coefficients of the square wave, sawtooth, and triangle wave from scratch without looking them up?
  • Can you use the differentiation theorem (cn[f]=incn[f]c_n[f'] = in\,c_n[f]) to derive the triangle wave coefficients from the square wave coefficients?
  • Can you apply Parseval's identity to evaluate n=11/n4\sum_{n=1}^\infty 1/n^4 (via the Fourier series of x2x^2)?
  • Can you identify the decay rate of Fourier coefficients (O(1/n)O(1/n), O(1/n2)O(1/n^2), exponential) from a description of the function?
  • Can you write the partial sum SNfS_N f as a convolution with the Dirichlet kernel?

R.2 Theoretical Understanding

  • Can you state Dirichlet's theorem and identify what happens at a jump discontinuity?
  • Can you explain the Gibbs phenomenon and why it persists at any truncation order?
  • Can you state Parseval's identity and explain its physical meaning (energy conservation)?
  • Can you explain why L2L^2 convergence is the "correct" convergence for Fourier series?
  • Can you explain why the Fejer kernel (Cesaro means) fixes the Gibbs phenomenon while the Dirichlet kernel doesn't?
  • Can you explain the Riemann-Lebesgue lemma and its consequence: smooth signals have decaying Fourier coefficients?

R.3 AI Connections

  • Can you derive the sinusoidal PE formula from the Fourier basis perspective?
  • Can you explain RoPE in terms of complex Fourier multiplication and the shift theorem?
  • Can you explain the spectral bias phenomenon and its consequences for neural network training?
  • Can you explain why Random Fourier Features approximate kernel methods and how Bochner's theorem is involved?
  • Can you explain the attention sink phenomenon using Fourier analysis (DC component)?

R.4 Connections to Other Sections

  • Do you understand that the Fourier Transform (Section 20-02) is the TT \to \infty limit of the Fourier series?
  • Do you understand that the DFT (Section 20-03) is the Fourier series for a finite discrete sequence?
  • Do you understand that the convolution theorem (Section 20-04) is a consequence of Fourier series orthogonality (cn[fg]=cn[f]cn[g]c_n[f*g] = c_n[f]\cdot c_n[g])?
  • Do you understand that wavelets (Section 20-05) overcome the time-localization failure of Fourier series?

If you cannot answer YES to all items in R.1 and R.2, and at least 4 items in R.3 and R.4, review the corresponding sections before proceeding.


End of Section 20-01 Fourier Series

<- Back to Fourier Analysis | Next: Fourier Transform ->


Appendix S: Summary of Key Results

S.1 The Core Theorems

Theorem 1 (Dirichlet, 1829). If ff is 2π2\pi-periodic and piecewise smooth, then for every xx:

n=NNcneinx    f(x+)+f(x)2as N.\sum_{n=-N}^N c_n e^{inx} \;\to\; \frac{f(x^+) + f(x^-)}{2} \quad \text{as } N \to \infty.

Theorem 2 (L2L^2 convergence). For every fL2[π,π]f \in L^2[-\pi,\pi]:

fnNcneinxL20as N.\left\lVert f - \sum_{|n|\leq N} c_n e^{inx} \right\rVert_{L^2} \to 0 \quad \text{as } N \to \infty.

Theorem 3 (Parseval-Bessel). For fL2[π,π]f \in L^2[-\pi,\pi]:

n=cn2=12πππf(x)2dx.\sum_{n=-\infty}^\infty |c_n|^2 = \frac{1}{2\pi}\int_{-\pi}^\pi |f(x)|^2 \, dx.

Theorem 4 (Fejer, 1900). For every fC[π,π]f \in C[-\pi,\pi] (continuous and 2π2\pi-periodic):

1N+1k=0NSkffuniformly as N.\frac{1}{N+1}\sum_{k=0}^N S_k f \to f \quad \text{uniformly as } N \to \infty.

Theorem 5 (Riemann-Lebesgue). For fL1[π,π]f \in L^1[-\pi,\pi]:

cn[f]0as n.c_n[f] \to 0 \quad \text{as } |n| \to \infty.

Theorem 6 (Carleson, 1966). For every fL2[π,π]f \in L^2[-\pi,\pi]:

nNcneinxf(x)for almost every x[π,π].\sum_{|n|\leq N} c_n e^{inx} \to f(x) \quad \text{for almost every } x \in [-\pi,\pi].

S.2 The Most Important Formula

Among all the formulas in this section, one stands out as the most consequential for both mathematics and AI:

cn[f]=12πππf(x)einxdxf(x)=n=cneinx\boxed{c_n[f] = \frac{1}{2\pi}\int_{-\pi}^{\pi} f(x)\, e^{-inx}\, dx \qquad \longleftrightarrow \qquad f(x) = \sum_{n=-\infty}^{\infty} c_n\, e^{inx}}

Every other result in this section - convergence theorems, Parseval, the shift property, RoPE, spectral bias - is a consequence or application of this pair.

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue