Part 3

23 min read18 headingsSplit lesson page

Lesson overview | Previous part | Lesson overview

Fourier Series: Appendix J: Fourier Series in Diverse Mathematical Contexts to Appendix S: Summary of Key Results

Appendix J: Fourier Series in Diverse Mathematical Contexts

J.1 Fourier Series and Number Theory

The connection between Fourier analysis and number theory is profound and historically important.

Dirichlet characters. For a modulus $q$ , a Dirichlet character $\chi: \mathbb{Z} \to \mathbb{C}$ is a completely multiplicative arithmetic function with period $q$ . These are precisely the characters of the group $(\mathbb{Z}/q\mathbb{Z})^*$ . Dirichlet's proof that there are infinitely many primes in any arithmetic progression $\{a, a+q, a+2q,\ldots\}$ with $\gcd(a,q) = 1$ uses Fourier analysis on the group $(\mathbb{Z}/q\mathbb{Z})^*$ - a finite version of Fourier series.

The Poisson summation formula. For a "nice" function $f$ :

\sum_{n=-\infty}^{\infty} f(n) = \sum_{k=-\infty}^{\infty} \hat{f}(k)

This connects sums over integers to sums over their Fourier transforms. Applications include:

Jacobi theta functions: $\theta(\tau) = \sum_{n=-\infty}^\infty e^{i\pi n^2 \tau}$ satisfies a functional equation from Poisson summation
Riemann zeta function: The functional equation $\zeta(s) = 2^s\pi^{s-1}\sin(\pi s/2)\Gamma(1-s)\zeta(1-s)$ is proved via Poisson summation
Quadratic forms: The number of ways to represent an integer as a sum of $k$ squares is expressed via theta functions

For AI: Poisson summation appears in the theory of neural network generalization. The neural tangent kernel at infinite width is related to the Fourier transform of the network architecture, and bounds on generalization involve the Poisson summation formula applied to the empirical Fourier spectrum of the training data.

J.2 Fourier Series on Non-Euclidean Spaces

Fourier analysis on groups. The Fourier series on $[-\pi,\pi]$ is really harmonic analysis on the group $S^1 = \{e^{i\theta} : \theta \in [-\pi,\pi]\}$ . The generalization to compact groups $G$ gives:

f(g) = \sum_{\pi \in \hat{G}} d_\pi \operatorname{tr}(\hat{f}(\pi) \pi(g))

where $\hat{G}$ is the set of irreducible unitary representations of $G$ and $d_\pi$ is the dimension of representation $\pi$ . For $G = S^1$ : all irreducible representations are 1-dimensional ( $d_\pi = 1$ ), and we recover the ordinary Fourier series.

For AI - Equivariant networks: Group-equivariant neural networks (Cohen & Welling, 2016) build in symmetries (rotations, reflections) by working with group-Fourier analysis. A convolution network on $\mathbb{R}^2$ is equivariant to translations; lifting to $SE(2)$ (rigid motions) gives equivariance to rotation and translation simultaneously. The mathematical foundation is the group Fourier transform - Fourier series on a non-abelian group.

Fourier analysis on graphs. The graph Laplacian $L = D - A$ has eigenvectors $U = [u_1,\ldots,u_n]$ with eigenvalues $0 = \lambda_1 \leq \lambda_2 \leq \cdots \leq \lambda_n$ . The graph Fourier transform is $\hat{f} = U^\top f$ . Graph convolution (used in GCN, ChebNet - covered in Section 11-04) is multiplication in this graph Fourier domain:

f *_G h = U((U^\top f) \odot (U^\top h))

This is exactly the convolution theorem applied to graphs, replacing the standard Fourier basis $e^{inx}$ with the Laplacian eigenvectors $u_k$ .

J.3 Fourier Series and Quantum Mechanics

The mathematical structure of quantum mechanics is built on Fourier analysis.

Position and momentum representations. A quantum state $|\psi\rangle$ can be represented in the position basis: $\psi(x) = \langle x|\psi\rangle$ (the wavefunction), or in the momentum basis: $\tilde{\psi}(p) = \langle p|\psi\rangle$ . The two representations are related by the Fourier transform:

\tilde{\psi}(p) = \frac{1}{\sqrt{2\pi\hbar}}\int \psi(x) e^{-ipx/\hbar}\,dx

The Heisenberg uncertainty principle $\Delta x \cdot \Delta p \geq \hbar/2$ is exactly the Fourier uncertainty principle $\sigma_t \cdot \sigma_\xi \geq 1/(4\pi)$ in disguise, with $p = \hbar\xi$ and $x = t$ .

For AI: The mathematical structure of quantum mechanics (Hilbert space, operators, eigenvalues) and the mathematical structure of attention (inner products, softmax, value aggregation) are strikingly parallel. Several researchers have proposed "quantum attention" mechanisms, and the Fourier analysis connecting position and momentum representations underlies proposals for "quantum-inspired" sequence modeling.

J.4 The Ergodic Theorem and Fourier Analysis

Birkhoff's ergodic theorem. For an ergodic measure-preserving transformation $T$ and $f \in L^1$ :

\frac{1}{N}\sum_{k=0}^{N-1} f(T^k x) \to \int f\,d\mu \quad \text{a.e.}

The time average equals the space average.

Connection to Fourier series. For the rotation $T: x \mapsto x + \alpha \pmod{1}$ on the circle (irrational $\alpha$ ), Birkhoff's theorem gives:

\frac{1}{N}\sum_{k=0}^{N-1} f(x + k\alpha) \to \int_0^1 f\,dx

In Fourier space: $\frac{1}{N}\sum_{k=0}^{N-1} c_n[f] e^{2\pi ink\alpha} = c_n[f] \cdot \frac{1}{N}\sum_{k=0}^{N-1} e^{2\pi ink\alpha}$ . For irrational $\alpha$ and $n \neq 0$ : $\frac{1}{N}\sum e^{2\pi ink\alpha} \to 0$ (equidistribution) -> only the $n = 0$ term survives -> the time average equals $c_0[f] = \int f$ .

For AI: The ergodic theorem underpins the mixing of training data. In language model training, a sequence of tokens is generated by a stationary stochastic process (natural language). Ergodicity says that sampling from a long enough sequence gives a representative sample of the distribution - justifying the use of random windowed crops for training, rather than requiring the full corpus at each step.

Appendix K: Connections Between Sections

K.1 Why This Section is the Foundation

All subsequent sections in this chapter build on the foundations laid here:

-> Section 20-02 (Fourier Transform): Take the period $T \to \infty$ in the Fourier series. The discrete sum $\sum_n c_n e^{inx}$ becomes the integral $\int \hat{f}(\xi)e^{2\pi i\xi x}\,d\xi$ . Every property of the Fourier series has an analogue in the Fourier transform - the derivations are nearly identical. The key new feature is the continuous spectrum and the $L^2(\mathbb{R})$ setting.

-> Section 20-03 (DFT and FFT): Discretize the Fourier transform: sample $f$ at $N$ points and compute $N$ frequency coefficients. The DFT is the Fourier series applied to the sequence $(f_0, \ldots, f_{N-1})$ viewed as a periodic sequence of period $N$ . The FFT is the efficient algorithm for computing it.

-> Section 20-04 (Convolution Theorem): The property $c_n[f*g] = c_n[f]\cdot c_n[g]$ (multiplication in frequency = convolution in time) is the single most important theorem in signal processing. It follows directly from the orthogonality of the Fourier basis and is proved here in the series setting; the full treatment with applications to CNNs is in Section 20-04.

-> Section 20-05 (Wavelets): Wavelets are a replacement for Fourier series that provides both time and frequency localization. The Haar wavelet is a piecewise constant basis - like the Fourier basis, but localized in time. The mathematical framework of multiresolution analysis is the generalization of the Fourier series framework (nested subspaces, orthonormal bases) to time-localized expansions.

K.2 What to Read Next

After this section, you have two natural paths:

Path A - Depth first: Continue with Section 20-02 (Fourier Transform) to see how the theory extends to aperiodic signals and the full real line. This path eventually leads to the uncertainty principle and the deeper function space theory.

Path B - Applications first: Jump to Section 20-04 (Convolution Theorem) to see immediately how Fourier series powers CNNs and sequence models. The convolution theorem proof requires the Fourier transform (Section 20-02), so read that first.

For the AI-focused reader: The most direct path from this section to ML applications is:

Section 20-01 (this section) - Fourier series fundamentals
Section 20-03 (DFT/FFT) - computational implementation
Section 20-04 (Convolution Theorem) - CNNs and sequence models
Section 20-02 (Fourier Transform) - theory for spectral methods
Section 20-05 (Wavelets) - multi-scale models and compression

Appendix L: The Fourier Series in Numerical Practice

L.1 Accuracy Benchmarks: Theoretical vs. Observed

When implementing Fourier series approximations numerically, understanding the gap between theoretical and observed accuracy is essential.

Experiment: Square wave approximation accuracy.

For the square wave $f(x) = \text{sgn}(x)$ with $N$ -term Fourier series $S_N f$ :

Theoretical $L^2$ error: $\lVert f - S_N f \rVert_2^2 = \sum_{k>N/2, k\text{ odd}} \frac{16}{\pi^2 k^2} \sim \frac{8}{\pi^2 N}$
Observed: Plot $\lVert f - S_N f \rVert_2^2$ vs $N$ on a log-log scale. Slope should be $-1$ , confirming $O(1/N)$ decay.
At jump: $|S_N f(0.01) - 1| \approx 0.179$ regardless of $N$ (Gibbs)

Experiment: Triangle wave approximation accuracy.

For $f(x) = |x|$ with $|c_n| = 4/(\pi^2 n^2)$ (odd $n$ ):

Theoretical $L^2$ error: $\lVert f - S_N f \rVert_2^2 = \sum_{k>N/2, k\text{ odd}} \frac{16}{\pi^4 k^4} \sim \frac{4}{3\pi^4 N^3}$
Observed: slope $\approx -3$ on log-log plot, confirming $O(1/N^3)$ decay

Experiment: Smooth function accuracy.

For $f(x) = e^{\cos x}$ (analytic, $2\pi$ -periodic):

Theoretical: $|c_n| \leq C e^{-n}$ -> $\lVert f - S_N f \rVert_2 \leq C' e^{-N}$
Observed: semi-log plot of $\lVert f - S_N f \rVert_2$ vs $N$ is linear (exponential decay)
Essentially machine-precision accuracy for $N \geq 30$

Takeaway: The rule of thumb is "smooth = exponentially good, nonsmooth = algebraically poor." For discontinuous signals, truncated Fourier series are often a poor choice - JPEG-style block transforms or wavelets are better.

L.2 The Fast Fourier Transform Preview

The DFT (fully treated in Section 20-03) computes exactly the same Fourier coefficients as the Fourier series formula, but for a finite sequence and in $O(N \log N)$ time.

The connection: For samples $f_k = f(x_k)$ at $N$ equispaced points $x_k = -\pi + 2\pi k/N$ :

\text{DFT}(f)[n] = \sum_{k=0}^{N-1} f_k e^{-2\pi ikn/N}

Comparing with $c_n = \frac{1}{2\pi}\int_{-\pi}^{\pi} f(x)e^{-inx}\,dx \approx \frac{1}{N}\sum_{k=0}^{N-1} f_k e^{-inx_k}$ (rectangle rule with $\Delta x = 2\pi/N$ ):

c_n \approx \frac{1}{N} \cdot \text{DFT}(f)[n]

So the DFT is just the Fourier series computation via a rectangle-rule quadrature, scaled by $N$ .

The FFT (Cooley-Tukey, 1965) computes all $N$ DFT values in $O(N \log N)$ operations instead of the naive $O(N^2)$ . For $N = 10^6$ : naive DFT requires $10^{12}$ operations; FFT requires $\approx 2 \times 10^7$ operations - a factor of 50,000 speedup.

This speedup is what made digital signal processing, audio compression, and modern AI practically feasible. The STFT and mel-spectrogram preprocessing in Whisper, the frequency-domain convolutions in FNet, and the state-space model computations in S4/Mamba all rely on this speedup. Full treatment: Section 20-03 Discrete Fourier Transform and FFT.

L.3 Fourier Series vs. Polynomial Approximation

Both Fourier series and polynomial approximation (Taylor, Chebyshev, Legendre) approximate functions by basis expansions. Understanding when to use each is important.

Aspect	Fourier Series	Polynomial (Chebyshev)
Best for	Periodic functions, global frequency analysis	Non-periodic functions on $[a,b]$
Convergence for smooth $f$	Exponential (analytic), algebraic (smooth)	Exponential (analytic), algebraic (smooth)
At discontinuities	Gibbs phenomenon (9% overshoot)	Runge phenomenon near endpoints
Complexity for $N$ terms	$O(N \log N)$ via FFT	$O(N^2)$ (or $O(N\log N)$ for Chebyshev via DCT)
AI applications	RoPE, spectrograms, convolutions	Polynomial activations, rational networks
Sparse representation	Works for bandlimited signals	Works for smooth functions

Key insight: For periodic and global structure, use Fourier series. For localized and non-periodic structure, use wavelets (Section 20-05) or local polynomial bases. The choice of approximation basis should match the intrinsic structure of the function.

L.4 When Fourier Series Fail: Pathological Examples

Understanding the failure modes of Fourier series deepens intuition.

Example 1 - Du Bois-Reymond's construction (1876): There exists a continuous $2\pi$ -periodic function $f$ whose Fourier series diverges at $x = 0$ . The construction uses the slow growth of $\lVert D_N \rVert_{L^1} \sim \frac{4}{\pi^2}\ln N$ : build $f$ to make $|S_N f(0)|$ grow like $\ln N$ .

Example 2 - Kolmogorov's function (1923): There exists an $L^1$ function whose Fourier series diverges everywhere. This shows that $L^1$ is too weak for pointwise Fourier convergence.

Example 3 - Weierstrass nowhere-differentiable function:

W(x) = \sum_{n=0}^\infty a^n \cos(b^n \pi x), \quad 0 < a < 1, \; ab > 1 + \frac{3\pi}{2}

This is a Fourier series (cosine series) that converges uniformly (geometric series bounds), but the sum is nowhere differentiable. The coefficients $a^n$ decay geometrically but the frequencies $b^n$ grow geometrically - the function has "fractal" structure with self-similar behavior at all scales.

For AI: Nowhere-differentiable functions are the worst case for neural network optimization. The loss landscape of a neural network restricted to certain parameter subspaces can exhibit Weierstrass-like behavior - highly oscillatory, with no gradient information about the global minimum. This motivates smooth parameterizations (Adam's momentum, gradient clipping) and understanding of the loss landscape geometry.

Appendix M: Quick Reference

M.1 Key Formulas at a Glance

\boxed{c_n = \frac{1}{2\pi}\int_{-\pi}^{\pi} f(x)e^{-inx}\,dx, \quad f(x) = \sum_{n=-\infty}^\infty c_n e^{inx}}

\boxed{a_n = \frac{1}{\pi}\int_{-\pi}^{\pi} f(x)\cos(nx)\,dx, \quad b_n = \frac{1}{\pi}\int_{-\pi}^{\pi} f(x)\sin(nx)\,dx}

\boxed{\text{Parseval: } \sum_{n=-\infty}^\infty |c_n|^2 = \frac{1}{2\pi}\int_{-\pi}^\pi |f(x)|^2\,dx}

\boxed{\text{Shift: } c_n[f(\cdot - x_0)] = e^{-inx_0}c_n[f], \quad \text{Deriv: } c_n[f'] = in\cdot c_n[f]}

\boxed{S_N f(x) = \frac{1}{2\pi}\int_{-\pi}^{\pi} f(t)D_N(x-t)\,dt, \quad D_N(x) = \frac{\sin((N+\frac{1}{2})x)}{\sin(x/2)}}

\boxed{\text{RoPE: }(q_{2i}+iq_{2i+1})e^{im\theta_i}, \;\theta_i = 10000^{-2i/d}}

\boxed{\text{Score}(m,n) = \operatorname{Re}\!\left[\sum_i (q_{2i}+iq_{2i+1})\overline{(k_{2i}+ik_{2i+1})} e^{i(m-n)\theta_i}\right]}

M.2 Decision Tree: Which Tool to Use

Is your signal periodic?
+-- YES -> Fourier series (Section 20-01)
|   +-- Need to compute? -> DFT/FFT (Section 20-03)
|   +-- Need to convolve? -> Convolution theorem (Section 20-04)
+-- NO -> Fourier transform (Section 20-02)
    +-- Bandlimited? -> Sampling theorem
    +-- Need time localization? -> Wavelets (Section 20-05)
    +-- Aperiodic transient? -> STFT (Section 20-03)

Is your signal discrete?
+-- YES -> DFT/FFT (Section 20-03)
+-- NO -> Fourier transform (Section 20-02)

Do you need time AND frequency localization?
+-- YES -> Wavelets (Section 20-05)

Appendix N: Fourier Series and Modern Transformers - Extended Analysis

N.1 YaRN and NTK-Aware Frequency Scaling

The original RoPE with base $10000$ runs out of positional expressiveness beyond $\sim 4096$ tokens. Several methods have been proposed to extend it:

Linear Position Interpolation (PI) - Chen et al., 2023. Scale all positions by $s = L_{\text{orig}}/L_{\text{new}}$ : position $m$ becomes $m/s$ . This compresses the angles to keep $m\theta_i/s$ within the trained range $[0, 2\pi]$ . Equivalently, reduce all frequencies: $\theta_i \to \theta_i/s$ .

Issue with linear interpolation: The lowest-frequency components (large $\theta_i$ , which change slowly) are barely affected, but the highest-frequency components (small $\theta_i$ , which rotate rapidly) are compressed most aggressively. High-frequency positional information (distinguishing adjacent tokens) degrades.

NTK-RoPE - Bloc97, 2023. Scale the base rather than the positions: $10000 \to 10000^{s^{d/(d-2)}}$ . This modifies low-frequency components more than high-frequency ones - a non-uniform scaling derived from the NTK analysis.

YaRN - Peng et al., 2023. Combines interpolation for middle frequencies with no interpolation for high frequencies (which extrapolate) and linear interpolation for low frequencies. The frequency ranges are: (i) $n < \alpha d/2$ : unchanged; (ii) $\alpha d/2 \leq n < \beta d/2$ : linear interpolation; (iii) $n \geq \beta d/2$ : extrapolation. The parameters $\alpha, \beta$ are tuned per model. YaRN achieves strong performance at 128K context with only 0.1% of the parameters fine-tuned.

The Fourier interpretation: All these methods are attempting to find the right frequency allocation - which Fourier basis vectors should be responsible for which positional scales. The ideal is to have each frequency range $[1/(2\theta_i), 1/(2\theta_{i+1})]$ cover roughly equal logarithmic bandwidth, so no scale is under- or over-represented.

N.2 Frequency Analysis of Layer Norms

LayerNorm as a high-pass filter. LayerNorm subtracts the mean and divides by standard deviation:

\text{LN}(\mathbf{x}) = \frac{\mathbf{x} - \bar{\mathbf{x}}}{\sigma_{\mathbf{x}}}

The mean subtraction is a DC removal in signal processing terms: it removes the $n=0$ (constant) Fourier component from the sequence. The division by $\sigma_\mathbf{x}$ is a gain normalization.

Implication: After LayerNorm, the token representation has zero mean. This concentrates the representational capacity in the non-DC Fourier components, reinforcing the tendency of transformers to represent relative (not absolute) positional information.

RMSNorm (used in LLaMA) only divides by the RMS (root mean square), without mean subtraction. It does not remove the DC component - preserving more absolute magnitude information. The trade-off: slightly less stable training vs. richer representation.

N.3 Fourier Analysis of Embedding Matrices

What does the Fourier transform of a token embedding matrix reveal?

Let $E \in \mathbb{R}^{V \times d}$ be the embedding matrix for vocabulary size $V$ and dimension $d$ . For a fixed dimension $j$ , the vector $e_j = E_{:,j} \in \mathbb{R}^V$ gives the $j$ -th coordinate of all token embeddings.

If we sort tokens by frequency rank (most common first) and look at $e_j$ as a function of token rank, we can ask: what is its Fourier transform?

Empirical finding (2024): The embedding vectors of tokens sorted by frequency show 1/f-like spectra in dimension $j$ vs. frequency rank - consistent with the Zipf distribution of token frequencies. This suggests that the embedding space has a natural "frequency ordering" that mirrors the frequency distribution of the training corpus.

For interpretability: The Fourier components of the embedding matrix at different "token frequencies" correspond to different semantic clusters. Low-frequency components (affecting the most common tokens) encode universal semantic primitives (noun/verb, positive/negative). High-frequency components encode rare, specific meanings.

N.4 The Fourier Perspective on Attention Sinks

The "attention sink" phenomenon (Xiao et al., 2023): In all transformer models, the first token (often <BOS> or the period .) receives disproportionate attention from most heads, even when semantically irrelevant.

Fourier explanation: The attention sink is the DC component (zero frequency) of the attention pattern. In signal processing, a low-pass filter preserves the DC component. If attention heads are implicitly low-pass filtering the sequence, they must "send" the removed energy somewhere - and the first token becomes the "drain" for this energy.

StreamingLLM (Xiao et al., 2023): By always keeping the first 4 tokens in the KV cache (the "attention sinks"), models can be extended to arbitrarily long sequences without performance degradation. This is a practical application of the Fourier insight: preserve the DC reference point.

N.5 Complete Worked Example: Sinusoidal PE Computation

Let $d = 8$ (small for illustration), positions $\{0, 1, 2, 3, 4\}$ .

Frequencies: $\omega_i = 10000^{-2i/8}$ for $i = 0, 1, 2, 3$ :

$\omega_0 = 1.0000$
$\omega_1 = 10000^{-1/4} \approx 0.1000$
$\omega_2 = 10000^{-1/2} \approx 0.0100$
$\omega_3 = 10000^{-3/4} \approx 0.0010$

Positional encoding matrix $\text{PE} \in \mathbb{R}^{5 \times 8}$ :

pos	sin(pos1.0)	cos(pos1.0)	sin(pos0.1)	cos(pos0.1)	sin(pos0.01)	cos(pos0.01)	sin(pos0.001)	cos(pos0.001)
0	0.000	1.000	0.000	1.000	0.000	1.000	0.000	1.000
1	0.841	0.540	0.100	0.995	0.010	1.000	0.001	1.000
2	0.909	-0.416	0.199	0.980	0.020	1.000	0.002	1.000
3	0.141	-0.990	0.296	0.955	0.030	1.000	0.003	1.000
4	-0.757	-0.654	0.389	0.921	0.040	1.000	0.004	1.000

Observations:

Dimension 0-1 ( $\omega_0 = 1$ ): Oscillates rapidly - completes a full cycle in $2\pi \approx 6.28$ positions. Distinguishes adjacent tokens.
Dimension 2-3 ( $\omega_1 = 0.1$ ): Oscillates at 1/10th the rate. Distinguishes tokens up to ~63 apart.
Dimension 4-5 ( $\omega_2 = 0.01$ ): Very slow oscillation. Distinguishes tokens up to ~628 apart.
Dimension 6-7 ( $\omega_3 = 0.001$ ): Extremely slow. Distinguishes tokens up to ~6283 apart.

This is the Fourier series multi-resolution strategy: use harmonics at multiple scales to encode position information at all scales simultaneously. Each dimension pair $(2i, 2i+1)$ is a Fourier basis function at a different frequency - the entire positional encoding is a multi-frequency Fourier expansion of the position index.

Appendix O: Further Reading and References

O.1 Primary References

Textbooks (Classical):

Korner, T.W. (1988). Fourier Analysis. Cambridge University Press. - The most readable rigorous treatment; includes Gibbs phenomenon, convergence theory, and applications to number theory.
Katznelson, Y. (2004). An Introduction to Harmonic Analysis, 3rd ed. Cambridge. - Definitive graduate reference.
Stein, E. & Weiss, G. (1971). Introduction to Fourier Analysis on Euclidean Spaces. Princeton. - The multidimensional extension.
Tolstov, G.P. (1962). Fourier Series (Silverman translation). Dover. - Inexpensive, detailed, complete derivations of all standard results.

Textbooks (Applied):

Oppenheim, A. & Willsky, A. (1997). Signals and Systems, 2nd ed. Prentice Hall. - Standard engineering reference for Fourier methods in signal processing.
Bracewell, R.N. (2000). The Fourier Transform and Its Applications, 3rd ed. McGraw-Hill. - Intuitive engineering treatment with excellent visualizations.

O.2 Key ML Papers Using Fourier Methods

Paper	Year	Contribution	Fourier connection
Vaswani et al., "Attention Is All You Need"	2017	Transformer; sinusoidal PE	Fourier basis for position
Rahimi & Recht, "Random Features for Large-Scale Kernel Machines"	2007	Random Fourier features	Bochner's theorem
Rahaman et al., "On the Spectral Bias of Neural Networks"	2019	Spectral bias phenomenon	NTK Fourier decomposition
Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding"	2021	RoPE	Complex Fourier rotation
Lee-Thorp et al., "FNet: Mixing Tokens with Fourier Transforms"	2021	FNet	DFT replaces attention
Sitzmann et al., "Implicit Neural Representations with Periodic Activations"	2020	SIREN	Fourier basis activations
Tancik et al., "Fourier Features Let Networks Learn High Frequency Functions"	2020	Fourier feature encoding	NeRF; overcomes spectral bias
Gu et al., "Efficiently Modeling Long Sequences with Structured State Spaces"	2021	S4 model	Frequency domain parameterization
Peng et al., "YaRN: Efficient Context Window Extension of LLMs"	2023	Extended context	RoPE frequency scaling
Xiao et al., "Efficient Streaming Language Models with Attention Sinks"	2023	StreamingLLM	DC component preservation

O.3 Connections to Other Sections in This Curriculum

Topic	Where to find it	Connection to Fourier Series
$L^2$ Hilbert spaces, completeness	Section 12-02 Hilbert Spaces	The abstract framework for Fourier series
Inner products on function spaces	Section 12-01 Normed Spaces	The $\langle f,g \rangle$ that defines Fourier coefficients
Kernel methods, RKHS	Section 12-03 Kernel Methods	Bochner's theorem; RFFs
Complex numbers, series	Section 01 Mathematical Foundations	$e^{inx}$ , Euler's formula
Gradient descent, loss landscape	Section 08 Optimization	NTK, spectral bias
Spectral graph convolution	Section 11-04 Spectral Graph Theory	Graph Fourier transform
Probability characteristic functions	Section 06 Probability Theory	Bochner, CLT via characteristic functions

Appendix P: Common Error Patterns in Code

When implementing Fourier series and Fourier-based methods in Python, these are the most frequent bugs:

P.1 Normalization Factor Confusion

Bug: Using np.fft.fft(x) without dividing by N.

# WRONG: FFT includes the N normalization in a different convention
c_wrong = np.fft.fft(x)  # These are N * c_n, not c_n!

# CORRECT: divide by N to get Fourier coefficients
c_correct = np.fft.fft(x) / N

Why: NumPy's FFT convention is $X[k] = \sum_{n=0}^{N-1} x[n] e^{-2\pi ink/N}$ (no $1/N$ factor). The mathematical Fourier series coefficient is $c_n = \frac{1}{N}\sum_{n=0}^{N-1} x[n] e^{-2\pi ink/N}$ .

P.2 Frequency Axis Misidentification

# WRONG: assuming output index k corresponds to frequency k
freqs = np.arange(N)  # Wrong!

# CORRECT: use fftfreq
freqs = np.fft.fftfreq(N, d=1.0/sample_rate)  # In Hz
# or for [-pi, pi] interval:
freqs = np.fft.fftfreq(N) * N  # integer wavenumbers -N/2 to N/2

Why: NumPy FFT output has frequencies $[0, 1, \ldots, N/2-1, -N/2, \ldots, -1]$ cycles/sample, not $[0, 1, \ldots, N-1]$ .

P.3 Forgetting to Handle Real FFT Symmetry

For real inputs, $c_{-n} = \overline{c_n}$ (Hermitian symmetry). The positive-frequency half is sufficient:

# For real input, use rfft (returns only positive frequencies):
c = np.fft.rfft(x) / N     # Length N//2 + 1
freqs = np.fft.rfftfreq(N) # Corresponding frequencies

# To recover the full spectrum, remember to double all |c_n|^2 for n > 0:
power_onesided = np.abs(c)**2
power_onesided[1:-1] *= 2   # Double all except DC and Nyquist

P.4 Aliasing from Insufficient Sampling

# DANGER: if signal has frequency content above Nyquist
t = np.linspace(0, 1, 100)     # 100 samples per second
x = np.sin(2 * np.pi * 80 * t)  # 80 Hz signal

# Nyquist frequency = 50 Hz; 80 Hz is ABOVE Nyquist
# The DFT will show this as a false 20 Hz component (80 - 100/2 = 30 != 20... wait)
# Actually: 80 Hz aliases to 100 - 80 = 20 Hz.
# ALWAYS check: max frequency in signal < sample_rate / 2

P.5 Misinterpreting Parseval in Discrete vs. Continuous Setting

Continuous Parseval: $\frac{1}{2\pi}\int_{-\pi}^\pi |f|^2 = \sum_n |c_n|^2$

Discrete Parseval (DFT): $\frac{1}{N}\sum_{k=0}^{N-1} |x[k]|^2 = \frac{1}{N^2}\sum_{n=0}^{N-1} |X[n]|^2$ , where $X = \text{DFT}(x)$ .

# Check Parseval numerically:
x = np.random.randn(N)
X = np.fft.fft(x)
lhs = np.mean(np.abs(x)**2)           # (1/N) * sum |x[k]|^2
rhs = np.mean(np.abs(X/N)**2)         # (1/N) * sum |c_n|^2
assert np.isclose(lhs, rhs), f"Parseval failed: {lhs} != {rhs}"

This completes the Section 20-01 Fourier Series reference notes. Continue with Section 20-02 Fourier Transform -> to see how the period- $T$ Fourier series becomes the continuous Fourier transform as $T \to \infty$ .

Appendix Q: Historical Vignettes

Q.1 Fourier's Audacity and the Reception of His Theory

Joseph Fourier (1768-1830) was not primarily a mathematician. He was a physicist and administrator who, as Napoleon's prefect of Isere, oversaw the draining of malarial swamps and the construction of roads. His mathematical masterpiece grew from a practical problem: how does heat conduct through a solid body?

When Fourier presented his theory to the Paris Academy in 1807, the reaction was hostile. The paper was rejected - or rather, shelved - on the grounds that the claim "every function can be represented as a trigonometric series" was both wrong (they thought) and insufficiently rigorous (it was). The judges included Laplace, Lagrange, and Monge - the greatest mathematicians of the age.

Lagrange's objection was mathematical: a sum of smooth functions (sines and cosines) cannot converge to a discontinuous function. This was a perfectly reasonable objection in 1807, before the modern theory of pointwise vs. $L^2$ convergence had been developed. Fourier's answer - essentially, "but it does!" - was more right than Lagrange, but not for the reasons Fourier gave.

Fourier persisted. In 1822, he published his masterpiece Theorie analytique de la chaleur, which remains one of the great works of mathematical physics. His key equations - the heat equation, the Fourier series, and the Fourier integral - were all present, though not rigorously proved.

The rigorous proof came from Dirichlet in 1829. The controversy about pointwise convergence of the Fourier series of a continuous function was not finally resolved until Carleson in 1966 - 159 years after Fourier's original claim. Mathematics sometimes takes time to catch up with correct intuitions.

Q.2 Dirichlet's Proof: The Birth of Rigorous Analysis

Peter Gustav Lejeune Dirichlet (1805-1859) was a student of Fourier's. His 1829 paper giving the first rigorous proof of pointwise convergence is often cited as one of the founding documents of modern analysis - not just because of the result, but because of the method.

Dirichlet's proof introduced two innovations that transformed mathematics:

The modern definition of function. Before Dirichlet, a "function" meant something expressible by a formula. Dirichlet explicitly worked with functions that could behave differently on different parts of their domain, introducing what we now recognize as the modern concept. His paper is one of the first places where the contemporary idea of function (as any rule assigning outputs to inputs) appears clearly.
Careful convergence arguments. Dirichlet's proof carefully tracked when limits could be exchanged and when they could not - a concern that Cauchy had introduced but that Dirichlet applied systematically to Fourier series. This careful accounting of convergence, rather than treating series as formal objects, is the hallmark of 19th-century rigor.

In this sense, Fourier series didn't just give us a useful computational tool. They provoked the development of the rigorous analysis that underlies all of modern mathematics and machine learning theory.

Q.3 Carleson's Theorem: The 159-Year-Old Problem

For 120 years after Dirichlet's theorem (1829), the following question remained open:

Does the Fourier series of every continuous function converge pointwise everywhere?

1873: Du Bois-Reymond showed NO for continuous functions at one point
1923: Kolmogorov showed NO for $L^1$ functions everywhere (!)
But for the natural space $L^2$ ...?

In 1966, Lennart Carleson proved: YES, the Fourier series of every $L^2$ function converges pointwise almost everywhere. This was one of the most celebrated theorems in 20th-century analysis. Carleson received the Abel Prize in 2006 - the only prize given specifically for this theorem.

The proof uses a remarkable decomposition of the Dirichlet kernel into "tiles" in the time-frequency plane, combined with a sophisticated stopping-time argument. It is considered one of the most difficult proofs in all of harmonic analysis - spanning 30+ pages of dense mathematics.

For our purposes, the practical upshot is: every $L^2$ function (which includes every function we encounter in ML) has its Fourier series converging pointwise almost everywhere. The theoretical pathologies of convergence failure are measure-zero exceptions.

Appendix R: Self-Assessment Checklist

Use this checklist to verify understanding before moving to Section 20-02.

R.1 Computational Mastery

Can you compute the Fourier coefficients of the square wave, sawtooth, and triangle wave from scratch without looking them up?
Can you use the differentiation theorem ( $c_n[f'] = in\,c_n[f]$ ) to derive the triangle wave coefficients from the square wave coefficients?
Can you apply Parseval's identity to evaluate $\sum_{n=1}^\infty 1/n^4$ (via the Fourier series of $x^2$ )?
Can you identify the decay rate of Fourier coefficients ( $O(1/n)$ , $O(1/n^2)$ , exponential) from a description of the function?
Can you write the partial sum $S_N f$ as a convolution with the Dirichlet kernel?

R.2 Theoretical Understanding

Can you state Dirichlet's theorem and identify what happens at a jump discontinuity?
Can you explain the Gibbs phenomenon and why it persists at any truncation order?
Can you state Parseval's identity and explain its physical meaning (energy conservation)?
Can you explain why $L^2$ convergence is the "correct" convergence for Fourier series?
Can you explain why the Fejer kernel (Cesaro means) fixes the Gibbs phenomenon while the Dirichlet kernel doesn't?
Can you explain the Riemann-Lebesgue lemma and its consequence: smooth signals have decaying Fourier coefficients?

R.3 AI Connections

Can you derive the sinusoidal PE formula from the Fourier basis perspective?
Can you explain RoPE in terms of complex Fourier multiplication and the shift theorem?
Can you explain the spectral bias phenomenon and its consequences for neural network training?
Can you explain why Random Fourier Features approximate kernel methods and how Bochner's theorem is involved?
Can you explain the attention sink phenomenon using Fourier analysis (DC component)?

R.4 Connections to Other Sections

Do you understand that the Fourier Transform (Section 20-02) is the $T \to \infty$ limit of the Fourier series?
Do you understand that the DFT (Section 20-03) is the Fourier series for a finite discrete sequence?
Do you understand that the convolution theorem (Section 20-04) is a consequence of Fourier series orthogonality ( $c_n[f*g] = c_n[f]\cdot c_n[g]$ )?
Do you understand that wavelets (Section 20-05) overcome the time-localization failure of Fourier series?

If you cannot answer YES to all items in R.1 and R.2, and at least 4 items in R.3 and R.4, review the corresponding sections before proceeding.

End of Section 20-01 Fourier Series

<- Back to Fourier Analysis | Next: Fourier Transform ->

Appendix S: Summary of Key Results

S.1 The Core Theorems

Theorem 1 (Dirichlet, 1829). If $f$ is $2\pi$ -periodic and piecewise smooth, then for every $x$ :

\sum_{n=-N}^N c_n e^{inx} \;\to\; \frac{f(x^+) + f(x^-)}{2} \quad \text{as } N \to \infty.

Theorem 2 ( $L^2$ convergence). For every $f \in L^2[-\pi,\pi]$ :

\left\lVert f - \sum_{|n|\leq N} c_n e^{inx} \right\rVert_{L^2} \to 0 \quad \text{as } N \to \infty.

Theorem 3 (Parseval-Bessel). For $f \in L^2[-\pi,\pi]$ :

\sum_{n=-\infty}^\infty |c_n|^2 = \frac{1}{2\pi}\int_{-\pi}^\pi |f(x)|^2 \, dx.

Theorem 4 (Fejer, 1900). For every $f \in C[-\pi,\pi]$ (continuous and $2\pi$ -periodic):

\frac{1}{N+1}\sum_{k=0}^N S_k f \to f \quad \text{uniformly as } N \to \infty.

Theorem 5 (Riemann-Lebesgue). For $f \in L^1[-\pi,\pi]$ :

c_n[f] \to 0 \quad \text{as } |n| \to \infty.

Theorem 6 (Carleson, 1966). For every $f \in L^2[-\pi,\pi]$ :

\sum_{|n|\leq N} c_n e^{inx} \to f(x) \quad \text{for almost every } x \in [-\pi,\pi].

S.2 The Most Important Formula

Among all the formulas in this section, one stands out as the most consequential for both mathematics and AI:

\boxed{c_n[f] = \frac{1}{2\pi}\int_{-\pi}^{\pi} f(x)\, e^{-inx}\, dx \qquad \longleftrightarrow \qquad f(x) = \sum_{n=-\infty}^{\infty} c_n\, e^{inx}}

Every other result in this section - convergence theorems, Parseval, the shift property, RoPE, spectral bias - is a consequence or application of this pair.

Fourier Series: Part 3 - Appendix J Fourier Series In Diverse Mathematical Contexts To Appendix