Lesson overview | Previous part | Lesson overview
Fourier Series: Appendix J: Fourier Series in Diverse Mathematical Contexts to Appendix S: Summary of Key Results
Appendix J: Fourier Series in Diverse Mathematical Contexts
J.1 Fourier Series and Number Theory
The connection between Fourier analysis and number theory is profound and historically important.
Dirichlet characters. For a modulus , a Dirichlet character is a completely multiplicative arithmetic function with period . These are precisely the characters of the group . Dirichlet's proof that there are infinitely many primes in any arithmetic progression with uses Fourier analysis on the group - a finite version of Fourier series.
The Poisson summation formula. For a "nice" function :
This connects sums over integers to sums over their Fourier transforms. Applications include:
- Jacobi theta functions: satisfies a functional equation from Poisson summation
- Riemann zeta function: The functional equation is proved via Poisson summation
- Quadratic forms: The number of ways to represent an integer as a sum of squares is expressed via theta functions
For AI: Poisson summation appears in the theory of neural network generalization. The neural tangent kernel at infinite width is related to the Fourier transform of the network architecture, and bounds on generalization involve the Poisson summation formula applied to the empirical Fourier spectrum of the training data.
J.2 Fourier Series on Non-Euclidean Spaces
Fourier analysis on groups. The Fourier series on is really harmonic analysis on the group . The generalization to compact groups gives:
where is the set of irreducible unitary representations of and is the dimension of representation . For : all irreducible representations are 1-dimensional (), and we recover the ordinary Fourier series.
For AI - Equivariant networks: Group-equivariant neural networks (Cohen & Welling, 2016) build in symmetries (rotations, reflections) by working with group-Fourier analysis. A convolution network on is equivariant to translations; lifting to (rigid motions) gives equivariance to rotation and translation simultaneously. The mathematical foundation is the group Fourier transform - Fourier series on a non-abelian group.
Fourier analysis on graphs. The graph Laplacian has eigenvectors with eigenvalues . The graph Fourier transform is . Graph convolution (used in GCN, ChebNet - covered in Section 11-04) is multiplication in this graph Fourier domain:
This is exactly the convolution theorem applied to graphs, replacing the standard Fourier basis with the Laplacian eigenvectors .
J.3 Fourier Series and Quantum Mechanics
The mathematical structure of quantum mechanics is built on Fourier analysis.
Position and momentum representations. A quantum state can be represented in the position basis: (the wavefunction), or in the momentum basis: . The two representations are related by the Fourier transform:
The Heisenberg uncertainty principle is exactly the Fourier uncertainty principle in disguise, with and .
For AI: The mathematical structure of quantum mechanics (Hilbert space, operators, eigenvalues) and the mathematical structure of attention (inner products, softmax, value aggregation) are strikingly parallel. Several researchers have proposed "quantum attention" mechanisms, and the Fourier analysis connecting position and momentum representations underlies proposals for "quantum-inspired" sequence modeling.
J.4 The Ergodic Theorem and Fourier Analysis
Birkhoff's ergodic theorem. For an ergodic measure-preserving transformation and :
The time average equals the space average.
Connection to Fourier series. For the rotation on the circle (irrational ), Birkhoff's theorem gives:
In Fourier space: . For irrational and : (equidistribution) -> only the term survives -> the time average equals .
For AI: The ergodic theorem underpins the mixing of training data. In language model training, a sequence of tokens is generated by a stationary stochastic process (natural language). Ergodicity says that sampling from a long enough sequence gives a representative sample of the distribution - justifying the use of random windowed crops for training, rather than requiring the full corpus at each step.
Appendix K: Connections Between Sections
K.1 Why This Section is the Foundation
All subsequent sections in this chapter build on the foundations laid here:
-> Section 20-02 (Fourier Transform): Take the period in the Fourier series. The discrete sum becomes the integral . Every property of the Fourier series has an analogue in the Fourier transform - the derivations are nearly identical. The key new feature is the continuous spectrum and the setting.
-> Section 20-03 (DFT and FFT): Discretize the Fourier transform: sample at points and compute frequency coefficients. The DFT is the Fourier series applied to the sequence viewed as a periodic sequence of period . The FFT is the efficient algorithm for computing it.
-> Section 20-04 (Convolution Theorem): The property (multiplication in frequency = convolution in time) is the single most important theorem in signal processing. It follows directly from the orthogonality of the Fourier basis and is proved here in the series setting; the full treatment with applications to CNNs is in Section 20-04.
-> Section 20-05 (Wavelets): Wavelets are a replacement for Fourier series that provides both time and frequency localization. The Haar wavelet is a piecewise constant basis - like the Fourier basis, but localized in time. The mathematical framework of multiresolution analysis is the generalization of the Fourier series framework (nested subspaces, orthonormal bases) to time-localized expansions.
K.2 What to Read Next
After this section, you have two natural paths:
Path A - Depth first: Continue with Section 20-02 (Fourier Transform) to see how the theory extends to aperiodic signals and the full real line. This path eventually leads to the uncertainty principle and the deeper function space theory.
Path B - Applications first: Jump to Section 20-04 (Convolution Theorem) to see immediately how Fourier series powers CNNs and sequence models. The convolution theorem proof requires the Fourier transform (Section 20-02), so read that first.
For the AI-focused reader: The most direct path from this section to ML applications is:
- Section 20-01 (this section) - Fourier series fundamentals
- Section 20-03 (DFT/FFT) - computational implementation
- Section 20-04 (Convolution Theorem) - CNNs and sequence models
- Section 20-02 (Fourier Transform) - theory for spectral methods
- Section 20-05 (Wavelets) - multi-scale models and compression
Appendix L: The Fourier Series in Numerical Practice
L.1 Accuracy Benchmarks: Theoretical vs. Observed
When implementing Fourier series approximations numerically, understanding the gap between theoretical and observed accuracy is essential.
Experiment: Square wave approximation accuracy.
For the square wave with -term Fourier series :
- Theoretical error:
- Observed: Plot vs on a log-log scale. Slope should be , confirming decay.
- At jump: regardless of (Gibbs)
Experiment: Triangle wave approximation accuracy.
For with (odd ):
- Theoretical error:
- Observed: slope on log-log plot, confirming decay
Experiment: Smooth function accuracy.
For (analytic, -periodic):
- Theoretical: ->
- Observed: semi-log plot of vs is linear (exponential decay)
- Essentially machine-precision accuracy for
Takeaway: The rule of thumb is "smooth = exponentially good, nonsmooth = algebraically poor." For discontinuous signals, truncated Fourier series are often a poor choice - JPEG-style block transforms or wavelets are better.
L.2 The Fast Fourier Transform Preview
The DFT (fully treated in Section 20-03) computes exactly the same Fourier coefficients as the Fourier series formula, but for a finite sequence and in time.
The connection: For samples at equispaced points :
Comparing with (rectangle rule with ):
So the DFT is just the Fourier series computation via a rectangle-rule quadrature, scaled by .
The FFT (Cooley-Tukey, 1965) computes all DFT values in operations instead of the naive . For : naive DFT requires operations; FFT requires operations - a factor of 50,000 speedup.
This speedup is what made digital signal processing, audio compression, and modern AI practically feasible. The STFT and mel-spectrogram preprocessing in Whisper, the frequency-domain convolutions in FNet, and the state-space model computations in S4/Mamba all rely on this speedup. Full treatment: Section 20-03 Discrete Fourier Transform and FFT.
L.3 Fourier Series vs. Polynomial Approximation
Both Fourier series and polynomial approximation (Taylor, Chebyshev, Legendre) approximate functions by basis expansions. Understanding when to use each is important.
| Aspect | Fourier Series | Polynomial (Chebyshev) |
|---|---|---|
| Best for | Periodic functions, global frequency analysis | Non-periodic functions on |
| Convergence for smooth | Exponential (analytic), algebraic (smooth) | Exponential (analytic), algebraic (smooth) |
| At discontinuities | Gibbs phenomenon (9% overshoot) | Runge phenomenon near endpoints |
| Complexity for terms | via FFT | (or for Chebyshev via DCT) |
| AI applications | RoPE, spectrograms, convolutions | Polynomial activations, rational networks |
| Sparse representation | Works for bandlimited signals | Works for smooth functions |
Key insight: For periodic and global structure, use Fourier series. For localized and non-periodic structure, use wavelets (Section 20-05) or local polynomial bases. The choice of approximation basis should match the intrinsic structure of the function.
L.4 When Fourier Series Fail: Pathological Examples
Understanding the failure modes of Fourier series deepens intuition.
Example 1 - Du Bois-Reymond's construction (1876): There exists a continuous -periodic function whose Fourier series diverges at . The construction uses the slow growth of : build to make grow like .
Example 2 - Kolmogorov's function (1923): There exists an function whose Fourier series diverges everywhere. This shows that is too weak for pointwise Fourier convergence.
Example 3 - Weierstrass nowhere-differentiable function:
This is a Fourier series (cosine series) that converges uniformly (geometric series bounds), but the sum is nowhere differentiable. The coefficients decay geometrically but the frequencies grow geometrically - the function has "fractal" structure with self-similar behavior at all scales.
For AI: Nowhere-differentiable functions are the worst case for neural network optimization. The loss landscape of a neural network restricted to certain parameter subspaces can exhibit Weierstrass-like behavior - highly oscillatory, with no gradient information about the global minimum. This motivates smooth parameterizations (Adam's momentum, gradient clipping) and understanding of the loss landscape geometry.
Appendix M: Quick Reference
M.1 Key Formulas at a Glance
M.2 Decision Tree: Which Tool to Use
Is your signal periodic?
+-- YES -> Fourier series (Section 20-01)
| +-- Need to compute? -> DFT/FFT (Section 20-03)
| +-- Need to convolve? -> Convolution theorem (Section 20-04)
+-- NO -> Fourier transform (Section 20-02)
+-- Bandlimited? -> Sampling theorem
+-- Need time localization? -> Wavelets (Section 20-05)
+-- Aperiodic transient? -> STFT (Section 20-03)
Is your signal discrete?
+-- YES -> DFT/FFT (Section 20-03)
+-- NO -> Fourier transform (Section 20-02)
Do you need time AND frequency localization?
+-- YES -> Wavelets (Section 20-05)
Appendix N: Fourier Series and Modern Transformers - Extended Analysis
N.1 YaRN and NTK-Aware Frequency Scaling
The original RoPE with base runs out of positional expressiveness beyond tokens. Several methods have been proposed to extend it:
Linear Position Interpolation (PI) - Chen et al., 2023. Scale all positions by : position becomes . This compresses the angles to keep within the trained range . Equivalently, reduce all frequencies: .
Issue with linear interpolation: The lowest-frequency components (large , which change slowly) are barely affected, but the highest-frequency components (small , which rotate rapidly) are compressed most aggressively. High-frequency positional information (distinguishing adjacent tokens) degrades.
NTK-RoPE - Bloc97, 2023. Scale the base rather than the positions: . This modifies low-frequency components more than high-frequency ones - a non-uniform scaling derived from the NTK analysis.
YaRN - Peng et al., 2023. Combines interpolation for middle frequencies with no interpolation for high frequencies (which extrapolate) and linear interpolation for low frequencies. The frequency ranges are: (i) : unchanged; (ii) : linear interpolation; (iii) : extrapolation. The parameters are tuned per model. YaRN achieves strong performance at 128K context with only 0.1% of the parameters fine-tuned.
The Fourier interpretation: All these methods are attempting to find the right frequency allocation - which Fourier basis vectors should be responsible for which positional scales. The ideal is to have each frequency range cover roughly equal logarithmic bandwidth, so no scale is under- or over-represented.
N.2 Frequency Analysis of Layer Norms
LayerNorm as a high-pass filter. LayerNorm subtracts the mean and divides by standard deviation:
The mean subtraction is a DC removal in signal processing terms: it removes the (constant) Fourier component from the sequence. The division by is a gain normalization.
Implication: After LayerNorm, the token representation has zero mean. This concentrates the representational capacity in the non-DC Fourier components, reinforcing the tendency of transformers to represent relative (not absolute) positional information.
RMSNorm (used in LLaMA) only divides by the RMS (root mean square), without mean subtraction. It does not remove the DC component - preserving more absolute magnitude information. The trade-off: slightly less stable training vs. richer representation.
N.3 Fourier Analysis of Embedding Matrices
What does the Fourier transform of a token embedding matrix reveal?
Let be the embedding matrix for vocabulary size and dimension . For a fixed dimension , the vector gives the -th coordinate of all token embeddings.
If we sort tokens by frequency rank (most common first) and look at as a function of token rank, we can ask: what is its Fourier transform?
Empirical finding (2024): The embedding vectors of tokens sorted by frequency show 1/f-like spectra in dimension vs. frequency rank - consistent with the Zipf distribution of token frequencies. This suggests that the embedding space has a natural "frequency ordering" that mirrors the frequency distribution of the training corpus.
For interpretability: The Fourier components of the embedding matrix at different "token frequencies" correspond to different semantic clusters. Low-frequency components (affecting the most common tokens) encode universal semantic primitives (noun/verb, positive/negative). High-frequency components encode rare, specific meanings.
N.4 The Fourier Perspective on Attention Sinks
The "attention sink" phenomenon (Xiao et al., 2023): In all transformer models, the first token (often <BOS> or the period .) receives disproportionate attention from most heads, even when semantically irrelevant.
Fourier explanation: The attention sink is the DC component (zero frequency) of the attention pattern. In signal processing, a low-pass filter preserves the DC component. If attention heads are implicitly low-pass filtering the sequence, they must "send" the removed energy somewhere - and the first token becomes the "drain" for this energy.
StreamingLLM (Xiao et al., 2023): By always keeping the first 4 tokens in the KV cache (the "attention sinks"), models can be extended to arbitrarily long sequences without performance degradation. This is a practical application of the Fourier insight: preserve the DC reference point.
N.5 Complete Worked Example: Sinusoidal PE Computation
Let (small for illustration), positions .
Frequencies: for :
Positional encoding matrix :
| pos | sin(pos1.0) | cos(pos1.0) | sin(pos0.1) | cos(pos0.1) | sin(pos0.01) | cos(pos0.01) | sin(pos0.001) | cos(pos0.001) |
|---|---|---|---|---|---|---|---|---|
| 0 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 |
| 1 | 0.841 | 0.540 | 0.100 | 0.995 | 0.010 | 1.000 | 0.001 | 1.000 |
| 2 | 0.909 | -0.416 | 0.199 | 0.980 | 0.020 | 1.000 | 0.002 | 1.000 |
| 3 | 0.141 | -0.990 | 0.296 | 0.955 | 0.030 | 1.000 | 0.003 | 1.000 |
| 4 | -0.757 | -0.654 | 0.389 | 0.921 | 0.040 | 1.000 | 0.004 | 1.000 |
Observations:
- Dimension 0-1 (): Oscillates rapidly - completes a full cycle in positions. Distinguishes adjacent tokens.
- Dimension 2-3 (): Oscillates at 1/10th the rate. Distinguishes tokens up to ~63 apart.
- Dimension 4-5 (): Very slow oscillation. Distinguishes tokens up to ~628 apart.
- Dimension 6-7 (): Extremely slow. Distinguishes tokens up to ~6283 apart.
This is the Fourier series multi-resolution strategy: use harmonics at multiple scales to encode position information at all scales simultaneously. Each dimension pair is a Fourier basis function at a different frequency - the entire positional encoding is a multi-frequency Fourier expansion of the position index.
Appendix O: Further Reading and References
O.1 Primary References
Textbooks (Classical):
- Korner, T.W. (1988). Fourier Analysis. Cambridge University Press. - The most readable rigorous treatment; includes Gibbs phenomenon, convergence theory, and applications to number theory.
- Katznelson, Y. (2004). An Introduction to Harmonic Analysis, 3rd ed. Cambridge. - Definitive graduate reference.
- Stein, E. & Weiss, G. (1971). Introduction to Fourier Analysis on Euclidean Spaces. Princeton. - The multidimensional extension.
- Tolstov, G.P. (1962). Fourier Series (Silverman translation). Dover. - Inexpensive, detailed, complete derivations of all standard results.
Textbooks (Applied):
- Oppenheim, A. & Willsky, A. (1997). Signals and Systems, 2nd ed. Prentice Hall. - Standard engineering reference for Fourier methods in signal processing.
- Bracewell, R.N. (2000). The Fourier Transform and Its Applications, 3rd ed. McGraw-Hill. - Intuitive engineering treatment with excellent visualizations.
O.2 Key ML Papers Using Fourier Methods
| Paper | Year | Contribution | Fourier connection |
|---|---|---|---|
| Vaswani et al., "Attention Is All You Need" | 2017 | Transformer; sinusoidal PE | Fourier basis for position |
| Rahimi & Recht, "Random Features for Large-Scale Kernel Machines" | 2007 | Random Fourier features | Bochner's theorem |
| Rahaman et al., "On the Spectral Bias of Neural Networks" | 2019 | Spectral bias phenomenon | NTK Fourier decomposition |
| Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding" | 2021 | RoPE | Complex Fourier rotation |
| Lee-Thorp et al., "FNet: Mixing Tokens with Fourier Transforms" | 2021 | FNet | DFT replaces attention |
| Sitzmann et al., "Implicit Neural Representations with Periodic Activations" | 2020 | SIREN | Fourier basis activations |
| Tancik et al., "Fourier Features Let Networks Learn High Frequency Functions" | 2020 | Fourier feature encoding | NeRF; overcomes spectral bias |
| Gu et al., "Efficiently Modeling Long Sequences with Structured State Spaces" | 2021 | S4 model | Frequency domain parameterization |
| Peng et al., "YaRN: Efficient Context Window Extension of LLMs" | 2023 | Extended context | RoPE frequency scaling |
| Xiao et al., "Efficient Streaming Language Models with Attention Sinks" | 2023 | StreamingLLM | DC component preservation |
O.3 Connections to Other Sections in This Curriculum
| Topic | Where to find it | Connection to Fourier Series |
|---|---|---|
| Hilbert spaces, completeness | Section 12-02 Hilbert Spaces | The abstract framework for Fourier series |
| Inner products on function spaces | Section 12-01 Normed Spaces | The that defines Fourier coefficients |
| Kernel methods, RKHS | Section 12-03 Kernel Methods | Bochner's theorem; RFFs |
| Complex numbers, series | Section 01 Mathematical Foundations | , Euler's formula |
| Gradient descent, loss landscape | Section 08 Optimization | NTK, spectral bias |
| Spectral graph convolution | Section 11-04 Spectral Graph Theory | Graph Fourier transform |
| Probability characteristic functions | Section 06 Probability Theory | Bochner, CLT via characteristic functions |
Appendix P: Common Error Patterns in Code
When implementing Fourier series and Fourier-based methods in Python, these are the most frequent bugs:
P.1 Normalization Factor Confusion
Bug: Using np.fft.fft(x) without dividing by N.
# WRONG: FFT includes the N normalization in a different convention
c_wrong = np.fft.fft(x) # These are N * c_n, not c_n!
# CORRECT: divide by N to get Fourier coefficients
c_correct = np.fft.fft(x) / N
Why: NumPy's FFT convention is (no factor). The mathematical Fourier series coefficient is .
P.2 Frequency Axis Misidentification
# WRONG: assuming output index k corresponds to frequency k
freqs = np.arange(N) # Wrong!
# CORRECT: use fftfreq
freqs = np.fft.fftfreq(N, d=1.0/sample_rate) # In Hz
# or for [-pi, pi] interval:
freqs = np.fft.fftfreq(N) * N # integer wavenumbers -N/2 to N/2
Why: NumPy FFT output has frequencies cycles/sample, not .
P.3 Forgetting to Handle Real FFT Symmetry
For real inputs, (Hermitian symmetry). The positive-frequency half is sufficient:
# For real input, use rfft (returns only positive frequencies):
c = np.fft.rfft(x) / N # Length N//2 + 1
freqs = np.fft.rfftfreq(N) # Corresponding frequencies
# To recover the full spectrum, remember to double all |c_n|^2 for n > 0:
power_onesided = np.abs(c)**2
power_onesided[1:-1] *= 2 # Double all except DC and Nyquist
P.4 Aliasing from Insufficient Sampling
# DANGER: if signal has frequency content above Nyquist
t = np.linspace(0, 1, 100) # 100 samples per second
x = np.sin(2 * np.pi * 80 * t) # 80 Hz signal
# Nyquist frequency = 50 Hz; 80 Hz is ABOVE Nyquist
# The DFT will show this as a false 20 Hz component (80 - 100/2 = 30 != 20... wait)
# Actually: 80 Hz aliases to 100 - 80 = 20 Hz.
# ALWAYS check: max frequency in signal < sample_rate / 2
P.5 Misinterpreting Parseval in Discrete vs. Continuous Setting
Continuous Parseval:
Discrete Parseval (DFT): , where .
# Check Parseval numerically:
x = np.random.randn(N)
X = np.fft.fft(x)
lhs = np.mean(np.abs(x)**2) # (1/N) * sum |x[k]|^2
rhs = np.mean(np.abs(X/N)**2) # (1/N) * sum |c_n|^2
assert np.isclose(lhs, rhs), f"Parseval failed: {lhs} != {rhs}"
This completes the Section 20-01 Fourier Series reference notes. Continue with Section 20-02 Fourier Transform -> to see how the period- Fourier series becomes the continuous Fourier transform as .
Appendix Q: Historical Vignettes
Q.1 Fourier's Audacity and the Reception of His Theory
Joseph Fourier (1768-1830) was not primarily a mathematician. He was a physicist and administrator who, as Napoleon's prefect of Isere, oversaw the draining of malarial swamps and the construction of roads. His mathematical masterpiece grew from a practical problem: how does heat conduct through a solid body?
When Fourier presented his theory to the Paris Academy in 1807, the reaction was hostile. The paper was rejected - or rather, shelved - on the grounds that the claim "every function can be represented as a trigonometric series" was both wrong (they thought) and insufficiently rigorous (it was). The judges included Laplace, Lagrange, and Monge - the greatest mathematicians of the age.
Lagrange's objection was mathematical: a sum of smooth functions (sines and cosines) cannot converge to a discontinuous function. This was a perfectly reasonable objection in 1807, before the modern theory of pointwise vs. convergence had been developed. Fourier's answer - essentially, "but it does!" - was more right than Lagrange, but not for the reasons Fourier gave.
Fourier persisted. In 1822, he published his masterpiece Theorie analytique de la chaleur, which remains one of the great works of mathematical physics. His key equations - the heat equation, the Fourier series, and the Fourier integral - were all present, though not rigorously proved.
The rigorous proof came from Dirichlet in 1829. The controversy about pointwise convergence of the Fourier series of a continuous function was not finally resolved until Carleson in 1966 - 159 years after Fourier's original claim. Mathematics sometimes takes time to catch up with correct intuitions.
Q.2 Dirichlet's Proof: The Birth of Rigorous Analysis
Peter Gustav Lejeune Dirichlet (1805-1859) was a student of Fourier's. His 1829 paper giving the first rigorous proof of pointwise convergence is often cited as one of the founding documents of modern analysis - not just because of the result, but because of the method.
Dirichlet's proof introduced two innovations that transformed mathematics:
-
The modern definition of function. Before Dirichlet, a "function" meant something expressible by a formula. Dirichlet explicitly worked with functions that could behave differently on different parts of their domain, introducing what we now recognize as the modern concept. His paper is one of the first places where the contemporary idea of function (as any rule assigning outputs to inputs) appears clearly.
-
Careful convergence arguments. Dirichlet's proof carefully tracked when limits could be exchanged and when they could not - a concern that Cauchy had introduced but that Dirichlet applied systematically to Fourier series. This careful accounting of convergence, rather than treating series as formal objects, is the hallmark of 19th-century rigor.
In this sense, Fourier series didn't just give us a useful computational tool. They provoked the development of the rigorous analysis that underlies all of modern mathematics and machine learning theory.
Q.3 Carleson's Theorem: The 159-Year-Old Problem
For 120 years after Dirichlet's theorem (1829), the following question remained open:
Does the Fourier series of every continuous function converge pointwise everywhere?
- 1873: Du Bois-Reymond showed NO for continuous functions at one point
- 1923: Kolmogorov showed NO for functions everywhere (!)
- But for the natural space ...?
In 1966, Lennart Carleson proved: YES, the Fourier series of every function converges pointwise almost everywhere. This was one of the most celebrated theorems in 20th-century analysis. Carleson received the Abel Prize in 2006 - the only prize given specifically for this theorem.
The proof uses a remarkable decomposition of the Dirichlet kernel into "tiles" in the time-frequency plane, combined with a sophisticated stopping-time argument. It is considered one of the most difficult proofs in all of harmonic analysis - spanning 30+ pages of dense mathematics.
For our purposes, the practical upshot is: every function (which includes every function we encounter in ML) has its Fourier series converging pointwise almost everywhere. The theoretical pathologies of convergence failure are measure-zero exceptions.
Appendix R: Self-Assessment Checklist
Use this checklist to verify understanding before moving to Section 20-02.
R.1 Computational Mastery
- Can you compute the Fourier coefficients of the square wave, sawtooth, and triangle wave from scratch without looking them up?
- Can you use the differentiation theorem () to derive the triangle wave coefficients from the square wave coefficients?
- Can you apply Parseval's identity to evaluate (via the Fourier series of )?
- Can you identify the decay rate of Fourier coefficients (, , exponential) from a description of the function?
- Can you write the partial sum as a convolution with the Dirichlet kernel?
R.2 Theoretical Understanding
- Can you state Dirichlet's theorem and identify what happens at a jump discontinuity?
- Can you explain the Gibbs phenomenon and why it persists at any truncation order?
- Can you state Parseval's identity and explain its physical meaning (energy conservation)?
- Can you explain why convergence is the "correct" convergence for Fourier series?
- Can you explain why the Fejer kernel (Cesaro means) fixes the Gibbs phenomenon while the Dirichlet kernel doesn't?
- Can you explain the Riemann-Lebesgue lemma and its consequence: smooth signals have decaying Fourier coefficients?
R.3 AI Connections
- Can you derive the sinusoidal PE formula from the Fourier basis perspective?
- Can you explain RoPE in terms of complex Fourier multiplication and the shift theorem?
- Can you explain the spectral bias phenomenon and its consequences for neural network training?
- Can you explain why Random Fourier Features approximate kernel methods and how Bochner's theorem is involved?
- Can you explain the attention sink phenomenon using Fourier analysis (DC component)?
R.4 Connections to Other Sections
- Do you understand that the Fourier Transform (Section 20-02) is the limit of the Fourier series?
- Do you understand that the DFT (Section 20-03) is the Fourier series for a finite discrete sequence?
- Do you understand that the convolution theorem (Section 20-04) is a consequence of Fourier series orthogonality ()?
- Do you understand that wavelets (Section 20-05) overcome the time-localization failure of Fourier series?
If you cannot answer YES to all items in R.1 and R.2, and at least 4 items in R.3 and R.4, review the corresponding sections before proceeding.
End of Section 20-01 Fourier Series
<- Back to Fourier Analysis | Next: Fourier Transform ->
Appendix S: Summary of Key Results
S.1 The Core Theorems
Theorem 1 (Dirichlet, 1829). If is -periodic and piecewise smooth, then for every :
Theorem 2 ( convergence). For every :
Theorem 3 (Parseval-Bessel). For :
Theorem 4 (Fejer, 1900). For every (continuous and -periodic):
Theorem 5 (Riemann-Lebesgue). For :
Theorem 6 (Carleson, 1966). For every :
S.2 The Most Important Formula
Among all the formulas in this section, one stands out as the most consequential for both mathematics and AI:
Every other result in this section - convergence theorems, Parseval, the shift property, RoPE, spectral bias - is a consequence or application of this pair.