Lesson overview | Lesson overview | Next part
Fourier Series: Part 1: Intuition to Appendix A: Extended Examples and Computations
1. Intuition
1.1 What Is a Fourier Series?
Strike a guitar string and you hear a pitch - the fundamental frequency. Listen more carefully and you hear overtones: frequencies at twice, three times, four times the fundamental. The rich timbre of a guitar versus a piano playing the same note comes entirely from the relative amplitudes of these overtones. A Fourier series is the precise mathematical statement of this physical observation: every periodic function is a (possibly infinite) weighted sum of sinusoids at integer multiples of a fundamental frequency.
More precisely, if is a -periodic function satisfying mild regularity conditions, then:
where the coefficients and are uniquely determined by . The term with is the fundamental harmonic, and the terms with are the overtones or harmonics.
The miracle is how general this is. The function does not have to be smooth. The square wave - which jumps discontinuously between and - has a Fourier series:
Summing 100 terms gives a function that is nearly indistinguishable from the square wave - except near the jumps, where a small overshoot persists no matter how many terms you include. This is the Gibbs phenomenon, and we will study it carefully.
Non-example: Not every function has a convergent Fourier series in the pointwise sense. A function that is not in (e.g., near ) does not have a well-defined Fourier series. A continuous function can even have a Fourier series that diverges at a single point (Du Bois-Reymond, 1876). The correct framework is convergence, not pointwise convergence.
1.2 Why It Matters for AI
Fourier series are not an abstract curiosity - they are embedded in the architecture of modern LLMs, CNNs, and audio models:
For AI:
- Positional encodings in Transformers (Vaswani et al., 2017) use and - these are Fourier basis functions at geometrically spaced frequencies.
- Rotary Position Embedding (RoPE) (Su et al., 2021), used in LLaMA-3, GPT-NeoX, and Mistral, interprets token positions as rotations in the complex plane - directly using complex Fourier basis vectors .
- Spectral bias (Rahaman et al., 2019): neural networks learn low-frequency Fourier components of the target function first and high-frequency components last. This governs convergence speed, generalization, and the effectiveness of data augmentation.
- Random Fourier features (Rahimi & Recht, 2007): kernel machines (SVMs, GPs) can be approximated in by projecting data onto random Fourier basis functions sampled from the kernel's spectral density.
1.3 Historical Timeline
FOURIER ANALYSIS - HISTORICAL MILESTONES
========================================================================
1748 Euler studies vibrating strings; writes trigonometric series
1807 Fourier presents "Theorie de la chaleur" to Paris Academy;
claims every function has a trigonometric series expansion
(claim rejected - Laplace and Lagrange are skeptical)
1822 Fourier publishes "Theorie analytique de la chaleur"
1829 Dirichlet proves convergence for piecewise smooth functions;
gives first rigorous proof of pointwise convergence
1854 Riemann extends the theory; defines the Riemann integral partly
for the purposes of studying Fourier series
1873 Du Bois-Reymond exhibits a continuous function whose Fourier
series diverges at a point - Fourier's original claim was wrong
1900 Fejer proves Cesaro summability: every continuous function's
Fourier series converges in the Cesaro sense
1907 Parseval's identity proved rigorously by Fatou & Riesz
1966 Carleson proves pointwise convergence a.e. for L2 functions
(Fields Medal, 2006)
2017 Vaswani et al. use Fourier basis for transformer PE
2021 Su et al. introduce RoPE; now standard in LLaMA-3, Mistral
2022 FNet (Lee-Thorp) replaces attention with Fourier transforms
2023 Spectral analysis of LLM weight matrices goes mainstream
========================================================================
1.4 The Geometric Picture
The key insight that makes Fourier series transparent is to think of functions as vectors in an infinite-dimensional inner product space.
On the interval , define the inner product of two functions and by:
The functions form an orthonormal set under this inner product:
where is the Kronecker delta. This is easy to verify: if , the integral of over a full period is zero.
Now the Fourier series is just the expansion of in this orthonormal basis. The coefficient is the projection of onto the basis vector :
This is identical to the formula for coordinates in any orthonormal basis: . The Fourier series is not a formula to be memorized - it is the inevitable consequence of projecting onto an orthonormal basis.
For AI: This geometric view directly explains RoPE. In RoPE, each attention head dimension pair is treated as a 2D vector and rotated by angle for token position . This rotation is multiplication by - a Fourier basis vector. The inner product between rotated queries and keys then depends only on their relative position , making the positional encoding relative rather than absolute.
1.5 Three Equivalent Representations
Every Fourier series can be written in three equivalent forms. Each has advantages in different contexts.
Real (sine-cosine) form:
Best for: real-valued functions where you want to see real amplitudes explicitly.
Complex exponential form:
Best for: computation, derivations, and AI applications (RoPE, FNet). More compact; the complex exponentials are eigenfunctions of differentiation.
Amplitude-phase form:
where , . Best for: signal processing applications where amplitude and phase are the physically meaningful quantities.
Conversion between forms: for , , and for real . These identities follow immediately from Euler's formula .
2. Formal Definitions
2.1 The Space
The natural domain for Fourier series is the square-integrable functions on :
This space carries the inner product:
and induced norm . Two functions that agree except on a set of measure zero are identified. Under this identification, is a complete inner product space - a Hilbert space. The full abstract theory is in Section 12-02 Hilbert Spaces; here we use only the concrete definition.
Examples in :
- Every bounded measurable function (e.g., square wave, triangle wave)
- (integrable singularity; )
- for any
Non-examples:
- near : , so
- near : grows too fast
2.2 The Trigonometric System
Definition (Trigonometric System): The collection is called the complex trigonometric system on .
Theorem (Orthonormality): The rescaled functions satisfy , using the unweighted inner product .
Proof. For :
since so . For : .
The real trigonometric system is similarly orthogonal, with norms and for .
Completeness (stated): The trigonometric system is complete in : for every and every , there exists a trigonometric polynomial such that . The proof uses the Weierstrass approximation theorem and is in Section 12-02.
2.3 Fourier Coefficients
Definition (Fourier Coefficients): For , the complex Fourier coefficients of are:
The real Fourier coefficients are:
Relations between forms: For real : , for , and (Hermitian symmetry).
Standard examples - complete computations:
(i) Square wave: for , for .
Since is odd, for all . For :
So .
(ii) Sawtooth wave: for , extended periodically.
is odd, so . For , integrate by parts:
So .
(iii) Triangle wave: for .
is even, so . For ():
And . So .
Non-examples: Not every sequence is the Fourier coefficient sequence of an function. By Parseval's identity (Section 4.2), we need . The sequence for all is not a valid Fourier coefficient sequence.
2.4 Partial Sums and the Dirichlet Kernel
Definition: The -th partial sum of the Fourier series of is:
A crucial observation: the partial sum can be written as a convolution:
where the Dirichlet kernel is:
The closed form follows by summing the geometric series and simplifying using .
Key properties of :
- (normalized)
- oscillates with peaks near
- does NOT stay non-negative (unlike an approximate identity), causing convergence issues
The failure of to remain non-negative is precisely what allows the Gibbs phenomenon (Section 3.4) and prevents uniform convergence at discontinuities.
3. Convergence Theory
3.1 Pointwise Convergence: Dirichlet's Theorem
Theorem (Dirichlet, 1829): Let be a -periodic function that is piecewise smooth on (i.e., and are piecewise continuous). Then the Fourier series of converges for every , and:
At points of continuity, , so the series converges to . At a jump discontinuity, the series converges to the average of the left and right limits.
Proof sketch. Write as an integral involving and the function . The key step: is integrable if is piecewise smooth (the singularity at is removable). Then apply the Riemann-Lebesgue Lemma (Section 5.5): as for any integrable .
Dirichlet conditions (sufficient, not necessary):
- is bounded and has finitely many maxima, minima, and discontinuities on
- has left and right derivatives everywhere
What Dirichlet's theorem does NOT say:
- It does not guarantee uniform convergence (fails at jumps due to Gibbs)
- It does not apply to all continuous functions (Du Bois-Reymond's example)
- It says nothing about convergence rate
3.2 Convergence and Completeness
The convergence theorem is stronger and cleaner than the pointwise result:
Theorem ( Convergence): For every :
Equivalently, as an limit. This follows directly from the completeness of the trigonometric system in (stated in Section 2.2).
Bessel's Inequality (preliminary to Parseval): For any :
Proof. Expand . This shows the partial sums of are bounded by . Taking gives Bessel.
Equality holds (Bessel becomes Parseval) exactly when the system is complete.
3.3 Uniform Convergence
Theorem: If (continuously differentiable) and periodic, then uniformly.
Proof sketch. Integrate by parts: for . So . The partial sums converge uniformly by the Weierstrass M-test since (conditionally). For , , giving faster uniform convergence.
Key point: Continuity alone is insufficient for uniform convergence. The counterexample (Du Bois-Reymond) is a continuous function whose Fourier series diverges at a single point.
3.4 The Gibbs Phenomenon
The Gibbs phenomenon is one of the most important practical facts about Fourier series.
Observation: Near a jump discontinuity at , the partial sum overshoots the function value by approximately - about 9% of the jump height - regardless of how large is.
Precise statement for the square wave: For with jump height :
So the overshoot is , which is of the total jump of .
Why it persists: The Dirichlet kernel has a tall central spike but also oscillating side lobes with total negative area . These side lobes cannot be eliminated by taking larger - they just become narrower and move closer to the discontinuity.
For AI: The Gibbs phenomenon is why "ringing" artifacts appear when you sharply truncate a frequency spectrum (e.g., in audio compression, image filtering, or when a language model encounters out-of-distribution high-frequency tokens). Windowing functions (Section 20-03) are the engineering fix.
Remedy - Fejer's theorem: Instead of taking partial sums, take their Cesaro means . Fejer proved these converge everywhere and do NOT exhibit Gibbs overshoot.
3.5 Fejer's Theorem and Cesaro Summability
Theorem (Fejer, 1900): Let be -periodic. Then uniformly as .
The Cesaro means use the Fejer kernel :
Unlike the Dirichlet kernel, the Fejer kernel is non-negative. This is why Cesaro summation fixes the Gibbs phenomenon - the averaging eliminates the negative side lobes.
For AI: Fejer's theorem is the prototype for windowing in signal processing. Multiplying a signal by a window function before taking the Fourier transform is equivalent to using a smoother summation kernel, which reduces spectral leakage and ringing (-> Section 20-03).
4. Parseval's Theorem and Energy
4.1 Bessel's Inequality
We proved Bessel's inequality in Section 3.2: . This says the "energy" in the frequency representation is at most the energy in the time representation. Equality requires completeness.
4.2 Parseval's Identity
Theorem (Parseval's Identity): For with Fourier coefficients :
In real form: .
Proof. By completeness, . Then:
using orthonormality of for the second equality.
Physical interpretation: The total energy (power) of a signal equals the sum of energies in each frequency component. Fourier analysis is an energy-preserving change of basis.
More general form: For :
This is the polarization identity version of Parseval.
4.3 Applications: Evaluating Infinite Series
Parseval's identity is a powerful tool for evaluating series that have no elementary closed form.
Example 1 - Basel problem via triangle wave: Apply Parseval to on :
Parseval gives , so .
Example 2 - : Apply Parseval to the square wave. The non-zero Fourier coefficients are . Then:
giving .
5. Properties of Fourier Coefficients
5.1 Linearity, Shift, and Scaling
Let denote the -th Fourier coefficient of .
| Property | Statement | Proof idea |
|---|---|---|
| Linearity | Integral is linear | |
| Shift | Change of variable | |
| Conjugation | Conjugate the integral | |
| Reflection | Change of variable |
The shift property is the Fourier-series version of the time-shift property of the Fourier transform. It says: shifting a signal in time multiplies its Fourier coefficients by a complex exponential - i.e., introduces a linear phase. This is fundamental in signal alignment and is the mechanism behind relative positional encodings in transformers.
For AI - RoPE connection: In RoPE, the key and query vectors at position and are and where is a rotation matrix. The dot product - it depends only on the relative position . This is exactly the Fourier shift property: shifting both signals by the same amount leaves their inner product unchanged.
5.2 Differentiation and Integration
Differentiation in frequency space:
Proof. Integrate by parts: . By periodicity, the boundary term vanishes, leaving .
Iterated differentiation: .
Implication for smoothness: A function (k-times continuously differentiable and periodic) has as . The smoother is, the faster its Fourier coefficients decay. This is the quantitative content of Section 5.4.
Integration: for , where is the mean.
5.3 Even and Odd Functions
Even functions () have for all , so the Fourier series contains only cosines: - a cosine series. The coefficients are .
Odd functions () have for all , so the Fourier series contains only sines: - a sine series. The coefficients are .
For AI - DCT connection: The Discrete Cosine Transform (DCT) used in JPEG and MP3 compression is the even-extension Fourier series. The DCT coefficients are exactly the Fourier coefficients of the even extension of the signal to . This is why the DCT achieves better energy compaction than the DFT for real signals.
5.4 Smoothness and Spectral Decay
The relationship between regularity and spectral decay is fundamental to both signal processing and machine learning:
| Regularity of | Decay of | Practical implication | |-------------------|------------------|-----------------------| | | (by Riemann-Lebesgue) | All coefficients vanish asymptotically | | piecewise smooth, discontinuity | | Slow decay (Gibbs) | | (continuously diff.) | | Faster decay, no Gibbs | | | | Super-algebraic if large | | analytic | for some | Exponential decay |
For AI - Spectral bias (Rahaman et al., 2019): Neural networks learn the target function's Fourier decomposition from lowest to highest frequency. Formally, if is the target function, a network trained by gradient descent first approximates the for small (low frequencies) and only later approximates high-frequency components. This implies:
- Benefit: Implicit regularization toward smooth (low-frequency) solutions -> good generalization
- Cost: Slow convergence on high-frequency targets; requires more data for texture-rich images
- Fix: Fourier feature embeddings (RFF, NeRF's positional encoding) inject high-frequency components explicitly
5.5 Riemann-Lebesgue Lemma
Theorem (Riemann-Lebesgue): If , then as .
Proof sketch. For a step function: direct computation. For general : approximate by step functions and use the bound.
Significance: This says every function has Fourier coefficients that vanish at high frequencies. The rate of decay depends on smoothness (Section 5.4), but the qualitative statement holds for all integrable functions. This is the mathematical foundation of the claim "smooth signals are compressible in the Fourier domain."
6. Fourier Series on General Intervals
6.1 Extension to
For a -periodic function , the Fourier series on is obtained by the change of variables :
with coefficients:
The fundamental frequency is , and the -th harmonic has frequency .
For AI - Transformer context length: In a transformer with context length , sinusoidal positional encodings use frequencies for . This covers a geometric range of frequencies over the interval - analogous to sampling the Fourier series on at geometrically spaced frequencies. Longer context requires lower minimum frequencies, which is why RoPE's limits effective context length and why extended-context models (LLaMA-3-128K, Mistral) modify the base .
6.2 Half-Range Expansions
If is defined on (not the full period), we can extend it to in two ways:
Even extension: Extend by -> leads to a cosine series on :
Odd extension: Extend by -> leads to a sine series on :
Application: Solving the heat equation on with Dirichlet boundary conditions () uses the sine series expansion. With Neumann conditions (), use the cosine series.
7. Applications in Machine Learning
7.1 Sinusoidal Positional Encodings in Transformers
The original Transformer (Vaswani et al., 2017) uses positional encodings:
This assigns each position a -dimensional vector whose components are sine and cosine values at geometrically spaced frequencies .
Why this works: The set of frequencies forms a geometric progression spanning from (very high frequency, distinguishes adjacent tokens) down to (very low frequency, distinguishes widely separated tokens). This is exactly the strategy of a Fourier series on a long interval: use many harmonics at different scales.
Limitation: These are absolute positional encodings - the embedding for position 5 is fixed regardless of context. This creates problems for generalization to longer sequences than seen in training.
7.2 Rotary Positional Encoding (RoPE)
RoPE (Su et al., 2021), now standard in LLaMA-2/3, Mistral, Falcon, and GPT-NeoX, encodes position as a rotation of the query and key vectors.
Construction: For a -dimensional query vector , split into pairs . Rotate each pair by angle where is the token position and :
This is multiplication by in the complex representation .
The key property: The attention score between query at position and key at position is:
The score depends only on - the relative position. This is the Fourier shift theorem: multiplying by and taking a conjugate product gives sensitivity to relative displacement.
Extended context: The maximum effective context length is determined by the lowest frequency . Models like LLaMA-3-128K extend context by scaling (Position Interpolation, Chen et al., 2023) or by increasing the base from 10000 to 500000 (Roziere et al., 2023).
7.3 Spectral Bias of Neural Networks
The phenomenon (Rahaman et al., 2019): Neural networks trained with gradient descent learn a biased decomposition of the target function: low-frequency Fourier components are learned first, high-frequency components last.
Mathematical statement: Let be the target function with Fourier decomposition . A network trained on a finite sample first converges on the low- components ( decreases for small first).
Consequences for AI:
- Good: Networks converge to smooth solutions -> implicit regularization -> good out-of-distribution generalization for smooth targets
- Bad: Learning high-frequency signals (sharp edges, fine texture) requires more data and training time
- NeRF / SIREN fix: Sinusoidal activations (Sitzmann et al., 2020) or Fourier feature mappings (Tancik et al., 2020) inject high-frequency components, overcoming spectral bias for 3D scene representation
Mechanism: The NTK (Neural Tangent Kernel) of a standard MLP has a spectrum that decays with frequency. High-frequency components of the target correspond to high-eigenvalue directions of the NTK, which converge slowly under gradient descent.
7.4 Random Fourier Features and Kernel Approximation
The problem: Kernel methods (SVMs, GPs) require storing and computing an kernel matrix - memory and computation. For large , this is infeasible.
The solution (Rahimi & Recht, 2007): Any shift-invariant kernel can be written, by Bochner's theorem, as the Fourier transform of a non-negative measure:
Sampling frequencies and defining the feature map:
gives . This reduces kernel computation to - linear in the dataset size.
Connection to Fourier series: is a finite Fourier expansion with randomly sampled frequencies. The approximation quality improves as increases, with concentration bounds showing with high probability for .
8. Common Mistakes
| # | Mistake | Why It's Wrong | Fix |
|---|---|---|---|
| 1 | Using without the factor | Missing normalization; coefficients will be off by | Always check which convention your source uses; in this repo we use |
| 2 | Assuming the Fourier series always converges pointwise | False: Du Bois-Reymond exhibited a continuous function whose series diverges at a point | State the convergence type explicitly; use convergence for most theory |
| 3 | Assuming the sum at a discontinuity equals | The series converges to the average at a jump | Apply Dirichlet's theorem: check for discontinuities before claiming convergence to |
| 4 | Writing for a real function | Correct relation is (conjugate, not equal) | For real : ; only real if is real |
| 5 | Applying differentiation term-by-term without checking conditions | requires to be absolutely continuous and periodic | Check that (absolutely continuous) before differentiating term-by-term |
| 6 | Confusing Parseval's theorem with Bessel's inequality | Bessel gives ; Parseval gives - they differ | Parseval requires completeness of the trigonometric system; Bessel holds for any orthonormal set |
| 7 | Claiming $ | c_n | = O(1/n)$ decay for a smooth function |
| 8 | Forgetting to subtract the mean before computing sine/cosine coefficients | is the DC component; it appears in the term, not in for | Always compute separately; the series is |
| 9 | Using the complex form with but real form coefficients | Complex form requires ; real form uses - these are different | Pick one form and use it consistently; convert via |
| 10 | Claiming the Gibbs phenomenon goes away as | The overshoot height stays at ~9% of the jump; only its width shrinks | Gibbs is a permanent feature of the partial sum near discontinuities; use windowing to mitigate |
9. Exercises
Exercise 1 (*): Compute the complex Fourier coefficients of the triangle wave on and write out the first five non-zero terms of the Fourier series. Verify that is even, so .
Exercise 2 (*): Prove from first principles that the functions form an orthonormal set in with the inner product .
Exercise 3 (*): Using Parseval's identity applied to the square wave, show that . Then use to recover the Basel sum .
Exercise 4 ():** Let for with real. (a) Compute . (b) Apply Parseval's identity to obtain a formula for in terms of a series involving . (c) Verify your answer for numerically.
Exercise 5 ():** Prove the Riemann-Lebesgue Lemma: if , then as . (Hint: first prove it for step functions, then approximate general .)
Exercise 6 ():** Consider on . (a) Find all Fourier coefficients . (b) Show that uniformly. (c) Set in the resulting series to derive .
Exercise 7 (*):** Implement RoPE from scratch in Python. (a) Given query and key vectors of dimension and sequence length , construct the rotation matrices for each position using . (b) Compute the rotated attention scores for all pairs . (c) Verify that depends only on by checking for several values of .
Exercise 8 (*):** Implement Random Fourier Features for the RBF kernel. (a) For the Gaussian kernel , show that the spectral density is . (b) Implement the feature map using random frequency samples. (c) On a synthetic dataset in , compare the exact kernel matrix with the approximation and plot the approximation error as a function of .
10. Why This Matters for AI (2026 Perspective)
| Concept | AI Impact | Concrete System |
|---|---|---|
| Fourier basis vectors | Foundation of positional encodings in all transformer variants | RoPE in LLaMA-3, Gemma-2, Mistral-7B, Falcon |
| Complex Fourier form | Rotation = position; enables relative PE via shift theorem | RoPE (Su et al., 2021), xPos, YaRN |
| Parseval's identity | Energy preservation: Fourier is a unitary transform; spectral analysis of embeddings | WeightWatcher spectral analysis of LLM health |
| Spectral bias ( decay) | Networks learn smooth functions first; governs training dynamics | SIREN (Sitzmann et al., 2020), NeRF frequency encoding |
| Gibbs phenomenon | Ringing in over-compressed audio/images; sudden context-length failure modes | JPEG compression artifacts, LLM boundary token issues |
| Fourier completeness | Any periodic signal can be exactly represented; digital audio encoding | MP3, AAC, Ogg Vorbis compression standards |
| Random Fourier features | kernel approximation replacing kernel matrix | Large-scale SVM/GP inference (Rahimi & Recht, 2007) |
11. Conceptual Bridge
Looking backward: Fourier series is built on three pillars from earlier chapters: the inner product structure of (from Section 12-02 Hilbert Spaces), complex exponentials (from Section 01 Mathematical Foundations), and convergence of series (from real analysis in Section 04 Calculus). If those foundations feel shaky, revisit them before proceeding.
Looking forward: Fourier series handles periodic functions on a bounded interval. The Fourier Transform (Section 20-02) extends this to aperiodic signals on the entire real line by taking the period . The discrete, computationally feasible version is the DFT and FFT (Section 20-03). The Convolution Theorem (Section 20-04) shows why Fourier analysis is so powerful: convolution in time becomes multiplication in frequency. Finally, Wavelets (Section 20-05) overcome Fourier's fundamental limitation - the inability to localize in both time and frequency simultaneously.
POSITION IN THE FOURIER ANALYSIS CHAPTER
========================================================================
Section 20-01 Fourier Series -- YOU ARE HERE
v (take T -> infty)
Section 20-02 Fourier Transform
v (discretize: N samples)
Section 20-03 DFT and FFT
v (mult. in freq conv. in time)
Section 20-04 Convolution Theorem
v (localize in time AND freq)
Section 20-05 Wavelets
Prerequisites: Forward pointers:
Section 01 Mathematical Foundations -> Section 20-02 (continuous FT)
Section 04 Calculus Fundamentals -> Section 20-03 (discrete FT)
Section 12-02 Hilbert Spaces -> Section 12-03 (kernel methods via Bochner)
========================================================================
<- Back to Fourier Analysis | Next: Fourier Transform ->
Appendix A: Extended Examples and Computations
A.1 Full Derivation: Fourier Series of Common Waveforms
This appendix provides complete, step-by-step derivations for the Fourier coefficients of the most important periodic waveforms. These are not presented as exercises - they are reference derivations that you should work through once and understand deeply. Every step is motivated.
A.1.1 Rectangular Pulse Train
Define a pulse of width centered at , repeated with period :
Complex coefficients: for :
where . For : (the duty cycle).
Key observation: The envelope of versus follows a function. The first zero crossing occurs at . A narrower pulse ( small) -> wider spectrum (more high frequencies needed to represent the sharp edges). This is the time-frequency trade-off in action.
Spectrum width: The "bandwidth" (distance to first spectral null) is in normalized frequency. Narrow pulses are "broadband"; wide pulses (large , approaching a constant) have most energy concentrated near .
A.1.2 Full-Wave Rectified Sine
is even and non-negative. Since , for all . For :
Using the product-to-sum formula :
For : .
For :
So .
Application: The full-wave rectified sine is used in AC-to-DC conversion. Its spectrum has no odd harmonics - only even harmonics - which is why its Fourier representation converges faster (coefficients decay as ) than the square wave (which has decay).
A.1.3 The Sawtooth Wave (Staircase Convergence)
on , (we assign the average at the jumps). We showed .
Let us verify: at , . The series gives:
This uses the Leibniz formula .
Convergence rate: For the sawtooth, , so (from the integral test). Convergence is slow: to reduce the error below , we need terms.
A.2 The Heat Equation: Fourier's Original Application
The problem that motivated Fourier's entire theory is the heat equation on a rod:
with boundary conditions (zero temperature at the ends) and initial condition .
Solution by separation of variables:
Assume . Substituting: , so (both sides must equal the same constant). This gives:
The boundary condition forces with eigenvalues . The corresponding time solution is .
Superposition: The general solution is:
The coefficients are determined by the initial condition:
This is exactly the sine series (half-range expansion, Section 6.2). So .
Physical insight: Each mode decays at rate . High-frequency modes (large ) decay much faster than low-frequency modes. At long times, the solution looks like just the fundamental mode. This is Fourier's original discovery: heat diffusion is a low-pass filter in frequency space.
For AI: This is the origin of the spectral bias observation. In a sense, gradient descent on a neural network is like running the heat equation in weight space - it diffuses high-frequency components (noise) faster than low-frequency components (signal). The spectral bias of neural networks has exactly the same mathematical structure as heat diffusion.
A.3 Dirichlet's Kernel: Detailed Analysis
Understanding why the Dirichlet kernel causes problems requires a careful look at its behavior.
Closed form derivation:
Properties:
- (the value at the origin)
- (normalized)
- has zeros at for
- alternates sign: it is NOT a non-negative approximate identity
- (the norm grows logarithmically)
The logarithmic growth of is the root cause of convergence problems. A proper approximate identity would have bounded - the Dirichlet kernel fails this, which is exactly why pointwise convergence can fail for continuous functions.
Contrast with the Fejer kernel:
The Fejer kernel satisfies: , , and for any : uniformly on . These three properties make a genuine approximate identity, guaranteeing uniform convergence.
A.4 Complex Analysis Connection: Laurent Series
The Fourier series of a function on is closely related to the Laurent series in complex analysis. If has Fourier series , define (a point on the unit circle ). Then:
This is a Laurent series in centered at the origin, evaluated on the unit circle. The Fourier coefficient is the coefficient of this Laurent series.
Analyticity and series convergence: If extends analytically to an annulus (with ), then the Laurent series converges absolutely and uniformly on the unit circle, and the Fourier coefficients decay exponentially: . This is the deep reason why analytic functions have exponentially decaying Fourier coefficients (Section 5.4).
For AI: The connection between Fourier analysis and analytic continuation underlies the theory of analytic functions of neural networks and the frequency-domain analysis of attention patterns. When LLM researchers analyze learned positional encodings in the complex plane, they are (often implicitly) using this Laurent series picture.
A.5 Numerical Experiments: Convergence Visualization
The following experiments (implemented in theory.ipynb) illustrate the key convergence phenomena:
Experiment 1 - Convergence speed: Compute partial sums for for the square wave. Plot the error as a function of . Observe the decay rate from the coefficients.
Experiment 2 - Gibbs phenomenon: Plot for the square wave. Zoom in near . Measure the overshoot height and verify it is of the jump height 2.
Experiment 3 - Cesaro means fix Gibbs: Plot (Cesaro means) alongside . Observe that the Gibbs overshoot is absent in the Cesaro means.
Experiment 4 - Decay rate vs smoothness: Compare the coefficient decay rate for: (a) square wave (discontinuous): ; (b) triangle wave (continuous, piecewise linear): ; (c) smooth bump function: decays super-polynomially. Plot vs to see the exponent.
Experiment 5 - RoPE implementation: Implement RoPE as in Exercise 7. Visualize the rotation matrices for positions and frequency index . Verify that consecutive positions differ by a fixed rotation angle , confirming the Fourier basis interpretation.