Lesson overview | Previous part | Next part
Fourier Series: Appendix B: Advanced Topics to Appendix I: Implementation Details
Appendix B: Advanced Topics
B.1 Multidimensional Fourier Series
On a -dimensional torus , the Fourier series generalizes naturally. For and :
The orthonormality relation is , and Parseval generalizes to .
For AI - 2D images: Images are functions on a 2D torus (with periodic boundary). The 2D Fourier series (or DFT) represents an image as a sum of 2D complex exponentials . Low-frequency components near the origin represent slowly varying regions; high-frequency components represent edges and textures. JPEG compression discards high-frequency 2D DCT coefficients - this is the 2D cosine series applied to blocks.
B.2 Distributions and the Dirac Delta
For rigorous treatment of impulse-like functions in Fourier analysis, we need distributions (generalized functions). The Dirac delta is defined by:
Its Fourier coefficients are , so its Fourier series is:
This is the completeness relation of the Fourier system: the identity operator decomposes as a sum over all frequency components. In functional analysis, this is written as (in Dirac notation).
Verification: For any smooth , ... (Parseval confirms this is by the inversion formula).
Comb function: The Dirac comb has Fourier coefficients for all - its Fourier series is the same as . This might seem paradoxical, but it reflects the periodization in the series context.
B.3 Pointwise Convergence: The Full Story (Carleson's Theorem)
The question of pointwise convergence of Fourier series remained open for 120 years after Dirichlet's 1829 result.
Timeline of the convergence problem:
- Dirichlet (1829): Piecewise smooth functions -> pointwise convergence everywhere
- Du Bois-Reymond (1876): Exhibited a continuous function whose Fourier series diverges at one point
- Kolmogorov (1923): Exhibited an function whose Fourier series diverges everywhere!
- Carleson (1966): Proved that for every , the Fourier series converges pointwise almost everywhere
- Hunt (1968): Extended to for any
Carleson's theorem (which won him the Abel Prize in 2006) resolved the central question: functions have Fourier series converging a.e. The proof is one of the most difficult in analysis, involving a refined version of the Calderon-Zygmund theory.
Practical implication: For all functions we encounter in ML (bounded measurable, piecewise smooth, ), Fourier series converge pointwise. The pathological examples require non-measurable functions or unbounded ones.
B.4 Sobolev Spaces and Regularity via Fourier Series
Definition (Sobolev spaces): For , the Sobolev space consists of functions whose Fourier coefficients satisfy:
The parameter measures smoothness: , approx "once weakly differentiable in ", for consists of continuous functions (Sobolev embedding theorem).
Fourier characterization of regularity: A function is in if and only if for some . This makes Sobolev regularity entirely readable from the Fourier spectrum.
For AI - Spectral bias quantified: The spectral bias (Section 7.3) says networks approximate more easily if is large (smooth target). Formally, a network trained by gradient descent with samples achieves risk of order when targeting - the standard minimax rate for nonparametric regression. Higher -> faster convergence -> need fewer samples.
For AI - Kernel smoothness: The RKHS of a kernel is a Sobolev space . The RBF kernel corresponds to (infinitely smooth functions), while the Matern kernel corresponds to for the regularity parameter and dimension . This is why the choice of kernel implicitly assumes a smoothness level for the target function.
B.5 Fejer-Riesz Theorem and Spectral Factorization
Theorem (Fejer-Riesz): Any trigonometric polynomial that is non-negative on can be written as where is an analytic polynomial.
Spectral factorization: Given a non-negative spectral density (such as the power spectral density of a stationary process), we can find a causal filter such that . This is the foundation of the Wiener filter (used in denoising) and Cholesky-like factorizations in probability theory.
For AI - Connection to SSMs: State Space Models (S4, Mamba) parameterize their convolution kernels through a spectral factorization. The kernel is written as where is constrained to be the frequency response of a stable ARMA model. The Fejer-Riesz theorem guarantees that the spectral density can always be factored into a causal part.
B.6 The Discrete Cosine Transform and Signal Compression
The Discrete Cosine Transform (DCT) of Type II is defined as:
This is the even-extension Fourier series (cosine series) applied to a finite sequence. The DCT has two critical properties for signal compression:
-
Energy compaction: For natural signals (images, audio), the DCT concentrates most of the signal energy in the first few low-frequency coefficients. The DCT of a typical image block has 95%+ of its energy in the top-left coefficients.
-
Decorrelation: The DCT approximately diagonalizes the covariance matrix of natural signals, making the coefficients nearly uncorrelated. This allows independent quantization of each coefficient.
JPEG compression pipeline:
- Divide image into blocks
- Apply 2D DCT to each block
- Quantize coefficients (coarser quantization for high frequencies)
- Run-length encode the quantized sparse coefficients
- Huffman encode
The quality parameter in JPEG controls the quantization step sizes - higher quality means finer quantization of the high-frequency DCT coefficients.
DCT vs DFT: The DCT avoids the "spectral leakage" (Gibbs phenomenon) of the DFT because the even extension of a finite sequence is continuous at the boundaries. This is why JPEG uses DCT rather than DFT.
B.7 Uniform Convergence: Weierstrass M-Test Application
Let us prove uniform convergence of the triangle wave Fourier series rigorously.
Claim: The Fourier series converges uniformly.
Proof via Weierstrass M-Test: We need where .
Then .
Since the series of bounds converges, the Weierstrass M-test gives uniform convergence.
Contrast with square wave: For the square wave, would bound a DIFFERENT series than what appears. But the square wave's coefficients are , and diverges (harmonic series). So the Weierstrass M-test fails - consistent with non-uniform convergence near the jumps.
B.8 Connection to Functional Analysis: Completeness Proof Sketch
Here is a sketch of why the trigonometric system is complete in - a result we stated but did not prove in Section 2.2.
Step 1: Stone-Weierstrass theorem: Continuous periodic functions can be uniformly approximated by trigonometric polynomials (polynomials in ).
Step 2: Density: Continuous functions are dense in (this is a standard result in measure theory).
Step 3: Combining: For any and , find continuous with , then find trigonometric polynomial with , giving . Triangle inequality gives .
Full proof: See Section 12-02 Hilbert Spaces where the abstract completeness theorem is proved and the trigonometric system is given as the primary example.
Appendix C: Problem-Solving Strategies
C.1 Computing Fourier Coefficients: A Systematic Approach
Follow this checklist when computing Fourier coefficients:
-
Check parity first. Is even? -> only terms. Odd? -> only terms. Neither? -> compute all.
-
Check for known waveform. Square wave, sawtooth, triangle wave, rectangular pulse - look up or recall the result.
-
Split piecewise functions. Write if is defined piecewise.
-
Use integration by parts for polynomial trig: . Set , . Repeat until done.
-
Check boundary terms. For periodic : boundary terms in integration by parts vanish. For non-periodic problems (PDEs), keep them.
-
Use the differentiation theorem as a shortcut: if you know , then .
-
Verify via Parseval. Compute and check it equals - this catches algebra errors.
C.2 Determining Convergence Type
| Condition | Conclusion |
|---|---|
| piecewise smooth | Pointwise convergence to (Dirichlet) |
| continuous + piecewise | (uniform) |
| (mean-square) | |
| , periodic | Uniform convergence; coefficients |
| analytic | Exponential coefficient decay; exponentially fast uniform convergence |
C.3 Recognizing When Fourier Series Appear in ML
Watch for these signatures of Fourier analysis in ML papers and code:
- and in positional encodings -> Fourier basis
- "Frequency domain" or "spectral" in the title -> likely uses DFT or FT
- Complex multiplication or rotation in attention -> RoPE or ALiBi variant
- "Low-pass filter" or "high-pass filter" -> filter = convolution = Fourier operation
- or weight decay in regularization -> Sobolev norm regularization
- "Spectral density" or "power spectral density" -> autocorrelation -> Wiener-Khinchin -> Fourier
- State space model (SSM, S4, Mamba) -> the convolution kernel is parameterized in frequency domain
Appendix D: Deep Dive - Proofs and Derivations
D.1 Proof of Dirichlet's Theorem (Complete)
We prove the central convergence result: for piecewise smooth and -periodic, .
Setup. Write . By periodicity, we may center the integral at :
Since , we can write:
Split the integral. Since , and using (even function):
Key function. Define:
We need . Near : the numerator behaves like (by piecewise smoothness), so numerator , and , giving - bounded near . Away from : is bounded since is bounded and . So .
Apply Riemann-Lebesgue. The expression becomes:
By the Riemann-Lebesgue lemma (proved below), this as .
Riemann-Lebesgue Lemma (proof): For :
Proof for step functions: If , then . By linearity, this holds for any step function. For general functions, approximate by step functions and use the bound.
D.2 Parseval's Identity: Detailed Proof
Theorem. For with complex Fourier coefficients :
Proof (assuming completeness). Consider the partial sum error:
The cross-term vanishes: (by the projection property: is the best approximation to from ).
So (Bessel).
By completeness, , so:
This gives the result.
Remark: The key step is . This is the Pythagorean theorem in : decomposes orthogonally into (the projection) and (the residual). No analogy - it IS the Pythagorean theorem, in an infinite-dimensional space.
D.3 Gibbs Phenomenon: Precise Quantification
Let (square wave). We compute the exact overshoot at the jump .
The partial sum has its first local maximum at (the first zero of , which is zero when ).
Computing:
This is a Riemann sum for with step . As :
The jump in is from to (height 2). The overshoot above the limiting value is , which is of the jump height.
Numerical value of :
Why is universal: The integral does not depend on the specific piecewise smooth function - only on the jump height. Every jump discontinuity in any piecewise smooth function gives a Gibbs overshoot of this universal fraction.
D.4 The Weierstrass Approximation Theorem and Completeness
Weierstrass Approximation Theorem (1885): Every continuous function on can be uniformly approximated by algebraic polynomials.
Trigonometric variant: Every continuous -periodic function can be uniformly approximated by trigonometric polynomials .
Proof sketch (Fejer's approach): The Cesaro means are trigonometric polynomials (each is a trigonometric polynomial of degree , and their average is also one). Fejer proved uniformly for continuous .
Why completeness follows: Given and :
- Find continuous with (density of in )
- Find trigonometric polynomial with (Weierstrass/Fejer)
- Then
- Triangle inequality:
D.5 Fourier Series and Operator Theory
The Fourier series establishes a fundamental connection between function analysis and operator theory.
The operator perspective: Consider the differentiation operator on periodic functions. The eigenfunctions of are exactly the complex exponentials:
with eigenvalue . The Fourier series is the eigenfunction expansion of . Every linear differential operator with constant coefficients:
is diagonalized by the Fourier basis:
where is the symbol of the operator.
For AI: Attention is (in the linear case) an operator on sequence positions. The linear attention matrix with entries (for a shift-invariant kernel) is diagonalized by the Fourier basis - its eigenvectors are the complex exponentials and its eigenvalues are the Fourier transform of . This is why spectral analysis of attention patterns connects directly to Fourier theory, and why RoPE (which multiplies by ) is so natural in this framework.
Compact operators: If the coefficients as , then is a compact operator. The inverse problem (recovering from ) is then ill-posed (small changes in can cause large changes in ) - this is the mathematical underpinning of regularization in machine learning.
Appendix E: Summary Tables
E.1 Fourier Coefficient Reference
| Function on | (complex) | Notes |
|---|---|---|
| (constant) | , () | DC component only |
| () | , () | Single frequency |
| , rest | ||
| , | ||
| (square wave) | ( odd), (even/zero) | decay |
| (sawtooth) | (), () | decay |
| $ | x | $ (triangle) |
| , () | decay | |
| Exponential decay | ||
| (Dirac) | (all ) | No decay (distributional) |
E.2 Properties Reference Table
| Property | Time domain | Frequency domain |
|---|---|---|
| Linearity | ||
| Shift | ||
| Modulation | ||
| Differentiation | ||
| Integration | () | |
| Reflection | ||
| Conjugation | ||
| Convolution | ||
| Multiplication | (convolution in freq) |
E.3 Convergence Summary
| Type | Condition | Rate | Gibbs? |
|---|---|---|---|
| (always) | N/A | ||
| Pointwise | piecewise smooth | error at smooth points | Yes at jumps |
| Uniform | , periodic | in | No |
| Superpolynomial | , periodic | No | |
| Exponential | analytic | No | |
| Cesaro | continuous | Uniform (Fejer) | No |
Appendix F: Complete Worked Problems
F.1 Problem: Full Solution of Fourier Series for on
This is a common PDE boundary value problem. We want the sine series of on (odd extension):
Step 1: Split the integral.
Step 2: Compute by parts.
Set , :
Step 3: Compute by parts twice.
First integration by parts: , :
Second integration by parts:
So:
Step 4: Combine.
Result:
Verification via Parseval:
Parseval: , giving . (Verifiable!)
Application to heat equation: The solution of on with and is:
At : this reduces to the Fourier series of . At large : the dominant term is - the solution relaxes toward zero via the fundamental mode.
F.2 Problem: Fourier Series of - The Modulation Property
This trivial-seeming example illuminates the modulation property.
is already a single Fourier mode with and for .
Now consider for integer . By the modulation property:
So has all energy concentrated at frequency . This is the frequency shift or modulation property: multiplying by shifts the spectrum by .
Application to RoPE in detail: In RoPE, the query at position for dimension is:
This is multiplication by the complex exponential - i.e., modulation by frequency . The inner product with key at position :
The score depends only on - relative position - by the modulation-then-conjugate-product structure. This is the Fourier basis shift theorem instantiated in transformer attention.
F.3 Problem: Convergence Rate Comparison
Compare the convergence rates of the Fourier series for four functions:
Function A: Square wave (1 discontinuity per period)
- ->
- Needs terms for relative error
Function B: Triangle wave (continuous, corners = 1 discontinuity in derivative)
- ->
- Needs terms for relative error
Function C: Smooth periodic bump
- is analytic -> decays exponentially,
- Needs terms for relative error
Function D: (slowly varying linear ramp)
- -> similar to triangle wave
- Fast convergence
Lesson: The algebraic convergence rates for discontinuous or non-smooth functions make Fourier truncation expensive. This motivates wavelets (Section 20-05), which can represent localized features (edges) efficiently without paying the Gibbs penalty.
F.4 Problem: Convolution in Fourier Space
Setup: Let and . Compute using the convolution theorem.
Using the convolution theorem: .
Fourier coefficients of : , rest zero.
Fourier coefficients of : , , rest zero.
Product : This is zero for all since the non-zero frequencies of () and () are disjoint!
So - the convolution of and is identically zero.
Direct verification: . Each integral is zero by orthogonality.
Lesson: The convolution of two functions with non-overlapping spectra is zero. This is the frequency-domain multiplication interpretation of convolution. In signal processing, this means two signals at different frequencies don't interfere - the superposition principle holds in frequency space.
For AI: This is why attention heads can specialize to different frequency ranges. If head attends to low-frequency positional patterns and head attends to high-frequency syntactic patterns, their convolution-like operations in frequency space are orthogonal.
F.5 Problem: Parseval and the Isoperimetric Inequality
The isoperimetric inequality - that among all closed curves of length , the circle encloses the maximum area - has an elegant proof via Fourier series.
Setup: Parameterize a smooth closed curve of length by arclength: for , with period . The constraint is .
Area via Green's theorem: .
Fourier expand: , .
Apply Parseval to the arc length constraint: -> after Parseval:
Bound the area: Using the Fourier expansion and Cauchy-Schwarz:
Equality holds when only the terms are non-zero - i.e., when the curve is a circle.
Conclusion: , with equality for a circle. This is the isoperimetric inequality, proved entirely via Parseval's theorem.
For AI: Isoperimetric inequalities in function spaces underlie capacity control in learning theory. The VC dimension and Rademacher complexity measure the "surface area to volume ratio" of hypothesis classes in high-dimensional function spaces - and Fourier analysis provides the tools to compute these quantities.
Appendix G: In-Depth AI Applications
G.1 Transformer Positional Encodings: A Unified View
To understand the deep connection between Fourier series and transformer positional encodings, let us develop the theory carefully.
The fundamental problem in attention. The self-attention mechanism is permutation-equivariant: permuting the input tokens permutes the output, but the attention weights treat all positions symmetrically. Without positional encodings, "the cat sat on the mat" and "the mat sat on the cat" would produce the same attention weights.
The sinusoidal solution (Vaswani et al., 2017). Add a position-dependent vector to the -th token embedding before attention. Use:
The key property. For any fixed offset , there exists a linear map such that . This is because:
Each dimension pair transforms via a rotation by - a Fourier shift in 2D.
Why different frequencies? At frequency :
- : , period - distinguishes adjacent tokens
- : - distinguishes tokens apart
- : - distinguishes tokens up to distance
This is exactly the geometric progression of harmonics in a Fourier series on the interval .
The Fourier basis interpretation. Define . The inner product depends only on . This is the autocorrelation function of the positional encoding - it measures how "similar" two positions are based on their relative offset.
Limitation. These encodings are absolute (the encoding of position 5 is the same regardless of whether the sequence has length 10 or 1000). For sequences longer than those seen in training, the high-frequency components () still work, but the model has never seen the low-frequency components () vary much. This causes length generalization failure.
G.2 RoPE: Relative Positions via Complex Multiplication
The RoPE construction. Instead of adding to the embedding, RoPE rotates the query and key vectors:
where is the block-diagonal rotation matrix:
Computing the attention score.
since (rotation by ). The score is a function of only - position-independent in the absolute sense, position-sensitive in the relative sense.
The complex representation. In complex notation, the -th pair maps to the complex number . The RoPE operation is:
This is multiplication by the unit complex number - a phase rotation of the complex Fourier coefficient at frequency .
Long-context extrapolation. The base frequency means the maximum period is . For positions beyond this, the encoding wraps around (becomes periodic) and position information is lost. This is why standard RoPE models struggle at contexts beyond tokens.
Fix - Extended RoPE: Scale where is the ratio of new context length to training context length (Position Interpolation, Chen et al., 2023). Alternatively, YaRN (Peng et al., 2023) uses a non-uniform frequency scaling based on Fourier analysis of the empirical attention patterns.
G.3 Spectral Bias: Mathematical Theory
The spectral bias (or frequency principle) of neural networks has a rigorous formulation in terms of the NTK and Fourier analysis.
The NTK framework. For a wide neural network trained with gradient descent, the NTK matrix is approximately constant during training. The training dynamics are:
This is a kernel regression problem in function space, where acts as the kernel.
Spectral decomposition. Let be eigenvalues/eigenfunctions of in . The projection of the target onto eigenvector has coefficient . Under gradient descent:
where is the target coefficient. The coefficient at eigenfrequency converges at rate - fast if large, slow if small.
The spectral bias. For standard MLPs with Gaussian initialization and ReLU activations, the NTK has eigenvalues that decrease with frequency (high-frequency eigenfunctions have small eigenvalues). Therefore:
- Low-frequency Fourier components have large eigenvalues -> learned quickly
- High-frequency components have small eigenvalues -> learned slowly
Quantitative: the Barron approximation theorem. Functions that can be represented with neurons (where is the Barron norm) can be approximated to accuracy in . The Barron norm penalizes high-frequency components: if has large high-frequency content ( large for large ), then is large and the network needs more parameters.
Implications for practice:
- Overfitting to noise: High-frequency noise (which has large ) is learned last -> early stopping acts as a frequency low-pass filter
- Data augmentation: Adding geometric augmentations (rotations, flips) destroys high-frequency features -> forces the network to rely on low-frequency structure -> better generalization
- NeRF: Without positional encoding, networks can't represent the high-frequency variation in scene appearance -> Fourier feature encodings inject high frequencies explicitly
G.4 SIREN Networks: Replacing ReLU with Sine
SIREN (Sitzmann et al., 2020). Instead of using activations, SIREN uses:
The derivative is again a SIREN (with cosines -> sines one layer deeper). This is the key property: the derivative of a SIREN is also a SIREN with the same architecture.
Fourier interpretation. The output of a SIREN is a sum of sinusoids at multiple frequencies and phases. The composition of sine activations creates a rich set of Fourier components. The Fourier spectrum of a SIREN output can represent high-frequency content that a standard ReLU network would struggle to learn.
Mathematical property. For a single-layer SIREN with neurons, the output is:
a sum of Fourier basis functions with learned frequencies, amplitudes, and phases. Unlike a standard Fourier series with fixed frequencies , SIREN can choose the frequencies freely - making it more expressive.
Applications: Implicit neural representations (NeRF, video), solving PDEs (physics-informed neural networks), generative modeling of images and 3D shapes.
G.5 The Spectral Analysis of LLM Attention Patterns
Recent research (2023-2024) has revealed striking spectral structure in LLM attention patterns:
Attention as a filter. In a transformer layer, the attention mechanism computes:
where . If depends only on (which is approximately true for well-trained models with relative PE), then this is a convolution with kernel . In frequency space:
The attention pattern is a frequency filter. Different heads learn different filters:
- Local heads: large only for small -> low-pass filter (smooth the sequence)
- Global heads: relatively uniform -> constant filter (compute the mean)
- Periodic heads: large when -> periodic filter (detect regular patterns)
Empirical findings:
- In BERT, some heads function as "copy heads" (attend to identical tokens) and "position heads" (attend to positions with specific relative offsets)
- In GPT-2/3, induction heads attend strongly to previous occurrences of the same token
- The frequency-domain view reveals why attention heads naturally factorize into frequency bands
WeightWatcher (Martin & Mahoney, 2021). The singular value spectra of weight matrices in LLMs follow power laws with exponent related to model quality. Healthy models have . Overfit or poorly trained models have outside this range. This is a Fourier analysis of the weight matrices - the spectral density tells us about the effective rank and implicit regularization.
G.6 Circular Convolution and LLMs
The circular convolution theorem (proved in Section 20-04) states that in the DFT domain, elementwise multiplication corresponds to circular convolution in the time domain. This has a direct connection to autoregressive language models.
Autoregressive language models as circular filters. A causal language model computes by attending to all previous tokens. In the frequency domain:
where is a one-sided (causal) filter. The causal constraint means for - only attending to past tokens. In frequency space, this corresponds to a filter whose phase response is for all frequencies (a Hilbert transform).
State space models. Models like S4, H3, and Mamba parameterize the causal convolution kernel directly in frequency space (via diagonal state space matrix). The output is:
where is the learned frequency response. This is exactly the convolution theorem applied to sequence modeling, and it enables computation instead of attention.
Appendix H: Connections to Probability and Statistics
H.1 Characteristic Functions as Fourier Transforms of Distributions
The characteristic function of a random variable is:
This is the Fourier transform of the probability measure ! The characteristic function exists for all random variables (unlike the moment generating function) and uniquely determines the distribution.
Fourier inversion of densities. If has density :
The density is recovered by the inverse Fourier transform: .
Central limit theorem via characteristic functions. Let i.i.d. with mean and variance . The standardized sum has characteristic function:
Expanding for small and substituting:
The Gaussian is the Fourier transform of the standard normal density. The CLT is a convergence in Fourier space to the Gaussian fixed point.
For AI - Training stability. The distribution of gradient updates in neural network training follows a CLT when the batch size is large: the stochastic gradient is a sum of per-sample gradients, so its distribution is approximately Gaussian for large . This is why large-batch training behaves differently from small-batch training - the gradient noise becomes more Gaussian (less heavy-tailed) as grows.
H.2 Autocorrelation and Power Spectral Density
For a stationary stochastic process with mean zero, the autocorrelation function is:
The power spectral density (PSD) is the Fourier transform of :
This is the Wiener-Khinchin theorem: the PSD and autocorrelation are a Fourier pair.
Interpretation of PSD:
- for all (non-negative by Bochner's theorem)
- (total power)
- at frequency tells how much of the signal's power is concentrated near frequency
White noise: (constant) -> (uncorrelated across all lags). White noise has equal power at all frequencies - it is the "maximally random" process.
1/f noise (pink noise): -> found in LLM token frequency distributions (Zipf's law), financial returns, EEG signals. The 1/f spectrum is exactly at the boundary between stationary ( integrable) and non-stationary processes.
For AI: The power spectral density of the hidden states in LLMs has been studied empirically. Models trained on natural language show 1/f-like spectra in their activations, consistent with the fractal/scale-free structure of language. This suggests that efficient LLM architectures should process information at multiple timescales simultaneously - exactly what multi-head attention (with different positional frequency ranges per head) achieves.
H.3 Fourier Analysis and the Sampling Theorem
The Shannon-Nyquist Sampling Theorem connects continuous Fourier analysis to discrete signal representation:
Theorem (Shannon, 1949; Nyquist, 1928): A bandlimited signal with for (bandwidth ) can be perfectly reconstructed from its samples :
Proof sketch. Define the sampled signal . Its Fourier transform is the periodized spectrum: . Since is bandlimited to , the copies don't overlap (no aliasing). Applying an ideal low-pass filter with cutoff recovers , and inverse FT gives .
Aliasing: If has frequency content above but is sampled at rate , the high-frequency copies in overlap, corrupting the spectrum. This is aliasing - high-frequency components masquerade as low-frequency ones.
For AI - Tokenization: Tokenization is analogous to sampling. A tokenizer samples text at a rate of tokens/word (for byte-pair encoding). If the "semantic content" of text has structure at finer granularity than 1 word, the tokenizer introduces aliasing: some semantic distinctions are lost. This is why character-level models can capture morphological patterns that word-level models miss.
Anti-aliasing in CNNs: The standard strided convolution in CNNs (stride-2 for downsampling) can alias: it samples the feature map at rate without first applying a low-pass filter. Zhang (2019) showed that adding a blur filter before strided convolutions dramatically improves shift-equivariance. This is the CNN version of anti-aliasing.
Appendix I: Implementation Details
I.1 Numerical Computation of Fourier Coefficients
Direct formula. For a function given by samples at equally spaced points :
This approximates the integral using the rectangle rule.
Accuracy. The approximation error is for smooth functions and for functions. For discontinuous functions, the error is regardless of how fine the sampling.
Using NumPy:
# N equispaced samples of f on [-pi, pi)
x = np.linspace(-np.pi, np.pi, N, endpoint=False)
f_samples = f(x)
# Fourier coefficients (approximate)
c = np.fft.fft(f_samples) / N
# c[n] corresponds to the n-th coefficient for n = 0, 1, ..., N-1
# Negative frequencies: c[N-n] corresponds to c_{-n}
Aliasing in discrete computation. The DFT computes only coefficients . For a bandlimited function with for , this is exact. Otherwise, the -th computed coefficient is actually (all aliased copies).
I.2 Plotting Spectra Correctly
Common mistakes when plotting Fourier spectra:
-
Frequency axis: NumPy's FFT output has frequencies in units of cycles/sample. Use
np.fft.fftfreq(N)to get the correct axis. -
Centering: Use
np.fft.fftshiftto rearrange the output so that zero frequency is in the center. -
One-sided vs two-sided: For real signals, plot only (the two-sided spectrum is symmetric). Use
np.fft.rfftfor real signals - it returns only the positive-frequency half. -
dB scale: Power spectra are often plotted in dB: or . This makes it easier to see components spanning many orders of magnitude.
-
Parseval check: Always verify . If these don't match, there's a normalization error.
I.3 RoPE Implementation Reference
def rope_embed(x, positions, base=10000):
"""
Apply RoPE to input tensor x.
x: shape (seq_len, d_model) - query or key vectors
positions: integer positions (seq_len,)
Returns: rotated x, same shape
"""
d = x.shape[-1]
# Frequencies: theta_i = base^{-2i/d} for i = 0, ..., d/2 - 1
i = np.arange(0, d, 2)
freqs = base ** (-i / d) # shape (d/2,)
# Angles: m * theta_i
angles = positions[:, None] * freqs[None, :] # (seq_len, d/2)
# Build rotation: cos(angle) and sin(angle)
cos = np.cos(angles) # (seq_len, d/2)
sin = np.sin(angles) # (seq_len, d/2)
# Rotate pairs (x_{2i}, x_{2i+1})
x_even = x[:, 0::2] # shape (seq_len, d/2)
x_odd = x[:, 1::2]
x_rot_even = x_even * cos - x_odd * sin
x_rot_odd = x_even * sin + x_odd * cos
# Interleave back
x_rot = np.empty_like(x)
x_rot[:, 0::2] = x_rot_even
x_rot[:, 1::2] = x_rot_odd
return x_rot
Verification: For any queries and keys with relative position :
score(m, n) = dot(rope_embed(q, [m]), rope_embed(k, [n]))
== score(m + delta, n + delta) # for any integer delta
This confirms that the attention score depends only on relative position.