Lesson overview | Lesson overview | Next part
Fourier Transform: Part 1: Intuition to 7. Tempered Distributions and the Dirac Delta
1. Intuition
1.1 From Fourier Series to Fourier Transform
Recall from Section 20-01 Fourier Series that a -periodic function has a Fourier series in complex exponential form:
The frequencies present are the discrete set , separated by spacing .
Now ask: what happens as ? The function is no longer required to repeat - it becomes an arbitrary function on all of . Three things change simultaneously:
- The frequency spacing : the discrete spectrum becomes a continuum.
- The discrete sum becomes an integral .
- The coefficient (which scales as ) becomes a density .
Concretely, substitute and let :
where the Fourier Transform (using the -convention) is:
This derivation makes the Fourier Transform inevitable: it is not a definition pulled from thin air, but the natural limit of the Fourier series as periodicity is removed. The continuous spectrum is the spectral density of - the amplitude and phase contributed by frequency per unit frequency interval.
Non-example: The derivation above requires that the "coefficients" are well-defined as . For this, we need , i.e., . A constant function is NOT in , so it does not have a classical Fourier Transform - we must extend the theory to distributions (Section 7) to handle it.
1.2 Frequency as a Continuous Variable
In a Fourier series, the spectrum is a discrete set of spikes: the function has energy only at the harmonics . In the Fourier Transform, the spectrum is a continuous function , and the function can have energy spread continuously across all frequencies.
The magnitude spectrum tells you how much of each frequency is present. The phase spectrum tells you the phase offset of each frequency component. Together they completely determine (via the inversion theorem). Some examples of what the spectrum looks like:
- A pure sinusoid : spectrum is two spikes at (as Dirac deltas).
- A rectangular pulse (nonzero on ): spectrum is a sinc function, spreading out in frequency.
- A Gaussian : spectrum is another Gaussian (this is the self-dual case).
- A chirp (frequency increasing linearly with time): spectrum is spread over a range of frequencies, with the time-frequency tradeoff governed by the uncertainty principle.
- White noise: flat spectrum - equal energy at every frequency.
The key principle is the time-frequency duality: a signal that is concentrated in time (short duration) must have a broad spectrum, and a signal with a narrow spectrum (nearly monochromatic) must extend over a long time. This is not a limitation of our measurement apparatus - it is a mathematical theorem (the Heisenberg uncertainty principle, Section 6).
For AI: This duality directly constrains attention mechanisms. RoPE uses frequencies to encode position, and extending context length (longer time window) requires lower frequencies - which is why YaRN (Peng et al., 2023) and LongRoPE (Ding et al., 2024) rescale the frequency base to accommodate longer sequences. The uncertainty principle is the fundamental reason why this rescaling is necessary.
1.3 Why It Matters for AI (2026 Perspective)
The Fourier Transform is not peripheral to modern AI - it is structurally present in several of the most important systems:
FNet (Lee-Thorp et al., 2022): Replaces the self-attention sublayer in a Transformer with a 2D Fourier transform over the sequence and embedding dimensions. Achieves 92% of BERT's accuracy on GLUE benchmarks at 7 the training speed, because FFT is while attention is . The mathematical insight: the FT acts as a "global mixer" that combines all token representations - a weaker but cheaper version of attention.
Random Fourier Features (Rahimi & Recht, 2007, NeurIPS Best Paper): Every shift-invariant kernel is the Fourier Transform of a non-negative measure (Bochner's theorem). By sampling random frequencies from that measure, one gets a randomized feature map such that . This reduces kernel machine computation from to .
Spectral Normalization (Miyato et al., 2018): To train stable GANs, the discriminator's weight matrices are normalized by their largest singular value (spectral radius). This enforces a Lipschitz constraint on the discriminator, stabilizing training. The "spectral" in the name refers to the spectrum of the weight matrix - a direct application of the FT of the weight matrices interpreted as linear operators.
Fourier Neural Operator (Li et al., 2021): Learns mappings between function spaces by applying a learnable linear transformation in Fourier space, then transforming back. Used for solving PDEs 1000 faster than traditional numerical methods. The key is that convolution in space = pointwise multiplication in Fourier space, so global dependencies can be captured efficiently.
RoPE (Su et al., 2021): Used in LLaMA-3, Mistral, Qwen, and virtually every modern LLM. Interprets each position as a rotation in the complex plane, using Fourier basis vectors at different frequencies for different embedding dimensions. The relative-position property follows from the group structure of the complex exponential.
1.4 Historical Timeline
FOURIER TRANSFORM - HISTORICAL MILESTONES
========================================================================
1822 Fourier's "Theorie analytique de la chaleur": introduces the
integral formula that becomes the Fourier Transform
1880s Rayleigh uses Fourier integrals in optics and acoustics
1898 Heaviside operational calculus (early form of the Laplace/FT)
1915 Plancherel proves the L2 isometry theorem (Plancherel's theorem)
1927 Heisenberg formulates the uncertainty principle in quantum
mechanics; Kennard gives the precise mathematical inequality
1933 Norbert Wiener's "The Fourier Integral and Certain of its
Applications" - rigorous L2 theory; lays groundwork for
signal processing as a mathematical discipline
1949 Shannon's sampling theorem (uses Poisson summation formula)
1965 Cooley & Tukey publish the FFT - makes FT computable in O(NlogN)
1966 Schwartz distributions: FT extended to , constants, sinusoids
1975 FT enters digital signal processing (audio, radar, MRI)
2007 Rahimi & Recht: Random Fourier Features (NeurIPS Best Paper)
2017 Vaswani et al.: sinusoidal positional encodings in Transformers
2018 Miyato et al.: spectral normalization for GANs
2021 Li et al.: Fourier Neural Operator for PDE solving
2021 Su et al.: RoPE - now in LLaMA-3, Mistral, Qwen, GPT-NeoX
2022 Lee-Thorp et al.: FNet - Fourier-based alternative to attention
2023 YaRN (Peng et al.): Fourier-based context length extension
2024 LongRoPE (Ding et al.): progressive frequency rescaling to 2M
context length; RWKV-7 uses state-space FT interpretation
========================================================================
1.5 What the Fourier Transform Does Geometrically
The most useful geometric picture is that the Fourier Transform decomposes in an orthonormal basis of complex exponentials - just as the Fourier series decomposed a periodic function in the trigonometric basis. The difference is that now the "basis" is uncountably infinite and the "coefficients" form a continuous function rather than a sequence.
Formally, the complex exponentials do not form a basis of in the usual Hilbert space sense (they are not in themselves - they have constant modulus 1 and are not square-integrable). The rigorous statement is that the Fourier Transform is a unitary operator on (Plancherel's theorem), meaning it preserves inner products and norms:
Think of the Fourier Transform as a rotation in an infinite-dimensional space: it rotates the function into a new coordinate system (the frequency domain) where the same function looks different but has exactly the same "length" (energy). Just as rotating a vector in preserves its norm but changes its coordinates, the Fourier Transform preserves energy but expresses it in frequency coordinates.
A second geometric picture: the Fourier Transform of the convolution is the pointwise product . Convolution in time/space corresponds to multiplication in frequency. This is the key theorem for signal processing and the foundation of efficient convolution in CNNs via the FFT. (Full treatment in Section 20-04 Convolution Theorem.)
GEOMETRIC PICTURE OF THE FOURIER TRANSFORM
========================================================================
Time domain Frequency domain
------------- ----------------
f(t): a function of time f(): a function of frequency
Signal duration T <--> Bandwidth ~ 1/T (uncertainty principle)
Convolution f*g <--> Pointwise product fg (Convolution Theorem)
Differentiation d/dt <--> Multiplication by 2i
Translation f(t-a) <--> Modulation e^{-2ia}f()
Scaling f(at) <--> Dilation (1/|a|)f(/a)
The FT is a UNITARY OPERATOR on L2(R):
f2 = f2 (Plancherel)
f,g = f,g (Parseval)
========================================================================
2. Formal Definitions
2.1 The Fourier Transform on
Definition 2.1 (Fourier Transform on ). For (i.e., ), the Fourier Transform of is the function defined by:
We write or .
Well-definedness. For and any :
So is bounded: . Moreover, one can show (by dominated convergence) that is continuous on , and by the Riemann-Lebesgue lemma, as .
Theorem 2.1 (Riemann-Lebesgue Lemma). If , then (continuous functions vanishing at infinity):
Proof sketch. For step functions, the integral reduces to a finite sum of terms , each going to 0. Approximate general functions by step functions and use the boundedness .
What tells you. The value is a complex number encoding:
- : the amplitude of frequency in
- : the phase of frequency in
The power spectrum gives the energy density per unit frequency at .
2.2 Convention War: vs vs
The Fourier Transform appears in three notational conventions in the literature, all equivalent but related by rescaling:
| Convention | Transform formula | Inverse formula | Used in |
|---|---|---|---|
| Frequency (Hz) | Signal processing, probability | ||
| Angular freq (rad/s), no | Physics, engineering | ||
| Angular freq , symmetric | Pure mathematics |
This section uses the -convention (row 1). It is self-symmetric (no factors), maps the Gaussian to itself, and is standard in modern ML papers (FNet, FNO, RFF all use it). The relationship between conventions: .
Warning: The most common source of errors in Fourier analysis is mixing conventions. When reading a paper, identify the convention in the first equation before proceeding. Properties like "differentiation multiplies by " vs "multiplies by " depend entirely on which convention is in use.
2.3 Standard Examples
These four transforms appear constantly in applications and should be memorized.
Example 1: Rectangular Pulse (rect function)
The sinc function oscillates and decays as . The slow decay reflects the sharp discontinuity of - sharp features in time produce slow decay in frequency (this is the spectral analog of the Gibbs phenomenon from Section 20-01).
A general pulse of width : has transform . Wider pulse -> narrower, taller sinc lobe.
Example 2: Gaussian
The Gaussian is self-dual under the Fourier Transform with the -convention: . This is unique among "nice" functions and makes the Gaussian the natural test function in Fourier analysis, the ground state in quantum mechanics, and the kernel of the heat equation.
Derivation: Complete the square in the exponent: . Then:
where the integral equals 1 by contour integration (the Gaussian integral is analytic and the contour shift is justified because the integrand decays rapidly).
Example 3: Lorentzian (Cauchy distribution)
The exponential decay in frequency reflects the moderate smoothness of the Lorentzian (it is but not compactly supported). The relationship between smoothness and spectral decay is captured by the general theorem in Section 3.5.
Example 4: One-Sided Exponential
This has modulus , which decays as for large - reflecting the discontinuity at .
Non-example 1: . This is in but not (it decays too slowly), so the integral does not converge absolutely and the definition above does not apply. The theory (Section 5) handles this case.
Non-example 2: . This is bounded but not in or - it does not decay. Its Fourier Transform exists only as a distribution: (see Section 7).
2.4 The Inverse Fourier Transform
Definition 2.2 (Inverse Fourier Transform). For :
Note: the inverse is the same as the forward transform except the sign in the exponent is flipped (). In the -convention, this symmetric form is one of its advantages.
The inversion problem: Given , can we recover ? Yes, under mild conditions:
but this requires knowing that (which is not guaranteed by alone - for instance, is not in ). The full inversion theorem is treated rigorously in Section 4.
2.5 Non-Examples and the Extension
The classical definition fails for many important functions:
| Function | Why FT fails | Solution |
|---|---|---|
| (broader Gaussian) | In - this one is fine | Not a failure |
| Not in | Use Plancherel (Section 5) | |
| Not in or | Use distributions (Section 7) | |
| Not in or | Use distributions (Section 7) | |
| Not a function | Use distributions (Section 7) |
The extension proceeds via a density argument. The subspace is dense in . For , the classical integral defines . The Plancherel theorem (Section 5) then shows , which allows extending to all of by continuity. The resulting transform is a unitary operator on .
3. Core Properties
3.1 Linearity and Conjugate Symmetry
Theorem 3.1 (Linearity). For and :
Proof: Immediate from linearity of the integral.
Theorem 3.2 (Conjugate Symmetry / Hermitian Property). For real-valued :
Proof: , where the last step uses (real-valued).
Corollaries for real :
- : the magnitude spectrum is even
- : the phase spectrum is odd
- If is also even, then is real-valued and even
- If is also odd, then is purely imaginary and odd
For AI: The Hermitian symmetry means that for real signals, half the spectrum is redundant - you only need frequencies . This is why the FFT of a real signal produces unique complex outputs from real inputs. Real-valued weight matrices in neural networks have Hermitian-symmetric spectra, which is exploited in efficient spectral computation.
3.2 Time-Shift and Frequency-Shift (Modulation)
Theorem 3.3 (Time-Shift / Translation). For :
Proof: .
Interpretation: Shifting a signal in time multiplies its spectrum by a complex phase factor. The magnitude spectrum is unchanged - a time shift does not affect which frequencies are present, only their phases. The phase changes linearly with frequency: .
Theorem 3.4 (Frequency-Shift / Modulation). For :
Proof: .
Interpretation: Multiplying by a complex exponential (modulation) shifts the spectrum. This is the mathematical basis for amplitude modulation (AM) radio and for the RoPE positional encoding.
For AI - RoPE connection: In RoPE, the query at position is (rotation by in the complex plane). The inner product between query at and key at is , depending only on relative position . This is the modulation theorem in action.
3.3 Scaling and Dilation
Theorem 3.5 (Scaling / Dilation). For , :
Proof: .
Interpretation: This is the time-bandwidth product in action:
- Compressing a signal in time (, so is narrower): the spectrum stretches by factor and shrinks in amplitude by
- Stretching a signal in time (): the spectrum compresses
This confirms the time-frequency duality: doubling the duration halves the bandwidth, and vice versa. This is a hard mathematical constraint, not an engineering limitation.
3.4 Time Reversal
Theorem 3.6 (Time Reversal). For :
Proof: .
Duality: There is a deeper symmetry: . Applying the Fourier Transform four times returns the original function: . This means the FT has eigenvalues and the Hermite functions are its eigenfunctions (with the Gaussian as the eigenfunction for eigenvalue 1).
3.5 Differentiation and Integration
Theorem 3.7 (Differentiation in Time). If :
More generally, .
Proof: Integration by parts: . The boundary term vanishes since implies as .
Theorem 3.8 (Differentiation in Frequency). If :
Equivalently: .
Key consequence for smoothness vs. spectral decay:
| Smoothness of | Decay of |
|---|---|
| and | $ |
| (smooth) | decays faster than any polynomial |
| has a jump discontinuity | $ |
| is analytic | decays exponentially |
For AI: The differentiation property is why Fourier methods are efficient for solving differential equations. It converts a PDE into an ODE , which is solved by - a simple exponential decay. The Fourier Neural Operator (Section 8.4) exploits this by learning the spectral solution operator directly.
3.6 The Master Properties Table
| Property | ||
|---|---|---|
| Linearity | ||
| Time shift | ||
| Frequency shift | ||
| Scaling | ||
| Time reversal | ||
| Conjugation | ||
| Hermitian (real ) | ||
| Differentiation | ||
| Freq. differentiation | ||
| Convolution | ||
| Multiplication | ||
| Duality |
3.7 Convolution-Multiplication Duality - Preview
The Convolution Theorem states that the Fourier Transform converts convolution into pointwise multiplication:
where .
This is the most important property of the Fourier Transform for applications. It means that:
- Linear filtering (convolution with a filter ) corresponds to pointwise multiplication of spectra
- A filter's effect on frequency is simply multiplication by (the frequency response)
- Convolution of length- signals costs directly but via FFT
Preview: The full treatment of the convolution theorem - including circular convolution, cross-correlation, the Wiener-Khinchin theorem, and applications to CNNs, WaveNet, and Mamba - is in Section 20-04 Convolution Theorem. Here we state the theorem and use it; the proof and all applications belong there.
4. The Fourier Inversion Theorem
4.1 Statement and Sufficient Conditions
The fundamental question: given , can we recover ? The answer is yes under appropriate conditions, but the precise statement requires care.
Theorem 4.1 (Fourier Inversion Theorem). Suppose and . Then for almost every :
Moreover, if is also continuous at , equality holds exactly (not just a.e.).
The subtlety: The condition is not automatic. If , then is bounded and continuous (by Riemann-Lebesgue), but not necessarily in . For example, gives , and (it decays too slowly: ). The inversion of the rect function requires the theory.
Sufficient condition for inversion: If and has bounded variation near (e.g., is differentiable at ), then the principal value integral:
converges to where is continuous, and to at jump discontinuities - exactly as in Dirichlet's theorem for Fourier series.
4.2 Proof via Approximate Identity
The standard proof uses the concept of an approximate identity - a family of functions that converge to the Dirac delta as a parameter goes to zero.
Definition (Approximate Identity). A family is an approximate identity if:
- for all
- For every : as
The Gauss-Weierstrass kernel is an approximate identity.
Proof sketch of inversion theorem:
Define the Gauss regularized inversion:
The factor provides absolute convergence (the Gaussian is in ). Using Fubini's theorem to interchange integration order:
Since is an approximate identity: as at every point of continuity of (and in norm). Since as when , the result follows.
Significance: This proof reveals why the Gaussian plays a central role - it is the unique function (up to scaling) that is its own Fourier Transform, making the Gauss regularization self-consistent.
4.3 The Inversion Formula in Practice
The inversion formula is used to:
- Recover a signal from its spectrum: Given , compute
- Compute transforms of new functions from known transforms: Use the FT table plus properties
- Solve PDEs: Take FT, solve the resulting ODE, invert
Example: Compute such that (double-sided exponential spectrum).
By the inversion formula:
So - confirming the Lorentzian result from Section 2.3 (Example 3).
Numerical verification: The inversion theorem can be verified numerically via the FFT: compute numerically, then invert, and check that you recover up to discretization error (this is Exercise 3 in Section 10).
4.4 The Self-Dual Gaussian
The Gaussian satisfies , making it a fixed point of the Fourier Transform. This property makes it indispensable in both theory and practice.
Generalized Gaussian family. For , define (Gaussian of width ). By the scaling theorem:
Wide Gaussian ( large) -> narrow Fourier Transform (bandwidth ). This is the uncertainty principle made explicit: the product of time-width and frequency-width equals 1, the minimum possible (see Section 6).
Eigenfunctions of the FT: The Fourier Transform has eigenvalues . The corresponding eigenfunctions are the Hermite functions:
where is the -th Hermite polynomial. For : (eigenvalue 1, the self-dual Gaussian). For : (eigenvalue ).
For AI: The Gaussian window is the optimal time-frequency window (by the uncertainty principle), which is why Gaussian smoothing is used in signal preprocessing and why Gaussian attention kernels appear in some efficient attention variants.
GAUSSIAN SELF-DUALITY
========================================================================
Time domain Frequency domain
------------- ----------------
g(t) = e^{-t2} --FT-- g() = e^{-2} <- same function!
Width _t = 1/(2) Width _ = 1/(2)
Product: _t _ = 1/(2) = MINIMUM (uncertainty bound)
Scaling:
f(t) = e^{-t2/2} --FT-- e^{-22}
^ wider in time ^ narrower in frequency
========================================================================
5. Plancherel's Theorem and Theory
5.1 Statement: FT as a Unitary Isometry
Theorem 5.1 (Plancherel's Theorem). The Fourier Transform extends uniquely from to a unitary operator on :
Unitarity means (the adjoint equals the inverse), which in this case means - the inverse FT.
Why this is non-trivial: For but , the defining integral need not converge absolutely. Plancherel's theorem says we can still define as the limit:
The limit exists because the truncated transforms form a Cauchy sequence in (by the case and density of that subspace in ).
5.2 Parseval's Relation
Theorem 5.2 (Parseval / Plancherel Identity). For :
The special case gives the energy conservation (or Parseval's formula):
Interpretation: The total energy of a signal is the same whether measured in the time domain or the frequency domain. No energy is created or destroyed by the Fourier Transform - it is a lossless change of representation.
Using Parseval to compute integrals: Sometimes is hard but is easy (or vice versa). Example: compute .
We know , so by Parseval:
For AI: Parseval's relation underpins the analysis of spectral energy distribution in neural networks. The WeightWatcher tool (Martin & Mahoney, 2021) analyzes the spectrum of weight matrices to diagnose training quality - a healthy model has a power-law spectral distribution. Parseval tells us the Frobenius norm equals the energy in the spectral domain.
5.3 Proof Sketch: Extension from to
The proof of Plancherel's theorem follows a classic functional analysis pattern:
Step 1: Show the Parseval identity for (the intersection is dense in both spaces).
For , use Fubini to compute:
Step 2: The isometry for shows is a bounded operator with operator norm 1.
Step 3: By density ( is dense in ) and the fact that a bounded linear isometry extends uniquely to the closure, extends to all of preserving the isometry.
Step 4: Show is surjective by showing (the inverse FT) also extends and .
5.4 Energy Conservation in Practice
The Parseval identity has immediate computational consequences:
Low-pass filtering: A filter that removes frequencies (a rectangular window in frequency) preserves at most a fraction of the total energy. For a signal with most energy at low frequencies, this fraction is close to 1.
Compression by truncation: For , the best -term approximation in the Fourier basis minimizes the error. The error is - the energy in the discarded frequencies. This is the Fourier analog of truncated SVD in matrix approximation.
For AI: The Fourier Neural Operator (Section 8.4) exploits this by keeping only the top- Fourier modes (lowest frequencies) of the input function and discarding the rest. Plancherel guarantees the error is exactly the discarded energy - a principled truncation criterion.
6. The Heisenberg Uncertainty Principle
6.1 Time Spread and Frequency Spread
To state the uncertainty principle precisely, we need quantitative measures of how "spread out" a signal is in time and frequency.
Definition 6.1 (Time Center and Spread). For with :
Definition 6.2 (Frequency Center and Spread). Similarly:
Note: since (Plancherel), and are probability densities. The time and frequency spreads are the standard deviations of these distributions.
6.2 Formal Statement:
Theorem 6.1 (Heisenberg-Kennard Uncertainty Principle). For any with and :
Equality holds if and only if is a Gaussian (up to time-shift, frequency-shift, and scaling):
for some and .
The bound is fundamental. This is not a statement about measurement error or quantum physics - it is a purely mathematical theorem about functions and their Fourier Transforms. Any signal with time duration necessarily has bandwidth at least . You cannot concentrate a signal to be both time-limited and bandwidth-limited simultaneously.
6.3 Proof via Cauchy-Schwarz
Proof. Without loss of generality, assume (we can always translate in time and frequency without changing the spreads). Then:
By Parseval and the differentiation property, (using and Parseval).
So we need to show:
i.e., .
By the Cauchy-Schwarz inequality:
More elegantly, integrate by parts and use Cauchy-Schwarz:
Wait - the cleaner path: use , integrate over , and the left side integrates to 0 (since implies ):
So . Then by Cauchy-Schwarz:
Dividing both sides by :
6.4 The Gaussian as the Unique Extremal Function
Equality in Cauchy-Schwarz requires for some constant . This ODE has solution . For , we need , giving for - a Gaussian.
Adding back the time-shift and frequency-shift :
Check: ... Actually with the normalization : , , product = - exactly the bound.
Implication: The Gaussian is the only signal that achieves perfect time-frequency concentration in the sense of minimizing . Every other signal is "more spread out" in at least one of the two domains. This is why Gaussian windows are optimal for time-frequency analysis (Gabor atoms), and why the Gaussian noise model is natural in signal processing.
6.5 Implications for Neural Architecture Design
The uncertainty principle has direct, concrete implications for AI systems:
RoPE and context length extension: In RoPE, each dimension pair uses frequency . The lowest frequencies encode the longest-range positional information. To extend context from 4K to 128K tokens (as in LLaMA-3 long context), the lowest frequencies must be able to distinguish positions up to 128K. But by the uncertainty principle, a signal that is distinguishable over a range in "position space" must have frequency resolution - which requires the frequency to be low enough. YaRN (Peng et al., 2023) and LongRoPE (Ding et al., 2024) resolve this by rescaling to use lower frequencies for long-context models.
Spectrogram and STFT: Audio models like Whisper (Radford et al., 2022) use the Short-Time Fourier Transform - the FT applied to windowed segments. The window length controls the time-frequency tradeoff. Longer windows: better frequency resolution, worse time resolution. The Gaussian window is optimal but the Hamming window is commonly used for computational reasons.
Attention window size: Multi-head self-attention with local windows (as in LongFormer, BigBird) effectively applies a bandpass filter in "position space." The uncertainty principle says a window of size can resolve frequency up to - setting a fundamental limit on the position encodings that can be usefully resolved within the window.
UNCERTAINTY PRINCIPLE - ARCHITECTURE IMPLICATIONS
========================================================================
Signal Analysis Neural Architecture Analog
----------------- -------------------------
Time duration T Context window / sequence length
Bandwidth B ~ 1/T Position frequency resolution
t >= 1/(4) Short context coarse PE, long fine PE
STFT window size W: Local attention window size W:
-> freq res: ~ 1/W -> PE resolution: ~ 1/W
-> time res: ~ W -> local context: ~ W
Gaussian window: Gaussian attention kernel:
-> optimal concentration -> used in some efficient attention variants
LongRoPE: rescales j to lower freqs -> finer resolution at long range
========================================================================
7. Tempered Distributions and the Dirac Delta
7.1 Why Distributions Are Necessary
The and theories of the Fourier Transform handle a large class of functions, but they exclude some of the most important objects in signal processing and physics:
- - the Dirac delta: not a function, but models an ideal impulse
- - the constant function: not in or , but models a DC signal
- - a pure tone: not in or , but is a fundamental signal
- - the Dirac comb: not a function, but models periodic sampling
All of these have Fourier Transforms in the distributional sense - and all appear in practical signal processing. The distribution framework, developed by Laurent Schwartz in the 1950s, extends the Fourier Transform to this broader class.
7.2 Schwartz Space and Tempered Distributions
Definition 7.1 (Schwartz Space). The Schwartz space is the space of all functions such that for all :
i.e., and all its derivatives decay faster than any polynomial. Examples: , (the derivative condition fails at 0 - not Schwartz), Gaussian bump functions.
The Schwartz space is closed under the Fourier Transform: if , then . This is the key property: the FT maps to itself bijectively.
Definition 7.2 (Tempered Distributions). A tempered distribution is a continuous linear functional . We write .
Examples of tempered distributions:
-
Regular distributions: Every function () defines a tempered distribution .
-
Dirac delta: . The Dirac delta evaluates at 0. It is NOT a function.
-
Derivative of delta: .
-
Dirac comb: .
Fourier Transform of a tempered distribution: Define for . This is the "transpose" of the Fourier Transform.
7.3 The Dirac Delta and its Fourier Transform
The Dirac delta models a unit impulse concentrated at :
It is the limit of a sequence of ordinary functions: converges to in the distributional sense as .
Fourier Transform of : By the definition . Therefore:
An ideal impulse at has a flat (white) spectrum - it contains equal energy at all frequencies. This makes perfect physical sense: to create a perfect impulse, you need all frequencies constructively interfering at .
Fourier Transform of a constant: By duality (applying FT twice): . A constant signal (DC) lives entirely at frequency .
The duality principle in full: The FT maps and - a striking symmetry.
More generally: A delta at :
This is the time-shift property: shifted by gives a phase factor , but the magnitude is still flat: .
7.4 FT of Constants, Sinusoids, and Periodic Signals
Using and the frequency-shift property :
Complex exponential:
Cosine and Sine:
A pure cosine has a spectrum consisting of two Dirac deltas, one at and one at (the negative frequency is the mirror image required by Hermitian symmetry). The amplitude of each spike is because the energy is split equally between the two.
General periodic signal: A -periodic function has Fourier Transform:
The Fourier Transform of a periodic signal is a discrete sum of Dirac deltas at the harmonics - precisely the discrete spectrum of the Fourier series! This shows that Fourier series and Fourier Transform are two aspects of the same theory, unified by the distributional framework.
7.5 The Dirac Comb and Poisson Summation Formula
Definition 7.3 (Dirac Comb). The -periodic Dirac comb is:
Theorem 7.1 (FT of the Dirac Comb).
A comb of spacing in time has a spectrum that is a comb of spacing in frequency. This is the time-frequency duality made concrete: fine temporal sampling -> coarse frequency sampling, and vice versa.
Theorem 7.2 (Poisson Summation Formula). For :
More generally, for sampling at interval :
Proof: The function is -periodic, so it has a Fourier series with coefficients (by computing ). Evaluating at : .
The Poisson summation formula is the key to sampling theory. Shannon's sampling theorem (1949) follows directly: a signal with bandwidth (i.e., for ) is completely determined by its samples at rate (the Nyquist rate). The Poisson summation formula explains both the reconstruction formula and the aliasing that occurs when undersampling.
For AI: The discrete Fourier Transform (Section 03) is the computational realization of the Poisson summation formula. The FFT computes the sampled Fourier Transform exactly, which requires the signal to be band-limited (or periodic) - any violation causes aliasing, the discrete analog of spectral leakage.
7.6 Unifying Fourier Series and Fourier Transform
The distribution framework reveals that Fourier series and Fourier Transform are the same thing:
| Situation | Signal | Spectrum |
|---|---|---|
| General signal | Continuous, aperiodic | Continuous, aperiodic |
| Periodic signal (period ) | Continuous, periodic | Discrete (spacing ): Fourier series |
| Bandlimited (bandwidth ) | Discrete (rate ): sampling | Periodic (period ): aliases |
| Periodic AND bandlimited | Discrete and periodic | Discrete and periodic: DFT (Section 03) |
FOURIER TRANSFORM UNIFICATION
========================================================================
Time domain Frequency domain Setting
----------- ---------------- -------
Continuous Continuous L1/L2 Fourier Transform
Continuous Discrete Fourier Series (periodic)
Discrete Continuous DTFT (sampled signal)
Discrete Discrete DFT / FFT -> see Section 20-03
All four are the same theory viewed through the distributional lens.
The Dirac comb + Poisson summation connects all four cases.
========================================================================