Wavelets and Multiresolution Analysis
"Wavelets are a mathematical microscope: by changing the magnification and the position of the lens, one can examine local features at any desired scale."
- Stephane Mallat, A Wavelet Tour of Signal Processing, 1998
Overview
The Fourier transform is one of the most powerful tools in mathematics - but it has a fundamental blind spot. When you decompose a signal into sinusoids, you learn which frequencies are present, but you lose all information about when they occur. A spike at 1 kHz in the first millisecond looks identical to a spike at 1 kHz in the last millisecond. For stationary signals this is fine, but real-world signals - speech, music, seismic data, financial time series, images - are profoundly non-stationary. Their frequency content changes over time.
Wavelets solve this problem by replacing the infinite sinusoidal Fourier atoms with localized oscillations called wavelets ("small waves"). A wavelet is a function with zero mean, compact (or rapidly decaying) support, and enough smoothness to serve as a basis. By shifting in time and scaling it in frequency, we get a family of basis functions that simultaneously encode position and scale information. The resulting wavelet transform gives a time-scale representation of signals that adapts to their local structure.
This section develops wavelets from first principles: the continuous wavelet transform (CWT), the theory of Multiresolution Analysis (MRA) that underpins discrete wavelets, the Mallat fast algorithm ( - faster than FFT!), the celebrated Daubechies construction of compactly-supported wavelets with vanishing moments, 2D wavelet transforms and JPEG 2000, and the modern AI applications: Mallat's scattering networks (stable CNNs without learned weights), wavelet-based attention transformers, and wavelet regularization in diffusion models.
Prerequisites
- Fourier Transform - FT definition, Plancherel theorem, uncertainty principle (Section 20-02)
- DFT and FFT - discrete Fourier transform, filter banks (Section 20-03)
- Convolution Theorem - filter design, LTI systems, QMF conditions (Section 20-04)
- spaces - inner products, orthonormal bases, Hilbert space completeness (Section 12-02 preview)
- Complex exponentials - , modulus, argument (Section 01)
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | CWT scalograms, MRA cascade, Mallat algorithm, Daubechies construction, 2D DWT, scattering networks, wavelet denoising |
| exercises.ipynb | 10 graded problems: Haar DWT by hand through scattering network feature analysis |
Learning Objectives
After completing this section, you will:
- Explain why the Fourier transform cannot localize events in time and how wavelets resolve this via the uncertainty principle
- Define the continuous wavelet transform (CWT) and state the admissibility condition for reconstruction
- State the five axioms of Multiresolution Analysis and derive the two-scale relation
- Derive the wavelet from the scaling function via the quadrature mirror filter relation
- Implement the Mallat fast DWT algorithm and explain why its complexity is
- Explain what "vanishing moments" means for a wavelet and why more vanishing moments improve compression
- Construct the Daubechies db2 filter from spectral factorization
- Apply the 2D separable DWT and identify the LL, LH, HL, HH subbands
- Describe Mallat's scattering network and explain why it provides stable, translation-invariant features
- Apply soft/hard wavelet thresholding for signal denoising (Donoho-Johnstone framework)
- Connect wavelet filter banks to hierarchical vision architectures (Swin Transformer, PVT)
- Identify and fix the 10 most common wavelet implementation errors
Table of Contents
- 1. Intuition
- 2. Formal Definitions
- 3. Multiresolution Analysis
- 4. The Mallat Algorithm (Fast DWT)
- 5. Daubechies Wavelets
- 6. Discrete Wavelet Transform in Practice
- 7. Time-Frequency Analysis
- 8. Applications in Machine Learning
- 9. Common Mistakes
- 10. Exercises
- 11. Why This Matters for AI (2026 Perspective)
- 12. Conceptual Bridge
1. Intuition
1.1 The Time-Frequency Problem
Consider a musical score. It tells you two things simultaneously: which note is played (frequency) and when it is played (time). The Fourier transform is like a musician who can tell you every pitch in a piece, but cannot tell you when any of them occur. For a signal , the Fourier coefficient:
integrates over all time. Information about the temporal location of a frequency event is completely lost. A piano chord struck at and the same chord struck at seconds have identical Fourier transforms (in magnitude).
This is not a defect of a particular implementation - it is a fundamental consequence of using globally supported basis functions. Sinusoids have infinite support: they oscillate forever in both directions. When you decompose a signal in this basis, temporal information is irrevocably mixed into the phase relationships between coefficients, and extracting it requires knowing all coefficients simultaneously.
The consequences for AI are immediate. When processing speech, EEG signals, financial time series, or any non-stationary data, global frequency analysis discards the temporal structure that carries meaning. A voice recording of "cat" vs "tac" have similar Fourier magnitude spectra but different time-frequency patterns. The Whisper model addresses this by computing a mel-spectrogram (short-time Fourier transform) that provides local frequency information - but STFT has its own limitations that wavelets resolve more elegantly.
1.2 The Uncertainty Principle and What Wavelets Buy
The Heisenberg-Gabor uncertainty principle (Section 20-02) states that no function can be simultaneously localized in both time and frequency beyond the bound:
where and are the standard deviations of and respectively. This is a hard mathematical limit: you cannot have perfect resolution in both domains simultaneously.
What the Short-Time Fourier Transform (STFT) does: Fix a window function (Gaussian, Hann, etc.) and compute:
This gives local frequency information near time . But the window is fixed - every frequency is analyzed at the same time resolution . This is suboptimal: low frequencies vary slowly (wide window optimal), while high frequencies vary rapidly (narrow window optimal).
What wavelets do: Wavelets use a window whose width automatically adapts to the analysis scale. At high frequencies (small scale ), the wavelet is narrow in time - providing good temporal resolution. At low frequencies (large scale ), the wavelet is wide - providing good frequency resolution. This is the "constant-Q" or "constant relative bandwidth" property:
This adaptive tiling of the time-frequency plane is the geometric essence of wavelet analysis. It matches the structure of natural signals, which tend to have brief, sharp high-frequency events (transients) and slow, extended low-frequency oscillations.
1.3 Wavelets as Mathematical Microscopes
A wavelet is a function satisfying the admissibility condition (Section 2.1). From it, we build a two-parameter family of analyzing functions by translation (shifting in time by ) and dilation (stretching in scale by ):
The factor normalizes energy: for all .
Think of this geometrically: as decreases (zooming in), becomes a narrow, high-frequency probe localized near . As increases (zooming out), becomes a wide, low-frequency probe covering a broad time interval. The Continuous Wavelet Transform:
measures the "content" of near time at scale - precisely the behavior of a microscope adjusted to different magnifications and positions.
For AI: Scattering networks (Mallat, 2012) use this idea to build stable, translation-invariant feature representations by cascading wavelet transforms with modulus nonlinearities. The resulting features have provable stability properties that learned CNNs lack - they cannot be destabilized by small deformations of the input.
1.4 Historical Timeline
| Year | Contributor | Development |
|---|---|---|
| 1807 | Fourier | Series decomposition of heat equation solutions |
| 1910 | Haar | First wavelet: the Haar function |
| 1946 | Gabor | Time-frequency atoms; uncertainty principle |
| 1965 | Cooley-Tukey | FFT algorithm - makes DFT computationally practical |
| 1982 | Morlet, Grossmann | Continuous wavelet transform formalized for geophysics |
| 1986 | Meyer | Construction of smooth orthonormal wavelets |
| 1988 | Daubechies | Compactly supported orthonormal wavelets with vanishing moments (db) |
| 1989 | Mallat | Multiresolution Analysis framework; fast DWT algorithm |
| 1992 | Coifman-Wickerhauser | Wavelet packets; best-basis algorithm |
| 1994 | Donoho-Johnstone | Wavelet thresholding for near-optimal denoising |
| 1996 | ISO | JPEG 2000 standard based on Cohen-Daubechies-Feauveau (CDF 9/7) biorthogonal wavelets |
| 2012 | Mallat | Scattering networks: stable CNNs from wavelet cascades |
| 2021 | Yao et al. | WaveBERT: wavelet token compression for long-range Transformers |
| 2022 | Liu et al. | WaveMix: multi-scale 2D wavelet mixing for vision |
| 2023 | Various | Wavelet-domain diffusion models; wavelet attention transformers |
2. Formal Definitions
2.1 The Continuous Wavelet Transform
Definition 2.1 (Admissible wavelet). A function is an admissible wavelet if it satisfies the admissibility condition:
This requires , i.e., - the wavelet must have zero mean. Informally, it must "oscillate" (wave) while also being localized (small = "let"). The condition ensures the CWT is invertible.
Definition 2.2 (Continuous Wavelet Transform). For and admissible , the CWT is:
Theorem 2.1 (CWT inversion formula). If is admissible, then:
The measure is the natural invariant measure on the affine group (translations and dilations). The CWT is an isometry: .
Parseval formula for CWT. The energy identity reads:
For AI: The CWT is used for speech analysis (Morlet CWT of mel-filtered audio), EEG artifact detection, and seismic signal interpretation. However, it is the discrete wavelet transform - derived from MRA - that dominates computational applications.
2.2 Common Wavelet Families
The Haar Wavelet is the simplest compactly-supported wavelet, introduced by Alfred Haar in 1910:
The Haar wavelet has compact support , exactly 1 vanishing moment, and its scaling function is the box function . It is the only wavelet that is simultaneously symmetric, orthonormal, and has compact support. Its discontinuity makes it poor for smooth signals but ideal for edge detection and theoretical exposition.
The Morlet Wavelet is a complex sinusoid modulated by a Gaussian:
where is the center frequency. The correction term ensures exactly. For , the correction term is negligible and .
The Morlet wavelet achieves optimal time-frequency localization (it saturates the Heisenberg uncertainty bound) and is widely used for analyzing oscillatory signals like EEG, ECG, and seismic data. It is not compactly supported (Gaussian tails), so the DWT based on Morlet is continuous.
The Mexican Hat (Ricker) Wavelet is the negative second derivative of a Gaussian:
Named for its characteristic shape. Has 2 vanishing moments and is real-valued. Used in edge detection, image analysis, and as the activation function insight for ReLU networks (second derivative = rectification).
Daubechies Wavelets db are the unique compactly-supported orthonormal wavelets with vanishing moments. Their explicit construction is in Section 5. Key members:
- db1 = Haar wavelet (discontinuous, 1 vanishing moment)
- db2 = Daubechies 4-tap filter (Lipschitz continuous, 2 vanishing moments)
- db4 = Daubechies 8-tap filter (more regular, 4 vanishing moments)
- db8 = 16-tap filter (very smooth, 8 vanishing moments)
Increasing gives smoother wavelets at the cost of longer filters (support ).
Symlets sym are near-symmetric modifications of db with equal numbers of vanishing moments but more symmetric filter banks. Preferred in image processing to reduce edge artifacts.
Coiflets coif have vanishing moments for both and (scaling function), making the approximation coefficients near-samples of the signal at coarse scales.
2.3 Inner Product View and Energy Preservation
Hilbert space background. The space is a Hilbert space with inner product .
Recall (Section 12-02): A Hilbert space is a complete inner product space. In , the Riesz representation theorem guarantees that every bounded linear functional is of the form for a unique .
A family of wavelets forms an orthonormal basis of if:
- (orthonormality)
- for all (completeness)
Theorem 2.2 (Parseval for discrete wavelets). If is an orthonormal wavelet basis, then:
Energy is perfectly preserved in the wavelet domain.
For AI: This energy preservation is crucial for compression and denoising. The wavelet coefficients measure the "energy contribution" of scale , position . Large coefficients correspond to significant signal features; small coefficients correspond to noise - the basis for threshold denoising.
3. Multiresolution Analysis
3.1 The MRA Axioms
Multiresolution Analysis (MRA), introduced by Stephane Mallat and Yves Meyer in 1989, provides the algebraic framework that links continuous wavelet theory to fast discrete algorithms. An MRA is a ladder of closed subspaces of satisfying five axioms:
Definition 3.1 (Multiresolution Analysis). A sequence of closed subspaces forms an MRA if:
-
Nesting: (finer scales are subspaces of coarser ones; indexing convention: increases = coarser resolution)
-
Density: (arbitrary precision is achievable by taking finer and finer scales)
-
Separation: (at infinite coarsening, only the zero function remains)
-
Scaling: (going to a finer scale is equivalent to compressing the function by a factor of 2)
-
Orthonormal basis: There exists such that is an orthonormal basis for .
The function is the scaling function (or father wavelet) associated with the MRA. The orthonormality condition means:
Example 3.1 (Haar MRA). Define as the space of functions that are piecewise constant on dyadic intervals for . The scaling function is . All five MRA axioms are satisfied. The approximation of at scale is the conditional expectation where is the -algebra generated by dyadic intervals of width .
Example 3.2 (Sinc / Shannon MRA). Define as the Paley-Wiener space of functions bandlimited to . The scaling function is . This is the MRA of ideally bandlimited signals. The sinc wavelet has infinite support - impractical but theoretically illuminating.
3.2 Scaling Functions and the Two-Scale Relation
Since (axiom 1) and is an orthonormal basis for , while is an orthonormal basis for (by axiom 4), we can expand in the finer-scale basis:
Definition 3.2 (Two-scale relation / refinement equation). The scaling function satisfies:
where the coefficients form the low-pass analysis filter. This is the fundamental equation of wavelet theory - it says the scaling function at scale 0 is a weighted sum of translated scaling functions at scale (twice as fine).
The Fourier transform of the two-scale relation gives:
Applying this recursively: . The infinite product formula shows is determined entirely by the filter (the Fourier transform of ).
Orthonormality conditions. For to be orthonormal, the filter must satisfy:
This is the critical constraint in wavelet filter design.
For AI: The two-scale relation is structurally identical to a residual connection. The coarse approximation at scale is recursively expressed in terms of the approximation at scale (twice the resolution). This is precisely the architecture of U-Net and multi-scale vision models.
3.3 Detail Spaces and the Orthogonal Complement
Since (finer is a subspace of coarser - careful: indexing), define the detail space as the orthogonal complement of within :
The detail space captures the "difference in information" between resolution and resolution . Applying this decomposition repeatedly:
Taking and using the density axiom:
This is the fundamental wavelet decomposition theorem: the detail spaces form a direct orthogonal sum decomposition of .
Orthonormal basis for . If is the mother wavelet (Section 3.4), then is an orthonormal basis for . Therefore:
Every expands as with .
3.4 The Wavelet Equation and QMF Relation
The mother wavelet is defined by the analogous two-scale relation:
where is the high-pass analysis filter. The filters and are related by the Quadrature Mirror Filter (QMF) condition:
In the frequency domain: , which ensures and are orthogonal (in the power complementary sense) and span all frequencies.
The QMF condition guarantees:
- (wavelet is orthogonal to scaling function)
- for (translated wavelets are orthogonal)
- for (wavelets at different scales are orthogonal)
Example 3.3 (Haar filters). For the Haar wavelet: (all others zero). Then , . Verify: : , . PASS
Geometric interpretation. The filter bank splits the frequency axis (normalized) into two halves: preserves (low frequencies = approximation), preserves (high frequencies = detail). This is the wavelet counterpart of the Convolution Theorem from Section 20-04: decomposition into frequency bands via orthogonal filter pairs.
4. The Mallat Algorithm (Fast DWT)
4.1 Analysis Filter Bank
The Mallat algorithm (1989) computes the DWT via iterated convolution + downsampling - no need for the inner products to be computed directly.
Given a sequence (the "approximation coefficients" at the finest scale, equivalent to sampling at integer positions), one step of the DWT computes:
Here are the approximation coefficients at scale (low-pass, half the samples), and are the detail coefficients at scale (high-pass, half the samples). The analysis filters (time-reversed conjugate) are the refinement filter banks.
Pictorially, one DWT step is:
a^{(j)} --+--[low-pass h]--[v2]---> a^{(j+1)} (approximation)
+--[high-pass g]--[v2]---> d^{(j+1)} (detail)
Repeating on gives the full multi-level decomposition.
For AI: This is a learned ResNet block where the filters are fixed (not learned). The downsampling () is equivalent to stride-2 convolution. The architecture of U-Net, FPN (Feature Pyramid Network), and Swin Transformer all implement variants of this Mallat tree.
4.2 Synthesis Filter Bank
Reconstruction (inverse DWT) is the exact reverse:
Pictorially:
a^{(j+1)} --[^2]--[h]--+
+---> a^{(j)}
d^{(j+1)} --[^2]--[g]--+
where denotes upsampling by 2 (zero insertion between samples). The synthesis filters are the time-reverses of the analysis filters (for orthogonal wavelets).
4.3 Perfect Reconstruction Conditions
Definition 4.1 (Perfect Reconstruction). A filter bank achieves perfect reconstruction if, for every input sequence , the synthesized output after analysis + synthesis equals (up to a delay).
For orthogonal wavelets, the analysis filters equal the synthesis filters (after time-reversal):
The PR conditions in the -transform domain are:
- Alias cancellation:
- Distortion-free:
Together, these ensure that aliasing from the operation is exactly cancelled in reconstruction.
Biorthogonal wavelets (used in JPEG 2000) use different analysis and synthesis filters, relaxing the orthogonality constraint while maintaining PR. This allows symmetric filters (impossible for compactly supported orthogonal wavelets with more than 1 vanishing moment, by Daubechies' theorem).
4.4 Computational Complexity
At level 1, the DWT processes samples with two convolutions + downsamplings, costing operations (each filter has taps, so but is fixed). At level 2, we process samples, costing . Summing the geometric series:
The DWT is - faster than the FFT's !
Comparison:
| Algorithm | Complexity | Memory | Time localization |
|---|---|---|---|
| DFT / FFT | None (global) | ||
| STFT | Fixed window | ||
| CWT (discrete approx.) | Scale-dependent | ||
| DWT (Mallat) | Scale-dependent |
This complexity is why wavelets underpin all modern image compression standards (JPEG 2000, EZW, SPIHT) and are computationally preferred over STFT for multi-scale analysis.
5. Daubechies Wavelets
5.1 Vanishing Moments
The most important property distinguishing different wavelet families is the number of vanishing moments.
Definition 5.1 (Vanishing moments). A wavelet has vanishing moments if:
Equivalently, in the Fourier domain: for , meaning has a zero of order at .
Why vanishing moments matter:
-
Polynomial annihilation. If is a polynomial of degree on the support of , then . Smooth signals (well-approximated by polynomials locally) have near-zero detail coefficients everywhere except near discontinuities. This leads to sparse wavelet representations - ideal for compression.
-
Approximation order. With vanishing moments, the wavelet approximation error at scale satisfies for . More vanishing moments = better approximation of smooth functions.
-
Filter regularity. More vanishing moments enforce more zeros in at , which forces the scaling function to be smoother.
Trade-off. More vanishing moments require longer filters (wider support). Daubechies db has exactly vanishing moments and minimum possible support . There is no way to have vanishing moments with a shorter filter while maintaining orthonormality.
5.2 Constructing db2 via Spectral Factorization
We construct the Daubechies db2 wavelet (4-tap filter, 2 vanishing moments) explicitly.
Step 1: Set up constraints. We need filter (4 taps, support ) satisfying:
- Power complementary:
- 2 vanishing moments: and , i.e., and
- Normalization: (i.e., )
These give 4 equations for 4 unknowns.
Step 2: Parameterize. The power-complementary condition in the -domain reads . This is the Bezout equation whose general solution is:
where is a polynomial satisfying for and .
Step 3: Minimum phase factorization. For , (chosen for minimum support). The polynomial evaluated gives:
Factoring this as (spectral factorization) with all roots inside the unit disk (minimum phase):
The db2 filter coefficients (normalized so ):
Numerically: .
Verification: PASS, PASS, PASS
5.3 Regularity and Holder Continuity
The regularity of the Daubechies scaling function is determined by the spectral radius of the transition matrix . For db:
Theorem 5.1 (Daubechies regularity). The db scaling function (Holder continuous of order ) with:
Specific values:
- db1 (Haar): (discontinuous, bounded variation)
- db2: (Holder continuous but not differentiable)
- db4: (once continuously differentiable)
- db8: (twice continuously differentiable)
The scaling function becomes smoother as increases, at the cost of wider support ( points).
5.4 Wavelet Family Comparison
WAVELET FAMILY COMPARISON
========================================================================
Wavelet | Support | VM | Regularity | Symmetric | Orthogonal
-----------|---------|----|------------|-----------|------------
Haar | [0,1] | 1 | C^0 | Yes | Yes
db2 | [0,3] | 2 | C^0.55 | No | Yes
db4 | [0,7] | 4 | C^1.28 | No | Yes
db8 | [0,15] | 8 | C^2.9 | No | Yes
sym4 | [0,7] | 4 | C^1.28 | Near-sym | Yes
coif4 | [-5,5] | 4 | C^1.28 | Near-sym | Yes
bior2.2 | [-1,3] | 2 | C^1 | Yes | Biortho
CDF 9/7 | large | 4 | smooth | Yes | Biortho
Mexican Hat| infty | 2 | C^infty | Yes | No (frame)
Morlet | infty | infty | C^infty | Near-sym | No (frame)
VM = Vanishing Moments; CDF 9/7 = JPEG 2000 standard
========================================================================
For AI: Smooth wavelets (large ) produce sparse representations for smooth neural network weights and activations, while Haar is efficient for piecewise constant approximations. The CDF 9/7 biorthogonal wavelet used in JPEG 2000 balances symmetry, smoothness, and reconstruction quality - it is also used in wavelet-domain diffusion models to separate frequency bands.
6. Discrete Wavelet Transform in Practice
6.1 Multi-Level Decomposition Tree
The Mallat algorithm applied times to a length- signal produces the wavelet decomposition tree:
WAVELET DECOMPOSITION TREE (J=3 levels)
========================================================================
Input: a[n], length N
|
+--[h,v2]---> a1[n], length N/2 (approximation, level 1)
| |
| +--[h,v2]---> a2[n], length N/4 (approx., L2)
| | |
| | +--[h,v2]---> a3[n] (approx., L3)
| | +--[g,v2]---> d3[n] (detail, L3)
| |
| +--[g,v2]---> d2[n], length N/4 (detail, L2)
|
+--[g,v2]---> d1[n], length N/2 (detail, level 1)
Storage: N/2 + N/4 + N/4 + N/8 + N/8 + ... = N (no redundancy!)
========================================================================
The output is: with total length (for periodic boundary). No information is lost or duplicated.
Frequency interpretation:
- : frequencies in Hz (highest octave)
- : frequencies in Hz
- : frequencies in Hz (octave bands)
- : frequencies in Hz (lowest frequencies)
6.2 Wavelet Packets
Standard DWT applies the filter bank only to the approximation branch (). Wavelet packets apply it to both branches - creating a full binary tree of subband decompositions:
At each node (level , node ), both the "approximation" and "detail" sub-signals are further decomposed. This gives subbands at level , each of width . The collection of all leaves at level corresponds to frequency subbands, each of width - the same as a DFT with bins.
The best-basis algorithm (Coifman-Wickerhauser, 1992) selects the pruning of this tree that minimizes a cost function (e.g., entropy) - giving an adaptive time-frequency decomposition that is optimal for the specific signal.
6.3 2D DWT and Image Subbands
For 2D signals (images), the DWT is applied separably: first along rows, then along columns (or vice versa). One level of 2D DWT produces four subbands of size :
2D DWT SUBBANDS (one level)
========================================================================
Input image (H W)
+-------------------------+
| |
| f(x, y) |
| |
+-------------------------+
|
+------------+------------+
| LL (H/2W/2) | LH (H/2W/2) |
| Low-row | Low-row |
| Low-col | High-col |
| (approx.) | (hor. edges)|
+------------+------------+
| HL (H/2W/2) | HH (H/2W/2) |
| High-row | High-row |
| Low-col | High-col |
| (vert. edges)| (diag. edges)|
+------------+------------+
LL = approximation (coarse image)
LH = horizontal detail (vertical edges)
HL = vertical detail (horizontal edges)
HH = diagonal detail (diagonal edges)
========================================================================
Applying the 2D DWT recursively to the LL subband gives the standard pyramid decomposition used in image processing.
For AI: This 2D decomposition is the backbone of Swin Transformer's patch merging, PVT's pyramid vision representation, and wavelet-enhanced U-Net architectures. The subbands naturally separate global structure (LL) from local edge features (LH, HL, HH).
6.4 JPEG 2000 and Image Compression
JPEG 2000 (2000) uses the CDF 9/7 biorthogonal wavelet (Le Gall 5/3 for lossless) to transform images before quantization and entropy coding. The key advantages over DCT-based JPEG:
- No blocking artifacts - the wavelet support spans the entire image, avoiding the 88 block boundary artifacts of JPEG
- Scalable quality - truncating the bitstream at any point gives a valid (lower quality) image
- Region of Interest (ROI) coding - different quality levels for different image regions
- Better compression at high quality - 20-30% better than JPEG at visually lossless quality
The coding steps: 2D DWT (up to 5 levels) -> coefficient quantization -> EBCOT entropy coding (embedded block coding with optimal truncation).
Modern relevance: Wavelet-based compression principles inform:
- Neural image codecs (Balle et al.): hyperprior model operates on wavelet-like feature maps
- Wavelet-domain diffusion models (WaveDiff, 2023): diffusion operates on DWT coefficients rather than pixels, reducing sequence length by per level
- VCT (Video Compression Transformer): uses learned transforms similar to wavelet filter banks
7. Time-Frequency Analysis
7.1 The Scalogram
The scalogram is the time-scale energy density:
It is the wavelet analogue of the spectrogram (squared magnitude of STFT). The scalogram shows how the energy of the signal is distributed across scales and times.
Reading the scalogram:
- Horizontal axis: time
- Vertical axis: scale (often plotted as frequency , increasing downward = decreasing scale)
- Color intensity: energy
- A vertical ridge indicates a transient (localized in time)
- A horizontal ridge indicates a sustained oscillation (localized in frequency)
Edge effects (cone of influence). Near the boundaries of the signal, wavelet coefficients are unreliable because the wavelet extends outside the signal domain. The cone of influence at scale is the region of times affected by boundary effects: approximately where is the support radius of . Results outside the cone of influence are unreliable.
Application example: For a chirp signal (instantaneous frequency linearly increasing with time), the scalogram shows a diagonal ridge tracing the time-frequency trajectory. The Fourier spectrum merely shows a broad band of frequencies with no temporal information.
7.2 Gabor Wavelets and the STFT Connection
Gabor wavelets (complex Morlet wavelets with exact zero mean) interpolate between the STFT and the CWT:
The parameter controls the time-frequency trade-off:
- Large : good frequency resolution (narrow band), poor time resolution
- Small : good time resolution, poor frequency resolution
The STFT with a Gaussian window of width is identical to the CWT with Gabor wavelets at a fixed scale. The key difference: in the STFT, is fixed for all frequencies; in the CWT, scales with (constant-Q).
The STFT time-frequency tiling uses uniform rectangles: each tile has the same area .
The CWT uses logarithmic tiling: tiles are wider at low frequencies (long , small ) and narrow at high frequencies.
7.3 Synchrosqueezing Transform
The Synchrosqueezing Transform (SST) (Daubechies-Lu-Wu, 2011) sharpens the scalogram by reassigning energy to the instantaneous frequency curve, producing a "synchrosqueezed" time-frequency representation with much higher resolution.
For a signal with instantaneous frequency :
estimates the instantaneous frequency at . The synchrosqueezed transform reassigns the energy from the scalogram at to the time-frequency point . The result is a sharper, more interpretable time-frequency map.
Application: SST is used to separate overlapping modes in multicomponent signals (e.g., ECG with P, QRS, T waves; audio with multiple instruments). It has become standard in empirical mode decomposition and is implemented in scipy.signal.
8. Applications in Machine Learning
8.1 Mallat Scattering Networks
The scattering transform (Mallat, 2012; Bruna-Mallat, 2013) builds provably stable, translation-invariant features by cascading wavelet transforms with modulus nonlinearities:
where is the low-pass filter at scale , and ranges over wavelets at scales for .
Key theorems:
Theorem 8.1 (Translation invariance). Scattering coefficients are (approximately) invariant to translations of size .
Theorem 8.2 (Stability to deformations). If is a diffeomorphism close to a translation (i.e., ), then:
This is the key property that CNNs lack: a small rotation or stretch can change CNN features by a constant amount, independent of how small the deformation is.
Scattering vs CNNs: Scattering networks have no learned parameters. Their features are competitive with shallow CNNs on MNIST and CIFAR-10. The architecture is:
Scattering Network Architecture
========================================================================
Input f(x)
|
+-- _J * f ------------------------------> S_0 (global average)
|
+-- |f * _{1}| * _J -------------------> S_1 (J terms)
|
+-- ||f * _{1}| * _{2}| * _J --------> S_2 (J2 terms)
Total feature dimension: O(J2) (vs O(K2C) for CNN)
Invariance: proven (vs empirical for CNN)
Stability: proven (vs empirical for CNN)
========================================================================
For AI: Scattering features fed to a simple linear classifier achieve ~87% on CIFAR-10. In protein structure prediction, a variant of scattering (SE(3)-equivariant networks) is used by AlphaFold2 to process atomic coordinates.
8.2 Wavelet-Based Attention
WaveBERT (2021) compresses long input sequences by applying wavelet transforms to token embeddings before attention, reducing sequence length by a factor of :
- Apply DWT to the token embedding sequence:
- Run Transformer attention only on the approximation (length )
- Reconstruct full-resolution outputs via IDWT
This reduces the attention complexity to - a factor of speedup.
Wavelet Attention (2023, various groups) applies 2D DWT to attention logits:
- Compute attention score matrix of shape
- Apply 2D DWT to -> sparse representation (most attention is low-frequency)
- Threshold small wavelet coefficients (attention is concentrated on main diagonals)
- Reconstruct and apply softmax
This achieves near-full-attention quality with sparse coefficient computation, similar to the linear attention approximations but with theoretical backing from wavelet compression theory.
WaveMix (2022) replaces the attention mechanism entirely with 2D wavelet mixing, applying the DWT across spatial dimensions in vision tasks. The model achieves competitive accuracy on ImageNet with significantly fewer parameters than ViT.
8.3 Multiresolution in Vision Transformers
The standard Vision Transformer (ViT) processes all patches at a single scale with quadratic attention. Hierarchical architectures introduce the wavelet-like multi-scale structure:
Swin Transformer (2021): Applies attention within local windows and shifts them across layers. The patch merging operation (concatenating 22 patches and projecting) is equivalent to a stride-2 convolution - structurally analogous to the DWT low-pass filter + downsampling step.
PVT (Pyramid Vision Transformer, 2021): Explicitly uses 4 stages with spatial reduction ratios - matching the wavelet octave decomposition exactly.
FPN (Feature Pyramid Network, 2017): Multi-scale feature maps combined via lateral connections - the convolutional analogue of the wavelet synthesis filter bank.
Common pattern: Resolution halved, channels doubled at each level. This is exactly the wavelet dyadic tree structure, with learned filters replacing fixed wavelet filters.
8.4 Wavelet Regularization in Diffusion Models
WaveDiff (2023) applies the diffusion process in the wavelet domain rather than pixel space:
- Apply 2D DWT to images:
- Apply Gaussian noise only to high-frequency subbands () - preserving the global structure in
- Train the denoising network on the (much smaller) wavelet coefficients
Benefits:
- Sequence length reduced by per level -> attention is cheaper
- High-frequency details are easier to model when separated from low-frequency structure
- Generation quality is preserved since wavelets are invertible
Wavelet-based training regularization: The spectral bias phenomenon (neural networks learn low frequencies first) can be accelerated by decomposing loss functions in the wavelet domain. Penalizing different scales at different rates speeds up convergence - equivalent to preconditioning by the wavelet spectrum.
8.5 Wavelet Denoising: Donoho-Johnstone
The Donoho-Johnstone framework (1994) provides a near-optimal, computationally efficient signal denoising procedure based on wavelet coefficient thresholding:
Setup. Observe where is the true signal and is Gaussian noise.
Algorithm:
- Compute DWT of : coefficients
- Estimate noise level: (using finest-scale detail coefficients)
- Set universal threshold:
- Apply thresholding to all detail coefficients:
- Soft:
- Hard:
- Reconstruct via IDWT of
Why does this work? Sparse coding: a smooth signal has few large wavelet coefficients; noise spreads uniformly across all coefficients. By zeroing small coefficients (which are dominated by noise) and preserving large coefficients (dominated by signal), we efficiently remove noise.
Theorem 8.3 (Near-optimality, Donoho-Johnstone 1994). The soft-threshold estimator satisfies:
This is within a factor of the minimax optimal risk - essentially unimprovable without additional assumptions.
For AI: Wavelet thresholding connects to:
- LASSO ( penalty) in the wavelet domain: soft thresholding is the proximal operator for
- Sparse autoencoders (SAE): identifying sparse representations of neural network activations
- Gradient compression in distributed training: transmit only large-magnitude gradient components (analogous to hard thresholding)
9. Common Mistakes
| # | Mistake | Why It's Wrong | Fix |
|---|---|---|---|
| 1 | Applying DWT without specifying boundary conditions | Boundary effects corrupt coefficients near signal edges | Use mode='periodization' for circular, 'symmetric' for mirror padding; always check the cone of influence |
| 2 | Using Haar wavelets for smooth signals and expecting good compression | Haar has only 1 vanishing moment; smooth functions need wavelets with many vanishing moments | Choose db4+ or sym4+ for signals with continuous derivatives |
| 3 | Confusing scale and frequency | In CWT, small scale = high frequency, large = low frequency; axes are often plotted reversed | Always label axes explicitly; clarify whether vertical axis is "scale" or "frequency" |
| 4 | Using the CWT for discrete signals without discretization | CWT is defined for continuous ; applying it naively to sampled data gives aliasing | Use the DWT (Mallat algorithm) for discrete signals, or the pseudo-CWT with sampling |
| 5 | Thinking DWT is redundant (like STFT) | DWT has exactly coefficients for inputs - it is non-redundant | The redundant version (undecimated DWT, "a trous") is useful for denoising but costs storage |
| 6 | Forgetting the normalization in the CWT | Without it, varies with scale, breaking energy conservation | Always use for normalization |
| 7 | Applying only one level of DWT and calling it "multi-scale" | Single-level DWT splits into only 2 bands; multi-scale analysis requires levels | Apply DWT recursively to the approximation branch for levels |
| 8 | Selecting threshold globally (same for all scales) | Noise variance varies across scales for colored noise; a global threshold over- or under-smooths | Use scale-adaptive threshold: where is the number of coefficients at level |
| 9 | Assuming wavelet coefficients are Gaussian when noise is Gaussian | Wavelet coefficients of a Gaussian noise signal are approximately Gaussian only for orthogonal wavelets; correlated noise produces non-Gaussian coefficients | Verify distribution; use non-parametric or robust thresholding for colored noise |
| 10 | Conflating the DWT with the FFT as frequency decomposition | DWT gives octave-band (logarithmic) frequency decomposition; FFT gives uniform frequency decomposition | Use DWT when you want multi-scale (octave) analysis; use FFT when you want uniform frequency resolution |
10. Exercises
Exercise 1 * - Haar DWT by Hand
Given the signal :
(a) Compute one level of Haar DWT by hand: approximation coefficients and detail coefficients . (Recall: , .)
(b) Apply one more level to to get and .
(c) Verify that (Parseval).
(d) Reconstruct from the wavelet coefficients by reversing the steps. Verify exact reconstruction.
Exercise 2 * - Admissibility and Zero Mean
(a) Verify that the Haar wavelet has zero mean: .
(b) Compute for the Haar wavelet explicitly. Show .
(c) A student proposes (Gaussian) as a wavelet. Show this fails the admissibility condition.
(d) Compute the admissibility constant for the Mexican Hat wavelet . Is it finite?
Exercise 3 * - MRA Axiom Verification
Verify that the Shannon (sinc) MRA satisfies all five MRA axioms:
(a) Prove the nesting property directly from the definition.
(b) Prove density: show that for any and , there exists and with .
(c) Verify that is orthonormal using Parseval's theorem.
(d) Write down the scaling function in the frequency domain. What is the two-scale relation ? What is ?
Exercise 4 ** - Mallat Algorithm Implementation
Implement the complete Mallat DWT (analysis) and IDWT (synthesis) algorithms from scratch using only numpy.
(a) Implement haar_dwt(x, J) that applies levels of Haar DWT to 1D signal x. Return the coefficient vector of total length len(x).
(b) Implement haar_idwt(coeffs, J) that reconstructs the signal from the coefficients. Verify with max error .
(c) Apply your DWT to a noisy signal for , where . Apply soft thresholding with to all detail coefficients. Reconstruct and compute SNR improvement.
(d) Benchmark your implementation vs pywt.wavedec for and 4 levels. Compare execution times and verify the outputs agree to numerical precision.
Exercise 5 ** - Daubechies Filter Verification
(a) Given the db2 filter , verify:
- (normalization)
- (unit energy)
- (orthogonality shift by 2)
(b) Compute the QMF partner for db2. Verify numerically for 100 values of .
(c) Verify db2 has exactly 2 vanishing moments: compute for and (should both be zero) and (should be nonzero).
(d) Construct the db2 scaling function by iterating the cascade algorithm: start from and apply for 6 iterations. Plot the result.
Exercise 6 ** - 2D DWT and Image Compression
(a) Load or generate a grayscale test image (use scipy.datasets.face(gray=True) or a numpy-generated pattern). Apply 3 levels of 2D Haar DWT using pywt.wavedec2.
(b) Visualize the 3-level wavelet decomposition, labeling all subbands (LL3, LH3, HL3, HH3, LH2, HL2, HH2, LH1, HL1, HH1).
(c) Implement a compression experiment: keep only the top of wavelet coefficients (by magnitude), zero the rest, and reconstruct. Plot PSNR vs for .
(d) Compare with JPEG-style DCT compression: apply 2D DCT to blocks and keep only the top of DCT coefficients. Plot PSNR vs for both methods on the same axes. At what compression ratio does DWT outperform DCT?
Exercise 7 *** - Scattering Transform Features
(a) Implement a simple 1D scattering transform (order 1 and 2) using pywt:
- Zeroth order: low-pass filter the input
- First order:
- Second order:
(b) Generate signals from two classes: class A = ; class B = . Compute scattering features and verify they are approximately translation-invariant: shift the signal by samples and check that for .
(c) Test stability to deformation: apply a small time-stretching and measure . Compare with the same test for raw Fourier features.
(d) Train a linear classifier on scattering features vs raw FFT features for a 4-class signal classification problem. Compare accuracy and number of features needed.
Exercise 8 *** - Wavelet Denoising and Sparse Coding
(a) Implement Donoho-Johnstone wavelet denoising with:
- Universal threshold
- Both soft and hard thresholding
- Noise level estimation via MAD:
(b) Test on with , noise levels . Plot SNR improvement vs for soft vs hard thresholding.
(c) Show the connection to LASSO: prove that soft thresholding is the proximal operator of the norm: .
(d) Implement a sparse coding experiment: represent a signal as a sparse linear combination of Daubechies db4 wavelets at 3 scales. Use LASSO regression (sklearn.linear_model.Lasso) to find the sparse coefficients. Compare reconstruction quality (PSNR) vs number of non-zero coefficients to direct DWT thresholding.
11. Why This Matters for AI (2026 Perspective)
| AI Concept | Wavelet Connection | Concrete Impact |
|---|---|---|
| Convolutional neural networks | Each conv layer is a learned filter bank; DWT is the fixed-filter special case | Understanding wavelets explains WHY CNNs learn edge detectors and Gabor-like filters at lower layers |
| Swin Transformer patch merging | Identical to stride-2 DWT low-pass filter: concatenate 22 patches -> linear projection | Multi-scale attention is the learned version of multi-level DWT |
| ViT + FPN / feature pyramids | Multi-resolution feature maps follow the wavelet octave structure exactly | Architecture design: number of levels, channel doubling at each scale |
| Speech: Whisper / wav2vec 2.0 | Mel spectrogram is an STFT with mel-warped frequencies approx wavelet constant-Q analysis | Replacing STFT with CWT could improve speech models; wavelet features more robust to pitch shift |
| Diffusion models (WaveDiff) | Separate noise diffusion of LL vs LH/HL/HH subbands; reduces sequence length 4 | Lower sampling cost, better preservation of global structure, faster inference |
| Scattering networks | CNN-equivalent features with zero learned parameters; provably stable to deformations | Robustness certification, data-efficient learning, group-equivariant architectures |
| Sparse autoencoders (SAE) | Mechanistic interpretability via sparse wavelet-like features of activations | Soft thresholding = LASSO proximal operator; same math as wavelet denoising |
| Gradient compression | Transmit top- gradient values = hard threshold in parameter space | Wavelet-domain gradient: compress gradient per scale/location, not per parameter |
| Long-range Transformers (WaveBERT) | Compress token sequence by via DWT before attention | attention complexity; quality preserved via wavelet inversion |
| Protein structure (SE(3)-nets) | Scattering-inspired SE(3)-equivariant networks process atomic coordinates | AlphaFold2's geometry processing; antibody design; molecular dynamics |
| JPEG 2000 / neural codecs | CDF 9/7 biorthogonal wavelet; hyperprior in learned image compression | Variable-rate coding; ROI coding; streaming and progressive decode |
| Spectral bias in neural networks | Networks learn low-frequency components first (captured by coarse wavelet scales) | Curriculum learning: train on coarse scales first; progressive training strategies |
12. Conceptual Bridge
Looking Backward
This section sits at the summit of the Fourier Analysis chapter. Everything built in Section 20-01 through Section 20-04 contributes:
From Section 20-01 (Fourier Series): The idea of decomposing a signal into orthogonal basis functions is generalized: instead of sinusoids with fixed global support, we use wavelets with variable-scale local support. The Parseval identity from Section 20-01 reappears as the energy conservation property of the wavelet transform.
From Section 20-02 (Fourier Transform): The uncertainty principle is the fundamental constraint that motivates wavelets. The continuous Fourier transform is the "full-global" limit; the CWT interpolates between full-time and full-frequency localization. Plancherel's theorem underlies both.
From Section 20-03 (DFT and FFT): The Mallat algorithm uses convolution + downsampling - precisely the operations made efficient by the FFT in Section 20-03. The DWT at each level costs via direct convolution (filter is short); a CWT discretization would cost via FFT.
From Section 20-04 (Convolution Theorem): The filter bank conditions (QMF, perfect reconstruction) are applications of the convolution algebra from Section 20-04. The LTI system framework identifies each wavelet scale as a bandpass filter; the synthesis bank cancels aliasing via the Convolution Theorem's frequency-domain perspective.
Looking Forward
Statistical Learning Theory (Section 21): Wavelet regularity classes (Besov spaces, Sobolev spaces) are the natural function classes for non-parametric estimation. Minimax rates of convergence depend on the wavelet regularity of the target function.
Functional Analysis (Section 12): The MRA axioms live in - a Hilbert space (Section 12-02). Riesz bases, frames, and Hilbert space geometry underpin the infinite-dimensional wavelet theory. The scattering transform connects to group representations (Section 12-04 preview).
Differential Geometry (Section 25): Wavelets generalize to manifolds via harmonic analysis on Riemannian spaces. Spectral graph wavelets (Section 11-04) extend the MRA framework to graphs. Heat kernel wavelets define multi-scale decompositions on curved spaces - used in geometric deep learning.
WAVELET SECTION IN CURRICULUM
=======================================================================
Section 20-01 Fourier Series Section 12-02 Hilbert Spaces
v orthogonal basis concept v L2 theory
Section 20-02 Fourier Transform Section 20-03 DFT and FFT
v uncertainty principle v filter banks, O(N log N)
Section 20-04 Convolution Theorem
v QMF, LTI systems
+======================================+
| Section 20-05 WAVELETS |
| ----------------------------------- |
| CWT, MRA, Mallat O(N) algorithm |
| Daubechies, QMF, perfect recon. |
| 2D DWT, JPEG 2000, scattering |
| Denoising, WaveDiff, WaveBERT |
+======================================+
v v
Section 21 Statistical Section 11-04 Spectral
Learning Theory Graph Theory
(Besov spaces, (Graph wavelets,
minimax rates) GCN, ChebNet)
v v
Section 25 Differential Section 12-04 Group
Geometry Representations
(manifold wavelets) (scattering theory)
=======================================================================
<- Back to Fourier Analysis | Previous: Convolution Theorem <- | Next Chapter: Statistical Learning Theory ->
Appendix A: Biorthogonal Wavelets and JPEG 2000
A.1 Motivation for Biorthogonal Wavelets
Daubechies' theorem establishes an uncomfortable trade-off for orthonormal wavelets: a compactly supported orthonormal wavelet with more than 1 vanishing moment cannot be symmetric. This is a fundamental obstruction - symmetry requires the filter to satisfy for some integer , which is incompatible with compact support and orthonormality (except for the Haar wavelet).
Symmetry is highly desirable in image processing: symmetric filters do not introduce phase distortion, and symmetric boundary extension (mirror padding) is more natural than periodic extension. For this reason, biorthogonal wavelets (Cohen-Daubechies-Feauveau, 1992) replace the single orthonormal filter bank with a pair of filter banks:
- Analysis filters: - used in the forward DWT
- Synthesis filters: - used in the inverse DWT
where the analysis and synthesis wavelets and satisfy the biorthogonality condition:
but for (they are not orthogonal to their own translates).
A.2 The CDF 9/7 Wavelet
The Cohen-Daubechies-Feauveau 9/7 wavelet (CDF 9/7) is the biorthogonal wavelet used in JPEG 2000 for lossy compression. It has:
- Analysis filter length: 9 taps (odd, therefore symmetric)
- Synthesis filter length: 7 taps (odd, symmetric)
- 4 vanishing moments (analysis) / 4 vanishing moments (synthesis)
- Linear phase (symmetric filter -> zero phase distortion)
The 9/7 analysis lowpass filter coefficients are:
| 0 | 0.6029490182363579 |
| 1 | 0.2668641184428723 |
| 2 | -0.07822326652898785 |
| 3 | -0.01686411844287495 |
| 4 | 0.02674875741080976 |
The filter is symmetric: . The corresponding synthesis filter has 7 nonzero coefficients.
Why 9/7? The choice balances:
- Compression efficiency (4 vanishing moments approx good polynomial annihilation)
- Visual quality (smooth synthesis wavelet approx no ringing artifacts)
- Computational cost (9 taps = fast, especially in hardware)
- Symmetry (eliminates the Daubechies asymmetry problem)
For lossless JPEG 2000, the Le Gall 5/3 biorthogonal wavelet is used (integer lifting steps, exact arithmetic).
A.3 Lifting Scheme
The lifting scheme (Sweldens, 1995) provides an efficient factorization of any wavelet filter bank into elementary "lifting steps" - pairs of predict and update operations:
- Split: divide the signal into even () and odd () samples
- Predict: update odd samples to reduce prediction error:
- Update: update even samples to maintain global statistics:
- Scale: normalize: ,
The lifting scheme:
- Requires exactly multiplications and additions for inputs (half the cost of the naive filter bank)
- Allows in-place computation (no extra memory)
- Supports the integer DWT (lossless coding) by rounding at each step
- Makes hardware implementation trivial (pipeline two lifting steps)
The Haar DWT is the simplest lifting scheme: Predict step , Update step .
Appendix B: Frames and Redundant Representations
B.1 Wavelet Frames
When wavelet functions do not form an exact orthonormal basis but still allow signal reconstruction, they form a frame. A frame satisfies:
for constants called frame bounds. If , the frame is tight (uniform stability in all directions). Tight frames satisfy (simple inversion formula, same as orthonormal basis but with redundancy ).
Why frames? Overcomplete (redundant) representations have:
- Better numerical stability (more "angles" to represent the signal)
- Invariance properties (e.g., undecimated DWT = shift-invariant wavelet frame)
- Flexibility in design (no restriction to dyadic sampling)
Continuous wavelet frames. The CWT with arbitrary discretization of forms a frame as long as the sampling is dense enough. For dyadic sampling , with small enough, the CWT forms a frame.
B.2 Undecimated DWT (Stationary Wavelet Transform)
The undecimated DWT (also called shift-invariant DWT or "a trous" algorithm) removes the downsampling step in the Mallat algorithm:
Analysis step (undecimated):
a^(j+1)[n] = sum_k h_k * a^(j)[n-2^j k] (convolution with upsampled filter)
d^(j+1)[n] = sum_k g_k * a^(j)[n-2^j k]
The output at each level has the same length as the input. Total output size: (redundant by factor ).
Properties:
- Shift-invariant: - unlike the downsampled DWT where shifts cause coefficient permutations
- Better denoising: Averaging over all circular shifts of the DWT (cycle-spinning) gives the undecimated DWT implicitly; this averages out the artifacts from the dyadic grid
- Cost: instead of - more expensive but often worth it for denoising
Cycle-spinning (Coifman-Donoho, 1995): Average wavelet-threshold denoising over all cyclic shifts. Equivalent to undecimated DWT thresholding. Eliminates "ringing" artifacts near discontinuities that standard DWT thresholding produces.
Appendix C: Wavelet Regularity and Approximation Theory
C.1 Besov Spaces
The natural function spaces for wavelet approximation theory are Besov spaces , which generalize Sobolev spaces by specifying different regularity in different norms.
A function belongs to (smoothness , integrability, scale summability) if and only if its wavelet coefficients satisfy:
This characterization makes wavelet coefficients the canonical coordinate system for Besov spaces.
Key special cases:
- (Sobolev space) - standard smooth functions
- (Holder space) - functions with bounded derivatives
- - functions with sparse wavelet representations (relevant for compression)
For AI: The spectral bias of neural networks (Rahaman et al., 2018) can be formalized as: gradient descent preferentially fits functions in lower-order Besov spaces first. Understanding Besov spaces explains why networks need curriculum learning for high-frequency targets.
C.2 Approximation Error in Wavelet Bases
Theorem C.1 (Nonlinear approximation, Cohen-Dahmen-DeVore 2001). For with , the best -term approximation (keeping the largest wavelet coefficients) satisfies:
where is the signal dimension ( for 1D signals, for images). Compare to linear approximation (keeping all coefficients at the coarsest scales):
where is the number of kept coefficients. Both rates are the same for Sobolev spaces, but nonlinear approximation is far better for piecewise smooth functions (e.g., natural images with edges).
For image compression: Natural images are well-modeled as piecewise smooth (Cartoon model: smooth regions + edges). Their best -term wavelet approximation error decays as in 2D, while DCT-based compression (JPEG) achieves only (no adaptation to edges). This is the theoretical explanation for JPEG 2000's superiority over JPEG.
C.3 Wavelet Coefficients and Signal Properties
The wavelet coefficients carry rich information about the local structure of :
| Coefficient behavior | Signal property |
|---|---|
| $ | d_{j,k} |
| $ | d_{j,k} |
| Large isolated coefficients across scales | Point singularity (discontinuity) at |
| Cone of influence: large $ | d_{j,k} |
This local regularity characterization is used in:
- Image edge detection via wavelet modulus maxima (Mallat-Hwang algorithm)
- Fractal dimension estimation from power-law behavior of
- Neural network weight analysis - regularity of trained weight matrices
Appendix D: Wavelet Connections to Other Methods
D.1 Relationship to Subband Coding
Subband coding (used in audio compression: MP3's filterbank, Dolby Atmos's DTS subbands) divides a signal into frequency bands using a filter bank. The DWT is a special case of octave subband coding with:
- Logarithmic (octave) frequency spacing
- Orthogonal / biorthogonal filter bank
- Perfect reconstruction guarantee
MP3 uses a 32-band uniform QMF filter bank (not wavelet - uniform bands). AAC uses the Modified DCT (MDCT) - a lapped transform that shares the PR property with wavelets but uses cosine instead of wavelet basis functions. The similarity is not coincidental: all PR filter banks can be parameterized via unitary matrices (Vaidyanathan, 1993).
D.2 Relationship to Multirate Signal Processing
The DWT is a multirate signal processing system. Key identifiers:
- Polyphase representation: filter bank in terms of operations (even/odd polyphase components)
- Noble identity: upsampling-then-filtering = filtering-then-upsampling with upsampled filter
- Nyquist criterion for PR: is the half-band Nyquist filter condition
Understanding this connection enables implementation optimization: FFT-based polyphase filtering reduces the cost of DWT from to for long filters.
D.3 Relationship to Attention Mechanisms
The wavelet transform and self-attention both compute weighted sums over all input positions. The differences are instructive:
| Property | Wavelet Transform | Self-Attention |
|---|---|---|
| Weights | Fixed (filter ) | Data-dependent () |
| Locality | Compact support of | Global (all pairs) |
| Scale | Dyadic hierarchy (fixed) | Single scale (fixed head dim) |
| Complexity | ||
| Equivariance | Translation-equivariant | Permutation-equivariant |
| Inversion | Exact (PR) | Via LayerNorm + residuals |
Attention can be viewed as a data-adaptive, globally-supported filter bank - it learns which positions to attend to, while a wavelet selects which scale/location to probe. The convergence of wavelet ideas and attention is driving research in wavelet-based long-range Transformers (see Section 8.2).
D.4 Wavelets and Renormalization Group Theory
In theoretical physics, the renormalization group (RG) describes how physical systems look at different scales. The key operation - integrating out short-distance degrees of freedom to obtain an effective theory at larger scales - is structurally identical to the DWT low-pass step:
- RG block-spin transformation (Kadanoff 1966): average spins in a block -> coarser description
- DWT approximation step: - average input samples via -> coarser approximation
The detail coefficients capture the "fluctuations" integrated out at each RG step. The analogy is precise for the Haar wavelet (block averaging = Kadanoff transformation) and approximate for general wavelets.
For AI: This RG/wavelet correspondence explains why:
- Neural networks with hierarchical architectures (ResNet, UNet, Swin) effectively implement learned RG flows
- The number of scales needed grows logarithmically with the ratio of finest to coarsest detail
- Feature representations at different layers of a CNN correspond to effective theories at different RG scales
Numerical renormalization group methods in quantum many-body physics use tensor network contractions that are analogous to wavelet coefficient manipulations.
Appendix E: Quick Reference
E.1 Notation Summary
| Symbol | Meaning |
|---|---|
| Mother wavelet | |
| Scaling function (father wavelet) | |
| Dilated/translated wavelet | |
| DWT basis function (scale , position ) | |
| Continuous Wavelet Transform | |
| Detail coefficients | |
| Approximation coefficients | |
| Low-pass (scaling) filter | |
| High-pass (wavelet) filter (QMF relation) | |
| Approximation space at scale | |
| Detail space at scale | |
| Admissibility constant | |
| db | Daubechies wavelet with vanishing moments |
| sym | Symlet (near-symmetric version of db) |
E.2 Key Formulas
Admissibility: iff .
CWT:
CWT inversion:
Two-scale relation:
Wavelet relation:
QMF relation:
Mallat analysis: ,
Mallat synthesis:
Parseval (DWT):
Vanishing moments: for
Universal threshold: ,
Soft threshold:
E.3 Standard Wavelet Types and Their Uses
| Use Case | Recommended Wavelet | Reason |
|---|---|---|
| Audio denoising | db4, sym4 | 4 vanishing moments, regular, fast |
| Image compression | CDF 9/7, CDF 5/3 | Symmetric, linear phase, JPEG 2000 standard |
| Edge detection | Mexican Hat, db2 | 2nd derivative structure; 2 vanishing moments |
| EEG/EMG analysis | Morlet, db4 | Optimal time-frequency, physiological band structure |
| Time series (financial) | db4, Haar | Haar for piecewise constant, db4 for regular |
| Geophysical seismology | Morlet, Paul | Complex, analytic (instantaneous frequency) |
| Scattering networks | Morlet (complex, real part) | Optimal time-frequency uncertainty |
| Lossless image coding | CDF 5/3 (integer) | Integer lifting steps, exact inverse |
| Quick analysis | Haar | Fastest, simplest, pedagogical |
Appendix F: Wavelet Software Ecosystem
F.1 PyWavelets (pywt)
The standard Python wavelet library. Key functions:
import pywt
# List available wavelets
pywt.families() # ['haar', 'db', 'sym', 'coif', 'bior', 'rbio', 'dmey', 'gaus', 'mexh', 'morl', ...]
pywt.wavelist(family='db') # ['db1', 'db2', ..., 'db38']
# Create wavelet object
w = pywt.Wavelet('db4')
print(w.dec_lo) # decomposition (analysis) low-pass filter
print(w.dec_hi) # decomposition (analysis) high-pass filter
print(w.rec_lo) # reconstruction (synthesis) low-pass filter
print(w.rec_hi) # reconstruction (synthesis) high-pass filter
# 1D DWT
coeffs = pywt.wavedec(signal, 'db4', level=5) # [cA5, cD5, cD4, cD3, cD2, cD1]
rec = pywt.waverec(coeffs, 'db4')
# 2D DWT
coeffs2 = pywt.wavedec2(image, 'db4', level=3)
# coeffs2 = [cA3, (cH3,cV3,cD3), (cH2,cV2,cD2), (cH1,cV1,cD1)]
rec2 = pywt.waverec2(coeffs2, 'db4')
# Continuous Wavelet Transform
scales = np.arange(1, 64)
coefficients, frequencies = pywt.cwt(signal, scales, 'morl', sampling_period=1/fs)
F.2 Denoising with pywt
import pywt, numpy as np
def denoise_signal(y, wavelet='db4', level=5, mode='soft'):
# Decompose
coeffs = pywt.wavedec(y, wavelet, level=level)
# Estimate noise from finest scale
sigma = np.median(np.abs(coeffs[-1])) / 0.6745
# Universal threshold
N = len(y)
threshold = sigma * np.sqrt(2 * np.log(N))
# Threshold all detail coefficients (not approximation)
thresholded = [coeffs[0]] # keep approximation
for c in coeffs[1:]:
thresholded.append(pywt.threshold(c, threshold, mode=mode))
# Reconstruct
return pywt.waverec(thresholded, wavelet)[:len(y)]
F.3 Integration with PyTorch
For differentiable wavelet transforms in deep learning:
# pytorch_wavelets (pip install pytorch_wavelets)
from pytorch_wavelets import DWTForward, DWTInverse
xfm = DWTForward(J=3, mode='zero', wave='db3') # Forward DWT
ifm = DWTInverse(mode='zero', wave='db3') # Inverse DWT
# x: (batch, channels, height, width)
Yl, Yh = xfm(x) # Yl: low-freq (coarse), Yh: list of high-freq subbands
x_rec = ifm((Yl, Yh))
For learnable wavelet-like transforms, the filter coefficients and can be parameterized as network weights while enforcing the PR conditions as constraints or soft penalties.
Appendix G: Proofs of Key Results
G.1 Proof: QMF Relation Ensures Orthogonality
Claim: If , then .
Proof sketch. In the frequency domain, . The orthonormality condition for translates of requires:
Using and the power-complementary property :
(using orthonormality of translates). For this to equal 1, we need whenever and vice versa, which is guaranteed by .
G.2 Proof: Mallat Algorithm Computes Inner Products
Claim: The approximation coefficients produced by the Mallat algorithm equal and detail coefficients equal .
Proof sketch. By induction: at scale , (sampling theorem). The downsampling step:
using the two-scale relation (derived from ). Similarly for .
G.3 Proof: DWT Has Complexity
Claim: Computing the full -level DWT of a length- signal costs operations.
Proof. Each level processes a sequence of length with a filter of length (number of taps). The cost of convolution + downsampling is operations. Total:
Since is fixed (depends only on the wavelet family, not ), .
Remark: The complexity is strict: for a -tap filter, the constant is exactly multiplications per input sample (across all levels). For db4 (): 16 multiplications/sample - far fewer than the operations of the FFT.
This section is part of Chapter 20: Fourier Analysis and Signal Processing. For questions or corrections, open an issue in the repository.