NotesMath for LLMs

Wavelets

Fourier Analysis and Signal Processing / Wavelets

Notes

Wavelets and Multiresolution Analysis

"Wavelets are a mathematical microscope: by changing the magnification and the position of the lens, one can examine local features at any desired scale."

  • Stephane Mallat, A Wavelet Tour of Signal Processing, 1998

Overview

The Fourier transform is one of the most powerful tools in mathematics - but it has a fundamental blind spot. When you decompose a signal into sinusoids, you learn which frequencies are present, but you lose all information about when they occur. A spike at 1 kHz in the first millisecond looks identical to a spike at 1 kHz in the last millisecond. For stationary signals this is fine, but real-world signals - speech, music, seismic data, financial time series, images - are profoundly non-stationary. Their frequency content changes over time.

Wavelets solve this problem by replacing the infinite sinusoidal Fourier atoms with localized oscillations called wavelets ("small waves"). A wavelet ψ\psi is a function with zero mean, compact (or rapidly decaying) support, and enough smoothness to serve as a basis. By shifting ψ\psi in time and scaling it in frequency, we get a family of basis functions that simultaneously encode position and scale information. The resulting wavelet transform gives a time-scale representation of signals that adapts to their local structure.

This section develops wavelets from first principles: the continuous wavelet transform (CWT), the theory of Multiresolution Analysis (MRA) that underpins discrete wavelets, the Mallat fast algorithm (O(N)O(N) - faster than FFT!), the celebrated Daubechies construction of compactly-supported wavelets with vanishing moments, 2D wavelet transforms and JPEG 2000, and the modern AI applications: Mallat's scattering networks (stable CNNs without learned weights), wavelet-based attention transformers, and wavelet regularization in diffusion models.

Prerequisites

  • Fourier Transform - FT definition, Plancherel theorem, uncertainty principle (Section 20-02)
  • DFT and FFT - discrete Fourier transform, filter banks (Section 20-03)
  • Convolution Theorem - filter design, LTI systems, QMF conditions (Section 20-04)
  • L2L^2 spaces - inner products, orthonormal bases, Hilbert space completeness (Section 12-02 preview)
  • Complex exponentials - eiθ=cosθ+isinθe^{i\theta} = \cos\theta + i\sin\theta, modulus, argument (Section 01)

Companion Notebooks

NotebookDescription
theory.ipynbCWT scalograms, MRA cascade, Mallat algorithm, Daubechies construction, 2D DWT, scattering networks, wavelet denoising
exercises.ipynb10 graded problems: Haar DWT by hand through scattering network feature analysis

Learning Objectives

After completing this section, you will:

  1. Explain why the Fourier transform cannot localize events in time and how wavelets resolve this via the uncertainty principle
  2. Define the continuous wavelet transform (CWT) and state the admissibility condition for reconstruction
  3. State the five axioms of Multiresolution Analysis and derive the two-scale relation ϕ(x)=2khkϕ(2xk)\phi(x) = \sqrt{2}\sum_k h_k\,\phi(2x-k)
  4. Derive the wavelet ψ\psi from the scaling function ϕ\phi via the quadrature mirror filter relation gk=(1)kh1kg_k = (-1)^k h_{1-k}
  5. Implement the Mallat fast DWT algorithm and explain why its complexity is O(N)O(N)
  6. Explain what "vanishing moments" means for a wavelet and why more vanishing moments improve compression
  7. Construct the Daubechies db2 filter from spectral factorization
  8. Apply the 2D separable DWT and identify the LL, LH, HL, HH subbands
  9. Describe Mallat's scattering network and explain why it provides stable, translation-invariant features
  10. Apply soft/hard wavelet thresholding for signal denoising (Donoho-Johnstone framework)
  11. Connect wavelet filter banks to hierarchical vision architectures (Swin Transformer, PVT)
  12. Identify and fix the 10 most common wavelet implementation errors

Table of Contents


1. Intuition

1.1 The Time-Frequency Problem

Consider a musical score. It tells you two things simultaneously: which note is played (frequency) and when it is played (time). The Fourier transform is like a musician who can tell you every pitch in a piece, but cannot tell you when any of them occur. For a signal f(t)f(t), the Fourier coefficient:

f^(ξ)=f(t)e2πiξtdt\hat{f}(\xi) = \int_{-\infty}^{\infty} f(t)\, e^{-2\pi i \xi t}\, dt

integrates over all time. Information about the temporal location of a frequency event is completely lost. A piano chord struck at t=0t=0 and the same chord struck at t=10t=10 seconds have identical Fourier transforms (in magnitude).

This is not a defect of a particular implementation - it is a fundamental consequence of using globally supported basis functions. Sinusoids e2πiξte^{2\pi i \xi t} have infinite support: they oscillate forever in both directions. When you decompose a signal in this basis, temporal information is irrevocably mixed into the phase relationships between coefficients, and extracting it requires knowing all coefficients simultaneously.

The consequences for AI are immediate. When processing speech, EEG signals, financial time series, or any non-stationary data, global frequency analysis discards the temporal structure that carries meaning. A voice recording of "cat" vs "tac" have similar Fourier magnitude spectra but different time-frequency patterns. The Whisper model addresses this by computing a mel-spectrogram (short-time Fourier transform) that provides local frequency information - but STFT has its own limitations that wavelets resolve more elegantly.

1.2 The Uncertainty Principle and What Wavelets Buy

The Heisenberg-Gabor uncertainty principle (Section 20-02) states that no function can be simultaneously localized in both time and frequency beyond the bound:

ΔtΔξ14π\Delta t \cdot \Delta \xi \geq \frac{1}{4\pi}

where Δt\Delta t and Δξ\Delta \xi are the standard deviations of f(t)2|f(t)|^2 and f^(ξ)2|\hat{f}(\xi)|^2 respectively. This is a hard mathematical limit: you cannot have perfect resolution in both domains simultaneously.

What the Short-Time Fourier Transform (STFT) does: Fix a window function g(t)g(t) (Gaussian, Hann, etc.) and compute:

STFTf(t,ξ)=f(τ)g(τt)e2πiξτdτ\text{STFT}_f(t,\xi) = \int_{-\infty}^{\infty} f(\tau)\, g(\tau - t)\, e^{-2\pi i \xi \tau}\, d\tau

This gives local frequency information near time tt. But the window gg is fixed - every frequency is analyzed at the same time resolution Δt\Delta t. This is suboptimal: low frequencies vary slowly (wide window optimal), while high frequencies vary rapidly (narrow window optimal).

What wavelets do: Wavelets use a window whose width automatically adapts to the analysis scale. At high frequencies (small scale aa), the wavelet is narrow in time - providing good temporal resolution. At low frequencies (large scale aa), the wavelet is wide - providing good frequency resolution. This is the "constant-Q" or "constant relative bandwidth" property:

Δξξ=const\frac{\Delta\xi}{\xi} = \text{const}

This adaptive tiling of the time-frequency plane is the geometric essence of wavelet analysis. It matches the structure of natural signals, which tend to have brief, sharp high-frequency events (transients) and slow, extended low-frequency oscillations.

1.3 Wavelets as Mathematical Microscopes

A wavelet ψ\psi is a function satisfying the admissibility condition (Section 2.1). From it, we build a two-parameter family of analyzing functions by translation (shifting in time by bb) and dilation (stretching in scale by a>0a > 0):

ψa,b(t)=1aψ ⁣(tba)\psi_{a,b}(t) = \frac{1}{\sqrt{a}}\,\psi\!\left(\frac{t-b}{a}\right)

The factor 1/a1/\sqrt{a} normalizes energy: ψa,b2=ψ2\|\psi_{a,b}\|_2 = \|\psi\|_2 for all a,ba, b.

Think of this geometrically: as aa decreases (zooming in), ψa,b\psi_{a,b} becomes a narrow, high-frequency probe localized near t=bt = b. As aa increases (zooming out), ψa,b\psi_{a,b} becomes a wide, low-frequency probe covering a broad time interval. The Continuous Wavelet Transform:

Wψf(a,b)=f,ψa,b=f(t)1aψ ⁣(tba)dtW_\psi f(a, b) = \langle f, \psi_{a,b} \rangle = \int_{-\infty}^{\infty} f(t)\, \frac{1}{\sqrt{a}}\,\psi\!\left(\frac{t-b}{a}\right)\, dt

measures the "content" of ff near time bb at scale aa - precisely the behavior of a microscope adjusted to different magnifications and positions.

For AI: Scattering networks (Mallat, 2012) use this idea to build stable, translation-invariant feature representations by cascading wavelet transforms with modulus nonlinearities. The resulting features have provable stability properties that learned CNNs lack - they cannot be destabilized by small deformations of the input.

1.4 Historical Timeline

YearContributorDevelopment
1807FourierSeries decomposition of heat equation solutions
1910HaarFirst wavelet: the Haar function ψ=1[0,1/2)1[1/2,1)\psi = \mathbf{1}_{[0,1/2)} - \mathbf{1}_{[1/2,1)}
1946GaborTime-frequency atoms; uncertainty principle
1965Cooley-TukeyFFT algorithm - makes DFT computationally practical
1982Morlet, GrossmannContinuous wavelet transform formalized for geophysics
1986MeyerConstruction of smooth orthonormal wavelets
1988DaubechiesCompactly supported orthonormal wavelets with pp vanishing moments (dbpp)
1989MallatMultiresolution Analysis framework; fast O(N)O(N) DWT algorithm
1992Coifman-WickerhauserWavelet packets; best-basis algorithm
1994Donoho-JohnstoneWavelet thresholding for near-optimal denoising
1996ISOJPEG 2000 standard based on Cohen-Daubechies-Feauveau (CDF 9/7) biorthogonal wavelets
2012MallatScattering networks: stable CNNs from wavelet cascades
2021Yao et al.WaveBERT: wavelet token compression for long-range Transformers
2022Liu et al.WaveMix: multi-scale 2D wavelet mixing for vision
2023VariousWavelet-domain diffusion models; wavelet attention transformers

2. Formal Definitions

2.1 The Continuous Wavelet Transform

Definition 2.1 (Admissible wavelet). A function ψL2(R)\psi \in L^2(\mathbb{R}) is an admissible wavelet if it satisfies the admissibility condition:

Cψ=0ψ^(ξ)2ξdξ<C_\psi = \int_0^{\infty} \frac{|\hat{\psi}(\xi)|^2}{\xi}\, d\xi < \infty

This requires ψ^(0)=0\hat{\psi}(0) = 0, i.e., ψ(t)dt=0\int_{-\infty}^{\infty} \psi(t)\, dt = 0 - the wavelet must have zero mean. Informally, it must "oscillate" (wave) while also being localized (small = "let"). The condition Cψ<C_\psi < \infty ensures the CWT is invertible.

Definition 2.2 (Continuous Wavelet Transform). For fL2(R)f \in L^2(\mathbb{R}) and admissible ψ\psi, the CWT is:

Wψf(a,b)=f,ψa,b=f(t)1aψ ⁣(tba)dt,a>0, bRW_\psi f(a, b) = \langle f, \psi_{a,b} \rangle = \int_{-\infty}^{\infty} f(t)\, \frac{1}{\sqrt{a}}\,\overline{\psi\!\left(\frac{t-b}{a}\right)}\, dt, \qquad a > 0,\ b \in \mathbb{R}

Theorem 2.1 (CWT inversion formula). If ψ\psi is admissible, then:

f(t)=1Cψ0Wψf(a,b)ψa,b(t)dadba2f(t) = \frac{1}{C_\psi} \int_0^{\infty}\int_{-\infty}^{\infty} W_\psi f(a,b)\, \psi_{a,b}(t)\, \frac{da\, db}{a^2}

The measure dadba2\frac{da\, db}{a^2} is the natural invariant measure on the affine group (translations and dilations). The CWT is an isometry: WψfL2(dadb/a2)2=CψfL22\|W_\psi f\|^2_{L^2(da\,db/a^2)} = C_\psi \|f\|^2_{L^2}.

Parseval formula for CWT. The energy identity reads:

1Cψ0Wψf(a,b)2dadba2=fL22\frac{1}{C_\psi}\int_0^\infty\int_{-\infty}^\infty |W_\psi f(a,b)|^2\,\frac{da\,db}{a^2} = \|f\|_{L^2}^2

For AI: The CWT is used for speech analysis (Morlet CWT of mel-filtered audio), EEG artifact detection, and seismic signal interpretation. However, it is the discrete wavelet transform - derived from MRA - that dominates computational applications.

2.2 Common Wavelet Families

The Haar Wavelet is the simplest compactly-supported wavelet, introduced by Alfred Haar in 1910:

ψHaar(t)={10t<1/211/2t<10otherwise\psi_\text{Haar}(t) = \begin{cases} 1 & 0 \leq t < 1/2 \\ -1 & 1/2 \leq t < 1 \\ 0 & \text{otherwise} \end{cases}

The Haar wavelet has compact support [0,1][0,1], exactly 1 vanishing moment, and its scaling function is the box function ϕHaar(t)=1[0,1)(t)\phi_\text{Haar}(t) = \mathbf{1}_{[0,1)}(t). It is the only wavelet that is simultaneously symmetric, orthonormal, and has compact support. Its discontinuity makes it poor for smooth signals but ideal for edge detection and theoretical exposition.

The Morlet Wavelet is a complex sinusoid modulated by a Gaussian:

ψMorlet(t)=π1/4(eiω0teω02/2)et2/2\psi_\text{Morlet}(t) = \pi^{-1/4}\left(e^{i\omega_0 t} - e^{-\omega_0^2/2}\right)e^{-t^2/2}

where ω05\omega_0 \approx 5 is the center frequency. The correction term eω02/2e^{-\omega_0^2/2} ensures ψ=0\int \psi = 0 exactly. For ω05\omega_0 \geq 5, the correction term is negligible and ψMorlet(t)π1/4eiω0tt2/2\psi_\text{Morlet}(t) \approx \pi^{-1/4} e^{i\omega_0 t - t^2/2}.

The Morlet wavelet achieves optimal time-frequency localization (it saturates the Heisenberg uncertainty bound) and is widely used for analyzing oscillatory signals like EEG, ECG, and seismic data. It is not compactly supported (Gaussian tails), so the DWT based on Morlet is continuous.

The Mexican Hat (Ricker) Wavelet is the negative second derivative of a Gaussian:

ψMexHat(t)=23π1/4(1t2)et2/2\psi_\text{MexHat}(t) = \frac{2}{\sqrt{3}\pi^{1/4}}\left(1 - t^2\right)e^{-t^2/2}

Named for its characteristic shape. Has 2 vanishing moments and is real-valued. Used in edge detection, image analysis, and as the activation function insight for ReLU networks (second derivative = rectification).

Daubechies Wavelets dbNN are the unique compactly-supported orthonormal wavelets with NN vanishing moments. Their explicit construction is in Section 5. Key members:

  • db1 = Haar wavelet (discontinuous, 1 vanishing moment)
  • db2 = Daubechies 4-tap filter (Lipschitz continuous, 2 vanishing moments)
  • db4 = Daubechies 8-tap filter (more regular, 4 vanishing moments)
  • db8 = 16-tap filter (very smooth, 8 vanishing moments)

Increasing NN gives smoother wavelets at the cost of longer filters (support =2N1= 2N-1).

Symlets symNN are near-symmetric modifications of dbNN with equal numbers of vanishing moments but more symmetric filter banks. Preferred in image processing to reduce edge artifacts.

Coiflets coifNN have vanishing moments for both ψ\psi and ϕ\phi (scaling function), making the approximation coefficients near-samples of the signal at coarse scales.

2.3 Inner Product View and Energy Preservation

Hilbert space background. The space L2(R)={f:RC:f2<}L^2(\mathbb{R}) = \{f : \mathbb{R} \to \mathbb{C} : \int |f|^2 < \infty\} is a Hilbert space with inner product f,g=f(t)g(t)dt\langle f, g \rangle = \int_{-\infty}^\infty f(t)\overline{g(t)}\, dt.

Recall (Section 12-02): A Hilbert space is a complete inner product space. In L2(R)L^2(\mathbb{R}), the Riesz representation theorem guarantees that every bounded linear functional is of the form ff,gf \mapsto \langle f, g \rangle for a unique gg.

A family {ψj,k}\{\psi_{j,k}\} of wavelets forms an orthonormal basis of L2(R)L^2(\mathbb{R}) if:

  1. ψj,k,ψj,k=δjjδkk\langle \psi_{j,k}, \psi_{j',k'} \rangle = \delta_{jj'}\delta_{kk'} (orthonormality)
  2. f=j,kf,ψj,kψj,kf = \sum_{j,k} \langle f, \psi_{j,k} \rangle \psi_{j,k} for all fL2(R)f \in L^2(\mathbb{R}) (completeness)

Theorem 2.2 (Parseval for discrete wavelets). If {ψj,k}\{\psi_{j,k}\} is an orthonormal wavelet basis, then:

f2=jZkZf,ψj,k2\|f\|^2 = \sum_{j \in \mathbb{Z}} \sum_{k \in \mathbb{Z}} |\langle f, \psi_{j,k}\rangle|^2

Energy is perfectly preserved in the wavelet domain.

For AI: This energy preservation is crucial for compression and denoising. The wavelet coefficients dj,k=f,ψj,kd_{j,k} = \langle f, \psi_{j,k}\rangle measure the "energy contribution" of scale jj, position kk. Large coefficients correspond to significant signal features; small coefficients correspond to noise - the basis for threshold denoising.


3. Multiresolution Analysis

3.1 The MRA Axioms

Multiresolution Analysis (MRA), introduced by Stephane Mallat and Yves Meyer in 1989, provides the algebraic framework that links continuous wavelet theory to fast discrete algorithms. An MRA is a ladder of closed subspaces {Vj}jZ\{V_j\}_{j\in\mathbb{Z}} of L2(R)L^2(\mathbb{R}) satisfying five axioms:

Definition 3.1 (Multiresolution Analysis). A sequence of closed subspaces {Vj:jZ}L2(R)\{V_j : j \in \mathbb{Z}\} \subset L^2(\mathbb{R}) forms an MRA if:

  1. Nesting: V2V1V0V1V2\cdots \subset V_2 \subset V_1 \subset V_0 \subset V_{-1} \subset V_{-2} \subset \cdots (finer scales are subspaces of coarser ones; indexing convention: jj increases = coarser resolution)

  2. Density: jZVj=L2(R)\overline{\bigcup_{j\in\mathbb{Z}} V_j} = L^2(\mathbb{R}) (arbitrary precision is achievable by taking finer and finer scales)

  3. Separation: jZVj={0}\bigcap_{j\in\mathbb{Z}} V_j = \{0\} (at infinite coarsening, only the zero function remains)

  4. Scaling: f(t)Vj    f(2t)Vj1f(t) \in V_j \iff f(2t) \in V_{j-1} (going to a finer scale is equivalent to compressing the function by a factor of 2)

  5. Orthonormal basis: There exists ϕV0\phi \in V_0 such that {ϕ(tk)}kZ\{\phi(t-k)\}_{k\in\mathbb{Z}} is an orthonormal basis for V0V_0.

The function ϕ\phi is the scaling function (or father wavelet) associated with the MRA. The orthonormality condition means:

ϕ(k),ϕ(l)=δkl,k,lZ\langle \phi(\cdot - k), \phi(\cdot - l) \rangle = \delta_{kl}, \qquad k, l \in \mathbb{Z}

Example 3.1 (Haar MRA). Define VjV_j as the space of functions that are piecewise constant on dyadic intervals [k2j,(k+1)2j)[k \cdot 2^j, (k+1) \cdot 2^j) for kZk \in \mathbb{Z}. The scaling function is ϕ(t)=1[0,1)\phi(t) = \mathbf{1}_{[0,1)}. All five MRA axioms are satisfied. The approximation of ff at scale jj is the conditional expectation E[fFj]E[f | \mathcal{F}_j] where Fj\mathcal{F}_j is the σ\sigma-algebra generated by dyadic intervals of width 2j2^j.

Example 3.2 (Sinc / Shannon MRA). Define VjV_j as the Paley-Wiener space of functions bandlimited to [2jπ,2jπ][-2^{-j}\pi, 2^{-j}\pi]. The scaling function is ϕ(t)=sinc(πt)=sin(πt)/(πt)\phi(t) = \text{sinc}(\pi t) = \sin(\pi t)/(\pi t). This is the MRA of ideally bandlimited signals. The sinc wavelet has infinite support - impractical but theoretically illuminating.

3.2 Scaling Functions and the Two-Scale Relation

Since V0V1V_0 \subset V_{-1} (axiom 1) and {ϕ(tk)}k\{\phi(t-k)\}_k is an orthonormal basis for V0V_0, while {2ϕ(2tk)}k\{\sqrt{2}\phi(2t-k)\}_k is an orthonormal basis for V1V_{-1} (by axiom 4), we can expand ϕ\phi in the finer-scale basis:

Definition 3.2 (Two-scale relation / refinement equation). The scaling function satisfies:

ϕ(t)=2kZhkϕ(2tk)\phi(t) = \sqrt{2}\sum_{k \in \mathbb{Z}} h_k\,\phi(2t - k)

where the coefficients {hk}\{h_k\} form the low-pass analysis filter. This is the fundamental equation of wavelet theory - it says the scaling function at scale 0 is a weighted sum of translated scaling functions at scale 1-1 (twice as fine).

The Fourier transform of the two-scale relation gives:

ϕ^(ξ)=H ⁣(ξ2)ϕ^ ⁣(ξ2),H(ξ)=12khke2πikξ\hat{\phi}(\xi) = H\!\left(\frac{\xi}{2}\right)\hat{\phi}\!\left(\frac{\xi}{2}\right), \qquad H(\xi) = \frac{1}{\sqrt{2}}\sum_k h_k\, e^{-2\pi i k\xi}

Applying this recursively: ϕ^(ξ)=j=1H(ξ/2j)\hat{\phi}(\xi) = \prod_{j=1}^{\infty} H(\xi/2^j). The infinite product formula shows ϕ^\hat{\phi} is determined entirely by the filter H(ξ)H(\xi) (the Fourier transform of {hk/2}\{h_k/\sqrt{2}\}).

Orthonormality conditions. For {ϕ(k)}k\{\phi(\cdot - k)\}_k to be orthonormal, the filter must satisfy:

H(ξ)2+H(ξ+1/2)2=1(power complementary, or Conjugate Quadrature Filter condition)|H(\xi)|^2 + |H(\xi + 1/2)|^2 = 1 \qquad \text{(power complementary, or Conjugate Quadrature Filter condition)}

This is the critical constraint in wavelet filter design.

For AI: The two-scale relation is structurally identical to a residual connection. The coarse approximation at scale jj is recursively expressed in terms of the approximation at scale j1j-1 (twice the resolution). This is precisely the architecture of U-Net and multi-scale vision models.

3.3 Detail Spaces and the Orthogonal Complement

Since Vj+1VjV_{j+1} \subset V_j (finer is a subspace of coarser - careful: indexing), define the detail space WjW_j as the orthogonal complement of Vj+1V_{j+1} within VjV_j:

Vj=Vj+1Wj+1V_j = V_{j+1} \oplus W_{j+1}

The detail space Wj+1W_{j+1} captures the "difference in information" between resolution jj and resolution j+1j+1. Applying this decomposition repeatedly:

V0=VJWJWJ1W1(for any J>0)V_0 = V_J \oplus W_J \oplus W_{J-1} \oplus \cdots \oplus W_1 \qquad \text{(for any } J > 0\text{)}

Taking JJ \to -\infty and using the density axiom:

L2(R)=jZWjL^2(\mathbb{R}) = \bigoplus_{j \in \mathbb{Z}} W_j

This is the fundamental wavelet decomposition theorem: the detail spaces {Wj}jZ\{W_j\}_{j\in\mathbb{Z}} form a direct orthogonal sum decomposition of L2(R)L^2(\mathbb{R}).

Orthonormal basis for WjW_j. If ψ\psi is the mother wavelet (Section 3.4), then {ψj,k(t)=2j/2ψ(2jtk)}kZ\{\psi_{j,k}(t) = 2^{-j/2}\psi(2^{-j}t - k)\}_{k\in\mathbb{Z}} is an orthonormal basis for WjW_j. Therefore:

{ψj,k:j,kZ}is an orthonormal basis for L2(R)\{\psi_{j,k} : j, k \in \mathbb{Z}\} \quad \text{is an orthonormal basis for } L^2(\mathbb{R})

Every fL2(R)f \in L^2(\mathbb{R}) expands as f=j,kf,ψj,kψj,kf = \sum_{j,k} \langle f, \psi_{j,k}\rangle \psi_{j,k} with f2=j,kf,ψj,k2\|f\|^2 = \sum_{j,k} |\langle f, \psi_{j,k}\rangle|^2.

3.4 The Wavelet Equation and QMF Relation

The mother wavelet ψW0\psi \in W_0 is defined by the analogous two-scale relation:

ψ(t)=2kZgkϕ(2tk)\psi(t) = \sqrt{2}\sum_{k \in \mathbb{Z}} g_k\,\phi(2t - k)

where {gk}\{g_k\} is the high-pass analysis filter. The filters hh and gg are related by the Quadrature Mirror Filter (QMF) condition:

gk=(1)kh1k(time-reverse, modulate, alternate signs)g_k = (-1)^k h_{1-k} \qquad \text{(time-reverse, modulate, alternate signs)}

In the frequency domain: G(ξ)=e2πiξH(ξ+1/2)G(\xi) = e^{-2\pi i\xi}\overline{H(\xi + 1/2)}, which ensures HH and GG are orthogonal (in the power complementary sense) and span all frequencies.

The QMF condition guarantees:

  1. ψϕ\psi \perp \phi (wavelet is orthogonal to scaling function)
  2. ψ0,kψ0,l\psi_{0,k} \perp \psi_{0,l} for klk \neq l (translated wavelets are orthogonal)
  3. ψj,kψj,k\psi_{j,k} \perp \psi_{j',k'} for jjj \neq j' (wavelets at different scales are orthogonal)

Example 3.3 (Haar filters). For the Haar wavelet: h0=h1=1/2h_0 = h_1 = 1/\sqrt{2} (all others zero). Then g0=1/2g_0 = 1/\sqrt{2}, g1=1/2g_1 = -1/\sqrt{2}. Verify: gk=(1)kh1kg_k = (-1)^k h_{1-k}: g0=(1)0h1=h1=1/2g_0 = (-1)^0 h_1 = h_1 = 1/\sqrt{2}, g1=(1)1h0=h0=1/2g_1 = (-1)^1 h_0 = -h_0 = -1/\sqrt{2}. PASS

Geometric interpretation. The filter bank [H,G][H, G] splits the frequency axis [0,1][0, 1] (normalized) into two halves: HH preserves [0,1/2][0, 1/2] (low frequencies = approximation), GG preserves [1/2,1][1/2, 1] (high frequencies = detail). This is the wavelet counterpart of the Convolution Theorem from Section 20-04: decomposition into frequency bands via orthogonal filter pairs.


4. The Mallat Algorithm (Fast DWT)

4.1 Analysis Filter Bank

The Mallat algorithm (1989) computes the DWT via iterated convolution + downsampling - no need for the inner products f,ψj,k\langle f, \psi_{j,k}\rangle to be computed directly.

Given a sequence a(0)={an(0)}\mathbf{a}^{(0)} = \{a^{(0)}_n\} (the "approximation coefficients" at the finest scale, equivalent to sampling ff at integer positions), one step of the DWT computes:

ak(j+1)=nhn2kan(j)(convolution with h~, then downsample by 2)a^{(j+1)}_k = \sum_n h_{n-2k}\, a^{(j)}_n \qquad \text{(convolution with } \tilde{h}\text{, then downsample by 2)} dk(j+1)=ngn2kan(j)(convolution with g~, then downsample by 2)d^{(j+1)}_k = \sum_n g_{n-2k}\, a^{(j)}_n \qquad \text{(convolution with } \tilde{g}\text{, then downsample by 2)}

Here a(j+1)\mathbf{a}^{(j+1)} are the approximation coefficients at scale j+1j+1 (low-pass, half the samples), and d(j+1)\mathbf{d}^{(j+1)} are the detail coefficients at scale j+1j+1 (high-pass, half the samples). The analysis filters h~n=hn\tilde{h}_n = \overline{h_{-n}} (time-reversed conjugate) are the refinement filter banks.

Pictorially, one DWT step is:

a^{(j)} --+--[low-pass h]--[v2]--->  a^{(j+1)}   (approximation)
          +--[high-pass g]--[v2]--->  d^{(j+1)}   (detail)

Repeating on a(j+1)\mathbf{a}^{(j+1)} gives the full multi-level decomposition.

For AI: This is a learned ResNet block where the filters are fixed (not learned). The downsampling (2\downarrow 2) is equivalent to stride-2 convolution. The architecture of U-Net, FPN (Feature Pyramid Network), and Swin Transformer all implement variants of this Mallat tree.

4.2 Synthesis Filter Bank

Reconstruction (inverse DWT) is the exact reverse:

an(j)=kak(j+1)hn2k+kdk(j+1)gn2ka^{(j)}_n = \sum_k a^{(j+1)}_k\, h_{n-2k} + \sum_k d^{(j+1)}_k\, g_{n-2k}

Pictorially:

a^{(j+1)} --[^2]--[h]--+
                         +--->  a^{(j)}
d^{(j+1)} --[^2]--[g]--+

where 2\uparrow 2 denotes upsampling by 2 (zero insertion between samples). The synthesis filters are the time-reverses of the analysis filters (for orthogonal wavelets).

4.3 Perfect Reconstruction Conditions

Definition 4.1 (Perfect Reconstruction). A filter bank {h,g,h~,g~}\{h, g, \tilde{h}, \tilde{g}\} achieves perfect reconstruction if, for every input sequence a(j)\mathbf{a}^{(j)}, the synthesized output after analysis + synthesis equals a(j)\mathbf{a}^{(j)} (up to a delay).

For orthogonal wavelets, the analysis filters equal the synthesis filters (after time-reversal):

h~n=hn,g~n=gn\tilde{h}_n = \overline{h_{-n}}, \qquad \tilde{g}_n = \overline{g_{-n}}

The PR conditions in the zz-transform domain are:

  1. Alias cancellation: H(z)H(z1)+G(z)G(z1)=0H(z)H(-z^{-1}) + G(z)G(-z^{-1}) = 0
  2. Distortion-free: H(z)H(z1)+G(z)G(z1)=2H(z)H(z^{-1}) + G(z)G(z^{-1}) = 2

Together, these ensure that aliasing from the 2\downarrow 2 operation is exactly cancelled in reconstruction.

Biorthogonal wavelets (used in JPEG 2000) use different analysis and synthesis filters, relaxing the orthogonality constraint while maintaining PR. This allows symmetric filters (impossible for compactly supported orthogonal wavelets with more than 1 vanishing moment, by Daubechies' theorem).

4.4 Computational Complexity

At level 1, the DWT processes NN samples with two convolutions + downsamplings, costing O(N)O(N) operations (each filter has pp taps, so O(pN)O(pN) but pp is fixed). At level 2, we process N/2N/2 samples, costing O(N/2)O(N/2). Summing the geometric series:

Total cost=O(N)+O(N/2)+O(N/4)+=O(N)j=0J2j=O(N)2=O(N)\text{Total cost} = O(N) + O(N/2) + O(N/4) + \cdots = O(N)\sum_{j=0}^{J} 2^{-j} = O(N)\cdot 2 = O(N)

The DWT is O(N)O(N) - faster than the FFT's O(NlogN)O(N\log N)!

Comparison:

AlgorithmComplexityMemoryTime localization
DFT / FFTO(NlogN)O(N\log N)O(N)O(N)None (global)
STFTO(NlogN)O(N\log N)O(N)O(N)Fixed window
CWT (discrete approx.)O(NlogN)O(N\log N)O(NlogN)O(N\log N)Scale-dependent
DWT (Mallat)O(N)O(N)O(N)O(N)Scale-dependent

This O(N)O(N) complexity is why wavelets underpin all modern image compression standards (JPEG 2000, EZW, SPIHT) and are computationally preferred over STFT for multi-scale analysis.


5. Daubechies Wavelets

5.1 Vanishing Moments

The most important property distinguishing different wavelet families is the number of vanishing moments.

Definition 5.1 (Vanishing moments). A wavelet ψ\psi has NN vanishing moments if:

tkψ(t)dt=0for k=0,1,,N1\int_{-\infty}^{\infty} t^k\,\psi(t)\,dt = 0 \qquad \text{for } k = 0, 1, \ldots, N-1

Equivalently, in the Fourier domain: ψ^(k)(0)=0\hat{\psi}^{(k)}(0) = 0 for k=0,,N1k = 0,\ldots,N-1, meaning ψ^\hat{\psi} has a zero of order NN at ξ=0\xi = 0.

Why vanishing moments matter:

  1. Polynomial annihilation. If ff is a polynomial of degree <N< N on the support of ψj,k\psi_{j,k}, then f,ψj,k=0\langle f, \psi_{j,k}\rangle = 0. Smooth signals (well-approximated by polynomials locally) have near-zero detail coefficients everywhere except near discontinuities. This leads to sparse wavelet representations - ideal for compression.

  2. Approximation order. With NN vanishing moments, the wavelet approximation error at scale jj satisfies fPjfC2jNf(N)\|f - P_j f\| \leq C \cdot 2^{-jN} \|f^{(N)}\| for fCNf \in C^N. More vanishing moments = better approximation of smooth functions.

  3. Filter regularity. More vanishing moments enforce more zeros in G(ξ)G(\xi) at ξ=0\xi = 0, which forces the scaling function to be smoother.

Trade-off. More vanishing moments require longer filters (wider support). Daubechies dbNN has exactly NN vanishing moments and minimum possible support [0,2N1][0, 2N-1]. There is no way to have NN vanishing moments with a shorter filter while maintaining orthonormality.

5.2 Constructing db2 via Spectral Factorization

We construct the Daubechies db2 wavelet (4-tap filter, 2 vanishing moments) explicitly.

Step 1: Set up constraints. We need filter h={h0,h1,h2,h3}h = \{h_0, h_1, h_2, h_3\} (4 taps, support [0,3][0,3]) satisfying:

  • Power complementary: H(ξ)2+H(ξ+1/2)2=1|H(\xi)|^2 + |H(\xi + 1/2)|^2 = 1
  • 2 vanishing moments: G(0)=0G(0) = 0 and G(0)=0G'(0) = 0, i.e., H(1/2)=0H(1/2) = 0 and H(1/2)=0H'(1/2) = 0
  • Normalization: H(0)=1H(0) = 1 (i.e., khk=2\sum_k h_k = \sqrt{2})

These give 4 equations for 4 unknowns.

Step 2: Parameterize. The power-complementary condition H(ξ)2+H(ξ+1/2)2=1|H(\xi)|^2 + |H(\xi+1/2)|^2 = 1 in the zz-domain reads H(z)H(z1)+H(z)H(z1)=2H(z)H(z^{-1}) + H(-z)H(-z^{-1}) = 2. This is the Bezout equation whose general solution is:

H(ξ)2=(1+e2πiξ2)2NP(sin2(πξ))|H(\xi)|^2 = \left(\frac{1+e^{-2\pi i\xi}}{2}\right)^{2N} P(\sin^2(\pi\xi))

where PP is a polynomial satisfying P(y)0P(y) \geq 0 for y[0,1]y \in [0,1] and P(y)+P(1y)=1P(y) + P(1-y) = 1.

Step 3: Minimum phase factorization. For N=2N=2, P(y)=1+2yP(y) = 1 + 2y (chosen for minimum support). The polynomial H(ξ)2|H(\xi)|^2 evaluated gives:

H(ξ)2=cos4(πξ)(1+2sin2(πξ))|H(\xi)|^2 = \cos^4(\pi\xi)\left(1 + 2\sin^2(\pi\xi)\right)

Factoring this as H(z)H(z1)H(z)H(z^{-1}) (spectral factorization) with all roots inside the unit disk (minimum phase):

H(z)=1+342+3+342z1+3342z2+1342z3H(z) = \frac{1+\sqrt{3}}{4\sqrt{2}} + \frac{3+\sqrt{3}}{4\sqrt{2}}z^{-1} + \frac{3-\sqrt{3}}{4\sqrt{2}}z^{-2} + \frac{1-\sqrt{3}}{4\sqrt{2}}z^{-3}

The db2 filter coefficients (normalized so hk2=1\sum h_k^2 = 1):

h0=1+342,h1=3+342,h2=3342,h3=1342h_0 = \frac{1+\sqrt{3}}{4\sqrt{2}}, \quad h_1 = \frac{3+\sqrt{3}}{4\sqrt{2}}, \quad h_2 = \frac{3-\sqrt{3}}{4\sqrt{2}}, \quad h_3 = \frac{1-\sqrt{3}}{4\sqrt{2}}

Numerically: h{0.4830,0.8365,0.2241,0.1294}h \approx \{0.4830, 0.8365, 0.2241, -0.1294\}.

Verification: h02+h12+h22+h32=1h_0^2 + h_1^2 + h_2^2 + h_3^2 = 1 PASS, hk=2\sum h_k = \sqrt{2} PASS, h0h2+h1h3=0h_0h_2 + h_1h_3 = 0 PASS

5.3 Regularity and Holder Continuity

The regularity of the Daubechies scaling function ϕ\phi is determined by the spectral radius of the transition matrix Tij=2hi2jT_{ij} = \sqrt{2}\, h_{i-2j}. For dbNN:

Theorem 5.1 (Daubechies regularity). The dbNN scaling function ϕCα\phi \in C^\alpha (Holder continuous of order α\alpha) with:

αN0.2075N(asymptotically as N)\alpha_N \approx 0.2075N \qquad \text{(asymptotically as } N \to \infty\text{)}

Specific values:

  • db1 (Haar): α=0\alpha = 0 (discontinuous, bounded variation)
  • db2: α0.550\alpha \approx 0.550 (Holder continuous but not differentiable)
  • db4: α1.275\alpha \approx 1.275 (once continuously differentiable)
  • db8: α2.9\alpha \approx 2.9 (twice continuously differentiable)

The scaling function becomes smoother as NN increases, at the cost of wider support (2N12N-1 points).

5.4 Wavelet Family Comparison

WAVELET FAMILY COMPARISON
========================================================================

  Wavelet    | Support | VM | Regularity | Symmetric | Orthogonal
  -----------|---------|----|------------|-----------|------------
  Haar       | [0,1]   |  1 | C^0        | Yes       | Yes
  db2        | [0,3]   |  2 | C^0.55     | No        | Yes
  db4        | [0,7]   |  4 | C^1.28     | No        | Yes
  db8        | [0,15]  |  8 | C^2.9      | No        | Yes
  sym4       | [0,7]   |  4 | C^1.28     | Near-sym  | Yes
  coif4      | [-5,5]  |  4 | C^1.28     | Near-sym  | Yes
  bior2.2    | [-1,3]  |  2 | C^1        | Yes       | Biortho
  CDF 9/7    | large   |  4 | smooth     | Yes       | Biortho
  Mexican Hat| infty       |  2 | C^infty        | Yes       | No (frame)
  Morlet     | infty       |  infty | C^infty        | Near-sym  | No (frame)

  VM = Vanishing Moments; CDF 9/7 = JPEG 2000 standard

========================================================================

For AI: Smooth wavelets (large NN) produce sparse representations for smooth neural network weights and activations, while Haar is efficient for piecewise constant approximations. The CDF 9/7 biorthogonal wavelet used in JPEG 2000 balances symmetry, smoothness, and reconstruction quality - it is also used in wavelet-domain diffusion models to separate frequency bands.


6. Discrete Wavelet Transform in Practice

6.1 Multi-Level Decomposition Tree

The Mallat algorithm applied JJ times to a length-NN signal produces the wavelet decomposition tree:

WAVELET DECOMPOSITION TREE (J=3 levels)
========================================================================

  Input:  a[n],  length N
     |
     +--[h,v2]--->  a1[n],  length N/2    (approximation, level 1)
     |               |
     |               +--[h,v2]--->  a2[n],  length N/4   (approx., L2)
     |               |               |
     |               |               +--[h,v2]--->  a3[n]  (approx., L3)
     |               |               +--[g,v2]--->  d3[n]  (detail, L3)
     |               |
     |               +--[g,v2]--->  d2[n],  length N/4   (detail, L2)
     |
     +--[g,v2]--->  d1[n],  length N/2    (detail, level 1)

  Storage: N/2 + N/4 + N/4 + N/8 + N/8 + ... = N  (no redundancy!)

========================================================================

The output is: {aJ,dJ,dJ1,,d1}\{a_J, d_J, d_{J-1}, \ldots, d_1\} with total length =N= N (for periodic boundary). No information is lost or duplicated.

Frequency interpretation:

  • d1d_1: frequencies in [N/4,N/2][N/4, N/2] Hz (highest octave)
  • d2d_2: frequencies in [N/8,N/4][N/8, N/4] Hz
  • djd_j: frequencies in [N/2j+1,N/2j][N/2^{j+1}, N/2^j] Hz (octave bands)
  • aJa_J: frequencies in [0,N/2J+1][0, N/2^{J+1}] Hz (lowest frequencies)

6.2 Wavelet Packets

Standard DWT applies the filter bank only to the approximation branch (a(j)\mathbf{a}^{(j)}). Wavelet packets apply it to both branches - creating a full binary tree of subband decompositions:

At each node (j,n)(j, n) (level jj, node nn), both the "approximation" and "detail" sub-signals are further decomposed. This gives 2j2^j subbands at level jj, each of width N/2jN/2^j. The collection of all leaves at level JJ corresponds to 2J2^J frequency subbands, each of width N/2JN/2^J - the same as a DFT with N/2JN/2^J bins.

The best-basis algorithm (Coifman-Wickerhauser, 1992) selects the pruning of this tree that minimizes a cost function (e.g., entropy) - giving an adaptive time-frequency decomposition that is optimal for the specific signal.

6.3 2D DWT and Image Subbands

For 2D signals (images), the DWT is applied separably: first along rows, then along columns (or vice versa). One level of 2D DWT produces four subbands of size (H/2)×(W/2)(H/2) \times (W/2):

2D DWT SUBBANDS (one level)
========================================================================

  Input image (H  W)
  +-------------------------+
  |                         |
  |      f(x, y)            |
  |                         |
  +-------------------------+
              |
              
  +------------+------------+
  | LL (H/2W/2) | LH (H/2W/2) |
  | Low-row     | Low-row    |
  | Low-col     | High-col   |
  | (approx.)   | (hor. edges)|
  +------------+------------+
  | HL (H/2W/2) | HH (H/2W/2) |
  | High-row    | High-row   |
  | Low-col     | High-col   |
  | (vert. edges)| (diag. edges)|
  +------------+------------+

  LL = approximation (coarse image)
  LH = horizontal detail (vertical edges)
  HL = vertical detail (horizontal edges)
  HH = diagonal detail (diagonal edges)

========================================================================

Applying the 2D DWT recursively to the LL subband gives the standard pyramid decomposition used in image processing.

For AI: This 2D decomposition is the backbone of Swin Transformer's patch merging, PVT's pyramid vision representation, and wavelet-enhanced U-Net architectures. The subbands naturally separate global structure (LL) from local edge features (LH, HL, HH).

6.4 JPEG 2000 and Image Compression

JPEG 2000 (2000) uses the CDF 9/7 biorthogonal wavelet (Le Gall 5/3 for lossless) to transform images before quantization and entropy coding. The key advantages over DCT-based JPEG:

  1. No blocking artifacts - the wavelet support spans the entire image, avoiding the 88 block boundary artifacts of JPEG
  2. Scalable quality - truncating the bitstream at any point gives a valid (lower quality) image
  3. Region of Interest (ROI) coding - different quality levels for different image regions
  4. Better compression at high quality - 20-30% better than JPEG at visually lossless quality

The coding steps: 2D DWT (up to 5 levels) -> coefficient quantization -> EBCOT entropy coding (embedded block coding with optimal truncation).

Modern relevance: Wavelet-based compression principles inform:

  • Neural image codecs (Balle et al.): hyperprior model operates on wavelet-like feature maps
  • Wavelet-domain diffusion models (WaveDiff, 2023): diffusion operates on DWT coefficients rather than pixels, reducing sequence length by 4×4\times per level
  • VCT (Video Compression Transformer): uses learned transforms similar to wavelet filter banks

7. Time-Frequency Analysis

7.1 The Scalogram

The scalogram is the time-scale energy density:

Sf(a,b)=Wψf(a,b)2S_f(a, b) = |W_\psi f(a, b)|^2

It is the wavelet analogue of the spectrogram (squared magnitude of STFT). The scalogram shows how the energy of the signal is distributed across scales and times.

Reading the scalogram:

  • Horizontal axis: time bb
  • Vertical axis: scale aa (often plotted as frequency 1/a1/a, increasing downward = decreasing scale)
  • Color intensity: energy Wψf(a,b)2|W_\psi f(a,b)|^2
  • A vertical ridge indicates a transient (localized in time)
  • A horizontal ridge indicates a sustained oscillation (localized in frequency)

Edge effects (cone of influence). Near the boundaries of the signal, wavelet coefficients are unreliable because the wavelet ψa,b\psi_{a,b} extends outside the signal domain. The cone of influence at scale aa is the region of times affected by boundary effects: approximately baσψ|b| \lesssim \sqrt{a} \cdot \sigma_\psi where σψ\sigma_\psi is the support radius of ψ\psi. Results outside the cone of influence are unreliable.

Application example: For a chirp signal f(t)=cos(2πt2)f(t) = \cos(2\pi t^2) (instantaneous frequency linearly increasing with time), the scalogram shows a diagonal ridge tracing the time-frequency trajectory. The Fourier spectrum merely shows a broad band of frequencies with no temporal information.

7.2 Gabor Wavelets and the STFT Connection

Gabor wavelets (complex Morlet wavelets with exact zero mean) interpolate between the STFT and the CWT:

ψGabor(t)=1πσ(eiω0teσ2ω02/2)et2/(2σ2)\psi_\text{Gabor}(t) = \frac{1}{\sqrt{\pi\sigma}}\left(e^{i\omega_0 t} - e^{-\sigma^2\omega_0^2/2}\right)e^{-t^2/(2\sigma^2)}

The parameter σ\sigma controls the time-frequency trade-off:

  • Large σ\sigma: good frequency resolution (narrow band), poor time resolution
  • Small σ\sigma: good time resolution, poor frequency resolution

The STFT with a Gaussian window of width σ\sigma is identical to the CWT with Gabor wavelets at a fixed scale. The key difference: in the STFT, σ\sigma is fixed for all frequencies; in the CWT, σ\sigma scales with aa (constant-Q).

The STFT time-frequency tiling uses uniform rectangles: each tile has the same area ΔtΔξ=const\Delta t \cdot \Delta\xi = \text{const}.
The CWT uses logarithmic tiling: tiles are wider at low frequencies (long Δt\Delta t, small Δξ\Delta\xi) and narrow at high frequencies.

7.3 Synchrosqueezing Transform

The Synchrosqueezing Transform (SST) (Daubechies-Lu-Wu, 2011) sharpens the scalogram by reassigning energy to the instantaneous frequency curve, producing a "synchrosqueezed" time-frequency representation with much higher resolution.

For a signal with instantaneous frequency ω(t)\omega(t):

ω^(a,b)=bWψf(a,b)Wψf(a,b)12πi\hat{\omega}(a, b) = \frac{\partial_b W_\psi f(a, b)}{W_\psi f(a,b)} \cdot \frac{1}{2\pi i}

estimates the instantaneous frequency at (a,b)(a, b). The synchrosqueezed transform reassigns the energy from the scalogram at (a,b)(a, b) to the time-frequency point (b,ω^(a,b))(b, \hat{\omega}(a,b)). The result is a sharper, more interpretable time-frequency map.

Application: SST is used to separate overlapping modes in multicomponent signals (e.g., ECG with P, QRS, T waves; audio with multiple instruments). It has become standard in empirical mode decomposition and is implemented in scipy.signal.


8. Applications in Machine Learning

8.1 Mallat Scattering Networks

The scattering transform (Mallat, 2012; Bruna-Mallat, 2013) builds provably stable, translation-invariant features by cascading wavelet transforms with modulus nonlinearities:

S0f=fϕJ(zeroth order: low-pass average)S_0 f = f * \phi_J \qquad \text{(zeroth order: low-pass average)} S1f(k,x)=fψλ1ϕJ(x)(first order: averaged wavelet modulus)S_1 f(k, x) = |f * \psi_{\lambda_1}| * \phi_J(x) \qquad \text{(first order: averaged wavelet modulus)} S2f(k,λ1,λ2,x)=fψλ1ψλ2ϕJ(x)(second order)S_2 f(k, \lambda_1, \lambda_2, x) = ||f * \psi_{\lambda_1}| * \psi_{\lambda_2}| * \phi_J(x) \qquad \text{(second order)}

where ϕJ\phi_J is the low-pass filter at scale JJ, and ψλ\psi_\lambda ranges over wavelets at scales λ=2j\lambda = 2^j for j=1,,Jj = 1, \ldots, J.

Key theorems:

Theorem 8.1 (Translation invariance). Scattering coefficients are (approximately) invariant to translations of size 2J\leq 2^J.

Theorem 8.2 (Stability to deformations). If τ\tau is a C2C^2 diffeomorphism close to a translation (i.e., supτϵ1\sup |\nabla\tau| \leq \epsilon \ll 1), then:

SfS(fτ)Cϵf\|Sf - S(f \circ \tau)\| \leq C\,\epsilon\,\|f\|

This is the key property that CNNs lack: a small rotation or stretch can change CNN features by a constant amount, independent of how small the deformation is.

Scattering vs CNNs: Scattering networks have no learned parameters. Their features are competitive with shallow CNNs on MNIST and CIFAR-10. The architecture is:

Scattering Network Architecture
========================================================================

  Input f(x)
      |
      +-- _J * f ------------------------------> S_0 (global average)
      |
      +-- |f * _{1}| * _J -------------------> S_1 (J terms)
      |
      +-- ||f * _{1}| * _{2}| * _J --------> S_2 (J2 terms)

  Total feature dimension: O(J2)   (vs O(K2C) for CNN)
  Invariance: proven          (vs empirical for CNN)
  Stability: proven            (vs empirical for CNN)

========================================================================

For AI: Scattering features fed to a simple linear classifier achieve ~87% on CIFAR-10. In protein structure prediction, a variant of scattering (SE(3)-equivariant networks) is used by AlphaFold2 to process atomic coordinates.

8.2 Wavelet-Based Attention

WaveBERT (2021) compresses long input sequences by applying wavelet transforms to token embeddings before attention, reducing sequence length by a factor of 2J2^J:

  1. Apply DWT to the token embedding sequence: e1,,eN(aJ,dJ,,d1)e_1, \ldots, e_N \to (a_J, d_J, \ldots, d_1)
  2. Run Transformer attention only on the approximation aJa_J (length N/2JN/2^J)
  3. Reconstruct full-resolution outputs via IDWT

This reduces the O(N2)O(N^2) attention complexity to O((N/2J)2)O((N/2^J)^2) - a factor of 4J4^J speedup.

Wavelet Attention (2023, various groups) applies 2D DWT to attention logits:

  • Compute attention score matrix S=QK/dS = QK^\top/\sqrt{d} of shape (L×L)(L \times L)
  • Apply 2D DWT to SS -> sparse representation (most attention is low-frequency)
  • Threshold small wavelet coefficients (attention is concentrated on main diagonals)
  • Reconstruct and apply softmax

This achieves near-full-attention quality with sparse coefficient computation, similar to the linear attention approximations but with theoretical backing from wavelet compression theory.

WaveMix (2022) replaces the attention mechanism entirely with 2D wavelet mixing, applying the DWT across spatial dimensions in vision tasks. The model achieves competitive accuracy on ImageNet with significantly fewer parameters than ViT.

8.3 Multiresolution in Vision Transformers

The standard Vision Transformer (ViT) processes all patches at a single scale with quadratic attention. Hierarchical architectures introduce the wavelet-like multi-scale structure:

Swin Transformer (2021): Applies attention within local windows and shifts them across layers. The patch merging operation (concatenating 22 patches and projecting) is equivalent to a stride-2 convolution - structurally analogous to the DWT low-pass filter + downsampling step.

PVT (Pyramid Vision Transformer, 2021): Explicitly uses 4 stages with spatial reduction ratios {1,1/2,1/4,1/8}\{1, 1/2, 1/4, 1/8\} - matching the wavelet octave decomposition exactly.

FPN (Feature Pyramid Network, 2017): Multi-scale feature maps combined via lateral connections - the convolutional analogue of the wavelet synthesis filter bank.

Common pattern: Resolution halved, channels doubled at each level. This is exactly the wavelet dyadic tree structure, with learned filters replacing fixed wavelet filters.

8.4 Wavelet Regularization in Diffusion Models

WaveDiff (2023) applies the diffusion process in the wavelet domain rather than pixel space:

  1. Apply 2D DWT to images: x(xLL,xLH,xHL,xHH)x \to (x_{LL}, x_{LH}, x_{HL}, x_{HH})
  2. Apply Gaussian noise only to high-frequency subbands (xLH,xHL,xHHx_{LH}, x_{HL}, x_{HH}) - preserving the global structure in xLLx_{LL}
  3. Train the denoising network on the (much smaller) wavelet coefficients

Benefits:

  • Sequence length reduced by 4×4\times per level -> attention is 16×16\times cheaper
  • High-frequency details are easier to model when separated from low-frequency structure
  • Generation quality is preserved since wavelets are invertible

Wavelet-based training regularization: The spectral bias phenomenon (neural networks learn low frequencies first) can be accelerated by decomposing loss functions in the wavelet domain. Penalizing different scales at different rates speeds up convergence - equivalent to preconditioning by the wavelet spectrum.

8.5 Wavelet Denoising: Donoho-Johnstone

The Donoho-Johnstone framework (1994) provides a near-optimal, computationally efficient signal denoising procedure based on wavelet coefficient thresholding:

Setup. Observe yn=fn+ϵny_n = f_n + \epsilon_n where ff is the true signal and ϵnN(0,σ2)\epsilon_n \sim \mathcal{N}(0, \sigma^2) is Gaussian noise.

Algorithm:

  1. Compute DWT of yy: coefficients {dj,k}\{d_{j,k}\}
  2. Estimate noise level: σ^=median(d1,k)/0.6745\hat{\sigma} = \text{median}(|d_{1,k}|) / 0.6745 (using finest-scale detail coefficients)
  3. Set universal threshold: λ=σ^2logN\lambda = \hat{\sigma}\sqrt{2\log N}
  4. Apply thresholding to all detail coefficients:
    • Soft: d^j,k=sign(dj,k)max(dj,kλ,0)\hat{d}_{j,k} = \text{sign}(d_{j,k})\max(|d_{j,k}| - \lambda, 0)
    • Hard: d^j,k=dj,k1[dj,k>λ]\hat{d}_{j,k} = d_{j,k} \cdot \mathbf{1}[|d_{j,k}| > \lambda]
  5. Reconstruct via IDWT of {aJ,d^J,,d^1}\{a_J, \hat{d}_J, \ldots, \hat{d}_1\}

Why does this work? Sparse coding: a smooth signal ff has few large wavelet coefficients; noise ϵ\epsilon spreads uniformly across all coefficients. By zeroing small coefficients (which are dominated by noise) and preserving large coefficients (dominated by signal), we efficiently remove noise.

Theorem 8.3 (Near-optimality, Donoho-Johnstone 1994). The soft-threshold estimator f^\hat{f} satisfies:

supfFEf^f2(2logN+1)(σ2+infgFfg2)\sup_{f \in \mathcal{F}} E\|\hat{f} - f\|^2 \leq \left(2\log N + 1\right)\left(\sigma^2 + \inf_{g \in \mathcal{F}} \|f - g\|^2\right)

This is within a 2logN2\log N factor of the minimax optimal risk - essentially unimprovable without additional assumptions.

For AI: Wavelet thresholding connects to:

  • LASSO (1\ell^1 penalty) in the wavelet domain: soft thresholding is the proximal operator for 1\ell^1
  • Sparse autoencoders (SAE): identifying sparse representations of neural network activations
  • Gradient compression in distributed training: transmit only large-magnitude gradient components (analogous to hard thresholding)

9. Common Mistakes

#MistakeWhy It's WrongFix
1Applying DWT without specifying boundary conditionsBoundary effects corrupt coefficients near signal edgesUse mode='periodization' for circular, 'symmetric' for mirror padding; always check the cone of influence
2Using Haar wavelets for smooth signals and expecting good compressionHaar has only 1 vanishing moment; smooth functions need wavelets with many vanishing momentsChoose db4+ or sym4+ for signals with continuous derivatives
3Confusing scale and frequencyIn CWT, small scale aa = high frequency, large aa = low frequency; axes are often plotted reversedAlways label axes explicitly; clarify whether vertical axis is "scale" or "frequency"
4Using the CWT for discrete signals without discretizationCWT is defined for continuous fL2(R)f \in L^2(\mathbb{R}); applying it naively to sampled data gives aliasingUse the DWT (Mallat algorithm) for discrete signals, or the pseudo-CWT with a=2ja = 2^j sampling
5Thinking DWT is redundant (like STFT)DWT has exactly NN coefficients for NN inputs - it is non-redundantThe redundant version (undecimated DWT, "a trous") is useful for denoising but costs O(NlogN)O(N\log N) storage
6Forgetting the 1/a1/\sqrt{a} normalization in the CWTWithout it, ψa,b\|\psi_{a,b}\| varies with scale, breaking energy conservationAlways use ψa,b(t)=a1/2ψ((tb)/a)\psi_{a,b}(t) = a^{-1/2}\psi((t-b)/a) for L2L^2 normalization
7Applying only one level of DWT and calling it "multi-scale"Single-level DWT splits into only 2 bands; multi-scale analysis requires J3J \geq 3 levelsApply DWT recursively to the approximation branch for J=log2N3J = \lfloor\log_2 N\rfloor - 3 levels
8Selecting threshold λ\lambda globally (same for all scales)Noise variance varies across scales for colored noise; a global threshold over- or under-smoothsUse scale-adaptive threshold: λj=σ^j2logNj\lambda_j = \hat{\sigma}_j\sqrt{2\log N_j} where NjN_j is the number of coefficients at level jj
9Assuming wavelet coefficients are Gaussian when noise is GaussianWavelet coefficients of a Gaussian noise signal are approximately Gaussian only for orthogonal wavelets; correlated noise produces non-Gaussian coefficientsVerify distribution; use non-parametric or robust thresholding for colored noise
10Conflating the DWT with the FFT as frequency decompositionDWT gives octave-band (logarithmic) frequency decomposition; FFT gives uniform frequency decompositionUse DWT when you want multi-scale (octave) analysis; use FFT when you want uniform frequency resolution

10. Exercises

Exercise 1 * - Haar DWT by Hand

Given the signal f=[4,6,10,12,8,6,5,5]f = [4, 6, 10, 12, 8, 6, 5, 5]:

(a) Compute one level of Haar DWT by hand: approximation coefficients a(1)\mathbf{a}^{(1)} and detail coefficients d(1)\mathbf{d}^{(1)}. (Recall: ak=(f2k+f2k+1)/2a_k = (f_{2k} + f_{2k+1})/\sqrt{2}, dk=(f2kf2k+1)/2d_k = (f_{2k} - f_{2k+1})/\sqrt{2}.)

(b) Apply one more level to a(1)\mathbf{a}^{(1)} to get a(2)\mathbf{a}^{(2)} and d(2)\mathbf{d}^{(2)}.

(c) Verify that f2=a(2)2+d(2)2+d(1)2\|\mathbf{f}\|^2 = \|\mathbf{a}^{(2)}\|^2 + \|\mathbf{d}^{(2)}\|^2 + \|\mathbf{d}^{(1)}\|^2 (Parseval).

(d) Reconstruct f\mathbf{f} from the wavelet coefficients {a(2),d(2),d(1)}\{\mathbf{a}^{(2)}, \mathbf{d}^{(2)}, \mathbf{d}^{(1)}\} by reversing the steps. Verify exact reconstruction.


Exercise 2 * - Admissibility and Zero Mean

(a) Verify that the Haar wavelet ψ=1[0,1/2)1[1/2,1)\psi = \mathbf{1}_{[0,1/2)} - \mathbf{1}_{[1/2,1)} has zero mean: ψ(t)dt=0\int_{-\infty}^\infty \psi(t)\,dt = 0.

(b) Compute ψ^(ξ)\hat{\psi}(\xi) for the Haar wavelet explicitly. Show ψ^(0)=0\hat{\psi}(0) = 0.

(c) A student proposes ψ(t)=et2/2\psi(t) = e^{-t^2/2} (Gaussian) as a wavelet. Show this fails the admissibility condition.

(d) Compute the admissibility constant Cψ=0ψ^(ξ)2/ξdξC_\psi = \int_0^\infty |\hat{\psi}(\xi)|^2/\xi\,d\xi for the Mexican Hat wavelet ψ^(ξ)=2πξ2eξ2/2\hat{\psi}(\xi) = \sqrt{2\pi}\,\xi^2\, e^{-\xi^2/2}. Is it finite?


Exercise 3 * - MRA Axiom Verification

Verify that the Shannon (sinc) MRA satisfies all five MRA axioms:

  • Vj={fL2(R):supp(f^)[2jπ,2jπ]}V_j = \{f \in L^2(\mathbb{R}) : \text{supp}(\hat{f}) \subseteq [-2^{-j}\pi, 2^{-j}\pi]\}
  • ϕ(t)=sinc(πt)=sin(πt)/(πt)\phi(t) = \text{sinc}(\pi t) = \sin(\pi t)/(\pi t)

(a) Prove the nesting property V1V0V_1 \subset V_0 directly from the definition.

(b) Prove density: show that for any fL2(R)f \in L^2(\mathbb{R}) and ϵ>0\epsilon > 0, there exists jj and gVjg \in V_j with fg<ϵ\|f - g\| < \epsilon.

(c) Verify that {ϕ(tk)}kZ\{\phi(t-k)\}_{k\in\mathbb{Z}} is orthonormal using Parseval's theorem.

(d) Write down the scaling function ϕ\phi in the frequency domain. What is the two-scale relation ϕ^(ξ)=H(ξ/2)ϕ^(ξ/2)\hat{\phi}(\xi) = H(\xi/2)\hat{\phi}(\xi/2)? What is H(ξ)H(\xi)?


Exercise 4 ** - Mallat Algorithm Implementation

Implement the complete Mallat DWT (analysis) and IDWT (synthesis) algorithms from scratch using only numpy.

(a) Implement haar_dwt(x, J) that applies JJ levels of Haar DWT to 1D signal x. Return the coefficient vector [aJ,dJ,dJ1,,d1][a_J, d_J, d_{J-1}, \ldots, d_1] of total length len(x).

(b) Implement haar_idwt(coeffs, J) that reconstructs the signal from the coefficients. Verify IDWT(DWT(x))=x\text{IDWT}(\text{DWT}(x)) = x with max error <1010< 10^{-10}.

(c) Apply your DWT to a noisy signal y[n]=sin(2π0.1n)+0.3ε[n]y[n] = \sin(2\pi \cdot 0.1 \cdot n) + 0.3\varepsilon[n] for n=0,,255n=0,\ldots,255, where εN(0,1)\varepsilon \sim \mathcal{N}(0,1). Apply soft thresholding with λ=0.5\lambda = 0.5 to all detail coefficients. Reconstruct and compute SNR improvement.

(d) Benchmark your implementation vs pywt.wavedec for N=214N = 2^{14} and 4 levels. Compare execution times and verify the outputs agree to numerical precision.


Exercise 5 ** - Daubechies Filter Verification

(a) Given the db2 filter h=[(1+3)/(42),(3+3)/(42),(33)/(42),(13)/(42)]h = [(1+\sqrt{3})/(4\sqrt{2}),\, (3+\sqrt{3})/(4\sqrt{2}),\, (3-\sqrt{3})/(4\sqrt{2}),\, (1-\sqrt{3})/(4\sqrt{2})], verify:

  • khk=2\sum_k h_k = \sqrt{2} (normalization)
  • khk2=1\sum_k h_k^2 = 1 (unit energy)
  • khkhk2=0\sum_k h_k h_{k-2} = 0 (orthogonality shift by 2)

(b) Compute the QMF partner gk=(1)kh1kg_k = (-1)^k h_{1-k} for db2. Verify H(ξ)2+G(ξ)2=1|H(\xi)|^2 + |G(\xi)|^2 = 1 numerically for 100 values of ξ[0,1]\xi \in [0, 1].

(c) Verify db2 has exactly 2 vanishing moments: compute k(1)kkmh1k\sum_k (-1)^k k^m h_{1-k} for m=0m=0 and m=1m=1 (should both be zero) and m=2m=2 (should be nonzero).

(d) Construct the db2 scaling function ϕ\phi by iterating the cascade algorithm: start from ϕ(0)=1[0,1]\phi^{(0)} = \mathbf{1}_{[0,1]} and apply ϕ(n+1)(t)=2khkϕ(n)(2tk)\phi^{(n+1)}(t) = \sqrt{2}\sum_k h_k\,\phi^{(n)}(2t-k) for 6 iterations. Plot the result.


Exercise 6 ** - 2D DWT and Image Compression

(a) Load or generate a 256×256256 \times 256 grayscale test image (use scipy.datasets.face(gray=True) or a numpy-generated pattern). Apply 3 levels of 2D Haar DWT using pywt.wavedec2.

(b) Visualize the 3-level wavelet decomposition, labeling all subbands (LL3, LH3, HL3, HH3, LH2, HL2, HH2, LH1, HL1, HH1).

(c) Implement a compression experiment: keep only the top k%k\% of wavelet coefficients (by magnitude), zero the rest, and reconstruct. Plot PSNR vs kk for k{1,2,5,10,20,50}k \in \{1, 2, 5, 10, 20, 50\}.

(d) Compare with JPEG-style DCT compression: apply 2D DCT to 8×88\times 8 blocks and keep only the top k%k\% of DCT coefficients. Plot PSNR vs kk for both methods on the same axes. At what compression ratio does DWT outperform DCT?


Exercise 7 *** - Scattering Transform Features

(a) Implement a simple 1D scattering transform (order 1 and 2) using pywt:

  • Zeroth order: low-pass filter the input
  • First order: S1[f,j1]=DWTjfϕJS_1[f, j_1] = |\text{DWT}_j f| * \phi_J
  • Second order: S2[f,j1,j2]=DWTj1fψj2ϕJS_2[f, j_1, j_2] = ||\text{DWT}_{j_1} f| * \psi_{j_2}| * \phi_J

(b) Generate signals from two classes: class A = sin(2π0.1t)+0.1ε\sin(2\pi \cdot 0.1 t) + 0.1\varepsilon; class B = sin(2π0.3t)+0.1ε\sin(2\pi \cdot 0.3 t) + 0.1\varepsilon. Compute scattering features and verify they are approximately translation-invariant: shift the signal by kk samples and check that SfS(Tkf)/Sf<0.01\|Sf - S(T_k f)\| / \|Sf\| < 0.01 for k2Jk \leq 2^J.

(c) Test stability to deformation: apply a small time-stretching ff(1.05t)f \to f(1.05t) and measure SfS(fτ)/Sf\|Sf - S(f \circ \tau)\| / \|Sf\|. Compare with the same test for raw Fourier features.

(d) Train a linear classifier on scattering features vs raw FFT features for a 4-class signal classification problem. Compare accuracy and number of features needed.


Exercise 8 *** - Wavelet Denoising and Sparse Coding

(a) Implement Donoho-Johnstone wavelet denoising with:

  • Universal threshold λ=σ^2logN\lambda = \hat{\sigma}\sqrt{2\log N}
  • Both soft and hard thresholding
  • Noise level estimation via MAD: σ^=median(d1,k)/0.6745\hat{\sigma} = \text{median}(|d_{1,k}|)/0.6745

(b) Test on f[n]=2sin(2π0.1n)+sin(2π0.35n)f[n] = 2\sin(2\pi \cdot 0.1 n) + \sin(2\pi \cdot 0.35 n) with N=512N=512, noise levels σ{0.1,0.5,1.0,2.0}\sigma \in \{0.1, 0.5, 1.0, 2.0\}. Plot SNR improvement vs σ\sigma for soft vs hard thresholding.

(c) Show the connection to LASSO: prove that soft thresholding is the proximal operator of the 1\ell^1 norm: proxλ1(v)i=sign(vi)max(viλ,0)\text{prox}_{\lambda\|\cdot\|_1}(v)_i = \text{sign}(v_i)\max(|v_i|-\lambda, 0).

(d) Implement a sparse coding experiment: represent a signal as a sparse linear combination of Daubechies db4 wavelets at 3 scales. Use LASSO regression (sklearn.linear_model.Lasso) to find the sparse coefficients. Compare reconstruction quality (PSNR) vs number of non-zero coefficients to direct DWT thresholding.


11. Why This Matters for AI (2026 Perspective)

AI ConceptWavelet ConnectionConcrete Impact
Convolutional neural networksEach conv layer is a learned filter bank; DWT is the fixed-filter special caseUnderstanding wavelets explains WHY CNNs learn edge detectors and Gabor-like filters at lower layers
Swin Transformer patch mergingIdentical to stride-2 DWT low-pass filter: concatenate 22 patches -> linear projectionMulti-scale attention is the learned version of multi-level DWT
ViT + FPN / feature pyramidsMulti-resolution feature maps follow the wavelet octave structure exactlyArchitecture design: number of levels, channel doubling at each scale
Speech: Whisper / wav2vec 2.0Mel spectrogram is an STFT with mel-warped frequencies approx wavelet constant-Q analysisReplacing STFT with CWT could improve speech models; wavelet features more robust to pitch shift
Diffusion models (WaveDiff)Separate noise diffusion of LL vs LH/HL/HH subbands; reduces sequence length 4Lower sampling cost, better preservation of global structure, faster inference
Scattering networksCNN-equivalent features with zero learned parameters; provably stable to deformationsRobustness certification, data-efficient learning, group-equivariant architectures
Sparse autoencoders (SAE)Mechanistic interpretability via sparse wavelet-like features of activationsSoft thresholding = LASSO proximal operator; same math as wavelet denoising
Gradient compressionTransmit top-kk gradient values = hard threshold in parameter spaceWavelet-domain gradient: compress gradient per scale/location, not per parameter
Long-range Transformers (WaveBERT)Compress token sequence by 2J2^J via DWT before attentionO((N/2J)2)O((N/2^J)^2) attention complexity; quality preserved via wavelet inversion
Protein structure (SE(3)-nets)Scattering-inspired SE(3)-equivariant networks process atomic coordinatesAlphaFold2's geometry processing; antibody design; molecular dynamics
JPEG 2000 / neural codecsCDF 9/7 biorthogonal wavelet; hyperprior in learned image compressionVariable-rate coding; ROI coding; streaming and progressive decode
Spectral bias in neural networksNetworks learn low-frequency components first (captured by coarse wavelet scales)Curriculum learning: train on coarse scales first; progressive training strategies

12. Conceptual Bridge

Looking Backward

This section sits at the summit of the Fourier Analysis chapter. Everything built in Section 20-01 through Section 20-04 contributes:

From Section 20-01 (Fourier Series): The idea of decomposing a signal into orthogonal basis functions is generalized: instead of sinusoids with fixed global support, we use wavelets with variable-scale local support. The Parseval identity from Section 20-01 reappears as the energy conservation property of the wavelet transform.

From Section 20-02 (Fourier Transform): The uncertainty principle ΔtΔξ1/(4π)\Delta t \cdot \Delta\xi \geq 1/(4\pi) is the fundamental constraint that motivates wavelets. The continuous Fourier transform is the "full-global" limit; the CWT interpolates between full-time and full-frequency localization. Plancherel's theorem underlies both.

From Section 20-03 (DFT and FFT): The Mallat algorithm uses convolution + downsampling - precisely the operations made efficient by the FFT in Section 20-03. The DWT at each level costs O(N)O(N) via direct convolution (filter is short); a CWT discretization would cost O(NlogN)O(N\log N) via FFT.

From Section 20-04 (Convolution Theorem): The filter bank conditions (QMF, perfect reconstruction) are applications of the convolution algebra from Section 20-04. The LTI system framework identifies each wavelet scale as a bandpass filter; the synthesis bank cancels aliasing via the Convolution Theorem's frequency-domain perspective.

Looking Forward

Statistical Learning Theory (Section 21): Wavelet regularity classes (Besov spaces, Sobolev spaces) are the natural function classes for non-parametric estimation. Minimax rates of convergence depend on the wavelet regularity of the target function.

Functional Analysis (Section 12): The MRA axioms live in L2(R)L^2(\mathbb{R}) - a Hilbert space (Section 12-02). Riesz bases, frames, and Hilbert space geometry underpin the infinite-dimensional wavelet theory. The scattering transform connects to group representations (Section 12-04 preview).

Differential Geometry (Section 25): Wavelets generalize to manifolds via harmonic analysis on Riemannian spaces. Spectral graph wavelets (Section 11-04) extend the MRA framework to graphs. Heat kernel wavelets define multi-scale decompositions on curved spaces - used in geometric deep learning.

WAVELET SECTION IN CURRICULUM
=======================================================================

  Section 20-01 Fourier Series                    Section 12-02 Hilbert Spaces
     v orthogonal basis concept                v L2 theory
  Section 20-02 Fourier Transform               Section 20-03 DFT and FFT
     v uncertainty principle                   v filter banks, O(N log N)
  Section 20-04 Convolution Theorem
     v QMF, LTI systems
  +======================================+
  |  Section 20-05 WAVELETS                     |
  |  ----------------------------------- |
  |  CWT, MRA, Mallat O(N) algorithm     |
  |  Daubechies, QMF, perfect recon.     |
  |  2D DWT, JPEG 2000, scattering       |
  |  Denoising, WaveDiff, WaveBERT       |
  +======================================+
          v                    v
  Section 21 Statistical          Section 11-04 Spectral
  Learning Theory          Graph Theory
  (Besov spaces,           (Graph wavelets,
   minimax rates)           GCN, ChebNet)
          v                    v
  Section 25 Differential         Section 12-04 Group
  Geometry                 Representations
  (manifold wavelets)      (scattering theory)

=======================================================================

<- Back to Fourier Analysis | Previous: Convolution Theorem <- | Next Chapter: Statistical Learning Theory ->


Appendix A: Biorthogonal Wavelets and JPEG 2000

A.1 Motivation for Biorthogonal Wavelets

Daubechies' theorem establishes an uncomfortable trade-off for orthonormal wavelets: a compactly supported orthonormal wavelet with more than 1 vanishing moment cannot be symmetric. This is a fundamental obstruction - symmetry requires the filter to satisfy H(ξ)=e2πiKξH(ξ)H(\xi) = e^{-2\pi i K\xi}\overline{H(-\xi)} for some integer KK, which is incompatible with compact support and orthonormality (except for the Haar wavelet).

Symmetry is highly desirable in image processing: symmetric filters do not introduce phase distortion, and symmetric boundary extension (mirror padding) is more natural than periodic extension. For this reason, biorthogonal wavelets (Cohen-Daubechies-Feauveau, 1992) replace the single orthonormal filter bank with a pair of filter banks:

  • Analysis filters: {h,g}\{h, g\} - used in the forward DWT
  • Synthesis filters: {h~,g~}\{\tilde{h}, \tilde{g}\} - used in the inverse DWT

where the analysis and synthesis wavelets ψ\psi and ψ~\tilde{\psi} satisfy the biorthogonality condition:

ψj,k,ψ~j,k=δjjδkk\langle \psi_{j,k}, \tilde{\psi}_{j',k'}\rangle = \delta_{jj'}\delta_{kk'}

but ψ⊥̸ψj,k\psi \not\perp \psi_{j,k'} for kkk \neq k' (they are not orthogonal to their own translates).

A.2 The CDF 9/7 Wavelet

The Cohen-Daubechies-Feauveau 9/7 wavelet (CDF 9/7) is the biorthogonal wavelet used in JPEG 2000 for lossy compression. It has:

  • Analysis filter length: 9 taps (odd, therefore symmetric)
  • Synthesis filter length: 7 taps (odd, symmetric)
  • 4 vanishing moments (analysis) / 4 vanishing moments (synthesis)
  • Linear phase (symmetric filter -> zero phase distortion)

The 9/7 analysis lowpass filter coefficients are:

kkhkh_k
00.6029490182363579
10.2668641184428723
2-0.07822326652898785
3-0.01686411844287495
40.02674875741080976

The filter is symmetric: hk=hkh_k = h_{-k}. The corresponding synthesis filter has 7 nonzero coefficients.

Why 9/7? The choice balances:

  • Compression efficiency (4 vanishing moments approx good polynomial annihilation)
  • Visual quality (smooth synthesis wavelet approx no ringing artifacts)
  • Computational cost (9 taps = fast, especially in hardware)
  • Symmetry (eliminates the Daubechies asymmetry problem)

For lossless JPEG 2000, the Le Gall 5/3 biorthogonal wavelet is used (integer lifting steps, exact arithmetic).

A.3 Lifting Scheme

The lifting scheme (Sweldens, 1995) provides an efficient factorization of any wavelet filter bank into elementary "lifting steps" - pairs of predict and update operations:

  1. Split: divide the signal into even (en=x2ne_n = x_{2n}) and odd (on=x2n+1o_n = x_{2n+1}) samples
  2. Predict: update odd samples to reduce prediction error: ononP(e)no_n \leftarrow o_n - P(e)_n
  3. Update: update even samples to maintain global statistics: enen+U(o)ne_n \leftarrow e_n + U(o)_n
  4. Scale: normalize: enen/Ke_n \leftarrow e_n/K, ononKo_n \leftarrow o_n \cdot K

The lifting scheme:

  • Requires exactly NN multiplications and NN additions for NN inputs (half the cost of the naive filter bank)
  • Allows in-place computation (no extra memory)
  • Supports the integer DWT (lossless coding) by rounding at each step
  • Makes hardware implementation trivial (pipeline two lifting steps)

The Haar DWT is the simplest lifting scheme: Predict step ononeno_n \leftarrow o_n - e_n, Update step enen+on/2e_n \leftarrow e_n + o_n/2.


Appendix B: Frames and Redundant Representations

B.1 Wavelet Frames

When wavelet functions do not form an exact orthonormal basis but still allow signal reconstruction, they form a frame. A frame {ψj,k}\{\psi_{j,k}\} satisfies:

Af2j,kf,ψj,k2Bf2A\|f\|^2 \leq \sum_{j,k} |\langle f, \psi_{j,k}\rangle|^2 \leq B\|f\|^2

for constants 0<AB<0 < A \leq B < \infty called frame bounds. If A=BA = B, the frame is tight (uniform stability in all directions). Tight frames satisfy f=1Aj,kf,ψj,kψj,kf = \frac{1}{A}\sum_{j,k}\langle f, \psi_{j,k}\rangle \psi_{j,k} (simple inversion formula, same as orthonormal basis but with redundancy B/A1B/A \geq 1).

Why frames? Overcomplete (redundant) representations have:

  • Better numerical stability (more "angles" to represent the signal)
  • Invariance properties (e.g., undecimated DWT = shift-invariant wavelet frame)
  • Flexibility in design (no restriction to dyadic sampling)

Continuous wavelet frames. The CWT with arbitrary discretization of (a,b)(a, b) forms a frame as long as the sampling is dense enough. For dyadic sampling a=2ja = 2^j, b=k2jδb = k \cdot 2^j \cdot \delta with δ>0\delta > 0 small enough, the CWT forms a frame.

B.2 Undecimated DWT (Stationary Wavelet Transform)

The undecimated DWT (also called shift-invariant DWT or "a trous" algorithm) removes the 2\downarrow 2 downsampling step in the Mallat algorithm:

Analysis step (undecimated):
  a^(j+1)[n] = sum_k h_k * a^(j)[n-2^j k]   (convolution with upsampled filter)
  d^(j+1)[n] = sum_k g_k * a^(j)[n-2^j k]

The output at each level has the same length NN as the input. Total output size: O(NJ)O(N \cdot J) (redundant by factor JJ).

Properties:

  • Shift-invariant: UDWT(Tkf)=Tk(UDWT(f))\text{UDWT}(T_k f) = T_k(\text{UDWT}(f)) - unlike the downsampled DWT where shifts cause coefficient permutations
  • Better denoising: Averaging over all circular shifts of the DWT (cycle-spinning) gives the undecimated DWT implicitly; this averages out the artifacts from the dyadic grid
  • Cost: O(NJ)O(NJ) instead of O(N)O(N) - more expensive but often worth it for denoising

Cycle-spinning (Coifman-Donoho, 1995): Average wavelet-threshold denoising over all NN cyclic shifts. Equivalent to undecimated DWT thresholding. Eliminates "ringing" artifacts near discontinuities that standard DWT thresholding produces.


Appendix C: Wavelet Regularity and Approximation Theory

C.1 Besov Spaces

The natural function spaces for wavelet approximation theory are Besov spaces Bp,qs(R)B^s_{p,q}(\mathbb{R}), which generalize Sobolev spaces by specifying different regularity in different LpL^p norms.

A function ff belongs to Bp,qsB^s_{p,q} (smoothness ss, LpL^p integrability, q\ell^q scale summability) if and only if its wavelet coefficients satisfy:

(j0(2js(kdj,kp)1/p)q)1/q<\left(\sum_{j \geq 0} \left(2^{js} \left(\sum_k |d_{j,k}|^p\right)^{1/p}\right)^q\right)^{1/q} < \infty

This characterization makes wavelet coefficients the canonical coordinate system for Besov spaces.

Key special cases:

  • B2,2s=HsB^s_{2,2} = H^s (Sobolev space) - standard smooth functions
  • B,s=CsB^s_{\infty,\infty} = C^s (Holder space) - functions with ss bounded derivatives
  • B1,1sB^s_{1,1} - functions with sparse wavelet representations (relevant for compression)

For AI: The spectral bias of neural networks (Rahaman et al., 2018) can be formalized as: gradient descent preferentially fits functions in lower-order Besov spaces first. Understanding Besov spaces explains why networks need curriculum learning for high-frequency targets.

C.2 Approximation Error in Wavelet Bases

Theorem C.1 (Nonlinear approximation, Cohen-Dahmen-DeVore 2001). For fBp,psf \in B^s_{p,p} with s>0s > 0, the best nn-term approximation fnf_n (keeping the nn largest wavelet coefficients) satisfies:

ffnL2Cns/dfBp,ps\|f - f_n\|_{L^2} \leq C\, n^{-s/d} \|f\|_{B^s_{p,p}}

where dd is the signal dimension (d=1d=1 for 1D signals, d=2d=2 for images). Compare to linear approximation (keeping all coefficients at the JJ coarsest scales):

fPJfL2CNJs/d\|f - P_J f\|_{L^2} \leq C\, N_J^{-s/d}

where NJ=2JdN_J = 2^{Jd} is the number of kept coefficients. Both rates are the same for Sobolev spaces, but nonlinear approximation is far better for piecewise smooth functions (e.g., natural images with edges).

For image compression: Natural images are well-modeled as piecewise smooth (Cartoon model: smooth regions + C2C^2 edges). Their best nn-term wavelet approximation error decays as n1n^{-1} in 2D, while DCT-based compression (JPEG) achieves only n1/2n^{-1/2} (no adaptation to edges). This is the theoretical explanation for JPEG 2000's superiority over JPEG.

C.3 Wavelet Coefficients and Signal Properties

The wavelet coefficients dj,k=f,ψj,kd_{j,k} = \langle f, \psi_{j,k}\rangle carry rich information about the local structure of ff:

Coefficient behaviorSignal property
$d_{j,k}
$d_{j,k}
Large isolated coefficients across scalesPoint singularity (discontinuity) at 2jk2^j k
Cone of influence: large $d_{j,k}

This local regularity characterization is used in:

  • Image edge detection via wavelet modulus maxima (Mallat-Hwang algorithm)
  • Fractal dimension estimation from power-law behavior of kdj,k2\sum_k |d_{j,k}|^2
  • Neural network weight analysis - regularity of trained weight matrices

Appendix D: Wavelet Connections to Other Methods

D.1 Relationship to Subband Coding

Subband coding (used in audio compression: MP3's filterbank, Dolby Atmos's DTS subbands) divides a signal into frequency bands using a filter bank. The DWT is a special case of octave subband coding with:

  • Logarithmic (octave) frequency spacing
  • Orthogonal / biorthogonal filter bank
  • Perfect reconstruction guarantee

MP3 uses a 32-band uniform QMF filter bank (not wavelet - uniform bands). AAC uses the Modified DCT (MDCT) - a lapped transform that shares the PR property with wavelets but uses cosine instead of wavelet basis functions. The similarity is not coincidental: all PR filter banks can be parameterized via unitary matrices (Vaidyanathan, 1993).

D.2 Relationship to Multirate Signal Processing

The DWT is a multirate signal processing system. Key identifiers:

  • Polyphase representation: filter bank in terms of z2z^2 operations (even/odd polyphase components)
  • Noble identity: upsampling-then-filtering = filtering-then-upsampling with upsampled filter
  • Nyquist criterion for PR: H(z)H(z1)+H(z)H(z1)=2H(z)H(z^{-1}) + H(-z)H(-z^{-1}) = 2 is the half-band Nyquist filter condition

Understanding this connection enables implementation optimization: FFT-based polyphase filtering reduces the cost of DWT from O(pN)O(pN) to O(Nlogp)O(N\log p) for long filters.

D.3 Relationship to Attention Mechanisms

The wavelet transform and self-attention both compute weighted sums over all input positions. The differences are instructive:

PropertyWavelet TransformSelf-Attention
WeightsFixed (filter hh)Data-dependent (QKQ K^\top)
LocalityCompact support of ψ\psiGlobal (all pairs)
ScaleDyadic hierarchy (fixed)Single scale (fixed head dim)
ComplexityO(N)O(N)O(N2)O(N^2)
EquivarianceTranslation-equivariantPermutation-equivariant
InversionExact (PR)Via LayerNorm + residuals

Attention can be viewed as a data-adaptive, globally-supported filter bank - it learns which positions to attend to, while a wavelet selects which scale/location to probe. The convergence of wavelet ideas and attention is driving research in wavelet-based long-range Transformers (see Section 8.2).

D.4 Wavelets and Renormalization Group Theory

In theoretical physics, the renormalization group (RG) describes how physical systems look at different scales. The key operation - integrating out short-distance degrees of freedom to obtain an effective theory at larger scales - is structurally identical to the DWT low-pass step:

  • RG block-spin transformation (Kadanoff 1966): average spins in a block -> coarser description
  • DWT approximation step: ak(j+1)=nhn2kan(j)a^{(j+1)}_k = \sum_n h_{n-2k} a^{(j)}_n - average input samples via hh -> coarser approximation

The detail coefficients d(j)d^{(j)} capture the "fluctuations" integrated out at each RG step. The analogy is precise for the Haar wavelet (block averaging = Kadanoff transformation) and approximate for general wavelets.

For AI: This RG/wavelet correspondence explains why:

  1. Neural networks with hierarchical architectures (ResNet, UNet, Swin) effectively implement learned RG flows
  2. The number of scales needed grows logarithmically with the ratio of finest to coarsest detail log2(N)\log_2(N)
  3. Feature representations at different layers of a CNN correspond to effective theories at different RG scales

Numerical renormalization group methods in quantum many-body physics use tensor network contractions that are analogous to wavelet coefficient manipulations.


Appendix E: Quick Reference

E.1 Notation Summary

SymbolMeaning
ψ\psiMother wavelet
ϕ\phiScaling function (father wavelet)
ψa,b(t)=a1/2ψ((tb)/a)\psi_{a,b}(t) = a^{-1/2}\psi((t-b)/a)Dilated/translated wavelet
ψj,k(t)=2j/2ψ(2jtk)\psi_{j,k}(t) = 2^{-j/2}\psi(2^{-j}t-k)DWT basis function (scale jj, position kk)
Wψf(a,b)=f,ψa,bW_\psi f(a,b) = \langle f, \psi_{a,b}\rangleContinuous Wavelet Transform
dj,k=f,ψj,kd_{j,k} = \langle f, \psi_{j,k}\rangleDetail coefficients
aj,k=f,ϕj,ka_{j,k} = \langle f, \phi_{j,k}\rangleApproximation coefficients
hkh_kLow-pass (scaling) filter
gk=(1)kh1kg_k = (-1)^k h_{1-k}High-pass (wavelet) filter (QMF relation)
VjV_jApproximation space at scale jj
WjW_jDetail space at scale jj
CψC_\psiAdmissibility constant
dbNNDaubechies wavelet with NN vanishing moments
symNNSymlet (near-symmetric version of dbNN)

E.2 Key Formulas

Admissibility: Cψ=0ψ^(ξ)2/ξdξ<C_\psi = \int_0^\infty |\hat{\psi}(\xi)|^2/\xi\,d\xi < \infty iff ψ^(0)=0\hat{\psi}(0) = 0.

CWT: Wψf(a,b)=a1/2f(t)ψ((tb)/a)dtW_\psi f(a,b) = a^{-1/2}\int f(t)\overline{\psi((t-b)/a)}\,dt

CWT inversion: f(t)=Cψ10RWψf(a,b)ψa,b(t)dadba2f(t) = C_\psi^{-1}\int_0^\infty\int_\mathbb{R} W_\psi f(a,b)\,\psi_{a,b}(t)\,\frac{da\,db}{a^2}

Two-scale relation: ϕ(t)=2khkϕ(2tk)\phi(t) = \sqrt{2}\sum_k h_k\phi(2t-k)

Wavelet relation: ψ(t)=2kgkϕ(2tk)\psi(t) = \sqrt{2}\sum_k g_k\phi(2t-k)

QMF relation: gk=(1)kh1kg_k = (-1)^k h_{1-k}

Mallat analysis: ak(j+1)=nhn2kan(j)a^{(j+1)}_k = \sum_n h_{n-2k}\,a^{(j)}_n, dk(j+1)=ngn2kan(j)d^{(j+1)}_k = \sum_n g_{n-2k}\,a^{(j)}_n

Mallat synthesis: an(j)=khn2kak(j+1)+kgn2kdk(j+1)a^{(j)}_n = \sum_k h_{n-2k}\,a^{(j+1)}_k + \sum_k g_{n-2k}\,d^{(j+1)}_k

Parseval (DWT): f2=j,kdj,k2+kaJ,k2\|f\|^2 = \sum_{j,k}|d_{j,k}|^2 + \sum_k |a_{J,k}|^2

Vanishing moments: tmψ(t)dt=0\int t^m \psi(t)\,dt = 0 for m=0,,N1m = 0,\ldots,N-1

Universal threshold: λ=σ^2logN\lambda = \hat{\sigma}\sqrt{2\log N}, σ^=median(d1,k)/0.6745\hat{\sigma} = \text{median}(|d_{1,k}|)/0.6745

Soft threshold: Tλ(x)=sign(x)max(xλ,0)T_\lambda(x) = \text{sign}(x)\max(|x|-\lambda, 0)

E.3 Standard Wavelet Types and Their Uses

Use CaseRecommended WaveletReason
Audio denoisingdb4, sym44 vanishing moments, regular, fast
Image compressionCDF 9/7, CDF 5/3Symmetric, linear phase, JPEG 2000 standard
Edge detectionMexican Hat, db22nd derivative structure; 2 vanishing moments
EEG/EMG analysisMorlet, db4Optimal time-frequency, physiological band structure
Time series (financial)db4, HaarHaar for piecewise constant, db4 for regular
Geophysical seismologyMorlet, PaulComplex, analytic (instantaneous frequency)
Scattering networksMorlet (complex, real part)Optimal time-frequency uncertainty
Lossless image codingCDF 5/3 (integer)Integer lifting steps, exact inverse
Quick analysisHaarFastest, simplest, pedagogical

Appendix F: Wavelet Software Ecosystem

F.1 PyWavelets (pywt)

The standard Python wavelet library. Key functions:

import pywt

# List available wavelets
pywt.families()          # ['haar', 'db', 'sym', 'coif', 'bior', 'rbio', 'dmey', 'gaus', 'mexh', 'morl', ...]
pywt.wavelist(family='db')  # ['db1', 'db2', ..., 'db38']

# Create wavelet object
w = pywt.Wavelet('db4')
print(w.dec_lo)   # decomposition (analysis) low-pass filter
print(w.dec_hi)   # decomposition (analysis) high-pass filter
print(w.rec_lo)   # reconstruction (synthesis) low-pass filter
print(w.rec_hi)   # reconstruction (synthesis) high-pass filter

# 1D DWT
coeffs = pywt.wavedec(signal, 'db4', level=5)  # [cA5, cD5, cD4, cD3, cD2, cD1]
rec = pywt.waverec(coeffs, 'db4')

# 2D DWT
coeffs2 = pywt.wavedec2(image, 'db4', level=3)
# coeffs2 = [cA3, (cH3,cV3,cD3), (cH2,cV2,cD2), (cH1,cV1,cD1)]
rec2 = pywt.waverec2(coeffs2, 'db4')

# Continuous Wavelet Transform
scales = np.arange(1, 64)
coefficients, frequencies = pywt.cwt(signal, scales, 'morl', sampling_period=1/fs)

F.2 Denoising with pywt

import pywt, numpy as np

def denoise_signal(y, wavelet='db4', level=5, mode='soft'):
    # Decompose
    coeffs = pywt.wavedec(y, wavelet, level=level)
    
    # Estimate noise from finest scale
    sigma = np.median(np.abs(coeffs[-1])) / 0.6745
    
    # Universal threshold
    N = len(y)
    threshold = sigma * np.sqrt(2 * np.log(N))
    
    # Threshold all detail coefficients (not approximation)
    thresholded = [coeffs[0]]  # keep approximation
    for c in coeffs[1:]:
        thresholded.append(pywt.threshold(c, threshold, mode=mode))
    
    # Reconstruct
    return pywt.waverec(thresholded, wavelet)[:len(y)]

F.3 Integration with PyTorch

For differentiable wavelet transforms in deep learning:

# pytorch_wavelets (pip install pytorch_wavelets)
from pytorch_wavelets import DWTForward, DWTInverse

xfm = DWTForward(J=3, mode='zero', wave='db3')  # Forward DWT
ifm = DWTInverse(mode='zero', wave='db3')          # Inverse DWT

# x: (batch, channels, height, width)
Yl, Yh = xfm(x)  # Yl: low-freq (coarse), Yh: list of high-freq subbands
x_rec = ifm((Yl, Yh))

For learnable wavelet-like transforms, the filter coefficients hkh_k and gkg_k can be parameterized as network weights while enforcing the PR conditions as constraints or soft penalties.


Appendix G: Proofs of Key Results

G.1 Proof: QMF Relation Ensures Orthogonality

Claim: If gk=(1)kh1kg_k = (-1)^k h_{1-k}, then ψ0,k,ψ0,l=δkl\langle\psi_{0,k}, \psi_{0,l}\rangle = \delta_{kl}.

Proof sketch. In the frequency domain, G(ξ)=e2πiξH(ξ+1/2)G(\xi) = e^{-2\pi i \xi}\overline{H(\xi + 1/2)}. The orthonormality condition for translates of ψ\psi requires:

nZψ^(ξ+n)2=1a.e.\sum_{n\in\mathbb{Z}} |\hat{\psi}(\xi + n)|^2 = 1 \quad \text{a.e.}

Using ψ^(ξ)=G(ξ/2)ϕ^(ξ/2)\hat{\psi}(\xi) = G(\xi/2)\hat{\phi}(\xi/2) and the power-complementary property H(ξ)2+H(ξ+1/2)2=1|H(\xi)|^2 + |H(\xi+1/2)|^2 = 1:

nψ^(ξ+n)2=G(ξ/2)2nϕ^(ξ/2+n)2=G(ξ/2)21\sum_n |\hat{\psi}(\xi+n)|^2 = |G(\xi/2)|^2\sum_n|\hat{\phi}(\xi/2+n)|^2 = |G(\xi/2)|^2 \cdot 1

(using orthonormality of ϕ\phi translates). For this to equal 1, we need G(ξ)2=1|G(\xi)|^2 = 1 whenever H(ξ)2=0|H(\xi)|^2 = 0 and vice versa, which is guaranteed by G(ξ)=e2πiξH(ξ+1/2)G(\xi) = e^{-2\pi i\xi}\overline{H(\xi+1/2)}. \square

G.2 Proof: Mallat Algorithm Computes Inner Products

Claim: The approximation coefficients produced by the Mallat algorithm equal ak(j)=f,ϕj,ka^{(j)}_k = \langle f, \phi_{j,k}\rangle and detail coefficients equal dk(j)=f,ψj,kd^{(j)}_k = \langle f, \psi_{j,k}\rangle.

Proof sketch. By induction: at scale j=0j=0, an(0)=f(n)a^{(0)}_n = f(n) (sampling theorem). The downsampling step:

ak(j+1)=nhn2kan(j)=f,nhn2kϕj,n=f,ϕj+1,ka^{(j+1)}_k = \sum_n h_{n-2k}\,a^{(j)}_n = \langle f, \sum_n h_{n-2k}\,\phi_{j,n}\rangle = \langle f, \phi_{j+1,k}\rangle

using the two-scale relation ϕj+1,k=nhn2kϕj,n\phi_{j+1,k} = \sum_n h_{n-2k}\,\phi_{j,n} (derived from ϕ(t)=2khkϕ(2tk)\phi(t) = \sqrt{2}\sum_k h_k\phi(2t-k)). Similarly for dk(j+1)d^{(j+1)}_k. \square

G.3 Proof: DWT Has O(N)O(N) Complexity

Claim: Computing the full JJ-level DWT of a length-NN signal costs O(N)O(N) operations.

Proof. Each level jj processes a sequence of length Nj=N/2j1N_j = N/2^{j-1} with a filter of length pp (number of taps). The cost of convolution + downsampling is O(pNj)O(p \cdot N_j) operations. Total:

T(N)=j=1JO(pN/2j1)=O(pN)j=0J1(1/2)j=O(pN)12J11/2<O(2pN)T(N) = \sum_{j=1}^{J} O(p \cdot N/2^{j-1}) = O(pN)\sum_{j=0}^{J-1}(1/2)^j = O(pN) \cdot \frac{1 - 2^{-J}}{1 - 1/2} < O(2pN)

Since pp is fixed (depends only on the wavelet family, not NN), T(N)=O(N)T(N) = O(N). \square

Remark: The O(N)O(N) complexity is strict: for a pp-tap filter, the constant is exactly 2p2p multiplications per input sample (across all levels). For db4 (p=8p=8): 16 multiplications/sample - far fewer than the Nlog2NN\log_2 N operations of the FFT.


This section is part of Chapter 20: Fourier Analysis and Signal Processing. For questions or corrections, open an issue in the repository.