Notes

Wavelets and Multiresolution Analysis

"Wavelets are a mathematical microscope: by changing the magnification and the position of the lens, one can examine local features at any desired scale."

Stephane Mallat, A Wavelet Tour of Signal Processing, 1998

Overview

The Fourier transform is one of the most powerful tools in mathematics - but it has a fundamental blind spot. When you decompose a signal into sinusoids, you learn which frequencies are present, but you lose all information about when they occur. A spike at 1 kHz in the first millisecond looks identical to a spike at 1 kHz in the last millisecond. For stationary signals this is fine, but real-world signals - speech, music, seismic data, financial time series, images - are profoundly non-stationary. Their frequency content changes over time.

Wavelets solve this problem by replacing the infinite sinusoidal Fourier atoms with localized oscillations called wavelets ("small waves"). A wavelet $\psi$ is a function with zero mean, compact (or rapidly decaying) support, and enough smoothness to serve as a basis. By shifting $\psi$ in time and scaling it in frequency, we get a family of basis functions that simultaneously encode position and scale information. The resulting wavelet transform gives a time-scale representation of signals that adapts to their local structure.

This section develops wavelets from first principles: the continuous wavelet transform (CWT), the theory of Multiresolution Analysis (MRA) that underpins discrete wavelets, the Mallat fast algorithm ( $O(N)$ - faster than FFT!), the celebrated Daubechies construction of compactly-supported wavelets with vanishing moments, 2D wavelet transforms and JPEG 2000, and the modern AI applications: Mallat's scattering networks (stable CNNs without learned weights), wavelet-based attention transformers, and wavelet regularization in diffusion models.

Prerequisites

Fourier Transform - FT definition, Plancherel theorem, uncertainty principle (Section 20-02)
DFT and FFT - discrete Fourier transform, filter banks (Section 20-03)
Convolution Theorem - filter design, LTI systems, QMF conditions (Section 20-04)
$L^2$ spaces - inner products, orthonormal bases, Hilbert space completeness (Section 12-02 preview)
Complex exponentials - $e^{i\theta} = \cos\theta + i\sin\theta$ , modulus, argument (Section 01)

Companion Notebooks

Notebook	Description
theory.ipynb	CWT scalograms, MRA cascade, Mallat algorithm, Daubechies construction, 2D DWT, scattering networks, wavelet denoising
exercises.ipynb	10 graded problems: Haar DWT by hand through scattering network feature analysis

Learning Objectives

After completing this section, you will:

Explain why the Fourier transform cannot localize events in time and how wavelets resolve this via the uncertainty principle
Define the continuous wavelet transform (CWT) and state the admissibility condition for reconstruction
State the five axioms of Multiresolution Analysis and derive the two-scale relation $\phi(x) = \sqrt{2}\sum_k h_k\,\phi(2x-k)$
Derive the wavelet $\psi$ from the scaling function $\phi$ via the quadrature mirror filter relation $g_k = (-1)^k h_{1-k}$
Implement the Mallat fast DWT algorithm and explain why its complexity is $O(N)$
Explain what "vanishing moments" means for a wavelet and why more vanishing moments improve compression
Construct the Daubechies db2 filter from spectral factorization
Apply the 2D separable DWT and identify the LL, LH, HL, HH subbands
Describe Mallat's scattering network and explain why it provides stable, translation-invariant features
Apply soft/hard wavelet thresholding for signal denoising (Donoho-Johnstone framework)
Connect wavelet filter banks to hierarchical vision architectures (Swin Transformer, PVT)
Identify and fix the 10 most common wavelet implementation errors

1. Intuition
2. Formal Definitions
3. Multiresolution Analysis
4. The Mallat Algorithm (Fast DWT)
5. Daubechies Wavelets
6. Discrete Wavelet Transform in Practice
7. Time-Frequency Analysis
8. Applications in Machine Learning
9. Common Mistakes
10. Exercises
11. Why This Matters for AI (2026 Perspective)
12. Conceptual Bridge

1. Intuition

1.1 The Time-Frequency Problem

Consider a musical score. It tells you two things simultaneously: which note is played (frequency) and when it is played (time). The Fourier transform is like a musician who can tell you every pitch in a piece, but cannot tell you when any of them occur. For a signal $f(t)$ , the Fourier coefficient:

\hat{f}(\xi) = \int_{-\infty}^{\infty} f(t)\, e^{-2\pi i \xi t}\, dt

integrates over all time. Information about the temporal location of a frequency event is completely lost. A piano chord struck at $t=0$ and the same chord struck at $t=10$ seconds have identical Fourier transforms (in magnitude).

This is not a defect of a particular implementation - it is a fundamental consequence of using globally supported basis functions. Sinusoids $e^{2\pi i \xi t}$ have infinite support: they oscillate forever in both directions. When you decompose a signal in this basis, temporal information is irrevocably mixed into the phase relationships between coefficients, and extracting it requires knowing all coefficients simultaneously.

The consequences for AI are immediate. When processing speech, EEG signals, financial time series, or any non-stationary data, global frequency analysis discards the temporal structure that carries meaning. A voice recording of "cat" vs "tac" have similar Fourier magnitude spectra but different time-frequency patterns. The Whisper model addresses this by computing a mel-spectrogram (short-time Fourier transform) that provides local frequency information - but STFT has its own limitations that wavelets resolve more elegantly.

1.2 The Uncertainty Principle and What Wavelets Buy

The Heisenberg-Gabor uncertainty principle (Section 20-02) states that no function can be simultaneously localized in both time and frequency beyond the bound:

\Delta t \cdot \Delta \xi \geq \frac{1}{4\pi}

where $\Delta t$ and $\Delta \xi$ are the standard deviations of $|f(t)|^2$ and $|\hat{f}(\xi)|^2$ respectively. This is a hard mathematical limit: you cannot have perfect resolution in both domains simultaneously.

What the Short-Time Fourier Transform (STFT) does: Fix a window function $g(t)$ (Gaussian, Hann, etc.) and compute:

\text{STFT}_f(t,\xi) = \int_{-\infty}^{\infty} f(\tau)\, g(\tau - t)\, e^{-2\pi i \xi \tau}\, d\tau

This gives local frequency information near time $t$ . But the window $g$ is fixed - every frequency is analyzed at the same time resolution $\Delta t$ . This is suboptimal: low frequencies vary slowly (wide window optimal), while high frequencies vary rapidly (narrow window optimal).

What wavelets do: Wavelets use a window whose width automatically adapts to the analysis scale. At high frequencies (small scale $a$ ), the wavelet is narrow in time - providing good temporal resolution. At low frequencies (large scale $a$ ), the wavelet is wide - providing good frequency resolution. This is the "constant-Q" or "constant relative bandwidth" property:

\frac{\Delta\xi}{\xi} = \text{const}

This adaptive tiling of the time-frequency plane is the geometric essence of wavelet analysis. It matches the structure of natural signals, which tend to have brief, sharp high-frequency events (transients) and slow, extended low-frequency oscillations.

1.3 Wavelets as Mathematical Microscopes

A wavelet $\psi$ is a function satisfying the admissibility condition (Section 2.1). From it, we build a two-parameter family of analyzing functions by translation (shifting in time by $b$ ) and dilation (stretching in scale by $a > 0$ ):

\psi_{a,b}(t) = \frac{1}{\sqrt{a}}\,\psi\!\left(\frac{t-b}{a}\right)

The factor $1/\sqrt{a}$ normalizes energy: $\|\psi_{a,b}\|_2 = \|\psi\|_2$ for all $a, b$ .

Think of this geometrically: as $a$ decreases (zooming in), $\psi_{a,b}$ becomes a narrow, high-frequency probe localized near $t = b$ . As $a$ increases (zooming out), $\psi_{a,b}$ becomes a wide, low-frequency probe covering a broad time interval. The Continuous Wavelet Transform:

W_\psi f(a, b) = \langle f, \psi_{a,b} \rangle = \int_{-\infty}^{\infty} f(t)\, \frac{1}{\sqrt{a}}\,\psi\!\left(\frac{t-b}{a}\right)\, dt

measures the "content" of $f$ near time $b$ at scale $a$ - precisely the behavior of a microscope adjusted to different magnifications and positions.

For AI: Scattering networks (Mallat, 2012) use this idea to build stable, translation-invariant feature representations by cascading wavelet transforms with modulus nonlinearities. The resulting features have provable stability properties that learned CNNs lack - they cannot be destabilized by small deformations of the input.

1.4 Historical Timeline

Year	Contributor	Development
1807	Fourier	Series decomposition of heat equation solutions
1910	Haar	First wavelet: the Haar function $\psi = \mathbf{1}_{[0,1/2)} - \mathbf{1}_{[1/2,1)}$
1946	Gabor	Time-frequency atoms; uncertainty principle
1965	Cooley-Tukey	FFT algorithm - makes DFT computationally practical
1982	Morlet, Grossmann	Continuous wavelet transform formalized for geophysics
1986	Meyer	Construction of smooth orthonormal wavelets
1988	Daubechies	Compactly supported orthonormal wavelets with $p$ vanishing moments (db $p$ )
1989	Mallat	Multiresolution Analysis framework; fast $O(N)$ DWT algorithm
1992	Coifman-Wickerhauser	Wavelet packets; best-basis algorithm
1994	Donoho-Johnstone	Wavelet thresholding for near-optimal denoising
1996	ISO	JPEG 2000 standard based on Cohen-Daubechies-Feauveau (CDF 9/7) biorthogonal wavelets
2012	Mallat	Scattering networks: stable CNNs from wavelet cascades
2021	Yao et al.	WaveBERT: wavelet token compression for long-range Transformers
2022	Liu et al.	WaveMix: multi-scale 2D wavelet mixing for vision
2023	Various	Wavelet-domain diffusion models; wavelet attention transformers

2. Formal Definitions

2.1 The Continuous Wavelet Transform

Definition 2.1 (Admissible wavelet). A function $\psi \in L^2(\mathbb{R})$ is an admissible wavelet if it satisfies the admissibility condition:

C_\psi = \int_0^{\infty} \frac{|\hat{\psi}(\xi)|^2}{\xi}\, d\xi < \infty

This requires $\hat{\psi}(0) = 0$ , i.e., $\int_{-\infty}^{\infty} \psi(t)\, dt = 0$ - the wavelet must have zero mean. Informally, it must "oscillate" (wave) while also being localized (small = "let"). The condition $C_\psi < \infty$ ensures the CWT is invertible.

Definition 2.2 (Continuous Wavelet Transform). For $f \in L^2(\mathbb{R})$ and admissible $\psi$ , the CWT is:

W_\psi f(a, b) = \langle f, \psi_{a,b} \rangle = \int_{-\infty}^{\infty} f(t)\, \frac{1}{\sqrt{a}}\,\overline{\psi\!\left(\frac{t-b}{a}\right)}\, dt, \qquad a > 0,\ b \in \mathbb{R}

Theorem 2.1 (CWT inversion formula). If $\psi$ is admissible, then:

f(t) = \frac{1}{C_\psi} \int_0^{\infty}\int_{-\infty}^{\infty} W_\psi f(a,b)\, \psi_{a,b}(t)\, \frac{da\, db}{a^2}

The measure $\frac{da\, db}{a^2}$ is the natural invariant measure on the affine group (translations and dilations). The CWT is an isometry: $\|W_\psi f\|^2_{L^2(da\,db/a^2)} = C_\psi \|f\|^2_{L^2}$ .

Parseval formula for CWT. The energy identity reads:

\frac{1}{C_\psi}\int_0^\infty\int_{-\infty}^\infty |W_\psi f(a,b)|^2\,\frac{da\,db}{a^2} = \|f\|_{L^2}^2

For AI: The CWT is used for speech analysis (Morlet CWT of mel-filtered audio), EEG artifact detection, and seismic signal interpretation. However, it is the discrete wavelet transform - derived from MRA - that dominates computational applications.

2.2 Common Wavelet Families

The Haar Wavelet is the simplest compactly-supported wavelet, introduced by Alfred Haar in 1910:

\psi_\text{Haar}(t) = \begin{cases} 1 & 0 \leq t < 1/2 \\ -1 & 1/2 \leq t < 1 \\ 0 & \text{otherwise} \end{cases}

The Haar wavelet has compact support $[0,1]$ , exactly 1 vanishing moment, and its scaling function is the box function $\phi_\text{Haar}(t) = \mathbf{1}_{[0,1)}(t)$ . It is the only wavelet that is simultaneously symmetric, orthonormal, and has compact support. Its discontinuity makes it poor for smooth signals but ideal for edge detection and theoretical exposition.

The Morlet Wavelet is a complex sinusoid modulated by a Gaussian:

\psi_\text{Morlet}(t) = \pi^{-1/4}\left(e^{i\omega_0 t} - e^{-\omega_0^2/2}\right)e^{-t^2/2}

where $\omega_0 \approx 5$ is the center frequency. The correction term $e^{-\omega_0^2/2}$ ensures $\int \psi = 0$ exactly. For $\omega_0 \geq 5$ , the correction term is negligible and $\psi_\text{Morlet}(t) \approx \pi^{-1/4} e^{i\omega_0 t - t^2/2}$ .

The Morlet wavelet achieves optimal time-frequency localization (it saturates the Heisenberg uncertainty bound) and is widely used for analyzing oscillatory signals like EEG, ECG, and seismic data. It is not compactly supported (Gaussian tails), so the DWT based on Morlet is continuous.

The Mexican Hat (Ricker) Wavelet is the negative second derivative of a Gaussian:

\psi_\text{MexHat}(t) = \frac{2}{\sqrt{3}\pi^{1/4}}\left(1 - t^2\right)e^{-t^2/2}

Named for its characteristic shape. Has 2 vanishing moments and is real-valued. Used in edge detection, image analysis, and as the activation function insight for ReLU networks (second derivative = rectification).

Daubechies Wavelets db $N$ are the unique compactly-supported orthonormal wavelets with $N$ vanishing moments. Their explicit construction is in Section 5. Key members:

db1 = Haar wavelet (discontinuous, 1 vanishing moment)
db2 = Daubechies 4-tap filter (Lipschitz continuous, 2 vanishing moments)
db4 = Daubechies 8-tap filter (more regular, 4 vanishing moments)
db8 = 16-tap filter (very smooth, 8 vanishing moments)

Increasing $N$ gives smoother wavelets at the cost of longer filters (support $= 2N-1$ ).

Symlets sym $N$ are near-symmetric modifications of db $N$ with equal numbers of vanishing moments but more symmetric filter banks. Preferred in image processing to reduce edge artifacts.

Coiflets coif $N$ have vanishing moments for both $\psi$ and $\phi$ (scaling function), making the approximation coefficients near-samples of the signal at coarse scales.

2.3 Inner Product View and Energy Preservation

Hilbert space background. The space $L^2(\mathbb{R}) = \{f : \mathbb{R} \to \mathbb{C} : \int |f|^2 < \infty\}$ is a Hilbert space with inner product $\langle f, g \rangle = \int_{-\infty}^\infty f(t)\overline{g(t)}\, dt$ .

Recall (Section 12-02): A Hilbert space is a complete inner product space. In $L^2(\mathbb{R})$ , the Riesz representation theorem guarantees that every bounded linear functional is of the form $f \mapsto \langle f, g \rangle$ for a unique $g$ .

A family $\{\psi_{j,k}\}$ of wavelets forms an orthonormal basis of $L^2(\mathbb{R})$ if:

$\langle \psi_{j,k}, \psi_{j',k'} \rangle = \delta_{jj'}\delta_{kk'}$ (orthonormality)
$f = \sum_{j,k} \langle f, \psi_{j,k} \rangle \psi_{j,k}$ for all $f \in L^2(\mathbb{R})$ (completeness)

Theorem 2.2 (Parseval for discrete wavelets). If $\{\psi_{j,k}\}$ is an orthonormal wavelet basis, then:

\|f\|^2 = \sum_{j \in \mathbb{Z}} \sum_{k \in \mathbb{Z}} |\langle f, \psi_{j,k}\rangle|^2

Energy is perfectly preserved in the wavelet domain.

For AI: This energy preservation is crucial for compression and denoising. The wavelet coefficients $d_{j,k} = \langle f, \psi_{j,k}\rangle$ measure the "energy contribution" of scale $j$ , position $k$ . Large coefficients correspond to significant signal features; small coefficients correspond to noise - the basis for threshold denoising.

3. Multiresolution Analysis

3.1 The MRA Axioms

Multiresolution Analysis (MRA), introduced by Stephane Mallat and Yves Meyer in 1989, provides the algebraic framework that links continuous wavelet theory to fast discrete algorithms. An MRA is a ladder of closed subspaces $\{V_j\}_{j\in\mathbb{Z}}$ of $L^2(\mathbb{R})$ satisfying five axioms:

Definition 3.1 (Multiresolution Analysis). A sequence of closed subspaces $\{V_j : j \in \mathbb{Z}\} \subset L^2(\mathbb{R})$ forms an MRA if:

Nesting: $\cdots \subset V_2 \subset V_1 \subset V_0 \subset V_{-1} \subset V_{-2} \subset \cdots$ (finer scales are subspaces of coarser ones; indexing convention: $j$ increases = coarser resolution)
Density: $\overline{\bigcup_{j\in\mathbb{Z}} V_j} = L^2(\mathbb{R})$ (arbitrary precision is achievable by taking finer and finer scales)
Separation: $\bigcap_{j\in\mathbb{Z}} V_j = \{0\}$ (at infinite coarsening, only the zero function remains)
Scaling: $f(t) \in V_j \iff f(2t) \in V_{j-1}$ (going to a finer scale is equivalent to compressing the function by a factor of 2)
Orthonormal basis: There exists $\phi \in V_0$ such that $\{\phi(t-k)\}_{k\in\mathbb{Z}}$ is an orthonormal basis for $V_0$ .

The function $\phi$ is the scaling function (or father wavelet) associated with the MRA. The orthonormality condition means:

\langle \phi(\cdot - k), \phi(\cdot - l) \rangle = \delta_{kl}, \qquad k, l \in \mathbb{Z}

Example 3.1 (Haar MRA). Define $V_j$ as the space of functions that are piecewise constant on dyadic intervals $[k \cdot 2^j, (k+1) \cdot 2^j)$ for $k \in \mathbb{Z}$ . The scaling function is $\phi(t) = \mathbf{1}_{[0,1)}$ . All five MRA axioms are satisfied. The approximation of $f$ at scale $j$ is the conditional expectation $E[f | \mathcal{F}_j]$ where $\mathcal{F}_j$ is the $\sigma$ -algebra generated by dyadic intervals of width $2^j$ .

Example 3.2 (Sinc / Shannon MRA). Define $V_j$ as the Paley-Wiener space of functions bandlimited to $[-2^{-j}\pi, 2^{-j}\pi]$ . The scaling function is $\phi(t) = \text{sinc}(\pi t) = \sin(\pi t)/(\pi t)$ . This is the MRA of ideally bandlimited signals. The sinc wavelet has infinite support - impractical but theoretically illuminating.

3.2 Scaling Functions and the Two-Scale Relation

Since $V_0 \subset V_{-1}$ (axiom 1) and $\{\phi(t-k)\}_k$ is an orthonormal basis for $V_0$ , while $\{\sqrt{2}\phi(2t-k)\}_k$ is an orthonormal basis for $V_{-1}$ (by axiom 4), we can expand $\phi$ in the finer-scale basis:

Definition 3.2 (Two-scale relation / refinement equation). The scaling function satisfies:

\phi(t) = \sqrt{2}\sum_{k \in \mathbb{Z}} h_k\,\phi(2t - k)

where the coefficients $\{h_k\}$ form the low-pass analysis filter. This is the fundamental equation of wavelet theory - it says the scaling function at scale 0 is a weighted sum of translated scaling functions at scale $-1$ (twice as fine).

The Fourier transform of the two-scale relation gives:

\hat{\phi}(\xi) = H\!\left(\frac{\xi}{2}\right)\hat{\phi}\!\left(\frac{\xi}{2}\right), \qquad H(\xi) = \frac{1}{\sqrt{2}}\sum_k h_k\, e^{-2\pi i k\xi}

Applying this recursively: $\hat{\phi}(\xi) = \prod_{j=1}^{\infty} H(\xi/2^j)$ . The infinite product formula shows $\hat{\phi}$ is determined entirely by the filter $H(\xi)$ (the Fourier transform of $\{h_k/\sqrt{2}\}$ ).

Orthonormality conditions. For $\{\phi(\cdot - k)\}_k$ to be orthonormal, the filter must satisfy:

|H(\xi)|^2 + |H(\xi + 1/2)|^2 = 1 \qquad \text{(power complementary, or Conjugate Quadrature Filter condition)}

This is the critical constraint in wavelet filter design.

For AI: The two-scale relation is structurally identical to a residual connection. The coarse approximation at scale $j$ is recursively expressed in terms of the approximation at scale $j-1$ (twice the resolution). This is precisely the architecture of U-Net and multi-scale vision models.

3.3 Detail Spaces and the Orthogonal Complement

Since $V_{j+1} \subset V_j$ (finer is a subspace of coarser - careful: indexing), define the detail space $W_j$ as the orthogonal complement of $V_{j+1}$ within $V_j$ :

V_j = V_{j+1} \oplus W_{j+1}

The detail space $W_{j+1}$ captures the "difference in information" between resolution $j$ and resolution $j+1$ . Applying this decomposition repeatedly:

V_0 = V_J \oplus W_J \oplus W_{J-1} \oplus \cdots \oplus W_1 \qquad \text{(for any } J > 0\text{)}

Taking $J \to -\infty$ and using the density axiom:

L^2(\mathbb{R}) = \bigoplus_{j \in \mathbb{Z}} W_j

This is the fundamental wavelet decomposition theorem: the detail spaces $\{W_j\}_{j\in\mathbb{Z}}$ form a direct orthogonal sum decomposition of $L^2(\mathbb{R})$ .

Orthonormal basis for $W_j$ . If $\psi$ is the mother wavelet (Section 3.4), then $\{\psi_{j,k}(t) = 2^{-j/2}\psi(2^{-j}t - k)\}_{k\in\mathbb{Z}}$ is an orthonormal basis for $W_j$ . Therefore:

\{\psi_{j,k} : j, k \in \mathbb{Z}\} \quad \text{is an orthonormal basis for } L^2(\mathbb{R})

Every $f \in L^2(\mathbb{R})$ expands as $f = \sum_{j,k} \langle f, \psi_{j,k}\rangle \psi_{j,k}$ with $\|f\|^2 = \sum_{j,k} |\langle f, \psi_{j,k}\rangle|^2$ .

3.4 The Wavelet Equation and QMF Relation

The mother wavelet $\psi \in W_0$ is defined by the analogous two-scale relation:

\psi(t) = \sqrt{2}\sum_{k \in \mathbb{Z}} g_k\,\phi(2t - k)

where $\{g_k\}$ is the high-pass analysis filter. The filters $h$ and $g$ are related by the Quadrature Mirror Filter (QMF) condition:

g_k = (-1)^k h_{1-k} \qquad \text{(time-reverse, modulate, alternate signs)}

In the frequency domain: $G(\xi) = e^{-2\pi i\xi}\overline{H(\xi + 1/2)}$ , which ensures $H$ and $G$ are orthogonal (in the power complementary sense) and span all frequencies.

The QMF condition guarantees:

$\psi \perp \phi$ (wavelet is orthogonal to scaling function)
$\psi_{0,k} \perp \psi_{0,l}$ for $k \neq l$ (translated wavelets are orthogonal)
$\psi_{j,k} \perp \psi_{j',k'}$ for $j \neq j'$ (wavelets at different scales are orthogonal)

Example 3.3 (Haar filters). For the Haar wavelet: $h_0 = h_1 = 1/\sqrt{2}$ (all others zero). Then $g_0 = 1/\sqrt{2}$ , $g_1 = -1/\sqrt{2}$ . Verify: $g_k = (-1)^k h_{1-k}$ : $g_0 = (-1)^0 h_1 = h_1 = 1/\sqrt{2}$ , $g_1 = (-1)^1 h_0 = -h_0 = -1/\sqrt{2}$ . PASS

Geometric interpretation. The filter bank $[H, G]$ splits the frequency axis $[0, 1]$ (normalized) into two halves: $H$ preserves $[0, 1/2]$ (low frequencies = approximation), $G$ preserves $[1/2, 1]$ (high frequencies = detail). This is the wavelet counterpart of the Convolution Theorem from Section 20-04: decomposition into frequency bands via orthogonal filter pairs.

4. The Mallat Algorithm (Fast DWT)

4.1 Analysis Filter Bank

The Mallat algorithm (1989) computes the DWT via iterated convolution + downsampling - no need for the inner products $\langle f, \psi_{j,k}\rangle$ to be computed directly.

Given a sequence $\mathbf{a}^{(0)} = \{a^{(0)}_n\}$ (the "approximation coefficients" at the finest scale, equivalent to sampling $f$ at integer positions), one step of the DWT computes:

a^{(j+1)}_k = \sum_n h_{n-2k}\, a^{(j)}_n \qquad \text{(convolution with } \tilde{h}\text{, then downsample by 2)}

d^{(j+1)}_k = \sum_n g_{n-2k}\, a^{(j)}_n \qquad \text{(convolution with } \tilde{g}\text{, then downsample by 2)}

Here $\mathbf{a}^{(j+1)}$ are the approximation coefficients at scale $j+1$ (low-pass, half the samples), and $\mathbf{d}^{(j+1)}$ are the detail coefficients at scale $j+1$ (high-pass, half the samples). The analysis filters $\tilde{h}_n = \overline{h_{-n}}$ (time-reversed conjugate) are the refinement filter banks.

Pictorially, one DWT step is:

a^{(j)} --+--[low-pass h]--[v2]--->  a^{(j+1)}   (approximation)
          +--[high-pass g]--[v2]--->  d^{(j+1)}   (detail)

Repeating on $\mathbf{a}^{(j+1)}$ gives the full multi-level decomposition.

For AI: This is a learned ResNet block where the filters are fixed (not learned). The downsampling ( $\downarrow 2$ ) is equivalent to stride-2 convolution. The architecture of U-Net, FPN (Feature Pyramid Network), and Swin Transformer all implement variants of this Mallat tree.

4.2 Synthesis Filter Bank

Reconstruction (inverse DWT) is the exact reverse:

a^{(j)}_n = \sum_k a^{(j+1)}_k\, h_{n-2k} + \sum_k d^{(j+1)}_k\, g_{n-2k}

Pictorially:

a^{(j+1)} --[^2]--[h]--+
                         +--->  a^{(j)}
d^{(j+1)} --[^2]--[g]--+

where $\uparrow 2$ denotes upsampling by 2 (zero insertion between samples). The synthesis filters are the time-reverses of the analysis filters (for orthogonal wavelets).

4.3 Perfect Reconstruction Conditions

Definition 4.1 (Perfect Reconstruction). A filter bank $\{h, g, \tilde{h}, \tilde{g}\}$ achieves perfect reconstruction if, for every input sequence $\mathbf{a}^{(j)}$ , the synthesized output after analysis + synthesis equals $\mathbf{a}^{(j)}$ (up to a delay).

For orthogonal wavelets, the analysis filters equal the synthesis filters (after time-reversal):

\tilde{h}_n = \overline{h_{-n}}, \qquad \tilde{g}_n = \overline{g_{-n}}

The PR conditions in the $z$ -transform domain are:

Alias cancellation: $H(z)H(-z^{-1}) + G(z)G(-z^{-1}) = 0$
Distortion-free: $H(z)H(z^{-1}) + G(z)G(z^{-1}) = 2$

Together, these ensure that aliasing from the $\downarrow 2$ operation is exactly cancelled in reconstruction.

Biorthogonal wavelets (used in JPEG 2000) use different analysis and synthesis filters, relaxing the orthogonality constraint while maintaining PR. This allows symmetric filters (impossible for compactly supported orthogonal wavelets with more than 1 vanishing moment, by Daubechies' theorem).

4.4 Computational Complexity

At level 1, the DWT processes $N$ samples with two convolutions + downsamplings, costing $O(N)$ operations (each filter has $p$ taps, so $O(pN)$ but $p$ is fixed). At level 2, we process $N/2$ samples, costing $O(N/2)$ . Summing the geometric series:

\text{Total cost} = O(N) + O(N/2) + O(N/4) + \cdots = O(N)\sum_{j=0}^{J} 2^{-j} = O(N)\cdot 2 = O(N)

The DWT is $O(N)$ - faster than the FFT's $O(N\log N)$ !

Comparison:

Algorithm	Complexity	Memory	Time localization
DFT / FFT	$O(N\log N)$	$O(N)$	None (global)
STFT	$O(N\log N)$	$O(N)$	Fixed window
CWT (discrete approx.)	$O(N\log N)$	$O(N\log N)$	Scale-dependent
DWT (Mallat)	$O(N)$	$O(N)$	Scale-dependent

This $O(N)$ complexity is why wavelets underpin all modern image compression standards (JPEG 2000, EZW, SPIHT) and are computationally preferred over STFT for multi-scale analysis.

5. Daubechies Wavelets

5.1 Vanishing Moments

The most important property distinguishing different wavelet families is the number of vanishing moments.

Definition 5.1 (Vanishing moments). A wavelet $\psi$ has $N$ vanishing moments if:

\int_{-\infty}^{\infty} t^k\,\psi(t)\,dt = 0 \qquad \text{for } k = 0, 1, \ldots, N-1

Equivalently, in the Fourier domain: $\hat{\psi}^{(k)}(0) = 0$ for $k = 0,\ldots,N-1$ , meaning $\hat{\psi}$ has a zero of order $N$ at $\xi = 0$ .

Why vanishing moments matter:

Polynomial annihilation. If $f$ is a polynomial of degree $< N$ on the support of $\psi_{j,k}$ , then $\langle f, \psi_{j,k}\rangle = 0$ . Smooth signals (well-approximated by polynomials locally) have near-zero detail coefficients everywhere except near discontinuities. This leads to sparse wavelet representations - ideal for compression.
Approximation order. With $N$ vanishing moments, the wavelet approximation error at scale $j$ satisfies $\|f - P_j f\| \leq C \cdot 2^{-jN} \|f^{(N)}\|$ for $f \in C^N$ . More vanishing moments = better approximation of smooth functions.
Filter regularity. More vanishing moments enforce more zeros in $G(\xi)$ at $\xi = 0$ , which forces the scaling function to be smoother.

Trade-off. More vanishing moments require longer filters (wider support). Daubechies db $N$ has exactly $N$ vanishing moments and minimum possible support $[0, 2N-1]$ . There is no way to have $N$ vanishing moments with a shorter filter while maintaining orthonormality.

5.2 Constructing db2 via Spectral Factorization

We construct the Daubechies db2 wavelet (4-tap filter, 2 vanishing moments) explicitly.

Step 1: Set up constraints. We need filter $h = \{h_0, h_1, h_2, h_3\}$ (4 taps, support $[0,3]$ ) satisfying:

Power complementary: $|H(\xi)|^2 + |H(\xi + 1/2)|^2 = 1$
2 vanishing moments: $G(0) = 0$ and $G'(0) = 0$ , i.e., $H(1/2) = 0$ and $H'(1/2) = 0$
Normalization: $H(0) = 1$ (i.e., $\sum_k h_k = \sqrt{2}$ )

These give 4 equations for 4 unknowns.

Step 2: Parameterize. The power-complementary condition $|H(\xi)|^2 + |H(\xi+1/2)|^2 = 1$ in the $z$ -domain reads $H(z)H(z^{-1}) + H(-z)H(-z^{-1}) = 2$ . This is the Bezout equation whose general solution is:

|H(\xi)|^2 = \left(\frac{1+e^{-2\pi i\xi}}{2}\right)^{2N} P(\sin^2(\pi\xi))

where $P$ is a polynomial satisfying $P(y) \geq 0$ for $y \in [0,1]$ and $P(y) + P(1-y) = 1$ .

Step 3: Minimum phase factorization. For $N=2$ , $P(y) = 1 + 2y$ (chosen for minimum support). The polynomial $|H(\xi)|^2$ evaluated gives:

|H(\xi)|^2 = \cos^4(\pi\xi)\left(1 + 2\sin^2(\pi\xi)\right)

Factoring this as $H(z)H(z^{-1})$ (spectral factorization) with all roots inside the unit disk (minimum phase):

H(z) = \frac{1+\sqrt{3}}{4\sqrt{2}} + \frac{3+\sqrt{3}}{4\sqrt{2}}z^{-1} + \frac{3-\sqrt{3}}{4\sqrt{2}}z^{-2} + \frac{1-\sqrt{3}}{4\sqrt{2}}z^{-3}

The db2 filter coefficients (normalized so $\sum h_k^2 = 1$ ):

h_0 = \frac{1+\sqrt{3}}{4\sqrt{2}}, \quad h_1 = \frac{3+\sqrt{3}}{4\sqrt{2}}, \quad h_2 = \frac{3-\sqrt{3}}{4\sqrt{2}}, \quad h_3 = \frac{1-\sqrt{3}}{4\sqrt{2}}

Numerically: $h \approx \{0.4830, 0.8365, 0.2241, -0.1294\}$ .

Verification: $h_0^2 + h_1^2 + h_2^2 + h_3^2 = 1$ PASS, $\sum h_k = \sqrt{2}$ PASS, $h_0h_2 + h_1h_3 = 0$ PASS

5.3 Regularity and Holder Continuity

The regularity of the Daubechies scaling function $\phi$ is determined by the spectral radius of the transition matrix $T_{ij} = \sqrt{2}\, h_{i-2j}$ . For db $N$ :

Theorem 5.1 (Daubechies regularity). The db $N$ scaling function $\phi \in C^\alpha$ (Holder continuous of order $\alpha$ ) with:

\alpha_N \approx 0.2075N \qquad \text{(asymptotically as } N \to \infty\text{)}

Specific values:

db1 (Haar): $\alpha = 0$ (discontinuous, bounded variation)
db2: $\alpha \approx 0.550$ (Holder continuous but not differentiable)
db4: $\alpha \approx 1.275$ (once continuously differentiable)
db8: $\alpha \approx 2.9$ (twice continuously differentiable)

The scaling function becomes smoother as $N$ increases, at the cost of wider support ( $2N-1$ points).

5.4 Wavelet Family Comparison

WAVELET FAMILY COMPARISON
========================================================================

  Wavelet    | Support | VM | Regularity | Symmetric | Orthogonal
  -----------|---------|----|------------|-----------|------------
  Haar       | [0,1]   |  1 | C^0        | Yes       | Yes
  db2        | [0,3]   |  2 | C^0.55     | No        | Yes
  db4        | [0,7]   |  4 | C^1.28     | No        | Yes
  db8        | [0,15]  |  8 | C^2.9      | No        | Yes
  sym4       | [0,7]   |  4 | C^1.28     | Near-sym  | Yes
  coif4      | [-5,5]  |  4 | C^1.28     | Near-sym  | Yes
  bior2.2    | [-1,3]  |  2 | C^1        | Yes       | Biortho
  CDF 9/7    | large   |  4 | smooth     | Yes       | Biortho
  Mexican Hat| infty       |  2 | C^infty        | Yes       | No (frame)
  Morlet     | infty       |  infty | C^infty        | Near-sym  | No (frame)

  VM = Vanishing Moments; CDF 9/7 = JPEG 2000 standard

========================================================================

For AI: Smooth wavelets (large $N$ ) produce sparse representations for smooth neural network weights and activations, while Haar is efficient for piecewise constant approximations. The CDF 9/7 biorthogonal wavelet used in JPEG 2000 balances symmetry, smoothness, and reconstruction quality - it is also used in wavelet-domain diffusion models to separate frequency bands.

6. Discrete Wavelet Transform in Practice

6.1 Multi-Level Decomposition Tree

The Mallat algorithm applied $J$ times to a length- $N$ signal produces the wavelet decomposition tree:

WAVELET DECOMPOSITION TREE (J=3 levels)
========================================================================

  Input:  a[n],  length N
     |
     +--[h,v2]--->  a1[n],  length N/2    (approximation, level 1)
     |               |
     |               +--[h,v2]--->  a2[n],  length N/4   (approx., L2)
     |               |               |
     |               |               +--[h,v2]--->  a3[n]  (approx., L3)
     |               |               +--[g,v2]--->  d3[n]  (detail, L3)
     |               |
     |               +--[g,v2]--->  d2[n],  length N/4   (detail, L2)
     |
     +--[g,v2]--->  d1[n],  length N/2    (detail, level 1)

  Storage: N/2 + N/4 + N/4 + N/8 + N/8 + ... = N  (no redundancy!)

========================================================================

The output is: $\{a_J, d_J, d_{J-1}, \ldots, d_1\}$ with total length $= N$ (for periodic boundary). No information is lost or duplicated.

Frequency interpretation:

$d_1$ : frequencies in $[N/4, N/2]$ Hz (highest octave)
$d_2$ : frequencies in $[N/8, N/4]$ Hz
$d_j$ : frequencies in $[N/2^{j+1}, N/2^j]$ Hz (octave bands)
$a_J$ : frequencies in $[0, N/2^{J+1}]$ Hz (lowest frequencies)

6.2 Wavelet Packets

Standard DWT applies the filter bank only to the approximation branch ( $\mathbf{a}^{(j)}$ ). Wavelet packets apply it to both branches - creating a full binary tree of subband decompositions:

At each node $(j, n)$ (level $j$ , node $n$ ), both the "approximation" and "detail" sub-signals are further decomposed. This gives $2^j$ subbands at level $j$ , each of width $N/2^j$ . The collection of all leaves at level $J$ corresponds to $2^J$ frequency subbands, each of width $N/2^J$ - the same as a DFT with $N/2^J$ bins.

The best-basis algorithm (Coifman-Wickerhauser, 1992) selects the pruning of this tree that minimizes a cost function (e.g., entropy) - giving an adaptive time-frequency decomposition that is optimal for the specific signal.

6.3 2D DWT and Image Subbands

For 2D signals (images), the DWT is applied separably: first along rows, then along columns (or vice versa). One level of 2D DWT produces four subbands of size $(H/2) \times (W/2)$ :

2D DWT SUBBANDS (one level)
========================================================================

  Input image (H  W)
  +-------------------------+
  |                         |
  |      f(x, y)            |
  |                         |
  +-------------------------+
              |
              
  +------------+------------+
  | LL (H/2W/2) | LH (H/2W/2) |
  | Low-row     | Low-row    |
  | Low-col     | High-col   |
  | (approx.)   | (hor. edges)|
  +------------+------------+
  | HL (H/2W/2) | HH (H/2W/2) |
  | High-row    | High-row   |
  | Low-col     | High-col   |
  | (vert. edges)| (diag. edges)|
  +------------+------------+

  LL = approximation (coarse image)
  LH = horizontal detail (vertical edges)
  HL = vertical detail (horizontal edges)
  HH = diagonal detail (diagonal edges)

========================================================================

Applying the 2D DWT recursively to the LL subband gives the standard pyramid decomposition used in image processing.

For AI: This 2D decomposition is the backbone of Swin Transformer's patch merging, PVT's pyramid vision representation, and wavelet-enhanced U-Net architectures. The subbands naturally separate global structure (LL) from local edge features (LH, HL, HH).

6.4 JPEG 2000 and Image Compression

JPEG 2000 (2000) uses the CDF 9/7 biorthogonal wavelet (Le Gall 5/3 for lossless) to transform images before quantization and entropy coding. The key advantages over DCT-based JPEG:

No blocking artifacts - the wavelet support spans the entire image, avoiding the 88 block boundary artifacts of JPEG
Scalable quality - truncating the bitstream at any point gives a valid (lower quality) image
Region of Interest (ROI) coding - different quality levels for different image regions
Better compression at high quality - 20-30% better than JPEG at visually lossless quality

The coding steps: 2D DWT (up to 5 levels) -> coefficient quantization -> EBCOT entropy coding (embedded block coding with optimal truncation).

Modern relevance: Wavelet-based compression principles inform:

Neural image codecs (Balle et al.): hyperprior model operates on wavelet-like feature maps
Wavelet-domain diffusion models (WaveDiff, 2023): diffusion operates on DWT coefficients rather than pixels, reducing sequence length by $4\times$ per level
VCT (Video Compression Transformer): uses learned transforms similar to wavelet filter banks

7. Time-Frequency Analysis

7.1 The Scalogram

The scalogram is the time-scale energy density:

S_f(a, b) = |W_\psi f(a, b)|^2

It is the wavelet analogue of the spectrogram (squared magnitude of STFT). The scalogram shows how the energy of the signal is distributed across scales and times.

Reading the scalogram:

Horizontal axis: time $b$
Vertical axis: scale $a$ (often plotted as frequency $1/a$ , increasing downward = decreasing scale)
Color intensity: energy $|W_\psi f(a,b)|^2$
A vertical ridge indicates a transient (localized in time)
A horizontal ridge indicates a sustained oscillation (localized in frequency)

Edge effects (cone of influence). Near the boundaries of the signal, wavelet coefficients are unreliable because the wavelet $\psi_{a,b}$ extends outside the signal domain. The cone of influence at scale $a$ is the region of times affected by boundary effects: approximately $|b| \lesssim \sqrt{a} \cdot \sigma_\psi$ where $\sigma_\psi$ is the support radius of $\psi$ . Results outside the cone of influence are unreliable.

Application example: For a chirp signal $f(t) = \cos(2\pi t^2)$ (instantaneous frequency linearly increasing with time), the scalogram shows a diagonal ridge tracing the time-frequency trajectory. The Fourier spectrum merely shows a broad band of frequencies with no temporal information.

7.2 Gabor Wavelets and the STFT Connection

Gabor wavelets (complex Morlet wavelets with exact zero mean) interpolate between the STFT and the CWT:

\psi_\text{Gabor}(t) = \frac{1}{\sqrt{\pi\sigma}}\left(e^{i\omega_0 t} - e^{-\sigma^2\omega_0^2/2}\right)e^{-t^2/(2\sigma^2)}

The parameter $\sigma$ controls the time-frequency trade-off:

Large $\sigma$ : good frequency resolution (narrow band), poor time resolution
Small $\sigma$ : good time resolution, poor frequency resolution

The STFT with a Gaussian window of width $\sigma$ is identical to the CWT with Gabor wavelets at a fixed scale. The key difference: in the STFT, $\sigma$ is fixed for all frequencies; in the CWT, $\sigma$ scales with $a$ (constant-Q).

The STFT time-frequency tiling uses uniform rectangles: each tile has the same area $\Delta t \cdot \Delta\xi = \text{const}$ .
The CWT uses logarithmic tiling: tiles are wider at low frequencies (long $\Delta t$ , small $\Delta\xi$ ) and narrow at high frequencies.

7.3 Synchrosqueezing Transform

The Synchrosqueezing Transform (SST) (Daubechies-Lu-Wu, 2011) sharpens the scalogram by reassigning energy to the instantaneous frequency curve, producing a "synchrosqueezed" time-frequency representation with much higher resolution.

For a signal with instantaneous frequency $\omega(t)$ :

\hat{\omega}(a, b) = \frac{\partial_b W_\psi f(a, b)}{W_\psi f(a,b)} \cdot \frac{1}{2\pi i}

estimates the instantaneous frequency at $(a, b)$ . The synchrosqueezed transform reassigns the energy from the scalogram at $(a, b)$ to the time-frequency point $(b, \hat{\omega}(a,b))$ . The result is a sharper, more interpretable time-frequency map.

Application: SST is used to separate overlapping modes in multicomponent signals (e.g., ECG with P, QRS, T waves; audio with multiple instruments). It has become standard in empirical mode decomposition and is implemented in scipy.signal.

8. Applications in Machine Learning

8.1 Mallat Scattering Networks

The scattering transform (Mallat, 2012; Bruna-Mallat, 2013) builds provably stable, translation-invariant features by cascading wavelet transforms with modulus nonlinearities:

S_0 f = f * \phi_J \qquad \text{(zeroth order: low-pass average)}

S_1 f(k, x) = |f * \psi_{\lambda_1}| * \phi_J(x) \qquad \text{(first order: averaged wavelet modulus)}

S_2 f(k, \lambda_1, \lambda_2, x) = ||f * \psi_{\lambda_1}| * \psi_{\lambda_2}| * \phi_J(x) \qquad \text{(second order)}

where $\phi_J$ is the low-pass filter at scale $J$ , and $\psi_\lambda$ ranges over wavelets at scales $\lambda = 2^j$ for $j = 1, \ldots, J$ .

Key theorems:

Theorem 8.1 (Translation invariance). Scattering coefficients are (approximately) invariant to translations of size $\leq 2^J$ .

Theorem 8.2 (Stability to deformations). If $\tau$ is a $C^2$ diffeomorphism close to a translation (i.e., $\sup |\nabla\tau| \leq \epsilon \ll 1$ ), then:

\|Sf - S(f \circ \tau)\| \leq C\,\epsilon\,\|f\|

This is the key property that CNNs lack: a small rotation or stretch can change CNN features by a constant amount, independent of how small the deformation is.

Scattering vs CNNs: Scattering networks have no learned parameters. Their features are competitive with shallow CNNs on MNIST and CIFAR-10. The architecture is:

Scattering Network Architecture
========================================================================

  Input f(x)
      |
      +-- _J * f ------------------------------> S_0 (global average)
      |
      +-- |f * _{1}| * _J -------------------> S_1 (J terms)
      |
      +-- ||f * _{1}| * _{2}| * _J --------> S_2 (J2 terms)

  Total feature dimension: O(J2)   (vs O(K2C) for CNN)
  Invariance: proven          (vs empirical for CNN)
  Stability: proven            (vs empirical for CNN)

========================================================================

For AI: Scattering features fed to a simple linear classifier achieve ~87% on CIFAR-10. In protein structure prediction, a variant of scattering (SE(3)-equivariant networks) is used by AlphaFold2 to process atomic coordinates.

8.2 Wavelet-Based Attention

WaveBERT (2021) compresses long input sequences by applying wavelet transforms to token embeddings before attention, reducing sequence length by a factor of $2^J$ :

Apply DWT to the token embedding sequence: $e_1, \ldots, e_N \to (a_J, d_J, \ldots, d_1)$
Run Transformer attention only on the approximation $a_J$ (length $N/2^J$ )
Reconstruct full-resolution outputs via IDWT

This reduces the $O(N^2)$ attention complexity to $O((N/2^J)^2)$ - a factor of $4^J$ speedup.

Wavelet Attention (2023, various groups) applies 2D DWT to attention logits:

Compute attention score matrix $S = QK^\top/\sqrt{d}$ of shape $(L \times L)$
Apply 2D DWT to $S$ -> sparse representation (most attention is low-frequency)
Threshold small wavelet coefficients (attention is concentrated on main diagonals)
Reconstruct and apply softmax

This achieves near-full-attention quality with sparse coefficient computation, similar to the linear attention approximations but with theoretical backing from wavelet compression theory.

WaveMix (2022) replaces the attention mechanism entirely with 2D wavelet mixing, applying the DWT across spatial dimensions in vision tasks. The model achieves competitive accuracy on ImageNet with significantly fewer parameters than ViT.

8.3 Multiresolution in Vision Transformers

The standard Vision Transformer (ViT) processes all patches at a single scale with quadratic attention. Hierarchical architectures introduce the wavelet-like multi-scale structure:

Swin Transformer (2021): Applies attention within local windows and shifts them across layers. The patch merging operation (concatenating 22 patches and projecting) is equivalent to a stride-2 convolution - structurally analogous to the DWT low-pass filter + downsampling step.

PVT (Pyramid Vision Transformer, 2021): Explicitly uses 4 stages with spatial reduction ratios $\{1, 1/2, 1/4, 1/8\}$ - matching the wavelet octave decomposition exactly.

FPN (Feature Pyramid Network, 2017): Multi-scale feature maps combined via lateral connections - the convolutional analogue of the wavelet synthesis filter bank.

Common pattern: Resolution halved, channels doubled at each level. This is exactly the wavelet dyadic tree structure, with learned filters replacing fixed wavelet filters.

8.4 Wavelet Regularization in Diffusion Models

WaveDiff (2023) applies the diffusion process in the wavelet domain rather than pixel space:

Apply 2D DWT to images: $x \to (x_{LL}, x_{LH}, x_{HL}, x_{HH})$
Apply Gaussian noise only to high-frequency subbands ( $x_{LH}, x_{HL}, x_{HH}$ ) - preserving the global structure in $x_{LL}$
Train the denoising network on the (much smaller) wavelet coefficients

Benefits:

Sequence length reduced by $4\times$ per level -> attention is $16\times$ cheaper
High-frequency details are easier to model when separated from low-frequency structure
Generation quality is preserved since wavelets are invertible

Wavelet-based training regularization: The spectral bias phenomenon (neural networks learn low frequencies first) can be accelerated by decomposing loss functions in the wavelet domain. Penalizing different scales at different rates speeds up convergence - equivalent to preconditioning by the wavelet spectrum.

8.5 Wavelet Denoising: Donoho-Johnstone

The Donoho-Johnstone framework (1994) provides a near-optimal, computationally efficient signal denoising procedure based on wavelet coefficient thresholding:

Setup. Observe $y_n = f_n + \epsilon_n$ where $f$ is the true signal and $\epsilon_n \sim \mathcal{N}(0, \sigma^2)$ is Gaussian noise.

Algorithm:

Compute DWT of $y$ : coefficients $\{d_{j,k}\}$
Estimate noise level: $\hat{\sigma} = \text{median}(|d_{1,k}|) / 0.6745$ (using finest-scale detail coefficients)
Set universal threshold: $\lambda = \hat{\sigma}\sqrt{2\log N}$
Apply thresholding to all detail coefficients:
- Soft: $\hat{d}_{j,k} = \text{sign}(d_{j,k})\max(|d_{j,k}| - \lambda, 0)$
- Hard: $\hat{d}_{j,k} = d_{j,k} \cdot \mathbf{1}[|d_{j,k}| > \lambda]$
Reconstruct via IDWT of $\{a_J, \hat{d}_J, \ldots, \hat{d}_1\}$

Why does this work? Sparse coding: a smooth signal $f$ has few large wavelet coefficients; noise $\epsilon$ spreads uniformly across all coefficients. By zeroing small coefficients (which are dominated by noise) and preserving large coefficients (dominated by signal), we efficiently remove noise.

Theorem 8.3 (Near-optimality, Donoho-Johnstone 1994). The soft-threshold estimator $\hat{f}$ satisfies:

\sup_{f \in \mathcal{F}} E\|\hat{f} - f\|^2 \leq \left(2\log N + 1\right)\left(\sigma^2 + \inf_{g \in \mathcal{F}} \|f - g\|^2\right)

This is within a $2\log N$ factor of the minimax optimal risk - essentially unimprovable without additional assumptions.

For AI: Wavelet thresholding connects to:

LASSO ( $\ell^1$ penalty) in the wavelet domain: soft thresholding is the proximal operator for $\ell^1$
Sparse autoencoders (SAE): identifying sparse representations of neural network activations
Gradient compression in distributed training: transmit only large-magnitude gradient components (analogous to hard thresholding)

9. Common Mistakes

#	Mistake	Why It's Wrong	Fix
1	Applying DWT without specifying boundary conditions	Boundary effects corrupt coefficients near signal edges	Use `mode='periodization'` for circular, `'symmetric'` for mirror padding; always check the cone of influence
2	Using Haar wavelets for smooth signals and expecting good compression	Haar has only 1 vanishing moment; smooth functions need wavelets with many vanishing moments	Choose db4+ or sym4+ for signals with continuous derivatives
3	Confusing scale and frequency	In CWT, small scale $a$ = high frequency, large $a$ = low frequency; axes are often plotted reversed	Always label axes explicitly; clarify whether vertical axis is "scale" or "frequency"
4	Using the CWT for discrete signals without discretization	CWT is defined for continuous $f \in L^2(\mathbb{R})$ ; applying it naively to sampled data gives aliasing	Use the DWT (Mallat algorithm) for discrete signals, or the pseudo-CWT with $a = 2^j$ sampling
5	Thinking DWT is redundant (like STFT)	DWT has exactly $N$ coefficients for $N$ inputs - it is non-redundant	The redundant version (undecimated DWT, "a trous") is useful for denoising but costs $O(N\log N)$ storage
6	Forgetting the $1/\sqrt{a}$ normalization in the CWT	Without it, $\\|\psi_{a,b}\\|$ varies with scale, breaking energy conservation	Always use $\psi_{a,b}(t) = a^{-1/2}\psi((t-b)/a)$ for $L^2$ normalization
7	Applying only one level of DWT and calling it "multi-scale"	Single-level DWT splits into only 2 bands; multi-scale analysis requires $J \geq 3$ levels	Apply DWT recursively to the approximation branch for $J = \lfloor\log_2 N\rfloor - 3$ levels
8	Selecting threshold $\lambda$ globally (same for all scales)	Noise variance varies across scales for colored noise; a global threshold over- or under-smooths	Use scale-adaptive threshold: $\lambda_j = \hat{\sigma}_j\sqrt{2\log N_j}$ where $N_j$ is the number of coefficients at level $j$
9	Assuming wavelet coefficients are Gaussian when noise is Gaussian	Wavelet coefficients of a Gaussian noise signal are approximately Gaussian only for orthogonal wavelets; correlated noise produces non-Gaussian coefficients	Verify distribution; use non-parametric or robust thresholding for colored noise
10	Conflating the DWT with the FFT as frequency decomposition	DWT gives octave-band (logarithmic) frequency decomposition; FFT gives uniform frequency decomposition	Use DWT when you want multi-scale (octave) analysis; use FFT when you want uniform frequency resolution

10. Exercises

Exercise 1 * - Haar DWT by Hand

Given the signal $f = [4, 6, 10, 12, 8, 6, 5, 5]$ :

(a) Compute one level of Haar DWT by hand: approximation coefficients $\mathbf{a}^{(1)}$ and detail coefficients $\mathbf{d}^{(1)}$ . (Recall: $a_k = (f_{2k} + f_{2k+1})/\sqrt{2}$ , $d_k = (f_{2k} - f_{2k+1})/\sqrt{2}$ .)

(b) Apply one more level to $\mathbf{a}^{(1)}$ to get $\mathbf{a}^{(2)}$ and $\mathbf{d}^{(2)}$ .

(c) Verify that $\|\mathbf{f}\|^2 = \|\mathbf{a}^{(2)}\|^2 + \|\mathbf{d}^{(2)}\|^2 + \|\mathbf{d}^{(1)}\|^2$ (Parseval).

(d) Reconstruct $\mathbf{f}$ from the wavelet coefficients $\{\mathbf{a}^{(2)}, \mathbf{d}^{(2)}, \mathbf{d}^{(1)}\}$ by reversing the steps. Verify exact reconstruction.

Exercise 2 * - Admissibility and Zero Mean

(a) Verify that the Haar wavelet $\psi = \mathbf{1}_{[0,1/2)} - \mathbf{1}_{[1/2,1)}$ has zero mean: $\int_{-\infty}^\infty \psi(t)\,dt = 0$ .

(b) Compute $\hat{\psi}(\xi)$ for the Haar wavelet explicitly. Show $\hat{\psi}(0) = 0$ .

(c) A student proposes $\psi(t) = e^{-t^2/2}$ (Gaussian) as a wavelet. Show this fails the admissibility condition.

(d) Compute the admissibility constant $C_\psi = \int_0^\infty |\hat{\psi}(\xi)|^2/\xi\,d\xi$ for the Mexican Hat wavelet $\hat{\psi}(\xi) = \sqrt{2\pi}\,\xi^2\, e^{-\xi^2/2}$ . Is it finite?

Exercise 3 * - MRA Axiom Verification

Verify that the Shannon (sinc) MRA satisfies all five MRA axioms:

$V_j = \{f \in L^2(\mathbb{R}) : \text{supp}(\hat{f}) \subseteq [-2^{-j}\pi, 2^{-j}\pi]\}$
$\phi(t) = \text{sinc}(\pi t) = \sin(\pi t)/(\pi t)$

(a) Prove the nesting property $V_1 \subset V_0$ directly from the definition.

(b) Prove density: show that for any $f \in L^2(\mathbb{R})$ and $\epsilon > 0$ , there exists $j$ and $g \in V_j$ with $\|f - g\| < \epsilon$ .

(c) Verify that $\{\phi(t-k)\}_{k\in\mathbb{Z}}$ is orthonormal using Parseval's theorem.

(d) Write down the scaling function $\phi$ in the frequency domain. What is the two-scale relation $\hat{\phi}(\xi) = H(\xi/2)\hat{\phi}(\xi/2)$ ? What is $H(\xi)$ ?

Exercise 4 ** - Mallat Algorithm Implementation

Implement the complete Mallat DWT (analysis) and IDWT (synthesis) algorithms from scratch using only numpy.

(a) Implement haar_dwt(x, J) that applies $J$ levels of Haar DWT to 1D signal x. Return the coefficient vector $[a_J, d_J, d_{J-1}, \ldots, d_1]$ of total length len(x).

(b) Implement haar_idwt(coeffs, J) that reconstructs the signal from the coefficients. Verify $\text{IDWT}(\text{DWT}(x)) = x$ with max error $< 10^{-10}$ .

(c) Apply your DWT to a noisy signal $y[n] = \sin(2\pi \cdot 0.1 \cdot n) + 0.3\varepsilon[n]$ for $n=0,\ldots,255$ , where $\varepsilon \sim \mathcal{N}(0,1)$ . Apply soft thresholding with $\lambda = 0.5$ to all detail coefficients. Reconstruct and compute SNR improvement.

(d) Benchmark your implementation vs pywt.wavedec for $N = 2^{14}$ and 4 levels. Compare execution times and verify the outputs agree to numerical precision.

Exercise 5 ** - Daubechies Filter Verification

(a) Given the db2 filter $h = [(1+\sqrt{3})/(4\sqrt{2}),\, (3+\sqrt{3})/(4\sqrt{2}),\, (3-\sqrt{3})/(4\sqrt{2}),\, (1-\sqrt{3})/(4\sqrt{2})]$ , verify:

$\sum_k h_k = \sqrt{2}$ (normalization)
$\sum_k h_k^2 = 1$ (unit energy)
$\sum_k h_k h_{k-2} = 0$ (orthogonality shift by 2)

(b) Compute the QMF partner $g_k = (-1)^k h_{1-k}$ for db2. Verify $|H(\xi)|^2 + |G(\xi)|^2 = 1$ numerically for 100 values of $\xi \in [0, 1]$ .

(c) Verify db2 has exactly 2 vanishing moments: compute $\sum_k (-1)^k k^m h_{1-k}$ for $m=0$ and $m=1$ (should both be zero) and $m=2$ (should be nonzero).

(d) Construct the db2 scaling function $\phi$ by iterating the cascade algorithm: start from $\phi^{(0)} = \mathbf{1}_{[0,1]}$ and apply $\phi^{(n+1)}(t) = \sqrt{2}\sum_k h_k\,\phi^{(n)}(2t-k)$ for 6 iterations. Plot the result.

Exercise 6 ** - 2D DWT and Image Compression

(a) Load or generate a $256 \times 256$ grayscale test image (use scipy.datasets.face(gray=True) or a numpy-generated pattern). Apply 3 levels of 2D Haar DWT using pywt.wavedec2.

(b) Visualize the 3-level wavelet decomposition, labeling all subbands (LL3, LH3, HL3, HH3, LH2, HL2, HH2, LH1, HL1, HH1).

(c) Implement a compression experiment: keep only the top $k\%$ of wavelet coefficients (by magnitude), zero the rest, and reconstruct. Plot PSNR vs $k$ for $k \in \{1, 2, 5, 10, 20, 50\}$ .

(d) Compare with JPEG-style DCT compression: apply 2D DCT to $8\times 8$ blocks and keep only the top $k\%$ of DCT coefficients. Plot PSNR vs $k$ for both methods on the same axes. At what compression ratio does DWT outperform DCT?

Exercise 7 *** - Scattering Transform Features

(a) Implement a simple 1D scattering transform (order 1 and 2) using pywt:

Zeroth order: low-pass filter the input
First order: $S_1[f, j_1] = |\text{DWT}_j f| * \phi_J$
Second order: $S_2[f, j_1, j_2] = ||\text{DWT}_{j_1} f| * \psi_{j_2}| * \phi_J$

(b) Generate signals from two classes: class A = $\sin(2\pi \cdot 0.1 t) + 0.1\varepsilon$ ; class B = $\sin(2\pi \cdot 0.3 t) + 0.1\varepsilon$ . Compute scattering features and verify they are approximately translation-invariant: shift the signal by $k$ samples and check that $\|Sf - S(T_k f)\| / \|Sf\| < 0.01$ for $k \leq 2^J$ .

(c) Test stability to deformation: apply a small time-stretching $f \to f(1.05t)$ and measure $\|Sf - S(f \circ \tau)\| / \|Sf\|$ . Compare with the same test for raw Fourier features.

(d) Train a linear classifier on scattering features vs raw FFT features for a 4-class signal classification problem. Compare accuracy and number of features needed.

Exercise 8 *** - Wavelet Denoising and Sparse Coding

(a) Implement Donoho-Johnstone wavelet denoising with:

Universal threshold $\lambda = \hat{\sigma}\sqrt{2\log N}$
Both soft and hard thresholding
Noise level estimation via MAD: $\hat{\sigma} = \text{median}(|d_{1,k}|)/0.6745$

(b) Test on $f[n] = 2\sin(2\pi \cdot 0.1 n) + \sin(2\pi \cdot 0.35 n)$ with $N=512$ , noise levels $\sigma \in \{0.1, 0.5, 1.0, 2.0\}$ . Plot SNR improvement vs $\sigma$ for soft vs hard thresholding.

(c) Show the connection to LASSO: prove that soft thresholding is the proximal operator of the $\ell^1$ norm: $\text{prox}_{\lambda\|\cdot\|_1}(v)_i = \text{sign}(v_i)\max(|v_i|-\lambda, 0)$ .

(d) Implement a sparse coding experiment: represent a signal as a sparse linear combination of Daubechies db4 wavelets at 3 scales. Use LASSO regression (sklearn.linear_model.Lasso) to find the sparse coefficients. Compare reconstruction quality (PSNR) vs number of non-zero coefficients to direct DWT thresholding.

11. Why This Matters for AI (2026 Perspective)

AI Concept	Wavelet Connection	Concrete Impact
Convolutional neural networks	Each conv layer is a learned filter bank; DWT is the fixed-filter special case	Understanding wavelets explains WHY CNNs learn edge detectors and Gabor-like filters at lower layers
Swin Transformer patch merging	Identical to stride-2 DWT low-pass filter: concatenate 22 patches -> linear projection	Multi-scale attention is the learned version of multi-level DWT
ViT + FPN / feature pyramids	Multi-resolution feature maps follow the wavelet octave structure exactly	Architecture design: number of levels, channel doubling at each scale
Speech: Whisper / wav2vec 2.0	Mel spectrogram is an STFT with mel-warped frequencies approx wavelet constant-Q analysis	Replacing STFT with CWT could improve speech models; wavelet features more robust to pitch shift
Diffusion models (WaveDiff)	Separate noise diffusion of LL vs LH/HL/HH subbands; reduces sequence length 4	Lower sampling cost, better preservation of global structure, faster inference
Scattering networks	CNN-equivalent features with zero learned parameters; provably stable to deformations	Robustness certification, data-efficient learning, group-equivariant architectures
Sparse autoencoders (SAE)	Mechanistic interpretability via sparse wavelet-like features of activations	Soft thresholding = LASSO proximal operator; same math as wavelet denoising
Gradient compression	Transmit top- $k$ gradient values = hard threshold in parameter space	Wavelet-domain gradient: compress gradient per scale/location, not per parameter
Long-range Transformers (WaveBERT)	Compress token sequence by $2^J$ via DWT before attention	$O((N/2^J)^2)$ attention complexity; quality preserved via wavelet inversion
Protein structure (SE(3)-nets)	Scattering-inspired SE(3)-equivariant networks process atomic coordinates	AlphaFold2's geometry processing; antibody design; molecular dynamics
JPEG 2000 / neural codecs	CDF 9/7 biorthogonal wavelet; hyperprior in learned image compression	Variable-rate coding; ROI coding; streaming and progressive decode
Spectral bias in neural networks	Networks learn low-frequency components first (captured by coarse wavelet scales)	Curriculum learning: train on coarse scales first; progressive training strategies

12. Conceptual Bridge

Looking Backward

This section sits at the summit of the Fourier Analysis chapter. Everything built in Section 20-01 through Section 20-04 contributes:

From Section 20-01 (Fourier Series): The idea of decomposing a signal into orthogonal basis functions is generalized: instead of sinusoids with fixed global support, we use wavelets with variable-scale local support. The Parseval identity from Section 20-01 reappears as the energy conservation property of the wavelet transform.

From Section 20-02 (Fourier Transform): The uncertainty principle $\Delta t \cdot \Delta\xi \geq 1/(4\pi)$ is the fundamental constraint that motivates wavelets. The continuous Fourier transform is the "full-global" limit; the CWT interpolates between full-time and full-frequency localization. Plancherel's theorem underlies both.

From Section 20-03 (DFT and FFT): The Mallat algorithm uses convolution + downsampling - precisely the operations made efficient by the FFT in Section 20-03. The DWT at each level costs $O(N)$ via direct convolution (filter is short); a CWT discretization would cost $O(N\log N)$ via FFT.

From Section 20-04 (Convolution Theorem): The filter bank conditions (QMF, perfect reconstruction) are applications of the convolution algebra from Section 20-04. The LTI system framework identifies each wavelet scale as a bandpass filter; the synthesis bank cancels aliasing via the Convolution Theorem's frequency-domain perspective.

Looking Forward

Statistical Learning Theory (Section 21): Wavelet regularity classes (Besov spaces, Sobolev spaces) are the natural function classes for non-parametric estimation. Minimax rates of convergence depend on the wavelet regularity of the target function.

Functional Analysis (Section 12): The MRA axioms live in $L^2(\mathbb{R})$ - a Hilbert space (Section 12-02). Riesz bases, frames, and Hilbert space geometry underpin the infinite-dimensional wavelet theory. The scattering transform connects to group representations (Section 12-04 preview).

Differential Geometry (Section 25): Wavelets generalize to manifolds via harmonic analysis on Riemannian spaces. Spectral graph wavelets (Section 11-04) extend the MRA framework to graphs. Heat kernel wavelets define multi-scale decompositions on curved spaces - used in geometric deep learning.

WAVELET SECTION IN CURRICULUM
=======================================================================

  Section 20-01 Fourier Series                    Section 12-02 Hilbert Spaces
     v orthogonal basis concept                v L2 theory
  Section 20-02 Fourier Transform               Section 20-03 DFT and FFT
     v uncertainty principle                   v filter banks, O(N log N)
  Section 20-04 Convolution Theorem
     v QMF, LTI systems
  +======================================+
  |  Section 20-05 WAVELETS                     |
  |  ----------------------------------- |
  |  CWT, MRA, Mallat O(N) algorithm     |
  |  Daubechies, QMF, perfect recon.     |
  |  2D DWT, JPEG 2000, scattering       |
  |  Denoising, WaveDiff, WaveBERT       |
  +======================================+
          v                    v
  Section 21 Statistical          Section 11-04 Spectral
  Learning Theory          Graph Theory
  (Besov spaces,           (Graph wavelets,
   minimax rates)           GCN, ChebNet)
          v                    v
  Section 25 Differential         Section 12-04 Group
  Geometry                 Representations
  (manifold wavelets)      (scattering theory)

=======================================================================

<- Back to Fourier Analysis | Previous: Convolution Theorem <- | Next Chapter: Statistical Learning Theory ->

Appendix A: Biorthogonal Wavelets and JPEG 2000

A.1 Motivation for Biorthogonal Wavelets

Daubechies' theorem establishes an uncomfortable trade-off for orthonormal wavelets: a compactly supported orthonormal wavelet with more than 1 vanishing moment cannot be symmetric. This is a fundamental obstruction - symmetry requires the filter to satisfy $H(\xi) = e^{-2\pi i K\xi}\overline{H(-\xi)}$ for some integer $K$ , which is incompatible with compact support and orthonormality (except for the Haar wavelet).

Symmetry is highly desirable in image processing: symmetric filters do not introduce phase distortion, and symmetric boundary extension (mirror padding) is more natural than periodic extension. For this reason, biorthogonal wavelets (Cohen-Daubechies-Feauveau, 1992) replace the single orthonormal filter bank with a pair of filter banks:

Analysis filters: $\{h, g\}$ - used in the forward DWT
Synthesis filters: $\{\tilde{h}, \tilde{g}\}$ - used in the inverse DWT

where the analysis and synthesis wavelets $\psi$ and $\tilde{\psi}$ satisfy the biorthogonality condition:

\langle \psi_{j,k}, \tilde{\psi}_{j',k'}\rangle = \delta_{jj'}\delta_{kk'}

but $\psi \not\perp \psi_{j,k'}$ for $k \neq k'$ (they are not orthogonal to their own translates).

A.2 The CDF 9/7 Wavelet

The Cohen-Daubechies-Feauveau 9/7 wavelet (CDF 9/7) is the biorthogonal wavelet used in JPEG 2000 for lossy compression. It has:

Analysis filter length: 9 taps (odd, therefore symmetric)
Synthesis filter length: 7 taps (odd, symmetric)
4 vanishing moments (analysis) / 4 vanishing moments (synthesis)
Linear phase (symmetric filter -> zero phase distortion)

The 9/7 analysis lowpass filter coefficients are:

$k$	$h_k$
0	0.6029490182363579
1	0.2668641184428723
2	-0.07822326652898785
3	-0.01686411844287495
4	0.02674875741080976

The filter is symmetric: $h_k = h_{-k}$ . The corresponding synthesis filter has 7 nonzero coefficients.

Why 9/7? The choice balances:

Compression efficiency (4 vanishing moments approx good polynomial annihilation)
Visual quality (smooth synthesis wavelet approx no ringing artifacts)
Computational cost (9 taps = fast, especially in hardware)
Symmetry (eliminates the Daubechies asymmetry problem)

For lossless JPEG 2000, the Le Gall 5/3 biorthogonal wavelet is used (integer lifting steps, exact arithmetic).

A.3 Lifting Scheme

The lifting scheme (Sweldens, 1995) provides an efficient factorization of any wavelet filter bank into elementary "lifting steps" - pairs of predict and update operations:

Split: divide the signal into even ( $e_n = x_{2n}$ ) and odd ( $o_n = x_{2n+1}$ ) samples
Predict: update odd samples to reduce prediction error: $o_n \leftarrow o_n - P(e)_n$
Update: update even samples to maintain global statistics: $e_n \leftarrow e_n + U(o)_n$
Scale: normalize: $e_n \leftarrow e_n/K$ , $o_n \leftarrow o_n \cdot K$

The lifting scheme:

Requires exactly $N$ multiplications and $N$ additions for $N$ inputs (half the cost of the naive filter bank)
Allows in-place computation (no extra memory)
Supports the integer DWT (lossless coding) by rounding at each step
Makes hardware implementation trivial (pipeline two lifting steps)

The Haar DWT is the simplest lifting scheme: Predict step $o_n \leftarrow o_n - e_n$ , Update step $e_n \leftarrow e_n + o_n/2$ .

Appendix B: Frames and Redundant Representations

B.1 Wavelet Frames

When wavelet functions do not form an exact orthonormal basis but still allow signal reconstruction, they form a frame. A frame $\{\psi_{j,k}\}$ satisfies:

A\|f\|^2 \leq \sum_{j,k} |\langle f, \psi_{j,k}\rangle|^2 \leq B\|f\|^2

for constants $0 < A \leq B < \infty$ called frame bounds. If $A = B$ , the frame is tight (uniform stability in all directions). Tight frames satisfy $f = \frac{1}{A}\sum_{j,k}\langle f, \psi_{j,k}\rangle \psi_{j,k}$ (simple inversion formula, same as orthonormal basis but with redundancy $B/A \geq 1$ ).

Why frames? Overcomplete (redundant) representations have:

Better numerical stability (more "angles" to represent the signal)
Invariance properties (e.g., undecimated DWT = shift-invariant wavelet frame)
Flexibility in design (no restriction to dyadic sampling)

Continuous wavelet frames. The CWT with arbitrary discretization of $(a, b)$ forms a frame as long as the sampling is dense enough. For dyadic sampling $a = 2^j$ , $b = k \cdot 2^j \cdot \delta$ with $\delta > 0$ small enough, the CWT forms a frame.

B.2 Undecimated DWT (Stationary Wavelet Transform)

The undecimated DWT (also called shift-invariant DWT or "a trous" algorithm) removes the $\downarrow 2$ downsampling step in the Mallat algorithm:

Analysis step (undecimated):
  a^(j+1)[n] = sum_k h_k * a^(j)[n-2^j k]   (convolution with upsampled filter)
  d^(j+1)[n] = sum_k g_k * a^(j)[n-2^j k]

The output at each level has the same length $N$ as the input. Total output size: $O(N \cdot J)$ (redundant by factor $J$ ).

Properties:

Shift-invariant: $\text{UDWT}(T_k f) = T_k(\text{UDWT}(f))$ - unlike the downsampled DWT where shifts cause coefficient permutations
Better denoising: Averaging over all circular shifts of the DWT (cycle-spinning) gives the undecimated DWT implicitly; this averages out the artifacts from the dyadic grid
Cost: $O(NJ)$ instead of $O(N)$ - more expensive but often worth it for denoising

Cycle-spinning (Coifman-Donoho, 1995): Average wavelet-threshold denoising over all $N$ cyclic shifts. Equivalent to undecimated DWT thresholding. Eliminates "ringing" artifacts near discontinuities that standard DWT thresholding produces.

Appendix C: Wavelet Regularity and Approximation Theory

C.1 Besov Spaces

The natural function spaces for wavelet approximation theory are Besov spaces $B^s_{p,q}(\mathbb{R})$ , which generalize Sobolev spaces by specifying different regularity in different $L^p$ norms.

A function $f$ belongs to $B^s_{p,q}$ (smoothness $s$ , $L^p$ integrability, $\ell^q$ scale summability) if and only if its wavelet coefficients satisfy:

\left(\sum_{j \geq 0} \left(2^{js} \left(\sum_k |d_{j,k}|^p\right)^{1/p}\right)^q\right)^{1/q} < \infty

This characterization makes wavelet coefficients the canonical coordinate system for Besov spaces.

Key special cases:

$B^s_{2,2} = H^s$ (Sobolev space) - standard smooth functions
$B^s_{\infty,\infty} = C^s$ (Holder space) - functions with $s$ bounded derivatives
$B^s_{1,1}$ - functions with sparse wavelet representations (relevant for compression)

For AI: The spectral bias of neural networks (Rahaman et al., 2018) can be formalized as: gradient descent preferentially fits functions in lower-order Besov spaces first. Understanding Besov spaces explains why networks need curriculum learning for high-frequency targets.

C.2 Approximation Error in Wavelet Bases

Theorem C.1 (Nonlinear approximation, Cohen-Dahmen-DeVore 2001). For $f \in B^s_{p,p}$ with $s > 0$ , the best $n$ -term approximation $f_n$ (keeping the $n$ largest wavelet coefficients) satisfies:

\|f - f_n\|_{L^2} \leq C\, n^{-s/d} \|f\|_{B^s_{p,p}}

where $d$ is the signal dimension ( $d=1$ for 1D signals, $d=2$ for images). Compare to linear approximation (keeping all coefficients at the $J$ coarsest scales):

\|f - P_J f\|_{L^2} \leq C\, N_J^{-s/d}

where $N_J = 2^{Jd}$ is the number of kept coefficients. Both rates are the same for Sobolev spaces, but nonlinear approximation is far better for piecewise smooth functions (e.g., natural images with edges).

For image compression: Natural images are well-modeled as piecewise smooth (Cartoon model: smooth regions + $C^2$ edges). Their best $n$ -term wavelet approximation error decays as $n^{-1}$ in 2D, while DCT-based compression (JPEG) achieves only $n^{-1/2}$ (no adaptation to edges). This is the theoretical explanation for JPEG 2000's superiority over JPEG.

C.3 Wavelet Coefficients and Signal Properties

The wavelet coefficients $d_{j,k} = \langle f, \psi_{j,k}\rangle$ carry rich information about the local structure of $f$ :

Coefficient behavior	Signal property
$	d_{j,k}
$	d_{j,k}
Large isolated coefficients across scales	Point singularity (discontinuity) at $2^j k$
Cone of influence: large $	d_{j,k}

This local regularity characterization is used in:

Image edge detection via wavelet modulus maxima (Mallat-Hwang algorithm)
Fractal dimension estimation from power-law behavior of $\sum_k |d_{j,k}|^2$
Neural network weight analysis - regularity of trained weight matrices

Appendix D: Wavelet Connections to Other Methods

D.1 Relationship to Subband Coding

Subband coding (used in audio compression: MP3's filterbank, Dolby Atmos's DTS subbands) divides a signal into frequency bands using a filter bank. The DWT is a special case of octave subband coding with:

Logarithmic (octave) frequency spacing
Orthogonal / biorthogonal filter bank
Perfect reconstruction guarantee

MP3 uses a 32-band uniform QMF filter bank (not wavelet - uniform bands). AAC uses the Modified DCT (MDCT) - a lapped transform that shares the PR property with wavelets but uses cosine instead of wavelet basis functions. The similarity is not coincidental: all PR filter banks can be parameterized via unitary matrices (Vaidyanathan, 1993).

D.2 Relationship to Multirate Signal Processing

The DWT is a multirate signal processing system. Key identifiers:

Polyphase representation: filter bank in terms of $z^2$ operations (even/odd polyphase components)
Noble identity: upsampling-then-filtering = filtering-then-upsampling with upsampled filter
Nyquist criterion for PR: $H(z)H(z^{-1}) + H(-z)H(-z^{-1}) = 2$ is the half-band Nyquist filter condition

Understanding this connection enables implementation optimization: FFT-based polyphase filtering reduces the cost of DWT from $O(pN)$ to $O(N\log p)$ for long filters.

D.3 Relationship to Attention Mechanisms

The wavelet transform and self-attention both compute weighted sums over all input positions. The differences are instructive:

Property	Wavelet Transform	Self-Attention
Weights	Fixed (filter $h$ )	Data-dependent ( $Q K^\top$ )
Locality	Compact support of $\psi$	Global (all pairs)
Scale	Dyadic hierarchy (fixed)	Single scale (fixed head dim)
Complexity	$O(N)$	$O(N^2)$
Equivariance	Translation-equivariant	Permutation-equivariant
Inversion	Exact (PR)	Via LayerNorm + residuals

Attention can be viewed as a data-adaptive, globally-supported filter bank - it learns which positions to attend to, while a wavelet selects which scale/location to probe. The convergence of wavelet ideas and attention is driving research in wavelet-based long-range Transformers (see Section 8.2).

D.4 Wavelets and Renormalization Group Theory

In theoretical physics, the renormalization group (RG) describes how physical systems look at different scales. The key operation - integrating out short-distance degrees of freedom to obtain an effective theory at larger scales - is structurally identical to the DWT low-pass step:

RG block-spin transformation (Kadanoff 1966): average spins in a block -> coarser description
DWT approximation step: $a^{(j+1)}_k = \sum_n h_{n-2k} a^{(j)}_n$ - average input samples via $h$ -> coarser approximation

The detail coefficients $d^{(j)}$ capture the "fluctuations" integrated out at each RG step. The analogy is precise for the Haar wavelet (block averaging = Kadanoff transformation) and approximate for general wavelets.

For AI: This RG/wavelet correspondence explains why:

Neural networks with hierarchical architectures (ResNet, UNet, Swin) effectively implement learned RG flows
The number of scales needed grows logarithmically with the ratio of finest to coarsest detail $\log_2(N)$
Feature representations at different layers of a CNN correspond to effective theories at different RG scales

Numerical renormalization group methods in quantum many-body physics use tensor network contractions that are analogous to wavelet coefficient manipulations.

Appendix E: Quick Reference

E.1 Notation Summary

Symbol	Meaning
$\psi$	Mother wavelet
$\phi$	Scaling function (father wavelet)
$\psi_{a,b}(t) = a^{-1/2}\psi((t-b)/a)$	Dilated/translated wavelet
$\psi_{j,k}(t) = 2^{-j/2}\psi(2^{-j}t-k)$	DWT basis function (scale $j$ , position $k$ )
$W_\psi f(a,b) = \langle f, \psi_{a,b}\rangle$	Continuous Wavelet Transform
$d_{j,k} = \langle f, \psi_{j,k}\rangle$	Detail coefficients
$a_{j,k} = \langle f, \phi_{j,k}\rangle$	Approximation coefficients
$h_k$	Low-pass (scaling) filter
$g_k = (-1)^k h_{1-k}$	High-pass (wavelet) filter (QMF relation)
$V_j$	Approximation space at scale $j$
$W_j$	Detail space at scale $j$
$C_\psi$	Admissibility constant
db $N$	Daubechies wavelet with $N$ vanishing moments
sym $N$	Symlet (near-symmetric version of db $N$ )

E.2 Key Formulas

Admissibility: $C_\psi = \int_0^\infty |\hat{\psi}(\xi)|^2/\xi\,d\xi < \infty$ iff $\hat{\psi}(0) = 0$ .

CWT: $W_\psi f(a,b) = a^{-1/2}\int f(t)\overline{\psi((t-b)/a)}\,dt$

CWT inversion: $f(t) = C_\psi^{-1}\int_0^\infty\int_\mathbb{R} W_\psi f(a,b)\,\psi_{a,b}(t)\,\frac{da\,db}{a^2}$

Two-scale relation: $\phi(t) = \sqrt{2}\sum_k h_k\phi(2t-k)$

Wavelet relation: $\psi(t) = \sqrt{2}\sum_k g_k\phi(2t-k)$

QMF relation: $g_k = (-1)^k h_{1-k}$

Mallat analysis: $a^{(j+1)}_k = \sum_n h_{n-2k}\,a^{(j)}_n$ , $d^{(j+1)}_k = \sum_n g_{n-2k}\,a^{(j)}_n$

Mallat synthesis: $a^{(j)}_n = \sum_k h_{n-2k}\,a^{(j+1)}_k + \sum_k g_{n-2k}\,d^{(j+1)}_k$

Parseval (DWT): $\|f\|^2 = \sum_{j,k}|d_{j,k}|^2 + \sum_k |a_{J,k}|^2$

Vanishing moments: $\int t^m \psi(t)\,dt = 0$ for $m = 0,\ldots,N-1$

Universal threshold: $\lambda = \hat{\sigma}\sqrt{2\log N}$ , $\hat{\sigma} = \text{median}(|d_{1,k}|)/0.6745$

Soft threshold: $T_\lambda(x) = \text{sign}(x)\max(|x|-\lambda, 0)$

E.3 Standard Wavelet Types and Their Uses

Use Case	Recommended Wavelet	Reason
Audio denoising	db4, sym4	4 vanishing moments, regular, fast
Image compression	CDF 9/7, CDF 5/3	Symmetric, linear phase, JPEG 2000 standard
Edge detection	Mexican Hat, db2	2nd derivative structure; 2 vanishing moments
EEG/EMG analysis	Morlet, db4	Optimal time-frequency, physiological band structure
Time series (financial)	db4, Haar	Haar for piecewise constant, db4 for regular
Geophysical seismology	Morlet, Paul	Complex, analytic (instantaneous frequency)
Scattering networks	Morlet (complex, real part)	Optimal time-frequency uncertainty
Lossless image coding	CDF 5/3 (integer)	Integer lifting steps, exact inverse
Quick analysis	Haar	Fastest, simplest, pedagogical

Appendix F: Wavelet Software Ecosystem

F.1 PyWavelets (pywt)

The standard Python wavelet library. Key functions:

import pywt

# List available wavelets
pywt.families()          # ['haar', 'db', 'sym', 'coif', 'bior', 'rbio', 'dmey', 'gaus', 'mexh', 'morl', ...]
pywt.wavelist(family='db')  # ['db1', 'db2', ..., 'db38']

# Create wavelet object
w = pywt.Wavelet('db4')
print(w.dec_lo)   # decomposition (analysis) low-pass filter
print(w.dec_hi)   # decomposition (analysis) high-pass filter
print(w.rec_lo)   # reconstruction (synthesis) low-pass filter
print(w.rec_hi)   # reconstruction (synthesis) high-pass filter

# 1D DWT
coeffs = pywt.wavedec(signal, 'db4', level=5)  # [cA5, cD5, cD4, cD3, cD2, cD1]
rec = pywt.waverec(coeffs, 'db4')

# 2D DWT
coeffs2 = pywt.wavedec2(image, 'db4', level=3)
# coeffs2 = [cA3, (cH3,cV3,cD3), (cH2,cV2,cD2), (cH1,cV1,cD1)]
rec2 = pywt.waverec2(coeffs2, 'db4')

# Continuous Wavelet Transform
scales = np.arange(1, 64)
coefficients, frequencies = pywt.cwt(signal, scales, 'morl', sampling_period=1/fs)

F.2 Denoising with pywt

import pywt, numpy as np

def denoise_signal(y, wavelet='db4', level=5, mode='soft'):
    # Decompose
    coeffs = pywt.wavedec(y, wavelet, level=level)
    
    # Estimate noise from finest scale
    sigma = np.median(np.abs(coeffs[-1])) / 0.6745
    
    # Universal threshold
    N = len(y)
    threshold = sigma * np.sqrt(2 * np.log(N))
    
    # Threshold all detail coefficients (not approximation)
    thresholded = [coeffs[0]]  # keep approximation
    for c in coeffs[1:]:
        thresholded.append(pywt.threshold(c, threshold, mode=mode))
    
    # Reconstruct
    return pywt.waverec(thresholded, wavelet)[:len(y)]

F.3 Integration with PyTorch

For differentiable wavelet transforms in deep learning:

# pytorch_wavelets (pip install pytorch_wavelets)
from pytorch_wavelets import DWTForward, DWTInverse

xfm = DWTForward(J=3, mode='zero', wave='db3')  # Forward DWT
ifm = DWTInverse(mode='zero', wave='db3')          # Inverse DWT

# x: (batch, channels, height, width)
Yl, Yh = xfm(x)  # Yl: low-freq (coarse), Yh: list of high-freq subbands
x_rec = ifm((Yl, Yh))

For learnable wavelet-like transforms, the filter coefficients $h_k$ and $g_k$ can be parameterized as network weights while enforcing the PR conditions as constraints or soft penalties.

Appendix G: Proofs of Key Results

G.1 Proof: QMF Relation Ensures Orthogonality

Claim: If $g_k = (-1)^k h_{1-k}$ , then $\langle\psi_{0,k}, \psi_{0,l}\rangle = \delta_{kl}$ .

Proof sketch. In the frequency domain, $G(\xi) = e^{-2\pi i \xi}\overline{H(\xi + 1/2)}$ . The orthonormality condition for translates of $\psi$ requires:

\sum_{n\in\mathbb{Z}} |\hat{\psi}(\xi + n)|^2 = 1 \quad \text{a.e.}

Using $\hat{\psi}(\xi) = G(\xi/2)\hat{\phi}(\xi/2)$ and the power-complementary property $|H(\xi)|^2 + |H(\xi+1/2)|^2 = 1$ :

\sum_n |\hat{\psi}(\xi+n)|^2 = |G(\xi/2)|^2\sum_n|\hat{\phi}(\xi/2+n)|^2 = |G(\xi/2)|^2 \cdot 1

(using orthonormality of $\phi$ translates). For this to equal 1, we need $|G(\xi)|^2 = 1$ whenever $|H(\xi)|^2 = 0$ and vice versa, which is guaranteed by $G(\xi) = e^{-2\pi i\xi}\overline{H(\xi+1/2)}$ . $\square$

G.2 Proof: Mallat Algorithm Computes Inner Products

Claim: The approximation coefficients produced by the Mallat algorithm equal $a^{(j)}_k = \langle f, \phi_{j,k}\rangle$ and detail coefficients equal $d^{(j)}_k = \langle f, \psi_{j,k}\rangle$ .

Proof sketch. By induction: at scale $j=0$ , $a^{(0)}_n = f(n)$ (sampling theorem). The downsampling step:

a^{(j+1)}_k = \sum_n h_{n-2k}\,a^{(j)}_n = \langle f, \sum_n h_{n-2k}\,\phi_{j,n}\rangle = \langle f, \phi_{j+1,k}\rangle

using the two-scale relation $\phi_{j+1,k} = \sum_n h_{n-2k}\,\phi_{j,n}$ (derived from $\phi(t) = \sqrt{2}\sum_k h_k\phi(2t-k)$ ). Similarly for $d^{(j+1)}_k$ . $\square$

G.3 Proof: DWT Has $O(N)$ Complexity

Claim: Computing the full $J$ -level DWT of a length- $N$ signal costs $O(N)$ operations.

Proof. Each level $j$ processes a sequence of length $N_j = N/2^{j-1}$ with a filter of length $p$ (number of taps). The cost of convolution + downsampling is $O(p \cdot N_j)$ operations. Total:

T(N) = \sum_{j=1}^{J} O(p \cdot N/2^{j-1}) = O(pN)\sum_{j=0}^{J-1}(1/2)^j = O(pN) \cdot \frac{1 - 2^{-J}}{1 - 1/2} < O(2pN)

Since $p$ is fixed (depends only on the wavelet family, not $N$ ), $T(N) = O(N)$ . $\square$

Remark: The $O(N)$ complexity is strict: for a $p$ -tap filter, the constant is exactly $2p$ multiplications per input sample (across all levels). For db4 ( $p=8$ ): 16 multiplications/sample - far fewer than the $N\log_2 N$ operations of the FFT.

This section is part of Chapter 20: Fourier Analysis and Signal Processing. For questions or corrections, open an issue in the repository.

Wavelets