Lesson overview | Previous part | Next part
Fourier Transform: Part 8: Applications in Machine Learning to Appendix A: Extended Properties and Derivations
8. Applications in Machine Learning
8.1 FNet: Replacing Self-Attention with Fourier Transforms
Paper: Lee-Thorp, Ainslie, Eckstein, Ontanon (2022). "FNet: Mixing Tokens with Fourier Transforms." NAACL.
The idea: In a standard Transformer, the self-attention sublayer computes:
which is in sequence length . FNet replaces this with a 2D DFT:
where applies FFT along the sequence dimension and along the embedding dimension. This is - much faster.
Why does it work? The Fourier Transform is a global linear mixing operation: every output position depends on every input position (via the sum ). This is similar to attention's global mixing, but unparameterized - the mixing weights are fixed Fourier basis functions rather than learned attention scores.
Results: FNet achieves 92-97% of BERT's performance on GLUE benchmarks. For tasks requiring fine-grained token interaction (e.g., extractive QA), the gap is larger; for tasks where global context suffices (e.g., sentence classification), FNet nearly matches BERT. Training speed is 7 faster on GPUs, 2 faster on TPUs.
Mathematical insight: The key theorem is that the DFT matrix satisfies (circular DFT is symmetric up to normalization). Therefore - the DFT is unitary, just like the continuous FT by Plancherel. The real part operation ensures the output is real-valued.
Code sketch:
import numpy as np
def fnet_mixing(X):
"""
X: (batch, seq_len, d_model) array
Returns: real-valued mixing output of same shape
"""
X_fft = np.fft.fft(np.fft.fft(X, axis=1), axis=2)
return np.real(X_fft)
8.2 Random Fourier Features
Paper: Rahimi & Recht (2007). "Random Features for Large-Scale Kernel Machines." NeurIPS (Best Paper Award).
The setup: Kernel methods (SVMs, Gaussian Processes) require computing for all training pairs - an matrix that costs to invert. For , this is prohibitive.
Bochner's theorem: A continuous, shift-invariant kernel is positive definite if and only if it is the Fourier Transform of a non-negative measure :
(Full treatment of Bochner's theorem in Section 12-03 Kernel Methods.)
The approximation: Sample frequencies and define the random feature map:
where are random phase offsets.
Guarantee: By the law of large numbers:
with concentration: .
For popular kernels:
- RBF kernel :
- Laplace kernel : is a Cauchy distribution
Impact: Random Fourier Features reduce kernel SVM training from to , making kernel methods scalable. The same idea appears in Performer (Choromanski et al., 2021) - an efficient attention mechanism that approximates the attention kernel with random features, reducing attention to .
8.3 Spectral Normalization
Paper: Miyato, Kataoka, Koyama, Yoshida (2018). "Spectral Normalization for Generative Adversarial Networks." ICLR.
The problem: Training GANs is notoriously unstable because the discriminator can overfit, causing gradient vanishing for the generator. A Lipschitz constraint on the discriminator stabilizes training.
The solution: Normalize each weight matrix by its spectral norm (the largest singular value):
This makes every linear layer 1-Lipschitz: . The composition of 1-Lipschitz layers is 1-Lipschitz - so the entire discriminator is 1-Lipschitz.
Computing : Power iteration gives an efficient approximation. Starting from a random unit vector :
- One iteration per training step suffices empirically.
Why "spectral"? The spectral norm is the largest eigenvalue of - the spectrum of the operator viewed as a linear map. This connects to Fourier analysis: for a convolution layer with kernel , the spectral norm is - the supremum of the Fourier Transform of the filter.
For AI: Spectral normalization is now standard in GAN training. It also appears in attention normalization - normalizing the key/value projection matrices to control the Lipschitz constant of the attention map, which is critical for stable training of large models.
8.4 Fourier Neural Operator - Preview
The Fourier Neural Operator (FNO) (Li et al., 2021) learns a mapping between function spaces (e.g., PDE initial conditions -> solutions) by:
- Lift input to a higher-dimensional representation
- Apply Fourier integral operator layers:
- Project to output:
The key operation is a learnable complex-valued tensor applied pointwise in Fourier space - a global convolution with a learned filter. Only the lowest Fourier modes are kept (Plancherel guarantees the truncation error equals the discarded energy).
FNO achieves 1000 speedup over classical PDE solvers for problems like weather prediction and turbulence simulation.
Full treatment in Section 20-03 DFT and FFT - the discretized version of FNO and its implementation via FFT is covered there.
8.5 Bochner's Theorem and Shift-Invariant Kernels - Preview
Bochner's theorem (1932) characterizes all positive-definite, shift-invariant kernels as Fourier Transforms of non-negative measures:
This is a profound connection between Fourier analysis and kernel methods. The kernel's shape in the spatial domain determines (and is determined by) its spectral density .
Full treatment in Section 12-03 Kernel Methods - the RKHS theory, Mercer's theorem, and applications to SVMs, Gaussian Processes, and the Neural Tangent Kernel are covered there. Here we note that Bochner's theorem is the mathematical foundation for Random Fourier Features (Section 8.2).
8.6 RoPE from the Continuous FT Perspective
Rotary Position Embedding (RoPE) (Su, Lu, Pan, Meng, Luo, 2021) is now used in LLaMA-3, Mistral, Qwen, GPT-NeoX, and virtually every frontier LLM. Its mathematical foundation is the Fourier Transform on the circle group.
Setup. In attention, the score between query at position and key at position should depend only on the relative position . This is a translation invariance requirement - exactly the condition Bochner's theorem characterizes.
Construction. Pair up embedding dimensions: and for . Treat each pair as a complex number . Apply rotation by angle :
The attention score becomes:
which depends only on - the relative position. This is the Fourier modulation theorem: multiplying by is a frequency shift by in the "position domain."
The frequencies : The geometric spacing means each pair encodes a different "octave" of positional information. Low-index dimensions ( small) use high frequencies - sensitive to local position differences. High-index dimensions use low frequencies - sensitive to global structure over thousands of positions. This is a multi-resolution decomposition - the Fourier Transform at multiple frequency scales.
Why the base? The maximum useful context is tokens at the original RoPE scale (each frequency completes at most one full rotation). For LLaMA-3 with 128K context, the effective base must be much larger - which is why LLaMA-3.1 uses instead of .
ROPE: FOURIER TRANSFORM IN ATTENTION
========================================================================
f_q(q, m) = q e^{im} <- modulation by Fourier basis e^{im}
f_q(q,m), f_k(k,n) = Re[q*k e^{i(m-n)}]
^ depends only on relative position m-n
j = 10000^{-2j/d}: j=0 -> approx1 (fast/local), j=d/2 -> approx1/10000 (slow/global)
Context extension:
Original RoPE: base=10000, max context ~10K tokens
LLaMA-3.1: base=500K, max context 128K tokens
LongRoPE: non-uniform rescaling up to 2M context
========================================================================
9. Common Mistakes
| # | Mistake | Why It's Wrong | Fix |
|---|---|---|---|
| 1 | Confusing the and conventions and getting factors wrong | has a in the inverse; does not. Mixing these adds spurious factors | Always identify the convention on first use; for this curriculum use the -convention |
| 2 | Assuming whenever | has . The Riemann-Lebesgue lemma only guarantees , not | Check whether is smooth enough (e.g., and Lipschitz) for to hold |
| 3 | Writing | has a flat spectrum: for all . A "pointlike" impulse contains all frequencies equally | Compute distributional FT: |
| 4 | Thinking the uncertainty principle only limits our instruments | is a theorem about functions, not about measurement devices. It holds for any , regardless of how it is measured | Accept the bound as a mathematical fact; designing around it requires fundamentally different time-frequency representations (wavelets, Section 05) |
| 5 | Applying the differentiation property to non-differentiable | requires . For , in the distributional sense | Work in the distributional setting when has jumps; the formula still holds but is a distribution |
| 6 | Claiming Parseval says $\int | f | ^2 = \int |
| 7 | Mixing up the FT of a product vs. convolution | (convolution in frequency), not . The multiplicative dual of the convolution theorem is often reversed | Memorize the duality table: convolution in time product in frequency; product in time convolution in frequency |
| 8 | Forgetting the $1/ | a | $ in the scaling theorem |
| 9 | Treating negative frequencies as "unphysical" | For real , negative frequencies are redundant (Hermitian symmetry) but are not wrong. For complex (e.g., analytic signals), negative and positive frequencies are distinct and carry different information | Embrace negative frequencies; they simplify formulas and are essential for complex signals |
| 10 | Applying pointwise instead of a.e. | Inversion holds almost everywhere (for sufficient regularity). At a jump discontinuity, - the average of left and right limits | Account for the -average behavior at discontinuities (same as Dirichlet's theorem in Section 01) |
| 11 | Using and interchangeably | There are two conventions for sinc. The normalized sinc satisfies in the -convention; the unnormalized does not | In this curriculum: (normalized), consistent with the -convention FT |
| 12 | Thinking FNet replaces attention entirely | FNet is a weaker mixer than attention - it cannot learn data-dependent weights. It works well for classification but poorly for tasks requiring fine-grained token interactions (QA, coreference) | Use FNet as a fast baseline; switch to attention for tasks requiring selective, query-dependent mixing |
10. Exercises
Exercise 1 * - Computing Fourier Transforms from Definition
(a) Compute for with . Show all steps of integration. What type of function is ?
(b) Compute for a pulse of width . Express using the normalized sinc function.
(c) Verify the bound: show for your answer in (a).
(d) Confirm the Riemann-Lebesgue lemma numerically for : plot and verify it decays to 0.
Exercise 2 * - Applying Properties
(a) Given , use the time-shift property to write the Fourier Transform of .
(b) Given , find the Fourier Transform of . (Use shift + modulation.)
(c) If has , what is the FT of ? (Use both shift and scaling.)
(d) Prove: if is real and even, then is real and even.
Exercise 3 * - Parseval's Relation Numerically
(a) For , compute analytically.
(b) Compute analytically and verify .
(c) Numerically verify Parseval using the FFT: discretize on with points, apply FFT, and compare vs .
(d) Use Parseval to evaluate .
Exercise 4 ** - The Inversion Theorem
(a) For (not in after FT since ), show that the principal value integral converges to for by evaluating the integral explicitly.
(b) Show that at (the jump discontinuity), the principal value integral converges to .
(c) This mirrors Dirichlet's theorem from Section 20-01. Identify the precise analog: what plays the role of the Dirichlet kernel in the Fourier integral case?
Exercise 5 ** - Uncertainty Principle
(a) For (normalized to appropriately), compute (RMS time spread) and (RMS frequency spread). Verify .
(b) For , compute and numerically. Is the uncertainty bound satisfied? Is it tight?
(c) A signal designer wants to transmit a pulse with duration s using bandwidth kHz. Is this possible? What does the uncertainty principle say?
(d) Show that any signal of the form achieves equality in the uncertainty principle.
Exercise 6 ** - Tempered Distributions
(a) Verify using the distributional definition .
(b) Compute (FT of the derivative of delta).
(c) Use the result from (a) to verify: .
(d) Verify the Poisson summation formula numerically for : compare with for .
Exercise 7 *** - Random Fourier Features
(a) For the RBF kernel (setting ), the spectral density is . Draw frequencies for and construct the random feature map (using cos and sin).
(b) Generate 100 random pairs with . For each pair, compute the exact kernel value and the RFF approximation . Plot the approximation vs. exact values and report RMSE.
(c) Study the approximation quality as a function of . How does RMSE scale with ? Is it consistent with the theoretical bound ?
(d) Takeaway: explain in 2 sentences why RFF matters for LLM-scale systems where kernel computation would be infeasible.
Exercise 8 *** - FNet vs Attention Experiment
(a) Implement a minimal FNet mixing layer: given , apply 2D FFT and take the real part. Implement a minimal attention mixing layer: .
(b) For a synthetic task - "copy the token from position to the output" (a task requiring precise position tracking) - construct training pairs and train both mixers with a linear head on top. Compare accuracy and training speed.
(c) For a synthetic task - "output the mean of all input tokens" (a global aggregation task) - repeat the comparison. Which mixer is better suited? Why?
(d) Takeaway: connect your results to the mathematical properties of the FT (fixed unparameterized mixing vs. learned data-dependent mixing) and explain what types of tasks each mixer is appropriate for.
11. Why This Matters for AI (2026 Perspective)
| Concept | AI Impact |
|---|---|
| Fourier Transform as unitary operator (Plancherel) | FNet (Lee-Thorp et al., 2022) uses FT as a token mixer - same information content, 7 faster than attention on GPU. The unitarity of FT ensures no information loss during mixing. |
| Convolution-multiplication duality | The theoretical basis for convolution via FFT - the reason CNNs can be trained efficiently. FlashConv (Fu et al., 2023) and Hyena (Poli et al., 2023) extend this to long-range sequence models. |
| Modulation theorem (time-shift frequency-shift) | Foundation of RoPE (Su et al., 2021) - multiplying embeddings by shifts the position representation. Now standard in LLaMA-3, Mistral, Qwen, GPT-NeoX. |
| Uncertainty principle | Fundamental constraint on context length extension. YaRN (Peng et al., 2023) and LongRoPE (Ding et al., 2024) must lower the frequency to extend context - the UP dictates the minimum frequency needed. |
| Bochner's theorem + FT of spectral density | Random Fourier Features (Rahimi & Recht, 2007) reduce kernel computation from to . Performer (Choromanski et al., 2021) uses the same idea to get attention. |
| Spectral norm | Spectral normalization (Miyato et al., 2018) - dividing discriminator weights by their spectral norm enforces Lipschitz constraints. Standard in GAN training and used in diffusion model discriminators. |
| FT as basis for function space operators | Fourier Neural Operator (Li et al., 2021) - solves PDEs 1000 faster by learning in Fourier space. Used in climate modeling, fluid simulation, and materials science. |
| Differentiation in frequency domain () | Heat and wave equation solutions in closed form via FT. Kernel of the heat equation is a Gaussian - explains why diffusion models add Gaussian noise: it has maximal spectral support. |
| Gaussian as optimal time-frequency window | Gaussian embeddings in Gaussian attention (Tsai et al., 2019) - replacing dot-product attention with a Gaussian kernel gives an attention with explicit time-frequency interpretation. |
| Dirac comb + Poisson summation | Shannon sampling theorem - ensures that Whisper's mel-spectrogram preprocessing (16kHz sampling, 80-channel mel filterbank) captures all speech frequencies (up to 8kHz) without aliasing. |
12. Conceptual Bridge
Looking Back: From Fourier Series to Fourier Transform
The Fourier series (Section 20-01) established that periodic functions decompose into discrete harmonics. The key insight was geometric: the trigonometric functions form an orthonormal basis of , and the Fourier coefficients are the projections of onto these basis vectors.
The Fourier Transform carries this insight to its natural conclusion. By letting the period , the discrete harmonic sum becomes the continuous integral , and the discrete spectrum becomes the continuous spectrum . Everything from the Fourier series has an analog:
- Orthonormality -> Plancherel's theorem (Section 5)
- Parseval's identity -> Parseval's relation (Section 5.2)
- Dirichlet's theorem on convergence -> Fourier inversion theorem (Section 4)
- Gibbs phenomenon -> spectral leakage in DFT (Section 03)
- Smoothness spectral decay -> differentiation property (Section 3.5)
The distribution theory in Section 7 reveals the deepest unity: Fourier series is just the Fourier Transform restricted to periodic distributions. The Dirac comb with spacing transforms to a Dirac comb with spacing - the very relationship between period and harmonic spacing that the Fourier series encodes.
Looking Forward: Three Branches
From the Fourier Transform, the curriculum branches in three directions:
Branch 1: Discretization -> Section 20-03 DFT and FFT. The continuous FT must be made computable. Discretizing both time and frequency leads to the DFT, and the Cooley-Tukey algorithm (1965) computes it in instead of . The FFT is arguably the most important algorithm in scientific computing - it is what makes spectrograms, MRI, and FNet feasible. The DFT inherits all properties of the continuous FT but adds complications: aliasing (from discretizing frequency), spectral leakage (from finite duration), and the circular convolution structure.
Branch 2: Convolution Theorem -> Section 20-04. The most important property of the FT for applications is that convolution in time equals multiplication in frequency. This converts the convolution operation into FFT-based multiplication. The convolution theorem is the mathematical foundation of CNNs, WaveNet, S4/Mamba, and Hyena - all architectures where the key operation is a learned filter applied by convolution. The full theory of LTI systems, frequency response, filter design, and the Wiener-Khinchin theorem belongs there.
Branch 3: Time-Frequency Localization -> Section 20-05 Wavelets. The FT has a fundamental limitation: tells you which frequencies are present but not when they occur. The uncertainty principle shows this is not fixable - any attempt to localize in both time and frequency is bounded by . Wavelets overcome this by accepting a principled tradeoff: high frequencies are analyzed at fine time resolution, low frequencies at coarse time resolution. This multi-resolution analysis (MRA) is the mathematical foundation of JPEG 2000, EEG analysis, and wavelet attention mechanisms.
POSITION IN THE FOURIER CURRICULUM
========================================================================
Section 01 FOURIER SERIES Section 02 FOURIER TRANSFORM (HERE)
--------------------- ---------------------------------
Periodic functions Aperiodic functions on R
Discrete spectrum {cn} Continuous spectrum f()
Parseval for series Plancherel's theorem
Dirichlet convergence Inversion theorem
Gibbs phenomenon Spectral leakage (-> Section 03)
RoPE derivation Full uncertainty principle
v v
+----------------------------------------------------------------+
| Section 02 -> THREE BRANCHES |
| |
| Discretize: Convolve: Time-localize: |
| Section 03 DFT/FFT -> Section 04 Convolution -> Section 05 Wavelets |
| $O(N\log N)$ CNNs, Mamba MRA, multi-scale |
| Whisper, FNO WaveNet, Hyena Scattering networks |
+----------------------------------------------------------------+
v
Section 21 Statistical Learning Theory
(spectral learning bounds, kernel methods)
========================================================================
Key Takeaways
The Fourier Transform is characterized by three central theorems:
-
Plancherel - the FT is a unitary isometry on : it preserves energy and inner products. The Fourier Transform changes how you describe a signal, not how much information it contains.
-
Uncertainty - : time and frequency concentration are fundamentally coupled. This is a theorem about analysis, not about physics, and it constrains every signal processing and ML system that operates in both domains simultaneously.
-
Inversion - the FT is invertible: given , you can recover exactly. No information is lost in the Fourier domain representation - it is a complete, lossless change of basis.
Together, these theorems say: the Fourier Transform is an invertible, energy-preserving, mathematically constrained change of representation. It reveals structure (frequency content) that is invisible in the time domain, converts convolution to multiplication, and converts differentiation to algebra - making it the single most powerful analytical tool for understanding and processing signals, functions, and the weight matrices of neural networks.
<- Back to Fourier Analysis | Previous: Fourier Series <- | Next: DFT and FFT ->
Appendix A: Extended Properties and Derivations
A.1 The Duality Theorem in Full
One of the most elegant properties of the Fourier Transform is self-duality: applying the FT twice returns the time-reversed original.
Theorem A.1 (Duality). For :
Equivalently: if , then .
Proof:
By Fubini (justified since ):
Corollary: Applying the FT four times: . The Fourier Transform has order 4 as an operator.
Using duality to compute new transforms:
If you know , then .
Example: We know . By duality: (since rect is even).
So : a sinc in time has a rectangular (bandlimited) spectrum. This is the ideal low-pass filter - a filter that passes frequencies perfectly and rejects all others.
A.2 Analytic Signals and the Hilbert Transform
Definition A.1 (Analytic Signal). For a real signal , its analytic signal is:
where is the Hilbert Transform: .
Fourier domain characterization:
The analytic signal has only non-negative frequencies: it is a one-sided spectrum signal. This is achieved by zeroing out the negative frequencies and doubling the positive ones - a frequency-domain operation.
Why this matters: The analytic signal enables the definition of instantaneous frequency - the frequency of the signal at a given instant. This is essential for frequency-modulated signals (FM radio, chirps) and is used in empirical mode decomposition for analyzing non-stationary signals like EEG and speech.
Hilbert Transform in spectral terms: . The Hilbert Transform applies a phase shift to positive frequencies and to negative frequencies - it is a "90-degree phase rotator."
A.3 The Fourier Transform on
The Fourier Transform extends naturally to dimensions. For with :
where is the frequency vector. All properties generalize:
- Shift:
- Scaling by matrix :
- Differentiation:
- Plancherel:
For AI: The 2D Fourier Transform is used in convolutional networks for images. A 2D convolution with filter is, in Fourier space, pointwise multiplication: . For FNet (Section 8.1), the 2D FT is applied over both the sequence dimension and the embedding dimension.
Radial functions: For radially symmetric , the Fourier Transform is also radial: . The RBF kernel is radially symmetric, so its spectral density is also radially symmetric - which is why sampling works for RFF with the RBF kernel.
A.4 The Fourier Transform and PDEs
The differentiation property converts partial differential equations into algebraic equations in frequency space. This is the primary tool for solving linear constant-coefficient PDEs.
Example 1: Heat Equation. Solve , .
Taking the Fourier Transform in :
This is a first-order ODE in for each fixed : .
Inverting:
where is the Gaussian heat kernel (a Gaussian with time-dependent width ).
Interpretation: Heat diffusion = convolution with a Gaussian. High frequencies decay as - much faster than low frequencies. Sharp features (high frequency content) are smoothed out rapidly; broad features (low frequency) persist.
For AI - Diffusion Models: The DDPM forward process is exactly discrete heat diffusion: the signal is progressively smoothed by adding Gaussian noise (which has a flat spectrum = all frequencies). The network learns to reverse this process. The Fourier perspective explains why the early denoising steps (low noise) must recover high-frequency details while later steps (high noise) recover the coarse structure - the exact inverse of heat diffusion.
Example 2: Wave Equation. Solve , , .
FT in :
Solution:
By d'Alembert's formula (inverting):
The wave equation propagates each frequency at the same speed - no dispersion. In contrast, the heat equation attenuates high frequencies faster than low frequencies - strong dispersion (dissipation).
Example 3: Schrodinger Equation. The free-particle Schrodinger equation is formally a heat equation with imaginary time. Its solution shows that each frequency component oscillates (rather than decays) with frequency - quantum dispersion. The uncertainty principle (where by de Broglie) is the physics version of the mathematical uncertainty principle in Section 6.
A.5 Windowed Fourier Transform (STFT)
The continuous Fourier Transform gives global frequency information but no time localization: tells you whether frequency is present somewhere in , but not when.
Definition A.2 (Short-Time Fourier Transform / Spectrogram). For a window function and signal :
This localizes the analysis around time using the window . The spectrogram is - a time-frequency power map.
The uncertainty principle applies: The time resolution and frequency resolution of the STFT are determined by the window :
- Narrow window -> good time resolution, poor frequency resolution
- Wide window -> poor time resolution, good frequency resolution
The Gaussian window minimizes the product for a given , making it the optimal STFT window (the resulting time-frequency representation is called the Gabor transform).
For AI - Audio Models: Whisper (Radford et al., 2022) processes audio as mel-spectrograms: it applies a 25ms STFT window (adequate time resolution for phoneme detection) with 10ms hop, then maps the frequency axis to the mel scale (logarithmic, matching human perception), and feeds the result as a 2D "image" to a Vision Transformer. The STFT parameters are fixed - the uncertainty principle determines the time-frequency resolution tradeoff.
STFT TIME-FREQUENCY TRADEOFF (UNCERTAINTY PRINCIPLE IN PRACTICE)
========================================================================
Short window (e.g., 2ms): Long window (e.g., 50ms):
------------------------- --------------------------
Fine time resolution (~2ms) Coarse time resolution (~50ms)
Poor freq resolution (~500 Hz) Fine freq resolution (~20 Hz)
Good for: transients, clicks Good for: sustained tones, pitch
Whisper (speech): 25ms window -> resolves individual phonemes (50-100ms)
while achieving ~40Hz frequency resolution (enough for pitch/formants)
The Gaussian window achieves the minimum t product - it is the
optimal window under the uncertainty principle.
========================================================================
A.6 Fourier Transform and Operator Theory
From the perspective of functional analysis, the Fourier Transform is a unitary operator on the Hilbert space . This abstract viewpoint connects Fourier analysis to the broader theory of self-adjoint operators.
The Fourier operator : is unitary: .
Since , the eigenvalues satisfy : .
The spectral decomposition of in terms of its eigenspaces: , where each is spanned by the Hermite functions with the appropriate eigenvalue.
Connection to quantum mechanics: The position operator and momentum operator (in natural units) are both self-adjoint operators on , related by . The commutator implies the uncertainty principle . The mathematical identity and the physics are the same theorem.
For AI: The language of operator theory is increasingly used to analyze neural networks. The weight matrix of a linear layer is a linear operator; its spectral norm is its "size" as an operator; its singular value decomposition expresses it in terms of rank-1 operators. The FT connects linear operator theory to signal processing, unifying the analysis of both classical filters and neural network layers.