CNN and Convolution Math

Convolutional neural networks use local filters, shared weights, and spatial hierarchies to model images and other grid-like data efficiently.

Overview

The core 2D cross-correlation used in deep learning is:

$$ y[i,j]=\sum_{u,v}x[i+u,j+v]w[u,v]. $$

With channels, every output channel has a kernel over all input channels. Stride, padding, and dilation determine output shape. Stacking layers grows receptive fields. Pooling and strided convolution downsample. Residual blocks and normalization make deep CNNs trainable.

Prerequisites

Matrix and tensor shapes
Dot products and gradients
Basic neural-network loss functions
Some image or grid-data intuition

Learning Objectives

After this section, you should be able to:

Compute 1D and 2D convolution/cross-correlation by hand.
Calculate output sizes from kernel, padding, stride, and dilation.
Count convolution parameters and FLOPs.
Explain pooling, strided convolution, receptive fields, and dilation.
Derive kernel, input, and bias gradients at a high level.
Explain residual blocks, bottlenecks, normalization, and feature pyramids.
Connect CNN patch operations to vision transformer patch embeddings.
Build shape and receptive-field diagnostics for CNN models.

Convolution as Local Linear Map
1.1 Local receptive field
1.2 Weight sharing
1.3 Translation equivariance
1.4 Cross-correlation convention
1.5 Channels
Output Shape Arithmetic
2.1 Stride
2.2 Padding
2.3 Dilation
2.4 Output height
2.5 Same convolution
Parameter and FLOP Counts
3.1 Parameter count
3.2 Output elements
3.3 FLOPs
3.4 1 by 1 convolution
3.5 Depthwise separable convolution
Pooling and Downsampling
4.1 Max pooling
4.2 Average pooling
4.3 Strided convolution
4.4 Global average pooling
4.5 Aliasing
Receptive Field
5.1 Single layer
5.2 Stacked layers
5.3 Jump
5.4 Dilation effect
5.5 Effective receptive field
Backpropagation Through Convolution
6.1 Kernel gradient
6.2 Input gradient
6.3 Bias gradient
6.4 im2col view
6.5 Autodiff check
CNN Building Blocks
7.1 Conv activation norm
7.2 Residual block
7.3 Bottleneck block
7.4 Batch normalization
7.5 Dropout and augmentation
Vision Tasks
8.1 Classification
8.2 Detection
8.3 Segmentation
8.4 Feature pyramids
8.5 Transfer learning
CNNs and Modern AI
9.1 Inductive bias
9.2 Data efficiency
9.3 Hybrid models
9.4 Patch embedding
9.5 When CNNs still matter
Diagnostics
10.1 Shape checks
10.2 Kernel visualization
10.3 Activation statistics
10.4 Receptive field test
10.5 Ablations

Shape Map

image batch:      X      shape (N, C_in, H, W)
kernel bank:      W      shape (C_out, C_in, K_h, K_w)
feature map:      Y      shape (N, C_out, H_out, W_out)
classification:   logits shape (N, num_classes)
segmentation:     logits shape (N, num_classes, H, W)

1. Convolution as Local Linear Map

This part studies convolution as local linear map as spatial tensor math. Keep track of axes, output shapes, receptive fields, and parameter sharing.

Subtopic	Question	Formula
Local receptive field	each output sees a small input neighborhood	$K_h\times K_w$
Weight sharing	the same kernel is reused across spatial positions	$W_{u,v}$ shared
Translation equivariance	shifting the input shifts the feature map	$f(T_\Delta x)=T_\Delta f(x)$
Cross-correlation convention	deep learning libraries usually do not flip the kernel	$y[i,j]=\sum_{u,v}x[i+u,j+v]w[u,v]$
Channels	filters mix input channels into output channels	$W\in\mathbb{R}^{C_{out}\times C_{in}\times K_h\times K_w}$

1.1 Local receptive field

Main idea. Each output sees a small input neighborhood.

Core relation:

$$K_h\times K_w$$

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has $128\cdot64\cdot3\cdot3=73,728$ weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

1.2 Weight sharing

Main idea. The same kernel is reused across spatial positions.

Core relation:

$$W_{u,v}$ shared$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

1.3 Translation equivariance

Main idea. Shifting the input shifts the feature map.

Core relation:

$$f(T_\Delta x)=T_\Delta f(x)$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

1.4 Cross-correlation convention

Main idea. Deep learning libraries usually do not flip the kernel.

Core relation:

$$y[i,j]=\sum_{u,v}x[i+u,j+v]w[u,v]$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

1.5 Channels

Main idea. Filters mix input channels into output channels.

Core relation:

$$W\in\mathbb{R}^{C_{out}\times C_{in}\times K_h\times K_w}$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

2. Output Shape Arithmetic

This part studies output shape arithmetic as spatial tensor math. Keep track of axes, output shapes, receptive fields, and parameter sharing.

Subtopic	Question	Formula
Stride	move the filter by more than one pixel	$S$
Padding	add border values to control output size	$P$
Dilation	space kernel taps apart	$D$
Output height	compute spatial dimension after convolution	$H_{out}=\lfloor(H+2P-D(K-1)-1)/S\rfloor+1$
Same convolution	choose padding so output size is preserved for stride one	$P=(K-1)/2$ for odd K

2.1 Stride

Main idea. Move the filter by more than one pixel.

Core relation:

$$S$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

2.2 Padding

Main idea. Add border values to control output size.

Core relation:

$$P$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

2.3 Dilation

Main idea. Space kernel taps apart.

Core relation:

$$D$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

2.4 Output height

Main idea. Compute spatial dimension after convolution.

Core relation:

$$H_{out}=\lfloor(H+2P-D(K-1)-1)/S\rfloor+1$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. Most CNN implementation bugs are shape arithmetic bugs.

2.5 Same convolution

Main idea. Choose padding so output size is preserved for stride one.

Core relation:

$$P=(K-1)/2$ for odd K$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

3. Parameter and FLOP Counts

This part studies parameter and flop counts as spatial tensor math. Keep track of axes, output shapes, receptive fields, and parameter sharing.

Subtopic	Question	Formula
Parameter count	kernel parameters do not depend on image size	$P=C_{out}C_{in}K_hK_w+C_{out}$
Output elements	number of spatial positions times output channels	$C_{out}H_{out}W_{out}$
FLOPs	each output element performs a dot product over channel and kernel axes	$O(C_{out}H_{out}W_{out}C_{in}K_hK_w)$
1 by 1 convolution	mix channels without spatial neighborhood	$K_h=K_w=1$
Depthwise separable convolution	factor spatial and channel mixing	$C_{in}K^2+C_{in}C_{out}$

3.1 Parameter count

Main idea. Kernel parameters do not depend on image size.

Core relation:

$$P=C_{out}C_{in}K_hK_w+C_{out}$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is why CNNs can process large images with far fewer parameters than dense layers.

3.2 Output elements

Main idea. Number of spatial positions times output channels.

Core relation:

$$C_{out}H_{out}W_{out}$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

3.3 FLOPs

Main idea. Each output element performs a dot product over channel and kernel axes.

Core relation:

$$O(C_{out}H_{out}W_{out}C_{in}K_hK_w)$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

3.4 1 by 1 convolution

Main idea. Mix channels without spatial neighborhood.

Core relation:

$$K_h=K_w=1$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

3.5 Depthwise separable convolution

Main idea. Factor spatial and channel mixing.

Core relation:

$$C_{in}K^2+C_{in}C_{out}$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

4. Pooling and Downsampling

This part studies pooling and downsampling as spatial tensor math. Keep track of axes, output shapes, receptive fields, and parameter sharing.

Subtopic	Question	Formula
Max pooling	take the largest value in a local window	$y=\max_{(u,v)\in R}x[u,v]$
Average pooling	average local values	$y=\|R\|^{-1}\sum_{(u,v)\in R}x[u,v]$
Strided convolution	learned downsampling alternative	$S>1$
Global average pooling	collapse spatial dimensions into channel statistics	$z_c=(HW)^{-1}\sum_{i,j}x_{c,i,j}$
Aliasing	downsampling without low-pass behavior can lose or distort information	$\mathrm{sample}\downarrow$

4.1 Max pooling

Main idea. Take the largest value in a local window.

Core relation:

$$y=\max_{(u,v)\in R}x[u,v]$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

4.2 Average pooling

Main idea. Average local values.

Core relation:

$$y=|R|^{-1}\sum_{(u,v)\in R}x[u,v]$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

4.3 Strided convolution

Main idea. Learned downsampling alternative.

Core relation:

$$S>1$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

4.4 Global average pooling

Main idea. Collapse spatial dimensions into channel statistics.

Core relation:

$$z_c=(HW)^{-1}\sum_{i,j}x_{c,i,j}$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

4.5 Aliasing

Main idea. Downsampling without low-pass behavior can lose or distort information.

Core relation:

$$\mathrm{sample}\downarrow$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

5. Receptive Field

This part studies receptive field as spatial tensor math. Keep track of axes, output shapes, receptive fields, and parameter sharing.

Subtopic	Question	Formula
Single layer	kernel size determines immediate field	$R=K$
Stacked layers	receptive field grows through depth	$R_l=R_{l-1}+(K_l-1)J_{l-1}$
Jump	effective stride between neighboring output positions	$J_l=J_{l-1}S_l$
Dilation effect	dilation expands field without more parameters	$K_\mathrm{eff}=D(K-1)+1$
Effective receptive field	learned influence is often concentrated near the center	$\partial y/\partial x$

5.1 Single layer

Main idea. Kernel size determines immediate field.

Core relation:

$$R=K$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

5.2 Stacked layers

Main idea. Receptive field grows through depth.

Core relation:

$$R_l=R_{l-1}+(K_l-1)J_{l-1}$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

5.3 Jump

Main idea. Effective stride between neighboring output positions.

Core relation:

$$J_l=J_{l-1}S_l$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

5.4 Dilation effect

Main idea. Dilation expands field without more parameters.

Core relation:

$$K_\mathrm{eff}=D(K-1)+1$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

5.5 Effective receptive field

Main idea. Learned influence is often concentrated near the center.

Core relation:

$$\partial y/\partial x$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

6. Backpropagation Through Convolution

This part studies backpropagation through convolution as spatial tensor math. Keep track of axes, output shapes, receptive fields, and parameter sharing.

Subtopic	Question	Formula
Kernel gradient	sum input patches weighted by output gradients	$\partial L/\partial W=\sum \delta y\cdot x_\mathrm{patch}$
Input gradient	spread output gradients back through kernel taps	$\partial L/\partial x$
Bias gradient	sum output gradients over batch and spatial axes	$\partial L/\partial b_c=\sum_{n,i,j}\delta y_{n,c,i,j}$
im2col view	convolution can be lowered to matrix multiplication	$Y=W_\mathrm{mat}X_\mathrm{col}$
Autodiff check	finite differences can verify small convolution gradients	$\frac{L(W+\epsilon)-L(W-\epsilon)}{2\epsilon}$

6.1 Kernel gradient

Main idea. Sum input patches weighted by output gradients.

Core relation:

$$\partial L/\partial W=\sum \delta y\cdot x_\mathrm{patch}$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

6.2 Input gradient

Main idea. Spread output gradients back through kernel taps.

Core relation:

$$\partial L/\partial x$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

6.3 Bias gradient

Main idea. Sum output gradients over batch and spatial axes.

Core relation:

$$\partial L/\partial b_c=\sum_{n,i,j}\delta y_{n,c,i,j}$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

6.4 im2col view

Main idea. Convolution can be lowered to matrix multiplication.

Core relation:

$$Y=W_\mathrm{mat}X_\mathrm{col}$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

6.5 Autodiff check

Main idea. Finite differences can verify small convolution gradients.

Core relation:

$$\frac{L(W+\epsilon)-L(W-\epsilon)}{2\epsilon}$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

7. CNN Building Blocks

This part studies cnn building blocks as spatial tensor math. Keep track of axes, output shapes, receptive fields, and parameter sharing.

Subtopic	Question	Formula
Conv activation norm	standard block combines convolution, nonlinearity, and normalization	$\mathrm{Norm}(\phi(\mathrm{Conv}(x)))$
Residual block	learn a correction around identity	$y=x+F(x)$
Bottleneck block	use 1 by 1 convolutions to reduce and restore channels	$1\times1\rightarrow3\times3\rightarrow1\times1$
Batch normalization	normalize channel statistics over batch and spatial axes	$\hat x=(x-\mu)/\sqrt{\sigma^2+\epsilon}$
Dropout and augmentation	regularize feature learning	$\tilde x=A(x)$

7.1 Conv activation norm

Main idea. Standard block combines convolution, nonlinearity, and normalization.

Core relation:

$$\mathrm{Norm}(\phi(\mathrm{Conv}(x)))$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

7.2 Residual block

Main idea. Learn a correction around identity.

Core relation:

$$y=x+F(x)$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. Residual connections made very deep CNNs practical by preserving an identity path.

7.3 Bottleneck block

Main idea. Use 1 by 1 convolutions to reduce and restore channels.

Core relation:

$$1\times1\rightarrow3\times3\rightarrow1\times1$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

7.4 Batch normalization

Main idea. Normalize channel statistics over batch and spatial axes.

Core relation:

$$\hat x=(x-\mu)/\sqrt{\sigma^2+\epsilon}$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

7.5 Dropout and augmentation

Main idea. Regularize feature learning.

Core relation:

$$\tilde x=A(x)$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

8. Vision Tasks

This part studies vision tasks as spatial tensor math. Keep track of axes, output shapes, receptive fields, and parameter sharing.

Subtopic	Question	Formula
Classification	map image features to class logits	$p(y\mid x)=\mathrm{softmax}(Wh+b)$
Detection	predict boxes and classes over spatial anchors or queries	$(b_i,c_i)$
Segmentation	predict a class for each pixel	$p(y_{i,j}\mid x)$
Feature pyramids	combine multi-scale feature maps	$F_1,\ldots,F_L$
Transfer learning	reuse pretrained convolutional features	$\theta=\theta_0+\Delta\theta$

8.1 Classification

Main idea. Map image features to class logits.

Core relation:

$$p(y\mid x)=\mathrm{softmax}(Wh+b)$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

8.2 Detection

Main idea. Predict boxes and classes over spatial anchors or queries.

Core relation:

$$(b_i,c_i)$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

8.3 Segmentation

Main idea. Predict a class for each pixel.

Core relation:

$$p(y_{i,j}\mid x)$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

8.4 Feature pyramids

Main idea. Combine multi-scale feature maps.

Core relation:

$$F_1,\ldots,F_L$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

8.5 Transfer learning

Main idea. Reuse pretrained convolutional features.

Core relation:

$$\theta=\theta_0+\Delta\theta$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

9. CNNs and Modern AI

This part studies cnns and modern ai as spatial tensor math. Keep track of axes, output shapes, receptive fields, and parameter sharing.

Subtopic	Question	Formula
Inductive bias	locality and translation equivariance suit images	$x_{i,j}$ neighborhoods
Data efficiency	weight sharing reduces sample complexity compared with dense layers	$P_\mathrm{conv}\ll P_\mathrm{dense}$
Hybrid models	modern vision systems often mix CNNs and attention	$\mathrm{Conv}+\mathrm{Attention}$
Patch embedding	ViT patch projection is a strided convolution view	$p=W\mathrm{vec}(\mathrm{patch})$
When CNNs still matter	edge vision and dense prediction often benefit from convolutional efficiency	$\mathrm{latency},\mathrm{memory}$

9.1 Inductive bias

Main idea. Locality and translation equivariance suit images.

Core relation:

$$x_{i,j}$ neighborhoods$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

9.2 Data efficiency

Main idea. Weight sharing reduces sample complexity compared with dense layers.

Core relation:

$$P_\mathrm{conv}\ll P_\mathrm{dense}$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

9.3 Hybrid models

Main idea. Modern vision systems often mix cnns and attention.

Core relation:

$$\mathrm{Conv}+\mathrm{Attention}$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

9.4 Patch embedding

Main idea. Vit patch projection is a strided convolution view.

Core relation:

$$p=W\mathrm{vec}(\mathrm{patch})$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This connects classical convolution math to modern vision transformers.

9.5 When CNNs still matter

Main idea. Edge vision and dense prediction often benefit from convolutional efficiency.

Core relation:

$$\mathrm{latency},\mathrm{memory}$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

10. Diagnostics

This part studies diagnostics as spatial tensor math. Keep track of axes, output shapes, receptive fields, and parameter sharing.

Subtopic	Question	Formula
Shape checks	track batch, channel, height, and width explicitly	$(N,C,H,W)$
Kernel visualization	inspect early filters and feature maps	$W_{c,:,:}$
Activation statistics	dead channels or saturated activations reveal training issues	$\mu_c,\sigma_c$
Receptive field test	verify output depends on intended input region	$\partial y/\partial x$
Ablations	compare kernel size, stride, depthwise factorization, residuals, and normalization	$\Delta S,\Delta T$

10.1 Shape checks

Main idea. Track batch, channel, height, and width explicitly.

Core relation:

$$(N,C,H,W)$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

10.2 Kernel visualization

Main idea. Inspect early filters and feature maps.

Core relation:

$$W_{c,:,:}$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

10.3 Activation statistics

Main idea. Dead channels or saturated activations reveal training issues.

Core relation:

$$\mu_c,\sigma_c$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

10.4 Receptive field test

Main idea. Verify output depends on intended input region.

Core relation:

$$\partial y/\partial x$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. A receptive-field check catches accidental padding, stride, or dilation mistakes.

10.5 Ablations

Main idea. Compare kernel size, stride, depthwise factorization, residuals, and normalization.

Core relation:

$$\Delta S,\Delta T$$

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Practice Exercises

Compute a 1D cross-correlation output.
Compute output size from kernel, padding, stride, and dilation.
Count convolution parameters.
Compare dense and convolutional parameter counts.
Compute max pooling and average pooling.
Compute receptive field through stacked layers.
Compute depthwise separable convolution parameters.
Build an im2col matrix for a tiny input.
Compute a residual block output.
Write a CNN debugging checklist.

Why This Matters for AI

Even in an LLM-heavy world, convolution math remains central for vision, audio, time series, edge models, segmentation, detection, multimodal encoders, and efficient local feature extraction. CNNs also teach key design principles: locality, weight sharing, equivariance, receptive fields, and hierarchy.

Bridge Forward

This completes the model-specific chapter arc: dense models, neural networks, probabilistic models, RNNs, transformers, reinforcement learning, generative models, and CNNs. The same accounting skills reappear in modern multimodal LLMs.

References

Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner, "Gradient-Based Learning Applied to Document Recognition", 1998: https://doi.org/10.1109/5.726791
Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, "ImageNet Classification with Deep Convolutional Neural Networks", 2012: https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks
Kaiming He et al., "Deep Residual Learning for Image Recognition", 2015: https://arxiv.org/abs/1512.03385
Stanford CS231n, "Convolutional Neural Networks": https://cs231n.github.io/convolutional-networks/

CNN and Convolution Math

Overview

Prerequisites

Learning Objectives

Table of Contents

Shape Map

1. Convolution as Local Linear Map

1.1 Local receptive field

1.2 Weight sharing

1.3 Translation equivariance

1.4 Cross-correlation convention

1.5 Channels

2. Output Shape Arithmetic

2.1 Stride

2.2 Padding

2.3 Dilation

2.4 Output height

2.5 Same convolution

3. Parameter and FLOP Counts

3.1 Parameter count

3.2 Output elements

3.3 FLOPs

3.4 1 by 1 convolution

3.5 Depthwise separable convolution

4. Pooling and Downsampling

4.1 Max pooling

4.2 Average pooling

4.3 Strided convolution

4.4 Global average pooling

4.5 Aliasing

5. Receptive Field

5.1 Single layer

5.2 Stacked layers

5.3 Jump

5.4 Dilation effect

5.5 Effective receptive field

6. Backpropagation Through Convolution

6.1 Kernel gradient

6.2 Input gradient

6.3 Bias gradient

6.4 im2col view

6.5 Autodiff check

7. CNN Building Blocks

7.1 Conv activation norm

7.2 Residual block

7.3 Bottleneck block

7.4 Batch normalization

7.5 Dropout and augmentation

8. Vision Tasks

8.1 Classification

8.2 Detection

8.3 Segmentation

8.4 Feature pyramids

8.5 Transfer learning

9. CNNs and Modern AI

9.1 Inductive bias

9.2 Data efficiency

9.3 Hybrid models

9.4 Patch embedding

9.5 When CNNs still matter

10. Diagnostics

10.1 Shape checks

10.2 Kernel visualization

10.3 Activation statistics

10.4 Receptive field test

10.5 Ablations

Practice Exercises

Why This Matters for AI

Bridge Forward

References