Convolutional neural networks use local filters, shared weights, and spatial hierarchies to model images and other grid-like data efficiently.
Overview
The core 2D cross-correlation used in deep learning is:
With channels, every output channel has a kernel over all input channels. Stride, padding, and dilation determine output shape. Stacking layers grows receptive fields. Pooling and strided convolution downsample. Residual blocks and normalization make deep CNNs trainable.
Prerequisites
- Matrix and tensor shapes
- Dot products and gradients
- Basic neural-network loss functions
- Some image or grid-data intuition
Companion Notebooks
| Notebook | Purpose |
|---|---|
| theory.ipynb | Demonstrates convolution indexing, output shapes, parameter counts, pooling, receptive fields, im2col, gradient checks, residuals, and patch embeddings. |
| exercises.ipynb | Ten practice problems for CNN shape and convolution arithmetic. |
Learning Objectives
After this section, you should be able to:
- Compute 1D and 2D convolution/cross-correlation by hand.
- Calculate output sizes from kernel, padding, stride, and dilation.
- Count convolution parameters and FLOPs.
- Explain pooling, strided convolution, receptive fields, and dilation.
- Derive kernel, input, and bias gradients at a high level.
- Explain residual blocks, bottlenecks, normalization, and feature pyramids.
- Connect CNN patch operations to vision transformer patch embeddings.
- Build shape and receptive-field diagnostics for CNN models.
Table of Contents
- Convolution as Local Linear Map
- 1.1 Local receptive field
- 1.2 Weight sharing
- 1.3 Translation equivariance
- 1.4 Cross-correlation convention
- 1.5 Channels
- Output Shape Arithmetic
- 2.1 Stride
- 2.2 Padding
- 2.3 Dilation
- 2.4 Output height
- 2.5 Same convolution
- Parameter and FLOP Counts
- 3.1 Parameter count
- 3.2 Output elements
- 3.3 FLOPs
- 3.4 1 by 1 convolution
- 3.5 Depthwise separable convolution
- Pooling and Downsampling
- 4.1 Max pooling
- 4.2 Average pooling
- 4.3 Strided convolution
- 4.4 Global average pooling
- 4.5 Aliasing
- Receptive Field
- 5.1 Single layer
- 5.2 Stacked layers
- 5.3 Jump
- 5.4 Dilation effect
- 5.5 Effective receptive field
- Backpropagation Through Convolution
- 6.1 Kernel gradient
- 6.2 Input gradient
- 6.3 Bias gradient
- 6.4 im2col view
- 6.5 Autodiff check
- CNN Building Blocks
- 7.1 Conv activation norm
- 7.2 Residual block
- 7.3 Bottleneck block
- 7.4 Batch normalization
- 7.5 Dropout and augmentation
- Vision Tasks
- 8.1 Classification
- 8.2 Detection
- 8.3 Segmentation
- 8.4 Feature pyramids
- 8.5 Transfer learning
- CNNs and Modern AI
- 9.1 Inductive bias
- 9.2 Data efficiency
- 9.3 Hybrid models
- 9.4 Patch embedding
- 9.5 When CNNs still matter
- Diagnostics
- 10.1 Shape checks
- 10.2 Kernel visualization
- 10.3 Activation statistics
- 10.4 Receptive field test
- 10.5 Ablations
Shape Map
image batch: X shape (N, C_in, H, W)
kernel bank: W shape (C_out, C_in, K_h, K_w)
feature map: Y shape (N, C_out, H_out, W_out)
classification: logits shape (N, num_classes)
segmentation: logits shape (N, num_classes, H, W)
1. Convolution as Local Linear Map
This part studies convolution as local linear map as spatial tensor math. Keep track of axes, output shapes, receptive fields, and parameter sharing.
| Subtopic | Question | Formula |
|---|---|---|
| Local receptive field | each output sees a small input neighborhood | |
| Weight sharing | the same kernel is reused across spatial positions | shared |
| Translation equivariance | shifting the input shifts the feature map | |
| Cross-correlation convention | deep learning libraries usually do not flip the kernel | |
| Channels | filters mix input channels into output channels |
1.1 Local receptive field
Main idea. Each output sees a small input neighborhood.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
1.2 Weight sharing
Main idea. The same kernel is reused across spatial positions.
Core relation:
W_{u,v}$ sharedConvolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
1.3 Translation equivariance
Main idea. Shifting the input shifts the feature map.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
1.4 Cross-correlation convention
Main idea. Deep learning libraries usually do not flip the kernel.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
1.5 Channels
Main idea. Filters mix input channels into output channels.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
2. Output Shape Arithmetic
This part studies output shape arithmetic as spatial tensor math. Keep track of axes, output shapes, receptive fields, and parameter sharing.
| Subtopic | Question | Formula |
|---|---|---|
| Stride | move the filter by more than one pixel | |
| Padding | add border values to control output size | |
| Dilation | space kernel taps apart | |
| Output height | compute spatial dimension after convolution | |
| Same convolution | choose padding so output size is preserved for stride one | for odd K |
2.1 Stride
Main idea. Move the filter by more than one pixel.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
2.2 Padding
Main idea. Add border values to control output size.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
2.3 Dilation
Main idea. Space kernel taps apart.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
2.4 Output height
Main idea. Compute spatial dimension after convolution.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. Most CNN implementation bugs are shape arithmetic bugs.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
2.5 Same convolution
Main idea. Choose padding so output size is preserved for stride one.
Core relation:
P=(K-1)/2$ for odd KConvolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
3. Parameter and FLOP Counts
This part studies parameter and flop counts as spatial tensor math. Keep track of axes, output shapes, receptive fields, and parameter sharing.
| Subtopic | Question | Formula |
|---|---|---|
| Parameter count | kernel parameters do not depend on image size | |
| Output elements | number of spatial positions times output channels | |
| FLOPs | each output element performs a dot product over channel and kernel axes | |
| 1 by 1 convolution | mix channels without spatial neighborhood | |
| Depthwise separable convolution | factor spatial and channel mixing |
3.1 Parameter count
Main idea. Kernel parameters do not depend on image size.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is why CNNs can process large images with far fewer parameters than dense layers.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
3.2 Output elements
Main idea. Number of spatial positions times output channels.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
3.3 FLOPs
Main idea. Each output element performs a dot product over channel and kernel axes.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
3.4 1 by 1 convolution
Main idea. Mix channels without spatial neighborhood.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
3.5 Depthwise separable convolution
Main idea. Factor spatial and channel mixing.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
4. Pooling and Downsampling
This part studies pooling and downsampling as spatial tensor math. Keep track of axes, output shapes, receptive fields, and parameter sharing.
| Subtopic | Question | Formula |
|---|---|---|
| Max pooling | take the largest value in a local window | |
| Average pooling | average local values | $y= |
| Strided convolution | learned downsampling alternative | |
| Global average pooling | collapse spatial dimensions into channel statistics | |
| Aliasing | downsampling without low-pass behavior can lose or distort information |
4.1 Max pooling
Main idea. Take the largest value in a local window.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
4.2 Average pooling
Main idea. Average local values.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
4.3 Strided convolution
Main idea. Learned downsampling alternative.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
4.4 Global average pooling
Main idea. Collapse spatial dimensions into channel statistics.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
4.5 Aliasing
Main idea. Downsampling without low-pass behavior can lose or distort information.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
5. Receptive Field
This part studies receptive field as spatial tensor math. Keep track of axes, output shapes, receptive fields, and parameter sharing.
| Subtopic | Question | Formula |
|---|---|---|
| Single layer | kernel size determines immediate field | |
| Stacked layers | receptive field grows through depth | |
| Jump | effective stride between neighboring output positions | |
| Dilation effect | dilation expands field without more parameters | |
| Effective receptive field | learned influence is often concentrated near the center |
5.1 Single layer
Main idea. Kernel size determines immediate field.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
5.2 Stacked layers
Main idea. Receptive field grows through depth.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
5.3 Jump
Main idea. Effective stride between neighboring output positions.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
5.4 Dilation effect
Main idea. Dilation expands field without more parameters.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
5.5 Effective receptive field
Main idea. Learned influence is often concentrated near the center.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
6. Backpropagation Through Convolution
This part studies backpropagation through convolution as spatial tensor math. Keep track of axes, output shapes, receptive fields, and parameter sharing.
| Subtopic | Question | Formula |
|---|---|---|
| Kernel gradient | sum input patches weighted by output gradients | |
| Input gradient | spread output gradients back through kernel taps | |
| Bias gradient | sum output gradients over batch and spatial axes | |
| im2col view | convolution can be lowered to matrix multiplication | |
| Autodiff check | finite differences can verify small convolution gradients |
6.1 Kernel gradient
Main idea. Sum input patches weighted by output gradients.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
6.2 Input gradient
Main idea. Spread output gradients back through kernel taps.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
6.3 Bias gradient
Main idea. Sum output gradients over batch and spatial axes.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
6.4 im2col view
Main idea. Convolution can be lowered to matrix multiplication.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
6.5 Autodiff check
Main idea. Finite differences can verify small convolution gradients.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
7. CNN Building Blocks
This part studies cnn building blocks as spatial tensor math. Keep track of axes, output shapes, receptive fields, and parameter sharing.
| Subtopic | Question | Formula |
|---|---|---|
| Conv activation norm | standard block combines convolution, nonlinearity, and normalization | |
| Residual block | learn a correction around identity | |
| Bottleneck block | use 1 by 1 convolutions to reduce and restore channels | |
| Batch normalization | normalize channel statistics over batch and spatial axes | |
| Dropout and augmentation | regularize feature learning |
7.1 Conv activation norm
Main idea. Standard block combines convolution, nonlinearity, and normalization.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
7.2 Residual block
Main idea. Learn a correction around identity.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. Residual connections made very deep CNNs practical by preserving an identity path.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
7.3 Bottleneck block
Main idea. Use 1 by 1 convolutions to reduce and restore channels.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
7.4 Batch normalization
Main idea. Normalize channel statistics over batch and spatial axes.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
7.5 Dropout and augmentation
Main idea. Regularize feature learning.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
8. Vision Tasks
This part studies vision tasks as spatial tensor math. Keep track of axes, output shapes, receptive fields, and parameter sharing.
| Subtopic | Question | Formula |
|---|---|---|
| Classification | map image features to class logits | |
| Detection | predict boxes and classes over spatial anchors or queries | |
| Segmentation | predict a class for each pixel | |
| Feature pyramids | combine multi-scale feature maps | |
| Transfer learning | reuse pretrained convolutional features |
8.1 Classification
Main idea. Map image features to class logits.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
8.2 Detection
Main idea. Predict boxes and classes over spatial anchors or queries.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
8.3 Segmentation
Main idea. Predict a class for each pixel.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
8.4 Feature pyramids
Main idea. Combine multi-scale feature maps.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
8.5 Transfer learning
Main idea. Reuse pretrained convolutional features.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
9. CNNs and Modern AI
This part studies cnns and modern ai as spatial tensor math. Keep track of axes, output shapes, receptive fields, and parameter sharing.
| Subtopic | Question | Formula |
|---|---|---|
| Inductive bias | locality and translation equivariance suit images | neighborhoods |
| Data efficiency | weight sharing reduces sample complexity compared with dense layers | |
| Hybrid models | modern vision systems often mix CNNs and attention | |
| Patch embedding | ViT patch projection is a strided convolution view | |
| When CNNs still matter | edge vision and dense prediction often benefit from convolutional efficiency |
9.1 Inductive bias
Main idea. Locality and translation equivariance suit images.
Core relation:
x_{i,j}$ neighborhoodsConvolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
9.2 Data efficiency
Main idea. Weight sharing reduces sample complexity compared with dense layers.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
9.3 Hybrid models
Main idea. Modern vision systems often mix cnns and attention.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
9.4 Patch embedding
Main idea. Vit patch projection is a strided convolution view.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This connects classical convolution math to modern vision transformers.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
9.5 When CNNs still matter
Main idea. Edge vision and dense prediction often benefit from convolutional efficiency.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
10. Diagnostics
This part studies diagnostics as spatial tensor math. Keep track of axes, output shapes, receptive fields, and parameter sharing.
| Subtopic | Question | Formula |
|---|---|---|
| Shape checks | track batch, channel, height, and width explicitly | |
| Kernel visualization | inspect early filters and feature maps | |
| Activation statistics | dead channels or saturated activations reveal training issues | |
| Receptive field test | verify output depends on intended input region | |
| Ablations | compare kernel size, stride, depthwise factorization, residuals, and normalization |
10.1 Shape checks
Main idea. Track batch, channel, height, and width explicitly.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
10.2 Kernel visualization
Main idea. Inspect early filters and feature maps.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
10.3 Activation statistics
Main idea. Dead channels or saturated activations reveal training issues.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
10.4 Receptive field test
Main idea. Verify output depends on intended input region.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. A receptive-field check catches accidental padding, stride, or dilation mistakes.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
10.5 Ablations
Main idea. Compare kernel size, stride, depthwise factorization, residuals, and normalization.
Core relation:
Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.
Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.
Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.
AI connection. This is a practical convolutional-model control variable.
Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.
Practice Exercises
- Compute a 1D cross-correlation output.
- Compute output size from kernel, padding, stride, and dilation.
- Count convolution parameters.
- Compare dense and convolutional parameter counts.
- Compute max pooling and average pooling.
- Compute receptive field through stacked layers.
- Compute depthwise separable convolution parameters.
- Build an im2col matrix for a tiny input.
- Compute a residual block output.
- Write a CNN debugging checklist.
Why This Matters for AI
Even in an LLM-heavy world, convolution math remains central for vision, audio, time series, edge models, segmentation, detection, multimodal encoders, and efficient local feature extraction. CNNs also teach key design principles: locality, weight sharing, equivariance, receptive fields, and hierarchy.
Bridge Forward
This completes the model-specific chapter arc: dense models, neural networks, probabilistic models, RNNs, transformers, reinforcement learning, generative models, and CNNs. The same accounting skills reappear in modern multimodal LLMs.
References
- Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner, "Gradient-Based Learning Applied to Document Recognition", 1998: https://doi.org/10.1109/5.726791
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, "ImageNet Classification with Deep Convolutional Neural Networks", 2012: https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks
- Kaiming He et al., "Deep Residual Learning for Image Recognition", 2015: https://arxiv.org/abs/1512.03385
- Stanford CS231n, "Convolutional Neural Networks": https://cs231n.github.io/convolutional-networks/