NotesMath for LLMs

CNN and Convolution Math

Math for Specific Models / CNN and Convolution Math

Notes

Convolutional neural networks use local filters, shared weights, and spatial hierarchies to model images and other grid-like data efficiently.

Overview

The core 2D cross-correlation used in deep learning is:

y[i,j]=u,vx[i+u,j+v]w[u,v].y[i,j]=\sum_{u,v}x[i+u,j+v]w[u,v].

With channels, every output channel has a kernel over all input channels. Stride, padding, and dilation determine output shape. Stacking layers grows receptive fields. Pooling and strided convolution downsample. Residual blocks and normalization make deep CNNs trainable.

Prerequisites

  • Matrix and tensor shapes
  • Dot products and gradients
  • Basic neural-network loss functions
  • Some image or grid-data intuition

Companion Notebooks

NotebookPurpose
theory.ipynbDemonstrates convolution indexing, output shapes, parameter counts, pooling, receptive fields, im2col, gradient checks, residuals, and patch embeddings.
exercises.ipynbTen practice problems for CNN shape and convolution arithmetic.

Learning Objectives

After this section, you should be able to:

  • Compute 1D and 2D convolution/cross-correlation by hand.
  • Calculate output sizes from kernel, padding, stride, and dilation.
  • Count convolution parameters and FLOPs.
  • Explain pooling, strided convolution, receptive fields, and dilation.
  • Derive kernel, input, and bias gradients at a high level.
  • Explain residual blocks, bottlenecks, normalization, and feature pyramids.
  • Connect CNN patch operations to vision transformer patch embeddings.
  • Build shape and receptive-field diagnostics for CNN models.

Table of Contents

  1. Convolution as Local Linear Map
  2. Output Shape Arithmetic
  3. Parameter and FLOP Counts
  4. Pooling and Downsampling
  5. Receptive Field
  6. Backpropagation Through Convolution
  7. CNN Building Blocks
  8. Vision Tasks
  9. CNNs and Modern AI
  10. Diagnostics

Shape Map

image batch:      X      shape (N, C_in, H, W)
kernel bank:      W      shape (C_out, C_in, K_h, K_w)
feature map:      Y      shape (N, C_out, H_out, W_out)
classification:   logits shape (N, num_classes)
segmentation:     logits shape (N, num_classes, H, W)

1. Convolution as Local Linear Map

This part studies convolution as local linear map as spatial tensor math. Keep track of axes, output shapes, receptive fields, and parameter sharing.

SubtopicQuestionFormula
Local receptive fieldeach output sees a small input neighborhoodKh×KwK_h\times K_w
Weight sharingthe same kernel is reused across spatial positionsWu,vW_{u,v} shared
Translation equivarianceshifting the input shifts the feature mapf(TΔx)=TΔf(x)f(T_\Delta x)=T_\Delta f(x)
Cross-correlation conventiondeep learning libraries usually do not flip the kernely[i,j]=u,vx[i+u,j+v]w[u,v]y[i,j]=\sum_{u,v}x[i+u,j+v]w[u,v]
Channelsfilters mix input channels into output channelsWRCout×Cin×Kh×KwW\in\mathbb{R}^{C_{out}\times C_{in}\times K_h\times K_w}

1.1 Local receptive field

Main idea. Each output sees a small input neighborhood.

Core relation:

Kh×KwK_h\times K_w

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

1.2 Weight sharing

Main idea. The same kernel is reused across spatial positions.

Core relation:

W_{u,v}$ shared

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

1.3 Translation equivariance

Main idea. Shifting the input shifts the feature map.

Core relation:

f(TΔx)=TΔf(x)f(T_\Delta x)=T_\Delta f(x)

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

1.4 Cross-correlation convention

Main idea. Deep learning libraries usually do not flip the kernel.

Core relation:

y[i,j]=u,vx[i+u,j+v]w[u,v]y[i,j]=\sum_{u,v}x[i+u,j+v]w[u,v]

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

1.5 Channels

Main idea. Filters mix input channels into output channels.

Core relation:

WRCout×Cin×Kh×KwW\in\mathbb{R}^{C_{out}\times C_{in}\times K_h\times K_w}

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

2. Output Shape Arithmetic

This part studies output shape arithmetic as spatial tensor math. Keep track of axes, output shapes, receptive fields, and parameter sharing.

SubtopicQuestionFormula
Stridemove the filter by more than one pixelSS
Paddingadd border values to control output sizePP
Dilationspace kernel taps apartDD
Output heightcompute spatial dimension after convolutionHout=(H+2PD(K1)1)/S+1H_{out}=\lfloor(H+2P-D(K-1)-1)/S\rfloor+1
Same convolutionchoose padding so output size is preserved for stride oneP=(K1)/2P=(K-1)/2 for odd K

2.1 Stride

Main idea. Move the filter by more than one pixel.

Core relation:

SS

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

2.2 Padding

Main idea. Add border values to control output size.

Core relation:

PP

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

2.3 Dilation

Main idea. Space kernel taps apart.

Core relation:

DD

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

2.4 Output height

Main idea. Compute spatial dimension after convolution.

Core relation:

Hout=(H+2PD(K1)1)/S+1H_{out}=\lfloor(H+2P-D(K-1)-1)/S\rfloor+1

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. Most CNN implementation bugs are shape arithmetic bugs.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

2.5 Same convolution

Main idea. Choose padding so output size is preserved for stride one.

Core relation:

P=(K-1)/2$ for odd K

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

3. Parameter and FLOP Counts

This part studies parameter and flop counts as spatial tensor math. Keep track of axes, output shapes, receptive fields, and parameter sharing.

SubtopicQuestionFormula
Parameter countkernel parameters do not depend on image sizeP=CoutCinKhKw+CoutP=C_{out}C_{in}K_hK_w+C_{out}
Output elementsnumber of spatial positions times output channelsCoutHoutWoutC_{out}H_{out}W_{out}
FLOPseach output element performs a dot product over channel and kernel axesO(CoutHoutWoutCinKhKw)O(C_{out}H_{out}W_{out}C_{in}K_hK_w)
1 by 1 convolutionmix channels without spatial neighborhoodKh=Kw=1K_h=K_w=1
Depthwise separable convolutionfactor spatial and channel mixingCinK2+CinCoutC_{in}K^2+C_{in}C_{out}

3.1 Parameter count

Main idea. Kernel parameters do not depend on image size.

Core relation:

P=CoutCinKhKw+CoutP=C_{out}C_{in}K_hK_w+C_{out}

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is why CNNs can process large images with far fewer parameters than dense layers.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

3.2 Output elements

Main idea. Number of spatial positions times output channels.

Core relation:

CoutHoutWoutC_{out}H_{out}W_{out}

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

3.3 FLOPs

Main idea. Each output element performs a dot product over channel and kernel axes.

Core relation:

O(CoutHoutWoutCinKhKw)O(C_{out}H_{out}W_{out}C_{in}K_hK_w)

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

3.4 1 by 1 convolution

Main idea. Mix channels without spatial neighborhood.

Core relation:

Kh=Kw=1K_h=K_w=1

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

3.5 Depthwise separable convolution

Main idea. Factor spatial and channel mixing.

Core relation:

CinK2+CinCoutC_{in}K^2+C_{in}C_{out}

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

4. Pooling and Downsampling

This part studies pooling and downsampling as spatial tensor math. Keep track of axes, output shapes, receptive fields, and parameter sharing.

SubtopicQuestionFormula
Max poolingtake the largest value in a local windowy=max(u,v)Rx[u,v]y=\max_{(u,v)\in R}x[u,v]
Average poolingaverage local values$y=
Strided convolutionlearned downsampling alternativeS>1S>1
Global average poolingcollapse spatial dimensions into channel statisticszc=(HW)1i,jxc,i,jz_c=(HW)^{-1}\sum_{i,j}x_{c,i,j}
Aliasingdownsampling without low-pass behavior can lose or distort informationsample\mathrm{sample}\downarrow

4.1 Max pooling

Main idea. Take the largest value in a local window.

Core relation:

y=max(u,v)Rx[u,v]y=\max_{(u,v)\in R}x[u,v]

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

4.2 Average pooling

Main idea. Average local values.

Core relation:

y=R1(u,v)Rx[u,v]y=|R|^{-1}\sum_{(u,v)\in R}x[u,v]

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

4.3 Strided convolution

Main idea. Learned downsampling alternative.

Core relation:

S>1S>1

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

4.4 Global average pooling

Main idea. Collapse spatial dimensions into channel statistics.

Core relation:

zc=(HW)1i,jxc,i,jz_c=(HW)^{-1}\sum_{i,j}x_{c,i,j}

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

4.5 Aliasing

Main idea. Downsampling without low-pass behavior can lose or distort information.

Core relation:

sample\mathrm{sample}\downarrow

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

5. Receptive Field

This part studies receptive field as spatial tensor math. Keep track of axes, output shapes, receptive fields, and parameter sharing.

SubtopicQuestionFormula
Single layerkernel size determines immediate fieldR=KR=K
Stacked layersreceptive field grows through depthRl=Rl1+(Kl1)Jl1R_l=R_{l-1}+(K_l-1)J_{l-1}
Jumpeffective stride between neighboring output positionsJl=Jl1SlJ_l=J_{l-1}S_l
Dilation effectdilation expands field without more parametersKeff=D(K1)+1K_\mathrm{eff}=D(K-1)+1
Effective receptive fieldlearned influence is often concentrated near the centery/x\partial y/\partial x

5.1 Single layer

Main idea. Kernel size determines immediate field.

Core relation:

R=KR=K

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

5.2 Stacked layers

Main idea. Receptive field grows through depth.

Core relation:

Rl=Rl1+(Kl1)Jl1R_l=R_{l-1}+(K_l-1)J_{l-1}

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

5.3 Jump

Main idea. Effective stride between neighboring output positions.

Core relation:

Jl=Jl1SlJ_l=J_{l-1}S_l

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

5.4 Dilation effect

Main idea. Dilation expands field without more parameters.

Core relation:

Keff=D(K1)+1K_\mathrm{eff}=D(K-1)+1

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

5.5 Effective receptive field

Main idea. Learned influence is often concentrated near the center.

Core relation:

y/x\partial y/\partial x

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

6. Backpropagation Through Convolution

This part studies backpropagation through convolution as spatial tensor math. Keep track of axes, output shapes, receptive fields, and parameter sharing.

SubtopicQuestionFormula
Kernel gradientsum input patches weighted by output gradientsL/W=δyxpatch\partial L/\partial W=\sum \delta y\cdot x_\mathrm{patch}
Input gradientspread output gradients back through kernel tapsL/x\partial L/\partial x
Bias gradientsum output gradients over batch and spatial axesL/bc=n,i,jδyn,c,i,j\partial L/\partial b_c=\sum_{n,i,j}\delta y_{n,c,i,j}
im2col viewconvolution can be lowered to matrix multiplicationY=WmatXcolY=W_\mathrm{mat}X_\mathrm{col}
Autodiff checkfinite differences can verify small convolution gradientsL(W+ϵ)L(Wϵ)2ϵ\frac{L(W+\epsilon)-L(W-\epsilon)}{2\epsilon}

6.1 Kernel gradient

Main idea. Sum input patches weighted by output gradients.

Core relation:

L/W=δyxpatch\partial L/\partial W=\sum \delta y\cdot x_\mathrm{patch}

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

6.2 Input gradient

Main idea. Spread output gradients back through kernel taps.

Core relation:

L/x\partial L/\partial x

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

6.3 Bias gradient

Main idea. Sum output gradients over batch and spatial axes.

Core relation:

L/bc=n,i,jδyn,c,i,j\partial L/\partial b_c=\sum_{n,i,j}\delta y_{n,c,i,j}

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

6.4 im2col view

Main idea. Convolution can be lowered to matrix multiplication.

Core relation:

Y=WmatXcolY=W_\mathrm{mat}X_\mathrm{col}

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

6.5 Autodiff check

Main idea. Finite differences can verify small convolution gradients.

Core relation:

L(W+ϵ)L(Wϵ)2ϵ\frac{L(W+\epsilon)-L(W-\epsilon)}{2\epsilon}

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

7. CNN Building Blocks

This part studies cnn building blocks as spatial tensor math. Keep track of axes, output shapes, receptive fields, and parameter sharing.

SubtopicQuestionFormula
Conv activation normstandard block combines convolution, nonlinearity, and normalizationNorm(ϕ(Conv(x)))\mathrm{Norm}(\phi(\mathrm{Conv}(x)))
Residual blocklearn a correction around identityy=x+F(x)y=x+F(x)
Bottleneck blockuse 1 by 1 convolutions to reduce and restore channels1×13×31×11\times1\rightarrow3\times3\rightarrow1\times1
Batch normalizationnormalize channel statistics over batch and spatial axesx^=(xμ)/σ2+ϵ\hat x=(x-\mu)/\sqrt{\sigma^2+\epsilon}
Dropout and augmentationregularize feature learningx~=A(x)\tilde x=A(x)

7.1 Conv activation norm

Main idea. Standard block combines convolution, nonlinearity, and normalization.

Core relation:

Norm(ϕ(Conv(x)))\mathrm{Norm}(\phi(\mathrm{Conv}(x)))

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

7.2 Residual block

Main idea. Learn a correction around identity.

Core relation:

y=x+F(x)y=x+F(x)

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. Residual connections made very deep CNNs practical by preserving an identity path.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

7.3 Bottleneck block

Main idea. Use 1 by 1 convolutions to reduce and restore channels.

Core relation:

1×13×31×11\times1\rightarrow3\times3\rightarrow1\times1

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

7.4 Batch normalization

Main idea. Normalize channel statistics over batch and spatial axes.

Core relation:

x^=(xμ)/σ2+ϵ\hat x=(x-\mu)/\sqrt{\sigma^2+\epsilon}

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

7.5 Dropout and augmentation

Main idea. Regularize feature learning.

Core relation:

x~=A(x)\tilde x=A(x)

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

8. Vision Tasks

This part studies vision tasks as spatial tensor math. Keep track of axes, output shapes, receptive fields, and parameter sharing.

SubtopicQuestionFormula
Classificationmap image features to class logitsp(yx)=softmax(Wh+b)p(y\mid x)=\mathrm{softmax}(Wh+b)
Detectionpredict boxes and classes over spatial anchors or queries(bi,ci)(b_i,c_i)
Segmentationpredict a class for each pixelp(yi,jx)p(y_{i,j}\mid x)
Feature pyramidscombine multi-scale feature mapsF1,,FLF_1,\ldots,F_L
Transfer learningreuse pretrained convolutional featuresθ=θ0+Δθ\theta=\theta_0+\Delta\theta

8.1 Classification

Main idea. Map image features to class logits.

Core relation:

p(yx)=softmax(Wh+b)p(y\mid x)=\mathrm{softmax}(Wh+b)

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

8.2 Detection

Main idea. Predict boxes and classes over spatial anchors or queries.

Core relation:

(bi,ci)(b_i,c_i)

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

8.3 Segmentation

Main idea. Predict a class for each pixel.

Core relation:

p(yi,jx)p(y_{i,j}\mid x)

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

8.4 Feature pyramids

Main idea. Combine multi-scale feature maps.

Core relation:

F1,,FLF_1,\ldots,F_L

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

8.5 Transfer learning

Main idea. Reuse pretrained convolutional features.

Core relation:

θ=θ0+Δθ\theta=\theta_0+\Delta\theta

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

9. CNNs and Modern AI

This part studies cnns and modern ai as spatial tensor math. Keep track of axes, output shapes, receptive fields, and parameter sharing.

SubtopicQuestionFormula
Inductive biaslocality and translation equivariance suit imagesxi,jx_{i,j} neighborhoods
Data efficiencyweight sharing reduces sample complexity compared with dense layersPconvPdenseP_\mathrm{conv}\ll P_\mathrm{dense}
Hybrid modelsmodern vision systems often mix CNNs and attentionConv+Attention\mathrm{Conv}+\mathrm{Attention}
Patch embeddingViT patch projection is a strided convolution viewp=Wvec(patch)p=W\mathrm{vec}(\mathrm{patch})
When CNNs still matteredge vision and dense prediction often benefit from convolutional efficiencylatency,memory\mathrm{latency},\mathrm{memory}

9.1 Inductive bias

Main idea. Locality and translation equivariance suit images.

Core relation:

x_{i,j}$ neighborhoods

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

9.2 Data efficiency

Main idea. Weight sharing reduces sample complexity compared with dense layers.

Core relation:

PconvPdenseP_\mathrm{conv}\ll P_\mathrm{dense}

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

9.3 Hybrid models

Main idea. Modern vision systems often mix cnns and attention.

Core relation:

Conv+Attention\mathrm{Conv}+\mathrm{Attention}

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

9.4 Patch embedding

Main idea. Vit patch projection is a strided convolution view.

Core relation:

p=Wvec(patch)p=W\mathrm{vec}(\mathrm{patch})

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This connects classical convolution math to modern vision transformers.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

9.5 When CNNs still matter

Main idea. Edge vision and dense prediction often benefit from convolutional efficiency.

Core relation:

latency,memory\mathrm{latency},\mathrm{memory}

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

10. Diagnostics

This part studies diagnostics as spatial tensor math. Keep track of axes, output shapes, receptive fields, and parameter sharing.

SubtopicQuestionFormula
Shape checkstrack batch, channel, height, and width explicitly(N,C,H,W)(N,C,H,W)
Kernel visualizationinspect early filters and feature mapsWc,:,:W_{c,:,:}
Activation statisticsdead channels or saturated activations reveal training issuesμc,σc\mu_c,\sigma_c
Receptive field testverify output depends on intended input regiony/x\partial y/\partial x
Ablationscompare kernel size, stride, depthwise factorization, residuals, and normalizationΔS,ΔT\Delta S,\Delta T

10.1 Shape checks

Main idea. Track batch, channel, height, and width explicitly.

Core relation:

(N,C,H,W)(N,C,H,W)

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

10.2 Kernel visualization

Main idea. Inspect early filters and feature maps.

Core relation:

Wc,:,:W_{c,:,:}

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

10.3 Activation statistics

Main idea. Dead channels or saturated activations reveal training issues.

Core relation:

μc,σc\mu_c,\sigma_c

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

10.4 Receptive field test

Main idea. Verify output depends on intended input region.

Core relation:

y/x\partial y/\partial x

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. A receptive-field check catches accidental padding, stride, or dilation mistakes.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.

10.5 Ablations

Main idea. Compare kernel size, stride, depthwise factorization, residuals, and normalization.

Core relation:

ΔS,ΔT\Delta S,\Delta T

Convolutional networks are built from local linear maps with shared weights. The math is simple but unforgiving: every padding, stride, dilation, and channel convention changes the output shape and the information path through the model.

Worked micro-example. A 3 by 3 convolution from 64 input channels to 128 output channels has 1286433=73,728128\cdot64\cdot3\cdot3=73,728 weights, independent of image height and width. A dense layer over a 224 by 224 image would scale with every pixel location.

Implementation check. For every layer, write the tensor shape before and after. Verify output size with the shape formula before debugging the model logic.

AI connection. This is a practical convolutional-model control variable.

Common mistake. Do not confuse mathematical convolution with the cross-correlation operation used by most deep learning libraries. The learned kernel adapts either way, but the indexing convention matters for hand calculations.


Practice Exercises

  1. Compute a 1D cross-correlation output.
  2. Compute output size from kernel, padding, stride, and dilation.
  3. Count convolution parameters.
  4. Compare dense and convolutional parameter counts.
  5. Compute max pooling and average pooling.
  6. Compute receptive field through stacked layers.
  7. Compute depthwise separable convolution parameters.
  8. Build an im2col matrix for a tiny input.
  9. Compute a residual block output.
  10. Write a CNN debugging checklist.

Why This Matters for AI

Even in an LLM-heavy world, convolution math remains central for vision, audio, time series, edge models, segmentation, detection, multimodal encoders, and efficient local feature extraction. CNNs also teach key design principles: locality, weight sharing, equivariance, receptive fields, and hierarchy.

Bridge Forward

This completes the model-specific chapter arc: dense models, neural networks, probabilistic models, RNNs, transformers, reinforcement learning, generative models, and CNNs. The same accounting skills reappear in modern multimodal LLMs.

References