Convolutional Neural Networks Explained

Convolutional Neural Network (ConvNet or CNN) is a special type of Neural Network used effectively for image recognition and classification. They are highly proficient in areas like identification of objects, faces, and traffic signs apart from generating vision in self-driving cars and robots too.

Four main operations exist in the ConvNet:

  1. Convolution
  2. Non-Linearity (ReLU)
  3. Pooling or Sub Sampling
  4. Classification (Fully Connected Layer)

The basic foundation of every Convolutional Neural Network is made up of these operations, so to develop a sound understanding of the working of these ConvNets, we need to comprehend thoroughly the working of these operations. We’ll go into detail about each of these operations in the below sections.

Representation of an image

An Image is classified as a matrix of pixel values. Any image can be represented as a matrix made up of pixel values. Any component of an image is conventionally referred to as a Channel. Images obtained from a standard digital camera will consist of 3 channels – green, blue and red. These can be pictured as 3 two-dimensional matrices, one for each color, with pixel values between 0 to 255 and stacked over one another.

You must have heard about a grayscale image, which is an image with just one channel. In this post, we will use only grayscale images so as to have a single 2D matrix depicting an image. Pixel value range for the matrix is between 0 to 255, with 0 indicating black and 255 showcasing white.

The Convolution Step

The name ConvNets had been derived from an operator called ‘Convolution’. The primary objective of this operator is the extraction of input image features. Convolution learns image features and works in coordination with pixels by using small squares of input data. Now, let’s try to understand its working on images.

Consider a 5 x 5 image with pixel values as 0 and 1 only. Pixel values vary from 0 to 255 in grayscale images but here is a special case of a green matrix:

Convolutional Neural Networks Explained -magoosh

Consider one more 3 x 3 matrix as shown below:

Convolution of the 5 x 5 image and the 3 x 3 matrix can be calculated in the animation depicted below:

convolutional neural network

Let’s discuss the computation being done here. Orange matrix has been slid over the original green image by 1 pixel, also known as a stride. For each position, we perform element-wise multiplication among the two matrices and summate the outputs of multiplication to obtain the final single element (an integer) of the output matrix which is pink. Also, note here that only a part of the input image is seen by the 3×3 matrix in every stride.

The 3×3 matrix is generally called ‘kernel’ or ‘filter’ or a ‘feature detector’ in the CNN terminology. And the matrix obtained due to sliding of filter on the image and the computation of the dot product is referred as the ‘Convolved Feature’ or ‘Activation Map’ or the ‘Feature Map’. Filters serve as feature detectors for the input image (original). It must be clear by now that different values for the filter matrix will result in varied Feature Maps even though the input image is same. For instance, consider the below input image:

Effects of convolution with different filters applied to the above image are listed in the table below. Operations such as Sharpen, Blur and Edge detection can be easily done by changing numeric values of the filter matrix before performing the convolution – this clearly implies the detection of different features like curves, edges etc. in an image by different filters.

The input image (convolution operation) is slid over by any filter (say, with a red outline) to result in a feature map. Now, when the convolution is done over the same image, but with another filter (say, with green outline) we obtain a different feature map as depicted. This operation of convolution successfully captures the dependencies of the original image. Also, one should note the appearance of two entirely different feature maps resulting from a single image. The fact that these two filters and the image are only numeric matrices should not be forgotten as stated above too.

The values of these filters are automatically imbibed by CNN during the training process. However, we are still required to define a few parameters before the training process like filter types, their size, network architecture, etc. More the number of filters present, greater is the extraction of the image features and more efficient our network gets in identifying patterns in new images.

Three parameters control the size of our Feature Map i.e. Convolved Feature. They are required to be set before performing the convolution step.

  • Depth: It defines the exact number of filters needed for the convolution operation. In the figure below, the convolution of the original boat image is performed using 3 distinct filters, thereby, generating 3 distinct feature maps. These feature maps can be thought of as stacked 2D matrices. So, the ‘depth’ of our feature map here would be 3.

  • Stride: It tells about the quantity of pixels needed to slide the filter matrix over the input matrix. A filter is moved one pixel at a time for a stride 1 and jumps by 2 pixels simultaneously at stride 2. However, note that using larger strides will result in small feature maps.
  • Zero-padding: It becomes necessary often to pad the input matrix using Zero around the borders, as that filter can be applied to bordering elements of input image matrix easily. The advantage of Zero-padding is that it permits monitoring of the feature map size. This phenomenon of adding zero-padding is known as ‘wide convolution’ and that of no zero-padding is ‘narrow convolution’.

Introducing Non-Linearity (ReLU)

In a typical ConvNet, a supplementary operation known as ReLU is performed after every convolution operation. ReLU is the abbreviation for Rectified Linear Unit and to our surprise, is itself a non-linear operation.

ReLU is an operation working element-wise i.e. is applied per pixel and substitutes every negative pixel value by 0 in the feature map. It serves the purpose of introducing non-linearity in ConvNet, because the maximum real-life data we will want to feed into our ConvNet is non-linear. (Convolution is a linear operation i.e. possesses element-wise matrix type multiplication and addition and so, there’s a need for the introduction of a nonlinear function such as ReLU to account for non-linearity).

The figure below clearly depicts ReLU operation. Here also we refer the output feature map as the ‘Rectified’ feature map.

Non-linear functions such as ‘tanh’ or ‘sigmoid’ can be put to use for operation in the place of ReLU. However, ReLU is most widely used due to its consistently better performance in most situations.

The Pooling Step

If we wish to reduce the dimensionality of individual feature maps and yet sustain the crucial information, we use Spatial Pooling commonly known as downsampling or subsampling. This type of Pooling is of varied types: Average, Sum, Maximum, etc.

For Max Pooling, we first specify a spatial neighborhood (such as a 2×2 window) and then pick out the largest element of that feature rectified map within that window. Instead of largest element, if we pick the Average one it’s called Average Pooling and when the summation of elements in the window is taken, we call it Sum Pooling.

Max Pooling has proven itself as the best.

Convolution + ReLU operation is executed on a rectified feature map and then Max Pooling operation is performed by using a 2×2 window as shown in the figure below.

2 x 2 window is slid by 2 cells also known as “stride” and occupies the largest value in each region.

Also, in the figure below, observe that we have applied Pooling operation to every feature map separately and as a result, we obtain 3 output maps from the 3 input maps we had fed.

The basic purpose of pooling is to tremendously reduce the structural size of the input given. More specifically, the following are achieved through pooling:

  • Generates input representations i.e. feature dimensions that are small and so, more manageable.
  • Lessens the parameters used and the computations in the network to efficiently control overfitting.
  • Makes the network resistant to distortions, translations in the input image and to small transformations. Though minute distortions in the input shall not distort the pooling output as average/maximum value is used in the local neighborhood.
  • Helps in arriving at an equivariant representation of the input image. This is extremely useful as it enables us to detect small objects inside our image irrespective of where they are placed.

We have studied the working of Convolution, ReLU and Pooling till now. These layers form the basic structure of any CNN. Generally, there are 2 sets of Convolution, ReLU and Pooling layers; convolution is performed on the output of the first Pooling layer by the 2nd convolution layer employing six filters and so, producing six feature maps as well. ReLU layer is implemented on all these six feature maps individually. Having done this, we’re left with the task of Max Pooling, which we do on each of the six rectified feature maps separately.

Coordination among these layers helps in the extraction of useful features from the images fed, then it adds non-linearity in the network and cuts down feature dimension thereby, making the features equivariant to scale and translate. Input to the Fully Connected Layer is provided by the output of the 2nd pooling layer which is discussed further in the article.

Fully Connected Layer

This layer is described as a Multilayer Perceptron which utilizes a softmax activation function present in the output layer (we can use SVM as the classifier as well but we will limit out approach to softmax here). The words “Fully Connected” in the fully connected layer indicate that every neuron in the next layer is connected to every individual neuron in the previous layer.

Pooling and convolutional layer outputs depict high-resolution features of the input image. The fully connected layer on the basis of the training dataset utilizes these features for categorizing input images into different classes.

A more general way of learning non-linear combinations of such features is by inserting a fully-connected layer instead of using classification. For the task of classification, features of the convolutional and pooling layers serve the purpose but using a combination of these is the best choice.

Summation of all the output possibilities of the Fully Connected Layer comes as 1. To ensure this, we operate the Softmax activation function in the output layer. This function works on any arbitrary real valued vector and transforms it into a vector valued between one and zero, so as to acquire a sum of 1.


To summarize, ConvNets employ complex techniques in order to make the right prediction. They are a great tool to be used, particularly when you intend to apply Machine Learning to images (Computer Vision).

Comments are closed.

Magoosh blog comment policy: To create the best experience for our readers, we will only approve comments that are relevant to the article, general enough to be helpful to other students, concise, and well-written! 😄 Due to the high volume of comments across all of our blogs, we cannot promise that all comments will receive responses from our instructors.

We highly encourage students to help each other out and respond to other students' comments if you can!

If you are a Premium Magoosh student and would like more personalized service from our instructors, you can use the Help tab on the Magoosh dashboard. Thanks!