Step-by-Step: Building Your First Convolutional Neural Network

Introduction To Convolutional Implementation For Beginners

Ever wondered how the human brain works? How do we comprehend and draw distinctions between various objects? There are dense networks of neurons inside the human brain that allow us to see and process the data acquired from our vision.

And just like the human brain, neural networks can be created for computers. Although these neural networks are way less complicated than the ones inside our body(phew! What a relief for us programmers, am I right?) they might need a little practice before you can master them.

With the rising popularity of artificial intelligence and machine learning models, artificial neural networks are the backbone of any deep learning algorithm. Apart from simple artificial neural networks(ANN), more advanced neural networks such as convolutional neural networks(CNN) and recurrent neural networks(RNN) are becoming more and more popular nowadays.

Convolutional Neural Networks (CNNs) are a type of artificial neural network designed for image processing, natural language processing, and other kinds of cognitive tasks. They consist of an input layer, multiple hidden layers (including convolutional, pooling, and fully connected layers), and an output layer. CNNs are capable of identifying complex patterns in data, making them a key technology in the field of deep learning.

In this article, we will take a look at what convolutional neural networks are and how we can use them to classify images.

By the end of this tutorial, you will have created your first convolution neural network in python. Let’s get started!

Understanding Convolutional Neural Networks

Convolutional neural networks are mostly used for processing data from images, natural language processing, classifications, etc. A convolutional neural network mainly consists of three layers through which an image or other data is processed. The three layers are the input layer, n number of hidden layers(here n denotes the variable number of hidden layers that might be used for data processing), and an output layer.

In the hidden layers, there are three types of sublayers. They are:

  • convolutional layer
  • pooling layer
  • fully connected layer

Most of the feature mapping, filtering and computations is done in the convolutional layer. There can be more than one convolutional or pooling layers.

To properly get an idea of what a convolutional neural network is, we’ll look at an example of the given dog picture.

Dog Picture
Dog Picture

When we look at pictures or real-life objects, our brain immediately processes the features of that particular object to recognize it. In this case, the first thing that we notice about the dog is not its color but its snout, its ears and mouth, its paw, and its tails. So even if the color or the position of the picture changes, we can still identify it as a dog.

These separate features again pool into another node each which will tell us about the dog’s head and body and finally, we will be able to recognize it as a dog.

Feature Classification And Image Identification
Feature Classification And Image Identification

In the above picture, a feature map has been simplified for basic understanding. Now let’s see how this is done using a little more computational terms. For computers to understand this, we use filters in our convolutional layer which specify separate things.

Diving Deep into the Convolutional Layer

The convolutional layer is the most important and computational layer of a CNN. Most of the calculations are carried out in this layer.

An image to a computer is nothing but a collection of pixels that is represented in the form of a matrix. The image data is stored as values of pixels in the computer. In case of color images there will be mainly three aspects to the image, height, width and depth. All three of these corresponds to its RGB channel.

Let us take a number such as “6”. The number six can be represented in the following manner:

Feature Classification And Image Identification 1
Feature Classification And Image Identification.

Now, these features will be implemented as filters over a grid of numbers mainly consisting of 1s and -1s(For simplification purposes, the pixel data is comprised only of 1s and -1s).

2
Filters in convolutional Implementation.

The different types of filters(also called kernels) are 2D arrays that move over an image to recognize different parts of it. They are usually a 3×3 matrix, and it is used for reception of different patterns across the entire image. When the filters are effective in recognizing the types of specified pattern, the dot product of the image data and the specified filter is calculated.

This process is repeated until there are no more receptive fields left for the filter to recognize.

The final dot product of the image pixel data and the filters is called a feature map.

Since the pixel data might be greater in size than the 3×3 filter matrix, we have to take 3×3 grids one at a time and then perform the dot product with the filters. This is known as the stride. Strides can take up different values depending upon requirements and size of filters.

The Feature Map
The Feature Map

In the above picture, we can see that there is one value in the feature map which is equal to 1. This is what the filter tells us. This purple highlighted value says that the filter is activated in that particular place for the number 6. The loop filter from this map can be seen activated at the very bottom of the number six.

Now, after each convolutional layer computation we need to use a rectified linear activation function, or a ReLU function to apply non linearity. It is a type of function that will directly give us an output of zero unless the value is strictly positive. It solves the vanishing gradient value problem in deep learning algorithms. It helps in removing unwanted smaller values that would make the process unnecessarily lengthy.

ReLU Activation Function
ReLU Activation Function

The shifting of the filters which move over the entire image’s receptive field is called convolution.

There can be more than one convolutional layer and form hierarchies. For example, in the previous section with the dog example, we have two convolutional layers, one identifying the facial features which when clubbed together form the next layer in the hierarchy which further combine with the body layer and then gives an output.

Read more about the ReLU function here!

Unraveling the Pooling Layer

In the pooling layer, there are filters that go over the feature maps and downsize the number of hidden layers. The pooling layer assembles the features into nodes before passing it to the fully connected layer. It collects all the feature maps and further reduce its dimensions. It reduces the amount of computation that is to be done.

There are two types of pooling layer, they are:

  • Average pooling: It takes the average value of the entire image data’s receptive filed and then feeds it to the next layer.
  • Max pooling: This is the most preferred method when processing image data. Max pooling or maximum pooling filters out the maximum values from the feature maps and then it is further feed into the model. This is more effective than average pooling.

We will be using max pooling in the later sections, so let’s look at how it’s done.

How Max Pooling Works

From the above feature map which has been normalized using the ReLU activation function, we need to extract square matrices of 2×2 dimensions to obtain the maximum from each of those and then pool them all together to downsize our sample.

Here we will take the stride as 1. This means that we will start from the very left column, that is the 1st column. Since our stride in 1, we will move one column forward and again extract another square matrix.

Hence our 1st smaller square grid will consist of (1,1),(1,2),(2,1),(2,2) elements where rows and the columns are represented are (row number, column number). The maximum among these four elements will be chosen and it will serve as the very 1st element of our more defined feature pool situated at the (1,1)th position in the new matrix.

Since the maximum in the very first smaller square grid is 0, the first element of our pooled feature map will be 0 at (1,1).

Our second square grid will consist of the (1,2),(1,3),(2,2),(2,3) elements from which the maximum will again be extracted and positioned at the (1,2)th position of the new matrix.

This process will continue till we are left with a 4×2 matrix.

MAX POOLING
MAX POOLING WITH STRIDE=1

Some advantages of pooling are:

  • It reduces unwanted and extra parameters.
  • It downsizes the huge sample.
  • It makes the computation faster.

One of the main disadvantages of this process is that a lot of information is lost when we reduce the dimension of our feature map resulting in information loss which can affect the overall accuracy of our model.

Decoding the Fully Connected Layer

The fully connected layer flattens or reduces the dimensions of our input into one single dimension where all the input is connected to the neurons. This is where classification is done. Classification is done where each of the input data or pixel data is divided into one or more classes according to real world object categories which are defined by default.

The FC is usually found at the very end of the CNN architecture where classification is carried out where inputs are classified using functions such as softmax, that scales numbers into probabilities, between 0 and 1.

CONVOLUTIONAL NEURAL NETWORK 1
COMPLETE CONVOLUTIONAL NEURAL NETWORK

Related: What Are the Different Types of Classification Algorithms?

Hands-on: Building Your First Convolutional Neural Network

In this section, we are going to create our very own simple convolution neural network by using a very popular dataset containing a lot of images for our model to train.

We’ll classify images using our own CNN by building it with the help of TensorFlow and Keras. We can visualize our output using matplotlib as well. Let’s get started.

Before jumping right into the code, you need to take care of some prerequisites. You have to make sure that you have TensorFlow, and Keras installed/updated in your system. Tensorflow is a high level library that is used for various machine learning purposes. Keras on the other hand, is a type of neural network used along with tensorflow to perform and train deep learning algorithms.

You can check out how to install TensorFlow from here according to your system requirements. In this tutorial, we are going to use google colab which is a cloud based Python IDE and comes pre-installed with Tensorflow, so the only thing you need to do in this case is just import it along with Keras.

#import all required modules.
import tensorflow 
from tensorflow.keras import datasets, layers, models
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras.layers import Flatten, Dense

We are going to use the CIFAR 10 dataset, which consists of 60,000 , 32×32 images out of which 50,000 are for training our model and 10,000 are for testing.

There are 10 different classes in the dataset, each associated with 10 random images.

The Classes In CIFAR 10 Dataset
The Classes In CIFAR 10 Dataset

We’ll now import the CIFAR10 in our program.

#importing CIFAR10 dataset into our code
(xtrainmodel, ytrainmodel) , (xtestmodel, ytestmodel) = datasets.cifar10.load_data()

You can check the number of testing and training datasets and their shapes if you want to by running the following code using the shape attribute.

xtrainmodel.shape
ytrainmodel.shape
xtestmodel.shape
ytestmodel.shape

You’ll get the following output:

(50000, 32, 32, 3) #xtrainmodel shape
(50000, 1) #ytrainmodel shape
(10000, 32, 32, 3) #xtestmodel shape
(10000, 1) #ytestmodel shape
Displaying The Shapes Of Our Variables
Displaying The Shapes Of Our Variables

Now, we’ll flatten our ‘ytrainmodel’ shape and also normalise our ‘xtestmodel’ and ‘xtrainmodel’ values to standardize everything.

ytrainmodel=ytrainmodel.reshape(-1,)
xtestmodel = xtestmodel / 255
xtrainmodel = xtrainmodel / 255

Note: We are dividing the images using 255 because all pixel values ranges from 0 to 256. So when we exclude 0, we get 255 values. This help normalize the pixel values and bring them in the range of 0 to 1.

Now, we’ll define our main function to train our models. You do not need to worry about the filters too much because their shapes and types will be automatically detected. We’ll define the dimensions.

convol=Sequential()
                       #first convolutional layer
convol.add(Conv2D(filters=32 , kernel_size=(3,3),
                                        activation='relu',
                                        input_shape=(32,32,3)))
                        #1st max pooling
                        #size of feature map after pooling 2x2
convol.add(MaxPooling2D((2,2)))

                        #2nd convolutional layer
convol.add(Conv2D(filters=64,
              kernel_size=(3,3),
              activation='relu'))
                        #2nd max pooling
                        #size of feature map after pooling is given by 2x2 
convol.add(MaxPooling2D((2,2)))

                        #fully connected layer
                        #flattening our inputs
convol.add(Flatten())
convol.add(Dense(64 , activation = 'relu'))
#using the softmax function
convol.add(Dense(10 , activation = 'softmax'))

In the above block of code, we have created our convolutional neural network where we have used the sequential model. In the beginning we have specified our kernel size, our input size and its dimensions and used the ReLU function to introduce non linearity in our convolutional layer. Next, we have used max pooling to pool our feature maps and downsize our sample.

The above process, consisting of the entire convolutional layer, has been repeated twice. The very last portion of the code defines the fully connected layer where we have flattened our input and conducted classification using the softmax activation function.

All that is left now is to compile our model and then train it. We will compile our model in the following way:

convol.compile(optimizer="adam",loss='sparse_categorical_crossentropy',metrics=['accuracy'])

We have used the adam optimizer, to increase the speed and accuracy of our model. Along with that we have used the sparse_categorical_crossentropy type loss parameter because one element in this dataset will only belong to one particular class and not multiple classes. Besides that we have used the accuracy parameter to be displayed so that we can see how accurate our training model is.

Now, we’ll train our model with the training samples. We will use 10 epochs which simply means the number of cycles that the neural network will train for with the given dataset.

convol.fit(xtrainmodel,ytrainmodel, epochs=10)

The output of the training will be something like this.

Training Our Convolutional Neural Network
Training Our Convolutional Neural Network

In the output you can observe how the accuracy increases with each cycle. In the beginning, it was about 46% and in the end it went up to 78%. That’s pretty impressive! These figures when evaluated with just a simple ANN network will give you a final accuracy of about 45%.

This is why CNN is way more accurate and popular nowadays. Now let’s evaluate our test samples.

convol.evaluate(xtestmodel,ytestmodel)

The output will be:

313/313 [==============================] - 6s 18ms/step - loss: 0.9689 - accuracy: 0.6918
[0.9689394235610962, 0.6917999982833862]

Well, our model is doing a pretty good job with about 70% accuracy!

Voila! you have created your very first and very own convolutional neural network!

The entire code of our CNN is given below all clubbed together.

#import all required modules.
import tensorflow 
from tensorflow.keras import datasets, layers, models
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras.layers import Flatten, Dense

#import dataset
(xtrainmodel, ytrainmodel) , (xtestmodel, ytestmodel) = datasets.cifar10.load_data()

#normalizing our data for training and testing
ytrainmodel=ytrainmodel.reshape(-1,)
xtestmodel = xtestmodel / 255
xtrainmodel = xtrainmodel / 255

#our CNN

convol=Sequential()
                       #first convolutional layer
convol.add(Conv2D(filters=32 , kernel_size=(3,3),
                                        activation='relu',
                                        input_shape=(32,32,3)))
                        #1st max pooling
                        #size of feature map after pooling 2x2
convol.add(MaxPooling2D((2,2)))

                        #2nd convolutional layer
convol.add(Conv2D(filters=64,
              kernel_size=(3,3),
              activation='relu'))
                        #2nd max pooling
                        #size of feature map after pooling is given by 2x2 
convol.add(MaxPooling2D((2,2)))

                        #fully connected layer
                        #flattening our inputs
convol.add(Flatten())
convol.add(Dense(64 , activation = 'relu'))
#using the softmax function
convol.add(Dense(10 , activation = 'softmax'))

#compiling our model
convol.compile(optimizer="adam",loss='sparse_categorical_crossentropy',metrics=['accuracy'])

#training our model
convol.fit(xtrainmodel,ytrainmodel, epochs=10)

#testing our model
convol.evaluate(xtestmodel,ytestmodel)

Note: Even though this is the entire code, it is advisable to use smaller separate cell blocks in google colab as given above according to their order for ease of understanding.

Challenges and Final Thoughts on Convolutional Neural Networks

We’ve learned what a convolutional neural network is and how you can implement it on your own. But there are two sides to every coin. As we have already seen how a CNN performs way better in terms of correctly identifying images with higher accuracy when compared to simple ANNs, there are also some limitations to this algorithm.

It takes a lot of time to train these neural networks because as the accuracy increases, the training data needs to be heavily organized and labeled for effective results. It also needs a huge amount of resources to correctly predict data, which is not possible for each and every task.

Overfitting is a huge problem in CNN due to which it cannot be applied on smaller datasets. There is also information loss that occurs which might generalize our models more than required.

Nonetheless, CNN is deserving of its immense fame in recent years because it has revolutionized machine learning in many ways. From natural language processing to computer vision, it is widely used everywhere. So, how do you feel after creating your very first convolutional neural network?