Understanding Convolutional Neural Network/ConvNet in a Simple Way

## Understanding Convolutional Neural Network/ConvNet in a Simple Way

### INTRODUCTION

This article exemplify (explain) Convolutional Neural Network/ConvNet in a simple way. In this, in addition to ConvNet readers will study basics of neural network, input, hidden and output nodes, feed forward neural network and back-propagation algorithm. Once the base for these terminology is clear, focus will be shifted to ConvNet architecture and its different layers such as CNN, pooling, normalization, dropout and fully connected. During this explanation, readers will imbibe (gain) knowledge and understanding of mathematics behind parameter calculation, output dimension from each layer calculation. In the end of this substantial (solid) reading, one will find some good references where more information can be written down.

### What is Neural Network

Neural Network or more clearly artificial neural network is a computational model which is based on working of neurons in a human brain. Talking about application of neural network, computer vision and image processing (image classification, object detection, speech recognition, image segmentation) is an area where NN (Neural Network) is widely used. Before proceeding further let understand theory and mathematics of a single neuron from this link. Once you got the quick understanding of a single neuron, it is important to build the concepts of input, hidden and output nodes to understand the feed forward neural network.

### Feed Forward, input, hidden and output nodes, weight and bias

Feed Forward Neural Network is defined as an arrangement of neurons/nodes/units in different layers. These layers are known as input, hidden and output layers. One must understand that information moves in one direction in such network and there are no cycles or loops back. This makes them different from recurrent neural network and one must also understand the difference between these networks with RNN. For this, AI Sangam has written an article entitled as Difference between Feed Forward and RNN. Please go through this article so that better understanding with diagram is cleared in your mind. Let us discuss three layers of the feed forward neural network which are:

• Input Layer
• Hidden Layer
• Output Layer

Input Layer: This is the layer where data is fed. In this layer, along with the nodes representing the data, one extra node for bias is also added. This neuron is always 1 and has its own connection weights. Let us look at the below diagram to understand input, hidden and output layers in more precise way.

Hidden Layer: Each neuron in the hidden layer receives inputs from the neurons in the previous layer. Inputs and their associated weights are added and passed to an activation function which in turns produces an output (suppose a). To know about working of neuron and how computation is done at this stage, please refer to link single neuron.

Output layer: This is the final layer of the feed forward neural network. Based on the input received from last the hidden layer, it produces an output y as shown in the  Figure: Layers of Feed Forward Neural Network.

As layers of the feed forward neural network is completed, now let us move towards understanding weights and bias.

Weights: In the input layer, each node has an input and each input has a weight. In short, each input is associated with a weight. Now the question arises what should be the initial value of these weights. As far as weights initial value is concerned, they are initialized with random value between 0 and 1. If you are working with python programming language, you can choose the function random.seed to generate random numbers. But this is not exactly true. If you look for the algorithms provided by xavier, kaiming, you would have different perception for initializing weights. AI Sangam has written a document on initializing weights using xavier and kaiming initialization. I would suggest readers to read this document so that better perception about weights is achieved.

Bias: For Bias, I have written as document i.e single neuron. Just click on this document and scroll the section Why Biasing is Needed.

### Minimizing Loss: Gradient Descent and the backpropagation

To understand Gradient Descent, I would like readers to go to my article difference between feed forward and RNN and study the points under the heading what is backpropagation method. To understand the mathematics behind backpropagation as well as how chain rule is applied with respect to cost function and weights, please do study from here.

### Difference between CNN and NN and what makes CNN special

Neural Network is also called Artificial Neural Network. Various classes of ANN includes convolutional neural network (CNN), recurrent neural network (RNN), autoencoders, deep belief networks (DBN). From this line, it is precisely clear that CNN is a type of neural network. Two important features of CNN which makes it special are reducing the computational complexity as well as ensuring translational invariance.

Reducing computational complexity: In CNN, all the neurons present in a channel of feature map or it would be more technical to say each depth of feature map shares weights which leads to drastic decrease in number of parameters. In this way, it decreases computational cost.

Translation Invariance: In CNN, the input patterns are recognized irrespective of any translation. It refers to shifting the position of an object in an image.

Along with these two features, CNN works on the principle of local connectivity. Local connectivity is very vital feature which makes it different from neural network. Have a look at this image to get an idea of what is local connectivity. If you look carefully at the image, you will come to know that the total number of connections a neuron will receive in the conv layer would come out to be 13.

Local connectivity: It makes it different from NN (Neural Network). Because of local connectivity, CNN is able to learn the adjacent neighbouring pixels which is very important while performing tasks such as image recognition, object detection, image classification or image segmentation  because the close pixels close information than far away pixels.

### CNN/ConvNet Layers

ConvNet comprises of two sections. Section 1 is used to extract the features and Section 2 is used to make the classification. Former contains layers such as convolutional layer, pooling and normalization whereas later works in the same way as a neural network works and is called fully connected network. Classification step comprises of flatten layer, fully connected layers and an activation function which is applied at dense layer. Let us discuss these layers section wise.

Convolutional Layer: This is the core building block of a convolutional neural network. Much of the computational task is implemented in this layer. In this layer, input is convoluted with the filters which are having weights. Local connectivity is the principle used in this layer and each neuron is convoluted to a small region in the input image. Please do remember area so selected should be equal to length*width of filter. From such convolutional calculation an output volume is formed which has a depth equal to the number of filters used. Output volume so formed can have a large number of parameters or less number of parameters depending on parameter sharing or not parameter sharing. I urge readers to please read both the document to get better knowledge of how number of parameters are reduced using parameter sharing. A lot of mathematics is involved in these documents, please go through the calculations also. Have a look at the below image to see the output after convolutional operation or in broader sense activations maps which is formed as the output. Calculation is taken from parameter sharing. So please open this document along with you to know about how output volume is formed from this input volume.

Depending on the kernel used g(x), different features are extracted. Kerners are of below type such as kernel for edge detection, sharpening, box blur, gaussian blur, identity and many more. The importance of convolutional layer is it tries to learn high level representation of an image like first convolutional layer detects the edges, second curve and third the whole object. If you compare it with other feature extraction algorithms such as HOG, SIFT, they try to detect the local intensity variations. After detecting they combine to make the classification. In this way deep learning is trying to work in a similar way as humans works which makes it different from machine learning. This is also the reason that it has gained a lot of success in the recent time both in accuracy and practical usage.

Pooling Layer: It is inserted between successive conv layers. It reduces the dimensions (width and height) with help of stride and filter. Normally stride value till 2 is used for better results. Pooling is divided into two categories (Average and Max Pooling).

• Average Pooling: It takes the average value of features from the feature map.
• Max Pooling: It picks the maximum value from the feature map.

Please have a look at the image attached here to understand it in better sense. With the help of pooling layer, the number of parameters are reduced and hence the computational cost is reduced.

Batch Normalization layer: To speed up the training of convolutional neural network as well as to reduce the sensitivity to network initialization, batch normalization is used between convolutional layers and non-linear layers such as RELU and tanh. Please see the image attached to understand it using a diagram.

Dropout layer: Please note that to prevent overfit, it is appropriate to use dropout layer. It may be used after conv-RELU-Dropout. But it is not recommended here as it leads to loss of some important information which is to be learnt from an image. Dropout means avoiding some of the neurons to reduce the number of parameters.Values such as 0.1, 0.2, 0.3 os considered as good but in some of the research paper and reading, value = 0.5 is also used. These values represents the percentage of neurons in a layer to be discarded. So use it on the dense layers (fully connected layer)

Flatten Layer: Output from the feature extractor phase is in the form of a matrix and input to the fully connected layer should be one dimensional vector so it is the flatten layer which converts n-dimensional output to 1-d input to the fully connected layer.

Fully connected layer: Input which is from the flatten layer is provided to the fully connected layer also called dense layer. Fully connected layer means each neuron receives input from every element in the previous layer. Please see the below code to understand this point in more detail.

```classifier.add(Dense(units=128, activation='relu'))
```

Explanation of the above code

Line 1 :According to the above code there are 128 neurons/units in the hidden layer and result will be passed to activation function which is relu. I would urge readers to read this document.

Line 2 :This dense layer denotes the output and one neuron is there to predict the output. If more than one neuron is there to predict the output, problem is called multiclass whereas this above problem is called binary classification.