MULTI LAYER PERCEPTRON explained

Published in

Analytics Vidhya

7 min readApr 27, 2021

So i am beginning my blogging journey from today. For my very first piece i’ll be explaining a simple but very essential concept to study DEEP LEARNING
that is MULTI LAYER PERCEPTRON.

DATA SET

For this blog we will be using make_moons data set from sklearn.
This dataset is selected because it cannot be separated by straight line.
Execute the following script to generate the data set that we are going to use, in order to train and test our neural network.

In the script above we import the datasets class from the sklearn library. To create non-linear dataset of 100 data-points, we use the make_moons method and pass it 100 as the first parameter. The method returns a dataset, which when plotted contains two interleaving half circles, as shown in the figure below:

You can clearly see that this data cannot be separated by a single straight line, hence the perceptron cannot be used to correctly classify this data.

Let’s verify this concept. To do so, we’ll use a simple perceptron with one input layer and one output layer because this kind of structure is basically a logistic regression function with input and output and we know we cannot classify this data with just a logistic regression ( i mean’t we can but then error won’t decrease from certain number )

Neural Networks with One Hidden Layer

In this section, we will create a neural network with one input layer, one hidden layer, and one output layer ( I prefer to call it SINGLE LAYER PERCEPTRON ). The architecture of our neural network will look like this :-

In the figure above, we have a neural network with 2 inputs, one hidden layer, and one output layer. The hidden layer has 4 nodes. The output layer has 1 node since we are solving a binary classification problem, where there can be only two possible outputs. This neural network architecture is capable of finding non-linear boundaries.

The basic principle for a neural network is 3 steps :-

FEED FORWARD
CALCULATE LOSS
BACK PROPAGATION
CHANGE WEIGHTS

And these are the four things we are going to do in this blog today.

FEED FORWARD

For each row we have to features ( x1 and x2 ). To calculate the values for each node in the hidden layer, we have to multiply the input with the corresponding weights of the node for which we are calculating the value. We then pass the dot product through an activation function to get the final value. Initially weights are randomly chosen. Since input (2 nodes) are connected to 4 nodes in Hidden Layer
So our weight matrix for layer 1 will be of shape (2,8) because every input_node is connected to 4 nodes of HIDDEN LAYER.

weights_layer1 = np.random.rand(features.shape[1],features.shape[1]*4) # weight’s shape (2,8) for moons data

For instance to calculate the final value for the first node in the hidden layer, which is denoted by “ah1”, you need to perform the following calculation:

zh1=x1w1+x2w2 → Equation a
ah1= 1/(1+np.exp(-zh1)) →Equation b

This is the resulting value for the top-most node in the hidden layer. In the same way, you can calculate the values for the 2nd, 3rd, and 4th nodes of the hidden layer.

Similarly, to calculate the value for the output layer, the values in the hidden layer nodes are treated as inputs. Therefore, to calculate the output, multiply the values of the hidden layer nodes with their corresponding weights and pass the result through an activation function.

This operation can be mathematically expressed by the following equation:

zo = ah1*w9+ah2*w10+ah3*w11+ah4*w12 → Equation c
ao = 1/(1+np.exp(-zo)) → Equation d

BACK PROPAGATION

In the back-propagation phase, we will first define our loss function. We will be using the mean squared error cost function. It can be represented mathematically as:
AND THIS IS OUR COST FUNCTION

Here n is number of observations

Phase 1

In the first phase of back propagation, we need to update weights of the output layer i.e w9, w10, w11, and w12. So for the time being, just consider that our neural network has the following part:

The purpose of the first phase of back propagation is to update weights w9, w10, w11, and w12 in such a way that the final error is minimized. This is an optimization problem where we have to find the function minima for our cost function.

To find the minima of a function, we can use the gradient decent algorithm. The gradient decent algorithm can be mathematically represented as follows:

In our neural network, the predicted output is represented by “ao”. Which means that we have to basically minimize this function:

Cost = ((1 / 2) * (np.power((ao — labels), 2)))

Labels are True values
ao is predicted values

we have to update weight values such that the cost decreases. To do so, we need to take derivative of the cost function with respect to each weight. Since in this phase, we are dealing with weights of the output layer, we need to differentiate cost function with respect to w9, w10, w11, and w2.

The diffrentiation of cost function is calculated using chain rule.

FOR OUR BLOG

we write
dao/dwo = dao_dwo ( VERY IMPORTANT )

dcost_dwo = dcost_dao * dao_dzo * dzo_dwo → Equation 1

Here “wo” refers to the weights in the output layer. The letter “d” at the start of each term refers to derivative.

dcost_dao = 2 * ( ao — labels)/ Total number of observations

and here 2 and Total number of observations are constant so we can ignore them.
so dcost_dao = ( ao — labels ) → Equation 2

Now ao wrt zo ( derivative of sigmoid is Equation 3 )
so dao_dzo = sigmoid(zo)*( 1- sigmoid(zo)) →Equation 3

Finally, we need to find “dzo” with respect to “dwo”. The derivative is simply the inputs coming from the hidden layer as shown below:

dzo_dwo = ah →Equation 4

Here “ah” refers to the 4 inputs from the hidden layers. Equation 1 can be used to find the updated weight values for the weights for the output layer. To find new weight values, the values returned by Equation 1 can be simply multiplied with the learning rate and subtracted from the current weight values. This is straight forward and we have done this previously.

Phase 2

In the previous section, we saw how we can find the updated values for the output layer weights i.e. w9, w10, w11, and 12. In this section, we will back-propagate our error to the previous layer and find the new weight values for hidden layer weights i.e. weights w1 to w8.

We denote hidden layer weights as “wh”. We basically have to differentiate the cost function with respect to “wh”. Mathematically we can use chain rule of differentiation to represent it as:

dcost_dwh = dcost_dah * dah_dzh * dzh_dwh → Equation 7

Again we will break equation 5 into parts

The first term “dcost” can be differentiated with respect to “dah” using the chain rule of differentiation as follows:

dcost_dah = dcost_dzo * dzo_dah → equation 7.1

Let’s again break the Equation 7.1 into individual terms. Using the chain rule again, we can differentiate “dcost” with respect to “dzo” as follows:

dcost_dzo = dcost_dao * dao_dzo → equation 7.2

We have already calculated the value of dcost/dao in Equation 5 and dao/dzo in Equation 6.

Now we need to find dzo/dah from Equation a. If we look at zo, it has the following value:

zo = ah1*w9+ah2*w10+ah3*w11+ah4*w12

If we differentiate it with respect to all inputs from the hidden layer, denoted by “ao”, then we are left with all the weights from the output layer, denoted by “wo”. Therefore

dzo_dah = wo →equation 7.3

Now we can find the value of dcost/dah by replacing the values from

Equation 7.2 and 7.3 in Equation 7.1

Coming back to Equation 2, we have yet to find dah_dzh and dzh_dwh.

dah_dzh= sigmoid(zh)* ( 1- sigmoid(zh)) →equation 7.4

dzh_dwh = input features →equation 7.5

If we replace 7.4 and 7.5, 7.1 in equation 7 , we can get the updated matrix for the hidden layer weights. To find new weight values for the hidden layer weights “wh”, the values returned by Equation 7 can be simply multiplied with the learning rate and subtracted from the current weight values. And that’s pretty much it.
There are lot of calculations but yes
You have learn’t something very beautiful

Link for the code will be
https://github.com/sahibpreetsingh12/100daysofmlcode/blob/main/D8-SLP/D8-SLP.ipynb

and link for Repo is https://github.com/sahibpreetsingh12/100daysofmlcode

Hope you like it :)