How to Implement the Backpropagation Algorithm From Scratch In PythonPhoto by NICHD, some rights reserved.

Description

This section provides a brief introduction to the Backpropagation Algorithm and the Wheat Seeds dataset that we will be using in this tutorial.

Backpropagation Algorithm

The Backpropagation algorithm is a supervised learning method for multilayer feed-forward networks from the field of Artificial Neural Networks.

Feed-forward neural networks are inspired by the information processing of one or more neural cells, called a neuron. A neuron accepts input signals via its dendrites, which pass the electrical signal down to the cell body. The axon carries the signal out to synapses, which are the connections of a cell’s axon to other cell’s dendrites.

The principle of the backpropagation approach is to model a given function by modifying internal weightings of input signals to produce an expected output signal. The system is trained using a supervised learning method, where the error between the system’s output and a known expected output is presented to the system and used to modify its internal state.

Technically, the backpropagation algorithm is a method for training the weights in a multilayer feed-forward neural network. As such, it requires a network structure to be defined of one or more layers where one layer is fully connected to the next layer. A standard network structure is one input layer, one hidden layer, and one output layer.

Backpropagation can be used for both classification and regression problems, but we will focus on classification in this tutorial.

In classification problems, best results are achieved when the network has one neuron in the output layer for each class value. For example, a 2-class or binary classification problem with the class values of A and B. These expected outputs would have to be transformed into binary vectors with one column for each class value. Such as [1, 0] and [0, 1] for A and B respectively. This is called a one hot encoding.

Wheat Seeds Dataset

The seeds dataset involves the prediction of species given measurements seeds from different varieties of wheat.

There are 201 records and 7 numerical input variables. It is a classification problem with 3 output classes. The scale for each numeric input value vary, so some data normalization may be required for use with algorithms that weight inputs like the backpropagation algorithm.

Below is a sample of the first 5 rows of the dataset.

1

2

3

4

5

15.26,14.84,0.871,5.763,3.312,2.221,5.22,1

14.88,14.57,0.8811,5.554,3.333,1.018,4.956,1

14.29,14.09,0.905,5.291,3.337,2.699,4.825,1

13.84,13.94,0.8955,5.324,3.379,2.259,4.805,1

16.14,14.99,0.9034,5.658,3.562,1.355,5.175,1

Using the Zero Rule algorithm that predicts the most common class value, the baseline accuracy for the problem is 28.095%.

Download the seeds dataset and place it into your current working directory with the filename seeds_dataset.csv. The dataset is in tab-separated format, so you must convert it to CSV using a text editor or a spreadsheet program.

Tutorial

This tutorial is broken down into 6 parts:

Initialize Network.

Forward Propagate.

Back Propagate Error.

Train Network.

Predict.

Seeds Dataset Case Study.

These steps will provide the foundation that you need to implement the backpropagation algorithm from scratch and apply it to your own predictive modeling problems.

1. Initialize Network

Let’s start with something easy, the creation of a new network ready for training.

Each neuron has a set of weights that need to be maintained. One weight for each input connection and an additional weight for the bias. We will need to store additional properties for a neuron during training, therefore we will use a dictionary to represent each neuron and store properties by names such as ‘weights‘ for the weights.

A network is organized into layers. The input layer is really just a row from our training dataset. The first real layer is the hidden layer. This is followed by the output layer that has one neuron for each class value.

We will organize layers as arrays of dictionaries and treat the whole network as an array of layers.

It is good practice to initialize the network weights to small random numbers. In this case, will we use random numbers in the range of 0 to 1.

Below is a function named initialize_network() that creates a new neural network ready for training. It accepts three parameters, the number of inputs, the number of neurons to have in the hidden layer and the number of outputs.

You can see that for the hidden layer we create n_hidden neurons and each neuron in the hidden layer has n_inputs + 1 weights, one for each input column in a dataset and an additional one for the bias.

You can also see that the output layer that connects to the hidden layer has n_outputs neurons, each with n_hidden + 1 weights. This means that each neuron in the output layer connects to (has a weight for) each neuron in the hidden layer.

Running the example, you can see that the code prints out each layer one by one. You can see the hidden layer has one neuron with 2 input weights plus the bias. The output layer has 2 neurons, each with 1 weight plus the bias.

Now that we know how to create and initialized a network, let’s see how we can use it to calculate an output.

2. Forward Propagate

We can calculate an output from a neural network by propagating an input signal through each layer until the output layer outputs its values.

We call this forward-propagation.

It is the technique we will need to generate predictions during training that will need to be corrected, and it is the method we will need after the network is trained to make predictions on new data.

We can break forward propagation down into three parts:

Neuron Activation.

Neuron Transfer.

Forward Propagation.

2.1. Neuron Activation

The first step is to calculate the activation of one neuron given an input.

The input could be a row from our training dataset, as in the case of the hidden layer. It may also be the outputs from each neuron in the hidden layer, in the case of the output layer.

Neuron activation is calculated as the weighted sum of the inputs. Much like linear regression.

1

activation = sum(weight_i * input_i) + bias

Where weight is a network weight, input is an input, i is the index of a weight or an input and bias is a special weight that has no input to multiply with (or you can think of the input as always being 1.0).

Below is an implementation of this in a function named activate(). You can see that the function assumes that the bias is the last weight in the list of weights. This helps here and later to make the code easier to read.

1

2

3

4

5

6

# Calculate neuron activation for an input

def activate(weights,inputs):

activation=weights[-1]

foriinrange(len(weights)-1):

activation+=weights[i]*inputs[i]

returnactivation

Now, let’s see how to use the neuron activation.

2.2. Neuron Transfer

Once a neuron is activated, we need to transfer the activation to see what the neuron output actually is.

The sigmoid activation function looks like an S shape, it’s also called the logistic function. It can take any input value and produce a number between 0 and 1 on an S-curve. It is also a function of which we can easily calculate the derivative (slope) that we will need later when backpropagating error.

We can transfer an activation function using the sigmoid function as follows:

Below is a function named transfer() that implements the sigmoid equation.

1

2

3

# Transfer neuron activation

def transfer(activation):

return1.0/(1.0+exp(-activation))

Now that we have the pieces, let’s see how they are used.

2.3. Forward Propagation

Forward propagating an input is straightforward.

We work through each layer of our network calculating the outputs for each neuron. All of the outputs from one layer become inputs to the neurons on the next layer.

Below is a function named forward_propagate() that implements the forward propagation for a row of data from our dataset with our neural network.

You can see that a neuron’s output value is stored in the neuron with the name ‘output‘. You can also see that we collect the outputs for a layer in an array named new_inputs that becomes the array inputs and is used as inputs for the following layer.

The function returns the outputs from the last layer also called the output layer.

1

2

3

4

5

6

7

8

9

10

11

# Forward propagate input to a network output

def forward_propagate(network,row):

inputs=row

forlayer innetwork:

new_inputs=[]

forneuron inlayer:

activation=activate(neuron['weights'],inputs)

neuron['output']=transfer(activation)

new_inputs.append(neuron['output'])

inputs=new_inputs

returninputs

Let’s put all of these pieces together and test out the forward propagation of our network.

We define our network inline with one hidden neuron that expects 2 input values and an output layer with two neurons.

Running the example propagates the input pattern [1, 0] and produces an output value that is printed. Because the output layer has two neurons, we get a list of two numbers as output.

The actual output values are just nonsense for now, but next, we will start to learn how to make the weights in the neurons more useful.

1

[0.6629970129852887, 0.7253160725279748]

3. Back Propagate Error

The backpropagation algorithm is named for the way in which weights are trained.

Error is calculated between the expected outputs and the outputs forward propagated from the network. These errors are then propagated backward through the network from the output layer to the hidden layer, assigning blame for the error and updating weights as they go.

The math for backpropagating error is rooted in calculus, but we will remain high level in this section and focus on what is calculated and how rather than why the calculations take this particular form.

This part is broken down into two sections.

Transfer Derivative.

Error Backpropagation.

3.1. Transfer Derivative

Given an output value from a neuron, we need to calculate it’s slope.

We are using the sigmoid transfer function, the derivative of which can be calculated as follows:

1

derivative = output * (1.0 - output)

Below is a function named transfer_derivative() that implements this equation.

1

2

3

# Calculate the derivative of an neuron output

def transfer_derivative(output):

returnoutput *(1.0-output)

Now, let’s see how this can be used.

3.2. Error Backpropagation

The first step is to calculate the error for each output neuron, this will give us our error signal (input) to propagate backwards through the network.

The error for a given neuron can be calculated as follows:

1

error = (expected - output) * transfer_derivative(output)

Where expected is the expected output value for the neuron, output is the output value for the neuron and transfer_derivative() calculates the slope of the neuron’s output value, as shown above.

This error calculation is used for neurons in the output layer. The expected value is the class value itself. In the hidden layer, things are a little more complicated.

The error signal for a neuron in the hidden layer is calculated as the weighted error of each neuron in the output layer. Think of the error traveling back along the weights of the output layer to the neurons in the hidden layer.

The back-propagated error signal is accumulated and then used to determine the error for the neuron in the hidden layer, as follows:

1

error = (weight_k * error_j) * transfer_derivative(output)

Where error_j is the error signal from the jth neuron in the output layer, weight_k is the weight that connects the kth neuron to the current neuron and output is the output for the current neuron.

Below is a function named backward_propagate_error() that implements this procedure.

You can see that the error signal calculated for each neuron is stored with the name ‘delta’. You can see that the layers of the network are iterated in reverse order, starting at the output and working backwards. This ensures that the neurons in the output layer have ‘delta’ values calculated first that neurons in the hidden layer can use in the subsequent iteration. I chose the name ‘delta’ to reflect the change the error implies on the neuron (e.g. the weight delta).

You can see that the error signal for neurons in the hidden layer is accumulated from neurons in the output layer where the hidden neuron number j is also the index of the neuron’s weight in the output layer neuron[‘weights’][j].

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

# Backpropagate error and store in neurons

def backward_propagate_error(network,expected):

foriinreversed(range(len(network))):

layer=network[i]

errors=list()

ifi!=len(network)-1:

forjinrange(len(layer)):

error=0.0

forneuron innetwork[i+1]:

error+=(neuron['weights'][j]*neuron['delta'])

errors.append(error)

else:

forjinrange(len(layer)):

neuron=layer[j]

errors.append(expected[j]-neuron['output'])

forjinrange(len(layer)):

neuron=layer[j]

neuron['delta']=errors[j]*transfer_derivative(neuron['output'])

Let’s put all of the pieces together and see how it works.

We define a fixed neural network with output values and backpropagate an expected output pattern. The complete example is listed below.

4. Train Network

This involves multiple iterations of exposing a training dataset to the network and for each row of data forward propagating the inputs, backpropagating the error and updating the network weights.

This part is broken down into two sections:

Update Weights.

Train Network.

4.1. Update Weights

Once errors are calculated for each neuron in the network via the back propagation method above, they can be used to update weights.

Network weights are updated as follows:

1

weight = weight + learning_rate * error * input

Where weight is a given weight, learning_rate is a parameter that you must specify, error is the error calculated by the backpropagation procedure for the neuron and input is the input value that caused the error.

The same procedure can be used for updating the bias weight, except there is no input term, or input is the fixed value of 1.0.

Learning rate controls how much to change the weight to correct for the error. For example, a value of 0.1 will update the weight 10% of the amount that it possibly could be updated. Small learning rates are preferred that cause slower learning over a large number of training iterations. This increases the likelihood of the network finding a good set of weights across all layers rather than the fastest set of weights that minimize error (called premature convergence).

Below is a function named update_weights() that updates the weights for a network given an input row of data, a learning rate and assume that a forward and backward propagation have already been performed.

Remember that the input for the output layer is a collection of outputs from the hidden layer.

1

2

3

4

5

6

7

8

9

10

# Update network weights with error

def update_weights(network,row,l_rate):

foriinrange(len(network)):

inputs=row[:-1]

ifi!=0:

inputs=[neuron['output']forneuron innetwork[i-1]]

forneuron innetwork[i]:

forjinrange(len(inputs)):

neuron['weights'][j]+=l_rate *neuron['delta']*inputs[j]

neuron['weights'][-1]+=l_rate *neuron['delta']

Now we know how to update network weights, let’s see how we can do it repeatedly.

4.2. Train Network

As mentioned, the network is updated using stochastic gradient descent.

This involves first looping for a fixed number of epochs and within each epoch updating the network for each row in the training dataset.

Because updates are made for each training pattern, this type of learning is called online learning. If errors were accumulated across an epoch before updating the weights, this is called batch learning or batch gradient descent.

Below is a function that implements the training of an already initialized neural network with a given training dataset, learning rate, fixed number of epochs and an expected number of output values.

The expected number of output values is used to transform class values in the training data into a one hot encoding. That is a binary vector with one column for each class value to match the output of the network. This is required to calculate the error for the output layer.

You can also see that the sum squared error between the expected output and the network output is accumulated each epoch and printed. This is helpful to create a trace of how much the network is learning and improving each epoch.

We now have all of the pieces to train the network. We can put together an example that includes everything we’ve seen so far including network initialization and train a network on a small dataset.

Below is a small contrived dataset that we can use to test out training our neural network.

1

2

3

4

5

6

7

8

9

10

11

X1 X2 Y

2.7810836 2.550537003 0

1.465489372 2.362125076 0

3.396561688 4.400293529 0

1.38807019 1.850220317 0

3.06407232 3.005305973 0

7.627531214 2.759262235 1

5.332441248 2.088626775 1

6.922596716 1.77106367 1

8.675418651 -0.242068655 1

7.673756466 3.508563011 1

Below is the complete example. We will use 2 neurons in the hidden layer. It is a binary classification problem (2 classes) so there will be two neurons in the output layer. The network will be trained for 20 epochs with a learning rate of 0.5, which is high because we are training for so few iterations.

Running the example first prints the sum squared error each training epoch. We can see a trend of this error decreasing with each epoch.

Once trained, the network is printed, showing the learned weights. Also still in the network are output and delta values that can be ignored. We could update our training function to delete these data if we wanted.

5. Predict

Making predictions with a trained neural network is easy enough.

We have already seen how to forward-propagate an input pattern to get an output. This is all we need to do to make a prediction. We can use the output values themselves directly as the probability of a pattern belonging to each output class.

It may be more useful to turn this output back into a crisp class prediction. We can do this by selecting the class value with the larger probability. This is also called the arg max function.

Below is a function named predict() that implements this procedure. It returns the index in the network output that has the largest probability. It assumes that class values have been converted to integers starting at 0.

1

2

3

4

# Make a prediction with a network

def predict(network,row):

outputs=forward_propagate(network,row)

returnoutputs.index(max(outputs))

We can put this together with our code above for forward propagating input and with our small contrived dataset to test making predictions with an already-trained network. The example hardcodes a network trained from the previous step.

Running the example prints the expected output for each record in the training dataset, followed by the crisp prediction made by the network.

It shows that the network achieves 100% accuracy on this small dataset.

1

2

3

4

5

6

7

8

9

10

Expected=0, Got=0

Expected=0, Got=0

Expected=0, Got=0

Expected=0, Got=0

Expected=0, Got=0

Expected=1, Got=1

Expected=1, Got=1

Expected=1, Got=1

Expected=1, Got=1

Expected=1, Got=1

Now we are ready to apply our backpropagation algorithm to a real world dataset.

6. Wheat Seeds Dataset

This section applies the Backpropagation algorithm to the wheat seeds dataset.

The first step is to load the dataset and convert the loaded data to numbers that we can use in our neural network. For this we will use the helper function load_csv() to load the file, str_column_to_float() to convert string numbers to floats and str_column_to_int() to convert the class column to integer values.

Input values vary in scale and need to be normalized to the range of 0 and 1. It is generally good practice to normalize input values to the range of the chosen transfer function, in this case, the sigmoid function that outputs values between 0 and 1. The dataset_minmax() and normalize_dataset() helper functions were used to normalize the input values.

We will evaluate the algorithm using k-fold cross-validation with 5 folds. This means that 201/5=40.2 or 40 records will be in each fold. We will use the helper functions evaluate_algorithm() to evaluate the algorithm with cross-validation and accuracy_metric() to calculate the accuracy of predictions.

A new function named back_propagation() was developed to manage the application of the Backpropagation algorithm, first initializing a network, training it on the training dataset and then using the trained network to make predictions on a test dataset.

A network with 5 neurons in the hidden layer and 3 neurons in the output layer was constructed. The network was trained for 500 epochs with a learning rate of 0.3. These parameters were found with a little trial and error, but you may be able to do much better.

Running the example prints the average classification accuracy on each fold as well as the average performance across all folds.

You can see that backpropagation and the chosen configuration achieved a mean classification accuracy of 95.238% which is dramatically better than the Zero Rule algorithm that did slightly better than 28.095% accuracy.

Extensions

This section lists extensions to the tutorial that you may wish to explore.

Tune Algorithm Parameters. Try larger or smaller networks trained for longer or shorter. See if you can get better performance on the seeds dataset.

Additional Methods. Experiment with different weight initialization techniques (such as small random numbers) and different transfer functions (such as tanh).

More Layers. Add support for more hidden layers, trained in just the same way as the one hidden layer used in this tutorial.

Regression. Change the network so that there is only one neuron in the output layer and that a real value is predicted. Pick a regression dataset to practice on. A linear transfer function could be used for neurons in the output layer, or the output values of the chosen dataset could be scaled to values between 0 and 1.

Batch Gradient Descent. Change the training procedure from online to batch gradient descent and update the weights only at the end of each epoch.

Did you try any of these extensions?
Share your experiences in the comments below.

Review

In this tutorial, you discovered how to implement the Backpropagation algorithm from scratch.

Specifically, you learned:

How to forward propagate an input to calculate a network output.

How to back propagate error and update network weights.

How to apply the backpropagation algorithm to a real world dataset.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

When I step through the code above for the ‘forward_propagate’ test case, I see the code correctly generate the output for the single hidden node but that output doesn’t get correctly processed when determining the outputs for the output layer. As written above in the activate function ‘for i in range(len(inputs)-1):’, when the calculation gets to the activate function for the output node for class=0, since ‘inputs’ has a single element in it (the output from the single hidden node), ‘len(inputs) – 1’ equals 0 so the for loop never executes. I’m assuming the code is supposed to read ‘for i in range(len(weights) -1):’ Does that make sense?

I’m just trying to make sure I don’t fundamentally misunderstand something and improve this post for other readers. This site has been really, really helpful for me.

Hi, Thanks for the tutorial, I’m doing a backpropagation project at the moment so its been really useful.

I was a little confused on the back-propagation error calculation function. Does “if i != len(network)-1:” mean that if the current layer isn’t the output layer then this following code is run or does it mean that the current layer is an output layer?

I have another question.
Would it be possible to extend the code from this tutorial and create a network that trains using the MNIST handwritten digit set? using a input unit to represent each pixel in the image. I’m also not sure whether/how I could use feature extractors for the images.

I have a project where I have to implement the Backpropagation algorithm with possibly the MNIST handwritten digit training set.

Couldn’t be the case that expected[row[-1]] = 1 will throw IndexError, as n_outputs is the size of the training set which is a subset of the dataset and row basically contains values from the whole dataset?

I’ve had the same error at the ‘train_network’ function. Is your dataset fine? I’ve had some problems because the CSV file wasn’t loaded correctly due to my regional windows settings. I’ve had to adjust my settings and everything worked out alright.

Thanks for the code and post.
Why is “expected” in expected = [0 for i in range(n_outputs)] initialized to [0,0] ?
Should not the o/p values be taken as expected when training the model ?
i.e for example in case of Xor should not 1 be taken as the expected ?

Hello, I have a couple more questions. When training the network with a dataset, does the error at each epoch indicate the distance between the predicted outcomes and the expected outcomes together for the whole dataset? Also when the mean accuracy is given in my case being 13% when I used the MNIST digit set, does this mean that the network will be correct 13% of the time and would have an error rate of 87%?

The epoch error does capture how wrong the algorithm is on all training data. This may or may not be a distance depending on the error measure used. RMSE is technically not a distance measure, you could use Euclidean distance if you like, but I would not recommend it.

Yes, in generally when the model makes predictions your understanding is correct.

“Where error_j is the error signal from the jth neuron in the output layer, weight_k is the weight that connects the kth neuron to the current neuron and output is the output for the current neuron.”

is the k-th neuron a neuron in the output layer or a neuron in the hidden layer we’re “on”? What about the current neuron, are you referring to the neuron in the output layer? Sorry, english is not my native tongue.

Hello Jason, great tutorial, I am developer and I do not really know much about this machine learning thing but I need to extend this your code to incorporate the Momentum aspect to the training, can you please explain how I can achieve this extension?

Hi Jason,
I have my own code written in C++, which works similar to your code. My intention is to extend my code to convolutional deep neural nets, and i have actually written the convolution, Relu and pooling functions however i could not begin to apply the backpropagation i have used in my shallow neural net, to the convolutional deep net, cause i really cant imagine the transition of the backpropagation calculation between the convolutional layers and the standard shallow layers existing in the same system. I hoped to find a source for this issue however i always come to the point that there is a standard backpropagation algorithm given for shallow nets that i applied already. Can you please guide me on this problem?

Just a suggestion for the people who would be using their own dataset(not the seeds_dataset) for training their network, make sure you add an IF loop as follows before the 45th line :
if minmax[i][1]!=minmax[i][0]

This is because your own dataset might contain same values in the same column and that might cause a divide by zero error.

Thanks jason for the amazing posts of your from scratch pyhton implementations! i have learned so much from you!

I have followed through both your naive bayes and backprop posts, and I have a (perhaps quite naive) question:

what is the relationship between the two? did backprop actually implement bayesian inference (after all, what i understand is that bayesian = weights being updated every cycle) already? perhaps just non-gaussian? so.. are non-gaussian PDF weight updates not bayesian inference?

i guess to put it simply : is backpropagation essentially a bayesian inference loop for an n number of epochs?

I came from the naive bayes tutorial wanting to implement backpropagation together with your naive bayes implementation but got a bit lost along the way.

sorry if i was going around in circles, i sincerely hope someone would be able to at least point me on the right direction.

No, they are both very different. Naive bayes is a direct use of the probabilities and bayes theorem. The neural net is approximating a mapping function from inputs and outputs – a very different approach that does not directly use the joint probability.

Generally, you want to split the data so that each fold is representative of the dataset. The objective measure is how closely the mean performance reflect the actual performance of the model on unseen data. We can only estimate this in practice (standard error?).

thank you for the reply! I read up a bit more about the differences between Naive Bayes (or Bayesian Nets in general) and Neural Networks and found this Quora answer that i thought was very clear. I’ll put it up here to give other readers a good point to go from:

TL:DR :
– they look the same, but every node in a Bayesian Network has meaning, in that you can read a Bayesian network structure (like a mind map) and see what’s happening where and why.
– a Neural Network structure doesn’t have explicit meaning, its just dots that link previous dots.
– there are more reasons, but the above two highlighted the biggest difference.

Just a quick guess after playing around with backpropagation a little: the way NB and backprop NN would work together is by running Naive Bayes to get a good ‘first guess’ of initial weights that are then run through and Neural Network and Backpropagated?

Update Jan/2017: Changed the calculation of fold_size in cross_validation_split() to always be an integer. Fixes issues with Python 3.

I’m still having this same problem whilst using python 3, on both the seeds data set and my own. It returns an error at line 75 saying ‘list object has no attribute ‘sum” and also saying than ‘an integer is required.’

Any help would be very much appreciated.
Overall this code is very helpful. Thank you!

As I am working on Iris Recognition, I have extracted the features of each eye and store it in .csv file, Can u suggest how further can I build my Backpropagation code.
As when I run your code I am getting many errors.
Thank you

Yes. By default we are back-propagating the error of the expected output vs the network output (inputs = row[:-1]), but if we are not the output layer, propagate the error from the previous layer in the network (inputs = [neuron[‘output’] for neuron in network[i – 1]]).

In function call, def backward_propagate_error(network, expected):
how much i understand is , it sequentially pass upto
if i != len(network)-1:
for j in range(len(layer)):
error = 0.0
for neuron in network[i + 1]:
error += (neuron[‘weights’][j] * neuron[‘delta’])
My question is which value is used in neuron[‘delta’]

Thank you very much for this awesome implementation of neural network,
I have a question for you : I want to replace the activation function from Sigmoid
to RELU . So, what are the changes that I should perform in order to get
correct predictions?

I am working on a program that recognizes handwritten digits, the dataset is consisting of pictures (45*45) pixels each, which is 2025 input neurons, this causes me a problem in the activation function, the summation of (weight[i] * input[i]) is big, then it gives me always a result of (0.99 -> 1) after putting the value of the activation function in the Sigmoid function, any suggestions?

Your blog is totally awesome not only by this post but also for the whole series about neural network. Some of them explained so much useful thing than others on Internet. They help me a lot to understand the core of network instead of applying directly Keras or Tensorflow.

Just one question, if I would like to change the result from classification to regression, which part in back propagation I need to change and how?

This is a very interesting contribution to the community 🙂
Have you tried using the algorithm with other activation functions?
I tried with Gaussian, tanh and sinx, but the accuracy was not that high, so I think that I omitted something. What I altered were the activation functions and the derivatives. Is there something else that needs to be changed?

Thanks for the great post. Here is some observation that I am not able to understand. In the back ward propagate you are not taking all the weights and only considering the jth. Can you kindly help understand. I was under the impression that the delta from output is applied across all the weights,
for neuron in network[i + 1]:
error += (neuron[‘weights’][j] * neuron[‘delta’])

Thanks for the great article. In the backward propagate, the delta value is applied for each weight across the neuron and the error is summed. I am curious why is the delta not applied to individual weights of the neuron and the error summed for that neuron. Can you please clarify?

Why don’t you split the data into TrainData and TestData, like 80% of the dataset for training and 20% for testing, because if you train with 100% of rows of the dataset and then test some rows of the dataset the accuracy will be good . But if you put new data on the seeds.csv the model will work with less accuracy, Right?

Thanks for the post! I have a question about cross-validation. The dataset of seeds is perfect for 5 folds but for a dataset of 211? I’ll have uniformly sized subset right? (211/5) Can you give me a suggestion how I could handle that ?
Thanks in advanced.

Ok thank you very much Jason.
But it wont work with searches unseen by the algorithm.
I red something in the books “Programming collective intelligence” about a neural net from scratch for this king of problem but I don’t understang how it works for the moments…

hi Jason
thanks for your code and good description here, i like it so much.
i run your example code and encounter with an error same others whom left note here
the error is:
expected[row[-1]] = 1
IndexError: list assignment index out of range

Hi..
Thanks for ur coding. It was too helpful. can u suggest me how to use this code for classifying tamil characters. i have tried in cnn and now i need to compare the result with bpn. can u pls suggest me.

Classification of Tamil characters sir. I have 144 different classes. I have taken 7 glcm features of each character and I need to train this features in backpropagation and predict the character to which class it belongs.

Hi, so I wasn’t following this tutorial when implementing my neural network from scratch, and mine is in JavaScript. I just need help with the theory. How do I calculate the error for each node in the net so that I can incrementally change the weights? Great tutorial btw

[ 6.38491205 5.333345 4.81565798 5.43552204 9.96445304 2.57268919 4.07671018 1.5258789 6.19728301 0 1 ]
Dear sir,
the above mentioned numerical values are extracted from the dental x-ray image using gray level co occurrence matrix [10 inputs and 1 output]. This dataset is used as a input for BPN classifier. whether the same data set as[.csv] file can be used as the input for DEEP Convolutional Neural Network technique ? and can i get the output as image ? for example if i give the dental x ray images as numerical values i have to get the caries affected teeth as the output for the given dataset.

3) What is exactly loss function in your example (I usually found some derivations of loss (cost ?) function (in other explanations), not transfer function derivation)? Im actually very confused by notation which I find around …

4) momentum and weight decay. In your example, you can implement them that you substract calculated decay and add calculated momentum (to weight update) ? Again, I found forms which substract both and weight update as w + deltaW, so again I’m mega confused by notation for backpropagation which I found…

Sorry for dumb questions, … math is not my strong side, so many things which can be inferred by math sense are simply hidden for me.

A VERY GOOD TUTORIAL SIR…
Sir i am implementing remote sensed image classification using BPN neural network using IDL.
I am not finding good resources on constructing features for input dataset and also number of hidden layers and number of neurons in hidden layer.
Any resources you know, can help me?

I only have one slight issue: I implemented this in Ruby and I tried to train it using the IRIS dataset, keeping the network simple (1 input layer, 1 hidden layer, 1 output layer) and after decreasing for a while the error rate keeps increasing. I tried lowering the learning rate, even making it dynamic so it decreases whenever the error increases but it doesn’t seem to help. Could you give me some advice? P.S sorry for my bad English

but the activation variable here is a single value…what I understand is that if I have set n_hidden = 5 (number of hidden layers), I should get N*5 (N = number of features in the dataset) outputs if I print the activation…

I have a question on the delta calculation at the output layer, where
the primary value is the difference between the neuron output and
the expected output. And we are then multiplying this difference
with the transfer_derivative. where transfer_derivative is a function
of neuron’s output.

My question is, is it correct to find the difference between the
neuron’s output and the expected output?

In this case of the example, you have chosen digital outputs [0,1]
and hence it may not have come up .. but my point is…
one is already subjected to a transfer function, and one is not.

The neuron’s output is always subjected to a transfer function and
hence will be in a specific range, say -.5 to +.5 or something..
But the expected output is the user’s choice .. isnt it?
user can have an expected value of say 488.34, for some stock price
learning.. then is it still correct to find this primary difference
between the expected output and the neuron output, at the output
layer delta calculation?

shoulnt the expected output also be subjected to the same transfer
function before finding the difference? Or the otherway, like
shoulnt the neuron ouptut be subjected to a reverse transfer function
before comparing with the expected output directly?

I have a question concerning the back-propagation : what if instead of having an error function I only have a desired gradient for the output (in the case of an actor-critic model for example)?
How can I change your backprop function to make it work? Or can I just use the gradient as the error?

Hi Jason , thank you for providing this tutorial. I’m confused of how can I implement the same backpropagation algorithm with output not binary. Since I noticed that your example has binary output. Like predicting for stock price given the open, high, low and close values. Regards.

great article. I have an interest in NN but I am not that good at python.

Want I wanted to try was to withhold say 5 rows from the dataset and have the trained network predict the results for those rows. these is is different from what I think the example does which is rolling predictions with the learning. Removing 5 rows from the dataset is of course easy but my pitiful attempts at predicting with unseen data like below fail ((I guess network is not in scope at the end): any help appreciated!

Hi Jason, I am trying to generalize your implementation to work with a variable number of layers and nodes. However, whenever I try to increase the number of nodes too much it stops working (the network freezes at one error rate and all output nodes are active, i.e. giving 1). Although the code would work if I decreased the layers and the errors will go down.
Is there something I am missing when using too many layers? The concepts should be the same.

I trained a network with 4 layers: [14,10,10,4] and it worked.
I trained a network with 4 layers [14,100,40,4] and it is stuck. Same dataset.

I’m trying to alter the code to represent a regression problem (sigmoid on hidden layer, linear on output layer). As far as I know, the main part of the code that would have to be modified is the FF algorithm. I’ve rewritten the code as below:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

# Forward propagate input to a network output

def forward_propagate_regression(network,row):

inputs=row

new_inputs=[]

#gets the 1st layer, applies sigmoid activation

hiddenlayer=network[0]

forneuron inhiddenlayer:

activation=activate(neuron['weights'],inputs)

neuron['output']=transfer(activation)

new_inputs.append(neuron['output'])

inputs=new_inputs

#gets the last layer, applies linear activation

outputlayer=network[-1]

forneuron inoutputlayer:

activation=activate(neuron['weights'],inputs)

neuron['output']=activation

new_inputs.append(neuron['output'])

inputs=new_inputs

returninputs

With this code, I’m getting an “OverflowError: (34, ‘Result too large’)” error. Could you please tell what I’m doing wrong? All the other parts of the code are as you’ve written.

I got the hidden layer (network[0]), and I applied your algorithm (calculate activation, transfer the activation to the output, append that to a new list called “new_inputs”).

After that, I get the output layer (network[-1]), I calculate the activation with the “new_inputs”, but I do NOT apply the sigmoid transfer function (so, the outputs should be linear). The results are appended to a new list, which is set to be the return of the function.

Would that be the best way to remove the sigmoid function from the output layer, making the code a regression, instead of a classification?

Hi Jason, nice posting and it really helps a lot
for j in range(len(layer)):
neuron = layer[j]
neuron[‘delta’] = errors[j] * transfer_derivative(neuron[‘output’])
Should the neuron[‘output’] be the output of the activation function instead of the transfer function here?

please tell me how we can change the neuron in hidden layer and in output layer?
and what will be the result when we change the neuron in hidden layer and in output layer?
in this tutorial u take one hidden layer,so can we use more than one hidden layer? and how?

I’m trying to adapt the code to support many hidden layers. I’ve adapted the code as below, with a new input called “n_layers”, to insert N hidden layers in the network.

# Initialize a network with “n_layers” hidden layers
def initialize_network3(n_inputs, n_hidden, n_layers, n_outputs):
network = list()
for i in range(n_layers):
hidden_layer = [{‘weights’:[random() for i in range(n_inputs + 1)]} for i in range(n_hidden)]
network.append(hidden_layer)
output_layer = [{‘weights’:[random() for i in range(n_hidden)]} for i in range(n_outputs)]
network.append(output_layer)
return network

When I try to run the code, it shows the error below. Do you have any idea why?

You need to add a conditional after your first layer to make sure your subsequent hidden layer weights have the proper dimensions (n_hidden+1, n_hidden)

for i in range(n_layers):
hidden_layer = [{‘weights’:[random() for i in range(n_inputs + 1)]} for i in range(n_hidden)]
if i > 0:
hidden_layer = [[{‘weights’:[random() for i in range(n_hidden + 1)]} for i in range(n_hidden)]
network.append(hidden_layer)

In the output/last layer when we are calculating the backprop error why are we multiplying with the transfer derivative with the (expected-output)?? transfer derivative is already canceled out for the the last layer , the update should be only (expected-output)*previous_layer_input , ???
Thanks

Really good article. Thanks a lot.
Need a little bit of clarification.
For backward propagation starting at the output layer,
you get the error by appending to errors expected[j] – neuron[‘output’].
Isn’t Error = 0.5 * sum(errors)?
and then using this sum of errors for back-propagation?
Thanks.

Thanks for the tutorial! I am trying to modify your code to do a regression model and I am stuck. I have an input data set (4 columns and many rows) and a single variable output data set (in range of tens of thousands). I fed them into the train procedure and I get an error when it reaches “expected = [0 for i in range(n_outputs)]” in the train portion. The error reads “only length-1 arrays can be converted to Python scalar”. Now I understand this is because of the intended purpose for the code was a categorization problem but I am wondering what I would need to modify to get this to work? Any help would go a long way as I have been stuck on this issue for some time now.

I have been studying how to develop a neural network from scratch and this tutorial is the main one I have been following because it is helping me so much.
I have a doubt: When I study the theory I see the neural network scheme carrying only the weights and bias. And here in practice I see that the network is also carrying the output values and the delta i.e (weights, bias, output and delta). Will the final model be saved like this? with the latter (weights, bias, output and delta)? would this be the rule in practice?

I would appreciate it if you could help with this issue so that I could get out of where I left off.

Your posts are really very good there is where I find my way in to learning in Machine Learning.

Fantastic stuff here. I had a question about the network configuration. This 2 input, 2 hidden and 2 output seems a bit odd to me. I’m used to seeing 2, 2, 1 for XOR – can you explain why you have two output nodes and how they work with the max function? I think it would better explain this line for me in train():

In the tain_network function the line “expected[row[-1]] = 1” what i understand is that you take the Y value of every row (which is either 0 or 1 ) and use it as an index in the expected array and you change the value at that index to 1 ,First i don’t know if i understand that correctly in the first place or not but if so, Wouldn’t the modification to the expected array be locked down to just only the first and second index because “expected[row[-1]] = 1” would only be expected[0] or expected[1] ? and how would that help in our algorithm .

Thanks for replying. I know the keras and have been using keras for a while. But in the problem I am focusing on, I need to make changes on the back propagation. That’s why I didn’t use keras.
So let’s go back to my original question, is the error term the cost function? Thanks.

Hello Jeson,
Thanks for the informative tutorial. I have a question.
if i want to change the error equation and as well as the equation between input with hidden and hidden with output layer. How can i change it?
Hope you will reply in a short time.

Hey there,
Been following your tutorial and I’m having problems with using my dataset with it. The outputs of the hidden neurons appear to only be exactly 1 constantly. I’m not sure what’s wrong exactly or how to fix it but its resulting in the network not learning at all. Please let me know if you can help.
Thanks,
Raj