Tuesday, June 21, 2016

In a previous post, I talked about how to use the gradient descent algorithm to optimize the weights and biases of an artificial neural network in order to give selected outputs for selected inputs. However plain gradient descent is inefficient on large neural nets since it recomputes a lot of values for each neuron, plus it has to be rewritten for every change in architecture (number of neurons and layers) and requires a lot of code. The standard solution is to use an optimized version of the gradient descent algorithm for neural nets called the backpropagation algorithm.

I will assume that you have read the previous post. Whereas the previous post was very specific using a specified architecture and a specified activation and cost function, here I will keep things as general as possible such that they can be applied on any feed forward neural network. Here are the definitions of symbols we shall be using, similar to last post's definitions:

In order to help explain the algorithm and equations, I shall be applying it to the last post's architecture, just to help you understand. So here is last post's neural net:

In the example, we have a 3 layer neural net with 2 neurons in each layer and 2 input neurons. It uses the sigmoid function as an activation function and the mean square error as the cost function.

The main point of interest in the backpropagation algorithm is the way you find the derivative of the cost function with respect to a weight or bias in any layer. Let's look at how to find these derivatives for each layer in the example neural net.

Weights and bias in layer L (output layer)

We begin by finding the derivatives for the output layer, which are the simplest.

d Cost / d Weight L

The derivative of the cost with respect to any weight in layer L can be found using

Here's how we can arrive to this equation based on a weight in layer 3 in the example:

d Cost / d Bias L

The derivative of the cost with respect to any bias in layer L can be found using

Here's how we can arrive to this equation based on a bias in layer 3 in the example:

Using delta

A part of the weights and biases equations is repeated. It will be convenient when finding derivatives of other layers to use an intermediate letter, called lower case delta, to represent this part of these equations.

Weights and bias in layer L-1

Now we find the derivative with respect to the weights and biases in layer L-1. The derivations will be continuations of the previous derivations and we won't be repeating the first bit of the derivations again.

d Cost / d Weight L-1

The derivative of the cost with respect to any weight in layer L-1 can be found using

Here's how we can arrive to this equation based on a weight in layer 2 in the example:

d Cost / d Bias L-1

The derivative of the cost with respect to any bias in layer L-1 can be found using

Here's how we can arrive to this equation based on a bias in layer 2 in the example:

Using delta

These equations can be shortened by using delta again and now we can use the previous delta to shorten this one even more.

Notice how there is a recursive definition of delta. This is the basis for the backpropagation algorithm.

Weights and bias in layer L-2

Now we find the derivative with respect to the weights and biases in layer L-2. Like before, we won't be repeating the first bit of the derivations again.

d Cost / d Weight L-2

The derivative of the cost with respect to any weight in layer L-2 can be found using

Here's how we can arrive to this equation based on a weight in layer 1 in the example (notice that there is a trick we'll be using with the summations that is explained below):

At the end of the derivation we did a trick with the sigmas (summations) in order to turn the equation into a recursive one.

First, we moved the delta to the inside of the sigmas. This can be done because it's just a common term that is factored out and we're factoring it back in. For example if the equation is "𝛿(a + b + c)" then we can turn that into "𝛿a + 𝛿b + 𝛿c".

Second, we swapped the "p" summation with the "q" summation. This does not change anything if you think about it. All it does is change the order of the summations, which does not change anything.

Finally we replaced the "p" summation with the previous delta, allowing us to shorten the equation.

d Cost / d Bias L-2

The derivative of the cost with respect to any bias in layer L-2 can be found using

Here's how we can arrive to this equation based on a bias in layer 1 in the example:

Using delta

Again, we can shorten the equations using delta.

Notice that this is exactly the same as the previous delta. This is because beyond L-1, all the deltas will be identical and can be defined using a single recursive relation.

In general: using indices

Up to now we saw what the derivatives for layer L, L-1, and L-2 are. We can now generalize these equations for any layer. Given the recursive pattern in the deltas, we can create an equation that gives the deltas of any layer provided you have the deltas of the previous layers. The base case of the recursion would be for layer L, which is defined on its own.

Now we have a way to find the derivatives for any layer, no matter how deep the neural network is. Here is how you'd use these equations on the example:

In general: using matrices

As things stand, there are a lot of index letters littering our equations (i,j,p). We can get rid of them however if we use matrix operations that work on all the indices at once. If we do this we can even get shorter code when programming the backpropagation algorithm by making use of a fast matrix library such as numpy.

Using matrices in neural nets

Let's look at how we can use matrices to compute the output of a neural net. A whole layer of activations can be calculated as a whole by treating it as a vector. We'll treat vectors as being horizontal by default and which need to be transposed in order to be made vertical.

Here's an example of what we want to calculate:

Of course we're still using indices just because we're organising our activations in a vector. In order to get rid of the indices we need to treat the weights as a matrix, where the first index is the row number and the second index is the column number. That way we can have a single letter "w" which represents all of the weights of a layer. When multiplied by a vector of the previous layer's activations, the matrix multiplication will result in another vector of weighted sums which can be added to a bias vector and passed through an activation function to compute the next vector of activations. Here is an example of the calculation based on the above example:

Final clean generalized deltas and cost gradients

And now we can finally see what the clean matrix-based general derivatives are for any layer:

The backpropagation algorithm

Finding the gradients is one step (the most complex one) of the backpropagation algorithm. The full backpropagation algorithm goes as follows:

For each input-target pair in the training set,

Compute the activations and the "z" of each layer when passing the input through the network. This is called a forward pass.

Use the values collected from the forward pass to calculate the gradient of the cost with respect to each weight and bias, starting from the output layer and using the delta equations to compute the gradients for previous layers. This is called the backward pass.

Following the backward pass, you have the gradient of the cost with respect to each weight and bias for each input in the training set. Add up all the corresponding gradients of each input in the training set (all the gradients with respect to the same weight or bias).

Now use the gradient to perform gradient descent by multiplying the gradient by a step size, or learning rate as it's called in the backpropagation algorithm, and subtracting it from the corresponding weights and biases. In algebra,
weight = weight - learningrate*dCost/dWeight

In code

And this is what the full program looks like in Python 3 using numpy to perform matrix operations. Again, we shall be training the neural net to perform the function of a half adder. A half adder adds together two binary digits and returns the sum and carry. So 0 + 1 in binary gives 1 carry 0 whilst 1 + 1 in binary gives 0 carry 1.