Post navigation

Neural Networks and the Backpropagation Algorithm

Neurons, as an Extension of the Perceptron Model

In a previous post in this series we investigated the Perceptron model for determining whether some data was linearly separable. That is, given a data set where the points are labelled in one of two classes, we were interested in finding a hyperplane that separates the classes. In the case of points in the plane, this just reduced to finding lines which separated the points like this:

As we saw last time, the Perceptron model is particularly bad at learning data. More accurately, the Perceptron model is very good at learning linearly separable data, but most kinds of data just happen to more complicated. Even with those disappointing results, there are two interesting generalizations of the Perceptron model that have exploded into huge fields of research. The two generalizations can roughly be described as

Use a number of Perceptron models in some sort of conjunction.

Use the Perceptron model on some non-linear transformation of the data.

The point of both of these is to introduce some sort of non-linearity into the decision boundary. The first generalization leads to the neural network, and the second leads to the support vector machine. Obviously this post will focus entirely on the first idea, but we plan to cover support vector machines in the near future. Recall further that the separating hyperplane was itself defined by a single vector (a normal vector to the plane) . To “decide” what class the new point is in, we check the sign of an inner product with an added constant shifting term:

The class of a point is just the value of this function, and as we saw with the Perceptron this corresponds geometrically to which side of the hyperplane the point lies on. Now we can design a “neuron” based on this same formula. We consider a point to be an input to the neuron, and the output will be the sign of the above sum for some coefficients . In picture form it would look like this:

It is quite useful to literally think of this picture as a directed graph (see this blog’s gentle introduction to graph theory if you don’t know what a graph is). The edges corresponding to the coordinates of the input vector have weights , and the output edge corresponds to the sign of the linear combination. If we further enforce the inputs to be binary (that is, ), then we get a very nice biological interpretation of the system. If we think of the unit as a neuron, then the input edges correspond to nerve impulses, which can either be on or off (identically to an electrical circuit: there is high current or low current). The weights correspond to the strength of the neuronal connection. The neuron transmits or does not transmit a pulse as output depending on whether the inputs are strong enough.

We’re not quite done, though, because in this interpretation the output of the neuron will either fire or not fire. However, neurons in real life are somewhat more complicated. Specifically, neurons do not fire signals according to a discontinuous function. In addition, we want to use the usual tools from classical calculus to analyze our neuron, but we cannot do that unless the activation function is differentiable, and a prerequisite for that is to be continuous. In plain words, we need to allow our neurons to be able to “partially fire.” We need a small range at which the neuron ramps up quickly from not firing to firing, so that the activation function as a whole is differentiable.

This raises the obvious question: what function should we pick? It turns out that there are a number of possible functions we could use, ranging from polynomial to exponential in nature. But before we pick one in particular, let’s outline the qualities we want such a function to have.

Definition: A function is an activation function if it satisfies the following properties:

It has a first derivative .

is non-decreasing, that is for all

has horizontal asymptotes at both 0 and 1 (and as a consequence, , and ).

and are both computable functions.

With appropriate shifting and normalizing, there are a few reasonable (and time-tested) activation functions. The two main ones are the hyperbolic tangent and the sigmoid curve . They both look (more or less) like this:

A sigmoid function (source: Wikipedia)

And it is easy to see visually that this is what we want.

As a side note, the sigmoid function is actually not used very often in practice for a good reason: it gets too “flat” when the function value approaches 0 or 1. The reason this is bad is because how “flat” the function is (the gradient) will guide the learning process. If the function is very flat, then the network won’t learn as quickly. This will manifest itself in our test later in this post, when we see that a neural network struggles to learn the sine function. It struggles specifically at those values of the function that are close to 1 or -1. Though I don’t want to go into too much detail about this, one alternative that has found a lot of success in deep learning is the “rectified linear unit.” This also breaks the assumption of having a derivative everywhere, so one needs a bit more work to deal with that.

Withholding any discussion of why one would pick one specific activation over another, there is one more small detail. In the Perceptron model we allowed a “bias” which translated the separating hyperplane so that it need not pass through the origin, hence allowing a the set of all pairs to represent every possible hyperplane. Perhaps the simplest way to incorporate the bias into this model is to add another input which is fixed to 1. Then we add a weight , and it is easy to see that the constant can just be replaced with the weight . In other words, the inner product is the same as the inner product of two new vectors where we set and and for all other .

The updated picture is now:

Now the specification of a single neuron is complete:

Definition: A neuron is a pair , where is a list of weights , and is an activation function. The impulse function of a neuron , which we will denote , is defined as

We call the bias weight, and by convention the first input coordinate is fixed to 1 for all inputs .

(Since we always fix the first input to 1, is technically a function , but the reader will forgive us for blurring these details.)

Combining Neurons into a Network

The question of how to “train” a single neuron is just a reformulation of the Perceptron problem. If we have a data set with class labels , we want to update the weights of a neuron so that the outputs agree with their class labels; that is, for all . And we saw in the Perceptron how to do this: it’s fast and efficient, given that the data are linearly separable. And in fact training a neuron in this model (accounting for the new activation function) will give us identical decision functions as in the Perceptron model. All we have done so far is change our perspective from geometry to biology. But as we mentioned originally, we want to form a mathematical army of neurons, all working together to form a more powerful decision function.

The question is what form should this army take? Since we already thought of a single neuron as a graph, let’s generalize this. Instead of having a bunch of “input” vertices, a single “output” vertex, and one neuron doing the computation, we now have the same set of input vertices, the same output vertex, but now a number of intermediate neurons connected arbitrarily to each other. That is, the edges that are outputs of some neurons are connected to the inputs of other neurons, and the very last neuron’s output is the final output. We call such a construction a neural network.

For example, the following graph gives a neural network with 5 neurons

To compute the output of any neuron , we need to compute the values of the impulse functions for each neuron whose output feeds into . This in turn requires computing the values of the impulse functions for each of the inputs to those neurons, and so on. If we imagine electric current flowing through such a structure, we can view it as a kind of network flow problem, which is where the name “neural networks” comes from. This structure is also called a dependency graph, and (in the parlance of graph theory) a directed acyclic graph. Though nothing technical about these structures will show up in this particular post, we plan in the future to provide primers on their basic theories.

We remark that we view the above picture as a directed graph with the directed edges going upwards. And as in the picture, the incidence structure (which pairs of neurons are connected or not connected) of the graph is totally arbitrary, as long as it has no cycles. Note that this is in contrast to the classical idea of a neural network as “layered” with one or more intermediate layers, such that all neurons in neighboring layers are completely connected to one another. Hence we will take a slightly more general approach in this post.

Now the question of training a network of interconnected neurons is significantly more complicated than that of training a single neuron. The algorithm to do so is called backpropagation, because we will check to see if the final output is an error, and if it is we will propagate the error backward through the network, updating weights as we go. But before we get there, let’s explore some motivation for the algorithm.

The Backpropagation Algorithm – Single Neuron

Let us return to the case of a single neuron with weights and an input . And momentarily, let us remove the activation function from the picture (so that just computes the summation part). In this simplified world it is easy to take a given set of training inputs with labels and compute the error of our neuron on the entire training set. A standard mathematical way to compute error is by sum of the squared deviations of our neuron’s output from the actual label.

The important part is that is a function just of the weights . In other words, the set of weights completely specifies the behavior of a single neuron.

Enter calculus. Any time we have a multivariate function (here, each of the weights is a variable), then we can speak of its minima and maxima. In our case we strive to find a global minimum of the error function , for then we would have learned our target classification function as well as possible. Indeed, to improve upon our current set of weights, we can use the standard gradient-descent algorithm. We have discussed versions of the gradient-descent algorithm on this blog before, as in our posts on decrypting substitution ciphers with n-grams and finding optimal stackings in Texas Hold ‘Em. We didn’t work with calculus there because the spaces involved were all discrete. But here we will eventually extend this error function to allow the inputs to be real-valued instead of binary, and so we need the full power of calculus. Luckily for the uninformed reader, the concept of gradient descent is the same in both cases. Since gives us a real number for each possible neuron (each choice of weights), we can take our current neuron and make it better it by changing the weights slightly, and ensuring our change gives us a smaller value under . If we cannot ensure this, then we have reached a minimum.

Here are the details. For convenience we add a factor of to and drop the subscript from . Since minimizing is the same as minimizing , this changes nothing about the minima of the function. That is, we will henceforth work with

Then we compute the gradient of . For fixed values of the variables (our current set of weights) this is a vector in , and as we know from calculus it points in the direction of steepest ascent of the function . That is, if we subtract some sufficiently small multiple of this vector from our current weight vector, we will be closer to a minimum of the error function than we were before. If we were to add, we’d go toward a maximum.

Note that is never negative, and so it will have a global minimum value at or near 0 (if it is possible for the neuron to represent the target function perfectly, it will be zero). That is, our update rule should be

where is some fixed parameter between 0 and 1 that represent the “learning rate.” We will not mention too much except to say that as long as it is sufficiently small and we allow ourselves enough time to learn, we are guaranteed to get a good approximation of some local minimum (though it might not be a global one).

With this update rule it suffices to compute explicitly.

In each partial we consider each other variable beside to be constant, and combining this with the chain rule gives

Since in the summation formula for the variable only shows up in the product (where is the -th term of the vector ), the last part expands as . i.e. we have

Noting the negatives cancelling, this makes our update rule just

There is an alternative form of an update rule that allows one to update the weights after each individual input is tested (as opposed to checking the outputs of the entire training set). This is called the stochastic update rule, and is given identically as above but without summing over all :

For our purposes in this post, the stochastic and non-stochastic update rules will give identical results.

Adding in the activation function is not hard, but we will choose our so that it has particularly nice computability properties. Specifically, we will pick the sigmoid function, because it satisfies the identity

So instead of in the formula above we need , and this requires the chain rule once again:

And using the identity for gives us

And a similar update rule as before. If we denote by the output value , then the stochastic version of this update rule is

Now that we have motivated an update rule for a single neuron, let’s see how to apply this to an entire network of neurons.

The Backpropagation Algorithm – Entire Network

There is a glaring problem in training a neural network using the update rule above. We don’t know what the “expected” output of any of the internal edges in the graph are. In order to compute the error we need to know what the correct output should be, but we don’t immediately have this information.

We don’t know the error value for a non-output node in the network.

In the picture above, we know the expected value of the edge leaving the node , but not that of . In order to compute the error for , we need to derive some kind of error value for nodes in the middle of the network.

It seems reasonable that the error for should depend on the errors of the nodes for which provides an input. That is, in the following picture the error should come from all of the neurons .

In particular, one possible error value for a particular input to the entire network would be a weighted sum over the errors of , where the weights are the weights of the edges from to . In other words, if has little effect on the output of one particular , it shouldn’t assume too much responsibility for that error. That is, using the above picture, the error for (in terms of the input weights ) is

where is the error computed for the node .

It turns out that there is a nice theoretical justification for using this quantity as well. In particular, if we think of the entire network as a single function, we can imagine the error as being a very convoluted function of all the weights in the network. But no matter how confusing the function may be to write down, we know that it only involves addition, multiplication, and composition of differentiable functions. So if we want to know how to update the error with respect to a weight that is hidden very far down in the network, in theory it just requires enough applications of the chain rule to find it.

To see this, let’s say we have a nodes connected forward to nodes connected forward to nodes , such that the weights represent weights going from , and weights are .

If we want to know the partial derivative of with respect to the deeply nested weight , then we can just compute it:

where represents the value of the impulse function at each of the output neurons, in terms of a bunch of crazy summations we omit for clarity.

But after applying the chain rule, the partial of the inner summation only depends on via the coefficient . i.e., the weight only affects node by the output of passing through the edge labeled . So we get a sum

That is, it’s simply a weighted sum of the final errors by the right weights. The stuff inside the is simply the output of that node, which is again a sum over its inputs. In stochastic form, this makes our update rule (for the weights of ) just

where by we denote the vector of inputs to the neuron in question (these may be the original input if this neuron is the first in the network and all of the inputs are connected to it, or it may be the outputs of other neurons feeding into it).

The argument we gave only really holds for a network where there are only two edges from the input to the output. But the reader who has mastered the art of juggling notation may easily generalize this via induction to prove it in general. This really is a sensible weight update for any neuron in the network.

And now that we have established our update rule, the backpropagation algorithm for training a neural network becomes relatively straightforward. Start by initializing the weights in the network at random. Evaluate an input by feeding it forward through the network and recording at each internal node the output value , and call the final output . Then compute the error for that output value, propagate the error back to each of the nodes feeding into the output node, and update the weights for the output node using our update rule. Repeat this error propagation followed by a weight update for each of the nodes feeding into the output node in the same way, compute the updates for the nodes feeding into those nodes, and so on until the weights of the entire network are updated. Then repeat with a new input .

One minor issue is when to stop. Specifically, it won’t be the case that we only need to evaluate each input exactly once. Depending on how the learning parameter is set, we may need to evaluate the entire training set many times! Indeed, we should only stop when the gradient for all of our examples is small, or we have run it for long enough to exhaust our patience. For simplicity we will ignore checking for a small gradient, and we will simply fix a number of iterations. We leave the gradient check as an exercise to the reader.

Then the result is a trained network, which we can further use to evaluate the labels for unknown inputs.

Python Implementation

The first thing we need to implement all of this is a data structure for a network. That is, we need to represent nodes and edges connecting nodes. Moreover each edge needs to have an associated value, and each node needs to store multiple values (the error that has been propagated back, and the output produced at that node). So we will represent this via two classes:

In particular, each Node needs to know about its most recent output, input, and error in order to update its weights. So any time we evaluate some input, we need to store these values in the Node. We will progressively fill in these classes with the methods needed to evaluate and train the network on data. But before we can do anything, we need to be able to distinguish between an input node and a node internal to the network. For this we create a subclass of Node called InputNode:

A network calls the evaluate function on its output node, and each node recursively calls evaluate on the sources of each of its incoming edges. An InputNode simply returns the corresponding entry in the inputVector (which requires us to pass the input vector along through the recursive calls). Since our graph structure is arbitrary, we note that some nodes may be “evaluated” more than once per evaluation. As such, we need to store the node’s output for the duration of the evaluation. We also need to store this value for use in training, and so before a call to evaluate we must clear this value. We omit the details here for brevity.

In addition, we need to automatically add bias nodes and corresponding edges to the non-input nodes. This results in a new subclass of Node which has a default evaluate() value of 1. Because of the way we organized things, the existence of this class changes nothing about the training algorithm.

We simply add a function to the Node class (which is overridden in the InputNode class) which adds a bias node and edge to every non-input node. The details are trivial; the reader may see them in the full source code.

The training algorithm will come in a loop consisting of three steps: first, evaluate an input example. Second, go through the network updating the error values of each node using backpropagation. Third, go through the network again to update the weights of the edges appropriately. This can be written as a very short function on a Network, which then requires a number of longer functions on the Node classes:

These are simply the formulas we derived in the previous sections translated into code. The propagated error is computed as a weighted sum in getError(), and the previous input and output values were saved from the call to evaluate().

A Sine Curve Example, and Issues

One simple example we can use to illustrate this is actually not a decision problem, per se, but a function estimation problem. In the course of all of this calculus, we implicitly allowed our neural network to output any values between 0 and 1 (indeed, the activation function did this for us). And so we can use a neural network to approximate any function which has values in . In particular we will try this on

on the domain .

Our network is simple: we have a single layer of twenty neurons, each of which is connected to a single input neuron and a single output neuron. The learning rate is set to 0.25, the number of iterations is set to a hundred thousand, and the training set is randomly sampled from the domain.

After training (which takes around fifteen seconds), the average error (when tested against a new random sample) is between 0.03 and 0.06. Here is an example of one such output:

An example of a 20-node neural network approximating two periods of a sine function.

This picture hints at an important shortcoming of our algorithm. Note how the neural network’s approximation of the sine function does particularly poorly close to 0 and 1. This is not a coincidence, but rather a side effect of our activation function . In particular, because the sigmoid function achieves the values 0 and 1 only in the limit. That is, they never actually achieve 0 and 1, and in order to get close we require prohibitively large weights (which in turn correspond to rather large values to be fed to the activation function). One potential solution is to modify our sine function slightly more, by scaling it and translating it so that its values lie in , say. We leave this as an exercise to the reader.

As one might expect, the neural network also does better when we test it on a single period instead of two (since the sine function is less “complicated” on a single period). We also constructed a data set of binary numbers whose labels were 1 if the number was even and 0 if the number was odd. A similar layout to the sine example with three internal nodes again gave good results.

The issues arise on larger datasets. One big problem with training a neural network is that it’s near impossible to determine the “correct” structure of the network ahead of time. The success of our sine function example, for instance, depended much more than we anticipated on the number of nodes used. Of course, this also depends on the choice of learning rate and the number of iterations allowed, but the point is the same: the neural network is fraught with arbitrary choices. What’s worse is that it’s just as impossible to tell if your choices are justified. All you have is an empirical number to determine how well your network does on one training set, and inspecting the values of the various weights will tell you nothing in all but the most trivial of examples.

There are a number of researchers who have attempted to alleviate this problem in some way. One prominent example is the Cascade Correlation algorithm, which dynamically builds the network structure depending on the data. Other avenues include dynamically updating the learning rate and using a variety of other activation and error functions, based on information theory and game theory (adding penalties for various undesirable properties). Still other methods involve alternative weight updates based on more advanced optimization techniques (such as the conjugate gradient method). Part of the benefit of the backpropagation algorithm is that the choice of error function is irrelevant, as long as it is differentiable. This gives us a lot of flexibility to customize the neural network for our own application domain.

These sorts of questions are what have caused neural networks to become such a huge field of research in machine learning. As such, this blog post has only given the reader a small taste of what is out there. This is the bread and butter in a world of fine cuisine: it’s proven to be a solid choice, but it leaves a cornucopia of flavors unrealized and untapped.

The future of this machine learning series, however, will deviate from the neural network path. We will instead investigate the other extension of the Perceptron model, the Support Vector Machine. We will also lay down some formal theories of learning, because as of yet we have simply been exploring algorithms without the ability to give guarantees on their performance. In order to do that, we must needs formalize the notion of an algorithm “learning” a task. This is no small feat, and nobody has quite agreed on the best formalization. We will nevertheless explore these frameworks, and see what kinds of theorems we can prove in them.

To be completely honest, I don’t know much about deep learning, and from what I’ve read on sparse autoencoders I’m not completely convinced (philosophically) that they do anything interesting (of course the mathematics and engineering is another question). Part of the reason I want to switch to talking about learning theory is just that: we can specify what it means to learn something, and then formally talk about which algorithms achieve that.

Excellent intro to neural networks, thanks a lot! You have a knack to explain things that are often smoothed over as “obvious” and lead to impenetrable “black magic” – even in textbooks. I know how I hated that as a student. Even with my calculus being rusty for years now and having learned it in another language than English I had no problems to follow the text. Once more – thanks!

Wonderful and thoughtful series of posts. Concur with the earlier comment that it would considerably add to the series to cover deep learning as it is becoming more important. One of the requirements of supervised learning is that features have to be defined manually. The promise of unsupervised deep learning is that features can be found autonomously though it requires considerable compute resources.

To the best of my knowledge deep learning is just one part of unsupervised learning based on neural networks. We’ll be investigating plenty of other unsupervised methods on this blog, but I have some reservations about deep learning until I learn more about it. (For instance, is there a universal definition of what it means to learn a concept in deep learning? Has it actually been used to do anything that wasn’t possible before?)

You certainly don’t get the same *behavior* as a batch update rule, but under the same usual assumptions of convexity, you can prove that the stochastic rule converges to the global minimum as well. So it essentially has all of the “same features” as the batch update rule. See http://en.wikipedia.org/wiki/Stochastic_gradient_descent for more.

Hi, great article! I have a curiosity: if I would want to approximate a nonlinear function whose output values are not necessarily in the [-1, 1] interval, will the activation function that you used be still suitable?

Nope, particularly because the output of the entire neural network is constrained by the output neuron, which has an activation function. There are ways around this, though I don’t know what industrial-strength folks use. You could, for example, scale the function to have outputs in [-k,k] if you know the range of your function.

Yes, it makes sense. Also, i believe it is possible to use a second activation function(e.g the linear function) for the output neurons, and for the hidden ones to keep this logistic function that you used.

There’s a confusing thing: when you first calculate the weight updates, you defined E as the sum of the squared deviations. That is, a single number. But later on, for computing an error for the next neuron N, you’re saying “Emᵢ is the error computed for the node Mᵢ”. But wait, how could E be computed for a separate node, if you previously defined it as a single number. Perhaps did you mean instead the weight update for these nodes (which is dE/dwᵢ × η)?

It looks like there is a typo in the back-propagation section. Shouldn’t o_i = F_(N_{1, j}) instead be o_i = F_(N_{3, i})? N_3 seems to be the output layer, so not sure why the index here references N_1.

Thanks for your kind thoughts. In fact, if you read the whole source code I cache the results to prevent re-computation. I omitted those details for brevity and clarity, since they are unrelated to the math. And also because it’s trivial to implement, maybe four lines of code.

It seems to me — because caching did not occur to you — that are not an algorithms expert, or that you have not done enough serious engineering to come across these issues and see the obvious solution. I suggest, when talking about math and engineering in general, to hold the default assumption that the person you are talking to is smarter than you are, and that your questions are stupid, therefore to give them the benefit of the doubt. This has helped me greatly, as 99% of the time I am indeed wrong, and this attitude prevents me from looking like more of an idiot than I am.

I read your blog on SVM and feel more confident about the topic now. I know I will be reading a lot of your blogs now. Bookmarked!

Now, I see that you wrote this blog on NN in 2012 and made a comment somewhere that you will post more on deep-learning once you are convinced that it can do things that were not done earlier (or DL can do things better maybe).

I am new to this subject but I see usage/talks about Deep-Learning (i.e, deep networks) very often these days and students doing research using DL/NN a lot.