Friday, December 30, 2016

Lernmi: Building a Neural Network in Ruby

Neural networks

A neural network is based somewhat on a real brain - the idea being that data comes into a bunch of neurons at the front, gets propagated through multiple layers, and an answer comes out the end. We've covered this a little in another blog post, so here we'll just focus on what goes on under the covers.

Math time

A neural network is a type of non-linear function that can approximate any other function. How well it can approximate is based on the number of links it has between its layers of neurons. In other words, more neurons and more layers gives you a better (but slower) neural network. If you want to have a play with this idea, have a look at the Tensorflow Playground - if you can't get it to match, try a mix of more layers and more neurons per layer.

More math time

We have a non-linear function, and we need to approximate something based on stochastic gradient descent. In other words, we get a bunch of samples as input, and we need to adjust the network based on what the output *should* be, and then the network will change to give us answers that are slightly closer to what we expect. If our training data is a diverse sample of our real data, then we can train a neural network on our training data, and it'll also work as well with other data that it hasn't seen before.

This does seem kind of like magic though, right? What's going on under the covers?

The calculus bit

We generally use a method called Stochastic Gradient Descent, or a method related to it. The "stochastic" part means we randomly choose samples from our training data so that the model doesn't get biased, and all the little nudges we give it start to shape it in the way we want it to behave. The "gradient descent" part means for each sample, we look at how wrong the network currently is (the "error"), and calculate a gradient (or slope, if you like), and nudge the network in that direction.

This approach is robust and it works for a lot of things, but the hard part with a neural network is trying to find the gradient for every single weight in the network - if your network is 10 layers deep (a small network by modern standards), how does an error at the output adjust the weights 10 layers back at the start?

Backpropagation

We use partial derivatives to find the gradient at each layer, and then use that gradient to adjust the network. At each neuron, we take the sum of all of the downstream gradients (they get "backpropagated" to us), multiplied by the derivative of the activation function applied to the forward-propagated data. We then take this number, multiplied by the weight of the previous link, and add a fraction of this (multiplied by the training rate) to the current weight of that link. We also take this number multiplied by the forward-propagated data and backpropagate that to the previous neuron.

Simple... right?

This is some fiddly stuff if you're not currently a calculus student, and it took me a couple weeks to get my head around. You can find it all on the wikipedia page for backpropagation, and there are a couple of papers floating around the internet that explain it in different ways.

It shouldn't be this hard

The idea of multiplying partial derivatives is a simple one, and I spent a long time thinking about how to make this accessible. The architecture that I've settled on for Lernmi has made a couple of trade-offs, but I feel it breaks out the backpropagation into bite-sized pieces, and I hope that it makes it easier to understand.

Lernmi

I've published a neural network application in ruby, hosted here. The logic is broken up between Neurons and Links, and instead of the word "gradient", I've used the word "sensitivity" as it better translates to what we're using it for - high sensitivity means we adjust the weight more, low sensitivity means we adjust it less.

When we propagate forwards, it's fairly simple. Each Link in a layer takes the output from it's input Neuron, multiplies it by its weight, and inputs it to its output Neuron. Each Neuron sums all of its inputs, applies the activation function (the sigmoid function), and waits for the next Link to take the resulting value.

When we backpropagate, the logic is spread across the Neurons and Links. For the output layer, the sensitivity is just the error (the actual output minus the expected output). Each Link in the last layer takes the sensitivity from its output Neuron, and multiplies this by the sensitivity of the activation function - we call this the "output sensitivity". The reason we do this is because our neurons use their activation function when they propagate, so we need to find the sensitivity of that, and we multiply it by the sensitivity of the output Neuron to get the sensitivity of the Link. Perhaps a better name could be the "link sensitivity"? We'll call it the "output sensitivity" for now though.

We use the output sensitivity in two ways: when we update the weight, we multiply it by its input value (the larger our input value, the more we probably contributed to the error), and adjust the weight of this Link by that amount. We also need to send a sensitivity back to the input Neuron - we multiply the output sensitivity by our weight, so that the earlier layer knows how much this Link contributed to the error.

Whose fault is this?

Don't be surprised if this is confusing the first time your read through. I would love to find a way to simplify this further - the principle is as simple as asking "what part of the network is to blame for the error". The math is a bit frustrating, as we're ultimately multiplying partial derivatives along the length of the network, but the goal of it is to do lots of tiny adjustments until the network is a good-enough approximation of the underlying data.

What are we approximating though?

This is where it gets a little freaky. It turns out that everything - handwriting, faces, pictures of various different objects, they all have a mathematical approximation. With a big enough neural network, you can recognise people, animals, objects, and various different styles of handwriting and signwriting. Even in a board game like chess or go, you can make a mathematical approximation of which moves are better, to the point where you can have a computer that can play at a world class level.

This isn't the same as intelligence; the computer isn't thinking, but at the same time, what is thinking anyway? Is what we call "thinking" and "judging" just a mathematical approximation in our heads?