Information Retrieval and other interesting topics

Linear Regression In PHP

I've had a couple of emails recently about the excellent Stanford Machine Learning and AI online classes, so I thought I'd put up the odd post or two on some of the techniques they cover, and what they might look like in PHP.

The second lecture of the ML class jumps into a simple, but often powerful, technique: linear regression - or fitting a line to data (don't worry if you haven't watched it, this post hopefully makes sense on it's own). There are a lot of problems that fall under predicting these types of continuous values based on limited inputs - for example: given the air pressure, how much rain will there be, given the qualifying times, how quick will the fastest lap be in the race. By taking a bunch of existing data and fitting a line, we will be able to make a prediction easily - and often reasonably correctly.

We can define a line with 2 variables, or parameters: the intercept (where it crosses the axis) and the gradient (how much is moves in one dimension for a move of one in the other dimension). Because we're going to want to predict variables with that line, we'll write a function that defines it, which, in keeping with tradition, we'll refer to as our hypothesis function:

But which line is the best fit? Intuitively, it is the one closest to the data points we have - which we can measure by taking the square of the different between each predicted value and actual value - we square because then we get positive numbers no matter which way the difference goes:

So, our job is to find the line which minimises the squared error - this type of function is referred to the cost function. Lets take our functions above and take a first stab at fitting our line. We'll write a very simple algorithm which just tests moving each parameter up and down a little bit at a time, and sees if the score gets lower or higher. If it gets lower, we'll take those parameters and restart the process, until we don't see any improvement:

So for this data, we get the line y = 1 + 4x. If we plot our data and our line on a graph, we can see it's a pretty good fit:

This algorithm, while straightforward, has several problems. We're moving in steps of 0.25, so the best answer could be not in this range, and we don't cache any of our checks, so we'll always end up evaluating our previous position as a possible move, which is a waste of time. It's also kind of slow - surely we can do something a bit better?

Of course, we can, and what we can do is use the gradient descent algorithm. This relies on the fact that we are effectively describing a curve with our cost function (or more accurately a curved surface) and we can see how sloped the surface is at any point by taking the derivative, and use the direction of the slope to guide our improvement. In order to use that, we need a bit of calculus to take the partial derivatives for the two parameters, which makes our score function look like this:

With this new function, we're going to get a positive or negative value for each field based on the direction, so we know we need to move our values the other way. We can use this in a new step function, which we'll call gradient. We're going update our parameters by subtracting the derivative each time, which we'll multiply by a small learning rate (sometimes referred to as alpha), in order to avoid over-shooting our optimal line.

What's cool about this is that we actually get a better result y = 0.29 + 4.08x - which provides a marginally improved score over the 1 and 4 our stepper returned. To be fair, there are easier ways of determining this value mathematically, and things get more interesting if we have more than 2 variables - but we'll leave that for another post. If you want to see the code for this post in one file, it's on Github.