Linear regression by hand

Suppose I take the same data from the pylab example and I imagine trying to add a linear function to represent that data. Here are two choices.

Which is better? The red line or the blue one? How do you decide? Well, you have to make up some criteria for choosing the best line. Commonly, it is chosen to pick the line such that the value of the sum of d2 is minimized. I displayed these d values on the graph for you. Notice that they are the vertical distance from the real data points to the fitting linear function. Why this way? Well, typically, the horizontal variable is your independent variable – so these might be some set values. The vertical data is typically the one with the most error (but not always). You could instead look at the horizontal distance from the data or even the perpendicular.

I don’t want to add up these vertical distances because some will be positive and some negative. Instead, I will add up this vertical distance squared such that:

Well that is just great. Now what? If I let S be the sum of the square of the distances, then I want to pick a line such that S is the smallest. Hint: this is where the term ‘least squares fit’ comes from. How do you minimize a function? The simple answer is to change the parameters m and b.

Let me pretend that I changed the parameter m and each time calculated the sum of the vertical distances squared (S). Suppose I then made a plot of S for the different values of m and it looks like this:

On this graph, which labeled point (a – d) is S at a minimum? Go ahead. You can say it. How many of you said ‘c’? Well, you would be right. But, how do you find that lowest point without making a graph? There is one important thing about the lowest point. Right before that lowest point, the function is decreasing. Right after that lowest point, the function is increasing. And so AT the lowest point the function is neither increasing or decreasing (with respect to changing m). Of course, I am talking about the slope of this function. I can find this lowest point by finding where the slope (the derivative with respect to m) is zero.

I know, I know. It is possible for a function to have a zero slope and NOT be a minimum. Let me proceed anyway (assuming the only location with a zero slope is a min). There are two things that I can change to get S to be a minimum – m and b. Let me assume that I can just vary one parameter at time (this means that I can use the partial derivative instead of the full derivative). Here is the partial derivative of S with respect to m – note that for sums I will leave off the “i = 1 to n part”.

That is the slope. I will set it equal to zero and I get (divide both sides by that pesky -2):

Now to do a similar thing with how S changes with the parameter b.

And again, setting it equal to zero (and dividing both sides by -2):

Now there are two equations and two unknowns (m and b). The n is the number of data points. All the other stuff (like the sum over xi) are technically known. What I want to do next it solve for m and b.

It should be obvious that I skipped some of the algebraic steps. They aren’t too difficult. You should be able to go through them yourself.

But, now that I have an expression for b and m, what to do? Well, if I know all the x and y data points, I can just calculate m and then b (since I left b in terms of m). If I don’t have too many data points, I could do this by hand. Or I could do it in python – or I could do it in a spread sheet. Randomly, I will choose to do this in a spreadsheet.

Here is that spreadsheet with the same data AND with the SLOPE() and INTERCEPT() function in google docs to show the answer is the same.

There. That is the the basic form of linear regression by hand. Note that there ARE other ways to do this – more complicated ways (assuming different types of distributions for the data). Also, the same basic idea is followed if you want to fit some higher order polynomial. Warning, it gets complicated (algebraically) real quick.