What Do We Mean by the “Best” Line?

To answer that question, first we have to agree on what we mean by the “best fit” of a line to a set
of points. Why do we say that the line on the left fits the points
better than the line on the right? And can we say that some other line
might fit them better still?

Intuitively, we think of a close fit as a
good fit. We look for a line with little space between the
line and the points it’s supposed to fit. We would say that the
best fitting line is the one that has the least
space between itself and the data points, which represent
actual measurements.

Okay, what do we mean by “least space”? There are three ways to
measure the space between a point and a line: vertically in the y
direction, horizontally in the x direction, and on a perpendicular to
the line. We choose to measure the space
vertically. Why? because our whole purpose in making a
regression line is to use it to predict the y value for a given x, and
the vertical distances are how far off the predictions would be for
the points we actually measured.

If we know we want the line that has the smallest vertical
distance between itself and the points, how do we compute that
vertical distance? If the line is y=3x+2 and we have a point (2,9),
the the predicted value is 3×2+2=8 and subtract the actual
measured value 9. We say that the deviation is –1, negative because the
predicted value is less than the actual value. In general, the
deviation (vertical gap) between any given point
(x,y) and the line y=mx+b will be
mx+b–y.

But each deviation could be positive or negative, depending on
whether the line fall above or below that point. We can’t simply add
up deviations, because then a line would be considered good if it fell
way below some points as long as it fell way above others. To prevent
that, we square each deviation, and add
up the squares. (This also has the desirable effect that a few small
deviations are more tolerable than one or two big ones.)

And at long last we can say exactly what we mean by the line of
best fit. If we compute the deviations in the y direction, square each
one, and add up the squares, we say the line of best fit is the line
for which that sum is the least. Since it’s a sum of squares, the
method is called the method of least
squares.

How Do We Find That Best Line?

It’s always a giant step in finding something to get clear on what
it is you’re looking for, and we’ve done that. The best-fit line, as
we have decided, is the line that minimizes the
sum of squares of vertical deviations between itself and the
measured points. We can write that sum as
where ŷ is the predicted value on the line for a given x
(namely mx+b), and the y is the actual value measured for that given x.

Do we just try a bunch of lines, compute their E values, and pick
the line with the lowest E value? No, we could never be sure that
there wasn’t some other line with still a lower E — and of
course it would be a lot of work, too.

Instead, we use a powerful and common
trick in mathematics: We assume we know the line,
and use its properties to help us find its identity. Here’s how that
works.

What is the line of best fit? It’s y=mx+b, because any
line (except a vertical one) is y=mx+b. We happen not to know m and b
just yet, but we can use the properties of the line to find them.

What is the chief property of the
line? It is that E is less for this line than for any other
line that might pass through the same set of points. In other words,
E is minimized by varying m and b. Let’s
look at how we can write an expression for E in terms of m and b, and
of course using the measured data points (x,y).

The squared deviation for any one point follows from the definition
we gave earlier:

E is found by summing over all points:

Once we find the m and b that minimize that quantity, we will know
the exact equation of the line of best fit.

As soon as you hear “minimize”, you think “calculus”. And indeed
calculus can find m and b. Surprisingly, we can also find m and b
using plain algebra.

Historical Note

It’s not entirely clear who invented the method of least squares.
Most authors attach it to the name of Karl Friedrich Gauss
(1777–1855), who first published on the subject in 1809.

But the Frenchman Adrien Marie Legendre (1752–1833) “published a
clear explanation of the method, with a worked example, in 1805”
according to Stephen Stigler in Statistics on the Table
(Cambridge, Massachusetts; Harvard University Press, 1999; see Chapter
17). In setting up the new metric system of
measurement, the meter was to be fixed at a ten-millionth of the
distance from the North Pole through Paris to the Equator. Surveyors
had measured portions of that arc, and Legendre invented the method of
least squares to get the best measurement for the whole arc.

The Calculus Way

Using calculus, a function has its minimum
where the derivative is 0. Since we need to adjust both m and
b, we take the derivative of E with respect to m, and separately with
respect to b, and set both to 0:

Each equation then gets divided by the common
factor 2, and the terms not involving m or b are moved to the other
side. With a little thought you can recognize the result as two
simultaneous equations in m and b, namely:

The summation expressions are all just numbers,
the result of summing x and y in various combinations.

(By the way, how do we know that these will give us a minimum and not
a maximum or inflection point? Because each second derivative is 2 for
all values of m and b, and if the first derivative is 0 and the second
derivative is positive you have a minimum.)

These simultaneous equations can be solved like any others: by
substitution or by linear combination. Let’s try substitution. The
second equation looks easy to solve for b:

Substitute that in the other equation and you eventually come up
with

And that is very probably what your calculator (or Excel) does: Add
up all the x’s, all the x², all the xy, and so on, and compute
the coefficients. It’s tedious, but not hard. (Usually these equations
are presented in the shortcut form shown
below.)

The Calculus-Free Way

But you don’t need calculus to solve
every minimum or maximum problem. Look back again at the equation for
E, which is the quantity we want to minimize:

Now that may look intimidating, but remember that all
the sigmas are just constants, formed by adding up various
combinations of the (x,y) of the original points. In fact, collecting
like terms reveals that E is really just a
parabola with respect to m or b:

Both these parabolas are open
upward. (Why? because the coefficients of the m² and
b² terms are positive. The sum of x² must be positive unless
all x’s are 0; and of course n, the number of points, is positive.)
Since the parabolas are open upward, each one has a minimum at its vertex.

Where is the vertex for each of these parabolas? Well, recall
that a parabola y=px²+qx+r has its vertex at -q/2p.
These are parabolas in m and b, not in x, but you can find the vertex
of each one the same way:

Now there are two equations in m and b. Substitute one into the
other one, perhaps the second into the first, and the solution is

“These values agree precisely with the regression equation
calculated by a TI-83 for the same data,” he said smugly.

Alternative Formulas

Some authors give a different form of the solutions for m and b, such as:

where x̅ and y̅
are the average of all x’s and average of all y’s.

These formulas are equivalent to the ones we derived earlier.
(Can you prove that? Remember that nx̅ is
∑ x, and
similarly for y.) While the m formula looks
simpler, it requires you to compute mean x and mean y first. If you do
that, here’s how the numbers work out:

n=5

x

y

x–x̅

y–y̅

(x–x̅)²

(x–x̅)(y–y̅)

0

6

–3.2

10.8

10.24

–34.56

2

–1

–1.2

3.8

1.44

–4.56

3

–3

–0.2

1.8

–0.04

–0.36

5

–10

1.8

–5.2

3.24

–9.36

6

–16

2.8

–11.2

7.84

–31.36

∑

16

–24

0

0

22.8

–80.2

mean

3.2

–4.8

Whew! Once you’ve got through that, m and b are only a little more work: