The Regression Line

Does education pay? Figure 1 shows the relationship between income and education, for a
representative sample of 637 California men age 25-29 in 1988. The summary statistics are:

average education is 12.5 years, SD is 4 years
average income is $19,700, SD is $16,000,
r = 0.35

The regression estimates for average income at each educational level fall along the
regression line shown in the figure. The line slopes up, showing that on the average,
income does go up with education.

Figure 1. The regression line. The scatter diagram
shows
income and education for a representative sample of 637
California men age 25-29 in 1988.

Any line can be described in terms of its slope and intercept. The y-intercept is the
height of the line when x is 0. And the slope is the rate at which y increases, per unit
increase in x. Slope and intercept are illustrated in figure 2.

Figure 2. Slope and intercept.

What do the slope and intercept mean for the regression line? To continue with the
income-education example: associated with an increase of one SD in education, there is an
increase of r SDs in income. That is, 4 extra years of education are worth an extra 0.35 x
$16,000 = $5,600 of income, on the average. So each extra year is worth $5,600/4 = $1,400.
The slope of the regression line is $1,400 per year. So far, it looks like education does
pay off for the men, at the rate of $1,400 per year.

Figure 3. Finding the slope and intercept of the regression line.

The intercept of the regression line is its height when x =0, corresponding to
men with 0 years of education. Such men are 12.5 years below average in education. And
each year costs $1,400-that is what the slope says. A man with no education is predicted
to have an income which is below average by

12.5 years x $1,400 per year = $17,500.

So, his income is predicted as $19,700 - $17,500 = $2,200. This is theintercept:
the predicted value of y when x = 0. (See figure 3.)

Zero years of education may sound extreme, but there were several men who reported
having no education, and their incomes ranged from $0 to about $12,000; their points are
in the lower left corner of figure 1.

Associated with each unit increase in x there is some average change in y. The
slope of the regression line says how much this change is. The formula for the slope is

r · (SD of y) / (SD of x )

The intercept of the regression line is just the predicted value for y, when x
is 0.

Any line has an equation, in terms of its slope and intercept:

y = slope x x + intercept.

The equation for the regression line is called (not surprisingly) the regression
equation. In Figure 3, the regression equation is

predicted income = ($1,400 per year) x education + $2,200.

There is nothing new here: the regression equation is just an alternative way of using
the regression method to predict y from x. However, the regression equation is often used
in the social sciences. An investigator who has to make many predictions might find it
easier to compute the slope and intercept once and for all, and then substitute into the
equation. Furthermore, the slope and intercept can be interesting in their own right.

Example 1. For 676 California women age 25-29 in 1988, there is a relationship
between income and education; data are from the Current Population Survey. The
relationship can be summarized as follows.

average education 12 years, SD 3.5 years
average income $11,600, SD $10,500,
r = 0.4

(a) Find the regression equation for predicting income from education.

(b) Use the equation to predict the income of a woman whose educational level is: 8
years, 12 years, 16 years.

Solution. Part (a). The first step is to find the slope (figure 3). In a run of
one SD of education, the regression line rises r SDs of income. So

slope = (0.4 x $10,500)/3.5 years = $1,200 per year.

The interpretation: on the average, each extra year of schooling is worth an extra
$1,200 of income; each year less of schooling costs $1,200 of income.

The next step is to find the intercept. That is the height of the regression line at x
= 0. In other words, it is the predicted income of a woman with no education. Such a
woman is 12 years below average; using the slope, she is predicted to be below average in
income by

12 years x $1,200 per year = $14,400.

So her income is predicted as

$11,600 - $14,400 = -$2,800.

That is the intercept: the prediction for y when x = 0. (The regression line
becomes less and less reliable as you move away from the center of the data, so a negative
intercept is not too disturbing.) The regression equation is

predicted income = ($1,200 per year) x (education) - $2,800.

Part (b). Substituting 8 years for education gives

($1,200 per year) x (8 years) - $2,800 = $6,800.

Substituting 12 years for education gives

($1,200 per year) x (12 years) - $2,800 = $11,600.

Substituting 16 years for education gives

($1,200 per year) x (16 years) - $2,800 = $16,400.

This completes the solution. Despite the negative intercept, the predictions are quite
reasonable -- for most of the women.

In this example, the slope is $1,200 per year. Associated with each extra year of
education, there is an increase of $1,200 in income, on the average. The phrase
"associated with" sounds like it is talking around some difficulty, and here is
the issue: Are income differences caused by differences in educational level, or do both
reflect the common influence of some third variable? The phrase "associated
with" was invented to let statisticians talk about regressions without having to
commit themselves on this sort of point.

Often, the slope is used to predict how y will respond, if someone intervenes and
changes x. This is legitimate when the data come from a controlled experiment. However,
with observational studies the inference is often shaky because of confounding. Take
example 1. On the average, the women who finished college (16 years of education) earned
about $4,800 more than women who just finished high school (12 years).

If the government sent a representative group of women with high school degrees on to
get college degrees, the slope suggests that their income would go up by an average of 4 x
$1,200 = $4,800. However, example 1 is based on survey data rather than a controlled
experiment. One group of women in the survey had 12 years of education. Another, separate,
group had 16 years. The two groups were probably different with respect to many factors
besides education -- like intelligence, ambition, and family background.

The effects of these factors are confounded with the effect of education, and their
effects go into the slope. Sending people off to get college degrees probably would make
their incomes go up, but not by the full $4,800. To measure the impact of a college degree
on incomes, it might be necessary to run a controlled experiment or use an advanced
technique called multiple regression.

With an observational study, the slope and intercept of the regression line are only
descriptive statistics. They say how the average value of one variable is related to
values of another variable, in the population being observed. The slope cannot be relied
on to predict how Y would respond if the investigator changes the value of X.

These notes drawn from Statistics, by Freedman, Pisani, Purves
and Adhikari.