The automobile dataset given above includes both Weight and Mileage of
60 automobiles. In addition to describing location and dispersion for each variable separately,
we also may be interested in what kind of relationship exists between these variables. The following
figure represents a scatterplot of these variables with the respective means superimposed. This
shows that for a high percentage of cars, those with above average Weight tend to have below
average Mileage, and those with below average Weight have above average Mileage. This is an
example of a decreasing relationship, and most of the data points in the plot fall in the upper
left/lower right quadrants. In an increasing relationship, most of the points will fall in the
lower left/upper right quadrants.

We can derive a measure of association for two variables by considering the
deviations of the data values from their respective means. Note that the
product of deviations for a data point in the lower left or upper right quadrants
is positive and the product of deviations for a data point in the upper left or lower
right quadrants is negative. Therefore, most of these products for variables with a
strong increasing relationship will be positive, and most of these products for
variables with a strong decreasing relationship will be negative. This implies that
the sum of these products will be a large positive number for variables that have
a strong increasing relationship, and the sum will be a large negative number for
variables that have a strong decreasing relationship. This is the motivation for using

as a measure of association between two variables. This quantity is called the
correlation coefficient. The denominator of r is a scale
factor that makes the correlation coefficient dimension-less and scales so that
. Note that this can be expressed equivalently as

If the correlation coefficient is close to 1, then the variables have a strong
increasing relationship and if the correlation coefficient is close to -1,
then the variables have a strong decreasing relationship. If the correlation is
exactly 1 or -1, then the data must fall exactly on a straight line. The
correlation coefficient is limited in that it is only valid for linear
relationships. A correlation coefficient close to 0 indicates that there is no
linear relationship. There may be a strong relationship in this case,
just not linear. Furthermore, the correlation may understate the strength of
the relationship even when r is large, if the relationship is
non-linear.

The correlation coefficient between Weight and Mileage is -0.848. This is
a fairly large negative number, and so there is a fairly strong linear,
decreasing relationship between Weight and Mileage. This is confirmed by the
scatterplot. Since these variables are so strongly related, we can ask how
well can we predict Mileage just by knowing the Weight of a vehicle. To answer
this question, we first define a measure of distance between a dataset and a
line.

Suppose we have measured two variables for each individual in a sample, denoted
by
, and we wish to predict the value of
Y given the value of X for a particular individual using
a straight line for the prediction. A reasonable approach would be to use the
line that comes closest to the data for this prediction. Let Y=a+bX
denote the equation of a prediction line, and let
denote the
predicted value of Y for . The difference between an actual and
predicted Y-value represents the error of prediction for that data
point. We define the distance between a prediction line and a point in
the dataset to be the square of the prediction error for that observation. The
total distance between the actual and predicted Y-values is then the
sum of the squared errors, which is the variance of the prediction errors
multiplied by . Since the predicted values, and hence the errors, depend on
the slope and intercept of the prediction line, we can express this total
distance by

Our goal now is to find the line that is closest to the data using this
definition of distance. This line has slope and intercept that minimize
. We can use differential calculus to find the minimum.

Setting these equal to 0 gives the system of equations

Therefore,

and, after substituting for in the second equation and solving for ,

It can be shown that the numerator equals and the denominator
equals . Hence,

The prediction line, referred to as the least squares regression line,
is then

The next question that can be asked related to this prediction problem is how
well does the prediction line predict? We can't answer that question completely
yet because the full answer requires inference tools that we have not yet
covered, but we can give a descriptive answer to this question. The distance
measure, D(a,b), represents the variance of the prediction errors.
One way of describing how well the prediction line performs is to compare it to
the best prediction we could obtain without using the X values to
predict. In that case, our predictor would be a single number. We have already
seen that the closest single number to a dataset is the mean of the data, so in
this case, the best predictor based only on the Y values is
. This corresponds to a horizontal line with intercept
, and so the distance between this line and the data is
. This quantity represents the error variance for the best
predictor that does not make use of the X values, and so the
difference,

represents the reduction in error variance (improvement in prediction) that
results from use of the X values to predict. If we express this as
a percent,

then this is the percent of the error variance that can be removed if we
use the least squares regression line to predict as opposed to simply using the
mean of the Y's. It can be shown that this quantity is equal to the
square of the correlation coefficient,

R-squared also can be interpreted as the proportion of variability in the
Y-variable that can be explained by the presence of a linear
relationship between X and Y.

In the automobile example, the correlation between Weight and Mileage was
r = -0.848, and so . If we use the regression line to
predict Mileage based on Weight, we can remove 71.9% of the variance of the
Mileage data by using Weight to predict Mileage. Another way of expressing this
is to ask: Why don't all cars have the same mileage. Part of the answer to that
question is that cars don't all weigh the same and there is a fairly strong
linear relationship between weight and mileage that accounts for 71.9% of the
variability in mileage. This leaves 28.1% of this variability that is related
to other factors, including the possibility of a non-linear relationship
between Mileage and Weight.

To help judge the adequacy of a linear regression fit, we can plot the residuals
vs the predictor variable . The residuals are the prediction errors,
, . If a linear fit is reasonable, then the
residuals should have no discernable relationship with and should be
essentially noise. This plot for a linear fit to predict Mileage based on
Weight is shown below.

This shows that the residuals are still related to Weight, so a linear fit is
not adequate. Note that removal of the linear component of the relationship
between weight and mileage, as represented by the residuals from a linear fit,
does a better job of revealing this non-linearity than a scatterplot of these
variables. This will be discussed in greater detail later.

Now suppose we only wish to consider cars whose engine displacements are
no more than 225. We can define a logical expression that represents such
cars and use that to subset the fuel data frame:

ndx = fuel.frame$Disp < 225
fuel1 = fuel.frame[ndx,]

Then we can use the fuel1 data frame to plot Mileage versus Weight and
to fit a linear regression model.

It is important to remember that correlation is a mathematical concept that
says nothing about causation. The presence of a strong correlation between
two variables indicates that there may be a causal relationship,
but does not prove that one exists, nor does it indicate the direction of any
causality.