The Correlation Coefficient

(Drawn from Statistics by Freedman, Pisani, Purves et al)

Suppose you are looking at the relationship between two variables, and have already
plotted the scatter diagram. The graph looks like a cloud of points. How can it be
summarized numerically? This is what the correlation coefficient does.

You already know how to summarize interval variables individually: you compute the mean
and the standard deviation (SD). For example, to summarize the variable "GPA"
for BC students you would just calculate the mean and the standard deviation, as in:

The mean GPA is 2.90

The standard deviation (the amount of spread around the mean) is 0.5.

Assuming that the distribution of GPA scores in the population is bell-shaped
("normally distributed"), then this means that about 65% of all students have a
GPA that falls in the interval [2.40, 3.40], and about 95% of all students have GPAs that
fall within [1.90, 3.90]

So, in the diagram below (Figure 4a), a firststep would be to mark a point
showing the average of the x-values and the average of the y-values. This is the point
of averages. It locates the center of the cloud. The next step would be to measure the
spread of the cloud from side to side. This can be done using the SD of the
x-values. Most of the points (95%) will be within 2 horizontal SDs on either side of
the point of averages (figure 4b). In the same way, the SD of the y-values can be used to
measure the spread of the cloud from top to bottom. Most of the points will be within 2
vertical SDs above or below the point of averages (figure 4c).

Figure 4. Summarizing a scatter diagram.

So far, the summary statistics are

average of x-values, SD of x-values;

average of y-values, SD of y-values.

These statistics tell us the center of the cloud, and how spread out it is, both
horizontally and vertically. But we still need to summarize the association between the
two variables. Look at the scatter diagrams in figure 5. Both clouds have the same center
and show the same spread, horizontally and vertically.

However, the points in the first cloud are tightly clustered around a line: there is a
strong linear association between the two variables. The correlation is quite high (the
highest possible is 1.0, this is maybe about 0.8). In the second cloud, the clustering is
much looser. When the clustering looks more like a circle than a football, the correlation
is near zero. The strength of the association is different in the two diagrams. To measure
the association, one more summary statistic is needed: the correlation coefficient. This
coefficient is usually denoted r.

The correlation coefficient, r, is a measure of linear
association or clustering around a line. The relationship between two variables can be
summarized by:

the average of the x-values, the SD of the x-values

the average of the y-values, the SD of the y-values

the correlation coefficient r.

The formula for computing r will be presented later. Right now we want to focus on the
graphical interpretation.

Following this paragraph are a number of scatter diagrams generated by computer using
200 hypothetical data points. The computer has printed the value of the correlation
coefficient at the top of each diagram. The first diagram (below) shows a formless,
circle-like cloud. The correlation between those two variables is just about zero. You can
see that small values of X have all kinds of Y values -- small, medium and large. So do
medium values of X, and so do large values of X. In short, the Y values are not related to
the X values. Knowing the X value for a given point does not help much to predict the Y
value.

The next scatter diagram has a little more of an elliptical shape. The average Y
value for points with small X values is lower than the average Y value for points with
large X values. The correlation is r = 0.28.

The next diagram (below) shows a correlation of about 0.48. This is just a little
higher than the correlation between income and education is in the United States. That
means that, on average, people with more education make more money.

The next diagram shows a much stronger correlation (r = 0.72). The band is
getting quite narrow, and big X values show much higher Y values than do small X values.

Then next diagram has a really high correlation of 0.89. This is just a little less
than the actual correlation between the heights of identical twins of all ages. Note that
even with a correlation of 0.89, you don't really expect the twins to have exactly the
same height: almost always there is SOME difference. All this really says is that on
average, the difference is small. Occasionally, though, there will be some big
differences.

The last picture shows a correlation of 0.99. You will probably never see a real
correlation that big. It's practically a straight line. Knowing a person's X value tells
you their Y value to within a couple of decimal places of accuracy. In other words,
virtually every person with a certain X value (say, 0.5), has practically the same Y value
(around 0.8).

So what am I saying?

The more tightly clustered the points are along a line, the stronger the relationship
between the variables, and the closer r is to 1.0.

When the correlation is near 1.0, knowing a point's X value allows you to predict its Y
value with very little error.

But that doesn't mean that the Y value is the same or nearly the same as the X value,
since the Y variable may be expressed in completely different units. It just means that it
is predictable from the X value. For example, if X is education level, in years, Y could
be income, in dollars; even if education predicts income perfectly, that doesn't mean that
the Y value associated with 16 years of education is anything like 16 -- it is probably
more like $35,000.

So far, only positive association has been discussed. In the United States, women with
more education tend to have fewer children. This is negative association. An increase in
education is accompanied on the whole by a decrease in the number of children. (What are
they learning in school?!) Negative association is indicated by a negative sign in the
correlation coefficient. A correlation of -0.90, for instance, indicates the same degree
of clustering as one of +0.90. With the negative sign, the clustering is around a line
which slopes down; with a positive sign, the line slopes up. For women of childbearing age
in the United States, the correlation between education and number of children is around
-0.2; not strong, but there.

Correlations are always between -1 and +1, but can take any value in between. A
positive correlation means that the cloud slopes up; as one variable increases, so does
the other. A negative correlation means that the cloud slopes down; as one variable
increases, the other decreases.

Interpreting the Correlation Coefficient

A correlation of, say, r = 0.80 does not mean that 80% of the points are tightly
clustered around a line, nor does it indicate twice as much linearity as r = 0.40. The
correlation measures the extent to which knowing the value of X helps you to predict the
value of Y.

Suppose you wanted to predict the GPA of a random BC student, and you knew what the
average GPA for all BC students was 2.9. Then you would just guess the 2.9. Why? because
the average of a list of numbers is the value that is least different from all the others.
Look at the table below. The column labeled X has a set of 10 values. The sum is 140, and
the average is 14. The next column over computes the (squared) differences of each value
in X from 14. The average difference is 115.8. The next three columns compute the
difference of each in X from different values, such as 12, 11 and 17. Notice that the
average squared difference is smallest for 14, the average. It's always this way: the
average is that value that is least different (in terms of squared differences) from all
the numbers in a list. So the average is your best guess of the GPA of a random student.

X

(X-14)2

(X-12)2

(X-11)2

(X-17)2

1

1

169

121

100

256

2

3

121

81

64

196

3

6

64

36

25

121

4

6

64

36

25

121

5

7

49

25

16

100

6

13

1

1

4

16

7

17

9

25

36

0

8

27

169

225

256

100

9

30

256

324

361

169

10

30

256

324

361

169

Total:

140

1158

1198

1248

1248

Average:

14

115.8

119.8

124.8

124.8

But suppose you knew that female students have a higher GPA than male students (by,
say, 0.20 gpa points). If you knew the sex of the random student, you would adjust your
estimate up or down depending on the sex. This should mean that, on average, your guesses
would not be as far off as if you just guessed 2.90. The more sex is related to GPA, the
more knowing someone's sex improves your guesses.

Suppose you are trying to guess a randomly chosen person's height. Your best guess is
the mean height. But if you knew what what shoe size they wore, you could do a much better
job of guessing: you would guess the average height of people with that particular shoe
size.

The correlation coefficient tells you how many standard deviations above the mean on
the Y variable most people are (on average), given that they are one standard
deviation above the mean on the X variable. If the correlation is 1.0, it means that for
all persons that are 1 SD above the mean on X, the average value of Y is 1 SD above the
average of Y. If the correlation is 0.8, it means that on average, people 1 SD over the
mean on X are about .8 SDs above the average of Y.

If the correlation is 0.0, it means that the average Y value for people 1 SD over the
average on X is just about 0 SDs over the average of Y, which means that it is just the
average of Y. In other words, when there is no correlation between X and Y, you just
predict the mean of Y, ignoring the value of X.

Calculating The Correlation Coefficient

Step 1. Convert the X and Y variables to standard units. Call the results X* and Y*. To
do this for X, subtract the mean of X from each X value, then divide each deviation
by the standard deviation. About 95% of the resulting values will lie between -2 and 2.
The mean of the new variable, X*, will be zero, and the standard deviation will be one.
For example:

Step 3. Take the average of the products computed in step 2. That is the correlation.
In the example shown, the correlation is 0.876.

Exercises

1. (a) Would the correlation between the age of a second-hand car and its price be
positive or negative? Why? (Antiques are not included.)

(b) What about the correlation between weight and miles per gallon?

2. For each scatter diagram below:

(a) The average of x is around 1.0 1.5 2.0 2.5 3.0 3.5 or 4.0 ?
(b) Same, for y.
(c) The SD of x is around 0.25 0.5 1.0 or 1.5 ?
(d) Same, for y.
(e) Is the correlation positive, negative, or zero?

3. For which of the diagrams in the previous exercise is the the correlation closer to
0, forgetting about signs?

4. In figure 1 of The Scatter Diagram, is the correlation
between the heights of the fathers and sons around -0.3, 0, 0.5, or 0.8?

5. In figure 1 of The Scatter Diagram, if you took
only the fathers who were taller than 6 feet and their sons, would the correlation between
the heights be around -0.3, 0, 0.5, or 0.8?

6. If women always married men who were five years older, what would the correlation
between their ages be? Why?

7. The correlation between the ages of husbands and wives in the U.S. is:

exactly -1,

close to -1,

close to 0,

close to 1,

exactly 1

Why did you give the answer you did?

8. Investigators are studying registered students at the University of California. The
students fill out questionnaires giving their year of birth, age (in years), age of
mother, and so forth. Fill in the blanks, using the options given below, and explain.

(a) The correlation between student's age and year of birth is

(b) The correlation between student's age and mother's age is:

-1

almost -1

somewhat negative

0

somewhat positive

almost 1

1

9. True or false: If the correlation coefficient is 0.90, then 90% of the points are
highly correlated.